In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import calinski_harabasz_score
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram
up, down = True, False
np.set_printoptions(precision=3)
import math
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import squareform

<a id='Topic0'></a>
# Assignment 2: Clustering


In the modern era of big data, clustering algorithms have found great use in situations such as grouping customers together to refine targeted advertisements, or organizing the themes of a large set of text documents to split them into topical groups. This week, we're going to be exploring clustering algorithms using a dataset of food recipes. We'll begin with k-means clustering, move on to an example of hierarchical clustering, and then finish by exploring the capabilities of DBSCAN. 


Run the code below to prepare the dataset and some functions for the assignment.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# This function returns:
# - a matrix X with one row per document (review). Each row is a sparse
# vector containing tf.idf term weights for the words in the document.
#
# - the vectorizor used to create X
#
# - the actual ingredients used as input to the vectorizer
recipes_df = pd.read_csv("./assets/RAW_recipes_processed.csv")
ingredients_series = recipes_df["ingredients"]
ingredients_series = ingredients_series.dropna()  # drop any rows with empty reviews

def get_ingredients_vectorized(top_n=-1, ngram_range=(1, 1), max_features=1000):

    vectorizer = TfidfVectorizer(
        max_df=0.5,
        max_features=max_features,
        min_df=2,
        stop_words="english",
        ngram_range=ngram_range,
        use_idf=True,
    )
    if top_n >= 0:
        recipe_list_instances = ingredients_series.values[0:top_n]
    else:
        recipe_list_instances = ingredients_series.values

    X = vectorizer.fit_transform(recipe_list_instances)

    return (X, vectorizer, recipe_list_instances)



# Part 1: K-means clustering:

A big problem in applying $K$-means (and many other unsupervised clustering-type methods) is that we need to specify the number of clusters $K$ in advance - but we don't know in advance what value of $K$ will give the best-quality clustering!  So, to pick the "best" number of clusters we're going to rely on calculating some automated measures of cluster quality. You'll compute these cluster quality measures for each choice of $K$, and then pick the value of $K$ that gives the best value of the quality measure(s).


First, we need to select the correct value for k, which has the highest Calinski-Harabasz index.

For the function, you're going to need to return the correct value for k, along with it's Calinski-Harabasz index. 

K will be a value from 2-9.

Hint: 

- for each value of k from 2-9, create a Kmeans object with arguments: n_clusters=k, init='k-means++', max_iter=100, n_init=1, random_state=42
- fit it with the vectorized reviews stored by the starter code in `X` by calling `.fit()` on the Kmeans object
- use the [calinski_harabasz_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html) with `.toarray()` applied to the vectorized reviews and the `.labels_` attribute of the kmeans object as the arguments to obtain the score. 

In [None]:
task_id = '1a'



In [None]:
def task_1a_solution():

    #X here is useful: DO NOT MODIFY THE BELOW LINE.
    (X, _, _) = get_ingredients_vectorized(10000, (1, 2))

    best_k = -1
    highest_calinski = -1.0
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return best_k, highest_calinski

In [None]:
# Use this cell to explore your solution.

task_1a_solution()

In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_1a_solution()

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, tuple
), f"Task {task_id}: Your function should return a tuple. "

assert isinstance(
    stu_ans[0], int
), f"Task {task_id}: The ideal k should be an integer. "

assert isinstance(
    stu_ans[1], float
), f"Task {task_id}: The calinski harabasz score should be a float. "


del stu_ans



# Task 1b: Perform K-means review


Now you will write your assignment code that accomplishes clustering using the optimal K you obtained in task 1a, and finds what the predominant terms are in the resulting clusters. You should return a list containing K lists of 10 strings where each list contains the top 10 strings for that cluster.

Remember, each cluster is a collection of vectors, each of which represents a recipe's ingredient list. To find good representative terms for each cluster, you want to know what the "typical" ingredient list looks like for each of the K clusters. A "typical" ingredient list for a cluster is just the mean of all the vectors (ingredients) that belong to that cluster. This vector is known as the cluster centroid or center. *Note that when you run K-Means you don't have to compute the cluster centers yourself -- they are computed for you by K-Means after you run 'fit'. Just use the resulting cluster_centers_ property.*

The cluster center, since it's a just the mean of a bunch of ingredient list vectors, is itself just a vector of the same dimension as the input review vectors: think of the cluster center vector like a "ingredient cloud" that holds the weight of each possible ingredient in a list. 

So to get the most predominant terms in a vector, just sort the entries of the cluster centroid *from highest to lowest term weight*, and return the corresponding term strings. *You can do this sorting with one line of code, on all the cluster centers simultaneously, by applying argsort(..) to the cluster_centers_ array which we store in centroids.*





In [None]:
task_id = '1b'



In [None]:
def task_1b_solution():
    #X and vectorizer here are useful. DO NOT MODIFY THE BELOW LINE.
    (X, vectorizer, _) = get_ingredients_vectorized(10000, (1, 2))
    
    kmeans_optimal = KMeans(n_clusters=task_1a_solution()[0], init='k-means++', max_iter=100, n_init=1, random_state=42)
    kmeans_optimal.fit(X)
    terms = vectorizer.get_feature_names_out()
    """
    terms is a list of length 1000 lists containing the terms for each cluster 
    ['add' 'added' 'adds' 'adjunct' 'aftertaste' 'age' 'aged' 'aggressive', ... 1000 items]
      corresponding to order in above terms.
    """
    centroids = kmeans_optimal.cluster_centers_
    """
    centroids is a 1:1 list to terms containing the weight of each term in its cluster/list.
    [[3.947e-03 7.001e-03 2.971e-03 ... 4.293e-04 0.000e+00 5.621e-05] 
    [4.655e-03 6.402e-03 3.254e-03 ... 1.364e-02 4.012e-03 3.327e-03]]
    """
    
    top_cluster_terms = []
    
    # YOUR CODE HERE
    raise NotImplementedError()
        
    return top_cluster_terms



In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_1b_solution()

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, list
), f"Task {task_id}: Your function should return a list containing lists of strings. "




del stu_ans

# Part 2: Hierarchical clustering 

Take a look at the graph generated by the code below. 

For this part, we're going to work through an example of putting the recipe data through a hierarchical clustering algorithm.



    

In [None]:

def plot_recipes():
    (X, vectorizer, recipe_list_instances) = get_ingredients_vectorized(100, (1, 2))


    #Here, we convert the text reviews to a set of x y coordinates through dimensionality reduction
    
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X.toarray())
    x_coordinates = X_pca[:10, 0]
    y_coordinates = X_pca[:10, 1]
    #Plot the PCA components.
    
    plt.scatter(x_coordinates, y_coordinates)
    plt.title('PCA of Text Data')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    for i in range(0, 10):
        recipe_name = recipes_df.loc[i,'name']
        plt.annotate(f'{recipe_name}', (x_coordinates[i], y_coordinates[i]))

plot_recipes()

# Task 2a: Calculate distances.


Compute the condensed distance matrix to return the correct linkage matrix which we can use to create a dendogram for the first 10 recipes.

For this question, loop through each combination of points and calculate the distance to enter into the uncondensed distance matrix we've defined for you. Then, use `squareform()` to condense the distance matrix. Use the resulting condensed distance matrix as an argument to the `linkage()` function using method='complete.' The return of the linkage() function is what the function should return.

Note: In the distance matrix, the distances stored on the diagonal, ie. measuring the distance between the same points, should be 0. For example, the distance between element 0 and element 0 is zero.

In [None]:
task_id = '2a'

# We've defined a function to calculate euclidean distance between coordinate points.
def calc_distance(x1, x2, y1, y2):
    return math.sqrt((x2 - x1)**2 + (y2 - y1)**2)


In [None]:


def task_2a_solution():
    (X, vectorizer, recipe_list_instances) = get_ingredients_vectorized(100, (1, 2))

    #Do not modify this block of code, you'll need it for your solution!
    (X, vectorizer, recipe_list_instances) = get_ingredients_vectorized(10, (1, 2))
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X.toarray())
    x_coordinates = X_pca[:10, 0]
    y_coordinates = X_pca[:10, 1]

    linkage_matrix = None

    uncondensed_distance_matrix = np.zeros((10, 10))

    # YOUR CODE HERE
    raise NotImplementedError()
        
    return linkage(condensed_distance_matrix, method='complete')

    
 



In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_2a_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, np.ndarray
), f"Task {task_id}: Your function should return a np.ndarray. "




del stu_ans

In [None]:
def plot_dendogram():
    #Step 2: Plot the Dendrogram
    linkage_matrix = task_2a_solution()
    plt.figure(figsize=(8, 6))
    dendrogram(linkage_matrix)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Documents')
    plt.ylabel('Distance')
    plt.show()

plot_dendogram()

# Task 2b:

Now we have our dendogram!

Return the names of each pair of recipes in the clusters that exist with distance less than 0.2. These are the most specific clusters in this set, so the recipes will likely be for similar foods. 

In [None]:
task_id = '2b'




In [None]:


def task_2b_solution():
    name_series = recipes_df['name']
    sol = []

    ### solution = [[name_series[n],name_series[n]],[name_series[n], name_series[n]],...]
    # YOUR CODE HERE
    raise NotImplementedError()

    return sol
    


In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_2b_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, list
), f"Task {task_id}: Your function should return a list of sets. "




del stu_ans

# Part 3:

Now, for the last question, we're going to have you run DBSCAN on the dataset and explore outliers in the dataset.  

# Task 3a:

We will first be obtaining the cluster assignments from the DBSCAN algorithm for the first hundred recipes in the dataset. 

When creating the DBSCAN object, use eps=0.1 and min_samples=2.

You may find the [DBSCAN documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN.fit_predict) helpful. Pay note to the `fit_predict` function.

In [None]:
task_id = '3a'




In [None]:
def task_3a_solution():
    # Vectorize the data using TF-IDF
    (X, vectorizer, recipe_list_instances) = get_ingredients_vectorized(100, (1, 2))
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X.toarray())

    dbscan = None
    cls = None
    # Apply DBSCAN clustering
    # YOUR CODE HERE
    raise NotImplementedError()

    return cls

In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_3a_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")


assert isinstance(
    stu_ans, np.ndarray
), f"Task {task_id}: Your function should return a np.ndarray "




del stu_ans

# Task 3b:

Something intersting you can do with DBSCAN is obtain outliers, ie. those not assigned to clusters. Which are the outlier recipes?

Hint(s):
- np.where is useful for this question
- The outliers are stored as -1 for their cluster assignment.
- When you obtain the indices of the outliers, access recipes_df to obtain the name of each ingredient. 

In [None]:
task_id = '3b'




In [None]:
"""
According to dbscan, what are the names of the outlier recipes?

"""
def task_3b_solution():
    cls = task_3a_solution()

    # YOUR CODE HERE
    raise NotImplementedError()
     



In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_3b_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, list
), f"Task {task_id}: Your function should return a list of lists "




del stu_ans

Here, you can see the outliers plotted as noise in the graph of PCA. 

In [None]:
from matplotlib.colors import ListedColormap, BoundaryNorm
import matplotlib.patches as mpatches

def plot_labelled_scatter(X, y, class_labels):
    num_labels = len(class_labels)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    marker_array = ['o', '^', '*']
    color_array = ['#FFFF00', '#00AAFF', '#000000', '#FF00AA']
    cmap_bold = ListedColormap(color_array)
    bnorm = BoundaryNorm(np.arange(0, num_labels + 1, 1), ncolors=num_labels)
    plt.figure()

    plt.scatter(X[:, 0], X[:, 1], s=65, c=y, cmap=cmap_bold, norm = bnorm, alpha = 0.40, edgecolor='black', lw = 1)

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)

    h = []
    for c in range(0, num_labels):
        h.append(mpatches.Patch(color=color_array[c], label=class_labels[c]))
    plt.legend(handles=h)

    plt.show()



def plot_dbscan():
    # Vectorize the data using TF-IDF
    (X, vectorizer, recipe_list_instances) = get_ingredients_vectorized(100, (1, 2))
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X.toarray())
    cls = task_3a_solution()
    plot_labelled_scatter(X_pca, cls + 1, ["Noise", "Cluster 0", "Cluster 1", "Cluster 2"])
    plt.tight_layout()
    
plot_dbscan()

In [None]:
"""clustering on text vector vs nutritional content"""
from sklearn.preprocessing import StandardScaler


def get_PCA():
    numerical_fields = ['calories','total_fat_pdv','sugar_pdv','sodium_pdv','protein_pdv','saturated_fat_pdv','carbs_pdv']
    numerical_df = recipes_df[numerical_fields].iloc[:100, :]
    normalized_data = StandardScaler().fit_transform(numerical_df)
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(normalized_data)
    return X_pca
    
    # Apply DBSCAN clustering

    

# Task 3c:

Lets switch gears for this question. So far, we've been looking at the text of recipe ingredients when making clusters. Let's change to using the nutritional info. Start by remaking the cluster object with the numeric data from above. Use your code from 3a, but instead, set eps to 1.

In [None]:
task_id = '3c'




In [None]:
def task_3c_solution():

    X_pca = get_PCA()
    dbscan = None
    cls = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return cls

In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_3c_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")


assert isinstance(
    stu_ans, np.ndarray
), f"Task {task_id}: Your function should return a ndarray. "




del stu_ans

Run the code below; The nutritional info paints a very different picture!

In [None]:
def plot_dbscan():
    # Vectorize the data using TF-IDF
    X_pca = get_PCA()
    cls = task_3c_solution()
    plot_labelled_scatter(X_pca, cls + 1, ["Noise", "Cluster 0", "Cluster 1", "Cluster 2"])
    plt.tight_layout()
    
plot_dbscan()

# Task 3d:

Now, based on the nutritional info, which recipes are outliers?

The code here is the same as your 3b solution.

In [None]:
task_id = '3d'




In [None]:


def task_3d_solution():

    cls = task_3c_solution()

    # YOUR CODE HERE
    raise NotImplementedError()



In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_3d_solution()


print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, list
), f"Task {task_id}: Your function should return a ndarray. "




del stu_ans