# Lab 7 Goals

The goals of this lab are:

* To practice basic data cleaning, such as removing columns, dropping rows with missing values, and accessing row names of a dataframe.
* To implement $K$-means clustering, including selecting an appropriate value for $K$ and interpreting the cluster output. 
* To implement hierarchical clustering, including creation of dendrograms.

For this lab, it may be helpful to install and load the following modules:
 
* `matplotlib` 
* `numpy`
* `pandas`
* `plotnine`
* `random`
* `scipy`
* `sklearn`

In [None]:
import matplotlib
import numpy as np
import pandas as pd
import plotnine as p9
import random
import scipy
import sklearn

We will be using the Epicurious dataset from Labs 1 and 2. 

In [None]:
epi = pd.read_csv('/Users/amynussbaum/Documents/U of C/Courses/119/Week 1/Lab/epi_r.csv')

# Data Cleaning and Exploration

Recall that the Epicurious dataset, created by a Kaggle User, has [over 20,000 recipes from the website Epicurious](https://www.kaggle.com/datasets/hugodarwood/epirecipes/code). Loosely speaking, there are a few groups of variables in the dataset:
  
* The nutritional variables (`calories`, `protein`, `fat`, `sodium`)
* Ingredient tags (`almond`, `amaretto`, `anchovy`, and so on)
* Place tags (`alabama`, `alaska`, `aspen`, `australia`, and so forth)
* Other tags (`advance.prep.required`, `anthony.bourdain`, etc.)

1. Before we do anything else, look at the column names of `epi`. 


2. Notice that the first column is called `title`. Print it out and examine it. 


3. We've kind of ignored these type of variables so far--that is, variables that are ID numbers or names. They are technically a type of categorical variable and haven't really been useful to us so far. However, they can be very valuable much later in a clustering analysis, when we try to interpret the clusters. They are the most helpful when they are the row names rather than a separate variable column. Edit the line of code below to "re-import" the data with recipe titles as row names, and remember this trick for later!

In [None]:
epi = pd.read_csv('/Users/amynussbaum/Documents/U of C/Courses/119/Week 1/Lab/epi_r.csv', index_col = 'title')
epi.head()

4. Now, check out the dataset with the usual suspects (`.head()`, `.shape()`, `.describe()`, etc.). Do you spot any issues with the data that might slow us down when attempting to analyze it?

5. Pay careful attention to the `count` row. There should be 20,052 values for each variable, but you should be able to see that at least a few columns (e.g., calories) have fewer. This can be a sign of missing values.

If a column has less values than that given in the output of `epi.shape`, some of the values in that column are missing. Examine the following line of code to see what it is doing, and then run it. How many columns having at least one missing value?

In [None]:
sum(epi.describe().iloc[0,] < 20052)

7. In earlier labs, we focused on data cleaning and identifying unusual values in the nutritional variables. We could just drop rows with missing values, for example, with the `.dropna()` method, but I would like to focus on the "tag" variables in this lab, so let's delete the columns with numerical variables entirely.

Most of the time, I've been redefining dataframes with only a subset of columns by explicitly naming the variables I would like to keep. However, there are over 600 columns to keep here, and I don't really want to have to name all of them! Instead, I can supply the indices of the columns that I would like to keep. 

The line of code below returns a dataframe with all of the columns. Can you edit it so that I keep only the last 674 columns and drop the numerical variables (a.k.a., the first five columns)? Print out the column names to confirm.

In [None]:
epi = epi.iloc[:, 0:679]

# $K$-Means Clustering

8. *From Week 7 Course Notes, Monday, May 1* Let's take a look at how to cluster with $K$-Means. The `sklearn` function [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) will do this for us--read the linked documentation to brush up on the syntax and see what kind of arguments you can change. I'm particularly interested in:

* `n_clusters`
* `n_init`

Obviously, `n_clusters` is the number of clusters. `n_init` is the number of times that we run the clustering algorithm. We mentioned in class that we need to run the clustering algorithm multiple times since there are random starts, this is especially true in large datasets like the one we are working with. 

The following lines of code create 12 clusters using 10 random starts. Can you edit them so that they create 6 clusters with 20 random starts? Save the object as `clust_1` so we can see what outputs are possible in the next few questions.

In [None]:
random.seed(944) ## keep this for reproducibility! You can change the seed if you want.

from sklearn.cluster import KMeans

kmeans1 = KMeans(n_clusters=12, n_init = 10)
clust_1 = kmeans1.fit(epi)

9. We are interested in a few outputs. First, the within cluster variation--in `sklearn`, this is called something different. Can you identify what object you should be extracting from the documentation? Once you find it, print it in the cell below.

In [None]:
clust_1

10. We used $K=6$ arbitrarily in Step 8., but of course we need to choose it. As I mentioned in class, we will need to write a loop and create the elbow plot to justify the choice of $K$. Fill in the loop below to get the inertia for values of $K$ from 1 to 25 (I'm picking a large number because the data is so large, in smaller datasets you may not need to go so far).

In [None]:
inertias = []

for i in range():
    kmeans = KMeans()
    kmeans.fit()
    inertias.append()

11. Now, let's create the elbow plot. First, create a data frame with two columns--one for the value of $K$ and one for the inertias.

In [None]:
chooseK = {}
chooseK_df = pd.DataFrame()

12. Now, create the plot with $K$ on the $x$-axis and inertia on the $y$. Do you see a clear choice for $K$?


In [None]:
import plotnine as p9

print(p9.ggplot() +
       p9.geom_line())

13. There does not appear to be a very clear "elbow" in the plot (although it's not terrible, either). This happens sometimes. I might consider the values 2, 5, 6, or 11 based on my plot (yours will look different). Pick a value and create an object `clust_2` on the `epi` dataset.

In [None]:
kmeans2 = KMeans()
clust_2 = 

14. One other thing we are interested in are the actual labels. Using your `clust_2` object and the `sklearn` documentation, can you figure out how to print them out?


15. Now comes the fun part! We can try to interpret each cluster. You might learn more advanced methods for visualizing the clusters in later classes, but in this class, the best we can do is just printing out the "names" of the points in the clusters to try and see if there's a pattern. Adapt the code below to print out the names of the points in Cluster 1, 2, etc. Try and give each one a name--this might depend on context clues, you will really have to think about what they all have in common!

In [None]:
epi.index[clust_2.labels_ == 0.0]

16. If you have time, try playing with some of the other settings in `KMeans()`. Do your clusters appear relatively stable?

# Hierarchical Clustering 

17. *From Week 7 Course Notes, Monday, May 1* Now let's move on to hierarchical clustering. Remember in class that I mentioned this also goes by the name of agglomerative clustering, which is the name of the `sklearn` command--[`AgglomerativeClustering()`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html). Read the linked documentation to brush up on the syntax and see what kind of arguments you can change. I'm particularly interested in:

* `n_clusters`
* `metric`
* `linkage`

`n_clusters` may seem a little odd, because one of the advantages of hierarchical clustering is that we do not have to choose a number of clusters.You can just make this `None` to get all of the dendrogram, but if you wanted to cut your dendrogram, this is one way to do it! That would make it easier to investigate the members of each cluster. `metric` is of course how to measure similarity between two points, and `linkage` is how to measure similarity between groups of points. 

The following line of code creates clusters using `ward` linkage. Can you edit it so that it creates clusters with `complete` linkage? Save the object as `clust_3` so we can see what outputs are possible in the next few questions.


In [None]:
from sklearn.cluster import AgglomerativeClustering

clust_3 = AgglomerativeClustering(distance_threshold = 0, n_clusters = None, 
                                  linkage = 'ward').fit(epi)

18. As mentioned in class, we like the dendrogram output. Unfortunately, `sklearn` does not have a nice function for plotting dendrograms. However, I did find a nice example from `sklearn` for [plotting hierarchical clustering Dendrograms](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py)--it makes use of a function from `scipy`. One other major change is that it uses `matplotlib` syntax rather than `plotnine`, we'll have to wait for `plotnine` to catch up. 

One thing you can do is copy and paste the function `plot_dendrogram` they have written in the tutorial, and apply that function to your own analysis. I've done that for you in the code chunk below--can you plot the dendrogram from the previous question?


In [None]:
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
plt.figure()                                      
plt.title("")
plt.xlabel("")

plot_dendrogram()

plt.show()    

19. Unfortunately, this dendrogram isn't very helpful because there are so many observations! Let's try re-running your clusters with the same value of $K$ you used previously. Are your clusters somewhat stable, or did the interpretations change?

In [None]:
clust_3 =

epi.index[clust_3.labels_ == 0.0]