# [Chapter 7] Clustering the nutrition data

[DSLC stages]: Analysis

**Note that the clustering results in the book were created with R, which has different impementations of the clustering algorithms. The results computed here thus look a bit different from the version you will find in the book.**

In this document, we will conduct a cluster analysis on the food nutrition data. Recall that our goal is to come up with some meaningful food groups that will help us organize the food items into categories for the user of our hypothetical nutrition app. E.g., if a user wanted to look up an item, they could click on the "meats" category or the "dessert" category to filter to the food item they are searching for.


Since we ended up using the *scaled* (i.e. each variable has been divided by its standard deviation), but *uncentered* FNDDS dataset (we did not subtract the mean from each variable before dividing by the standard deviation) where we have applied a *log-transformation* to each variable, we will continue to use this version of the dataset here. We will later explore the stability of our results to see how these pre-processing judgment calls impacted our cluster findings (although note that the judgment call to not mean-center should not have any impact on our clustering results).

The following code gets sets up the libraries and cleaned and pre-processed data that we will use in this document.


In [1]:
import pandas as pd
import numpy as np
from random import sample
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples, rand_score, adjusted_rand_score
from itertools import product

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)

In [2]:
from functions.preprocess_food_data import preprocess_food_data
from functions.clean_food_data import clean_food_data

In [3]:
nutrient_amount = pd.read_csv("../data/food_nutrient.csv")
food_name = pd.read_csv("../data/food.csv")
nutrient_name = pd.read_csv("../data/nutrient_name.csv")

# create the clean dataset
food_fndds = clean_food_data(nutrient_amount_data=nutrient_amount, 
                             food_name_data=food_name, 
                             nutrient_name_data=nutrient_name, 
                             select_data_type="survey_fndds_food")

# Preprocess the FNDDS data
food_fndds_scaled = preprocess_food_data(food_fndds, 
                                         log_transform=False,
                                         center=False,
                                         scale=True,
                                         remove_fat=False)

food_fndds_log_scaled = preprocess_food_data(food_fndds, 
                                             log_transform=True,
                                             center=False,
                                             scale=True,
                                             remove_fat=False)


  nutrient_amount = pd.read_csv("../data/food_nutrient.csv")



## Applying hierarchical and K-means clustering with 6 clusters to the training data

Let's apply hierarchical clustering to the `food_fndds_log_scaled` pre-processed training dataset.

In [4]:

food_hclust_init = AgglomerativeClustering(n_clusters=6, metric='euclidean', linkage='ward')
food_hclust = food_hclust_init.fit(food_fndds_log_scaled).labels_ + 1



Our resulting hclust clustering object is an array of cluster membership:

In [5]:
food_hclust

array([2, 2, 2, ..., 1, 1, 2])


Next, we can apply K-means with $K = 6$:


In [6]:
food_kmeans_init = KMeans(n_clusters=6, random_state=879, n_init="auto")
food_kmeans = food_kmeans_init.fit(food_fndds_log_scaled).labels_ + 1


In [7]:
# rearrange the clusters so that they match the hierarchical clusters more closely (this is based on the observations below)
label_indexes = [food_kmeans == 1,
                 food_kmeans == 2,
                 food_kmeans == 3,
                 food_kmeans == 4,
                 food_kmeans == 5,
                 food_kmeans == 6]
new_labels = [5, 6, 4, 3, 1, 2]
food_kmeans = np.select(label_indexes, new_labels)
#food_kmeans



Our resulting clustering object is also an array of cluster membership


In [8]:
food_kmeans

array([3, 4, 4, ..., 5, 5, 3])


## Examining the clusters

### Qualitative explorations

Let's look at some randomly sampled food items from each cluster, and the size of each cluster. 


Let's define a function that prints a random sample of 15 food items from each cluster.


In [9]:
def sample_foods_cluster(description, cluster, seed):
    food_samples_by_cluster = (pd.DataFrame({"description": description, 
                                             "cluster": cluster})
        .groupby(cluster)
        .sample(15, random_state = seed))
    # make sure the index is repeated in each group
    food_samples_by_cluster.index = 6 * list(np.arange(15))
    # pivot to wider format
    food_samples_by_cluster = food_samples_by_cluster.pivot(columns="cluster").droplevel(axis=1, level=0)
    # add word cluster to column names
    food_samples_by_cluster.columns = ["cluster_" + str(i) for i in food_samples_by_cluster.columns]

    return food_samples_by_cluster



And a function that plots the number of food items in each cluster. 

In [10]:
def plot_cluster_size(cluster, title=None):
    cluster_size = pd.Series(cluster).value_counts()
    return px.bar(cluster_size, title=title)


Let's look at the hierarchical clusters in terms of 15 randomly selected food items from each cluster and the size of each cluster.

Note that since the results in the book were produced using R, they will look slightly different to what is shown in this document.

In [11]:
sample_foods_cluster(description=food_fndds_log_scaled.index,
                     cluster=food_hclust, 
                     seed=23489)


Unnamed: 0,cluster_1,cluster_2,cluster_3,cluster_4,cluster_5,cluster_6
0,"Perch, baked or broiled, made with cooking spray","Almond milk, sweetened, chocolate","Trout, cooked, NS as to cooking method","Cookie, coconut",Cereal (Kellogg's Low Fat Granola),"Artichoke, cooked, from fresh, made with marga..."
1,"Cheese enchilada, frozen meal","Cream of wheat, regular or quick, made with wa...","Mullet, baked or broiled, fat not added in coo...","Ice cream bar or stick, rich ice cream, chocol...","Nutritional powder mix, high protein (Herbalife)","Greens, cooked, from canned, fat added in cook..."
2,"Cookie, rugelach","Ice pop, sweetened with low calorie sweetener",Mussels with tomato-based sauce,Baby Ruth,Cereal (Malt-O-Meal Colossal Crunch),"Beef, rice, and vegetables including carrots, ..."
3,"Pretzels, soft, from school lunch","Coffee, Cafe Mocha, nonfat","Mackerel, baked or broiled, fat not added in c...",Whatchamacallit,"Brown rice cereal, baby food, dry, instant","Asparagus, cooked, from frozen, fat not added ..."
4,"Lentils, dry, cooked, NS as to fat added in co...","Orange-apple-banana juice, baby food","Barracuda, steamed or poached",Whipped topping,Cereal (Malt-O-Meal Golden Puffs),"Broccoli, cooked, from fresh, made with butter"
5,"Crisp, rhubarb","Bean sprouts, cooked, NS as to form, fat added...","Sardines, canned in oil","Ice cream sundae, not fruit or chocolate toppi...","Tomatoes, red, dried","Pumpkin, cooked, from frozen, fat not added in..."
6,"Beans, lima, immature, from canned, creamed or...","Infant formula, powder, made with water, NFS (...","Swordfish, cooked, NS as to cooking method","Beer cheese soup, made with milk","Cereal, chocolate flavored, frosted, puffed corn","Stewed beans with pork, tomatoes, and chili pe..."
7,"Egg casserole with bread, cheese, milk and meat","Yogurt, Greek, nonfat milk, flavors other than...","Salmon, dried","Potato, scalloped, from dry mix","Cereal, muesli","Vegetable and turkey, baby food, junior"
8,Honey mustard dip,"Prunes with oatmeal, baby food, strained","Salmon, coated, fried, made with oil","Cake, tres leche",Cereal (Kellogg's Honey Smacks),"Cactus, raw"
9,"Frankfurter or hot dog sandwich, with chili, o...","Coconut water, unsweetened","Sardines, cooked","Cheese spread, cream cheese, regular",Cereal (Kellogg's Apple Jacks),"Corn, cooked, NS as to form, with cream sauce,..."


Let's also look at the number of food items in each cluster

In [12]:
plot_cluster_size(food_hclust)


For the hierarchical clusters (note that these differ from the R implementation shown in the book):

- The first cluster contains vegetables, chicken and seafood

- The second cluster has no clear theme.

- The third cluster contains seafood.

- The fourth cluster contains dairy and desserts.

- The fifth cluster has mostly cereals.

- The sixth cluster has no clear theme, and is also the largest cluster.




Let's look next at the K-means clusters

In [13]:
sample_foods_cluster(description=food_fndds_log_scaled.index,
                     cluster=food_kmeans, 
                     seed=23489)

Unnamed: 0,cluster_1,cluster_2,cluster_3,cluster_4,cluster_5,cluster_6
0,"Fish, NS as to type, coated, fried, made with ...","Vegetable rice soup, canned, prepared with wat...","Bananas and pineapple, baby food, NS as to str...","Crepe, NS as to filling","Grilled cheese sandwich, Cheddar cheese, on wh...","Barley cereal, baby food, dry, instant"
1,"Tilapia, coated, baked or broiled, made with m...","Broccoflower, cooked, made with oil","Peach, cooked or canned, NS as to sweetened or...","Chocolate milk, made from no sugar added dry m...","Cookie, sugar wafer, sugar free",Cereal (Post Grape-Nuts Flakes)
2,"Meat loaf made with beef, veal and pork",Matzo ball soup,"Dutch apple dessert, baby food, junior","Chocolate milk, made from reduced sugar mix wi...",Taco or tostada with chicken and sour cream,"Seaweed, dried"
3,"Croaker, coated, baked or broiled, fat added i...","Peas and corn, cooked, fat added in cooking, N...","Apple, raw","Refried beans, made with oil",Sweet potato chips,"Cereal bar with fruit filling, baby food"
4,"Chicken breast, baked, broiled, or roasted, sk...","Beets, cooked, from canned, made with margarine",Gelatin dessert with whipped cream,"Meat, baby food, NS as to type, NS as to strai...","Cake, carrot, diet",Cereal (Kellogg's Corn Flakes)
5,"Chiliburger, with or without cheese, on bun","Turnip greens, canned, reduced sodium, cooked,...","Grapefruit juice, 100%, with calcium added","Infant formula, liquid concentrate, made with ...","Onion rings, from fresh, batter-dipped, baked ...",Cereal (Kellogg's Cinnabon)
6,"Flounder, baked or broiled, made with margarine","Minestrone soup, canned, prepared with water, ...","Milk, lactose free, fat free (skim)","Turkey, ham, and roast beef club sandwich, wit...","Calzone, with cheese, meatless",Cereal (Post Alpha-Bits)
7,"Chicken nuggets, from fast food","Sausage, potatoes, and vegetables including ca...",Taffy,"Refried beans, made with animal fat or meat dr...","Potato sticks, fry shaped",Soybean meal
8,"Lamb, loin chop, cooked, NS as to fat eaten","Rice, white, with carrots and tomatoes and/or ...",Mai Tai,"Infant formula, NS as to form (Gerber Graduate...",Carob chips,"Nutritional powder mix, sugar free (Slim Fast)"
9,"Veal cutlet or steak, fried, lean and fat eaten","Lettuce, raw",Mimosa,Ham and noodles with cream or white sauce,"Snack cake, not chocolate, with icing or filli...",Cereal (Post Honey Bunches of Oats with Almonds)


In [14]:
plot_cluster_size(food_kmeans)

For the K-means clusters:

- The first cluster contains no clear theme.

- The second cluster contains no clear theme.

- The third cluster contains seafood.

- The fourth cluster contains dairy and desserts.

- The fifth cluster contains cereals and other things.

- The sixth cluster contains vegetables.



Overall, it seems like 6 is probably not enough clusters, but it is encouraging to see distinct themes emerging in at least some of the clusters for both algorithms. 

Since it is easier to visualize fewer clusters, we will first explore these sets of six clusters, to give a sense of how to examine and compare the clusters we have computed, before conducting an analysis to identify what a more appropriate number of clusters might be.





### Comparing variable distributions across each cluster

The plot below shows the distribution of the "carbohydrates" nutrient variable across the (a) hierarchical and (b) K-means clusters using boxplots. Notice that the third and fourth clusters from both algorithms tend to consist of more carbohydrate-heavy food items relative to the other clusters. We have re-ordered the so that they line up as much as possible (since the original cluster order was arbitrary and random).

In [15]:
carbs_by_cluster = food_fndds_log_scaled.copy()
carbs_by_cluster["hclust"] = food_hclust
carbs_by_cluster["kmeans"] = food_kmeans
carbs_by_cluster = carbs_by_cluster.melt(value_vars=["hclust", "kmeans"], 
                                         var_name="algorithm",
                                         value_name="cluster",
                                         id_vars=["carbohydrates"], 
                                         ignore_index=False)
carbs_by_cluster

Unnamed: 0_level_0,carbohydrates,algorithm,cluster
description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Milk, human",1.838299,hclust,2
"Milk, NFS",1.572064,hclust,2
"Milk, whole",1.564424,hclust,2
"Milk, low sodium, whole",1.510663,hclust,2
"Milk, calcium fortified, whole",1.564424,hclust,2
...,...,...,...
Breading or batter as ingredient in food,3.297084,kmeans,4
Wheat bread as ingredient in sandwiches,3.412933,kmeans,4
Sauce as ingredient in hamburgers,2.581662,kmeans,5
Industrial oil as ingredient in food,0.000000,kmeans,5


In [16]:
px.box(carbs_by_cluster, x="cluster", y="carbohydrates", facet_col="algorithm")


### Projecting clusters onto a two-dimensional scatterplot

Let's next visualize the two sets of clusters created using all variables but visualized in the space just defined by the sodium and protein variables.


In [17]:
foods_1000 = food_fndds_log_scaled.copy()
foods_1000["hclust"] = food_hclust
foods_1000["kmeans"] = food_kmeans
foods_1000 = foods_1000.sample(1000, random_state=3487)

foods_1000_long = foods_1000.melt(value_vars=["hclust", "kmeans"], 
                                  var_name="algorithm",
                                  value_name="cluster",
                                  id_vars=["sodium", "protein"], 
                                  ignore_index=False)
foods_1000_long["cluster"] = foods_1000_long["cluster"].astype(str)

In [18]:
px.scatter(foods_1000_long, x="sodium", y="protein", 
           color="cluster", 
           opacity=0.3, 
           facet_col="algorithm")


The two sets of clusters bear similarities, but also differences. Again, we have tried to manually arrange the clusters so that the same colors were capturing similar clusters in each plot, but since the hierarchical and K-means clusters are not perfectly identical, the colors don't exactly "match up".



## Quantifying cluster quality

Next, we can compute some of the quantitative metrics of "cluster quality" for our clusters.


### Within-cluster Sum of Squares


Let's compute the within-cluster sum of squares for each set of clusters.  This compares each data point to the center/mean of its cluster, squares the distance, and then adds up all of these squared distances.

Note that the K-means algorithm reports the total within sum of squares metric added across all of the clusters (note since we cant match the random seed in R and python this is slightly different to the result reported in the boook).


In [19]:
round(food_kmeans_init.inertia_, 1)

319090.5


But the hierarchical clustering function does not. So we need to write a function for computing the within-cluster SS for a given dataset and set of clusters.


In [20]:
def tot_within_sum_of_square(data, clusters):
    data_clustered = data.copy()
    # add clusters to data frame
    data_clustered["cluster"] = clusters
    # compute the cluster means for each variable
    data_clustered_mean = data_clustered.groupby("cluster").mean().reset_index()
    # melt both data frames to long-form
    data_clustered_mean_melted = data_clustered_mean.melt(id_vars="cluster",
                                                          value_name="cluster_mean")
    data_clustered_melted = data_clustered.melt(id_vars="cluster")
    # add cluster means to data
    data_clustered_melted = data_clustered_melted.merge(data_clustered_mean_melted, 
                                                    on=["cluster", "variable"])
    # compute within-cluster SS for each obseravation
    data_clustered_melted["wss"] = (data_clustered_melted["value"] - data_clustered_melted["cluster_mean"])**2
    # add SS across each cluster
    wss = data_clustered_melted.groupby("cluster")["wss"].sum()
    return round(sum(wss), 1)

The total within sum of squares for the hierarchical clustering algorithm is computed to be:

In [21]:
tot_within_sum_of_square(food_fndds_log_scaled, food_hclust)

324588.1


While the total within sum of squares for K-means is slightly lower at

In [22]:
tot_within_sum_of_square(food_fndds_log_scaled, food_kmeans)

319090.5


Since a lower sum of squares indicates "better" performance, this indicates that the K-means algorithm yields slightly "better" clusters, at least according to this WSS metric. 

#### A simulation study: WSS vs K

The following plot shows how the WSS changes as $K$ increases for the same dataset. It is very clear that as $K$ increases, the WSS decreases.

In [23]:
def fit_kmeans(data, k):
    return KMeans(n_clusters=k, n_init="auto").fit(data).labels_
k_range = np.arange(3, 21)
wss_simulation_k = [tot_within_sum_of_square(food_fndds_log_scaled, fit_kmeans(food_fndds_log_scaled, k)) for k in k_range]

px.line(pd.DataFrame({"K": k_range,
                      "WSS": wss_simulation_k}),
                      x = "K", y="WSS")


#### A simulation study: WSS vs sample size

Next, we can conduct a similar study, but this time observing how the WSS changes as we gradually increase the sample size (using subsampling) while fixing $K = 6$.  From this figure, it is clear that as the sample size increases, the WSS increases too. 

In [24]:
def sampled_kmeans_wss(data, prop, k):
    data_sampled = data.copy()
    data_sampled = data_sampled.sample(int(round(prop * len(data_sampled.index), 0)))
    return tot_within_sum_of_square(data_sampled, fit_kmeans(data_sampled, k))


prop_range = np.arange(0.1, 1.1, 0.1)
wss_simulation_prop = [sampled_kmeans_wss(food_fndds_log_scaled, prop, k=6) for prop in prop_range]
sample_size_range = prop_range * len(food_fndds_log_scaled.index)
px.line(pd.DataFrame({"sample_size": sample_size_range,
                      "WSS": wss_simulation_prop}),
                      x = "sample_size", y="WSS")



### Silhouette score


Next, we can evaluate the cluster quality by computing the silhouette score for each data point, and plotting them as bars. 

First, we will calculate the silhouette score for each set of clusters

In [25]:
silhouette_score(food_fndds_log_scaled, food_hclust)

0.10070346109742716

In [26]:
silhouette_score(food_fndds_log_scaled, food_kmeans)

0.093656730829967

Note that the values here are different from the versions we computed in R but the conclusion that the K-means clusters have higher silhouette score remain the same

Again, this indicates that the K-means algorithm yields slightly "better" clusters, according to the silhouette metric.

Let's write a function that will take either of the silhouette data frames that we created above, and will create a silhouette plot from it. 


In [27]:
def plot_silhouette(data, cluster):
    silhouette_scores = silhouette_samples(data, cluster)
    silhouette_df = pd.DataFrame({"silhouette": silhouette_scores,
                                  "cluster": cluster.astype("str")}).sort_values(["cluster", "silhouette"])
    silhouette_df["i"] = np.arange(len(silhouette_scores))
    fig = px.bar(silhouette_df, x="silhouette", y="i", 
                 color="cluster", orientation="h")
    fig.update_traces(marker_line_width=0)
    return fig




The following plots show the silhouette widths for each observation in each cluster. We matched the colors to the scatterplots above. Note that there definitely seem to be more negative silhouette widths in each of the hierarchical clusters than in the K-means clusters.



In [28]:
plot_silhouette(food_fndds_log_scaled, food_hclust)

In [29]:
plot_silhouette(food_fndds_log_scaled, food_kmeans)



### Rand and adjusted rand indexes


Another metric that we can compute is the Rand index and the adjusted Rand index. Rather than being metrics for summarizing the "performance" of one set of clusters (as the WSS and silhouette score were), these are metrics for *comparing* two different sets of clusters of the same data units. This will be useful for conducting a stability analysis for each clustering algorithm (to compare how much the resulting clusters from each algorithm change across perturbations), but for now, let's compare the K-means clusters we have created to the hierarchical clustering clusters. 

The rand index for the K-means and hierarchical clustering algorithm is:

In [30]:
rand_score(food_hclust, food_kmeans)

0.7494763758947631

In [31]:
adjusted_rand_score(food_hclust, food_kmeans)

0.33156877264369855



These indicate some similarity, but since the adjusted Rand Index of 0 indicates absolutely no similarity, while an adjusted Rand Index of 1 indicates perfect similarity, it seems that these two clusters definitely have some significant differences.



## Choosing K (the number of clusters)

The code above gives a demonstration of how to do cluster analysis when you already know the number of clusters that you are trying to identify. We based our original choice of $K = 6$ clusters off our admittedly novice domain understanding of food groups.

### Quantitative assessments of K

In this section, we will use cross validation to explore whether there might be some other number of clusters, $K$, that might "better" (in terms of the metrics we have examined above) separate the food items into clusters.

Since our above analyses have determined that the K-means algorithm generally has "better" performance than the hierarchical clustering algorithms (in terms of the silhouette and total WSS metrics we explored), we will focus the rest of our analyses on the K-means algorithm.


Let's first create some cross validation folds, and add the fold index as a column to our data.


In [32]:
reps = int(len(food_fndds_log_scaled.index)/10)
fold = [[i]*reps for i in np.arange(10)]
# flatten list of lists
fold = [entry for subentry in fold for entry in subentry]
# randomly rearrange fold list
fold = sample(fold, len(food_fndds_log_scaled.index))


In [33]:
# add folds to food data
food_cv = food_fndds_log_scaled.copy()
food_cv["fold"] = fold


Then we want to iterate through all values of $K$ we will consider (10 to 50), where for each fold, we will use all but the selected fold to identify the clusters, and then computing the silhouette and sum of squares metrics on the remaining withheld fold. Since we have 10 folds, we will thus end up with 10 metrics for each $K$.  The cluster membership of each withheld data point is decided based on which cluster center it is closest to.

First, let's write a function that computes which cluster each withheld data point belongs to. 

In [34]:
def compute_cluster(data, centers):
    """Finds and returns the index of the closest centroid for a given vector x"""
    k = len(centers)
    distances = np.empty((k, len(data.index)))
    for i in range(k):
        distances[i] = np.square(centers[i] - data).sum(axis=1)
    return np.argmin(distances, axis=0) # return the index of the lowest distance

In [35]:

# try out our function
compute_cluster(food_fndds_log_scaled, food_kmeans_init.cluster_centers_)

array([3, 2, 2, ..., 0, 0, 3])

Then we will use this function to loop through each fold and each value of $K$ to compute a range of CV silhouette scores. 

This code is a bit complex (and is almost certainly not the most efficient way to do this - it takes a while to run), but try to run through line by line to understand what is doing (e.g., define `k = 10` and `_fold = 1` to see what is happening inside the map functions).

In [36]:
np.random.seed(seed=389)
k_range = np.arange(10, 51, 5)
k_pos = 0
eval_cv_k_hclust = np.empty((10, len(k_range)))
eval_cv_k_kmeans = np.empty((10, len(k_range)))
for k in k_range:
    for _fold in np.arange(0, 10, 1):
        # define the "training" folds: all but the current fold
        # this will be used to define the clusters
        food_train_fold = food_cv.copy()
        food_train_fold = food_train_fold.query("fold != @_fold").drop(columns=["fold"])
        # define the "validation" fold: the current withheld fold
        # this will be used to evaluate the clusters
        food_val_fold = food_cv.copy()
        food_val_fold = food_val_fold.query("fold == @_fold").drop(columns=["fold"])

        # run k-means with the current k on the "training" fold
        kmeans_fold_init = KMeans(n_clusters=k, n_init="auto")
        kmeans_fold_fit = kmeans_fold_init.fit(food_train_fold)
        kmeans_fold = kmeans_fold_fit.labels_ 
        # run hclust with the current k on the "training" fold
        hclust_fold_init = AgglomerativeClustering(n_clusters=k, metric='euclidean', linkage='ward')
        hclust_fold_fit = hclust_fold_init.fit(food_train_fold)
        hclust_fold = hclust_fold_fit.labels_

        # compute the cluster centers for hclust
        food_train_fold_hclust = food_train_fold.copy()
        # add clusters to data frame
        food_train_fold_hclust["cluster"] = hclust_fold
        # compute the cluster means for each variable
        hclust_fold_centers = food_train_fold_hclust.groupby("cluster").mean().reset_index(drop=True)
        hclust_fold_centers = hclust_fold_centers.values.tolist()

        # identify which cluster the "validation" data points are in based on which
        # "training" cluster centers it is closest to
        kmeans_val_membership = compute_cluster(food_val_fold, kmeans_fold_fit.cluster_centers_)
        hclust_val_membership = compute_cluster(food_val_fold, hclust_fold_centers)

        # calculate silhouette for "validation" fold
        eval_cv_k_kmeans[_fold, k_pos] = silhouette_score(food_val_fold, kmeans_val_membership)
        eval_cv_k_hclust[_fold, k_pos] = silhouette_score(food_val_fold, hclust_val_membership)
    k_pos += 1



Let's look at the results of our cross validation analysis by plotting the average silhouette width (computed for each CV iteration using on the data in the withheld folds) for each $K$. The figure below shows that there is generally a higher silhouette width for larger values of $k$, but that this improvement levels off around $K = 25$. 

In [37]:
# comptue the average silhouette score across the folds for each k
kmeans_mean_sil_cv_k = eval_cv_k_kmeans.mean(axis=0)
hclust_mean_sil_cv_k = eval_cv_k_hclust.mean(axis=0)

In [38]:
fig = px.box(eval_cv_k_kmeans, labels=dict(variable="K", value="Silhouette"))
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = np.arange(len(k_range)),
        ticktext = [str(x) for x in k_range]
    )
)
fig.show()

In [39]:
fig = px.box(eval_cv_k_hclust, labels=dict(variable="K", value="Silhouette"))
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = np.arange(len(k_range)),
        ticktext = [str(x) for x in k_range]
    )
)
fig.show()


Let's take a look at the $K = 30$ results. 




### Qualitative assessments of K



Let's look at the clusters we uncover with K-means and hierarchical clustering with $K = 30$.

In [40]:
kmeans_clust_k30_init = KMeans(n_clusters=30, n_init="auto", random_state=914)
kmeans_clust_k30_fit = kmeans_clust_k30_init.fit(food_fndds_log_scaled)
kmeans_clust_k30 = kmeans_clust_k30_fit.labels_ + 1


We can examine the themes of each set of clusters by looking at samples of food items that ended up each cluster, as well as the size of each cluster. 

In [41]:
food_clustered_30 = food_fndds_log_scaled.copy()
food_clustered_30["cluster"] = kmeans_clust_k30

In [42]:
food_clustered_samples = food_clustered_30.reset_index() \
                 .groupby("cluster")[["cluster", "description"]] \
                 .sample(15, random_state=311)
                 
food_clustered_samples["index"] = list(range(15))*30
food_clustered_samples = food_clustered_samples.set_index("index")
food_clustered_samples.pivot(columns="cluster")
                

Unnamed: 0_level_0,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description
cluster,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2
0,"Fish shish kabob with vegetables, excluding po...","Milk dessert bar, frozen, made from lowfat mil...","Thousand Island dressing, light","Cauliflower, cooked, from frozen, fat added in...","Pork, spareribs, cooked, NS as to fat eaten","Cereal or granola bar, with coconut, chocolate...",Whiskey,Cereal (General Mills Lucky Charms Chocolate),"Sardines, canned in oil","Broccoli, cooked, from fresh, NS as to fat add...","Light ice cream, fudgesicle","Whiting, baked or broiled, made with butter",Nutrition bar (South Beach Living Meal Bar),"Chicken drumstick, baked, coated, skin / coati...","Cream, whipped",Popover,"Coffee, Cappuccino, nonfat","Vegetables, NS as to type, cooked, made with oil",Passion fruit nectar,"Mixed nuts, NFS","Swordfish, baked or broiled, fat added in cooking",Industrial oil as ingredient in food,"Fish sandwich, on bun, with spread","Fruit salad, excluding citrus fruits, with mar...","Syrup, dietetic","Lentils, dry, cooked, made with oil","Meat turnover, Puerto Rican style","Beans, string, green, canned, low sodium, NS a...","Infant formula, NS as to form (Enfamil ProSobee)","Egg omelet or scrambled egg, with cheese and m..."
1,"Beef chow mein or chop suey, no noodles","Coffee, Iced Latte, decaffeinated, nonfat, fla...","Pie, plum, two crust","Coffee creamer, soy, liquid","Chicken, back","Cake, rice flour, without icing or filling","Beer, light",Nutrition bar (Zone Perfect Classic Crunch),"Mackerel, smoked","Broccoli raab, cooked, NS as to fat added in c...","Icing, chocolate","Fish, NS as to type, coated, baked or broiled,...",Cereal (Weetabix Whole Grain),"Beef steak, fried, NS as to fat eaten",Queso Chihuahua,"Dippin' Dots, flash frozen ice cream snacks, f...","Coffee, Iced Latte, nonfat, flavored","Vegetable soup, chunky style","Apples and pears, baby food, junior","Almonds, NFS","Flounder, smoked",Soybean oil,Beef and noodles with cream or white sauce,"Salad dressing, light, NFS","Coffee, decaffeinated, pre-lightened","Waffle, from school, NFS","Pizza, cheese, stuffed crust","Peas and carrots, cooked, from frozen, fat add...","Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with meat and tom..."
2,"Stewed chickpeas with Spanish sausages, Puerto...","Asparagus soup, cream of, prepared with milk",Yuca fries,"Chicken or turkey and corn hominy soup, home ...","Frankfurter or hot dog, cheese-filled","Doughnut, chocolate, cake type, with chocolate...",Screwdriver,Cereal (General Mills Cookie Crisp),"Barracuda, baked or broiled, fat not added in ...","Bitter melon leaves, horseradish leaves, jute ...","Topping, chocolate flavored hazelnut spread","Ocean perch, baked or broiled, fat not added i...",Cereal (Post Golden Crisp),"Beef steak, braised, lean and fat eaten","Cheese, Cheddar","Shrimp, coated, baked or broiled, made with bu...","Coffee, macchiato, sweetened","Carrots, canned, low sodium, made with margarine","Lime, raw","Crackers, corn","Eel, cooked, NS as to cooking method","Shortening, NS as to vegetable or animal","Bologna and cheese sandwich, with spread",Russian dressing,Marshmallow,"Beans, canned, drained, NS as to type, fat add...",Taquito or flauta with meat and cheese,"Corn, cooked, from frozen, NS as to color, mad...","Infant formula, NS as to form (Similac Isomil ...","Egg, whole, fried without fat"
3,"Beef, rice, and vegetables excluding carrots, ...",Gelatin dessert with fruit and whipped cream,"Pie, pear, two crust","Onions, cooked, NS as to form, NS as to fat ad...","Chicken thigh, fried, coated, skin / coating e...","Nuts, carob-coated",Vodka and tonic,Cereal (Post Cocoa Pebbles),"Salmon, dried","Cress, raw","Pie, pudding, chocolate, with chocolate coatin...","Perch, baked or broiled, made with margarine","Cereal, frosted oat cereal with marshmallows","Turkey, light meat, skin eaten","Cheese, Mexican blend, reduced fat","Cheese sandwich, reduced fat Cheddar cheese, o...","Coffee, instant, pre-lightened and pre-sweeten...","Rice, brown, with vegetables and gravy, fat ad...","Blueberries, cooked or canned, unsweetened, wa...","Cereal or Granola bar, NFS","Trout, baked or broiled, made with butter",Animal fat or drippings,"Chicken nuggets, from school lunch","Ranch dip, light",Fruit flavored syrup used for milk beverages,"Edamame, cooked","Burrito with beans, meatless","Corn, yellow, NS as to form, cream style, fat ...","Infant formula, NS as to form (Gerber Graduate...","Bear, cooked"
4,"Chicken or turkey, potatoes, and vegetables in...","Yogurt, whole milk, baby food, with fruit and ...","Biscuit dough, fried","Rice, white, with gravy, NS as to fat added in...","Beef, sandwich steak, flaked, formed, thinly s...",Banana chips,Frozen margarita,Cereal (General Mills Count Chocula),"Herring, raw","Mustard greens, cooked, from fresh, fat not ad...","Snack cake, chocolate, with icing or filling, ...","Fish, NS as to type, raw","Nutritional powder mix, protein, light, NFS","Chicken, NS as to part, stewed, skin eaten","Cheese, Cheddar, reduced sodium","Pie, coconut cream","Soft drink, cola, fruit or vanilla flavored, diet","Plantain, boiled, NS as to green or ripe","Strawberries, cooked or canned, in syrup",Flax seeds,"Ray, baked or broiled, fat not added in cooking",Cottonseed oil,Ham salad sandwich,"Dill dip, light","Fluid replacement, 5% glucose in water","Lima beans, dry, cooked, made with animal fat ...","Pizza with pepperoni, from restaurant or fast ...","Dandelion greens, cooked, made with margarine","Infant formula, powder, made with baby water (...","Egg, whole, fried, NS as to fat added in cooking"
5,"Macaroni or pasta salad, made with light mayon...","Light ice cream, soft serve, flavors other tha...","Empanada, Mexican turnover, pumpkin","Flavored rice, brown and wild","Ham, fresh, cooked, NS as to fat eaten","Dietetic or low calorie candy, chocolate covered",Vodka and cola,Cereal (General Mills Chex Chocolate),"Herring, coated, baked or broiled, fat not add...","Brussels sprouts, cooked, from fresh, fat adde...","Cookie bar, with chocolate, nuts, and graham c...","Tuna, fresh, dried",Cereal (Post Honey Bunches of Oats with Vanill...,"Chicken, NS as to part, grilled with sauce, sk...","Butter, whipped, stick, salted",Quesadilla with vegetables,Frozen coffee drink,"Mixed vegetables, cooked, NS as to form, fat n...","Apricots, baby food, junior","Crackers, woven wheat, flavored (Triscuit)","Sea bass, baked or broiled, fat not added in c...","Margarine, tub, salted","Cornbread muffin, stick, round","Dill dip, regular","Gelatin dessert, dietetic, with whipped toppin...","White beans, dry, cooked, made with animal fat...","Stuffed shells, cheese-filled, with meat sauce","Rice, white, with corn, NS as to fat added in ...","Infant formula, NS as to form (Similac Expert ...","Egg, whole, baked, NS as to fat added in cooking"
6,"Cannelloni, cheese-filled, with tomato sauce, ...","Yogurt, whole milk, plain","Cookie, oatmeal, with raisins","Mushroom, Asian, cooked, from dried","Frankfurter or hot dog sandwich, NFS, plain, o...",Cereal or granola bar (Quaker Granola Bites),Jagerbomb,Cereal (General Mills Reese's Puffs),"Herring, coated, fried","Brussels sprouts, cooked, from fresh, made wit...",TWIX Chocolate Fudge Cookie Bars,"Sushi, topped with eel","Finger Foods, Puffs, baby food","Veal cutlet or steak, fried, NS as to fat eaten","Cheese, cream","Gnocchi, potato","Coffee, Latte, nonfat, flavored",Spaghetti sauce,"Tutti-fruitti pudding, baby food, strained","Potato chips, restructured, fat free","Shrimp, dried",Safflower oil,"Taco or tostada with meat, from fast food","Chipotle dip, yogurt based","Water, bottled, flavored, sugar free (Glaceau ...","Peas, cowpeas, field peas, or blackeye peas, n...","Bacon cheeseburger, 1 medium patty, with condi...","Carrots, cooked, from frozen, fat not added in...","Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with cheese, meat..."
7,"Ravioli, meat-filled, with tomato sauce or mea...","Fat free ice cream, NS as to flavor","Coffee creamer, powder, fat free","Beans, lima, immature, cooked, from frozen, NS...","Frankfurter or hot dog sandwich, NFS, plain, o...",TWIX Caramel Cookie Bars,Mojito,Cereal (General Mills Cocoa Puffs),Sardines with mustard sauce,"Peas, green, raw","Pie shell, chocolate wafer","Whiting, coated, baked or broiled, made with m...",Cereal (Kellogg's Special K Red Berries),"Chicken drumstick, baked, broiled, or roasted,...",Lemon-butter sauce,"Potato, baked, peel not eaten, with vegetables",Energy drink (Monster),"Mixed vegetables, canned, low sodium, fat not ...","Lettuce, raw","Pecans, unroasted","Mussels, steamed or poached",Sunflower oil,"Cake, Ravani","Dip, NFS",Butterscotch hard candy,"Bagel, wheat, with raisins","Cheeseburger, 1 medium patty, plain, on wheat bun",Mustard pickles,"Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with meat and veg..."
8,"Chicken or turkey, rice, and vegetables includ...","Macaroni or noodles with cheese, made from red...",Vegetable tempura,"Barley, NS as to fat added in cooking","Frankfurter or hot dog sandwich, beef, plain, ...","Rice, cooked with coconut milk","Fruit punch, alcoholic",Nutritional powder mix (Carnation Instant Brea...,"Herring, coated, baked or broiled, fat added i...","Snowpea, cooked, from frozen, NS as to fat add...","Milk chocolate candy, plain","Halibut, coated, baked or broiled, made with m...","Cereal, fruit whirls","Chicken drumstick, sauteed, skin not eaten","Queso Anejo, aged Mexican cheese","Pasta, whole grain, with cream sauce and seafo...","Gravy, redeye","Mixed vegetables, cooked, from frozen, made wi...","Apricot, dried, cooked, NS as to sweetened or ...","Crackers, wheat, reduced sodium","Ray, cooked, NS as to cooking method","Margarine, stick, unsalted","Potato, hash brown, from restaurant, with cheese","Margarine, stick, salted",Snow cone,"Frankfurter or hot dog sandwich, fat free, pla...","Chiles rellenos, cheese-filled","Beans, lima and corn, cooked, NS as to fat add...","Infant formula, NS as to form (Enfamil Enfagro...","Cookie, ladyfinger"
9,Fish and rice with tomato-based sauce,"Chocolate milk, made from light syrup with red...","Egg white, omelet, scrambled, or fried, with c...","Mushrooms, cooked, from fresh, made with marga...","Beef, shortribs, cooked, lean and fat eaten",Cereal or granola bar (Quaker Chewy Dipps Gran...,Irish Coffee,Nutrition bar (Balance Original Bar),"Herring, cooked, NS as to cooking method","Peas and onions, cooked, fat added in cooking,...","Light ice cream, bar or stick, with low-calori...","Whiting, coated, baked or broiled, made withou...","Cereal, bran flakes","Catfish, baked or broiled, made without fat","Cheese, Brick",Sour cream,"Energy drink, sugar free","Calabaza, cooked","Grapefruit, canned or frozen, unsweetened, wat...",Pine nuts,"Croaker, steamed or poached",Flaxseed oil,"Chicken tenders or strips, breaded, from fast ...",Butter-vegetable oil blend,Fruit syrup,"Jelly sandwich, regular jelly, on white bread","Burrito with beans, rice, and sour cream, meat...","Corn, yellow and white, cooked, from frozen, N...","Infant formula, liquid concentrate, made with ...",Lamb or mutton loaf


In [43]:
px.bar(food_clustered_30["cluster"].value_counts())


**Note that these results aren't exactly identical to the R results that are shown in the book due to different random initializations and implementations of the K-means algorithm, but there are a lot of similar categories identified**

If we look at the cluster groups identified by $K = 30$, we identify the following groups (not necessarily in the same order as printed above). Note that each of the entries below is our best approximation to a theme (not all food items in each cluster fit the theme):

1. Meals

2. (Unclear category)

3. (Unclear category)

4. Vegetable-based meals

5. Meats

6. Cereal and dessert

7. Beverages

8. Cereals

9. Fish

10. Vegetables

11. Desserts

12. Fish

13. Cereals

14. Meats

15. Dairy

16. (Unclear category)

17. Caffienated beverages

18. Vegetables

19. Fruits

20. Nuts and crackers

21. Fish

22. Fats (oils)

23. Meals

24. Dips and oils

25. Desserts

26. Beans

27. Meat meals

28. Vegetables

29. Infant formula

30. Eggs


While our quantitative analysis results indicated that $K = 30$ was a good choice, our own investigations of other values of $K$ mostly demonstrate that the higher the value of $K$, the more specific the cluster categories are. For instance, one particular run of K-means with $K = 100$ yielded highly specific categories like "pizza", "alcoholic beverages", "egg omelets", "leafy greens". Overall, it increasingly feels like the choice of $K$ should be based on how specific we want our categories to be, rather than the values of the metrics above. The level of detail we obtained with $K = 30$ seems fairly reasonable, although we will likely want to manually combine some categories.

That said, let's investigate the predictability and stability of our results when $K = 30$. 


## PCS analysis of cluster results with K = 30


Now let's examine the predictability and stability of our results.

### Predictability

To explore the predictability of our results, we will see what the clusters we have identified (using K-means with $K = 30$) when we use their cluster centers to cluster the external "legacy" data.


First we will create and pre-process the legacy dataset

In [44]:
# Clean the Legacy data
food_legacy = clean_food_data(nutrient_amount_data = nutrient_amount, 
                              food_name_data=food_name,
                              nutrient_name_data=nutrient_name, 
                              select_data_type="sr_legacy_food")
# filter to the columns in the fndds dataset
food_legacy = food_legacy[food_fndds.columns].dropna()
# Preprocess the Legacy data
food_legacy_log_scaled = preprocess_food_data(food_legacy,
                                              log_transform=True,
                                              center=False,
                                              scale=True,
                                              remove_fat=False)

Next, we will identify which cluster each "legacy" food item belongs to based on the 30 cluster centers we identified from the "fndds" FNDDS data above.

In [45]:
kmeans_legacy_cluster = compute_cluster(food_legacy_log_scaled, 
                                        kmeans_clust_k30_init.cluster_centers_)


We can then calculate the average silhouette width and within sum of squares of these clustered legacy food items.

Below, the silhouette width is computed both for the original FNDDS data with K = 30 and for the legacy food items (clustered according to the 30 original FNDDS cluster centers). The Legacy silhouette score is actually *higher* (better) than for the FNDDS foods!

In [46]:
# fndds silhouette score
silhouette_score(food_fndds_log_scaled, kmeans_clust_k30)

0.15448090621912813

In [47]:
# legacy silhouette score
silhouette_score(food_legacy_log_scaled, kmeans_legacy_cluster)

0.1895141376411138


Note that since the two datasets have a different number of food items, the total WSS is not directly comparable so we won't compute it here.

Unfortunately also since we are looking at so many clusters, it is hard to visualize the clusters (if we are coding them by color) in the data space using a scatterplot as we did for K = 6.

We can, however look at 15 randomly chosen legacy food items from each cluster to see if this will highlight any similarities between the categories we identified for the FNDDS clusters:



In [48]:
food_legacy_clustered_30 = food_legacy_log_scaled.copy()
food_legacy_clustered_30["cluster"] = kmeans_legacy_cluster + 1

food_legacy_clustered_samples = food_legacy_clustered_30.reset_index() \
    .groupby("cluster")[["cluster", "description"]] \
    .sample(15, random_state=311, replace=True)
                 
food_legacy_clustered_samples["index"] = list(range(15))*30
food_legacy_clustered_samples = food_legacy_clustered_samples.set_index("index")
food_legacy_clustered_samples.pivot(columns="cluster")
                

Unnamed: 0_level_0,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description
cluster,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2
0,"Turnover, chicken- or turkey-, and vegetable-f...","Sauce, homemade, white, medium","Restaurant, family style, fish fillet, battere...","Currants, zante, dried","Beef, rib eye steak, boneless, lip-on, separab...","Beverages, coffee and cocoa, instant, decaffei...","Alcoholic beverage, wine, cooking","Formulated bar, MARS SNACKFOOD US, SNICKERS MA...","Fish, mackerel, Atlantic, raw","Collards, raw","Beverages, Dairy drink mix, chocolate, reduced...","Mollusks, mussel, blue, raw","Beverages, rich chocolate, powder","Beef, shoulder pot roast, boneless, separable ...","Cheese, gouda","Candies, white chocolate","Beverages, coffee, ready to drink, milk based,...","Carrot juice, canned","Beverages, Orange drink, breakfast type, with ...","Peanuts, all types, oil-roasted, with salt","Fish, Salmon, pink, canned, drained solids, wi...","Oil, industrial, cottonseed, fully hydrogenated","Danish pastry, cheese","Salad dressing, italian dressing, commercial, ...","Pear nectar, canned, without added ascorbic acid","Bread, reduced-calorie, white","Fast foods, hamburger, large, single patty, wi...","Sweet potato leaves, cooked, steamed, with salt","Infant formula, MEAD JOHNSON, ENFAMIL, AR, rea...","Egg, whole, cooked, omelet"
1,"Tomatoes, sun-dried","Milk shakes, thick chocolate","Cake, yellow, commercially prepared, with choc...","Rice noodles, dry","Lamb, New Zealand, imported, frozen, loin, sep...","Nuts, coconut milk, raw (liquid expressed from...","Alcoholic beverage, wine, dessert, sweet","Formulated bar, MARS SNACKFOOD US, SNICKERS MA...","Fish, mackerel, Atlantic, raw","Onions, spring or scallions (includes tops and...","Candies, MARS SNACKFOOD US, MILKY WAY Bar","Fish, cod, Atlantic, dried and salted","Spices, parsley, dried","Beef, chuck, under blade steak, boneless, sepa...","Cheese, cheddar (Includes foods for USDA's Foo...","Infant formula, PBM PRODUCTS, store brand, soy...","Beverages, tea, Oolong, brewed","Babyfood, vegetables, squash, strained","Cherries, sweet, canned, juice pack, solids an...","Seeds, sesame butter, tahini, from roasted and...","Fish, trout, rainbow, farmed, cooked, dry heat","Oil, industrial, soy, low linolenic","School Lunch, chicken nuggets, whole grain bre...","Margarine-like vegetable-oil spread, stick/tub...","Beverages, Apple juice drink, light, fortified...","Cereals, QUAKER, QUAKER MultiGrain Oatmeal, dry","School Lunch, pizza, cheese topping, thick cru...","Peppers, jalapeno, raw","Infant formula, ABBOTT NUTRITION, SIMILAC, ADV...","Babyfood, meat, turkey sticks, junior"
2,"HOT POCKETS, CROISSANT POCKETS Chicken, Brocco...","Whey, sweet, dried","Crackers, multigrain","Beans, baked, canned, plain or vegetarian","Pork, fresh, spareribs, separable lean and fat...","Candies, REESE'S PIECES Candy","Alcoholic beverage, wine, light","Formulated bar, ZONE PERFECT CLASSIC CRUNCH BA...","Fish, mackerel, Pacific and jack, mixed specie...","Spinach, frozen, chopped or leaf, unprepared (...","Candies, milk chocolate","Crustaceans, crab, blue, cooked, moist heat","Cereals ready-to-eat, POST Bran Flakes","Veal, loin, separable lean and fat, raw","Cheese, parmesan, shredded","Cheese, Mexican, blend, reduced fat","Beverages, Energy drink, ROCKSTAR, sugar free","Babyfood, dinner, macaroni and tomato and beef...","Soup, tomato, dry, mix, prepared with water","Crackers, whole grain, sandwich-type, with pea...","Fish, trout, rainbow, farmed, cooked, dry heat","Oil, cottonseed, salad or cooking","Infant formula, MEAD JOHNSON, ENFAMIL, ENFAGRO...","Puff pastry, frozen, ready-to-bake, baked","Candies, fondant, prepared-from-recipe","Cereals, oats, instant, fortified, with cinnam...","Fast foods, hamburger; single, regular patty; ...","Beans, snap, green, canned, no salt added, dra...","Margarine-like, vegetable oil spread, 20% fat,...","Egg, duck, whole, fresh, raw"
3,"Lasagna, Vegetable, frozen, baked","Leavening agents, baking powder, double-acting...","Danish pastry, fruit, enriched (includes apple...","Lima beans, immature seeds, frozen, baby, cook...","Beef, brisket, flat half, separable lean and f...","Snacks, granola bar, with coconut, chocolate c...","Alcoholic beverage, beer, regular, all","Beverage, instant breakfast powder, chocolate,...","Fish, herring, Atlantic, pickled","Lettuce, red leaf, raw","Cocoa, dry powder, unsweetened, processed with...","Crustaceans, lobster, northern, cooked, moist ...","Babyfood, cereal, whole wheat, with apples, dr...","Beef, loin, tenderloin steak, boneless, separa...","Butter, without salt","Cheese, monterey, low fat","Beverages, carbonated, low calorie, cola or pe...","Babyfood, vegetables, peas, strained","Beverages, FUZE, orange mango, fortified with ...","Snacks, popcorn, oil-popped, microwave, regula...","Fish, salmon, Atlantic, farmed, raw","Oil, industrial, soy, ultra low linolenic","Infant formula, NESTLE, GOOD START SUPREME, wi...","Salad dressing, sesame seed dressing, regular","Beverages, almond milk, sweetened, vanilla fla...","Beverages, coffee substitute, cereal grain bev...","Fast Food, Pizza Chain, 14"" pizza, cheese topp...","Kumquats, raw","Infant Formula, GERBER GOOD START 2, GENTLE PL...","Egg, whole, raw, frozen, pasteurized (Includes..."
4,"Salad dressing, french dressing, reduced fat","Leavening agents, baking powder, double-acting...","Cake, gingerbread, dry mix","Egg substitute, liquid or frozen, fat free","Nuts, macadamia nuts, dry roasted, with salt a...","Breakfast bars, oats, sugar, raisins, coconut ...","Alcoholic beverage, distilled, all (gin, rum, ...","Cereals ready-to-eat, chocolate-flavored frost...","Fish, herring, Atlantic, cooked, dry heat","Babyfood, vegetables, spinach, creamed, strained","Baking chocolate, unsweetened, squares","Mollusks, snail, raw","Cereals ready-to-eat, RALSTON TASTEEOS","Beef, chuck, under blade pot roast or steak, b...","Cheese, low-sodium, cheddar or colby","Cream, sour, cultured","Beverages, yellow green colored citrus soft dr...","Beans, snap, green, frozen, cooked, boiled, dr...","Bulgur, cooked","Seeds, pumpkin and squash seed kernels, dried","Fish, pompano, florida, raw","Oil, soybean, salad or cooking, (partially hyd...","Toddler formula, MEAD JOHNSON, ENFAGROW, Toddl...","Salad dressing, poppyseed, creamy","Whey, acid, fluid","Wheat flour, white, bread, enriched","School Lunch, pizza, sausage topping, thick cr...","Spices, cinnamon, ground","Infant formula, GERBER, GOOD START 2, PROTECT ...","Beef, variety meats and by-products, liver, co..."
5,"Lasagna with meat sauce, frozen, prepared","Yogurt, vanilla flavor, lowfat milk, sweetened...","Burrito, bean and cheese, frozen","Noodles, egg, enriched, cooked","Luncheon meat, pork, canned","Dessert topping, powdered","Beverages, AMBER, hard cider","Formulated bar, ZONE PERFECT CLASSIC CRUNCH BA...","Fish, shad, american, raw","Chrysanthemum, garland, cooked, boiled, draine...","Beverages, Cocoa mix, low calorie, powder, wit...","Fish, pollock, Alaska, cooked, dry heat (may c...","Cereals ready-to-eat, GENERAL MILLS, CHEERIOS","Beef, round, top round roast, boneless, separa...","Vegetable oil-butter spread, reduced calorie","Cream, fluid, half and half","Beverages, carbonated, limeade, high caffeine","Egg rolls, vegetable, frozen, prepared","Potatoes, baked, flesh, without salt","Snacks, granola bar, GENERAL MILLS NATURE VALL...","Beef, variety meats and by-products, brain, co...","Fat, turkey","Fast foods, bagel, with breakfast steak, egg, ...","Salad dressing, italian dressing, reduced calorie","Beverages, Fruit flavored drink containing les...","Chicken, meatless, breaded, fried","Pizza, pepperoni topping, regular crust, froze...","Peas and carrots, frozen, cooked, boiled, drai...","Infant formula, MEAD JOHNSON, ENFAMIL, NUTRAMI...","Egg, whole, raw, frozen, pasteurized (Includes..."
6,"Fast foods, burrito, with beans, cheese, and beef","Frozen novelties, No Sugar Added, FUDGSICLE pops","Cookies, vanilla sandwich with creme filling, ...","Mushrooms, canned, drained solids","Beef, chuck, blade roast, separable lean and f...","Breakfast bars, oats, sugar, raisins, coconut ...","Alcoholic beverage, beer, regular, all","Beverage, instant breakfast powder, chocolate,...","Fish, halibut, Greenland, raw","Drumstick leaves, cooked, boiled, drained, wit...","Ice creams, chocolate","Mollusks, octopus, common, raw","Beverages, OVALTINE, Classic Malt powder","Pork, fresh, loin, sirloin (chops), bone-in, s...","Cream, fluid, heavy whipping","Cheese, ricotta, whole milk","Beverages, coffee, brewed, prepared with tap w...","Peas, edible-podded, frozen, cooked, boiled, d...","Raspberries, canned, red, heavy syrup pack, so...","Nuts, cashew nuts, dry roasted, without salt a...","Beef, variety meats and by-products, brain, co...","Oil, industrial, soy ( partially hydrogenated)...","Bacon, turkey, low sodium","Salad dressing, honey mustard, regular","Soup, cream of mushroom, low sodium, ready-to-...","Bread, rye","Fast Food, Pizza Chain, 14"" pizza, pepperoni t...","Pickle relish, sweet","Infant Formula, MEAD JOHNSON, ENFAMIL, ENFACAR...","Turkey, all classes, liver, cooked, simmered"
7,"Tomatoes, sun-dried","Frozen yogurts, chocolate","Macaroni and cheese, box mix with cheese sauce...","Cabbage, kimchi","Pork, cured, bacon, cooked, broiled, pan-fried...","Topping, SMUCKER'S MAGIC SHELL","Alcoholic beverage, beer, light, higher alcohol","Cereals ready-to-eat, MALT-O-MEAL, COCO-ROOS","Spices, mustard seed, ground","Mustard greens, cooked, boiled, drained, with ...","Beverages, Cocoa mix, low calorie, powder, wit...","Mollusks, snail, raw","Protein supplement, milk based, Muscle Milk, p...","Pork, cured, ham, shank, bone-in, separable le...","Cheese, swiss, low sodium","Cheese product, pasteurized process, American,...","Beverages, coffee, instant, regular, half the ...","Cabbage, raw","Babyfood, fruit, bananas with tapioca, junior","Snacks, potato sticks","Fish, pompano, florida, raw","Oil, canola","Beverages, Malted drink mix, natural, powder, ...","Salad dressing, mayonnaise and mayonnaise-type...","Beverages, GEROLSTEINER BRUNNEN GmbH & Co. KG,...","Babyfood, cereal, rice with pears and apple, d...","Pizza, meat and vegetable topping, rising crus...","Peas and carrots, frozen, cooked, boiled, drai...","Infant formula, MEAD JOHNSON, ENFAMIL, Prematu...","Chicken, liver, all classes, raw"
8,"Pasta with Sliced Franks in Tomato Sauce, cann...","Leavening agents, baking powder, double-acting...","Toaster pastries, fruit (includes apple, blueb...","Mushrooms, oyster, raw","Beef, rib eye steak/roast, bone-in, lip-on, se...","Candies, MOUNDS Candy Bar","Alcoholic beverage, wine, table, all","Beverage, instant breakfast powder, chocolate,...","Fish, mackerel, Atlantic, raw","Cabbage, chinese (pak-choi), raw","Candies, HERSHEY'S SKOR Toffee Bar","Crustaceans, lobster, northern, raw","Babyfood, cereal, rice, dry fortified","Beef, round, bottom round, steak, separable le...","Cheese, mexican, queso anejo","Cheese, pasteurized process, American, fortifi...","Beverages, Energy Drink with carbonated water ...","Sauce, barbecue","Babyfood, vegetables, beets, strained","Peanuts, all types, dry-roasted, without salt","Fish, cisco, smoked","Oil, canola","Infant formula, MEAD JOHNSON, ENFAMIL, NUTRAMI...","Margarine-like, vegetable oil spread, 60% fat,...","Pears, canned, light syrup pack, solids and li...","Cereals, whole wheat hot natural cereal, dry","Pizza rolls, frozen, unprepared","Babyfood, snack, GERBER, GRADUATES, YOGURT MELTS","Infant formula, MEAD JOHNSON, ENFAMIL, NUTRAMI...","Egg substitute, powder"
9,"Chicken pot pie, frozen entree, prepared","Turkey, drumstick, from whole bird, meat only,...","Cookies, peanut butter sandwich, regular","Potatoes, boiled, cooked without skin, flesh, ...","Beef, top sirloin, steak, separable lean and f...","Cream substitute, powdered","Alcoholic beverage, distilled, all (gin, rum, ...","Cereals ready-to-eat, POST, COCOA PEBBLES","Fish, mackerel, Atlantic, raw","Babyfood, vegetables, spinach, creamed, strained","Ice creams, regular, low carbohydrate, chocolate","Mollusks, abalone, mixed species, raw","Cereals ready-to-eat, MALT-O-MEAL, COLOSSAL CR...","Pork, cured, ham -- water added, shank, bone-i...","Cheese, pasteurized process, pimento","Frozen novelties, ice cream type, chocolate or...","Beverages, carbonated, limeade, high caffeine","Peas, green, cooked, boiled, drained, without ...","Mushrooms, maitake, raw","Soy flour, full-fat, raw","Fish, salmon, pink, canned, total can contents","Shortening, industrial, soy (partially hydroge...","Hush puppies, prepared from recipe","Margarine-like, vegetable oil spread, 60% fat,...","Soup, wonton, Chinese restaurant","Cereals, QUAKER, corn grits, instant, plain, dry","Pizza, pepperoni topping, regular crust, froze...","Tangerines, (mandarin oranges), canned, juice ...","Milk, human, mature, fluid (For Reference Only)","Egg, whole, raw, frozen, pasteurized (Includes..."



The general themes we can identify are shown in the table below. The clusters that seem to at least approximately have the same theme across the two datasets are shown in bold. Note that the two datasets have different naming conventions as well as different foods included in them.


| Cluster | FNDDS | Legacy |
|:--------|:------|:-------|
|1 | **Meals** | **Meals** |
|2 | **(Unclear category)** | **(Unclear category)**  |
|3 | **(Unclear category)** | **(Unclear category)** |
|4 | Vegetable-based meals | Vegetables and rice |
|5 | **Meats** | **Meats** |
|6 | **Cereal and dessert** | **Cereals and dessert** |
|7 | **Alcoholic beverages** | **Alcoholic beverages** |
|8 | **Cereals** | **Cereals** |
|9 | **Fish** | **Fish** |
|10 | **Vegetables** | **Vegetables** |
|11 | **Desserts** | **Desserts** |
|12 | **Fish** | **Fish** |
|13 | **Cereals** | **Cereals and supplements** |
|14 | **Meats** | **Meats** |
|15 | **Cheese and dairy** | **Cheese and dairy** |
|16 | (Unclear category) | Dairy |
|17 | **Caffienated beverages** | **Caffienated beverages** |
|18 | **Vegetables** | **Vegetables and baby food** |
|19 | Fruits | (Unclear category) |
|20 | **Nuts and crackers** | **Nuts and seeds** |
|21 | **Fish** | **Fish** |
|22 | **Fats (oils)** | **Fats (oils)** | 
|23 | Meals | Fast food and infant formula |
|24 | **Dips and oils** | **Dressing and oils** |
|25 | Desserts | Fuits and beverages |
|26 | Beans | Cereals and breads |
|27 | **Meat meals** | **Meals** |
|28 | **Vegetables** | **Vegetables** |
|29 | **Infant formula** | **Infant formula** |
|30 | **Eggs** | **Eggs and meat** |




Overall 25 of the 30 Legacy food item clusters (which are computed based on the FNDDS cluster centers) seem to have a similar theme (based on our subjective opinion) to the original clusters in the original FNDDS dataset, and those that don't quite match tend to have some similar food items too. This is fairly impressive predictability performance, and since the Legacy and FNDDS datasets have different foods in them and likely different nutrient measurement techniques, some differences are not totally unexpected, but the reasonable extent to which the Legacy food items match the FNDDS clusters is certainly encouraging.

Over all our impression is that the clusters we have identified have pretty good predictability. Moreover, despite using a different random seed and implementation of K-means here than for the results produced using R for the book, these cluster themes, while certainly not identical, bear substantial similarities to the book version.


### Stability


Next, let's investigate the stability of our clustering algorithm and results.

#### Stability to the choice of algorithm

Early on in our analysis, we observed that the K-means algorithm yielded better results than the hierarchical clustering algorithm, but this was just focusing on $K = 6$. Let's consider what happens when we use $K = 30$.

Let's compute some hierarchical clusters with $K = 30$.

In [49]:
hclust_clust_k30_init = AgglomerativeClustering(n_clusters=30, metric='euclidean', linkage='ward')
hclust_clust_k30_fit = hclust_clust_k30_init.fit(food_fndds_log_scaled)
hclust_clust_k30 = hclust_clust_k30_fit.labels_ + 1


The code below computes the average silhouette width for the hierarchical clustering algorithm with 30 clusters.

In [50]:
# fndds silhouette score
silhouette_score(food_fndds_log_scaled, hclust_clust_k30)

0.12267033603237429


Again, we see that the hierarchical clustering yields lower silhouette widths on average than K-means.

The two sets of clusters have fairly high similarity according to the Rand and adjusted Rand indexes:


In [51]:
rand_score(hclust_clust_k30, kmeans_clust_k30)

0.9481758412849812

In [52]:
adjusted_rand_score(hclust_clust_k30, kmeans_clust_k30)

0.4833651201964665


Below we print out some randomly sampled food items from each hierarchical cluster:

In [53]:
food_clustered_30 = food_fndds_log_scaled.copy()
food_clustered_30["cluster"] = hclust_clust_k30 

food_clustered_hclust_samples = food_clustered_30.reset_index() \
    .groupby("cluster")[["cluster", "description"]] \
    .sample(15, random_state=311, replace=True)
                 
food_clustered_hclust_samples["index"] = list(range(15))*30
food_clustered_hclust_samples = food_clustered_hclust_samples.set_index("index")
food_clustered_hclust_samples.pivot(columns="cluster")
                

Unnamed: 0_level_0,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description
cluster,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2
0,"Chocolate milk, made from no sugar added dry m...","Roll, gluten free",Animal fat or drippings,"Popcorn, ready-to-eat packaged, cheese flavored","Pasta, whole grain, with tomato-based sauce an...","Salmon, coated, fried, made with oil","Cobbler, plum","Cereal or granola bar with nuts, chocolate coated","Tea, iced, instant, black, decaffeinated, pre-...","Mackerel, coated, baked or broiled, fat not ad...","Yogurt, frozen, chocolate, whole milk",Cereal (General Mills Cheerios),"Seaweed, cooked, fat added in cooking, NS as t...","Vegetable dip, regular","Potato, french fries, NS as to fresh or frozen",Irish Coffee,Hollandaise sauce,"Bitter melon leaves, horseradish leaves, jute ...","Sour cream, reduced fat",Porcupine balls with tomato-based sauce,Gefilte fish,"Rhubarb, cooked or canned, drained solids","Cookie, brownie, without icing","Beef liver, braised","Egg omelet or scrambled egg, with meat, made w...",Queso Asadero,"Cornish game hen, roasted, skin not eaten","Rice, fried, with shrimp","Rice pudding made with coconut milk, Puerto Ri...",Cereal (Quaker Oatmeal Squares)
1,"Cheese, cottage, NFS","Cookie, cone shell, ice cream type, wafer or cake","Ground beef, cooked","Peanuts, roasted, unsalted","Pasta with tomato-based sauce and seafood, res...","Mullet, coated, baked or broiled, fat not adde...","Corn soup, cream of, prepared with water",Taffy,"Water, baby, bottled, unsweetened","Herring, raw","Chocolate milk, made from light syrup with fat...",Cereal (General Mills Kix),"Eggplant, cooked, fat not added in cooking","Vegetable oil-butter spread, stick, salted","Cake batter, raw, not chocolate",Rum and cola,"Light butter, stick, salted","Cabbage, green, cooked, made with margarine","Grilled cheese sandwich, reduced fat Cheddar c...","Spaghetti with corned beef, Puerto Rican style",Gefilte fish,"Apple, dried, cooked, with sugar","Cake, Dobos Torte","Crab, canned","Egg omelet or scrambled egg, with cheese, meat...","Cheese, Cheddar","Pork, NS as to cut, fried, NS as to fat eaten","Pork, rice, and vegetables including carrots, ...","Cornstarch coconut dessert, Puerto Rican style","Nutritional powder mix, protein, NFS"
2,"Tortellini, cheese-filled, no sauce",Refried beans with meat,"Frankfurter or hot dog sandwich, chicken and/o...","Peanut butter and jelly sandwich, with regular...","Beef, noodles, and vegetables excluding carrot...","Salmon, steamed or poached","Vegetable combination, excluding carrots, broc...",Reese's Pieces,Fruit flavored syrup used for milk beverages,"Tuna, fresh, smoked","Caramel, chocolate-flavored roll",Cereal (Kellogg's Frosted Flakes),Mole verde sauce,"Shortening, NS as to vegetable or animal",Potato pancake,Mai Tai,Hollandaise sauce,"Cabbage, green, cooked, made with butter","Cod, coated, fried, made with butter",Soft taco with chicken and sour cream,"Ocean perch, baked or broiled, fat added in co...","Applesauce and apricots, baby food, junior","Fondant, chocolate covered","Lobster, canned","Egg, whole, fried, NS as to type of fat","Cheese, Cheddar","Chicken breast, NS as to cooking method, skin ...","Beef, potatoes, and vegetables, excluding carr...","Topping, chocolate, hard coating","Cereal (Post Great Grains, Cranberry Almond Cr..."
3,"Beef, roast, hash","Pretzel, baby food","Bologna sandwich, with spread","Walnuts, NFS","Chili con carne with beans, from restaurant","Sea bass, cooked, NS as to cooking method","Corn, yellow and white, cooked, from fresh, ma...","Whipped topping, fat free","Coffee, decaffeinated, pre-sweetened with sugar","Mackerel, coated, baked or broiled, fat added ...","Coffee, Iced Cafe Mocha, decaffeinated",Cereal (General Mills Wheaties),"Pudding, ready-to-eat, low calorie, containing...",Sunflower oil,"Breakfast bar, date, with yogurt coating","Wine, dessert, sweet","Table fat, NFS","Celery, cooked, made with oil","Caramel dip, regular","Egg, cheese, and steak on bagel",Flounder with crab stuffing,"Mixed fruit yogurt dessert, baby food, strained","Cookie, chocolate chip, reduced fat","Squid, steamed or boiled","Egg omelet or scrambled egg, with meat, made w...","Cheese, Colby Jack","Chicken wing, stewed","Tomatoes, green, cooked, NS as to form","Cookie, macaroon","Multigrain, whole grain cereal, baby food, dry..."
4,"Chocolate milk, made from no sugar added dry m...","Muffin, English, cheese","Stuffed pot roast, with potatoes, Puerto Rican...","Popcorn, microwave, plain","Mushroom soup, with meat broth, prepared with ...","Goat head, cooked","Rice, white, with vegetables and gravy, fat ad...","Cake or cupcake, coconut, with icing or filling","Water, bottled, flavored, sugar free (SoBe)","Herring, pickled","Chocolate milk, made from dry mix with non-dai...",Nutritional powder mix (Muscle Milk),"Mixed cereal with applesauce and bananas, baby...",Soybean and sunflower oil,"Muffin, NFS",Frozen daiquiri,Lemon-butter sauce,"Beans, string, cooked, from frozen, NS as to c...","Sour cream, light","Chicken breast, grilled with sauce, skin not e...","Catfish, coated, baked or broiled, made with oil","Berries, frozen, NFS","Yogurt, frozen, cone, chocolate","Lobster, coated, baked or broiled, fat not add...","Egg omelet or scrambled egg, with cheese, meat...","Cheese, provolone, reduced fat","Pork, tenderloin, braised","Rice, white, with vegetables, cheese and/or cr...","Cookie, animal, with frosting or icing",Cereal (Post Shredded Wheat Honey Nut)
5,Sausage and noodles with cream or white sauce,"Crackers, cheese, reduced fat","Beef, bacon, reduced sodium, cooked",Bacon bits,"Seafood soup with potatoes, and vegetables exc...","Salmon, baked or broiled, made with oil","Pumpkin, cooked, from fresh, fat not added in ...","Doughnut, chocolate, cake type, with chocolate...","Frozen daiquiri mix, from frozen concentrate, ...","Herring, cooked, NS as to cooking method","Caramel, chocolate-flavored roll",Cereal (Quaker Cap'n Crunch),"Congee, with vegetables","Shortening, vegetable","Cookie, oatmeal, with chocolate and peanut but...",Gin,"Butter, whipped, stick, unsalted","Brussels sprouts, cooked, from fresh, made wit...","Halibut, coated, baked or broiled, made with b...","Pork, spareribs, barbecued, with sauce, lean o...","Shrimp, coated, fried, made with oil","Strawberries, frozen, unsweetened","Cake, pound, chocolate","Squid, dried","Chicken liver, fried","Cheese, Monterey, reduced fat","Chicken, neck or ribs","Ham, potatoes, and vegetables including carrot...","Cookie, coconut",Cereal (Quaker Toasted Oat Bran)
6,Lamb or mutton and noodles with gravy,"Pie, custard, individual size or tart",Polish sausage,"Mixed nuts, without peanuts, unsalted","Stuffed tomato, with rice, meatless",Salmon cake or patty,"Apple rings, fried","Cereal or granola bar, coated with non-chocola...","Soft drink, chocolate flavored, diet","Barracuda, coated, baked or broiled, fat added...","Coffee, macchiato, sweetened","Milk, malted, dry mix, not reconstituted","Quinoa, fat not added in cooking","Mayonnaise, light","Crackers, cheese",Beer,"Butter, whipped, tub, unsalted","Spinach, cooked, NS as to form, fat added in c...","Potato, scalloped, ready-to-heat",Sausage and cheese on English muffin,"Cod, smoked","Apples and pears, baby food, junior","Waffle, chocolate","Liver, beef or calves, and onions","Egg omelet or scrambled egg, with vegetables o...","Cheese, American, reduced fat","Dove, cooked, NS as to cooking method","Bread, pumpkin","Topping, chocolate, hard coating",Cereal (Quaker Corn Bran Crunch)
7,"Infant formula, powder, made with water, NFS (...",Ham sandwich with lettuce and spread,"Veal chop, NS as to cooking method, NS as to f...","Popcorn, microwave, other flavored",Black bean salad,"Abalone, cooked, NS as to cooking method","Peas and carrots, from frozen, creamed","Light ice cream, bar or stick, chocolate cover...","Tea, iced, brewed, green, decaffeinated, pre-s...","Barracuda, coated, baked or broiled, fat added...","Chocolate milk, made from dry mix with reduced...",Cereal (Kellogg's Special K Low Fat Granola),"Lo mein, meatless","Margarine-like spread, reduced calorie, about ...","Sweet potato fries, from fresh, baked","Beer, light, higher alcohol","Table fat, NFS","Lettuce, arugula, raw","Cod, coated, baked or broiled, made with butter","Chimichanga, meatless","Flounder, coated, baked or broiled, made with ...","Fruit juice bar, frozen, flavor other than orange","Muffin, chocolate chip","Lobster, baked or broiled, fat added in cooking","Egg omelet or scrambled egg, with cheese and m...","Cheese, Brick","Tuna, fresh, steamed or poached","Fish shish kabob with vegetables, excluding po...","Coffee creamer, powder",Cereal (Post Honeycomb)
8,"Oatmeal, instant, maple flavored, fat not adde...","Mung beans, dry, cooked, fat added in cooking","Beef steak, breaded or floured, baked or fried...","Cashews, unsalted","Vegetable beef soup, chunky style",Mustard,"Mixed vegetables, cooked, NS as to form, fat a...","Cake, rice flour, without icing or filling","Iced Coffee, brewed","Herring, pickled, in cream sauce","Chocolate milk, ready to drink, whole",Cereal (Quaker Cap'n Crunch's Crunchberries),"Cereal, baby food, jarred, NFS","Onion dip, regular",Tuna loaf,Grasshopper,"Light butter, whipped, tub, salted","Spinach, cooked, from fresh, with cheese sauce",Croissant sandwich with sausage and egg,"Pizza with extra meat, thick crust","Catfish, baked or broiled, made with cooking s...","Peach, frozen, with sugar","Cookie, toffee bar","Squid, baked or broiled, fat added in cooking","Egg omelet or scrambled egg, with tomatoes and...","Cheese spread, cream cheese, regular","Cornish game hen, cooked, skin eaten","Beef with vegetables excluding carrots, brocco...",Chocolate-flavored sprinkles,"Waffle, whole grain, fruit, from frozen"
9,"Oatmeal, instant, plain, made with water, NS a...","Bread, gluten free","Beef steak, battered, fried, lean only eaten","Potato chips, sour cream and onion flavored",Seafood stew with potatoes and vegetables excl...,"Trout, baked or broiled, made without fat","Mixed vegetables, cooked, NS as to form, NS as...",Coconut milk,"Sugar, white, and water syrup","Tuna, fresh, smoked","Yogurt, frozen, chocolate, nonfat milk",Cereal (General Mills Honey Kix),"Egg white, omelet, scrambled, or fried, with v...","Onion dip, regular","Potato salad, from restaurant",Martini,"Butter, whipped, stick, unsalted","Beans, string, yellow, cooked, NS as to form, ...","Ice cream sundae, not fruit or chocolate toppi...","Jalapeno pepper, stuffed with cheese, breaded ...",Shrimp scampi,Apricot nectar,TWIX Peanut Butter Cookie Bars,"Squid, coated, fried","Egg, deviled","Cheese, Mexican blend, reduced fat","Pork, tenderloin, braised","Beef, potatoes, and vegetables, excluding carr...",Coconut oil,"Mixed cereal, baby food, dry, instant"



These clusters seem to have very clear categories. We might even argue that their categories actually seem *clearer* than the K-means clusters (at least there certainly seem to be fewer clusters with ambiguous/unclear categories), despite the fact that the silhouette width is "worse".


#### Stability to algorithmic randomness

Next, let's investigate how much the clusters change across the randomness inherent in the K-means algorithm itself (the algorithm starts with different random initial cluster centers every time).

We will just look at four different implementations of the K-means algorithm to get a sense of how much they change.

In [54]:
kmeans_clust_k30_labels_perturb = []
silhouette_kmeans_k30_perturb = []
for i in range(4):
    # fit the K-means clusters
    kmeans_clust_k30_init_iter = KMeans(n_clusters=30, n_init="auto")
    kmeans_clust_k30_fit_iter = kmeans_clust_k30_init_iter.fit(food_fndds_log_scaled)
    kmeans_clust_k30_labels_iter = kmeans_clust_k30_fit_iter.labels_
    # store the labels in a list
    kmeans_clust_k30_labels_perturb.append(kmeans_clust_k30_labels_iter)
    # store the silhouette scores in a list
    silhouette_iter = silhouette_samples(food_fndds_log_scaled, kmeans_clust_k30_labels_iter)
    silhouette_kmeans_k30_perturb.append(silhouette_iter)


The distribution of silhouette widths for all data points is almost identical: 

In [55]:
px.box(pd.DataFrame(silhouette_kmeans_k30_perturb).T, 
       labels=dict(variable="Iteration", value="Silhouette score"))


We can also look at the Rand index and adjusted Rand index with the original clustering for each iteration of the K-means algorithm (as well as the average silhouette width) in the table below.

In [56]:
[rand_score(clust, kmeans_clust_k30) for clust in kmeans_clust_k30_labels_perturb]

[0.9619387289273993,
 0.9561022686382701,
 0.9761739145866611,
 0.9614586171079104]

In [57]:
[adjusted_rand_score(clust, kmeans_clust_k30) for clust in kmeans_clust_k30_labels_perturb]

[0.59835289308491, 0.5442983315949159, 0.7512885238819963, 0.5993643036671712]

As well as the average silhouette score for each perturbation

In [58]:
[sil.mean() for sil in silhouette_kmeans_k30_perturb]

[0.1579090502080682,
 0.14250294169867453,
 0.1478223838912038,
 0.1457061277816669]

These results show that the Rand index and silhouette scores are fairly consistent, but the adjusted Rand index is a bit more variable. These results are a bit different to those shown in the book that were produced using R. It is unclear whether this is due to differences in the implementation of the K-means algorithm, or the specific set of 30 clusters that were created (to which we are comparing these perturbed clusters)--This would be worthy of exploration!



#### Stability to data perturbations


Next, we conduct a similar analysis, but this time also perturbing the *data* using reasonable perturbations (the same perturbations that we used in our PCA PCS analysis). Specifically, we will investigate two types of data perturbations below:

1. Bootstrap sampling: to represent the fact that different food items may have been included in the observed data

1. Adding random noise: to represent inaccuracies in the nutrient measurements.



We will re-compute the principal components using 4 different versions of the training data, each of which involves a bootstrap sample of the original food items, and each of which has been modified by adding a small amount of "noise" where we add random numbers whose magnitude is up to around 20% of the observed measurement. Since our data has been scaled to have standard deviation 1, we roughly approximate such noise using random numbers drawn from a Gaussian distribution that has mean 0 and standard deviation of 0.2 multiplied by the mean of each column (to represent an error of 20%). For any resulting perturbed values that ended up being negative, we rounded them up to 0.



In [59]:
def perturb_food_data(df, bootstrap=True):
  # add random noise to each observation
  df_perturb = df.copy()
  df_perturb = df_perturb + np.random.normal(0, scale=0.2*df.mean(), size=df.shape)
  if bootstrap:
      # bootstrap sample
      df_perturb = df_perturb.sample(df_perturb.shape[0], replace=True)
  # replace any negative entries with 0
  df_perturb[df_perturb < 0] = 0
  return df_perturb

# create a list of four perturbed versions of the food dataset
food_fndds_perturb = [perturb_food_data(food_fndds_log_scaled) for i in np.arange(4)]


In [60]:

i = 0
cluster_perturb_iter = pd.DataFrame(columns=["iter", "cluster", "silhouette"])
for food_perturb in food_fndds_perturb:
    # fit the K-means clusters
    kmeans_clust_k30_init_iter = KMeans(n_clusters=30, n_init="auto")
    kmeans_clust_k30_fit_iter = kmeans_clust_k30_init_iter.fit(food_perturb)
    kmeans_clust_k30_labels_iter = kmeans_clust_k30_fit_iter.labels_
    # store the silhouette scores in a list
    iter_results = pd.DataFrame(dict(iter=i,
                                     cluster=kmeans_clust_k30_labels_iter,
                                     silhouette=silhouette_samples(food_perturb, kmeans_clust_k30_labels_iter)),
                                index=food_perturb.index)
    cluster_perturb_iter = pd.concat([cluster_perturb_iter, iter_results]) 
    i += 1



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.



In [61]:
cluster_perturb_iter

Unnamed: 0,iter,cluster,silhouette
"Caramel dip, light",0,20,0.083708
"Ice cream cone, chocolate covered or dipped, chocolate ice cream",0,22,0.128004
"Fruit mixture, dried",0,4,0.054507
"Venison or deer, potatoes, and vegetables excluding carrots, broccoli, and dark-green leafy; gravy",0,29,0.131534
"Pork steak or cutlet, NS as to cooking method, lean only eaten",0,6,0.020054
...,...,...,...
"Frankfurter or hot dog, reduced fat or light, NFS",3,25,0.078849
"Candied ripe plantain, Puerto Rican style",3,23,0.024342
"Pink beans, canned, drained, fat added in cooking",3,13,0.162134
"Chicken, NS as to part and cooking method, skin eaten",3,8,0.102235



The distribution of silhouette widths for all data points is almost identical: 

In [62]:
# need to reset the index for each series for each boxplot to be shown in the same plot
px.box(cluster_perturb_iter, x="iter", y="silhouette")


Below we look at the rand indices and silhouette scores. Again, we see similar results to our analysis in the previous section, where the rand index is again fairly decent, but the adjusted rand index indicates that the similarity after "adjusting for chance" is hovering between 0.52 and 0.62, which indicates some similarity, but far from identical. The silhouette scores are relatively similar across all iterations, however they are a bit lower than the repeated runs of the algorithm without perturbing the data itself.

In [63]:
# place the original cluster results in a dataframe
original_cluster_df = pd.DataFrame(dict(cluster_orig=kmeans_clust_k30), index=food_fndds_log_scaled.index)
# merge each set of perturbed clusters with the original clusters using the index to match
cluster_perturb_iter_with_orig = [cluster_perturb_iter.query("iter == @i").merge(original_cluster_df, left_index=True, right_index=True) for i in range(4)]

In [64]:
[rand_score(df.cluster, df.cluster_orig) for df in cluster_perturb_iter_with_orig]


[0.9578110280832041,
 0.9571162353469679,
 0.9591263691868123,
 0.9499412044460273]

In [65]:
[adjusted_rand_score(df.cluster, df.cluster_orig) for df in cluster_perturb_iter_with_orig]

[0.5362374884703421, 0.539869884946795, 0.5667435692963426, 0.4821928200462442]

In [66]:
[df["silhouette"].mean() for df in cluster_perturb_iter_with_orig]


[0.1009289306427285,
 0.11222348694022545,
 0.10976792070741381,
 0.11337490253065267]


#### Stability to pre-processing judgment call perturbations

Finally, let's look at how much the pre-processing judgment calls we made impact our results. Like the PCA pre-processing judgment calls, the analysis above is based on a version of the data that has been log-transformed and scaled, but we did not mean-center them. Since we have nutrients on vastly different scales, and we don't want nutrients with larger values to dominate the cluster algorithm, we always scale our data prior to clustering (i.e., "scaling" is not a judgment call that we are interested in perturbing). We will thus focus on exploring how our results change when we use alternative transformation (log transformed vs untransformed) and centering (centering vs not centering). Note that in theory, centering our data *should* make no difference to the results of the K-means algorithm, but it is worth exploring anyway. 

Let's repeat our cluster analyses using these alternative data cleaning judgment calls. 

In [67]:
perturb_options = list(product([True, False], [True, False]))
perturb_options = pd.DataFrame(perturb_options, columns=("center", "log"))

In [68]:
# add a column that specifies the perturbation options taking place to use for annotating figures
def specify_option(x):
    a = perturb_options.columns * perturb_options
    return a["center"] + a["log"]
perturb_options["perturb_option"] = specify_option(perturb_options)

In [69]:

food_fndds_jc_perturb = [preprocess_food_data(food_fndds,
                                              log_transform=perturb_options["log"][i],
                                              center=perturb_options["center"][i],
                                              scale=True) 
                         for i in range(perturb_options.shape[0])]


In [70]:
perturbed_jc_clusters_df = pd.DataFrame(columns=["cluster", "silhouette", "perturbation_option"])
i=0
for df in food_fndds_jc_perturb:
    kmeans_clust_k30_init_iter = KMeans(n_clusters=30, n_init="auto")
    kmeans_clust_k30_fit_iter = kmeans_clust_k30_init_iter.fit(df)
    results = pd.DataFrame(dict(cluster=kmeans_clust_k30_fit_iter.labels_,
                                silhouette=silhouette_samples(df, kmeans_clust_k30_fit_iter.labels_),
                                perturbation_option=perturb_options["perturb_option"][i]))
    results.index = df.index
    perturbed_jc_clusters_df = pd.concat([perturbed_jc_clusters_df, results])
    i += 1


The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.




Let's investigate the distribution of silhouette scores for each perturbation. The silhouette widths have a higher median value, but also a wider range (there are many points with substantially worse silhouette widths) when we *do not* use a log-transformation. Centering the data no difference (which is actually to be expected, since distances are the same regardless of whether we center the data or not, but it is always good to check). 


In [71]:
px.box(perturbed_jc_clusters_df, x="perturbation_option", y="silhouette")


The bar chart in below shows the number of observations in each cluster. The clusters computed on the log-transformed data have more evenly sized clusters, whereas the clusters computed on the un-transformed data have one big cluster that contains more than 3,000 food items, and another one that contains more than 1,000 food items, which is not ideal.


In [72]:
cluster_counts = perturbed_jc_clusters_df.query("perturbation_option == ['', 'log']") \
    .groupby(["perturbation_option"]) \
    .cluster \
    .value_counts()
px.bar(cluster_counts.reset_index(), x="cluster", y="count", facet_col="perturbation_option")


### PCS conclusion

These findings indicate that our results are generally fairly predictable and are also reasonably stable. 


## Final clustering results

Keep in mind that our goal is to provide a food group label that will help us categorize the food items. It seems as though the higher the value of $K$, the more specific the categories identified. Since the level of specificity in the categories identified with $K = 30$ feels fairly reasonable, we decide to work with the $K = 30$ results that we examined in this document. We showed that these results were reasonably predictable and at least moderately stable. 


Our intention is to use the results of one particular run of the K-means clustering algorithm with $K = 30$, and we will then we will need to manually name each category that each cluster corresponds to based on our opinion ("cereal", "leafy greens", etc). 

The code below converts the clusters to actual categories and for categories with an unclear theme, places the category name in parentheses. We simplified a few categories (e.g., "leafy greens" and "corn and other vegetables" are now under "fruits and vegetables").

In [73]:
food_clustered_samples.pivot(columns="cluster")
                

Unnamed: 0_level_0,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description,description
cluster,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
index,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2
0,"Fish shish kabob with vegetables, excluding po...","Milk dessert bar, frozen, made from lowfat mil...","Thousand Island dressing, light","Cauliflower, cooked, from frozen, fat added in...","Pork, spareribs, cooked, NS as to fat eaten","Cereal or granola bar, with coconut, chocolate...",Whiskey,Cereal (General Mills Lucky Charms Chocolate),"Sardines, canned in oil","Broccoli, cooked, from fresh, NS as to fat add...","Light ice cream, fudgesicle","Whiting, baked or broiled, made with butter",Nutrition bar (South Beach Living Meal Bar),"Chicken drumstick, baked, coated, skin / coati...","Cream, whipped",Popover,"Coffee, Cappuccino, nonfat","Vegetables, NS as to type, cooked, made with oil",Passion fruit nectar,"Mixed nuts, NFS","Swordfish, baked or broiled, fat added in cooking",Industrial oil as ingredient in food,"Fish sandwich, on bun, with spread","Fruit salad, excluding citrus fruits, with mar...","Syrup, dietetic","Lentils, dry, cooked, made with oil","Meat turnover, Puerto Rican style","Beans, string, green, canned, low sodium, NS a...","Infant formula, NS as to form (Enfamil ProSobee)","Egg omelet or scrambled egg, with cheese and m..."
1,"Beef chow mein or chop suey, no noodles","Coffee, Iced Latte, decaffeinated, nonfat, fla...","Pie, plum, two crust","Coffee creamer, soy, liquid","Chicken, back","Cake, rice flour, without icing or filling","Beer, light",Nutrition bar (Zone Perfect Classic Crunch),"Mackerel, smoked","Broccoli raab, cooked, NS as to fat added in c...","Icing, chocolate","Fish, NS as to type, coated, baked or broiled,...",Cereal (Weetabix Whole Grain),"Beef steak, fried, NS as to fat eaten",Queso Chihuahua,"Dippin' Dots, flash frozen ice cream snacks, f...","Coffee, Iced Latte, nonfat, flavored","Vegetable soup, chunky style","Apples and pears, baby food, junior","Almonds, NFS","Flounder, smoked",Soybean oil,Beef and noodles with cream or white sauce,"Salad dressing, light, NFS","Coffee, decaffeinated, pre-lightened","Waffle, from school, NFS","Pizza, cheese, stuffed crust","Peas and carrots, cooked, from frozen, fat add...","Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with meat and tom..."
2,"Stewed chickpeas with Spanish sausages, Puerto...","Asparagus soup, cream of, prepared with milk",Yuca fries,"Chicken or turkey and corn hominy soup, home ...","Frankfurter or hot dog, cheese-filled","Doughnut, chocolate, cake type, with chocolate...",Screwdriver,Cereal (General Mills Cookie Crisp),"Barracuda, baked or broiled, fat not added in ...","Bitter melon leaves, horseradish leaves, jute ...","Topping, chocolate flavored hazelnut spread","Ocean perch, baked or broiled, fat not added i...",Cereal (Post Golden Crisp),"Beef steak, braised, lean and fat eaten","Cheese, Cheddar","Shrimp, coated, baked or broiled, made with bu...","Coffee, macchiato, sweetened","Carrots, canned, low sodium, made with margarine","Lime, raw","Crackers, corn","Eel, cooked, NS as to cooking method","Shortening, NS as to vegetable or animal","Bologna and cheese sandwich, with spread",Russian dressing,Marshmallow,"Beans, canned, drained, NS as to type, fat add...",Taquito or flauta with meat and cheese,"Corn, cooked, from frozen, NS as to color, mad...","Infant formula, NS as to form (Similac Isomil ...","Egg, whole, fried without fat"
3,"Beef, rice, and vegetables excluding carrots, ...",Gelatin dessert with fruit and whipped cream,"Pie, pear, two crust","Onions, cooked, NS as to form, NS as to fat ad...","Chicken thigh, fried, coated, skin / coating e...","Nuts, carob-coated",Vodka and tonic,Cereal (Post Cocoa Pebbles),"Salmon, dried","Cress, raw","Pie, pudding, chocolate, with chocolate coatin...","Perch, baked or broiled, made with margarine","Cereal, frosted oat cereal with marshmallows","Turkey, light meat, skin eaten","Cheese, Mexican blend, reduced fat","Cheese sandwich, reduced fat Cheddar cheese, o...","Coffee, instant, pre-lightened and pre-sweeten...","Rice, brown, with vegetables and gravy, fat ad...","Blueberries, cooked or canned, unsweetened, wa...","Cereal or Granola bar, NFS","Trout, baked or broiled, made with butter",Animal fat or drippings,"Chicken nuggets, from school lunch","Ranch dip, light",Fruit flavored syrup used for milk beverages,"Edamame, cooked","Burrito with beans, meatless","Corn, yellow, NS as to form, cream style, fat ...","Infant formula, NS as to form (Gerber Graduate...","Bear, cooked"
4,"Chicken or turkey, potatoes, and vegetables in...","Yogurt, whole milk, baby food, with fruit and ...","Biscuit dough, fried","Rice, white, with gravy, NS as to fat added in...","Beef, sandwich steak, flaked, formed, thinly s...",Banana chips,Frozen margarita,Cereal (General Mills Count Chocula),"Herring, raw","Mustard greens, cooked, from fresh, fat not ad...","Snack cake, chocolate, with icing or filling, ...","Fish, NS as to type, raw","Nutritional powder mix, protein, light, NFS","Chicken, NS as to part, stewed, skin eaten","Cheese, Cheddar, reduced sodium","Pie, coconut cream","Soft drink, cola, fruit or vanilla flavored, diet","Plantain, boiled, NS as to green or ripe","Strawberries, cooked or canned, in syrup",Flax seeds,"Ray, baked or broiled, fat not added in cooking",Cottonseed oil,Ham salad sandwich,"Dill dip, light","Fluid replacement, 5% glucose in water","Lima beans, dry, cooked, made with animal fat ...","Pizza with pepperoni, from restaurant or fast ...","Dandelion greens, cooked, made with margarine","Infant formula, powder, made with baby water (...","Egg, whole, fried, NS as to fat added in cooking"
5,"Macaroni or pasta salad, made with light mayon...","Light ice cream, soft serve, flavors other tha...","Empanada, Mexican turnover, pumpkin","Flavored rice, brown and wild","Ham, fresh, cooked, NS as to fat eaten","Dietetic or low calorie candy, chocolate covered",Vodka and cola,Cereal (General Mills Chex Chocolate),"Herring, coated, baked or broiled, fat not add...","Brussels sprouts, cooked, from fresh, fat adde...","Cookie bar, with chocolate, nuts, and graham c...","Tuna, fresh, dried",Cereal (Post Honey Bunches of Oats with Vanill...,"Chicken, NS as to part, grilled with sauce, sk...","Butter, whipped, stick, salted",Quesadilla with vegetables,Frozen coffee drink,"Mixed vegetables, cooked, NS as to form, fat n...","Apricots, baby food, junior","Crackers, woven wheat, flavored (Triscuit)","Sea bass, baked or broiled, fat not added in c...","Margarine, tub, salted","Cornbread muffin, stick, round","Dill dip, regular","Gelatin dessert, dietetic, with whipped toppin...","White beans, dry, cooked, made with animal fat...","Stuffed shells, cheese-filled, with meat sauce","Rice, white, with corn, NS as to fat added in ...","Infant formula, NS as to form (Similac Expert ...","Egg, whole, baked, NS as to fat added in cooking"
6,"Cannelloni, cheese-filled, with tomato sauce, ...","Yogurt, whole milk, plain","Cookie, oatmeal, with raisins","Mushroom, Asian, cooked, from dried","Frankfurter or hot dog sandwich, NFS, plain, o...",Cereal or granola bar (Quaker Granola Bites),Jagerbomb,Cereal (General Mills Reese's Puffs),"Herring, coated, fried","Brussels sprouts, cooked, from fresh, made wit...",TWIX Chocolate Fudge Cookie Bars,"Sushi, topped with eel","Finger Foods, Puffs, baby food","Veal cutlet or steak, fried, NS as to fat eaten","Cheese, cream","Gnocchi, potato","Coffee, Latte, nonfat, flavored",Spaghetti sauce,"Tutti-fruitti pudding, baby food, strained","Potato chips, restructured, fat free","Shrimp, dried",Safflower oil,"Taco or tostada with meat, from fast food","Chipotle dip, yogurt based","Water, bottled, flavored, sugar free (Glaceau ...","Peas, cowpeas, field peas, or blackeye peas, n...","Bacon cheeseburger, 1 medium patty, with condi...","Carrots, cooked, from frozen, fat not added in...","Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with cheese, meat..."
7,"Ravioli, meat-filled, with tomato sauce or mea...","Fat free ice cream, NS as to flavor","Coffee creamer, powder, fat free","Beans, lima, immature, cooked, from frozen, NS...","Frankfurter or hot dog sandwich, NFS, plain, o...",TWIX Caramel Cookie Bars,Mojito,Cereal (General Mills Cocoa Puffs),Sardines with mustard sauce,"Peas, green, raw","Pie shell, chocolate wafer","Whiting, coated, baked or broiled, made with m...",Cereal (Kellogg's Special K Red Berries),"Chicken drumstick, baked, broiled, or roasted,...",Lemon-butter sauce,"Potato, baked, peel not eaten, with vegetables",Energy drink (Monster),"Mixed vegetables, canned, low sodium, fat not ...","Lettuce, raw","Pecans, unroasted","Mussels, steamed or poached",Sunflower oil,"Cake, Ravani","Dip, NFS",Butterscotch hard candy,"Bagel, wheat, with raisins","Cheeseburger, 1 medium patty, plain, on wheat bun",Mustard pickles,"Infant formula, powder, made with plain bottle...","Egg omelet or scrambled egg, with meat and veg..."
8,"Chicken or turkey, rice, and vegetables includ...","Macaroni or noodles with cheese, made from red...",Vegetable tempura,"Barley, NS as to fat added in cooking","Frankfurter or hot dog sandwich, beef, plain, ...","Rice, cooked with coconut milk","Fruit punch, alcoholic",Nutritional powder mix (Carnation Instant Brea...,"Herring, coated, baked or broiled, fat added i...","Snowpea, cooked, from frozen, NS as to fat add...","Milk chocolate candy, plain","Halibut, coated, baked or broiled, made with m...","Cereal, fruit whirls","Chicken drumstick, sauteed, skin not eaten","Queso Anejo, aged Mexican cheese","Pasta, whole grain, with cream sauce and seafo...","Gravy, redeye","Mixed vegetables, cooked, from frozen, made wi...","Apricot, dried, cooked, NS as to sweetened or ...","Crackers, wheat, reduced sodium","Ray, cooked, NS as to cooking method","Margarine, stick, unsalted","Potato, hash brown, from restaurant, with cheese","Margarine, stick, salted",Snow cone,"Frankfurter or hot dog sandwich, fat free, pla...","Chiles rellenos, cheese-filled","Beans, lima and corn, cooked, NS as to fat add...","Infant formula, NS as to form (Enfamil Enfagro...","Cookie, ladyfinger"
9,Fish and rice with tomato-based sauce,"Chocolate milk, made from light syrup with red...","Egg white, omelet, scrambled, or fried, with c...","Mushrooms, cooked, from fresh, made with marga...","Beef, shortribs, cooked, lean and fat eaten",Cereal or granola bar (Quaker Chewy Dipps Gran...,Irish Coffee,Nutrition bar (Balance Original Bar),"Herring, cooked, NS as to cooking method","Peas and onions, cooked, fat added in cooking,...","Light ice cream, bar or stick, with low-calori...","Whiting, coated, baked or broiled, made withou...","Cereal, bran flakes","Catfish, baked or broiled, made without fat","Cheese, Brick",Sour cream,"Energy drink, sugar free","Calabaza, cooked","Grapefruit, canned or frozen, unsweetened, wat...",Pine nuts,"Croaker, steamed or poached",Flaxseed oil,"Chicken tenders or strips, breaded, from fast ...",Butter-vegetable oil blend,Fruit syrup,"Jelly sandwich, regular jelly, on white bread","Burrito with beans, rice, and sour cream, meat...","Corn, yellow and white, cooked, from frozen, N...","Infant formula, liquid concentrate, made with ...",Lamb or mutton loaf


In [74]:
def name_category(clust):
    if clust == 1:
        return "(Unknown)"
    elif clust == 2:
        return "Dairy"
    elif clust == 3:
        return "Fruits and Vegetables"
    elif clust == 4:
        return "Seafood"
    elif clust == 5:
        return "(Unknown)"
    elif clust == 6:
        return "Cereal"
    elif clust == 7:
        return "(Unknown)"
    elif clust == 8:
        return "Beverages"
    elif clust == 9:
        return "Meats"
    elif clust == 10:
        return "Fats and oils"
    elif clust == 11:
        return "(Unknown)"
    elif clust == 12:
        return "Breads"
    elif clust == 13:
        return "Seafood"
    elif clust == 14:
        return "(Unknown)"
    elif clust == 15:
        return "Dairy"
    elif clust == 16:
        return "Desserts"
    elif clust == 17:
        return "Fruits and Vegetables"
    elif clust == 18:
        return "Fruits and Vegetables"
    elif clust == 19:
        return "Seafood"
    elif clust == 20:
        return "Snacks"
    elif clust == 21:
        return "Meats"
    elif clust == 22:
        return "(Unknown)"
    elif clust == 23:
        return "Desserts"
    elif clust == 24:
        return "Infant formula"
    elif clust == 25:
        return "Meals"
    elif clust == 26:
        return "Eggs"
    elif clust == 27:
        return "Dairy"
    elif clust == 28:
        return "Vegetables"
    elif clust == 29:
        return "Alcoholic beverages"
    elif clust == 30:
        return "Nuts"


In [75]:

food_categories = pd.DataFrame(dict(cluster=kmeans_clust_k30))
food_categories.index = food_fndds_log_scaled.index
food_categories["category"] = [name_category(x) for x in food_categories["cluster"]]
food_categories


Unnamed: 0_level_0,cluster,category
description,Unnamed: 1_level_1,Unnamed: 2_level_1
"Milk, human",29,Alcoholic beverages
"Milk, NFS",2,Dairy
"Milk, whole",2,Dairy
"Milk, low sodium, whole",2,Dairy
"Milk, calcium fortified, whole",2,Dairy
...,...,...
Breading or batter as ingredient in food,26,Eggs
Wheat bread as ingredient in sandwiches,26,Eggs
Sauce as ingredient in hamburgers,24,Infant formula
Industrial oil as ingredient in food,22,(Unknown)



Below, we can look at the results of this first pass of clustering with $K = 30$ (where we have manually defined and simplified the names of each cluster) for a random sample of 30 food items:

In [76]:
food_categories.loc[[
    "Mushroom soup, with meat broth, prepared with water",
    "Pasta with cream sauce and poultry, restaurant",
    "Beef, rice, and vegetables excluding carrots, broccoli, and dark-green leafy; gravy",
    "Rice, brown, with other vegetables, fat not added in cooking",
    "Milk, dry, reconstituted, NS as to fat content",
    "Vegetable and fruit juice drink, with high vitamin C, diet",
    "Chicory beverage",
    "Muffin, English, multigrain",
    "Cookie, Pizzelle",
    "Light ice cream, soft serve cone, flavors other than chocolate",
    "Infant formula, powder, made with baby water (Enfamil Gentlease)",
    "Infant formula, powder, made with baby water (Store Brand Soy)",
    "Reese's Peanut Butter Cup",
    "Egg salad, made with light mayonnaise",
    "Corn, yellow and white, cooked, NS as to form, NS as to fat added in cooking",
    "Green banana, cooked in salt water",
    "Beans, string, green, cooked, from frozen, NS as to fat added in cooking",
    "Peaches, baby food, junior",
    "Greens, cooked, NS as to form, NS as to fat added in cooking",
    "Huckleberries, raw",
    "Stewed chitterlings, Puerto Rican style",
    "Chicken wing, stewed",
    "Halibut, coated, fried, made with margarine",
    "Flounder, coated, baked or broiled, made with cooking spray",
    "Trout, coated, baked or broiled, made with oil",
    "Cod, coated, baked or broiled, made with oil",
    "Pistachio nuts, NFS",
    "Crackers, sandwich",
    "Sunflower seeds, plain, unsalted",
    "Tomato and onion, cooked, fat added in cooking, NS as to type of fat"
]]

Unnamed: 0_level_0,cluster,category
description,Unnamed: 1_level_1,Unnamed: 2_level_1
"Mushroom soup, with meat broth, prepared with water",4,Seafood
"Pasta with cream sauce and poultry, restaurant",23,Desserts
"Beef, rice, and vegetables excluding carrots, broccoli, and dark-green leafy; gravy",1,(Unknown)
"Rice, brown, with other vegetables, fat not added in cooking",4,Seafood
"Milk, dry, reconstituted, NS as to fat content",2,Dairy
"Vegetable and fruit juice drink, with high vitamin C, diet",25,Meals
Chicory beverage,25,Meals
"Muffin, English, multigrain",26,Eggs
"Cookie, Pizzelle",26,Eggs
"Light ice cream, soft serve cone, flavors other than chocolate",2,Dairy



These results are pretty impressive. They are also very similar to the results we saw when we used R (only 5 of the categories differ, and this is usually because one or the other us (Unknown)).  

One idea to improve these clusters is to do *another "layer"* of clustering, where we create some additional clusters within each cluster to see if we can tease out some more groups, and to see if the mistakes get separated from the non-mistakes. 

Another idea is to increase the number of clusters (e.g., set $K = 100$) and manually aggregate the highly specific categories you obtain into more general categories. This will be a lot more work though, since you will need to manually identify the theme of many more clusters.

You could also try using the hierarchical clustering algorithm instead of K-means, since our brief exploration in our stability analysis suggested that despite the quantifiable metrics, the hierarchical clustering algorithm might yield fewer "unclear" clusters.

Another option is to combine the results of multiple clustering algorithms using a "majority vote" kind of system.

These explorations will be an exercise for the ambitious reader, but regardless, if these results are going to be put into production, someone is going to have to go through and manually fix any remaining mistakes (it will be almost impossible for a purely data-driven approach to perfectly categorize every food item in the data).



## EDA of clustering results

Aside from being useful for categorizing food items in our hypothetical app, our results are also very useful exploratory tool by allowing us to visualize the distribution of food items in our dataset for the first time in the figure below

In [77]:
px.bar(food_categories["category"].value_counts())


What we can see is that the largest food category (ignoring the "Unknown" category) is fruits and vegetables, followed by meats. 


## [Exercise: to complete] Visualizing the data in principal component space




## [Exercise: to complete] Clustering the columns