<a href="https://colab.research.google.com/github/Yorrek/SchlemmerSlammer/blob/master/epicurious_tsne001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MAT 259: Final Project

Chantal Nguyen

Instructor: George Legrady


The following code reads in a json file containing recipes extracted from the online recipe database Epicurious.com. The data includes recipe titles, ratings, review counts, dates of publication, ingredient lists, instructions, etc. 

tf-idf is used to vectorize the ingredient lists of each recipe. Dimensionality reduction is performed first with SVD and then with t-SNE to project recipes into 2D space. k-means clustering is used to cluster recipes into 15 clusters. 

In [0]:
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import io
import re
import json
import pandas as pd
import time
from datetime import datetime

Load in the json file as a pandas dataframe

In [0]:
# load in the data
data = pd.read_json('epicurious-recipes.json',lines=True)
data = data[data.reviewsCount > 0]
for index, row in data.iterrows(): # clean up entries with no ingredients listed
    if not isinstance(row['ingredients'],(list,)):
        data = data.drop(index)
data = data.reset_index(drop=True)

Extract ingredient lists and remove numbers and non-alphabetic characters

In [0]:
ingredients_data = []
chars_to_remove = dict((ord(char), None) for char in '/0123456789!"#%\'()*+,-./:;<=>?@[\]^_`{|}~')
for index, row in data.iterrows():
    ingredients_data.append((''.join(row['ingredients'])).translate(chars_to_remove))

Define stopwords: common English stopwords (found here: https://github.com/stopwords-iso/stopwords-en/) plus words related to measurement quantities

In [0]:
with open('stopwords-en.json') as f:
    stopwords = json.load(f)

In [0]:
stopwords = stopwords + [u'cup', u'cups', u'tablespoon', u'tablespoons', u'teaspoon', u'teaspoons', u'c', u'tbsp', u'tsp', u'oz', u'g', u'kg', u'lb', u'pt', u'gal', u'qt', u'qts', u'tbsps', u'tsps', u'ml', u'l', u'inch', u'inches', u'pinch', u'pinches', u'dash', u'dashes', u'ounce', u'ounces', u'can', u'cans', u'bag', u'bags', u'package', u'packages', u'gram', u'grams', u'pound', u'pounds']

Vectorize with tf-idf

In [6]:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                                 min_df=2, stop_words=stopwords,
                                 use_idf=True)
X = vectorizer.fit_transform(ingredients_data)

  'stop_words.' % sorted(inconsistent))


First reduce dimensionality to 50 dimensions with SVD (performing t-SNE now would be too intensive)

In [0]:
svd = TruncatedSVD(n_components=50)
Y = svd.fit_transform(X)

Reduce to 2D with t-SNE

In [8]:
tsne_model = TSNE(n_components=2, verbose=1, random_state=1)
Z = tsne_model.fit_transform(Y)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 31417 samples in 0.084s...
[t-SNE] Computed neighbors for 31417 samples in 116.772s...
[t-SNE] Computed conditional probabilities for sample 1000 / 31417
[t-SNE] Computed conditional probabilities for sample 2000 / 31417
[t-SNE] Computed conditional probabilities for sample 3000 / 31417
[t-SNE] Computed conditional probabilities for sample 4000 / 31417
[t-SNE] Computed conditional probabilities for sample 5000 / 31417
[t-SNE] Computed conditional probabilities for sample 6000 / 31417
[t-SNE] Computed conditional probabilities for sample 7000 / 31417
[t-SNE] Computed conditional probabilities for sample 8000 / 31417
[t-SNE] Computed conditional probabilities for sample 9000 / 31417
[t-SNE] Computed conditional probabilities for sample 10000 / 31417
[t-SNE] Computed conditional probabilities for sample 11000 / 31417
[t-SNE] Computed conditional probabilities for sample 12000 / 31417
[t-SNE] Computed conditional probabilities for s

Extract t-SNE coordinates, append to dataframe

In [0]:
x_coords = Z[:,0]
y_coords = Z[:,1]
coords = np.transpose(np.vstack((x_coords, y_coords)))  
with open('epi_coords.csv', 'w') as f:
    for x, y in coords:
        f.write('%f, %f\n' % (x,y)) 

In [0]:
coords_df = pd.DataFrame(np.stack((x_coords,y_coords)).transpose(), columns=list('xy'))

In [0]:
data = data[['hed', 'pubDate', 'aggregateRating','reviewsCount','willMakeAgainPct']]
data = data.join(coords_df)

Perform k-means clustering into (arbitrarily-set) 15 clusters

In [12]:
num_clusters = 15
km = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=100, n_init=1)
print("Clustering sparse data with %s" % km)
km.fit(X)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
        

Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=15, n_init=1, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
Top terms per cluster:
Cluster 0:
 chicken
 broth
 chopped
 lowsalt
 canned
 white
 dried
 oil
 butter
 garlic
Cluster 1:
 sauce
 soy
 sesame
 rice
 asian
 minced
 oil
 ginger
 peeled
 garlic
Cluster 2:
 baking
 flour
 sugar
 unsalted
 butter
 allpurpose
 powder
 salt
 soda
 vanilla
Cluster 3:
 sugar
 water
 salt
 butter
 lime
 unsalted
 juice
 chopped
 cream
 equipment
Cluster 4:
 lemon
 juice
 grated
 zest
 finely
 chopped
 sugar
 peel
 oil
 olive
Cluster 5:
 chopped
 finely
 onion
 oil
 garlic
 pepper
 red
 olive
 tomatoes
 seeded
Cluster 6:
 orange
 juice
 peel
 sugar
 grated
 zest
 lemon
 finely
 liqueur
 chopped
Cluster 7:
 chocolate
 unsweetened
 bittersweet
 semisweet
 sugar
 cream
 vanilla
 chopped
 unsalted
 powder
Cluster 8:
 sugar
 cream
 vanilla
 egg
 unsalted
 

In [0]:
data['label'] = km.labels_

Do some cleanup and save data as json file containing recipe title, rating, review count, publication date, % of reviewers who said they will make the recipe again, t-SNE coords, and cluster label

In [0]:
datestrings = []
for index, row in data.iterrows():
    pdate = row['pubDate']
    data.at[index,'pubDate'] = time.mktime(datetime.strptime(pdate, '%Y-%m-%dT%H:%M:%S.%fZ').timetuple())/1e+09
    datestrings.append(pdate[0:10])
data['datestr'] = datestrings

In [0]:
with open('epicurious_data.json','w') as f:
    f.write(data.to_json(orient = "records"))