## Project Fletcher Notebook: An Examination of Recipes from Around the World

author: Benjamin Sturm <br />
contact: bwsturm@gmail.com <br />
date: June 4, 2018 

** Project Summary **

For this project, I examined recipes from around the world through the lens of a data scientist.  I was really interested to see if I could learn something about the relationships of different cuisines throughout the world.  In order to explore this topic, I chose to use recipe data.  In particular, I used the list of ingredients for ~12,500 different recipes and ran several machine learning models.  This notebook provides an overview of my process for downloading the data, using NLP methods to process the data, and finally some results obtained from different unsupervised machine learning algorithms.

In order to run this notebook, you will need to have access to the Yummly API.  Yummly was kind enough to grant me student access, because I'm currently a student at the [Metis](https://www.thisismetis.com/) bootcamp.  If you would like to run this code, you will need to request access to the Yummly API and substitute your Application ID and Keys where I've specified.

In [1]:
import requests
import pandas as pd
import numpy as np

In [2]:
def get_recipes(cuisine='None',num_pages=2):
    '''
    Function to search for recipes from the Yummly API.
    
    Args:
        cuisine (str): Defaults to None.  The list of possible
            cuisines can be found at https://developer.yummly.com/documentation
        num_pages (int): Number of request pages to return.
            If num_pages = 2, will return 20 recipes
            
    Returns: 
        Pandas DataFrame: Each row of the DataFrame will be
            a different recipe.
    '''
    app_id = 'ab3c96d5'   # this is where you will need to substitute your own Yummly application ID
    app_key = '57476e58107ffca18d1585f2b463d78e'  # susbitute your own application key
    yummly_url = 'http://api.yummly.com/v1/api/recipes?_app_id={}&_app_key={}&your _search_parameters'.format(app_id,app_key)
    allowedCuisine = 'cuisine^cuisine-{}'.format(cuisine)
    maxResult = 10
    params = {'allowedCuisine[]' :allowedCuisine, 'maxResult' : maxResult, 'start' : 0}
    #print(params)
    yummly_df = pd.DataFrame(columns=['course','ingredients','recipe_name','rating','cuisine'])
    
    for page in range(num_pages):
        params['start']=page*maxResult
        try:
            resp = requests.get(yummly_url, params=params)
        except:
            print('Exception raised when requesting cuisine: {}'.format(cuisine))
            return yummly_df
        
        data = resp.json()
        for i in range(len(data['matches'])):
            yummly_dict = data['matches'][i]
            yummly_dict2 = {'ingredients':str(yummly_dict['ingredients']), 
              'recipe_name':yummly_dict['recipeName'], 
              'rating':yummly_dict['rating'],
              'cuisine':cuisine}
            if yummly_dict['flavors'] != None:
                for key,value in yummly_dict['flavors'].items():
                    yummly_dict2[key] = value
            if 'course' in yummly_dict['attributes'].keys():
                yummly_dict2['course'] = yummly_dict['attributes']['course']
            yummly_df = yummly_df.append(yummly_dict2, ignore_index=True)
       
    return yummly_df  

In [3]:
def merge_cuisines(cuisines_list=None, num_pages=2):
    '''
    Helper function which iterates through all the cuisines, calls get_recipes(), and then merges the data.
    
    Args:
        cuisine_list: Defaults to None.  A list containing all of the cuisines to query.
        num_pages (int): Number of request pages to return.
            If num_pages = 2, will return 20 recipes
            
    Returns:
        Pandas DataFrame: The results of get_recipes() with all the cuisines merged.
    '''
    
    merged_df = pd.DataFrame(columns=['course','ingredients','recipe_name','rating','cuisine'])
    
    for cuisine in cuisines_list:
        print("loading in data for cuisine: {}".format(cuisine))
        df = get_recipes(cuisine=cuisine, num_pages=num_pages)
        merged_df = merged_df.append(df, ignore_index=True)
        
    return merged_df

This list of cuisines supported by Yummly.

In [4]:
S = 'American, Italian, Asian, Mexican, Southern & Soul Food, French, Southwestern, Barbecue, Indian, Chinese, Cajun & Creole, English, Mediterranean, Greek, Spanish, German, Thai, Moroccan, Irish, Japanese, Cuban, Hawaiian, Swedish, Hungarian, Portugese'
S_list = S.split(',')

In [5]:
S_list_new = []
for S_i in S_list:
    S_i_new = S_i.split('&')[0].lower().strip()
    S_list_new.append(S_i_new)

Now I'm going to download the data, five cuisines at a time.  I'm specifying num_pages=50, which will provide 500 recipes per cuisine.

In [6]:
yummly_df_merged_first5 = merge_cuisines(S_list_new[:5],num_pages=50)

loading in data for cuisine: american
loading in data for cuisine: italian
loading in data for cuisine: asian
loading in data for cuisine: mexican
loading in data for cuisine: southern


In [7]:
yummly_df_merged_second5 = merge_cuisines(S_list_new[5:10],num_pages=50)

loading in data for cuisine: french
loading in data for cuisine: southwestern
loading in data for cuisine: barbecue
loading in data for cuisine: indian
loading in data for cuisine: chinese


In [11]:
yummly_df_merged_third5 = merge_cuisines(S_list_new[10:15],num_pages=50)

loading in data for cuisine: cajun
loading in data for cuisine: english
loading in data for cuisine: mediterranean
loading in data for cuisine: greek
loading in data for cuisine: spanish


In [8]:
yummly_df_merged_fouth5 = merge_cuisines(S_list_new[15:20],num_pages=50)

loading in data for cuisine: german
loading in data for cuisine: thai
loading in data for cuisine: moroccan
loading in data for cuisine: irish
loading in data for cuisine: japanese


In [9]:
yummly_df_merged_fifth5 = merge_cuisines(S_list_new[20:25],num_pages=50)

loading in data for cuisine: cuban
loading in data for cuisine: hawaiian
loading in data for cuisine: swedish
loading in data for cuisine: hungarian
loading in data for cuisine: portugese


In [12]:
yummly_df_merged_large = pd.concat([yummly_df_merged_first5,yummly_df_merged_second5,
                                    yummly_df_merged_third5,yummly_df_merged_fouth5,
                                    yummly_df_merged_fifth5], ignore_index=True)

In [13]:
yummly_df_merged_large.shape

(12487, 11)

Now I'm going to pickle the merged dataframe so I will be able to access it for future use.

In [14]:
yummly_df_merged_large.to_pickle('yummly_df.pkl')

In [15]:
yummly_df = pd.read_pickle('yummly_df.pkl')

In [16]:
yummly_df.head()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet
0,0.666667,[Main Dishes],american,"['dried pasta', 'milk', 'shredded cheddar chee...",0.166667,0.166667,4,Revolutionary Mac & Cheese,0.833333,0.166667,0.166667
1,0.5,[Salads],american,"['tomatoes', 'avocado', 'red onion', 'chopped ...",0.166667,0.0,4,Avocado and Tomato Salad,0.166667,0.833333,0.166667
2,,"[Breakfast and Brunch, Breads]",american,"['melted butter', 'biscuit dough', 'fresh mozz...",,,5,Easy Cheesy Bacon Biscuit Pull-Aparts,,,
3,,[Side Dishes],american,"['cauliflower', 'extra-virgin olive oil', 'red...",,,5,Roasted Spicy Cauliflower,,,
4,0.833333,,american,"['yukon gold potatoes', 'salt', 'smoked paprik...",0.166667,0.166667,5,Shakin’ Hash Browns,0.166667,0.666667,0.0


In [17]:
yummly_df.tail()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet
12482,0.166667,[Beverages],portugese,"['water', 'cucumber', 'lemon', 'mint leaves']",0.0,0.0,3,Cucumber Lemon And Mint Water,0.0,0.333333,0.166667
12483,,[Breads],portugese,"['warm water', 'sugar', 'instant yeast', 'melt...",,,4,Quick and Soft English Muffins,,,
12484,0.5,[Main Dishes],portugese,"['crumbs', 'salt', 'freshly ground black peppe...",0.833333,0.0,4,Easy Baked Chicken Drumsticks,0.666667,0.166667,0.333333
12485,0.5,[Desserts],portugese,"['shredded coconut', 'large egg whites', 'suga...",0.333333,0.0,4,How To Make the Best Coconut Macaroons,0.666667,0.0,0.833333
12486,,"[Appetizers, Lunch]",portugese,"['onions', 'sour cream', 'mayonnaise', 'grated...",,,4,Parmesan Onion Canapés,,,


Now looking at our class balance.

In [18]:
yummly_df.groupby('cuisine').count()

Unnamed: 0_level_0,bitter,course,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet
cuisine,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
american,322,467,500,322,322,500,500,322,322,322
asian,343,432,500,343,343,500,500,343,343,343
barbecue,352,442,500,352,352,500,500,352,352,352
cajun,380,437,500,380,380,500,500,380,380,380
chinese,361,445,500,361,361,500,500,361,361,361
cuban,278,465,500,278,278,500,500,278,278,278
english,333,479,499,333,333,499,499,333,333,333
french,364,460,497,364,364,497,497,364,364,364
german,374,403,500,374,374,500,500,374,374,374
greek,350,463,499,350,350,499,499,350,350,350


The above shows that we are evenly balanced with approximately 500 recipes for each cuisine type.

### Yummly recipe data analysis and modeling

This section of the notebook covers data preprocessing and modeling.  A much more thorough exploration can be found in the yummly_Model.ipynb file. This is a much more compact version of that notebook.

In [29]:
import ast

In [20]:
pd.set_option('display.max_colwidth', -1)

In [21]:
yummly_df.head()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet
0,0.666667,[Main Dishes],american,"['dried pasta', 'milk', 'shredded cheddar cheese', 'salt', 'dijon mustard']",0.166667,0.166667,4,Revolutionary Mac & Cheese,0.833333,0.166667,0.166667
1,0.5,[Salads],american,"['tomatoes', 'avocado', 'red onion', 'chopped cilantro', 'lime', 'extra-virgin olive oil', 'salt']",0.166667,0.0,4,Avocado and Tomato Salad,0.166667,0.833333,0.166667
2,,"[Breakfast and Brunch, Breads]",american,"['melted butter', 'biscuit dough', 'fresh mozzarella', 'bacon', 'shredded cheddar cheese']",,,5,Easy Cheesy Bacon Biscuit Pull-Aparts,,,
3,,[Side Dishes],american,"['cauliflower', 'extra-virgin olive oil', 'red pepper flakes', 'salt', 'ground black pepper']",,,5,Roasted Spicy Cauliflower,,,
4,0.833333,,american,"['yukon gold potatoes', 'salt', 'smoked paprika', 'olive oil']",0.166667,0.166667,5,Shakin’ Hash Browns,0.166667,0.666667,0.0


In [30]:
yummly_df['ingredients'] = yummly_df['ingredients'].apply(lambda x: ast.literal_eval(x))

First I'm going to create a new column consisting of the ingredients represented as a string, without the commas.

In [31]:
yummly_df2 = yummly_df.copy()
yummly_df2['ingredients_string'] = yummly_df2['ingredients'].str.join(' ')

In [32]:
yummly_df2.head()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet,ingredients_string
0,0.666667,[Main Dishes],american,"[dried pasta, milk, shredded cheddar cheese, salt, dijon mustard]",0.166667,0.166667,4,Revolutionary Mac & Cheese,0.833333,0.166667,0.166667,dried pasta milk shredded cheddar cheese salt dijon mustard
1,0.5,[Salads],american,"[tomatoes, avocado, red onion, chopped cilantro, lime, extra-virgin olive oil, salt]",0.166667,0.0,4,Avocado and Tomato Salad,0.166667,0.833333,0.166667,tomatoes avocado red onion chopped cilantro lime extra-virgin olive oil salt
2,,"[Breakfast and Brunch, Breads]",american,"[melted butter, biscuit dough, fresh mozzarella, bacon, shredded cheddar cheese]",,,5,Easy Cheesy Bacon Biscuit Pull-Aparts,,,,melted butter biscuit dough fresh mozzarella bacon shredded cheddar cheese
3,,[Side Dishes],american,"[cauliflower, extra-virgin olive oil, red pepper flakes, salt, ground black pepper]",,,5,Roasted Spicy Cauliflower,,,,cauliflower extra-virgin olive oil red pepper flakes salt ground black pepper
4,0.833333,,american,"[yukon gold potatoes, salt, smoked paprika, olive oil]",0.166667,0.166667,5,Shakin’ Hash Browns,0.166667,0.666667,0.0,yukon gold potatoes salt smoked paprika olive oil


In [33]:
#loading in the tf-idf and CountVectorizer libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
list_corpus = yummly_df2['ingredients_string'].tolist()
list_labels = yummly_df2['cuisine'].tolist()

In [35]:
vectorizer = TfidfVectorizer()
vectorizer.fit(list_corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [36]:
vector = vectorizer.transform(yummly_df2['ingredients_string'])
max_value = vector.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()

In [37]:
feature_names = np.array(vectorizer.get_feature_names())
print("Features with the lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:100]]))

Features with the lowest tfidf:
['fluff' 'huckleberries' 'cornish' 'partridges' 'hens' 'vineyard'
 'burgundi' 'collect' 'premium' 'dijonnaise' 'pinch' 'perfect' 'jamón'
 'vital' 'perrins' 'lea' 'nacho' 'tradit' 'gelato' 'fresca' 'pressed'
 'four' 'toll' 'roux' 'substitut' 'snaps' 'chex' 'niçoise' 'brazil'
 'barramundi' 'candlenuts' 'garlic' 'fettuccini' 'kasuri' 'lettuc'
 'romain' 'leav' 'sturgeon' 'chees' 'skippi' 'natur' 'turkish' 'rins'
 'world' 'cara' 'mia' 'lentilles' 'du' 'cupcake' 'mm' 'pretzels'
 'valentine' 'craisins' 'cheerios' 'garni' 'creations' 'sheet' 'shrimps'
 'tail' 'traditional' 'klondike' 'super' 'amaranth' 'gourmet' 'aonori'
 'touch' 'leche' 'dulce' 'boned' 'pompeian' 'minicub' 'parslei' 'leafy'
 'pepperocini' 'pack' 'cheek' 'seitan' 'stellette' 'flageolet' 'curls'
 'eatin' 'drain' 'bianca' 'rosa' 'blackberry' 'betty' 'crocker' '100'
 'chana' 'ritz' 'dri' 'crush' 'drumstick' 'flake' 'heath' 'bowl' 'trifle'
 'flatout' 'flatbreads' 'crabs']


In [38]:
print("Features with the highest tfidf:\n{}".format(feature_names[sorted_by_tfidf[-100:]]))

Features with the highest tfidf:
['tikka' 'moscato' 'mirin' 'hazelnuts' 'biscoff' 'ricotta' 'paneer'
 'pimenton' 'saltines' 'beaten' 'pomegranate' 'cornmeal' 'brisket' 'hass'
 'icing' 'creole' 'mitsukan' 'seasoning' 'tart' 'pectin' 'mango' 'pizza'
 'liqueur' 'nonstick' 'lemonade' 'melon' 'alum' 'sofrito' 'violets'
 'buttermilk' 'liver' 'oleo' 'herbs' 'sea' 'beech' 'rose' 'biscuit'
 'papad' 'gizzards' 'orange' 'couscous' 'naan' 'melted' 'limoncello'
 'challa' 'gumbo' 'chambord' 'high' 'drippings' 'yardlong' 'champagne'
 'cottage' 'cookies' 'plantains' 'fruit' 'atta' 'oatmeal' 'citrus' 'roe'
 'sardines' 'potatoes' 'gram' 'sheepshead' 'mahi' 'boudin' 'half' 'apples'
 'strawberries' 'goya' 'lard' 'yucca' 'liquor' 'flavoring' 'semolina'
 'jarlsberg' 'dried' 'konbu' 'chicory' 'juice' 'pickling' 'cauliflower'
 'liquid' 'brats' 'bhaji' 'dates' 'coffee' 'cultured' 'spaetzle' 'ground'
 'vodka' 'grits' 'peach' 'duck' 'homemade' 'mccormick' 'jamaica' 'cabbage'
 'peanuts' 'taro' 'pudding']


The following list of words are those with the lowest idf score.  That is, those that appear frequently and are therefore deemed less important.

In [39]:
sorted_by_idf = np.argsort(vectorizer.idf_)
print("Features with the lowest idf:\n{}".format(feature_names[sorted_by_idf[:50]]))

Features with the lowest idf:
['salt' 'oil' 'pepper' 'garlic' 'sugar' 'ground' 'butter' 'flour' 'olive'
 'onions' 'fresh' 'sauce' 'black' 'chicken' 'red' 'cheese' 'eggs' 'water'
 'tomatoes' 'onion' 'milk' 'powder' 'juice' 'green' 'cream' 'lemon'
 'white' 'cilantro' 'ginger' 'soy' 'rice' 'chopped' 'vinegar' 'all'
 'purpose' 'paprika' 'cumin' 'lime' 'parsley' 'large' 'leaves' 'corn'
 'cloves' 'broth' 'bell' 'kosher' 'brown' 'vegetable' 'dried' 'beef']


In [41]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [42]:
custom_stop_words = []
for word in ENGLISH_STOP_WORDS:
    custom_stop_words.append(word) 
    
custom_stop_words = custom_stop_words + feature_names[sorted_by_idf[:30]].tolist()

Next I'm going to do Bag-of-Words processing.

In [43]:
count_vect = CountVectorizer(stop_words=custom_stop_words)

In [44]:
counts = count_vect.fit_transform(yummly_df2["ingredients_string"])  # sparse matrix with columns corresponding to words
words = count_vect.get_feature_names()  # array with words corresponding to columns

Now I'm going to try K-means clustering on my vectorized data set.

In [45]:
from sklearn.cluster import KMeans

In [46]:
number_of_clusters=10
km = KMeans(n_clusters = number_of_clusters)
km.fit(counts)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [47]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = count_vect.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :10]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: cumin coriander turmeric cinnamon paprika chili chopped cloves masala cayenne
Cluster 1: lime leaves mint coconut fish rum paste chopped curry chilies
Cluster 2: bell paprika parsley broth chopped rice celery seasoning yellow bay
Cluster 3: corn beans shredded tortillas salsa cumin chilies cheddar chili chopped
Cluster 4: vinegar potatoes pineapple purpose brown large kosher vegetable parsley bread
Cluster 5: sesame rice vinegar seeds corn starch brown boneless breasts vegetable
Cluster 6: baking vanilla purpose extract large unsalted soda egg granulated buttermilk
Cluster 7: virgin extra parsley vinegar cloves paprika sea leaves cucumber kosher
Cluster 8: beef broth parsley paprika sour stock potatoes allspice bread large
Cluster 9: dried oregano feta cucumber thyme vinegar basil wine parsley leaves


In [48]:
number_of_clusters=25
km = KMeans(n_clusters = number_of_clusters)
km.fit(counts)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=25, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [49]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = count_vect.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :10]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: sesame seeds rice vinegar corn starch toasted honey brown scallions
Cluster 1: shredded cheddar tortillas mozzarella parmesan seasoning chopped beef salsa sour
Cluster 2: virgin extra parsley vinegar cloves feta wine oregano cucumber leaves
Cluster 3: parsley vegetable purpose bread pork vinegar seasoning kosher chopped cinnamon
Cluster 4: potatoes russet parsley bacon kosher gold sweet yukon vegetable purpose
Cluster 5: dried oregano feta thyme parsley basil wine cucumber vinegar paprika
Cluster 6: skinless boneless breasts corn rice bell starch broth vegetable vinegar
Cluster 7: vanilla extract egg large purpose heavy unsalted baking yolks chocolate
Cluster 8: beef broth parsley allspice bread nutmeg stock sour pork large
Cluster 9: corn beans chili cumin tortillas salsa shredded avocado cheddar chopped
Cluster 10: cumin coriander turmeric masala garam seeds chili leaves seed chilies
Cluster 11: paprika sour parsley sweet broth hungarian smoked wine 

K-means clustering shows that certain words show up a lot, but are not informative.  These words are like 'extra','virgin','extract','unsalted'.   I'm going to include these words in my list of stopwords before doing further analysis.

Another observation is that I feel like certain words should be hyphenated, since they always go together.  These words include 'baking-soda','baking-power','sesame-seeds'.  I'm going to do this step first before removing any stop words.

In [54]:
yummly_df3 = pd.read_pickle('yummly_df.pkl')

In [55]:
def hyphenate_ingredients2(df):
    df['ingredients'].replace(regex=r'baking soda',value='baking-soda',inplace=True)
    df['ingredients'].replace(regex=r'baking powder',value='baking-powder',inplace=True)
    df['ingredients'].replace(regex=r'sesame seeds',value='sesame-seeds',inplace=True)
    df['ingredients'].replace(regex=r'simple syrup',value='simple-syrup',inplace=True)
    df['ingredients'].replace(regex=r'olive oil',value='olive-oil',inplace=True)
    df['ingredients'].replace(regex=r'corn starch',value='corn-starch',inplace=True)
    df['ingredients'].replace(regex=r'garam masala',value='garam-masala',inplace=True)
    
    return df

In [56]:
yummly_df3 = hyphenate_ingredients2(yummly_df3)

In [57]:
yummly_df3.head()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet
0,0.666667,[Main Dishes],american,"['dried pasta', 'milk', 'shredded cheddar cheese', 'salt', 'dijon mustard']",0.166667,0.166667,4,Revolutionary Mac & Cheese,0.833333,0.166667,0.166667
1,0.5,[Salads],american,"['tomatoes', 'avocado', 'red onion', 'chopped cilantro', 'lime', 'extra-virgin olive-oil', 'salt']",0.166667,0.0,4,Avocado and Tomato Salad,0.166667,0.833333,0.166667
2,,"[Breakfast and Brunch, Breads]",american,"['melted butter', 'biscuit dough', 'fresh mozzarella', 'bacon', 'shredded cheddar cheese']",,,5,Easy Cheesy Bacon Biscuit Pull-Aparts,,,
3,,[Side Dishes],american,"['cauliflower', 'extra-virgin olive-oil', 'red pepper flakes', 'salt', 'ground black pepper']",,,5,Roasted Spicy Cauliflower,,,
4,0.833333,,american,"['yukon gold potatoes', 'salt', 'smoked paprika', 'olive-oil']",0.166667,0.166667,5,Shakin’ Hash Browns,0.166667,0.666667,0.0


We can now see that 'olive oil' was replaced to 'olive-oil', so hyphenating worked.  Now I just need to convert ingredients to a list.

In [58]:
yummly_df3['ingredients'] = yummly_df3['ingredients'].apply(lambda x: ast.literal_eval(x))

In [62]:
# A very simple function to tokenize a list of words
def tokenize_list(l):
    new_l = [val.split(" ") for val in l]
    flat_l = [item for sublist in new_l for item in sublist]
    return flat_l

In [60]:
yummly_df3['tokens_ingr'] = yummly_df3['ingredients'].apply(tokenize_list)

In [61]:
yummly_df3.head()

Unnamed: 0,bitter,course,cuisine,ingredients,meaty,piquant,rating,recipe_name,salty,sour,sweet,tokens_ingr
0,0.666667,[Main Dishes],american,"[dried pasta, milk, shredded cheddar cheese, salt, dijon mustard]",0.166667,0.166667,4,Revolutionary Mac & Cheese,0.833333,0.166667,0.166667,"[dried, pasta, milk, shredded, cheddar, cheese, salt, dijon, mustard]"
1,0.5,[Salads],american,"[tomatoes, avocado, red onion, chopped cilantro, lime, extra-virgin olive-oil, salt]",0.166667,0.0,4,Avocado and Tomato Salad,0.166667,0.833333,0.166667,"[tomatoes, avocado, red, onion, chopped, cilantro, lime, extra-virgin, olive-oil, salt]"
2,,"[Breakfast and Brunch, Breads]",american,"[melted butter, biscuit dough, fresh mozzarella, bacon, shredded cheddar cheese]",,,5,Easy Cheesy Bacon Biscuit Pull-Aparts,,,,"[melted, butter, biscuit, dough, fresh, mozzarella, bacon, shredded, cheddar, cheese]"
3,,[Side Dishes],american,"[cauliflower, extra-virgin olive-oil, red pepper flakes, salt, ground black pepper]",,,5,Roasted Spicy Cauliflower,,,,"[cauliflower, extra-virgin, olive-oil, red, pepper, flakes, salt, ground, black, pepper]"
4,0.833333,,american,"[yukon gold potatoes, salt, smoked paprika, olive-oil]",0.166667,0.166667,5,Shakin’ Hash Browns,0.166667,0.666667,0.0,"[yukon, gold, potatoes, salt, smoked, paprika, olive-oil]"
