# Solid-Spoon

**Author:** Crissy Bruce
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import requests
import config
import json
import pandas as pd


In [2]:
data=pd.read_csv(r'C:\Users\owner\Documents\Flatiron\solid-spoon\solid-spoon\data1.csv')

In [3]:
data['extendedIngredients']

0       [{'id': 11090, 'aisle': 'Produce', 'image': 'b...
1       [{'id': 2044, 'aisle': 'Produce;Spices and Sea...
2       [{'id': 9040, 'aisle': 'Produce', 'image': 'ba...
3       [{'id': 2069, 'aisle': 'Oil, Vinegar, Salad Dr...
4       [{'id': 16018, 'aisle': 'Canned and Jarred', '...
                              ...                        
1595    [{'id': 10111352, 'aisle': 'Frozen', 'image': ...
1596    [{'id': 1053, 'aisle': 'Milk, Eggs, Other Dair...
1597    [{'id': 2044, 'aisle': 'Produce;Spices and Sea...
1598    [{'id': 10123, 'aisle': 'Meat', 'image': 'raw-...
1599    [{'id': 20012, 'aisle': 'Pasta and Rice;Ethnic...
Name: extendedIngredients, Length: 1600, dtype: object

In [4]:
#check for duplicates
data.duplicated(subset=['id'])

0       False
1       False
2       False
3       False
4       False
        ...  
1595     True
1596     True
1597     True
1598     True
1599     True
Length: 1600, dtype: bool

In [5]:
data=data.drop_duplicates(subset=['id'])

In [6]:
data.shape

(1000, 44)

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,vegetarian,vegan,glutenFree,dairyFree,veryHealthy,cheap,veryPopular,sustainable,weightWatcherSmartPoints,...,spoonacularSourceUrl,usedIngredientCount,missedIngredientCount,missedIngredients,likes,usedIngredients,unusedIngredients,preparationMinutes,cookingMinutes,author
0,0,True,True,True,True,True,False,True,False,4,...,https://spoonacular.com/cauliflower-brown-rice...,0,9,"[{'id': 11090, 'amount': 2.0, 'unit': 'cups', ...",0,[],[],,,
1,1,True,True,False,True,True,False,True,False,19,...,https://spoonacular.com/homemade-garlic-and-ba...,0,2,"[{'id': 2044, 'amount': 0.25, 'unit': 'cup', '...",0,[],[],,,
2,2,True,False,False,False,True,False,True,False,15,...,https://spoonacular.com/berry-banana-breakfast...,0,5,"[{'id': 9040, 'amount': 0.25, 'unit': 'cup', '...",0,[],[],5.0,0.0,
3,3,True,True,True,True,True,False,False,False,5,...,https://spoonacular.com/garlicky-kale-644387,0,3,"[{'id': 2069, 'amount': 3.0, 'unit': 'tablespo...",0,[],[],,,
4,4,False,False,True,True,True,False,True,False,10,...,https://spoonacular.com/chicken-tortilla-soup-...,0,9,"[{'id': 16018, 'amount': 15.0, 'unit': 'oz', '...",0,[],[],,,


In [8]:
list(data.columns)

['Unnamed: 0',
 'vegetarian',
 'vegan',
 'glutenFree',
 'dairyFree',
 'veryHealthy',
 'cheap',
 'veryPopular',
 'sustainable',
 'weightWatcherSmartPoints',
 'gaps',
 'lowFodmap',
 'aggregateLikes',
 'spoonacularScore',
 'healthScore',
 'creditsText',
 'license',
 'sourceName',
 'pricePerServing',
 'extendedIngredients',
 'id',
 'title',
 'readyInMinutes',
 'servings',
 'sourceUrl',
 'image',
 'imageType',
 'nutrition',
 'summary',
 'cuisines',
 'dishTypes',
 'diets',
 'occasions',
 'analyzedInstructions',
 'spoonacularSourceUrl',
 'usedIngredientCount',
 'missedIngredientCount',
 'missedIngredients',
 'likes',
 'usedIngredients',
 'unusedIngredients',
 'preparationMinutes',
 'cookingMinutes',
 'author']

Dropping data that will not be used.

In [9]:
data=data.drop(columns=['Unnamed: 0',
 'vegetarian',
 'vegan',
 'glutenFree',
 'dairyFree',
 'veryHealthy',
 'cheap',
 'veryPopular',
 'sustainable',
 'weightWatcherSmartPoints',
 'gaps',
 'lowFodmap',
 'creditsText',
 'license',
 'sourceName',
 'readyInMinutes',
 'image',
 'imageType',
 'summary',
 'cuisines',
 'occasions',
 'analyzedInstructions',
 'likes',
 'preparationMinutes',
 'cookingMinutes',
 'author'])

In [10]:
data

Unnamed: 0,aggregateLikes,spoonacularScore,healthScore,pricePerServing,extendedIngredients,id,title,servings,sourceUrl,nutrition,dishTypes,diets,spoonacularSourceUrl,usedIngredientCount,missedIngredientCount,missedIngredients,usedIngredients,unusedIngredients
0,3689,99.0,76.0,112.39,"[{'id': 11090, 'aisle': 'Produce', 'image': 'b...",716426,"Cauliflower, Brown Rice, and Vegetable Fried Rice",8,http://fullbellysisters.blogspot.com/2012/01/c...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'lacto ovo veget...",https://spoonacular.com/cauliflower-brown-rice...,0,9,"[{'id': 11090, 'amount': 2.0, 'unit': 'cups', ...",[],[]
1,1669,99.0,78.0,83.23,"[{'id': 2044, 'aisle': 'Produce;Spices and Sea...",715594,Homemade Garlic and Basil French Fries,2,http://www.pinkwhen.com/homemade-french-fries/,"{'nutrients': [{'name': 'Calories', 'title': '...","['lunch', 'main course', 'main dish', 'dinner']","['dairy free', 'lacto ovo vegetarian', 'vegan']",https://spoonacular.com/homemade-garlic-and-ba...,0,2,"[{'id': 2044, 'amount': 0.25, 'unit': 'cup', '...",[],[]
2,689,99.0,63.0,204.29,"[{'id': 9040, 'aisle': 'Produce', 'image': 'ba...",715497,Berry Banana Breakfast Smoothie,1,http://www.pinkwhen.com/berry-banana-breakfast...,"{'nutrients': [{'name': 'Calories', 'title': '...","['morning meal', 'brunch', 'breakfast']",['lacto ovo vegetarian'],https://spoonacular.com/berry-banana-breakfast...,0,5,"[{'id': 9040, 'amount': 0.25, 'unit': 'cup', '...",[],[]
3,19,99.0,93.0,69.09,"[{'id': 2069, 'aisle': 'Oil, Vinegar, Salad Dr...",644387,Garlicky Kale,2,http://www.foodista.com/recipe/J2FTJBF7/garlic...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'paleolithic', '...",https://spoonacular.com/garlicky-kale-644387,0,3,"[{'id': 2069, 'amount': 3.0, 'unit': 'tablespo...",[],[]
4,1429,99.0,73.0,339.33,"[{'id': 16018, 'aisle': 'Canned and Jarred', '...",715392,Chicken Tortilla Soup (Slow Cooker),2,http://www.pinkwhen.com/chicken-tortilla-soup-...,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],"['gluten free', 'dairy free']",https://spoonacular.com/chicken-tortilla-soup-...,0,9,"[{'id': 16018, 'amount': 15.0, 'unit': 'oz', '...",[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,6,76.0,24.0,111.22,"[{'id': 10111352, 'aisle': 'Frozen', 'image': ...",660322,Smashed Fried Lemon Potatoes,4,http://www.foodista.com/recipe/38DN2VQF/smashe...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'lacto ovo veget...",https://spoonacular.com/smashed-fried-lemon-po...,0,8,"[{'id': 10111352, 'amount': 1.5, 'unit': 'poun...",[],[]
996,0,76.0,46.0,595.06,"[{'id': 1053, 'aisle': 'Milk, Eggs, Other Dair...",157081,Omega-3 Creamy Leek Soup,2,http://spoonacular.com/-1381436006147,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],"['gluten free', 'primal', 'pescatarian']",https://spoonacular.com/omega-3-creamy-leek-so...,0,8,"[{'id': 1053, 'amount': 250.0, 'unit': 'ml', '...",[],[]
997,1,76.0,36.0,242.59,"[{'id': 2044, 'aisle': 'Produce;Spices and Sea...",661351,Spinach Soup With Wontons,4,http://www.foodista.com/recipe/VPKSZHYP/spinac...,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],[],https://spoonacular.com/spinach-soup-with-wont...,0,8,"[{'id': 2044, 'amount': 1.0, 'unit': 'teaspoon...",[],[]
998,1,76.0,37.0,363.42,"[{'id': 10123, 'aisle': 'Meat', 'image': 'raw-...",658529,"Roasted Butternut Squash, Pecan, Bacon, Mix Gr...",4,http://www.foodista.com/recipe/67XP25KK/roaste...,"{'nutrients': [{'name': 'Calories', 'title': '...",['salad'],"['gluten free', 'dairy free', 'paleolithic', '...",https://spoonacular.com/roasted-butternut-squa...,0,8,"[{'id': 10123, 'amount': 5.0, 'unit': 'slices'...",[],[]


In [11]:
data.isnull().sum().sum()

0

In [12]:
data.isna().sum()

aggregateLikes           0
spoonacularScore         0
healthScore              0
pricePerServing          0
extendedIngredients      0
id                       0
title                    0
servings                 0
sourceUrl                0
nutrition                0
dishTypes                0
diets                    0
spoonacularSourceUrl     0
usedIngredientCount      0
missedIngredientCount    0
missedIngredients        0
usedIngredients          0
unusedIngredients        0
dtype: int64

## EDA
Time to explore the data to better understand the dataset.

In [13]:
data.head()

Unnamed: 0,aggregateLikes,spoonacularScore,healthScore,pricePerServing,extendedIngredients,id,title,servings,sourceUrl,nutrition,dishTypes,diets,spoonacularSourceUrl,usedIngredientCount,missedIngredientCount,missedIngredients,usedIngredients,unusedIngredients
0,3689,99.0,76.0,112.39,"[{'id': 11090, 'aisle': 'Produce', 'image': 'b...",716426,"Cauliflower, Brown Rice, and Vegetable Fried Rice",8,http://fullbellysisters.blogspot.com/2012/01/c...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'lacto ovo veget...",https://spoonacular.com/cauliflower-brown-rice...,0,9,"[{'id': 11090, 'amount': 2.0, 'unit': 'cups', ...",[],[]
1,1669,99.0,78.0,83.23,"[{'id': 2044, 'aisle': 'Produce;Spices and Sea...",715594,Homemade Garlic and Basil French Fries,2,http://www.pinkwhen.com/homemade-french-fries/,"{'nutrients': [{'name': 'Calories', 'title': '...","['lunch', 'main course', 'main dish', 'dinner']","['dairy free', 'lacto ovo vegetarian', 'vegan']",https://spoonacular.com/homemade-garlic-and-ba...,0,2,"[{'id': 2044, 'amount': 0.25, 'unit': 'cup', '...",[],[]
2,689,99.0,63.0,204.29,"[{'id': 9040, 'aisle': 'Produce', 'image': 'ba...",715497,Berry Banana Breakfast Smoothie,1,http://www.pinkwhen.com/berry-banana-breakfast...,"{'nutrients': [{'name': 'Calories', 'title': '...","['morning meal', 'brunch', 'breakfast']",['lacto ovo vegetarian'],https://spoonacular.com/berry-banana-breakfast...,0,5,"[{'id': 9040, 'amount': 0.25, 'unit': 'cup', '...",[],[]
3,19,99.0,93.0,69.09,"[{'id': 2069, 'aisle': 'Oil, Vinegar, Salad Dr...",644387,Garlicky Kale,2,http://www.foodista.com/recipe/J2FTJBF7/garlic...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'paleolithic', '...",https://spoonacular.com/garlicky-kale-644387,0,3,"[{'id': 2069, 'amount': 3.0, 'unit': 'tablespo...",[],[]
4,1429,99.0,73.0,339.33,"[{'id': 16018, 'aisle': 'Canned and Jarred', '...",715392,Chicken Tortilla Soup (Slow Cooker),2,http://www.pinkwhen.com/chicken-tortilla-soup-...,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],"['gluten free', 'dairy free']",https://spoonacular.com/chicken-tortilla-soup-...,0,9,"[{'id': 16018, 'amount': 15.0, 'unit': 'oz', '...",[],[]


In [14]:
#rearranging columns 
data = data[['id', 'title', 'servings', 'sourceUrl', 'nutrition', 'dishTypes', 'diets', 'spoonacularSourceUrl', 'usedIngredientCount','missedIngredientCount','missedIngredients', 'usedIngredients', 'unusedIngredients', 'aggregateLikes','spoonacularScore', 'healthScore', 'extendedIngredients', 'pricePerServing']]

In [15]:
data

Unnamed: 0,id,title,servings,sourceUrl,nutrition,dishTypes,diets,spoonacularSourceUrl,usedIngredientCount,missedIngredientCount,missedIngredients,usedIngredients,unusedIngredients,aggregateLikes,spoonacularScore,healthScore,extendedIngredients,pricePerServing
0,716426,"Cauliflower, Brown Rice, and Vegetable Fried Rice",8,http://fullbellysisters.blogspot.com/2012/01/c...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'lacto ovo veget...",https://spoonacular.com/cauliflower-brown-rice...,0,9,"[{'id': 11090, 'amount': 2.0, 'unit': 'cups', ...",[],[],3689,99.0,76.0,"[{'id': 11090, 'aisle': 'Produce', 'image': 'b...",112.39
1,715594,Homemade Garlic and Basil French Fries,2,http://www.pinkwhen.com/homemade-french-fries/,"{'nutrients': [{'name': 'Calories', 'title': '...","['lunch', 'main course', 'main dish', 'dinner']","['dairy free', 'lacto ovo vegetarian', 'vegan']",https://spoonacular.com/homemade-garlic-and-ba...,0,2,"[{'id': 2044, 'amount': 0.25, 'unit': 'cup', '...",[],[],1669,99.0,78.0,"[{'id': 2044, 'aisle': 'Produce;Spices and Sea...",83.23
2,715497,Berry Banana Breakfast Smoothie,1,http://www.pinkwhen.com/berry-banana-breakfast...,"{'nutrients': [{'name': 'Calories', 'title': '...","['morning meal', 'brunch', 'breakfast']",['lacto ovo vegetarian'],https://spoonacular.com/berry-banana-breakfast...,0,5,"[{'id': 9040, 'amount': 0.25, 'unit': 'cup', '...",[],[],689,99.0,63.0,"[{'id': 9040, 'aisle': 'Produce', 'image': 'ba...",204.29
3,644387,Garlicky Kale,2,http://www.foodista.com/recipe/J2FTJBF7/garlic...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'paleolithic', '...",https://spoonacular.com/garlicky-kale-644387,0,3,"[{'id': 2069, 'amount': 3.0, 'unit': 'tablespo...",[],[],19,99.0,93.0,"[{'id': 2069, 'aisle': 'Oil, Vinegar, Salad Dr...",69.09
4,715392,Chicken Tortilla Soup (Slow Cooker),2,http://www.pinkwhen.com/chicken-tortilla-soup-...,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],"['gluten free', 'dairy free']",https://spoonacular.com/chicken-tortilla-soup-...,0,9,"[{'id': 16018, 'amount': 15.0, 'unit': 'oz', '...",[],[],1429,99.0,73.0,"[{'id': 16018, 'aisle': 'Canned and Jarred', '...",339.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,660322,Smashed Fried Lemon Potatoes,4,http://www.foodista.com/recipe/38DN2VQF/smashe...,"{'nutrients': [{'name': 'Calories', 'title': '...",['side dish'],"['gluten free', 'dairy free', 'lacto ovo veget...",https://spoonacular.com/smashed-fried-lemon-po...,0,8,"[{'id': 10111352, 'amount': 1.5, 'unit': 'poun...",[],[],6,76.0,24.0,"[{'id': 10111352, 'aisle': 'Frozen', 'image': ...",111.22
996,157081,Omega-3 Creamy Leek Soup,2,http://spoonacular.com/-1381436006147,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],"['gluten free', 'primal', 'pescatarian']",https://spoonacular.com/omega-3-creamy-leek-so...,0,8,"[{'id': 1053, 'amount': 250.0, 'unit': 'ml', '...",[],[],0,76.0,46.0,"[{'id': 1053, 'aisle': 'Milk, Eggs, Other Dair...",595.06
997,661351,Spinach Soup With Wontons,4,http://www.foodista.com/recipe/VPKSZHYP/spinac...,"{'nutrients': [{'name': 'Calories', 'title': '...",['soup'],[],https://spoonacular.com/spinach-soup-with-wont...,0,8,"[{'id': 2044, 'amount': 1.0, 'unit': 'teaspoon...",[],[],1,76.0,36.0,"[{'id': 2044, 'aisle': 'Produce;Spices and Sea...",242.59
998,658529,"Roasted Butternut Squash, Pecan, Bacon, Mix Gr...",4,http://www.foodista.com/recipe/67XP25KK/roaste...,"{'nutrients': [{'name': 'Calories', 'title': '...",['salad'],"['gluten free', 'dairy free', 'paleolithic', '...",https://spoonacular.com/roasted-butternut-squa...,0,8,"[{'id': 10123, 'amount': 5.0, 'unit': 'slices'...",[],[],1,76.0,37.0,"[{'id': 10123, 'aisle': 'Meat', 'image': 'raw-...",363.42


I need to pull the details from the extendedIngredients column so that there is a single column for each ingredient in order to use the ingredient data in my recommendation model.

In [16]:
data['extendedIngredients']

0      [{'id': 11090, 'aisle': 'Produce', 'image': 'b...
1      [{'id': 2044, 'aisle': 'Produce;Spices and Sea...
2      [{'id': 9040, 'aisle': 'Produce', 'image': 'ba...
3      [{'id': 2069, 'aisle': 'Oil, Vinegar, Salad Dr...
4      [{'id': 16018, 'aisle': 'Canned and Jarred', '...
                             ...                        
995    [{'id': 10111352, 'aisle': 'Frozen', 'image': ...
996    [{'id': 1053, 'aisle': 'Milk, Eggs, Other Dair...
997    [{'id': 2044, 'aisle': 'Produce;Spices and Sea...
998    [{'id': 10123, 'aisle': 'Meat', 'image': 'raw-...
999    [{'id': 20012, 'aisle': 'Pasta and Rice;Ethnic...
Name: extendedIngredients, Length: 1000, dtype: object

In [17]:
#accessing the first row of the extendedIngredients data
first=data['extendedIngredients'].iloc[0]

In [18]:
#Viewing the first row of extendedIngredients data
eval(first)

[{'id': 11090,
  'aisle': 'Produce',
  'image': 'broccoli.jpg',
  'consistency': 'solid',
  'name': 'broccoli',
  'nameClean': 'broccoli',
  'original': '2 cups cooked broccoli, chopped small',
  'originalString': '2 cups cooked broccoli, chopped small',
  'originalName': 'cooked broccoli, chopped small',
  'amount': 2.0,
  'unit': 'cups',
  'meta': ['cooked', 'chopped'],
  'metaInformation': ['cooked', 'chopped'],
  'measures': {'us': {'amount': 2.0, 'unitShort': 'cups', 'unitLong': 'cups'},
   'metric': {'amount': 473.176,
    'unitShort': 'ml',
    'unitLong': 'milliliters'}}},
 {'id': 11135,
  'aisle': 'Produce',
  'image': 'cauliflower.jpg',
  'consistency': 'solid',
  'name': 'cauliflower',
  'nameClean': 'cauliflower',
  'original': '1 head of cauliflower, raw',
  'originalString': '1 head of cauliflower, raw',
  'originalName': 'cauliflower, raw',
  'amount': 1.0,
  'unit': 'head',
  'meta': ['raw'],
  'metaInformation': ['raw'],
  'measures': {'us': {'amount': 1.0, 'unitShort'

In [19]:
#creating a function that lists all the ingredient names of the first recipe
def ingnames(text):
    ingnames=[]
    ingredientsnamelist=eval(text)
    for item in ingredientsnamelist:
        ingnames.append(item['nameClean'])
    return ingnames

In [20]:
ingnames(first)

['broccoli',
 'cauliflower',
 'coconut oil',
 'cooked brown rice',
 'garlic',
 'grape seed oil',
 'lower sodium soy sauce',
 'green peas',
 'salt',
 'spring onions',
 'spring onions',
 'sesame oil',
 'sesame seeds']

In [22]:
#As show above, there are duplicate ingreds within the same recipe.  After viewing the data in detail, it appears to be different applications for each of the duplicate ingredients

In [23]:
#creating lambda function that stores all the ingredient names as a data point
data['ingredname'] = data['extendedIngredients'].map(lambda x: ingnames(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [24]:
#View new ingredname column data
data['ingredname']

0      [broccoli, cauliflower, coconut oil, cooked br...
1      [basil, wheat flour, garlic powder, garlic sal...
2      [banana, graham cracker crumbs, soymilk, straw...
3            [balsamic vinegar, garlic, kale, olive oil]
4      [canned black beans, canned green chiles, cann...
                             ...                        
995    [fingerling potato, parsley, fresh rosemary, t...
996    [cream, dill, lemon juice, leek, green peas, s...
997    [basil, chicken broth, chili sauce, frozen spi...
998    [bacon, ground black pepper, butternut squash,...
999    [bulgur, wheat flour, olive oil, salt, salt an...
Name: ingredname, Length: 1000, dtype: object

In [None]:
#applying Python str to the ingredients column
#data['ingredname']=data['ingredname'].apply(str)

In [25]:
data[['ingredname']].iloc

<pandas.core.indexing._iLocIndexer at 0x242300c2e08>

In [26]:
#Checking out the list of ingredients for the first recipe
data['ingredname'].iloc[629]

['cabbage',
 'red cabbage',
 'carrot',
 'black sesame seeds',
 'yellow onion',
 'olive oil',
 'lemon juice',
 'agave',
 'tahini',
 'water',
 'salt']

In [27]:
#Obtaining a list of all ingredients in the 'ingredname' column
total_ingreds = set()
for comment in data['ingredname']:
    total_ingreds.update(comment)
len(total_ingreds)

1064

In [28]:
import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize, FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
np.random.seed(0)

In [29]:
for x in data['ingredname']:
    print(x)
    print(type(x))
    break

['broccoli', 'cauliflower', 'coconut oil', 'cooked brown rice', 'garlic', 'grape seed oil', 'lower sodium soy sauce', 'green peas', 'salt', 'spring onions', 'spring onions', 'sesame oil', 'sesame seeds']
<class 'list'>


In [30]:
#exploring the data with frequency distributions
ingreds_concat=[]
for ingreds in data['ingredname']:
    ingreds_concat +=ingreds

In [31]:
ingreds_freqdist = FreqDist(ingreds_concat)
ingreds_freqdist.most_common(200)

[('garlic', 511),
 ('olive oil', 434),
 ('salt', 345),
 ('onion', 295),
 ('salt and pepper', 233),
 ('ground black pepper', 227),
 ('water', 195),
 ('parsley', 174),
 ('lemon juice', 165),
 ('carrot', 163),
 ('bell pepper', 157),
 ('tomato', 129),
 ('spring onions', 128),
 (None, 124),
 ('fresh cilantro', 114),
 ('cumin', 106),
 ('red pepper', 103),
 ('red onion', 102),
 ('parmesan', 90),
 ('ginger', 89),
 ('extra virgin olive oil', 89),
 ('wheat flour', 88),
 ('basil', 87),
 ('red pepper flakes', 86),
 ('sea salt', 76),
 ('lime juice', 75),
 ('oregano', 73),
 ('hass avocado', 71),
 ('egg', 67),
 ('bay leaves', 67),
 ('canned tomatoes', 63),
 ('honey', 63),
 ('spinach', 63),
 ('mushrooms', 63),
 ('coarse kosher salt', 62),
 ('sugar', 61),
 ('butter', 60),
 ('thyme', 59),
 ('soy sauce', 58),
 ('vegetable oil', 57),
 ('zucchini', 56),
 ('garlic powder', 55),
 ('cooking oil', 54),
 ('chili powder', 54),
 ('ground cayenne pepper', 53),
 ('shallot', 52),
 ('boneless chicken breast', 48),
 (

In [None]:
#train test split before vectorization

In [32]:
y = data[['pricePerServing']].copy()
X = data[['ingredname']].copy()

In [33]:
string_col_values = []
for row in X.values:
    true_ingreds = [i for i in row[0] if i]
    string_col_values.append(','.join(true_ingreds))
X.loc[:, 'string_ingreds'] = string_col_values
X.head()

Unnamed: 0,ingredname,string_ingreds
0,"[broccoli, cauliflower, coconut oil, cooked br...","broccoli,cauliflower,coconut oil,cooked brown ..."
1,"[basil, wheat flour, garlic powder, garlic sal...","basil,wheat flour,garlic powder,garlic salt,ve..."
2,"[banana, graham cracker crumbs, soymilk, straw...","banana,graham cracker crumbs,soymilk,strawberr..."
3,"[balsamic vinegar, garlic, kale, olive oil]","balsamic vinegar,garlic,kale,olive oil"
4,"[canned black beans, canned green chiles, cann...","canned black beans,canned green chiles,canned ..."


In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X['string_ingreds'],y, test_size=.33, random_state=42) 

In [35]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

670 330 670 330


In [36]:
X_train

703    bacon,olive oil,beef cubes,garlic,onion,thyme,...
311    ground black pepper,dill,garlic,lemon juice,le...
722    olive oil,garlic,onion,carrot,water,long grain...
629    cabbage,red cabbage,carrot,black sesame seeds,...
0      broccoli,cauliflower,coconut oil,cooked brown ...
                             ...                        
106    brown rice flour,golden brown sugar,carrot,gro...
270    chickpeas,orange pepper,olive oil spray,olive ...
860    butter,corn starch,dried apricots,dried sweete...
435    balsamic vinegar,thyme,garlic,ground black pep...
102    chicken stock,ice,mango,rice,root vegetable,sc...
Name: string_ingreds, Length: 670, dtype: object

In [37]:
print(type(X_train))

<class 'pandas.core.series.Series'>


In [38]:
type(X_train)

pandas.core.series.Series

In [39]:
for x in X_train:
    print(x)
    print(type(x))
    break

bacon,olive oil,beef cubes,garlic,onion,thyme,bay leaves,parsley,pearl onion,mushrooms,carrot,red wine,beef broth,wheat flour,salt and pepper
<class 'str'>


In [40]:
# this is printing the letters of each word-but why???
for each in X_train.values:
    # print(each)
    for e in each[4]:
        print(e)
    break

n


In [42]:
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
np.random.seed(0)
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\owner\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [47]:
from sklearn.manifold import TSNE
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

In [48]:
X_train_new = [count_vectorize(x) for x in X_train]

NameError: name 'count_vectorize' is not defined

In [None]:
#exploring the data with frequency distributions
ingreds_concat=[]
for ingreds in X_train:
    ingreds_concat +=ingreds

In [None]:
ingreds_freqdist = FreqDist(ingreds_concat)
ingreds_freqdist.most_common(200)

In [49]:
X_train.apply(count_vectorize)

NameError: name 'count_vectorize' is not defined

In [None]:
#create function to return a count vectorized as a Python dictionary
def count_vectorize(X_train, vocab=None):
    X_train=X_train.split(',').apply
    if vocab:
        unique_words = vocab
    else:
        unique_words = list(X_train)
    
    X_train_dict = {i:0 for i in unique_words}
    
    for word in X_train:
        X_train_dict[word] += 1
    
    return X_train_dict

test_vectorized = count_vectorize(X_train)
print(test_vectorized)

In [None]:
#confirming that correct information is pulling from the data for the first row
eval(first)[0]['id']

In [None]:
#creating a function that lists all the ingredient ids of the first recipe
def ids(text):
    ids=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        ids.append(item['id'])
    return ids


In [None]:
#viewing the ingredient ids of the first recipe
ids(first)

In [None]:
#creating lambda function that stores all the ingredient IDs as a data point
data['ingredid'] = data['extendedIngredients'].map(lambda x: ids(x))

In [None]:
#View new ingredid column data
data['ingredid']

In [None]:
#creating functions to create new columns for the rest of the data that was in the original extended ingredients column (aisle, name,original name, amount, unit and meta)
def aisle(text):
    aisles=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        aisles.append(item['aisle'])
    return aisles

In [None]:
data['aisle'] = data['extendedIngredients'].map(lambda x: aisle(x))

In [None]:
data['aisle']

In [None]:
def name(text):
    names=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        names.append(item['name'])
    return names

In [None]:
data['name'] = data['extendedIngredients'].map(lambda x: name(x))

In [None]:
def originalName(text):
    orignames=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        orignames.append(item['originalName'])
    return orignames

In [None]:
data['originalName'] = data['extendedIngredients'].map(lambda x: originalName(x))

In [None]:
def amount(text):
    amounts=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        amounts.append(item['amount'])
    return amounts

In [None]:
data['amount'] = data['extendedIngredients'].map(lambda x: amount(x))

In [None]:
def unit(text):
    units=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        units.append(item['unit'])
    return units

In [None]:
data['unit'] = data['extendedIngredients'].map(lambda x: unit(x))

In [None]:
def meta(text):
    metas=[]
    ingredientslist=eval(text)
    for item in ingredientslist:
        metas.append(item['meta'])
    return metas

In [None]:
data['meta'] = data['extendedIngredients'].map(lambda x: meta(x))

In [None]:
#reviewing the changes I made to columns
data.head()

In [None]:
#create dummies of the ingred ids
ingred_dummies = pd.get_dummies(data['ingredid'].explode()).sum(level=0)

In [None]:
#add dummies to df
data=pd.concat([data, ingred_dummies], axis=1)

In [None]:
#looking at the new df head
data.head()

In [None]:
#looking at a single row
data.iloc[1]

In [None]:
#for i in data.ingredid:
    #print (i)

In [None]:
#access unique ingredients
uniqueingred=data['ingredid'].apply(pd.Series).stack().unique()
uniqueingred.sort()

In [None]:
uniqueingred

In [None]:
def recipes_per_ingredient(uniqueingred, data):
        try:
            return data.groupby('id').count()["uniqueingred"].loc[1]
        except KeyError:
            return None

In [None]:
ingredient_names = data['ingredid'].apply(pd.Series).stack().unique()

In [None]:
#determine how many recipes per ingredient to learn if data is imbalanced
def count_recipes_per_ingredient(ingredid,data):
    try:
        return data.groupby(ingredid).count().loc[1]
    except Keyerror:
        return None

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#looking at the column of one ingredient
data[11320420]

In [None]:
data[11320420].sum()

In [None]:
data.sum(axis=0)

In [None]:
data.loc['total']= data.sum()

In [None]:
#looking at all the column titles
for col in data.columns: 
    print(col)

In [None]:
data.apply(lambda column: column[1001 : 13811111].sum(),axis=1)

In [None]:
#selecting subdata to use for the bar chart
subdata=data.iloc[1000, 24:]

In [None]:
#count of recipes per ingredient
ax2=subdata.sort_values(ascending=False).plot(
    kind='bar', stacked=False)

ax2.set_ylabel("Count of Recipes")
ax2.set_title("Count of Recipes Per Ingredient")
plt.figure(figsize=(5,100))
plt.show()

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(35,15)
subdata.sort_values(ascending=False).plot.bar(stacked=False)
plt.legend(loc=2, prop={'size': 20})
plt.ylabel("Count of Recipes")
plt.title("Recipes per Unique Ingredient")
plt.show()

In [None]:
#len(uniqueingred)

In [None]:
# Here you run your code to explore the data

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***