### Part 0: Basic Data Cleaning
The first step is to do some basic data cleaning and rid of all the columns that won't be of any use acrross any of the projects going forward, and add some useful columns to the dataset based on the existing ones that will come handy in both Data Analysis and ML/NLP.

* **Drop:** 
['Name', 'AuthorName', 'CookTime', 'PrepTime', 'TotalTime', 'DatePublished', 'Description', 'Images', 'ReviewCount']

* **Add:**
['TotalMinutes', 'YearPublished', 'MonthPublished', 'DayPublished', 'HourPublished']

* **Replace:**
['RecipeIngredientQuantities', 'RecipeIngredientParts'] with ones scraped from food.com froms scratch.

**Save:**
BasicCleanData.parquet 

We can perform classical data analysis on BasicCleanData.parquet


#### Imports and sanity checks

In [1]:
import sys
sys.executable

'C:\\Users\\mathe\\anaconda3\\envs\\deepchef\\python.exe'

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [10]:
# This allows scrolling through all the columns. Useful for dataframes with too many columns.
pd.set_option('display.max_columns', 100)

In [None]:
recipes = pd.read_parquet('recipes.parquet')

In [None]:
recipes.sample(2)

In [None]:
recipes.info()

In [None]:
recipes.describe()

#### Adding recipe urls to the dataframe
We will first reconstruct the recipe urls from the original recipes dataset. 
* We can use these urls to check recipe data recorded in the dataset and the actual info on the respective recipe webpages.
* We also use these links to scrape food.com in order to upgrade the ingredients (currently ongoing in another notebook).

In [None]:
recipes['url']= recipes['Name'].apply(lambda x: x.replace(' ','-')+'-')
recipes['url']

In [None]:
recipes['url'] = recipes[['url', 'RecipeId']].apply(lambda x: 'https://www.food.com/recipe/' + x['url'] + str(int(x['RecipeId'])), axis=1)
recipes['url']

In [None]:
recipes.sample(5)

In [None]:
recipes.to_csv('recipes_with_urls.pkl')

In [6]:
recipes = pd.read_parquet('recipes_with_urls.parquet')

In [11]:
recipes.sample(5)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
105857,111572.0,Delicious Vegan Taco Pie!,196370,grungrrrl,PT20M,PT20M,PT40M,2005-02-21 20:00:00+00:00,This is a recipe I used to love before I went ...,[],Savory Pies,"[Grains, Soy/Tofu, Beans, Vegetable, Mexican, ...","[1, 1, 1⁄2, 1, 2, 1, None, None, None, None]","[vegan taco seasoning, water, vegan sour cream...",5.0,2.0,412.4,17.0,4.2,28.4,801.1,36.3,7.3,3.7,28.2,,1 Pie,"[Preheat oven to 375 degrees., Cook the""faux"" ...",https://www.food.com/recipe/Delicious-Vegan-Ta...
179632,187795.0,My Crock Pot Green Beans,101732,mydesigirl,PT8H,PT10M,PT8H10M,2006-09-26 16:37:00+00:00,These are just like my mom used to make. They...,[https://img.sndimg.com/food/image/upload/w_55...,Vegetable,"[Low Protein, Low Cholesterol, Easy]","[4, 2, 1, 1⁄2, None, None, None]","[green beans, chicken broth, onion, bacon, bac...",5.0,27.0,432.3,27.7,9.1,38.6,1151.4,32.4,11.6,15.2,18.7,,,"[Add everything into crock pot and mix., Cook ...",https://www.food.com/recipe/My-Crock-Pot-Green...
397321,411865.0,Audrey's Favorite Beef Stew,56905,RdhdA8,PT8H,PT20M,PT8H20M,2010-02-05 09:15:00+00:00,My daughter Audrey always called this &quot;Ca...,[],Meat,[Easy],"[3, 1 1⁄4, 28, 2, 5, 1, 8, 1, 1, 1⁄2, 1]","[boneless beef roast, diced tomatoes, minute t...",,,409.3,8.2,2.8,102.3,1643.9,43.5,6.5,11.8,40.7,,,[Add everything to slow cooker (stir together)...,https://www.food.com/recipe/Audrey's-Favorite-...
245761,255778.0,Spicy Shrimp,448587,Malaika T,PT15M,PT15M,PT30M,2007-09-27 12:00:00+00:00,A spicy shrimp dish with a touch of sweetness....,[],< 30 Mins,[Easy],"[2, 1⁄2, 1⁄2, 2, 1⁄3, 1⁄2, 1⁄2, 1 -2, 1 -2, 1,...","[jumbo shrimp, lime juice, vinegar, marmalade,...",,,620.2,31.4,4.6,345.0,547.6,38.1,1.0,27.4,46.9,4.0,,"[Mix the lime juice, orange juice, hot sauce, ...",https://www.food.com/recipe/Spicy-Shrimp-255778
177729,185838.0,Bob Evan's Reuben,285039,Cook4_6,PT15M,PT15M,PT30M,2006-09-13 17:28:00+00:00,I got this recipe off of a recipe card at Bob ...,[],Lunch/Snacks,"[Pork, Meat, < 30 Mins]","[1, 1⁄2, 16, 1⁄3, 2, 1, 3, 1, 1⁄4]","[hot sausage, onion, sauerkraut, mayonnaise, k...",,,618.9,42.2,17.4,109.6,1433.0,31.8,4.3,4.7,27.5,10.0,,[In medium skillet crumble and brown sausage w...,https://www.food.com/recipe/Bob-Evan's-Reuben-...


In [12]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 29 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[ns, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

In [13]:
recipes.describe()

Unnamed: 0,RecipeId,AuthorId,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,522517.0,522517.0,269294.0,275028.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0
mean,271821.43697,45725850.0,4.632014,5.227784,484.43858,24.614922,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191
std,155495.878422,292971400.0,0.641934,20.381347,1397.116649,111.485798,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809
min,38.0,27.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,137206.0,69474.0,4.5,1.0,174.2,5.6,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0
50%,271758.0,238937.0,5.0,2.0,317.1,13.8,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0
75%,406145.0,565828.0,5.0,4.0,529.1,27.4,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0
max,541383.0,2002886000.0,5.0,3063.0,612854.6,64368.1,26740.6,130456.4,1246921.0,108294.6,3012.0,90682.3,18396.2,32767.0


In [15]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82545
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   751
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
R

#### Dropping Reduntant Columns <a class ='author' id='part-0'></a>
`TotalTime` is the sum of `CookTime` and `PrepTime`. Plus, the latter two seem to be missing from the recipes on the webpages. I'll just drop `CookTime` and `PrepTime`.

In [16]:
recipes.drop(['CookTime', 'PrepTime'], axis=1,inplace=True)

In [20]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
258526,268939.0,Chicken Hominy Soup,669653,cuteandbublee,PT2H10M,2007-11-29 19:37:00+00:00,I don't eat pork so I made a kosher version of...,[https://img.sndimg.com/food/image/upload/w_55...,Very Low Carbs,"[Low Cholesterol, Healthy, Kosher, < 4 Hours]","[2 1⁄2, 2, 1, 3 -5, 10, 1, 2, 3, 3]","[chicken, garlic cloves, chicken broth, dried ...",5.0,1.0,432.9,25.5,6.5,86.2,2339.6,17.8,3.7,3.7,31.4,6.0,,"[Brown chicken in oil in a pot., Remove chicke...",https://www.food.com/recipe/Chicken-Hominy-Sou...
98661,104173.0,Pizza Sauce,37636,PalatablePastime,PT35M,2004-11-16 20:00:00+00:00,Make and share this Pizza Sauce recipe from Fo...,[https://img.sndimg.com/food/image/upload/w_55...,Sauces,"[Vegetable, European, Savory, < 60 Mins, Stove...","[15, 3, 2, 2, 2, 1, 1⁄2, 1⁄4]","[tomato sauce, garlic cloves, olive oil, dried...",5.0,9.0,416.5,28.4,3.9,0.0,3411.5,41.2,8.2,22.6,6.7,1.0,,[Combine ingredients in a small saucepan and s...,https://www.food.com/recipe/Pizza-Sauce-104173


`AuthorName` has the numeric equivalent of `AuthorId`, so we drop it. Similar for `Name`, which has the equivalent of `RecipeId`. We will eventually also drop `url` but for now we keep it as it serves us.

In [21]:
recipes.drop(['Name', 'AuthorName'], axis=1,inplace=True)

In [22]:
recipes.sample(3)

Unnamed: 0,RecipeId,AuthorId,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
360332,373504.0,227586,PT30M,2009-05-21 02:17:00+00:00,"This is a super easy candy to make, made so po...",[],Candy,"[Dessert, High In..., < 30 Mins, Easy]","[1, 1, 2, 3⁄4]","[walnuts, butter, walnuts]",5.0,2.0,241.6,16.1,4.2,17.1,62.0,21.5,1.1,19.6,5.4,12.0,27 candies,"[Heat the condensed milk, walnuts, and butter...",https://www.food.com/recipe/Docinho-De-Nozes-(...
349848,362771.0,1178785,PT25M,2009-03-25 02:15:00+00:00,Make and share this Exotic Steak recipe from F...,[],Meat,"[Asian, Lactose Free, Egg Free, Free Of..., Sp...","[1⁄4, 1, 3, 1⁄2 - 1, 3 -4, 2, 2, 1⁄2, 7, 1⁄2, ...","[canola oil, onion, garlic cloves, ginger, red...",,,251.7,22.9,2.3,0.0,509.6,8.6,2.6,2.3,6.3,4.0,,"[Heat oil in a wok or pan., Make a paste or th...",https://www.food.com/recipe/Exotic-Steak-362771
88493,93745.0,113509,PT30M,2004-06-18 20:00:00+00:00,"I got this from Yahoo, but it is originally fr...",[],Chicken Breast,"[Chicken, Poultry, Meat, < 30 Mins]","[1, 1, 1, 1, 3, 4, 1⁄4, 1, 2, 2, 1⁄4, 1⁄2, 2, ...","[chicken cutlet, mushroom, garlic cloves, blac...",,,295.0,11.4,2.7,71.3,563.9,15.9,1.9,1.4,31.9,4.0,,[Combine the breadcrumbs and Parmesan cheese i...,https://www.food.com/recipe/Chicken-Scaloppine...


`DatePublished` has too much info in it. Instead we turn it into `YearPublished`, `MonthPublished` and `DayPublished`. 

We can later on use these to derive insights on what days, months and years havae the highest rate of published recipes, and so on.

In [23]:
recipes['DatePublished'].apply(lambda x: x.hour)

0         21
1         13
2         19
3         14
4          6
          ..
522512    15
522513    15
522514    15
522515    22
522516    22
Name: DatePublished, Length: 522517, dtype: int64

In [24]:
recipes['YearPublished'] = recipes['DatePublished'].apply(lambda x: x.year)
recipes['MonthPublished'] = recipes['DatePublished'].apply(lambda x: x.month)
recipes['DayPublished'] = recipes['DatePublished'].apply(lambda x: x.day)
recipes['HourPublished'] = recipes['DatePublished'].apply(lambda x: x.hour)

In [25]:
recipes.drop(['DatePublished'],axis=1,inplace=True)

In [26]:
recipes.sample(3)

Unnamed: 0,RecipeId,AuthorId,TotalTime,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished
423537,439195.0,101920,PT20M,Make and share this P90x Gravy recipe from Foo...,[],Low Protein,"[Low Cholesterol, Healthy, < 30 Mins, Easy]","[1⁄3, 1⁄3, 3, 1⁄4, 1]","[shallot, all-purpose flour, fat-free chicken ...",,,20.9,0.1,0.0,0.0,206.1,4.2,0.1,0.1,0.8,10.0,10.0,[Sauté shallots in some of the broth until sof...,https://www.food.com/recipe/P90x-Gravy-439195,2010,10,11,11
226329,235829.0,20480,PT50M,Mexican treat chez nous.\r\nYou can change or ...,[https://img.sndimg.com/food/image/upload/w_55...,Poultry,"[Vegetable, Meat, Mexican, < 60 Mins]","[2, 1, 1, 1⁄4, 1⁄4, 2, 1⁄4, 1⁄4, 4, 1⁄2, 1, 1,...","[onion, green pepper, paprika, cayenne, garlic...",4.5,4.0,868.3,39.1,15.9,127.8,1644.2,75.3,9.7,8.3,54.5,4.0,,"[Heat 1Tsp oil in skillet., Sauté onion, garli...",https://www.food.com/recipe/Chicken-Fajitas-23...,2007,6,19,22
302218,313952.0,628076,PT10M,We’ve added roasted peppers and garlic to this...,[],Vegetable,"[< 15 Mins, Easy]","[4, 1, 1⁄2, 8 -10, 1⁄4, 1, 1⁄2, None]","[fresh mozzarella cheese, fresh basil leaves, ...",,,331.2,26.2,11.2,59.7,785.4,7.1,1.6,4.0,18.0,6.0,,"[Combine tomatoes, mozzarella, roasted peppers...",https://www.food.com/recipe/Caprese-Salad-With...,2008,7,15,19


Now let's turn the `TotalTime` to numbers (in minutes). At the moment the values of this column look like one of the following: 'PT3H30M', 'PT3H', 'PT20M'

In [27]:
re.findall('\dH|\d*M','PT3H30M')

['3H', '30M']

In [28]:
[string.replace('H','') for string in re.findall('\dH|\d*M','PT3H30M')]

['3', '30M']

In [29]:
result = [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in re.findall('\d+H|\d+M', 'PT3H30M')]
result

[180, 30]

In [30]:
recipes['TotalMinutes'] = recipes['TotalTime'].apply(lambda string: re.findall('\dH|\d*M', string))
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in timelist])
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: sum(timelist))
recipes['TotalMinutes']

0         285
1         265
2          35
3         260
4          50
         ... 
522512     95
522513    210
522514    240
522515     15
522516     40
Name: TotalMinutes, Length: 522517, dtype: int64

In [32]:
recipes.drop(['TotalTime'],axis=1,inplace=True)

In [33]:
recipes.sample(2)

Unnamed: 0,RecipeId,AuthorId,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
445780,462223.0,1680722,"I love octopus, but I couldn&rsquo;t decide ho...",[https://img.sndimg.com/food/image/upload/w_55...,Octopus,"[No Shell Fish, European, Very Low Carbs, < 4 ...","[1000, 3, 3, 1, 1⁄2, None, None, 3, 3, 1, None...","[vinegar, garlic clove, oregano, salt, pepper,...",,,2018.0,133.8,18.4,480.0,2314.7,48.5,6.9,11.1,153.2,,1 octopus,"[Wash your octopus., Cook without water in his...",https://www.food.com/recipe/3-Kind-Grilled-Oct...,2011,8,12,12,75
95529,100962.0,62422,Make and share this Hula recipe from Food.com.,[],Beverages,"[Low Protein, Low Cholesterol, < 15 Mins]","[1, 1, 1⁄2, 1⁄2, None, 1]","[banana, evaporated milk, lime, juice of, salt]",,,264.9,9.8,5.9,36.5,213.2,37.0,1.9,14.7,9.6,2.0,,"[Mash banana., Add milk, fruit juice, and salt...",https://www.food.com/recipe/Hula-100962,2004,9,30,20,5


We won't be using `Images` anywhere in our projects, so I'll remove the column. Same for `ReviewCount`. (For now I'll keep `url` because it helps double checking recipe entries using the actual recipe url; I'll later drop that column too when we get to do ML.)

In [34]:
recipes.drop(['ReviewCount', 'Images'],axis=1,inplace=True)

In [36]:
recipes.sample(2)

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
30865,34485.0,43385,Make and share this Honey Chicken and Mint Col...,Chicken,"[Poultry, Vegetable, Meat, European, < 15 Mins...","[2, 1, None, 1⁄2, 1, 200, 1, 1, 1⁄2, 2]","[chicken breasts, honey, light soya sauce, cab...",,177.0,8.0,2.1,46.4,69.7,10.0,2.6,5.5,16.5,4.0,,"[Marinate chicken breats with honey,salt,peppe...",https://www.food.com/recipe/Honey-Chicken-and-...,2002,7,17,22,10
427800,443650.0,143946,I love to cook from scratch but I also love my...,Meatballs,"[Meat, High Protein, High In..., < 30 Mins]","[1, 1, 1⁄2, 1, 1, 1⁄4, 1, 1⁄2, 1]","[ground beef, parmesan cheese, dried parsley, ...",5.0,354.7,23.1,9.9,138.9,477.0,5.8,0.3,0.5,29.1,4.0,,"[Preheat oven to 325., Process slice of bread ...",https://www.food.com/recipe/Parmesan-Meatballs...,2010,12,6,11,30


In [37]:
recipes.isna().sum()

RecipeId                           0
AuthorId                           0
Description                        5
RecipeCategory                   751
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253223
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
RecipeInstructions                 0
url                                0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
HourPublished                      0
TotalMinutes                       0
dtype: int64

In [42]:
pd.to_pickle(recipes, 'BasicCleanData.parquet')