### Part 0: Basic Data Cleaning
The first step is to do some basic data cleaning and rid of all the columns that won't be of any use acrross any of the projects going forward, and add some useful columns to the dataset based on the existing ones that will come handy in both Data Analysis and ML/NLP.

* **Drop:** 
['Name', 'AuthorName', 'CookTime', 'PrepTime', 'TotalTime', 'DatePublished', 'Description', 'Images', 'ReviewCount']

* **Add:**
['TotalMinutes', 'YearPublished', 'MonthPublished', 'DayPublished', 'HourPublished']

* **Replace:**
['RecipeIngredientQuantities', 'RecipeIngredientParts'] with ones scraped from food.com froms scratch.

**Save:**
BasicCleanData.parquet 

We can perform classical data analysis on BasicCleanData.parquet


#### Imports and sanity checks

In [1]:
import sys
sys.executable

'C:\\Users\\mathe\\anaconda3\\envs\\deepchef\\python.exe'

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# The Recipes Dataset

In [3]:
# This allows scrolling through all the columns. Useful for dataframes with too many columns.
pd.set_option('display.max_columns', 100)

In [4]:
recipes = pd.read_parquet('../recipes.parquet')

In [5]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
247258,257334.0,Pumpkin Gooey Cake,37305,Karen..,PT50M,PT10M,PT1H,2007-10-06 00:40:00+00:00,Another yummy one from Turkey Hill just in tim...,[],Dessert,"[Vegetable, Very Low Carbs, Low Protein, Low C...","[1, 1, 8, 1, 1, 3, 1, 8, 1, 1, 1⁄2, 1⁄2]","[egg, butter, cream cheese, pumpkin, eggs, van...",5.0,1.0,568.5,28.7,15.2,132.8,468.8,74.1,0.8,56.2,5.9,,,"[Preheat oven to 350 degrees., Combine cake mi..."
11796,15008.0,Chocolate Eclair Cake II,11295,ThatJodiGirl,,PT10M,PT10M,2001-11-29 09:52:00+00:00,Make and share this Chocolate Eclair Cake II r...,[],Dessert,"[Kid Friendly, Potluck, < 15 Mins]","[2, 3, 1, 1, 1⁄4, 1⁄3, 1, 2, 1]","[instant vanilla pudding, milk, chocolate-cove...",3.0,1.0,512.6,22.0,14.6,17.2,438.4,76.5,2.4,60.0,6.1,,,"[In a large bowl, combine pudding mix and 3 cu..."


In [6]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[ns, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

In [7]:
recipes.describe()

Unnamed: 0,RecipeId,AuthorId,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,522517.0,522517.0,269294.0,275028.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0
mean,271821.43697,45725850.0,4.632014,5.227784,484.43858,24.614922,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191
std,155495.878422,292971400.0,0.641934,20.381347,1397.116649,111.485798,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809
min,38.0,27.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,137206.0,69474.0,4.5,1.0,174.2,5.6,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0
50%,271758.0,238937.0,5.0,2.0,317.1,13.8,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0
75%,406145.0,565828.0,5.0,4.0,529.1,27.4,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0
max,541383.0,2002886000.0,5.0,3063.0,612854.6,64368.1,26740.6,130456.4,1246921.0,108294.6,3012.0,90682.3,18396.2,32767.0


#### Adding recipe urls to the dataframe
We will first reconstruct the recipe urls from the original recipes dataset. 
* We can use these urls to check recipe data recorded in the dataset and the actual info on the respective recipe webpages.
* We also use these links to scrape food.com in order to upgrade the ingredients (currently ongoing in another notebook).

In [8]:
recipes['url']= recipes['Name'].apply(lambda x: x.replace(' ','-')+'-')
recipes['url']

0                        Low-Fat-Berry-Blue-Frozen-Dessert-
1                                                  Biryani-
2                                            Best-Lemonade-
3                           Carina's-Tofu-Vegetable-Kebabs-
4                                             Cabbage-Soup-
                                ...                        
522512                      Meg's-Fresh-Ginger-Gingerbread-
522513    Roast-Prime-Rib-au-Poivre-with-Mixed-Peppercorns-
522514                               Kirshwasser-Ice-Cream-
522515            Quick-&-Easy-Asian-Cucumber-Salmon-Rolls-
522516                             Spicy-Baked-Scotch-Eggs-
Name: url, Length: 522517, dtype: object

In [9]:
recipes['url'] = recipes[['url', 'RecipeId']].apply(lambda x: 'https://www.food.com/recipe/' + x['url'] + str(int(x['RecipeId'])), axis=1)
recipes['url']

0         https://www.food.com/recipe/Low-Fat-Berry-Blue...
1                    https://www.food.com/recipe/Biryani-39
2              https://www.food.com/recipe/Best-Lemonade-40
3         https://www.food.com/recipe/Carina's-Tofu-Vege...
4               https://www.food.com/recipe/Cabbage-Soup-42
                                ...                        
522512    https://www.food.com/recipe/Meg's-Fresh-Ginger...
522513    https://www.food.com/recipe/Roast-Prime-Rib-au...
522514    https://www.food.com/recipe/Kirshwasser-Ice-Cr...
522515    https://www.food.com/recipe/Quick-&-Easy-Asian...
522516    https://www.food.com/recipe/Spicy-Baked-Scotch...
Name: url, Length: 522517, dtype: object

In [10]:
#recipes.to_csv('recipes_with_urls.pkl')

In [11]:
#recipes = pd.read_parquet('../recipes_with_urls.parquet')

In [12]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
202525,211333.0,Birds’-Nest Pudding (Little House),69587,Sascha,PT1H,PT15M,PT1H15M,2007-02-13 16:24:00+00:00,"This is a traditional apple dessert, where the...",[],Dessert,"[Apple, Fruit, Low Protein, Kid Friendly, < 4 ...","[1⁄2, 6, 1, 1⁄2, 3, 1, 1, 1, 1, 1⁄2, 1⁄2, 1⁄2, 1]","[butter, tart apples, brown sugar, ground nutm...",,,666.5,34.0,20.1,219.3,324.2,85.3,3.9,61.9,8.6,6.0,,[Butter a baking dish (2-quart). Peel and cor...,https://www.food.com/recipe/Birds’-Nest-Puddin...
390096,404130.0,Bailey's Irish Cream Chocolate Chip Cookies,1226076,Alexa,PT10M,PT10M,PT20M,2009-12-17 21:30:00+00:00,I just thought of eating cookies with Bailey's...,[https://img.sndimg.com/food/image/upload/w_55...,Drop Cookies,"[Dessert, Cookie & Brownie, < 30 Mins, For Lar...","[1⁄2, 1⁄2, 1⁄2, 1, 1, 1⁄2, 2 1⁄4, 1⁄2, 1⁄2, 1]","[butter, granulated sugar, brown sugar, egg, v...",2.0,2.0,100.9,4.2,2.5,12.7,71.8,15.4,0.4,8.3,1.1,36.0,36 cookies,"[Cream butter, sugars and egg until fluffy., A...",https://www.food.com/recipe/Bailey's-Irish-Cre...


In [13]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 29 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[ns, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

In [14]:
recipes.describe()

Unnamed: 0,RecipeId,AuthorId,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,522517.0,522517.0,269294.0,275028.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0
mean,271821.43697,45725850.0,4.632014,5.227784,484.43858,24.614922,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191
std,155495.878422,292971400.0,0.641934,20.381347,1397.116649,111.485798,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809
min,38.0,27.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,137206.0,69474.0,4.5,1.0,174.2,5.6,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0
50%,271758.0,238937.0,5.0,2.0,317.1,13.8,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0
75%,406145.0,565828.0,5.0,4.0,529.1,27.4,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0
max,541383.0,2002886000.0,5.0,3063.0,612854.6,64368.1,26740.6,130456.4,1246921.0,108294.6,3012.0,90682.3,18396.2,32767.0


In [15]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82545
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   751
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
R

#### Dropping Reduntant Columns <a class ='author' id='part-0'></a>
`TotalTime` is the sum of `CookTime` and `PrepTime`. Plus, the latter two seem to be missing from the recipes on the webpages. I'll just drop `CookTime` and `PrepTime`.

In [16]:
recipes.drop(['CookTime', 'PrepTime'], axis=1,inplace=True)

In [17]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
332868,345374.0,Peppermint Cookies,346383,senseicheryl,PT18M,2008-12-29 22:12:00+00:00,I received an email this morning from the www....,[],Drop Cookies,"[Dessert, Cookie & Brownie, Grains, Potluck, C...","[1 1⁄4, 1 1⁄3, 2 1⁄2, 1⁄4, 3⁄4, 2, 3⁄4, 1]","[granulated sugar, all-purpose flour, salt, bu...",,,1186.0,50.4,30.4,263.0,569.6,168.9,2.8,89.7,15.4,,3 dozen cookies,"[Preheat oven to 350 degrees., Grind or crush ...",https://www.food.com/recipe/Peppermint-Cookies...
94259,99662.0,Taste of Thai Beef Salad - Yam Nuea,64642,Molly53,PT25M,2004-09-13 19:59:00+00:00,From the Taste of Thai company. This recipe fr...,[https://img.sndimg.com/food/image/upload/w_55...,Steak,"[Greens, Onions, Vegetable, Meat, Thai, Asian,...","[1, 1⁄4, 2, 2, 3, 1, 1, 2, 3, 1]","[fresh lime juice, fish sauce, red chili peppe...",5.0,8.0,156.6,6.5,2.6,51.4,532.3,7.0,1.6,2.5,18.0,6.0,,"[Grill or broil flank steak., Slice into thin ...",https://www.food.com/recipe/Taste-of-Thai-Beef...


`AuthorName` has the numeric equivalent of `AuthorId`, so we drop it. Similar for `Name`, which has the equivalent of `RecipeId`. We will eventually also drop `url` but for now we keep it as it serves us.

In [18]:
recipes.drop(['Name', 'AuthorName'], axis=1,inplace=True)

In [19]:
recipes.sample(3)

Unnamed: 0,RecipeId,AuthorId,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
270087,280857.0,169430,PT1H15M,2008-01-22 23:50:00+00:00,Make and share this Apricot-Pumpkin Bread Pudd...,[https://img.sndimg.com/food/image/upload/w_55...,Dessert,"[Vegetable, Winter, Beginner Cook, < 4 Hours, ...","[None, 4, 1⁄3, 3⁄4, 2, 1, 2, 1⁄2, None]","[Egg Beaters egg substitute, nonfat milk, cann...",5.0,2.0,70.8,0.7,0.2,1.1,160.0,13.7,2.1,6.9,3.5,9.0,,"[Preheat oven to 350 degree F., Coat a 2-quart...",https://www.food.com/recipe/Apricot-Pumpkin-Br...
196631,205259.0,151973,PT17M,2007-01-13 21:42:00+00:00,Make and share this Tuscan Simmer Chicken reci...,[],Chicken,"[Poultry, Meat, < 30 Mins, Beginner Cook, Easy]","[1, 1, 1⁄4, 1, 1, None]","[olive oil, boneless skinless chicken, chicken...",,,314.8,9.7,1.7,132.0,243.4,0.1,0.0,0.1,53.1,,,"[Brown chicken in olive oil., Pour chicken bro...",https://www.food.com/recipe/Tuscan-Simmer-Chic...
241541,251423.0,89831,PT55M,2007-09-06 15:57:00+00:00,This is an easy recipe that is delicious serve...,[],Pie,"[Dessert, Fruit, Low Protein, Kid Friendly, Ch...","[6, 1, 1⁄2 - 3⁄4, 3, 1⁄2, 1, 3⁄4, 1, None]","[sugar, flour, cinnamon, salt, vanilla, ice cr...",,,353.0,18.3,9.1,40.8,173.2,46.1,2.8,30.9,3.4,,,"[Set oven to 400°F., Set the pie pastry into t...",https://www.food.com/recipe/Easy-Peach-Cream-P...


`DatePublished` has too much info in it. Instead we turn it into `YearPublished`, `MonthPublished` and `DayPublished`. 

We can later on use these to derive insights on what days, months and years havae the highest rate of published recipes, and so on.

In [20]:
recipes['DatePublished'].apply(lambda x: x.hour)

0         21
1         13
2         19
3         14
4          6
          ..
522512    15
522513    15
522514    15
522515    22
522516    22
Name: DatePublished, Length: 522517, dtype: int64

In [21]:
recipes['YearPublished'] = recipes['DatePublished'].apply(lambda x: x.year)
recipes['MonthPublished'] = recipes['DatePublished'].apply(lambda x: x.month)
recipes['DayPublished'] = recipes['DatePublished'].apply(lambda x: x.day)
recipes['HourPublished'] = recipes['DatePublished'].apply(lambda x: x.hour)

In [22]:
recipes.drop(['DatePublished'],axis=1,inplace=True)

In [23]:
recipes.sample(3)

Unnamed: 0,RecipeId,AuthorId,TotalTime,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished
19969,23374.0,28846,PT1H15M,Make and share this Snicker Cake recipe from F...,[],Dessert,"[Fruit, Nuts, Kid Friendly, Weeknight, Oven, <...","[1, 3, 1⁄3, 1 1⁄3, 1⁄2, 1, 1⁄3, 1, 1]","[eggs, water, butter, milk, walnuts]",5.0,2.0,581.9,33.7,12.2,76.1,428.6,70.5,3.3,50.9,7.2,,1 nine x thirteen inch pan (depending on servi...,"[Preheat oven to 350°F., Grease and flour a 9 ...",https://www.food.com/recipe/Snicker-Cake-23374,2002,3,28,11
285451,296632.0,679438,PT40M,I was wanting some real meatballs and started ...,[],Poultry,"[Meat, Egg Free, Free Of..., Savory, Brunch, <...","[1, 1, 1⁄2, 1⁄2, 1, 1, 1, 1⁄2]","[lean ground turkey, caraway seed, ground cumi...",4.0,1.0,71.4,3.6,0.9,31.3,26.8,0.8,0.3,0.1,9.1,,10-20 meatballs,"[Preheat the oven to 350 degrees., In a bowl, ...",https://www.food.com/recipe/Turkey-Meatballs-2...,2008,4,5,18
137116,143886.0,258213,PT25M,"I was looking for some Halloween recipes, and ...",[],Cheese,"[Vegetable, European, Low Protein, Low Cholest...","[1, 6, 1⁄2, 1, 1, 125, None, None, 2, 1, 1, No...","[chopped tomatoes, tomatoes, onion, garlic clo...",,,158.0,8.6,5.1,20.6,459.2,19.4,4.9,10.8,4.2,3.0,,"[Put your tomatoes, onion and garlic into a bl...",https://www.food.com/recipe/Gooby-Tomato-Soup-...,2005,11,5,6


Now let's turn the `TotalTime` to numbers (in minutes). At the moment the values of this column look like one of the following: 'PT3H30M', 'PT3H', 'PT20M'

In [24]:
re.findall('\dH|\d*M','PT3H30M')

['3H', '30M']

In [25]:
[string.replace('H','') for string in re.findall('\dH|\d*M','PT3H30M')]

['3', '30M']

In [26]:
result = [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in re.findall('\d+H|\d+M', 'PT3H30M')]
result

[180, 30]

In [27]:
recipes['TotalMinutes'] = recipes['TotalTime'].apply(lambda string: re.findall('\dH|\d*M', string))
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in timelist])
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: sum(timelist))
recipes['TotalMinutes']

0         285
1         265
2          35
3         260
4          50
         ... 
522512     95
522513    210
522514    240
522515     15
522516     40
Name: TotalMinutes, Length: 522517, dtype: int64

In [28]:
recipes.drop(['TotalTime'],axis=1,inplace=True)

In [29]:
recipes.sample(2)

Unnamed: 0,RecipeId,AuthorId,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
248510,258622.0,431813,Make and share this Roquefort Cream Quiche rec...,[https://img.sndimg.com/food/image/upload/w_55...,One Dish Meal,[< 60 Mins],"[1, 6, 3, 2, 3, 1 1⁄4, 1, 1, 2, None]","[cream cheese, Roquefort cheese, butter, eggs,...",4.5,2.0,505.2,43.2,22.1,198.0,609.8,18.3,1.2,0.4,11.8,6.0,,"[Preheat oven to 265°F., In a bowl, blend crea...",https://www.food.com/recipe/Roquefort-Cream-Qu...,2007,10,12,1,35
303693,315458.0,250427,This is my favorite way to eat cucumbers fresh...,[],Vegetable,"[Low Protein, Low Cholesterol, Summer, Beginne...","[1⁄2, 1⁄2, 1⁄4, 1⁄8, 2, 1]","[mayonnaise, salt, dried dill weed, pepper, cu...",5.0,1.0,82.6,5.7,0.9,4.4,287.6,8.1,0.6,2.9,0.8,7.0,,[Mix all ingredients. Cover and refrigerate a...,https://www.food.com/recipe/Creamy-Cucumber-Sa...,2008,7,24,2,250


We won't be using `Images` anywhere in our projects, so I'll remove the column. (For now I'll keep `url` because it helps double checking recipe entries using the actual recipe url; I'll later drop that column too when we get to do ML.)

In [30]:
recipes.drop(['Images'],axis=1,inplace=True)

In [31]:
recipes.sample(2)

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
162992,170665.0,269521,A traditional holiday bread that looks like a ...,Breads,"[Norwegian, Scandinavian, European]","[8, 1⁄2, 1⁄2, 1, 1, 4]","[mashed potatoes, heavy cream, butter, butter,...",4.0,1.0,573.7,20.3,12.2,58.9,733.0,86.5,4.8,4.9,10.9,8.0,,[Peel and cook potatoes in large pot of water....,https://www.food.com/recipe/Norwegian-Potato-L...,2006,5,30,19,160
22134,25583.0,28442,Something strange is happening at the breakfas...,Breakfast,"[Apple, Fruit, Low Protein, Low Cholesterol, H...","[1, 1 1⁄2, 1 1⁄2, 1 1⁄2, None]","[golden delicious apples, granny smith apple, ...",4.5,5.0,142.6,3.0,1.8,7.6,27.4,31.4,2.5,26.8,0.5,2.0,,"[Core and slice apple., Mix apple with honey, ...",https://www.food.com/recipe/Hot-Buttered-Apple...,2002,4,18,10,7


In [32]:
recipes.isna().sum()

RecipeId                           0
AuthorId                           0
Description                        5
RecipeCategory                   751
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
RecipeInstructions                 0
url                                0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
HourPublished                      0
TotalMinutes                       0
d

### Dealing with categories

In [33]:
recipes['RecipeCategory'].unique(), recipes['RecipeCategory'].nunique()

(array(['Frozen Desserts', 'Chicken Breast', 'Beverages', 'Soy/Tofu',
        'Vegetable', 'Pie', 'Chicken', 'Dessert', 'Southwestern U.S.',
        'Sauces', 'Stew', 'Black Beans', '< 60 Mins', 'Lactose Free',
        'Weeknight', 'Yeast Breads', 'Whole Chicken', 'High Protein',
        'Cheesecake', 'Free Of...', 'High In...', 'Brazilian', 'Breakfast',
        'Breads', 'Bar Cookie', 'Brown Rice', 'Oranges', 'Pork',
        'Low Protein', 'Asian', 'Potato', 'Cheese', 'Halibut', 'Meat',
        'Lamb/Sheep', 'Very Low Carbs', 'Spaghetti', 'Scones',
        'Drop Cookies', 'Lunch/Snacks', 'Beans', 'Punch Beverage',
        'Pineapple', 'Low Cholesterol', '< 30 Mins', 'Quick Breads',
        'Sourdough Breads', 'Curries', 'Chicken Livers', 'Coconut',
        'Savory Pies', 'Poultry', 'Steak', 'Healthy', 'Lobster', 'Rice',
        'Apple', 'Broil/Grill', 'Spreads', 'Crab', 'Jellies', 'Pears',
        'Chowders', 'Cauliflower', 'Candy', 'Chutneys', 'White Rice',
        'Tex Mex', 'Bass',

We have 311 categories. Turning these into numerical values will add many dimensions to our dataframe. We can reduce these catgeories into some more major categories. Here's a suggestion:


**Desserts**: Frozen Desserts, Cheesecake, Pie, Dessert, Cheesecake, Gelatin, Candy, Jellies, Tarts, Sweet, Chocolate Chip Cookies, Bread Pudding, Lemon Cake, Key Lime Pie, Coconut Cream Pie, Ice Cream, Fruit Desserts, Apple Pie, Pumpkin, Coconut Cream Pie.

**Chicken**: Chicken Breast, Chicken, Chicken Thigh & Leg, Chicken Livers, Whole Chicken, Roast Chicken, Chicken Crock Pot.

**Beverages**: Beverages, Punch Beverage, Smoothies, Shakes.

**Vegetarian/Vegan**: Soy/Tofu, Vegetable, Vegan.

**Sauces/Condiments**: Sauces, Salad Dressings, Spreads, Chutneys.

**Meat**: Pork, Lamb/Sheep, Meat, Meatballs, Beef Organ Meats, Steak, Ground Meat, Roast Beef, Ham, Ground Beef, Ground Turkey.

**Seafood**: Halibut, Lobster, Crab, Crawfish, Bass, Tuna, Trout, Catfish, Squid, Mahi Mahi, Oysters, Salmon.

**International Cuisines**: Asian, Brazilian, Greek, German, Hungarian, Indonesian, Mexican, Dutch, Spanish, Russian, Thai, Cajun, Chinese, Turkish, Vietnamese, Lebanese, Moroccan, Korean, Polish, Scandinavian, African, Norwegian, Belgian, Australian, Scottish, Cuban, Portuguese, Hawaiian, Austrian, Egyptian, Filipino, Welsh, Czech, Iraqi, Pakistani, Chilean, Puerto Rican, Ecuadorean, Sudanese, Mongolian, Peruvian, Cambodian, Honduran, Sudanese, Mongolian, Peruvian.

**Side Dishes**: Potatoes, Rice, Grains, Pasta, Breads, Corn, Lentil, Yam/Sweet Potato, Greens, Collard Greens, Spinach, Chard, Artichoke, Mashed Potatoes.

**Breakfast/Brunch**: Breakfast, Breakfast Eggs, Brunch.

In [34]:
category_mapping = {
    'Frozen Desserts': 'Desserts',
    'Chicken Breast': 'Chicken',
    'Beverages': 'Beverages',
    'Soy/Tofu': 'Vegetarian/Vegan',
    'Vegetable': 'Vegetables',
    'Pie': 'Desserts',
    'Chicken': 'Chicken',
    'Dessert': 'Desserts',
    'Southwestern U.S.': 'Regional',
    'Sauces': 'Sauces/Condiments',
    'Stew': 'Main Dish',
    'Black Beans': 'Beans/Legumes',
    '< 60 Mins': 'Quick and Easy',
    'Lactose Free': 'Special Dietary Needs',
    'Weeknight': 'Quick and Easy',
    'Yeast Breads': 'Baked Goods',
    'Whole Chicken': 'Chicken',
    'High Protein': 'Healthy',
    'Cheesecake': 'Desserts',
    'Free Of...': 'Special Dietary Needs',
    'High In...': 'Healthy',
    'Brazilian': 'International',
    'Breakfast': 'Breakfast/Brunch',
    'Breads': 'Baked Goods',
    'Bar Cookie': 'Desserts',
    'Brown Rice': 'Nuts/Seeds/Grains',
    'Oranges': 'Fruit',
    'Pork': 'Meat',
    'Low Protein': 'Special Dietary Needs',
    'Asian': 'International',
    'Potato': 'Side Dishes',
    'Cheese': 'Dairy',
    'Halibut': 'Seafood',
    'Meat': 'Meat',
    'Lamb/Sheep': 'Meat',
    'Very Low Carbs': 'Healthy',
    'Spaghetti': 'Pasta',
    'Scones': 'Breads',
    'Drop Cookies': 'Desserts',
    'Lunch/Snacks': 'Lunch',
    'Beans': 'Beans/Legumes',
    'Punch Beverage': 'Beverages',
    'Pineapple': 'Fruit',
    'Low Cholesterol': 'Healthy',
    '< 30 Mins': 'Quick and Easy',
    'Quick Breads': 'Baked Goods',
    'Sourdough Breads': 'Baked Goods',
    'Curries': 'International',
    'Chicken Livers': 'Chicken',
    'Coconut': 'Fruit',
    'Savory Pies': 'Main Dish',
    'Poultry': 'Chicken',
    'Steak': 'Meat',
    'Healthy': 'Healthy',
    'Lobster': 'Seafood',
    'Rice': 'Nuts/Seeds/Grains',
    'Apple': 'Fruit',
    'Broil/Grill': 'Cooking Methods',
    'Spreads': 'Sauces/Condiments',
    'Crab': 'Seafood',
    'Jellies': 'Sauces/Condiments',
    'Pears': 'Fruit',
    'Chowders': 'Soups',
    'Cauliflower': 'Vegetables',
    'Candy': 'Desserts',
    'Chutneys': 'Sauces/Condiments',
    'White Rice': 'Nuts/Seeds/Grains',
    'Tex Mex': 'Regional',
    'Bass': 'Seafood',
    'German': 'International',
    'Fruit': 'Fruit',
    'European': 'International',
    'Smoothies': 'Beverages',
    'Hungarian': 'International',
    'Manicotti': 'Pasta',
    'Onions': 'Vegetables',
    'New Zealand': 'International',
    'Chicken Thigh & Leg': 'Chicken',
    'Indonesian': 'International',
    'Greek': 'International',
    'Corn': 'Vegetables',
    'Lentil': 'Beans/Legumes',
    'Summer': 'Seasonal',
    'Long Grain Rice': 'Nuts/Seeds/Grains',
    'Southwest Asia (middle East)': 'International',
    'Spanish': 'International',
    'Dutch': 'International',
    'Gelatin': 'Desserts',
    'Tuna': 'Seafood',
    'Citrus': 'Fruit',
    'Berries': 'Fruit',
    'Peppers': 'Vegetables',
    'Salad Dressings': 'Sauces/Condiments',
    'Clear Soup': 'Soups',
    'Mexican': 'International',
    'Raspberries': 'Fruit',
    'Crawfish': 'Seafood',
    'Beef Organ Meats': 'Meat',
    'Strawberry': 'Fruit',
    'Shakes': 'Beverages',
    'Short Grain Rice': 'Nuts/Seeds/Grains',
    '< 15 Mins': 'Quick and Easy',
    'One Dish Meal': 'Main Dish',
    'Spicy': 'Flavor Profiles',
    'Thai': 'International',
    'Cajun': 'Regional',
    'Oven': 'Cooking Methods',
    'Microwave': 'Cooking Methods',
    'Russian': 'International',
    'Melons': 'Fruit',
    'Papaya': 'Fruit',
    'Veal': 'Meat',
    'No Cook': 'Quick and Easy',
    '< 4 Hours': 'Quick and Easy',
    None: 'Uncategorized',
    'Roast': 'Cooking Methods',
    'Potluck': 'Occasions',
    'Orange Roughy': 'Seafood',
    'Canadian': 'International',
    'Caribbean': 'International',
    'Mussels': 'Seafood',
    'Medium Grain Rice': 'Nuts/Seeds/Grains',
    'Japanese': 'International',
    'Penne': 'Pasta',
    'Easy': 'Quick and Easy',
    'Elk': 'Meat',
    'Colombian': 'International',
    'Gumbo': 'Soups',
    'Roast Beef': 'Meat',
    'Perch': 'Seafood',
    'Vietnamese': 'International',
    'Rabbit': 'Meat',
    'Christmas': 'Occasions',
    'Lebanese': 'International',
    'Turkish': 'International',
    'Kid Friendly': 'Family-Friendly',
    'Vegan': 'Vegetarian/Vegan',
    'For Large Groups': 'Occasions',
    'Whole Turkey': 'Poultry',
    'Chinese': 'International',
    'Grains': 'Nuts/Seeds/Grains',
    'Yam/Sweet Potato': 'Side Dishes',
    'Native American': 'Regional',
    'Meatloaf': 'Meat',
    'Winter': 'Seasonal',
    'Trout': 'Seafood',
    'African': 'International',
    'Ham': 'Meat',
    'Goose': 'Poultry',
    'Pasta Shells': 'Pasta',
    'Stocks': 'Soups',
    "St. Patrick's Day": 'Occasions',
    'Meatballs': 'Meat',
    'Whole Duck': 'Poultry',
    'Scandinavian': 'International',
    'Greens': 'Vegetables',
    'Catfish': 'Seafood',
    'Dehydrator': 'Cooking Methods',
    'Duck Breasts': 'Poultry',
    'Savory': 'Flavor Profiles',
    'Stir Fry': 'Main Dish',
    'Polish': 'International',
    'Spring': 'Seasonal',
    'Deer': 'Meat',
    'Wild Game': 'Meat',
    'Pheasant': 'Meat',
    'No Shell Fish': 'Seafood',
    'Collard Greens': 'Vegetables',
    'Tilapia': 'Seafood',
    'Quail': 'Poultry',
    'Refrigerator': 'Preservation',
    'Canning': 'Preservation',
    'Moroccan': 'International',
    'Pressure Cooker': 'Cooking Methods',
    'Squid': 'Seafood',
    'Korean': 'International',
    'Plums': 'Fruit',
    'Danish': 'International',
    'Creole': 'Regional',
    'Mahi Mahi': 'Seafood',
    'Tarts': 'Desserts',
    'Spinach': 'Vegetables',
    'Hawaiian': 'Regional',
    'Homeopathy/Remedies': 'Healthy',
    'Austrian': 'International',
    'Thanksgiving': 'Occasions',
    'Moose': 'Meat',
    'Bath/Beauty': 'Healthy',
    'Swedish': 'International',
    'High Fiber': 'Healthy',
    'Kosher': 'Special Dietary Needs',
    'Norwegian': 'International',
    'Household Cleaner': 'Household',
    'Ethiopian': 'International',
    'Belgian': 'International',
    'Australian': 'International',
    'Pennsylvania Dutch': 'Regional',
    'Bear': 'Meat',
    'Scottish': 'International',
    'Tempeh': 'Vegetarian/Vegan',
    'Cuban': 'International',
    'Turkey Breasts': 'Poultry',
    'Cantonese': 'International',
    'Tropical Fruits': 'Fruit',
    'Peanut Butter': 'Sauces/Condiments',
    'Szechuan': 'International',
    'Portuguese': 'International',
    'Summer Dip': 'Appetizers',
    'Costa Rican': 'International',
    'Duck': 'Poultry',
    'Sweet': 'Flavor Profiles',
    'Nuts': 'Nuts/Seeds/Grains',
    'Filipino': 'International',
    'Welsh': 'International',
    'Camping': 'Outdoor Cooking',
    'Pot Pie': 'Main Dish',
    'Polynesian': 'International',
    'Mango': 'Fruit',
    'Cherries': 'Fruit',
    'Egyptian': 'International',
    'Chard': 'Vegetables',
    'Lime': 'Flavor Profiles',
    'Lemon': 'Flavor Profiles',
    'Brunch': 'Breakfast/Brunch',
    'Toddler Friendly': 'Family-Friendly',
    'Kiwifruit': 'Fruit',
    'Whitefish': 'Seafood',
    'South American': 'International',
    'Malaysian': 'International',
    'Octopus': 'Seafood',
    'Nigerian': 'International',
    'Mixer': 'Cooking Methods',
    'Venezuelan': 'International',
    'Halloween': 'Occasions',
    'Stove Top': 'Cooking Methods',
    'Bread Machine': 'Baked Goods',
    'French Toast': 'Breakfast/Brunch',
    'French Canadian': 'Regional',
    'Sauerkraut': 'Vegetables',
    'West Virginia': 'Regional',
    'Cooker': 'Cooking Methods',
    'Jewish': 'International',
    'Leek': 'Vegetables',
    'Asian Greens': 'Vegetables',
    'Buffalo': 'Meat',
    'Smoothie': 'Beverages',
    'Indian': 'International',
    'Cooking For One': 'Quick and Easy',
    'Kansas': 'Regional',
    'Carrot': 'Vegetables',
    'Australian And New Zealand': 'International',
    'Canadian Bacon': 'Meat',
    'Zucchini': 'Vegetables',
    'Flounder': 'Seafood',
    'Fijian': 'International',
    'Winter Squash': 'Vegetables',
    'Israeli': 'International',
    'Ethnic': 'International',
    'Eggplant': 'Vegetables',
    'Afghan': 'International',
    'Barbecue': 'Cooking Methods',
    'Vegetarian': 'Vegetarian/Vegan',
    'Main Dish': 'Main Dish',
    'Missouri': 'Regional',
    'Salmon': 'Seafood',
    'Pesto': 'Sauces/Condiments',
    'Braised': 'Cooking Methods',
    'Czech': 'International',
    'Salads': 'Salads',
    'Soul Food': 'Regional',
    'Swiss': 'International',
    'Jamaican': 'International',
    'Easter': 'Occasions',
    'Tex-Mex': 'Regional',
    'Northeastern United States': 'Regional',
    'Swiss Cheese': 'Dairy',
    'Pacific Northwestern': 'Regional',
    'Czechoslovakian': 'International',
    'Meals': 'Main Dish',
    'Microwave Appetizers': 'Appetizers',
    'Northwestern United States': 'Regional',
    'Moravian': 'International',
    'Special Occasion': 'Occasions',
    'California': 'Regional',
    'Mandarin Oranges': 'Fruit',
    'Pennsylvania': 'Regional',
    'Brazil': 'International',
    'Thai Sweet Rice': 'Nuts/Seeds/Grains',
    'Freezer': 'Preservation',
    'Cornish Hens': 'Poultry',
    'Arizona': 'Regional',
    'Pacific Islands': 'International',
    'Rhode Island': 'Regional',
    'Georgian': 'International',
    'Pork Tenderloin': 'Meat',
    'No-Cook': 'Quick and Easy',
    'Basque': 'International',
    'Thanksgiving Leftovers': 'Occasions',
    'Avocado': 'Fruit',
    'Alcoholic': 'Beverages',
    'Hamburger': 'Meat',
    'Michigan': 'Regional',
    'Red Beans And Rice': 'Beans/Legumes',
    'Pan Grilling': 'Cooking Methods',
    'Deep Fryer': 'Cooking Methods',
    'Muffins': 'Baked Goods',
    'Pan Frying': 'Cooking Methods',
    'English': 'International',
    'Pressure Cookers': 'Cooking Methods',
    'High Calcium': 'Healthy',
    'Low Saturated Fat': 'Healthy',
    'Game': 'Meat',
    'Gluten-Free': 'Special Dietary Needs',
    'Wheat': 'Nuts/Seeds/Grains',
    'Finnish': 'International',
    'New England': 'Regional',
    'Swedish Meatballs': 'Meat',
    'Algerian': 'International',
    'Pacific Rim': 'International',
    'Thermomix': 'Cooking Methods',
    'Nuts/Seeds': 'Nuts/Seeds/Grains',
    'Vegetables': 'Vegetables',
    'Apple Pie': 'Desserts',
    'Jerky': 'Meat',
    'Condiments, Etc.': 'Sauces/Condiments',
    'New York': 'Regional',
    'Colombia': 'International',
    'Chicago Style': 'Regional',
    'Mediterranean': 'International',
    'Irish': 'International',
    'Pressure Canning': 'Preservation',
    'Middle Eastern': 'International',
    'Plants': 'Vegetarian/Vegan',
    'Southwestern': 'Regional',
    'Jam': 'Sauces/Condiments',
    'Peaches': 'Fruit',
    'Egg-Free': 'Special Dietary Needs',
    'Eastern European': 'International',
    'Soft Drinks': 'Beverages',
    'Picnics': 'Outdoor Cooking',
    'Kiwi': 'Fruit',
    'Ice Cream': 'Desserts',
    'Turkey': 'Poultry',
    'Cherry': 'Fruit',
    'Vegetable Casserole': 'Vegetables',
    'Goat': 'Meat',
    'Dressings': 'Sauces/Condiments',
    'Cabbage': 'Vegetables',
    'Romaine': 'Vegetables',
    'Low Fat': 'Healthy',
    'Sausage': 'Meat',
    'Roasts': 'Meat',
    'Casseroles': 'Main Dish',
    'North American': 'International',
    'High Potassium': 'Healthy',
    'Soups': 'Soups',
    'Main Dishes': 'Main Dish',
    'Crisps': 'Desserts',
    'French Canadian Tourtiere': 'Regional',
    'Irish Soda Bread': 'Baked Goods',
    'Loaves': 'Baked Goods',
    'Crepes': 'Breakfast/Brunch',
    'Potatoes': 'Vegetables',
    'Rhubarb': 'Vegetables',
    'Salmon Lox': 'Seafood',
    'Apricot': 'Fruit',
    'Bbq': 'Cooking Methods',
    'Herb And Spice Mixes': 'Sauces/Condiments',
    'Low Calorie': 'Healthy',
    'Salmon Fillets': 'Seafood',
    'Apricots': 'Fruit',
    'South Carolina': 'Regional',
    'Shrimp': 'Seafood',
    'Chinese Five-Spice': 'Spices/Seasonings',
    'Grains/Cereals': 'Nuts/Seeds/Grains',
    'Honduran': 'International',
    'Chilean': 'International',
    'Flat Shell Fish': 'Seafood',
    'Portuguese Sausage': 'Meat',
    'Cinnamon': 'Spices/Seasonings',
    'Swiss Chard': 'Vegetables',
    'Bulgarian': 'International',
    'Champagne': 'Beverages',
    'Mashed Potatoes': 'Side Dishes',
    'Vermont': 'Regional',
    'Finger Food': 'Appetizers',
    'Side Dish': 'Side Dishes',
    'Steamed': 'Cooking Methods',
    'Raspberry': 'Fruit',
    'Berries And Currants': 'Fruit',
    'Kentucky': 'Regional',
    'Ethnic Foods': 'International',
    'New Hampshire': 'Regional',
    'Alfredo': 'Pasta',
    'Whole Chicken': 'Poultry',
    'North Dakota': 'Regional',
    'Gelatin Desserts': 'Desserts',
    'Iowa': 'Regional',
    'Spreads': 'Sauces/Condiments',
    'Dried Beans': 'Beans/Legumes',
    'Fruit': 'Fruit',
    'Oklahoma': 'Regional',
    'Pennsylvania Dutch Cooking': 'Regional',
    'Broccoli': 'Vegetables',
    'California Style': 'Regional',
    'Fish': 'Seafood',
    'Crab': 'Seafood',
    'Vegetarian/Vegan': 'Vegetarian/Vegan',
    'Brisket': 'Meat',
    'Jewish Holidays': 'Occasions',
    'Mussels/Squid': 'Seafood',
    'Wok': 'Cooking Methods',
    'St. Louis': 'Regional',
    'Breads': 'Baked Goods',
    'Polenta': 'Nuts/Seeds/Grains',
    'Rice Cooker': 'Cooking Methods',
    'Arizona Style': 'Regional',
    'Cucumber': 'Vegetables',
    'Pineapple': 'Fruit',
    'Cheese': 'Dairy',
    'Omelets': 'Breakfast/Brunch',
    'Cantaloupe': 'Fruit',
    'Pancakes And Waffles': 'Breakfast/Brunch',
    'Danish Pastry': 'Baked Goods',
    'Cherry Tomatoes': 'Vegetables',
    'Freshwater Fish': 'Seafood',
    'Lunch/Snacks': 'Lunch/Snacks',
    'Cornmeal': 'Nuts/Seeds/Grains',
    'Squash': 'Vegetables',
    'Meat': 'Meat',
    'Polynesian/Hawaiian': 'Regional',
    'High Protein': 'Healthy',
    'Chutneys': 'Sauces/Condiments',
    'Southwestern United States': 'Regional',
    'Wine': 'Beverages',
    'Smoothies': 'Beverages',
    'South Dakota': 'Regional',
    'High Fiber Cereals': 'Nuts/Seeds/Grains',
    'Chowders': 'Soups',
    'Chiles': 'Spices/Seasonings',
    'Lamb': 'Meat',
    'Mangoes': 'Fruit',
    'Belgian Waffle': 'Breakfast/Brunch',
    'Jamaican Patties': 'International',
    'Mozzarella': 'Dairy',
    'Fish Fry': 'Main Dish',
    'Swiss Fondue': 'International',
    'Jellies': 'Sauces/Condiments',
    'Southwest': 'Regional',
    'Lettuce': 'Vegetables',
    'Poppy Seeds': 'Nuts/Seeds/Grains',
    'Hummus': 'Sauces/Condiments',
    'Icing/Frosting': 'Desserts',
    'Lobster': 'Seafood',
    'St. Patrick\'s Day': 'Occasions',
    'Food Processor/Blender': 'Cooking Methods',
    'Hamburgers': 'Meat',
    'Lemon Juice': 'Flavor Profiles',
    'Valentine\'s Day': 'Occasions',
    'Cranberries': 'Fruit',
    'North Carolina': 'Regional',
    'Baked Goods': 'Baked Goods',
    'Poultry': 'Poultry',
    'Root Vegetables': 'Vegetables',
    'Tamales': 'International',
    'Vegetarian And Vegan': 'Vegetarian/Vegan',
    'Oats': 'Nuts/Seeds/Grains',
    'Brazilian': 'International',
    'High Vitamin C': 'Healthy',
    'Southern': 'Regional',
    'Hawaiian': 'International',
    'Kiwi Fruit': 'Fruit',
    'Ice Cream Maker': 'Cooking Methods',
    'South': 'Regional',
    'Creole/Cajun': 'Regional',
    'Pork': 'Meat',
    'American': 'International',
    'Moroccan Chicken': 'International',
    'Chicken Breasts': 'Poultry',
    'Austrian/German/Swiss': 'International',
    'Baked Potato': 'Side Dishes',
    'Pineapple Juice': 'Flavor Profiles',
    'Lunch': 'Lunch/Snacks',
    'Peanuts': 'Nuts/Seeds/Grains',
    'Mushrooms': 'Vegetables',
    'Smoker': 'Cooking Methods',
    'Stir-Fry': 'Main Dish',
    'Northwest': 'Regional',
    'Breakfast/Brunch': 'Breakfast/Brunch',
    'Chinese': 'International',
    'Hot Dogs/Poultry': 'Poultry',
    'Mixed Drinks': 'Beverages',
    'Grilled Cheese': 'Sandwiches',
    'South African': 'International',
    'Pakistani': 'International',
    'Pakistani And Indian': 'International',
    'Oranges': 'Fruit',
    'Jewish Cuisine': 'International',
    'Peppers': 'Vegetables',
    'Alaska': 'Regional',
    'Jewish Holidays And Events': 'Occasions',
    'Baked Beans': 'Beans/Legumes',
    'Low Sodium': 'Healthy',
    'Smoothie Bowl': 'Beverages',
    'Southern United States': 'Regional',
    'Alaskan King Crab': 'Seafood',
    'Diabetic': 'Special Dietary Needs',
    'Mideast': 'International',
    'Crock Pot': 'Cooking Methods',
    'Sourdough': 'Baked Goods',
    'German': 'International',
    'West Virginia Style': 'Regional',
    'Fish And Seafood': 'Seafood',
    'Puerto Rican': 'International',
    'Minnesota': 'Regional',
    'Okra': 'Vegetables',
    'Bass': 'Seafood',
    'Panfish': 'Seafood',
    'West': 'Regional',
    'Pumpkin': 'Vegetables',
    'Cajun/Creole': 'Regional',
    'Bundt Cake': 'Desserts',
    'Mexican': 'International',
    'Northwest Usa': 'Regional',
    'Congo': 'International',
    'Alcohol': 'Beverages',
    'Christmas': 'Occasions',
    'Czech Republic': 'International',
    'Vinegar': 'Sauces/Condiments',
    'Soy': 'Vegetarian/Vegan',
    'Sushi': 'International',
    'Crockpot': 'Cooking Methods',
    'California/Mexican': 'Regional',
    'Coffee': 'Beverages',
    'Jerk': 'International',
    'Cheddar Cheese': 'Dairy',
    'Minnesota Style': 'Regional',
    'Ranch Dressing': 'Sauces/Condiments',
    'West Coast': 'Regional',
    'Bavarian': 'International',
    'Spanish': 'International',
    'Middle East': 'International',
    'Southeast Asian': 'International',
    'Cheese Balls': 'Appetizers',
    'Bar Cookies': 'Desserts',
    'Zucchini And Yellow Squash': 'Vegetables',
    'Thai': 'International',
    'Latin American': 'International',
    'Peruvian': 'International',
    'Chocolate': 'Desserts',
    'Corn': 'Vegetables',
    'Seafood': 'Seafood',
    'Cucumber Salad': 'Salads',
    'Greek': 'International',
    'Veal': 'Meat',
    'Beef': 'Meat',
    'Southern Us': 'Regional',
    'Central American': 'International',
    'Scones': 'Baked Goods',
    'Beverages': 'Beverages',
    'Pumpkin Seeds': 'Nuts/Seeds/Grains',
    'Indian Subcontinent': 'International',
    'Italian': 'International',
    'Pork Chops': 'Meat',
    'Curry': 'International',
    'Caribbean': 'International',
    'Caribbean And West Indian': 'International',
    'Chinese Regional': 'International',
    'Hawaiian And Pacific Islands': 'International',
    'Canning/Preserving': 'Preservation',
    'Cookies': 'Desserts',
    'Cookies And Brownies': 'Desserts',
    'Hamburger Patties': 'Meat',
    'Sugar-Free': 'Special Dietary Needs',
    'Grapes': 'Fruit',
    'Meatloaf': 'Meat',
    'Greek Style': 'International',
    'Duck': 'Poultry',
    'Egg Nog': 'Beverages',
    'Bhutan': 'International',
    'Spice Blends': 'Spices/Seasonings',
    'Raisins': 'Fruit',
    'Rye': 'Nuts/Seeds/Grains',
    'Omelet/Frittatas': 'Breakfast/Brunch',
    'Canadian': 'International',
    'Ground Beef': 'Meat',
    'Turkey Leftovers': 'Meat',
    'Hummus And Pita': 'Sauces/Condiments',
    'Broccoli Rabe': 'Vegetables',
    'Polish': 'International',
    'Beans And Peas': 'Beans/Legumes',
    'Butternut Squash': 'Vegetables',
    'Cheddar': 'Dairy',
    'Butter': 'Dairy',
    'Sweet Potatoes/Yams': 'Vegetables',
    'Sesame': 'Nuts/Seeds/Grains',
    'Fish Fillets': 'Seafood',
    'New Mexico': 'Regional',
    'Broth': 'Soups',
    'Crock Pot/Slow Cooker': 'Cooking Methods',
    'Russian': 'International',
    'Tuna': 'Seafood',
    'Artichokes': 'Vegetables',
    'Finnish/Nordic': 'International',
    'Low Cholesterol': 'Healthy',
    'Irish Soda Bread Ii': 'Baked Goods',
    'Salsa': 'Sauces/Condiments',
    'North Carolina Style': 'Regional',
    'Nebraska': 'Regional',
    'Creole': 'Regional',
    'Iced/Cold Beverages': 'Beverages',
    'Southern Style': 'Regional',
    'Iowa Style': 'Regional',
    'Low Carbohydrate': 'Healthy',
    'Creole/Creole And Cajun': 'Regional',
    'Brazilian Favourites': 'International',
    'Asian': 'International',
    'Yogurt': 'Dairy',
    'Oregon': 'Regional',
    'Hamburgers/Hot Dogs': 'Meat',
    'Dairy': 'Dairy',
    'Low Protein': 'Healthy',
    'Freezer': 'Preservation',
    'Buttermilk': 'Dairy',
    'Jam/Jelly': 'Sauces/Condiments',
    'Candy': 'Desserts',
    'Main Dish': 'Main Dish',
    'Easy': 'Easy',
    'Korean': 'International',
    'Oktoberfest': 'Occasions',
    'Lobster/Crab/Shrimp': 'Seafood',
    'English': 'International',
    'Belizean': 'International',
    'Californian': 'Regional',
    'Lebanese': 'International',
    'South American': 'International',
    'Thanksgiving': 'Occasions',
    'Indian': 'International',
    'Fish And Chips': 'Seafood',
    'Vegetables/Fruits': 'Vegetables',
    'Vegetarian': 'Vegetarian/Vegan',
    'Kansas': 'Regional',
    'Salsa/Hot Sauces': 'Sauces/Condiments',
    'Salads': 'Salads',
    'Poultry And Game Birds': 'Poultry',
    'Sauces/Condiments': 'Sauces/Condiments',
    'Thanksgiving Leftovers': 'Meat',
    'Juices': 'Beverages',
    'Chile': 'International',
    'Garlic': 'Flavor Profiles',
    'Candy/Candy Making': 'Desserts',
    'Dutch Oven': 'Cooking Methods',
    'Condiments': 'Sauces/Condiments',
    'Main Course': 'Main Dish',
    'South American And Central American': 'International',
    'English/Irish/Scottish': 'International',
    'No-Cook': 'No-Cook',
    'Maryland': 'Regional',
    'Preservation': 'Preservation',
    'Greece': 'International',
    'Nut-Free': 'Special Dietary Needs',
    'Asian/Asian And Indian': 'International',
    'Jamaican': 'International',
    'German Regional': 'International',
    'French': 'International',
    'Yeast Breads': 'Baked Goods',
    'Scandinavian': 'International',
    'Minnesota Recipes': 'Regional',
    'Cake Mixes': 'Baked Goods',
    'Pacific Northwest': 'Regional',
    'Sweet Corn': 'Vegetables',
    'Cake Decorating': 'Desserts',
    'Moroccan': 'International',
    'Dairy-Free': 'Special Dietary Needs',
    'Icelandic': 'International',
    'European': 'International',
    'Meringue': 'Desserts',
    'Low Carb': 'Healthy',
    'Chickpeas': 'Beans/Legumes',
    'Low Sodium Main Dishes': 'Healthy',
    'Potato Salad': 'Salads',
    'Tarts': 'Desserts',
    'Low Sodium Desserts': 'Healthy',
    'New York Style': 'Regional',
    'Cheesecake': 'Desserts',
    'Candy Bars': 'Desserts',
    'North Carolina Style Bbq Sauce': 'Sauces/Condiments',
    'Condiment': 'Sauces/Condiments',
    'Creole And Cajun': 'Regional',
    'Illinois': 'Regional',
    'South African Cuisine': 'International',
    'Mexican/Southwestern': 'Regional',
    'Pacific Rim/Asian': 'International',
    'African': 'International',
    'Shellfish': 'Seafood',
    'English And Irish': 'International',
    'Lentils': 'Beans/Legumes',
    'Ethiopian': 'International',
    'East Indian': 'International',
    'African American': 'International',
    'German And Austrian': 'International',
    'Microwave': 'Cooking Methods',
    'Hawaiian Regional': 'Regional',
    'Mediterranean': 'International',
    'Quick Breads': 'Baked Goods',
    'Honduran': 'International',
    'Snacks': 'Lunch/Snacks',
    'Swiss': 'International',
    'Caribbean And Jamaican': 'International',
    'East Coast': 'Regional',
    'Chinese Regional And Chinese': 'International',
    'Bakery': 'Baked Goods',
    'Kansas City': 'Regional',
    'Party': 'Occasions',
    'Asian/Asian And Pacific Rim': 'International',
    'Southern/Cajun And Creole': 'Regional',
    'Greek Regional': 'International',
    'Valentine\'s Day And Romantic': 'Occasions',
    'Indian And South Asian': 'International',
    'Seafood/Fish': 'Seafood',
    'Caribbean And Latin American': 'International',
    'Beef Roast': 'Meat',
    'German And Austrian And Swiss': 'International',
    'Pasta': 'Pasta',
    'Baking': 'Baked Goods',
    'Potato': 'Vegetables',
    'Pork Loin': 'Meat',
    'Cajun': 'Regional',
    'Peruvian And Bolivian': 'International',
    'Turkey': 'Meat',
    'Ireland': 'International',
    'High Protein Low Carb': 'Healthy',
    'Indian And South African': 'International',
    'Asian/Indian': 'International',
    'Indian Subcontinent And Pakistan': 'International',
    'Potatoes': 'Vegetables',
    'Special Diets': 'Special Dietary Needs',
    'International': 'International',
    'Cabbage': 'Vegetables',
    'Stir-Fries': 'Main Dish',
    'Czechoslovakian': 'International',
    'New England': 'Regional',
    'Asian/Chinese': 'International',
    'Szechuan/Sichuan': 'International',
    'Czech': 'International',
    'Chile Pepper': 'Spices/Seasonings',
    'Microwave Cooking': 'Cooking Methods',
    'Mid-Atlantic': 'Regional',
    'Pizza': 'Main Dish',
    'Caribbean And Puerto Rican': 'International',
    'Pennsylvania': 'Regional',
    'Soups': 'Soups',
    'Iceland': 'International',
    'Low Cholesterol Desserts': 'Healthy',
    'Cocktails': 'Beverages',
    'Easy Main Dish': 'Easy',
    'Sauce': 'Sauces/Condiments',
    'German And Austrian': 'International',
    'Peruvian And Ecuadorian': 'International',
    'Nuts/Seeds': 'Nuts/Seeds/Grains',
    'Kentucky': 'Regional',
    'Colorado': 'Regional',
    'Asian/Japanese': 'International',
    'Japanese': 'International',
    'Jewish': 'International',
    'Middle Eastern': 'International',
    'Baking Mixes': 'Baked Goods',
    'Low Fat': 'Healthy',
    'Alabama': 'Regional',
    'Cheese Appetizers': 'Appetizers',
    'Jewish And Kosher': 'International',
    'Cakes': 'Desserts',
    'Southwestern': 'Regional',
    'Appetizers': 'Appetizers',
    'Alcoholic': 'Beverages',
    'Czechoslovakian And German': 'International',
    'Desserts': 'Desserts',
    'Maryland Regional': 'Regional',
    'Deli': 'Sandwiches',
    'Chile Pepper And Chile Pepper Sauce': 'Spices/Seasonings',
    'Seafood/Fish And Seafood': 'Seafood',
    'Oklahoma': 'Regional',
    'Salads/Salads And Dressings': 'Salads',
    'New England And Mid-Atlantic': 'Regional',
    'Dairy And Eggs': 'Dairy',
    'Soul Food': 'Regional',
    'Swedish': 'International',
    'Alcoholic Beverages': 'Beverages',
    'Eggs': 'Dairy',
    'Iowa': 'Regional',
    'Arizona': 'Regional',
    'Brazilian And South American': 'International',
    'Lunch/Snacks': 'Lunch/Snacks',
    'Noodles': 'Pasta',
    'Hot Drinks': 'Beverages',
    'Texas': 'Regional',
    'Maryland And Virginia': 'Regional',
    'Pacific Northwest And Western': 'Regional',
    'Poultry': 'Poultry',
    'British Isles': 'International',
    'Polish And Eastern European': 'International',
    'Apples': 'Fruit',
    'Italian Regional': 'International',
    'Gluten-Free': 'Special Dietary Needs',
    'Oregon Regional': 'Regional',
    'Mexican Regional': 'Regional',
    'Austrian': 'International',
    'Southwest': 'Regional',
    'Low Fat Main Dishes': 'Healthy',
    'Casserole': 'Main Dish',
    'Southern/Cajun And Creole And Cajun': 'Regional',
    'Eastern European': 'International',
    'Asian/Indian And South Asian': 'International',
    'Casseroles': 'Main Dish',
    'Noodles And Pasta': 'Pasta',
    'Breads': 'Baked Goods',
    'Sauces': 'Sauces/Condiments',
    'Quick': 'Quick',
    'Southwestern And Mexican': 'Regional',
    'New England And Eastern European': 'Regional',
    'Appetizer': 'Appetizers',
    'California': 'Regional',
    'Curries': 'International',
    'Baking Soda': 'Baking Ingredients',
    'Southwestern And Mexican And Tex-Mex': 'Regional',
    'Pennsylvania Dutch': 'International',
    'Southwestern And Mexican And Southwestern': 'Regional',
    'Caribbean And Central American': 'International',
    'Southern/Cajun And Creole And Southern': 'Regional',
    'Arizona And New Mexican': 'Regional',
    'Midwestern': 'Regional',
    'Middle Eastern And Israeli': 'International',
    'Southwestern And Mexican And Mexican': 'Regional',
    'German And Eastern European': 'International',
    'Dairy And Poultry': 'Dairy',
    'Eggs And Dairy': 'Dairy',
    'German And Polish': 'International',
    'British': 'International',
    'Pasta And Noodles': 'Pasta',
    'Irish': 'International',
    'Chinese': 'International',
    'Muffins': 'Baked Goods',
    'Southwestern And Tex-Mex': 'Regional',
    'Eastern European And Russian': 'International',
    'Dessert Sauces': 'Sauces/Condiments',
    'Jewish And Passover': 'International',
    'Northwestern': 'Regional',
    'Northern Italian': 'International',
    'Taco': 'Main Dish',
    'Italian Regional And Italian': 'International',
    'Crock-Pot': 'Cooking Methods',
    'Breads/Bread Machine': 'Baked Goods',
    'Salads/Salads And Vegetables': 'Salads',
    'Cabbage And Corned Beef': 'Meat',
    'Crepes': 'Breakfast/Brunch',
    'Southern/Cajun And Creole And Cajun And Creole': 'Regional',
    'Chocolate Chip': 'Desserts',
    'Sour Cream': 'Dairy',
    'Caribbean And Cuban': 'International',
    'Mexican And Southwestern': 'Regional',
    'Eastern European And German': 'International',
    'German': 'International',
    'Condiments And Sauces': 'Sauces/Condiments',
    'Southern/Cajun And Creole And Southern And Cajun And Creole': 'Regional',
    'Pies': 'Desserts',
    'German And Austrian And German And Austrian': 'International',
    'Healthy': 'Healthy',
    'Low Sodium': 'Healthy',
    'Scandinavian And Swedish': 'International',
    'Eastern European And Hungarian': 'International',
    'German And Austrian And Polish': 'International',
    'German And Austrian And Swiss And Swiss': 'International',
    'Middle Eastern And Jewish': 'International',
    'Peanut Butter': 'Nuts/Seeds/Grains',
    'Southern/Cajun And Creole And Southern And Creole': 'Regional',
    'Fruit': 'Fruit',
    'Southern/Cajun And Creole And Cajun And Creole And Southern': 'Regional',
    'Dips': 'Appetizers',
    'Thai And Southeast Asian': 'International',
    'South American And Mexican': 'International',
    'Quick And Easy': 'Quick',
    'Low Sodium Main Dishes And Healthy': 'Healthy',
    'Canning': 'Preservation',
    'Mexican And South American': 'International',
    'California And Southwestern': 'Regional',
    'Czech And Eastern European': 'International',
    'California And American': 'Regional',
    'Southern/Cajun And Creole And Southern And Creole And Cajun': 'Regional',
    'Greek And Italian': 'International',
    'Low Fat Desserts': 'Healthy',
    'North Dakota': 'Regional',
    'German And Polish And Eastern European': 'International',
    'Jewish And Hanukkah': 'International',
    'Artichoke': 'Vegetables',
    'Bean Soup': 'Soups',
    'Beef Liver': 'Meat',
    'Beginner Cook': 'Uncategorized',
    'Birthday': 'Occasions',
    'Black Bean Soup': 'Soups',
    'Bread Pudding': 'Desserts',
    'Breakfast Casseroles': 'Breakfast/Brunch',
    'Breakfast Eggs': 'Breakfast/Brunch',
    'Broccoli Soup': 'Soups',
    'Buttermilk Biscuits': 'Baked Goods',
    'Cambodian': 'International',
    'Chicken Crock Pot': 'Chicken',
    'Chocolate Chip Cookies': 'Desserts',
    'Coconut Cream Pie': 'Desserts',
    'Dairy Free Foods': 'Special Dietary Needs',
    'Deep Fried': 'Cooking Methods',
    'Desserts Fruit': 'Desserts',
    'Ecuadorean': 'International',
    'Egg Free': 'Special Dietary Needs',
    'Fish Salmon': 'Seafood',
    'Fish Tuna': 'Seafood',
    'From Scratch': 'Cooking Methods',
    'Guatemalan': 'International',
    'Ham And Bean Soup': 'Soups',
    'Hanukkah': 'Occasions',
    'Hunan': 'International',
    'Inexpensive': 'Budget',
    'Iraqi': 'International',
    'Key Lime Pie': 'Desserts',
    'Labor Day': 'Occasions',
    'Lemon Cake': 'Desserts',
    'Macaroni And Cheese': 'Pasta',
    'Main Dish Casseroles': 'Main Dish',
    'Margarita': 'Beverages',
    'Memorial Day': 'Occasions',
    'Mongolian': 'International',
    'Mushroom Soup': 'Soups',
    'Nepalese': 'International',
    'Oatmeal': 'Breakfast/Brunch',
    'Oysters': 'Seafood',
    'Palestinian': 'International',
    'Peanut Butter Pie': 'Desserts',
    'Pot Roast': 'Meat',
    'Potato Soup': 'Soups',
    'Roast Beef Crock Pot': 'Meat',
    'Small Appliance': 'Cooking Methods',
    'Snacks Sweet': 'Desserts',
    'Somalian': 'International',
    'Soups Crock Pot': 'Soups',
    'Spaghetti Sauce': 'Sauces/Condiments',
    'Steam': 'Cooking Methods',
    'Sudanese': 'International',
    'Turkey Gravy': 'Poultry',
    'Wheat Bread': 'Baked Goods',
    'Appetizers, Dietary Restrictions': 'Special Dietary Needs',
    'Beans/Legumes': 'Beans/Legumes',
    'Beef, Cooking Methods': 'Meat',
    'Beverage': 'Beverages',
    'Cake, Dessert': 'Desserts',
    'Casseroles, Main Dish': 'Main Dish',
    'Chicken, Cooking Methods': 'Chicken',
    'Cookies, Dessert': 'Desserts',
    'Cooking Methods': 'Cooking Methods',
    'Cooking Skill Level': 'Uncategorized',
    'Cooking Times': 'Cooking Times',
    'Cuisine': 'International',
    'Cost': 'Budget',
    'Dessert, Fruit': 'Desserts',
    'Dietary Restrictions': 'Special Dietary Needs',
    'Family-Friendly': 'Occasions',
    'Flavor Profiles': 'Flavor Profiles',
    'Gravy, Turkey': 'Poultry',
    'Health/Wellness': 'Healthy',
    'Household': 'Uncategorized',
    'Occasion': 'Occasions',
    'Occasions': 'Occasions',
    'Outdoor Cooking': 'Occasions',
    'Pasta, Cheese, Main Dish': 'Pasta',
    'Pie, Dessert': 'Desserts',
    'Quick and Easy': 'Quick and Easy',
    'Regional': 'Regional',
    'Sauce, Pasta': 'Pasta',
    'Seasonal':'Seasonal',
    'Side Dishes': 'Side Dishes',
    'Snacks, Dessert': 'Desserts',
    'Soup': 'Soups',
    'Soup, Cooking Methods' : 'Soups',
    'Special Dietary Needs': 'Special Dietary Needs',
    'Uncategorized': 'Uncategorized',
    'Gluten Free Appetizers': 'Special Dietary Needs',
    'Easy': 'Quick and Easy',
    'Family-Friendly': 'Occasions',
    'Outdoor Cooking': 'Occasions',
    

}

Checking if we didn't cover anything:

In [35]:
set(recipes['RecipeCategory'].unique()) - set(category_mapping.keys()) 

set()

In [36]:
recipes['RecipeCategory'] = recipes['RecipeCategory'].map(category_mapping)

In [37]:
recipes['RecipeCategory'].unique(), recipes['RecipeCategory'].nunique()

(array(['Desserts', 'Chicken', 'Beverages', 'Vegetarian/Vegan',
        'Vegetables', 'Regional', 'Sauces/Condiments', 'Main Dish',
        'Beans/Legumes', 'Quick and Easy', 'Special Dietary Needs',
        'Baked Goods', 'Poultry', 'Healthy', 'International',
        'Breakfast/Brunch', 'Nuts/Seeds/Grains', 'Fruit', 'Meat', 'Dairy',
        'Seafood', 'Pasta', 'Lunch/Snacks', 'Cooking Methods', 'Soups',
        'Seasonal', 'Flavor Profiles', 'Uncategorized', 'Occasions',
        'Family-Friendly', 'Side Dishes', 'Preservation', 'Household',
        'Appetizers', 'Outdoor Cooking', 'Budget'], dtype=object),
 36)

We now have 36 categories!

We can still see that some categories barfely have any memebers. I'll merge them into other categories:

In [38]:
recipes['RecipeCategory'].value_counts()

Desserts                 100616
Vegetables                50318
Main Dish                 40235
Meat                      36461
Lunch/Snacks              32586
Quick and Easy            32452
Baked Goods               30131
Chicken                   26383
Beverages                 22822
Sauces/Condiments         22812
Breakfast/Brunch          21913
Healthy                   18419
International             17495
Nuts/Seeds/Grains          8719
Fruit                      8567
Dairy                      8462
Beans/Legumes              7894
Seafood                    6584
Poultry                    6525
Soups                      5208
Pasta                      3962
Side Dishes                2624
Vegetarian/Vegan           1844
Occasions                  1809
Flavor Profiles            1392
Regional                   1319
Family-Friendly            1221
Special Dietary Needs      1142
Uncategorized               942
Seasonal                    817
Cooking Methods             472
Househol

In [39]:
new_cat_list = ['Desserts', 'Chicken', 'Beverages', 'Vegetarian/Vegan',
        'Vegetables', 'Regional', 'Sauces/Condiments', 'Main Dish',
        'Beans/Legumes', 'Quick and Easy', 'Special Dietary Needs',
        'Baked Goods', 'Poultry', 'Healthy', 'International',
        'Breakfast/Brunch', 'Nuts/Seeds/Grains', 'Fruit', 'Meat', 'Dairy',
        'Seafood', 'Pasta', 'Lunch/Snacks', 'Cooking Methods', 'Soups',
        'Seasonal', 'Flavor Profiles', 'Uncategorized', 'Occasions',
        'Family-Friendly', 'Side Dishes', 'Preservation', 'Household',
        'Appetizers', 'Outdoor Cooking', 'Budget']

In [40]:
new_cat_dict = {x:x for x in new_cat_list}

In [41]:
new_cat_dict

{'Desserts': 'Desserts',
 'Chicken': 'Chicken',
 'Beverages': 'Beverages',
 'Vegetarian/Vegan': 'Vegetarian/Vegan',
 'Vegetables': 'Vegetables',
 'Regional': 'Regional',
 'Sauces/Condiments': 'Sauces/Condiments',
 'Main Dish': 'Main Dish',
 'Beans/Legumes': 'Beans/Legumes',
 'Quick and Easy': 'Quick and Easy',
 'Special Dietary Needs': 'Special Dietary Needs',
 'Baked Goods': 'Baked Goods',
 'Poultry': 'Poultry',
 'Healthy': 'Healthy',
 'International': 'International',
 'Breakfast/Brunch': 'Breakfast/Brunch',
 'Nuts/Seeds/Grains': 'Nuts/Seeds/Grains',
 'Fruit': 'Fruit',
 'Meat': 'Meat',
 'Dairy': 'Dairy',
 'Seafood': 'Seafood',
 'Pasta': 'Pasta',
 'Lunch/Snacks': 'Lunch/Snacks',
 'Cooking Methods': 'Cooking Methods',
 'Soups': 'Soups',
 'Seasonal': 'Seasonal',
 'Flavor Profiles': 'Flavor Profiles',
 'Uncategorized': 'Uncategorized',
 'Occasions': 'Occasions',
 'Family-Friendly': 'Family-Friendly',
 'Side Dishes': 'Side Dishes',
 'Preservation': 'Preservation',
 'Household': 'Household

In [42]:
new_dict = {'Desserts': 'Desserts',
 'Chicken': 'Chicken',
 'Beverages': 'Beverages',
 'Vegetarian/Vegan': 'Vegetarian/Vegan',
 'Vegetables': 'Vegetables',
 'Regional': 'Regional',
 'Sauces/Condiments': 'Sauces/Condiments',
 'Main Dish': 'Main Dish',
 'Beans/Legumes': 'Beans/Legumes',
 'Quick and Easy': 'Quick and Easy',
 'Special Dietary Needs': 'Special Dietary Needs',
 'Baked Goods': 'Baked Goods',
 'Poultry': 'Poultry',
 'Healthy': 'Healthy',
 'International': 'International',
 'Breakfast/Brunch': 'Breakfast/Brunch',
 'Nuts/Seeds/Grains': 'Nuts/Seeds/Grains',
 'Fruit': 'Fruit',
 'Meat': 'Meat',
 'Dairy': 'Dairy',
 'Seafood': 'Seafood',
 'Pasta': 'Pasta',
 'Lunch/Snacks': 'Lunch/Snacks',
 'Cooking Methods': 'Cooking Methods',
 'Soups': 'Soups',
 'Seasonal': 'Seasonal',
 'Flavor Profiles': 'Flavor Profiles',
 'Uncategorized': 'Uncategorized',
 'Occasions': 'Occasions',
 'Family-Friendly': 'Family-Friendly',
 'Side Dishes': 'Side Dishes',
 'Preservation': 'Uncategorized',
 'Household': 'Uncategorized',
 'Appetizers': 'Uncategorized',
 'Outdoor Cooking': 'Occasions',
 'Budget': 'Uncategorized'}

In [43]:
recipes['RecipeCategory'] = recipes['RecipeCategory'].map(new_dict)

In [44]:
recipes['RecipeCategory'].value_counts(), recipes['RecipeCategory'].nunique()

(Desserts                 100616
 Vegetables                50318
 Main Dish                 40235
 Meat                      36461
 Lunch/Snacks              32586
 Quick and Easy            32452
 Baked Goods               30131
 Chicken                   26383
 Beverages                 22822
 Sauces/Condiments         22812
 Breakfast/Brunch          21913
 Healthy                   18419
 International             17495
 Nuts/Seeds/Grains          8719
 Fruit                      8567
 Dairy                      8462
 Beans/Legumes              7894
 Seafood                    6584
 Poultry                    6525
 Soups                      5208
 Pasta                      3962
 Side Dishes                2624
 Occasions                  1852
 Vegetarian/Vegan           1844
 Flavor Profiles            1392
 Regional                   1319
 Uncategorized              1270
 Family-Friendly            1221
 Special Dietary Needs      1142
 Seasonal                    817
 Cooking M

Good! Now we have 31 major categories..

In [45]:
recipes.sample(2)

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
123941,130309.0,229591,"These are delicious cheesecakes, perfect for H...",Desserts,"[Dessert, < 30 Mins]","[12, 8, 2, 2, 2⁄3, 1⁄2]","[light cream cheese, eggs, lemon juice, sugar,...",,,166.2,6.7,3.3,49.6,171.9,23.2,0.5,15.9,4.0,12.0,,[·Preheat oven to 350 degrees F (175 degrees C...,"https://www.food.com/recipe/Low-Fat,-Low-Cal,-...",2005,7,18,17,27
315141,327185.0,346383,This is listed as one of Hungry Girl's top rec...,Beverages,"[< 15 Mins, Beginner Cook, Easy]","[1, 2, 1, 3, 2, 1]","[cocoa powder, instant coffee, Splenda sugar s...",,,57.8,0.4,0.1,0.0,28.2,13.3,0.6,9.3,1.6,1.0,1 cup of coffee,[Put all dry ingredients in 16 oz. glass. Dis...,https://www.food.com/recipe/Hg's-Super-Duper-C...,2008,9,25,13,5


In [46]:
recipes.isna().sum()

RecipeId                           0
AuthorId                           0
Description                        5
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253223
ReviewCount                   247489
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182911
RecipeYield                   348071
RecipeInstructions                 0
url                                0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
HourPublished                      0
TotalMinutes                       0
d

Note that we have also eliminated the null values for `RecipeCategory`. We also have 5 nul values for `Description`; let's drop them too:

In [47]:
recipes['Description'].dropna(inplace=True)

In [48]:
recipes['Description'].isna().sum()

5

In [49]:
recipes[recipes['Description'].isna()]

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
3416,5177.0,1552,,Baked Goods,"[Breakfast, < 15 Mins, For Large Groups, Oven]","[1 1⁄2, 1⁄4, 1, 1, 1, 1, 1, 1⁄4]","[butter, margarine, parmesan cheese, rosemary,...",5.0,4.0,35.5,3.1,1.9,8.8,80.7,0.3,0.1,0.0,1.6,24.0,,"[Grease a fluted tube Bundt pan., combine chee...",https://www.food.com/recipe/Herb-Pull-Aparts-5177,1999,11,30,23,0
3526,5300.0,1992,,Chicken,"[Chicken, Beef Organ Meats, Beef Liver, Poultr...","[2, 1 1⁄2, 900, 9, 8, 1, None]","[sweet sherry, chicken livers, eggs, nutmeg]",,,4650.2,391.1,208.1,7517.1,1606.2,30.1,0.5,5.8,243.4,1.0,,[Bring cream to simmering point. Puree all oth...,https://www.food.com/recipe/Chicken-Liver-Parf...,1999,12,5,13,0
3645,5428.0,1534,,International,"[European, Very Low Carbs, < 15 Mins]","[1, 1⁄3, 10 -12, 1⁄4, 1⁄4, None, 2, 3]","[garlic, fresh swiss chard, red wine vinegar, ...",5.0,5.0,928.6,93.0,15.9,344.6,1172.6,7.2,2.1,2.5,16.4,1.0,,"[Marinate garlic clove in oil for 1 hour., Rem...",https://www.food.com/recipe/Hot-Swiss-Chard-Sa...,1999,12,15,23,0
4590,7426.0,1534,,Sauces/Condiments,[< 15 Mins],"[2, 1⁄2, 1, 1, 1⁄2, 1, 1]","[salt, garlic powder, parsley flakes, mayonnai...",,,119.9,2.3,1.4,9.8,1829.2,16.4,0.8,12.1,9.0,,1 batch,"[Mix instant onion mix, salt, garlic powder, p...",https://www.food.com/recipe/Hidden-Valley-Mix-...,1999,12,15,23,0
4591,7427.0,1534,,Fruit,"[Meat, < 15 Mins, Oven]","[2, 1, 2, 1⁄2, 1⁄3, 3, 2, 1⁄4, 1⁄4, 1, 12, 1, 1]","[beef, eggs, parsley, ketchup, onions, soy sau...",4.5,5.0,1264.1,109.3,45.1,211.8,1364.7,51.9,4.6,40.9,17.5,6.0,,"[In a large bowl, combine ground beef, cornfla...",https://www.food.com/recipe/Cranberry-Cocktail...,1999,12,15,23,0


In [50]:
recipes.drop([3416,3526,3645,4591,4590],axis=0,inplace=True)

In [51]:
recipes.isna().sum()

RecipeId                           0
AuthorId                           0
Description                        0
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              253221
ReviewCount                   247487
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182910
RecipeYield                   348067
RecipeInstructions                 0
url                                0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
HourPublished                      0
TotalMinutes                       0
d

### Dealling with `AggregatedRating`

In [52]:
reviews = pd.read_parquet('../reviews.parquet')

In [53]:
reviews.sample(5)

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
798146,888782,366061,149363,Leslie,5,Loved this simple way of preparing tomatoes. I...,2009-06-14 09:24:07+00:00,2009-06-14 09:24:07+00:00
145564,156172,88747,142583,Laurieh,0,"You will NOT believe what happened to me, my s...",2005-02-25 16:00:30+00:00,2005-02-25 16:00:30+00:00
276115,295129,23966,88585,frazerjane1,5,This recipe sved my dinner! I was in the middl...,2006-07-20 06:47:55+00:00,2006-07-20 06:47:55+00:00
1321400,2002889,61366,2001275176,Mairott,0,I made using canned peaches. Recipe does not s...,2016-11-29 17:15:40+00:00,2016-11-29 17:15:40+00:00
1312866,1506943,457907,305531,lazyme,5,This was such a simple and good salad that was...,2016-07-17 15:57:00+00:00,2016-07-17 15:57:00+00:00


In [54]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1401982 entries, 0 to 1401981
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype              
---  ------         --------------    -----              
 0   ReviewId       1401982 non-null  int32              
 1   RecipeId       1401982 non-null  int32              
 2   AuthorId       1401982 non-null  int32              
 3   AuthorName     1401982 non-null  object             
 4   Rating         1401982 non-null  int32              
 5   Review         1401982 non-null  object             
 6   DateSubmitted  1401982 non-null  datetime64[ns, UTC]
 7   DateModified   1401982 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2), int32(4), object(2)
memory usage: 64.2+ MB


In [55]:
reviews.isna().sum()

ReviewId         0
RecipeId         0
AuthorId         0
AuthorName       0
Rating           0
Review           0
DateSubmitted    0
DateModified     0
dtype: int64

In [56]:
reviews.describe()

Unnamed: 0,ReviewId,RecipeId,AuthorId,Rating
count,1401982.0,1401982.0,1401982.0,1401982.0
mean,817973.9,152641.2,155863800.0,4.407951
std,528082.1,130111.2,530511100.0,1.272012
min,2.0,38.0,1533.0,0.0
25%,374386.2,47038.75,133680.0,4.0
50%,771780.5,109327.0,330545.0,5.0
75%,1204126.0,231876.8,818359.0,5.0
max,2090347.0,541298.0,2002902000.0,5.0


In [57]:
reviews['RecipeId'].value_counts()

45809     2892
2886      2182
27208     1614
89204     1584
39087     1491
          ... 
229614       1
320225       1
47944        1
270626       1
230339       1
Name: RecipeId, Length: 271678, dtype: int64

**NOTE:** There's a mismatch batween the actual aggregated rating and one recorded in the recipes dataset:

In [58]:
recipes[recipes['RecipeId'] == 992]

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
702,992.0,1545,Make and share this Jalapeno Pepper Poppers re...,Vegetables,"[< 30 Mins, For Large Groups]","[8, 4, 4, 6, 1⁄4, 1⁄4, 1⁄4, 1, 1⁄2, None]","[cream cheese, sharp cheddar cheese, monterey ...",5.0,15.0,111.4,9.2,4.9,23.7,172.5,3.2,0.6,0.9,4.3,24.0,,"[In a mixing bowl, combine cheeses, bacon and ...",https://www.food.com/recipe/Jalapeno-Pepper-Po...,1999,9,6,4,30


In [59]:
recipes[recipes['RecipeId'] == 992]['AggregatedRating']

702    5.0
Name: AggregatedRating, dtype: float64

In [60]:
reviews[reviews['RecipeId'] == 992]['Rating'].mean()

4.916666666666667

More examples:

In [61]:
print(f"Recorded rating: {recipes[recipes['RecipeId'] == 45809]['AggregatedRating'][41924]}")
print(f"Actual rating: {reviews[reviews['RecipeId'] == 45809]['Rating'].mean()}")

Recorded rating: 5.0
Actual rating: 4.314661134163209


In [62]:
print(f"Recorded rating: {recipes[recipes['RecipeId'] == 2886]['AggregatedRating'][1436]}")
print(f"Actual rating: {reviews[reviews['RecipeId'] == 2886]['Rating'].mean()}")

Recorded rating: 5.0
Actual rating: 4.218148487626031


So we can drop the inaccurate `AggregatedRating` from the recipes dataset, and replace the amount with the average of the `Rating` in the reviews dataset.

In [63]:
ratings = reviews.groupby(['RecipeId']).mean()[['Rating']]
ratings

  ratings = reviews.groupby(['RecipeId']).mean()[['Rating']]


Unnamed: 0_level_0,Rating
RecipeId,Unnamed: 1_level_1
38,4.250000
39,3.000000
40,4.333333
41,4.500000
42,2.666667
...,...
540899,5.000000
541001,0.000000
541030,5.000000
541195,5.000000


In [64]:
ratings.index

Int64Index([    38,     39,     40,     41,     42,     43,     44,     45,
                46,     47,
            ...
            540716, 540717, 540731, 540836, 540876, 540899, 541001, 541030,
            541195, 541298],
           dtype='int64', name='RecipeId', length=271678)

In [65]:
recipes[['RecipeId','AggregatedRating']].isna().sum()

RecipeId                 0
AggregatedRating    253221
dtype: int64

In [66]:
recipe_ids_with_aggrating = recipes[['RecipeId','AggregatedRating']].dropna()['RecipeId']
recipe_ids_with_aggrating

0             38.0
1             39.0
2             40.0
3             41.0
4             42.0
            ...   
522018    540876.0
522039    540899.0
522167    541030.0
522330    541195.0
522431    541298.0
Name: RecipeId, Length: 269291, dtype: float64

In [67]:
recipe_ids_with_aggrating.values

array([3.80000e+01, 3.90000e+01, 4.00000e+01, ..., 5.41030e+05,
       5.41195e+05, 5.41298e+05])

In [68]:
recipes['CorrectAggregatedRating'] = ''

In [69]:
recipes['CorrectAggregatedRating']

0          
1          
2          
3          
4          
         ..
522512     
522513     
522514     
522515     
522516     
Name: CorrectAggregatedRating, Length: 522512, dtype: object

In [70]:
# get the indices from the ratings dataframe that exist in the recipes dataframe as RecipeId:

indices = []
for i,j in zip(ratings.index,ratings.values):
    if i in recipe_ids_with_aggrating.values:
        indices.append(i)

In [71]:
len(indices)

265979

In [72]:
recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].isna().sum()

0

In [73]:
recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            521816, 521829, 521863, 521864, 521877, 522018, 522039, 522167,
            522330, 522431],
           dtype='int64', length=265979)

In [74]:
ratings.loc[indices]['Rating'].values

array([4.25      , 3.        , 4.33333333, ..., 5.        , 5.        ,
       5.        ])

In [75]:
# Assign to the `CorrectAggregatedRating` value of recipes with existing AggregatedRating the actual aggregated rating,
# recorded in the ratings dataframe:

recipes.loc[recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index,'CorrectAggregatedRating'] = ratings.loc[indices]['Rating'].values

In [76]:
recipes.loc[recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index].sample(4)

Unnamed: 0,RecipeId,AuthorId,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes,CorrectAggregatedRating
313167,325144.0,950428,This is one of my favorite cookies that my mom...,Desserts,"[Dessert, Cookie & Brownie, Sweet, < 30 Mins, ...","[1 3⁄4, 3⁄4, 1⁄4, 1⁄2, 1⁄4, 1, 1⁄2, 1, 1 1⁄2]","[all-purpose flour, baking soda, salt, granula...",1.0,1.0,1041.3,49.6,30.1,227.8,1126.9,133.9,3.0,50.7,14.9,,2 dozen,"[PREHEAT oven to 375°F., COMBINE flour, baking...",https://www.food.com/recipe/Butterfinger-Cooki...,2008,9,15,12,22,1.0
126664,133111.0,24386,Ground beef is mixed together with shredded Mo...,Meat,"[High Protein, High In..., < 30 Mins]","[2, 1⁄4, 2, 1⁄2, 1, 1⁄2, 8, None]","[monterey jack cheese, lean ground beef, onion...",5.0,3.0,429.5,21.8,10.5,98.8,727.2,22.6,1.1,3.3,33.8,8.0,,"[Combine shredded cheese, pepper sauce, ground...",https://www.food.com/recipe/Monterey-Jack-Burg...,2005,8,10,17,20,4.666667
195371,203976.0,334301,Make and share this Homestyle Hash recipe from...,Breakfast/Brunch,"[Lunch/Snacks, Soy/Tofu, Beans, Vegan, < 15 Mins]","[1, 1, 7⁄8, 20, 1, 1⁄2, 2]","[ketchup, boiling water, potatoes, potatoes, o...",5.0,2.0,101.2,2.4,0.4,0.0,34.8,18.6,2.3,1.9,2.1,,,[Mix the first three ingredients and let stand...,https://www.food.com/recipe/Homestyle-Hash-203976,2007,1,7,23,15,5.0
12450,15682.0,21752,"This cake is good, I mean REALLY GOOD. If you ...",Desserts,"[Apple, Fruit, Low Protein, Kid Friendly, Kosh...","[1⁄2, 1, 2, 2, 1, 1, 1, 1⁄2, 2, 1⁄2, 1⁄2, 1 1⁄...","[butter, light brown sugar, eggs, pure vanilla...",5.0,69.0,412.3,23.2,12.6,85.3,304.4,48.8,1.2,38.4,3.4,12.0,,"[Preheat oven to 350°F degrees., Beat butter a...",https://www.food.com/recipe/Chunky-Apple-Spice...,2001,12,12,10,50,4.521739


We now have the actual aggregated ratings, recorded in `CorrectAggregatedRating`:

In [78]:
recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()

Unnamed: 0,RecipeId,AggregatedRating,CorrectAggregatedRating
0,38.0,4.5,4.25
1,39.0,3.0,3.0
2,40.0,4.5,4.333333
3,41.0,4.5,4.5
4,42.0,4.5,2.666667
...,...,...,...
522018,540876.0,5.0,5.0
522039,540899.0,5.0,5.0
522167,541030.0,5.0,5.0
522330,541195.0,5.0,5.0


In [79]:
reviews[reviews['RecipeId'] == 40.0]['Rating'].mean()

4.333333333333333

We can now see how many wrong entries existed in our original recipes dataset, as values of `AggregatedRating`:

In [80]:
(recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()['AggregatedRating'] != recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()['CorrectAggregatedRating']).sum()

88851

Whew, this was some cleaning!! Our EDA and models woul've been filled with wrong entries if we didn't fix this! Let's now drop the original `AggregatedRating`:

In [81]:
recipes.drop(['AggregatedRating'],axis=1,inplace=True)

In [82]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522512 entries, 0 to 522516
Data columns (total 27 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    522512 non-null  float64
 1   AuthorId                    522512 non-null  int32  
 2   Description                 522512 non-null  object 
 3   RecipeCategory              522512 non-null  object 
 4   Keywords                    522512 non-null  object 
 5   RecipeIngredientQuantities  522512 non-null  object 
 6   RecipeIngredientParts       522512 non-null  object 
 7   ReviewCount                 275025 non-null  float64
 8   Calories                    522512 non-null  float64
 9   FatContent                  522512 non-null  float64
 10  SaturatedFatContent         522512 non-null  float64
 11  CholesterolContent          522512 non-null  float64
 12  SodiumContent               522512 non-null  float64
 13  CarbohydrateCo

Let's also turn the new values into floats and round the them by 2 decimals:

In [83]:
recipes['CorrectAggregatedRating']

0             4.25
1              3.0
2         4.333333
3              4.5
4         2.666667
            ...   
522512            
522513            
522514            
522515            
522516            
Name: CorrectAggregatedRating, Length: 522512, dtype: object

In [85]:
recipes['CorrectAggregatedRating'] = recipes['CorrectAggregatedRating'].apply(lambda x: round(float(x),2) if x != '' else None)

In [117]:
recipes['CorrectAggregatedRating']

0         4.0
1         3.0
2         4.0
3         4.0
4         2.0
         ... 
522512    NaN
522513    NaN
522514    NaN
522515    NaN
522516    NaN
Name: CorrectAggregatedRating, Length: 522512, dtype: float64

In [86]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522512 entries, 0 to 522516
Data columns (total 27 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    522512 non-null  float64
 1   AuthorId                    522512 non-null  int32  
 2   Description                 522512 non-null  object 
 3   RecipeCategory              522512 non-null  object 
 4   Keywords                    522512 non-null  object 
 5   RecipeIngredientQuantities  522512 non-null  object 
 6   RecipeIngredientParts       522512 non-null  object 
 7   ReviewCount                 275025 non-null  float64
 8   Calories                    522512 non-null  float64
 9   FatContent                  522512 non-null  float64
 10  SaturatedFatContent         522512 non-null  float64
 11  CholesterolContent          522512 non-null  float64
 12  SodiumContent               522512 non-null  float64
 13  CarbohydrateCo

In [87]:
recipes.isna().sum()

RecipeId                           0
AuthorId                           0
Description                        0
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
ReviewCount                   247487
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182910
RecipeYield                   348067
RecipeInstructions                 0
url                                0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
HourPublished                      0
TotalMinutes                       0
CorrectAggregatedRating       256533
d

We now save our dataframe for later use. 

**NOTE:** We didn't touch many of the categorical columns, as well as the `url` column here. We will be dealing with these in other notebooks; here we only wanted to perform some baseic cleaning and feature engineering.

In [88]:
recipes.to_parquet('BasicCleanData.parquet') 