# Part 0: Basic Data Cleaning
The first step is to do some basic data cleaning and rid of all the columns that won't be of any use acrross any of the projects going forward, and add some useful columns to the dataset based on the existing ones that will come handy in both Data Analysis and ML/NLP.

* **Drop:** 
[CookTime', 'PrepTime', 'TotalTime', 'DatePublished', 'Description', 'Images', 'ReviewCount']

* **Add:**
['TotalMinutes', 'YearPublished', 'MonthPublished', 'DayPublished', 'HourPublished']

* **Replace:**
['RecipeIngredientQuantities', 'RecipeIngredientParts'] with ones scraped from food.com from scratch.

* **Correct:**
['AggregatedRating'] using the ratings from the reviews dataset, that is https://www.kaggle.com/datasets/irkaal/foodcom-recipes-and-reviews.

**Save:**
BasicCleanData.parquet and upload it on the Kaggle page where the original data is downloaded from.

We can perform classical data analysis on BasicCleanData.parquet


## Imports and sanity checks

In [None]:
# Check if I'm in the correct environment

import sys
sys.executable

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [3]:
# This allows scrolling through all the columns. Useful for dataframes with too many columns.
pd.set_option('display.max_columns', 100)

In [4]:
recipes = pd.read_parquet('../recipes.parquet')

In [5]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
19235,22625.0,Tzatziki Sauce,33017,T Likes to cook,,PT2H,PT2H,2002-03-15 13:38:00+00:00,Make and share this Tzatziki Sauce recipe from...,[],Weeknight,"[< 4 Hours, Easy]","[1, 1⁄2, 2, 1, 1⁄2]","[plain yogurt, English cucumber, lemon juice, ...",4.5,3.0,191.1,12.1,8.2,49.3,476.8,11.8,0.5,8.9,10.2,,2 cups,[(To drain yogurt and cucumber- line a straine...
118626,124842.0,Sloppy Joes,28604,BeccaB3c,PT30M,PT10M,PT40M,2005-06-06 13:29:00+00:00,I always love Sloppy Joes. I grew up on Manwhi...,[],Lunch/Snacks,"[Meat, Toddler Friendly, Kid Friendly, Summer,...","[1, 1⁄2, 4, 2, 1, 1⁄2, 8, None, None]","[lean ground beef, onion, celery, brown sugar,...",,,258.1,11.6,4.6,73.7,416.5,14.2,1.8,10.7,23.9,4.0,,"[Brown ground beef in skillet, drain off fat.,..."


In [8]:
recipes.shape

(522517, 28)

In [6]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    522517 non-null  float64            
 1   Name                        522517 non-null  object             
 2   AuthorId                    522517 non-null  int32              
 3   AuthorName                  522517 non-null  object             
 4   CookTime                    439972 non-null  object             
 5   PrepTime                    522517 non-null  object             
 6   TotalTime                   522517 non-null  object             
 7   DatePublished               522517 non-null  datetime64[ns, UTC]
 8   Description                 522512 non-null  object             
 9   Images                      522516 non-null  object             
 10  RecipeCategory              521766 non-null 

In [7]:
recipes.describe()

Unnamed: 0,RecipeId,AuthorId,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,522517.0,522517.0,269294.0,275028.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0
mean,271821.43697,45725850.0,4.632014,5.227784,484.43858,24.614922,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191
std,155495.878422,292971400.0,0.641934,20.381347,1397.116649,111.485798,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809
min,38.0,27.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,137206.0,69474.0,4.5,1.0,174.2,5.6,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0
50%,271758.0,238937.0,5.0,2.0,317.1,13.8,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0
75%,406145.0,565828.0,5.0,4.0,529.1,27.4,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0
max,541383.0,2002886000.0,5.0,3063.0,612854.6,64368.1,26740.6,130456.4,1246921.0,108294.6,3012.0,90682.3,18396.2,32767.0


### Adding recipe urls to the dataframe
We will first reconstruct the recipe urls from the original recipes dataset. 
* We can use these urls to check recipe data recorded in the dataset and the actual info on the respective recipe webpages.
* We also use these links to scrape food.com in order to upgrade the ingredients (currently ongoing in another notebook).

In [10]:
recipes['url']= recipes['Name'].apply(lambda x: x.replace(' ','-')+'-')
recipes['url']

0                        Low-Fat-Berry-Blue-Frozen-Dessert-
1                                                  Biryani-
2                                            Best-Lemonade-
3                           Carina's-Tofu-Vegetable-Kebabs-
4                                             Cabbage-Soup-
                                ...                        
522512                      Meg's-Fresh-Ginger-Gingerbread-
522513    Roast-Prime-Rib-au-Poivre-with-Mixed-Peppercorns-
522514                               Kirshwasser-Ice-Cream-
522515            Quick-&-Easy-Asian-Cucumber-Salmon-Rolls-
522516                             Spicy-Baked-Scotch-Eggs-
Name: url, Length: 522517, dtype: object

In [11]:
recipes['url'] = recipes[['url', 'RecipeId']].apply(lambda x: 'https://www.food.com/recipe/' + x['url'] + str(int(x['RecipeId'])), axis=1)
recipes['url']

0         https://www.food.com/recipe/Low-Fat-Berry-Blue...
1                    https://www.food.com/recipe/Biryani-39
2              https://www.food.com/recipe/Best-Lemonade-40
3         https://www.food.com/recipe/Carina's-Tofu-Vege...
4               https://www.food.com/recipe/Cabbage-Soup-42
                                ...                        
522512    https://www.food.com/recipe/Meg's-Fresh-Ginger...
522513    https://www.food.com/recipe/Roast-Prime-Rib-au...
522514    https://www.food.com/recipe/Kirshwasser-Ice-Cr...
522515    https://www.food.com/recipe/Quick-&-Easy-Asian...
522516    https://www.food.com/recipe/Spicy-Baked-Scotch...
Name: url, Length: 522517, dtype: object

In [12]:
#recipes = pd.read_parquet('../recipes_with_urls.parquet')

In [12]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url
177084,185181.0,Janie's Carrot Cake,80937,Just Janie,PT40M,PT20M,PT1H,2006-09-07 21:33:00+00:00,I like this incredibly moist carrott cake beca...,[https://img.sndimg.com/food/image/upload/w_55...,Dessert,[< 60 Mins],"[2, 2, 1, 1⁄4, 2⁄3, 1, 3, 2⁄3, 2, 1⁄2, 1⁄2, 3 ...","[flour, cinnamon, baking powder, salt, butter,...",5.0,3.0,748.1,40.8,21.1,165.1,398.8,90.0,2.7,61.2,9.3,8.0,1 layer,[Beat 2/3 cup butter with sugar thoroughly. Ad...,https://www.food.com/recipe/Janie's-Carrot-Cak...
286757,297975.0,Grown-Up Trail/Snack Mix,771228,Tiffiny D.,,PT5M,PT5M,2008-04-11 00:37:00+00:00,My refined version of snack mix. Thought I'd ...,[],Lunch/Snacks,"[Kid Friendly, < 15 Mins, Easy]","[6, 6, 4, 4, 6, 6]","[dried cranberries, raisins, sunflower seeds]",,,358.7,26.8,5.1,0.0,99.9,25.9,6.2,15.8,11.5,12.0,5 cups,"[Mix all ingredients in a large bowl., Store i...",https://www.food.com/recipe/Grown-Up-Trail/Sna...


Now let's import the dataframe containing scraped data that has the full and correct info from `RecipeIngredientQuantities` and `RecipeIngredientParts`:

In [13]:
scraped_data = pd.read_pickle('../Recipes_final2.pkl')

In [14]:
scraped_data.head()

Unnamed: 0,url,ingred_quants,ingred_items
0,https://www.food.com/recipe/Low-Fat-Berry-Blue...,"[4, 1⁄4, 1, 1]","[cups blueberries, fresh or frozen, cup granul..."
1,https://www.food.com/recipe/Biryani-39,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[tablespoon saffron, teaspoons milk, warm, hot..."
2,https://www.food.com/recipe/Best-Lemonade-40,"[1 1⁄2, 1, , 1 1⁄2, , 3⁄4]","[cups sugar, tablespoon lemons, rind of or 1 t..."
3,https://www.food.com/recipe/Carina's-Tofu-Vege...,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[ounces extra firm tofu, water-packed, medium ..."
4,https://www.food.com/recipe/Cabbage-Soup-42,"[46, 4, 1, 2, 1]","[ounces plain tomato juice, cups cabbage, shre..."


The shape of the scraped dataset:

In [15]:
scraped_data.shape

(521712, 3)

The shape of the original dataset:

In [16]:
recipes.shape

(522517, 29)

We can see that most of the urls have been scraped!

We now join these on the `url` column:

In [17]:
recipes = pd.merge(recipes,scraped_data,on='url')

Let's check out the new columns with the olds ones:

In [18]:
recipes[['RecipeIngredientQuantities', 'ingred_quants', 'RecipeIngredientParts', 'ingred_items']]

Unnamed: 0,RecipeIngredientQuantities,ingred_quants,RecipeIngredientParts,ingred_items
0,"[4, 1⁄4, 1, 1]","[4, 1⁄4, 1, 1]","[blueberries, granulated sugar, vanilla yogurt...","[cups blueberries, fresh or frozen, cup granul..."
1,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[saffron, milk, hot green chili peppers, onion...","[tablespoon saffron, teaspoons milk, warm, hot..."
2,"[1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4]","[1 1⁄2, 1, , 1 1⁄2, , 3⁄4]","[sugar, lemons, rind of, lemon, zest of, fresh...","[cups sugar, tablespoon lemons, rind of or 1 t..."
3,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[extra firm tofu, eggplant, zucchini, mushroom...","[ounces extra firm tofu, water-packed, medium ..."
4,"[46, 4, 1, 2, 1]","[46, 4, 1, 2, 1]","[plain tomato juice, cabbage, onion, carrots, ...","[ounces plain tomato juice, cups cabbage, shre..."
...,...,...,...,...
521707,"[8, 12, 6, 1, 3, 3, 5, 2, None]","[8, 12, 6, 1, 3, 3, 5, 2, ]","[fettuccine pasta, button mushrooms, silken to...","[ounces fettuccine pasta, ounces button mushro..."
521708,"[1, 1, 1⁄2, 2, 1⁄2]","[1, 1, 1⁄2, 2, 1⁄2]","[sugar, cinnamon]","[cup dark roast coffee, tablespoon sugar, teas..."
521709,"[1, 1, 1, 1, 16, 25, 1, 1, None]","[1, 1, 1, 1, 16, 25, 1, 1, ]","[orange, apple, pear, pineapple, mint leaf]","[Brut champagne, orange, large apple, large pe..."
521710,"[2, 3, 1, 1, 1, 1, 1, 1, 1, 5]","[2, 3, 1, 1, 1, 1, 1, 1, 1, 5]","[lobster tails, garlic powder, parmesan cheese...","[lobster tails, tablespoons garlic powder, cup..."


We can see how the scrapte data is much more complete. All the measurements are there now. Also, a large number of mismatches between ingredient quantities and items have been fixed now; those mismatches existed in the original dataset due to the way the relevant info was scraped.

For now we keep the old columns as they might turn out useful later. We can drop them later when needed.

In [20]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 521712 entries, 0 to 521711
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    521712 non-null  float64            
 1   Name                        521712 non-null  object             
 2   AuthorId                    521712 non-null  int32              
 3   AuthorName                  521712 non-null  object             
 4   CookTime                    439288 non-null  object             
 5   PrepTime                    521712 non-null  object             
 6   TotalTime                   521712 non-null  object             
 7   DatePublished               521712 non-null  datetime64[ns, UTC]
 8   Description                 521707 non-null  object             
 9   Images                      521711 non-null  object             
 10  RecipeCategory              520967 non-null 

In [19]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
CookTime                       82424
PrepTime                           0
TotalTime                          0
DatePublished                      0
Description                        5
Images                             1
RecipeCategory                   745
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              252432
ReviewCount                   246700
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182734
RecipeYield                   347633
R

### Dropping Reduntant Columns <a class ='author' id='part-0'></a>
`TotalTime` is the sum of `CookTime` and `PrepTime`. Plus, the latter two seem to be missing from the recipes on the webpages. I'll just drop `CookTime` and `PrepTime`.

In [22]:
recipes.drop(['CookTime', 'PrepTime'], axis=1,inplace=True)

In [23]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,TotalTime,DatePublished,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items
431959,447952.0,Jitterbug's Stewed Chicken,714468,Brookelynne26,PT1H,2011-02-01 16:38:00+00:00,Make and share this Jitterbug's Stewed Chicken...,[],Chicken Breast,"[Chicken, Poultry, Onions, Peppers, Vegetable,...","[4, 3⁄4, 1⁄2, 1⁄4, 3⁄4, 1⁄3, 1⁄3, 2, 1⁄4, 1, 1...","[boneless chicken breasts, salt, fresh ground ...",5.0,1.0,683.6,30.2,6.6,100.0,889.6,58.6,1.6,5.6,41.0,4.0,,[Preheat oven to 350 degrees. Season both sid...,https://www.food.com/recipe/Jitterbug's-Stewed...,"[4, 3⁄4, 1⁄2, 1⁄4, 3⁄4, 1⁄3, 1⁄3, 2, 1⁄4, 1, 1...","[boneless chicken breasts, with skin rinsed an..."
244569,254547.0,Orange Walnut Loaf (Abm),283251,dicentra,PT5M,2007-09-21 17:16:00+00:00,Make and share this Orange Walnut Loaf (Abm) r...,[],Yeast Breads,"[Breads, Bread Machine, < 15 Mins, Small Appli...","[1, 1 1⁄2, 1 1⁄2, 1, 1⁄2, 1⁄2, 1, 2 1⁄2, 3⁄4, ...","[water, walnut oil, honey, walnut extract, ora...",,,3041.9,177.0,25.3,15.5,3570.1,348.8,26.3,34.8,61.0,,1 loaf,[Put all of the ingredients except the walnuts...,https://www.food.com/recipe/Orange-Walnut-Loaf...,"[1, 1 1⁄2, 1 1⁄2, 1, 1⁄2, 1⁄2, 1, 2 1⁄2, 3⁄4, ...","[cup water, tablespoons walnut oil, tablespoon..."


### Feature-Engineering `DatePublished`

`DatePublished` has too much info in it. Instead we turn it into `YearPublished`, `MonthPublished` and `DayPublished`. 

We can later on use these to derive insights on what days, months and years havae the highest rate of published recipes, and so on.

In [24]:
recipes['DatePublished'].apply(lambda x: x.hour)

0         21
1         13
2         19
3         14
4          6
          ..
521707    20
521708    20
521709    20
521710    20
521711    21
Name: DatePublished, Length: 521712, dtype: int64

In [25]:
recipes['YearPublished'] = recipes['DatePublished'].apply(lambda x: x.year)
recipes['MonthPublished'] = recipes['DatePublished'].apply(lambda x: x.month)
recipes['DayPublished'] = recipes['DatePublished'].apply(lambda x: x.day)
recipes['HourPublished'] = recipes['DatePublished'].apply(lambda x: x.hour)

In [26]:
recipes.drop(['DatePublished'],axis=1,inplace=True)

In [27]:
recipes.sample(3)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,TotalTime,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished
419710,435166.0,Shrimp Scampi With Garlic Toasts,1098956,redsoxgirl09,PT25M,Make and share this Shrimp Scampi With Garlic ...,[],< 30 Mins,[Easy],"[3, 3, 5, None, 1, 8, 1 1⁄4, 3⁄4, 1⁄2, 1⁄2, No...","[extra virgin olive oil, unsalted butter, garl...",,,714.2,25.2,8.2,238.9,997.1,71.2,4.2,1.0,40.8,4.0,,[Preheat the broiler. Heat the olive oil and 2...,https://www.food.com/recipe/Shrimp-Scampi-With...,"[3, 3, 5, , 1, 8, 1 1⁄4, 3⁄4, 1⁄2, 1⁄2, , 1⁄3,...","[tablespoons extra virgin olive oil, tablespoo...",2010,8,16,12
25617,29141.0,5 Ingredient Dump Cake,37305,Karen..,PT1H5M,This is another dump cake variation (see my re...,[],Dessert,"[Pineapple, Tropical Fruits, Fruit, Low Protei...","[1, 1, 1, 3⁄4, 1]","[crushed pineapple, cherry pie filling, butter...",5.0,23.0,431.2,23.1,8.6,31.4,390.2,55.2,2.0,25.7,3.2,,,"[Preheat oven to 350 degrees., Dump pineapples...",https://www.food.com/recipe/5-Ingredient-Dump-...,"[1, 1, 1, 3⁄4, 1]","[(20 ounce) can crushed pineapple, (20 ounce) ...",2002,5,22,18
117729,123904.0,Marshmallow Pineapple Fluff Dessert,217657,startnover,PT15M,This recipe is from my grandmother. She made ...,[https://img.sndimg.com/food/image/upload/w_55...,Dessert,"[Pineapple, Tropical Fruits, Fruit, Low Protei...","[1, 3, 2, 1, 2]","[milk, marshmallows, crushed pineapple, graham...",4.5,4.0,270.1,16.9,9.8,57.2,102.2,28.5,0.7,16.7,2.8,12.0,,[Scald milk. Add marshmallows and stir till me...,https://www.food.com/recipe/Marshmallow-Pineap...,"[1, 3, 2, 1, 2]","[cup milk, cups marshmallows, cups whipping cr...",2005,5,30,8


Now let's turn the `TotalTime` to numbers (in minutes). At the moment the values of this column look like one of the following: 'PT3H30M', 'PT3H', 'PT20M'

In [28]:
re.findall('\dH|\d*M','PT3H30M')

['3H', '30M']

In [29]:
[string.replace('H','') for string in re.findall('\dH|\d*M','PT3H30M')]

['3', '30M']

In [30]:
result = [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in re.findall('\d+H|\d+M', 'PT3H30M')]
result

[180, 30]

In [31]:
recipes['TotalMinutes'] = recipes['TotalTime'].apply(lambda string: re.findall('\dH|\d*M', string))
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: [int(x.replace('H', '')) * 60 if 'H' in x else int(x.replace('M', '')) for x in timelist])
recipes['TotalMinutes'] = recipes['TotalMinutes'].apply(lambda timelist: sum(timelist))
recipes['TotalMinutes']

0         285
1         265
2          35
3         260
4          50
         ... 
521707     20
521708     25
521709     20
521710     20
521711     30
Name: TotalMinutes, Length: 521712, dtype: int64

In [32]:
recipes.drop(['TotalTime'],axis=1,inplace=True)

In [33]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,Images,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
309836,321728.0,Almendras Saladas (Salted Almonds - Tapas),464080,januarybride,Make and share this Almendras Saladas (Salted ...,[],Spanish,"[European, < 15 Mins, Beginner Cook, Easy]","[2, 1, 1⁄2, None]","[sea salt, sweet paprika, olive oil]",5.0,1.0,428.6,38.1,2.9,0.0,595.3,13.7,7.3,3.4,15.6,4.0,2 cups,[Pour 1 inch depth of olive oil into a saucepa...,https://www.food.com/recipe/Almendras-Saladas-...,"[2, 1, 1⁄2, ]","[cups blanched almonds, teaspoon sea salt, tea...",2008,8,27,19,5
271154,281957.0,Minestrone Tortellini,719313,TheGrumpyChef,Make and share this Minestrone Tortellini reci...,[],One Dish Meal,"[< 30 Mins, Easy]","[1, 1, 1, 1⁄2, 1, 1, 2, 1⁄4, 2]","[olive oil, zucchini, yellow bell pepper, dice...",4.5,2.0,388.1,8.8,2.9,26.8,660.1,59.5,10.6,3.1,19.2,4.0,,[Cook and drain tortellini as directed on pack...,https://www.food.com/recipe/Minestrone-Tortell...,"[1, 1, 1, 1⁄2, 1, 1, 2, 1⁄4, 2]",[(9 ounce) package refrigerated cheese-filled ...,2008,1,28,1,25


We won't be using `Images` anywhere in our projects, so I'll remove the column. (For now I'll keep `url` because it helps double checking recipe entries using the actual recipe url; I'll later drop that column too when we get to do ML.)

In [34]:
recipes.drop(['Images'],axis=1,inplace=True)

In [35]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
177775,185885.0,Mum's Hamburger Soup,343981,mumoftwo,This is a recipe I always want whenever I am s...,Grains,"[Vegetable, Meat, Low Cholesterol, Healthy, < ...","[1, 2, 1 1⁄2, 1, 28, 1, 1, 1⁄8, 1⁄8, 1⁄2, 1, 1...","[onion, garlic cloves, lean hamburger, canned ...",5.0,1.0,327.2,9.2,3.6,47.0,737.1,36.9,8.6,6.8,25.1,10.0,,"[In a large pot (Dutch oven), place diced onio...",https://www.food.com/recipe/Mum's-Hamburger-So...,"[1, 2, 1 1⁄2, 1, 28, 1, 1, 1⁄8, 1⁄8, 1⁄2, 1, 1...","[onion, diced, garlic cloves, minced, lbs lean...",2006,9,13,19,100
245837,255856.0,Healthy Black Bean Salad,267665,bernettavan,Make and share this Healthy Black Bean Salad r...,Black Beans,"[Beans, Low Cholesterol, < 15 Mins, Easy]","[1, 1, 1, 1, 1⁄4, 1⁄2, 1⁄4, 1⁄4, 1⁄4, 4, 2, 2,...","[black beans, corn, purple onion, dried cranbe...",5.0,1.0,237.7,12.2,1.6,0.0,48.7,27.6,6.4,6.5,6.9,8.0,,"[Drain beans and corn, mix together with peppe...",https://www.food.com/recipe/Healthy-Black-Bean...,"[1, 1, 1, 1, 1⁄4, 1⁄2, 1⁄4, 1⁄4, 1⁄4, 4, 2, 2,...","[(540 ml) can black beans, (398 ml) can niblet...",2007,9,27,20,15


### Dealing with `RecipeCategory`

In [37]:
recipes['RecipeCategory'].unique(), recipes['RecipeCategory'].nunique()

(array(['Frozen Desserts', 'Chicken Breast', 'Beverages', 'Soy/Tofu',
        'Vegetable', 'Pie', 'Chicken', 'Dessert', 'Southwestern U.S.',
        'Sauces', 'Stew', 'Black Beans', '< 60 Mins', 'Lactose Free',
        'Weeknight', 'Yeast Breads', 'Whole Chicken', 'High Protein',
        'Cheesecake', 'Free Of...', 'High In...', 'Brazilian', 'Breakfast',
        'Breads', 'Bar Cookie', 'Brown Rice', 'Oranges', 'Pork',
        'Low Protein', 'Asian', 'Potato', 'Cheese', 'Halibut', 'Meat',
        'Lamb/Sheep', 'Very Low Carbs', 'Spaghetti', 'Scones',
        'Drop Cookies', 'Lunch/Snacks', 'Beans', 'Punch Beverage',
        'Pineapple', 'Low Cholesterol', '< 30 Mins', 'Quick Breads',
        'Sourdough Breads', 'Curries', 'Chicken Livers', 'Coconut',
        'Savory Pies', 'Poultry', 'Steak', 'Healthy', 'Lobster', 'Rice',
        'Apple', 'Broil/Grill', 'Spreads', 'Crab', 'Jellies', 'Pears',
        'Chowders', 'Cauliflower', 'Candy', 'Chutneys', 'White Rice',
        'Tex Mex', 'Bass',

We have 311 categories. Turning these into numerical values will add many dimensions to our dataframe. We can reduce these catgeories into some more major categories. Here's a suggestion:


**Desserts**: Frozen Desserts, Cheesecake, Pie, Dessert, Cheesecake, Gelatin, Candy, Jellies, Tarts, Sweet, Chocolate Chip Cookies, Bread Pudding, Lemon Cake, Key Lime Pie, Coconut Cream Pie, Ice Cream, Fruit Desserts, Apple Pie, Pumpkin, Coconut Cream Pie.

**Chicken**: Chicken Breast, Chicken, Chicken Thigh & Leg, Chicken Livers, Whole Chicken, Roast Chicken, Chicken Crock Pot.

**Beverages**: Beverages, Punch Beverage, Smoothies, Shakes.

**Vegetarian/Vegan**: Soy/Tofu, Vegetable, Vegan.

**Sauces/Condiments**: Sauces, Salad Dressings, Spreads, Chutneys.

**Meat**: Pork, Lamb/Sheep, Meat, Meatballs, Beef Organ Meats, Steak, Ground Meat, Roast Beef, Ham, Ground Beef, Ground Turkey.

**Seafood**: Halibut, Lobster, Crab, Crawfish, Bass, Tuna, Trout, Catfish, Squid, Mahi Mahi, Oysters, Salmon.

**International Cuisines**: Asian, Brazilian, Greek, German, Hungarian, Indonesian, Mexican, Dutch, Spanish, Russian, Thai, Cajun, Chinese, Turkish, Vietnamese, Lebanese, Moroccan, Korean, Polish, Scandinavian, African, Norwegian, Belgian, Australian, Scottish, Cuban, Portuguese, Hawaiian, Austrian, Egyptian, Filipino, Welsh, Czech, Iraqi, Pakistani, Chilean, Puerto Rican, Ecuadorean, Sudanese, Mongolian, Peruvian, Cambodian, Honduran, Sudanese, Mongolian, Peruvian.

**Side Dishes**: Potatoes, Rice, Grains, Pasta, Breads, Corn, Lentil, Yam/Sweet Potato, Greens, Collard Greens, Spinach, Chard, Artichoke, Mashed Potatoes.

**Breakfast/Brunch**: Breakfast, Breakfast Eggs, Brunch.

In [38]:
category_mapping = {
    'Frozen Desserts': 'Desserts',
    'Chicken Breast': 'Chicken',
    'Beverages': 'Beverages',
    'Soy/Tofu': 'Vegetarian/Vegan',
    'Vegetable': 'Vegetables',
    'Pie': 'Desserts',
    'Chicken': 'Chicken',
    'Dessert': 'Desserts',
    'Southwestern U.S.': 'Regional',
    'Sauces': 'Sauces/Condiments',
    'Stew': 'Main Dish',
    'Black Beans': 'Beans/Legumes',
    '< 60 Mins': 'Quick and Easy',
    'Lactose Free': 'Special Dietary Needs',
    'Weeknight': 'Quick and Easy',
    'Yeast Breads': 'Baked Goods',
    'Whole Chicken': 'Chicken',
    'High Protein': 'Healthy',
    'Cheesecake': 'Desserts',
    'Free Of...': 'Special Dietary Needs',
    'High In...': 'Healthy',
    'Brazilian': 'International',
    'Breakfast': 'Breakfast/Brunch',
    'Breads': 'Baked Goods',
    'Bar Cookie': 'Desserts',
    'Brown Rice': 'Nuts/Seeds/Grains',
    'Oranges': 'Fruit',
    'Pork': 'Meat',
    'Low Protein': 'Special Dietary Needs',
    'Asian': 'International',
    'Potato': 'Side Dishes',
    'Cheese': 'Dairy',
    'Halibut': 'Seafood',
    'Meat': 'Meat',
    'Lamb/Sheep': 'Meat',
    'Very Low Carbs': 'Healthy',
    'Spaghetti': 'Pasta',
    'Scones': 'Breads',
    'Drop Cookies': 'Desserts',
    'Lunch/Snacks': 'Lunch',
    'Beans': 'Beans/Legumes',
    'Punch Beverage': 'Beverages',
    'Pineapple': 'Fruit',
    'Low Cholesterol': 'Healthy',
    '< 30 Mins': 'Quick and Easy',
    'Quick Breads': 'Baked Goods',
    'Sourdough Breads': 'Baked Goods',
    'Curries': 'International',
    'Chicken Livers': 'Chicken',
    'Coconut': 'Fruit',
    'Savory Pies': 'Main Dish',
    'Poultry': 'Chicken',
    'Steak': 'Meat',
    'Healthy': 'Healthy',
    'Lobster': 'Seafood',
    'Rice': 'Nuts/Seeds/Grains',
    'Apple': 'Fruit',
    'Broil/Grill': 'Cooking Methods',
    'Spreads': 'Sauces/Condiments',
    'Crab': 'Seafood',
    'Jellies': 'Sauces/Condiments',
    'Pears': 'Fruit',
    'Chowders': 'Soups',
    'Cauliflower': 'Vegetables',
    'Candy': 'Desserts',
    'Chutneys': 'Sauces/Condiments',
    'White Rice': 'Nuts/Seeds/Grains',
    'Tex Mex': 'Regional',
    'Bass': 'Seafood',
    'German': 'International',
    'Fruit': 'Fruit',
    'European': 'International',
    'Smoothies': 'Beverages',
    'Hungarian': 'International',
    'Manicotti': 'Pasta',
    'Onions': 'Vegetables',
    'New Zealand': 'International',
    'Chicken Thigh & Leg': 'Chicken',
    'Indonesian': 'International',
    'Greek': 'International',
    'Corn': 'Vegetables',
    'Lentil': 'Beans/Legumes',
    'Summer': 'Seasonal',
    'Long Grain Rice': 'Nuts/Seeds/Grains',
    'Southwest Asia (middle East)': 'International',
    'Spanish': 'International',
    'Dutch': 'International',
    'Gelatin': 'Desserts',
    'Tuna': 'Seafood',
    'Citrus': 'Fruit',
    'Berries': 'Fruit',
    'Peppers': 'Vegetables',
    'Salad Dressings': 'Sauces/Condiments',
    'Clear Soup': 'Soups',
    'Mexican': 'International',
    'Raspberries': 'Fruit',
    'Crawfish': 'Seafood',
    'Beef Organ Meats': 'Meat',
    'Strawberry': 'Fruit',
    'Shakes': 'Beverages',
    'Short Grain Rice': 'Nuts/Seeds/Grains',
    '< 15 Mins': 'Quick and Easy',
    'One Dish Meal': 'Main Dish',
    'Spicy': 'Flavor Profiles',
    'Thai': 'International',
    'Cajun': 'Regional',
    'Oven': 'Cooking Methods',
    'Microwave': 'Cooking Methods',
    'Russian': 'International',
    'Melons': 'Fruit',
    'Papaya': 'Fruit',
    'Veal': 'Meat',
    'No Cook': 'Quick and Easy',
    '< 4 Hours': 'Quick and Easy',
    None: 'Uncategorized',
    'Roast': 'Cooking Methods',
    'Potluck': 'Occasions',
    'Orange Roughy': 'Seafood',
    'Canadian': 'International',
    'Caribbean': 'International',
    'Mussels': 'Seafood',
    'Medium Grain Rice': 'Nuts/Seeds/Grains',
    'Japanese': 'International',
    'Penne': 'Pasta',
    'Easy': 'Quick and Easy',
    'Elk': 'Meat',
    'Colombian': 'International',
    'Gumbo': 'Soups',
    'Roast Beef': 'Meat',
    'Perch': 'Seafood',
    'Vietnamese': 'International',
    'Rabbit': 'Meat',
    'Christmas': 'Occasions',
    'Lebanese': 'International',
    'Turkish': 'International',
    'Kid Friendly': 'Family-Friendly',
    'Vegan': 'Vegetarian/Vegan',
    'For Large Groups': 'Occasions',
    'Whole Turkey': 'Poultry',
    'Chinese': 'International',
    'Grains': 'Nuts/Seeds/Grains',
    'Yam/Sweet Potato': 'Side Dishes',
    'Native American': 'Regional',
    'Meatloaf': 'Meat',
    'Winter': 'Seasonal',
    'Trout': 'Seafood',
    'African': 'International',
    'Ham': 'Meat',
    'Goose': 'Poultry',
    'Pasta Shells': 'Pasta',
    'Stocks': 'Soups',
    "St. Patrick's Day": 'Occasions',
    'Meatballs': 'Meat',
    'Whole Duck': 'Poultry',
    'Scandinavian': 'International',
    'Greens': 'Vegetables',
    'Catfish': 'Seafood',
    'Dehydrator': 'Cooking Methods',
    'Duck Breasts': 'Poultry',
    'Savory': 'Flavor Profiles',
    'Stir Fry': 'Main Dish',
    'Polish': 'International',
    'Spring': 'Seasonal',
    'Deer': 'Meat',
    'Wild Game': 'Meat',
    'Pheasant': 'Meat',
    'No Shell Fish': 'Seafood',
    'Collard Greens': 'Vegetables',
    'Tilapia': 'Seafood',
    'Quail': 'Poultry',
    'Refrigerator': 'Preservation',
    'Canning': 'Preservation',
    'Moroccan': 'International',
    'Pressure Cooker': 'Cooking Methods',
    'Squid': 'Seafood',
    'Korean': 'International',
    'Plums': 'Fruit',
    'Danish': 'International',
    'Creole': 'Regional',
    'Mahi Mahi': 'Seafood',
    'Tarts': 'Desserts',
    'Spinach': 'Vegetables',
    'Hawaiian': 'Regional',
    'Homeopathy/Remedies': 'Healthy',
    'Austrian': 'International',
    'Thanksgiving': 'Occasions',
    'Moose': 'Meat',
    'Bath/Beauty': 'Healthy',
    'Swedish': 'International',
    'High Fiber': 'Healthy',
    'Kosher': 'Special Dietary Needs',
    'Norwegian': 'International',
    'Household Cleaner': 'Household',
    'Ethiopian': 'International',
    'Belgian': 'International',
    'Australian': 'International',
    'Pennsylvania Dutch': 'Regional',
    'Bear': 'Meat',
    'Scottish': 'International',
    'Tempeh': 'Vegetarian/Vegan',
    'Cuban': 'International',
    'Turkey Breasts': 'Poultry',
    'Cantonese': 'International',
    'Tropical Fruits': 'Fruit',
    'Peanut Butter': 'Sauces/Condiments',
    'Szechuan': 'International',
    'Portuguese': 'International',
    'Summer Dip': 'Appetizers',
    'Costa Rican': 'International',
    'Duck': 'Poultry',
    'Sweet': 'Flavor Profiles',
    'Nuts': 'Nuts/Seeds/Grains',
    'Filipino': 'International',
    'Welsh': 'International',
    'Camping': 'Outdoor Cooking',
    'Pot Pie': 'Main Dish',
    'Polynesian': 'International',
    'Mango': 'Fruit',
    'Cherries': 'Fruit',
    'Egyptian': 'International',
    'Chard': 'Vegetables',
    'Lime': 'Flavor Profiles',
    'Lemon': 'Flavor Profiles',
    'Brunch': 'Breakfast/Brunch',
    'Toddler Friendly': 'Family-Friendly',
    'Kiwifruit': 'Fruit',
    'Whitefish': 'Seafood',
    'South American': 'International',
    'Malaysian': 'International',
    'Octopus': 'Seafood',
    'Nigerian': 'International',
    'Mixer': 'Cooking Methods',
    'Venezuelan': 'International',
    'Halloween': 'Occasions',
    'Stove Top': 'Cooking Methods',
    'Bread Machine': 'Baked Goods',
    'French Toast': 'Breakfast/Brunch',
    'French Canadian': 'Regional',
    'Sauerkraut': 'Vegetables',
    'West Virginia': 'Regional',
    'Cooker': 'Cooking Methods',
    'Jewish': 'International',
    'Leek': 'Vegetables',
    'Asian Greens': 'Vegetables',
    'Buffalo': 'Meat',
    'Smoothie': 'Beverages',
    'Indian': 'International',
    'Cooking For One': 'Quick and Easy',
    'Kansas': 'Regional',
    'Carrot': 'Vegetables',
    'Australian And New Zealand': 'International',
    'Canadian Bacon': 'Meat',
    'Zucchini': 'Vegetables',
    'Flounder': 'Seafood',
    'Fijian': 'International',
    'Winter Squash': 'Vegetables',
    'Israeli': 'International',
    'Ethnic': 'International',
    'Eggplant': 'Vegetables',
    'Afghan': 'International',
    'Barbecue': 'Cooking Methods',
    'Vegetarian': 'Vegetarian/Vegan',
    'Main Dish': 'Main Dish',
    'Missouri': 'Regional',
    'Salmon': 'Seafood',
    'Pesto': 'Sauces/Condiments',
    'Braised': 'Cooking Methods',
    'Czech': 'International',
    'Salads': 'Salads',
    'Soul Food': 'Regional',
    'Swiss': 'International',
    'Jamaican': 'International',
    'Easter': 'Occasions',
    'Tex-Mex': 'Regional',
    'Northeastern United States': 'Regional',
    'Swiss Cheese': 'Dairy',
    'Pacific Northwestern': 'Regional',
    'Czechoslovakian': 'International',
    'Meals': 'Main Dish',
    'Microwave Appetizers': 'Appetizers',
    'Northwestern United States': 'Regional',
    'Moravian': 'International',
    'Special Occasion': 'Occasions',
    'California': 'Regional',
    'Mandarin Oranges': 'Fruit',
    'Pennsylvania': 'Regional',
    'Brazil': 'International',
    'Thai Sweet Rice': 'Nuts/Seeds/Grains',
    'Freezer': 'Preservation',
    'Cornish Hens': 'Poultry',
    'Arizona': 'Regional',
    'Pacific Islands': 'International',
    'Rhode Island': 'Regional',
    'Georgian': 'International',
    'Pork Tenderloin': 'Meat',
    'No-Cook': 'Quick and Easy',
    'Basque': 'International',
    'Thanksgiving Leftovers': 'Occasions',
    'Avocado': 'Fruit',
    'Alcoholic': 'Beverages',
    'Hamburger': 'Meat',
    'Michigan': 'Regional',
    'Red Beans And Rice': 'Beans/Legumes',
    'Pan Grilling': 'Cooking Methods',
    'Deep Fryer': 'Cooking Methods',
    'Muffins': 'Baked Goods',
    'Pan Frying': 'Cooking Methods',
    'English': 'International',
    'Pressure Cookers': 'Cooking Methods',
    'High Calcium': 'Healthy',
    'Low Saturated Fat': 'Healthy',
    'Game': 'Meat',
    'Gluten-Free': 'Special Dietary Needs',
    'Wheat': 'Nuts/Seeds/Grains',
    'Finnish': 'International',
    'New England': 'Regional',
    'Swedish Meatballs': 'Meat',
    'Algerian': 'International',
    'Pacific Rim': 'International',
    'Thermomix': 'Cooking Methods',
    'Nuts/Seeds': 'Nuts/Seeds/Grains',
    'Vegetables': 'Vegetables',
    'Apple Pie': 'Desserts',
    'Jerky': 'Meat',
    'Condiments, Etc.': 'Sauces/Condiments',
    'New York': 'Regional',
    'Colombia': 'International',
    'Chicago Style': 'Regional',
    'Mediterranean': 'International',
    'Irish': 'International',
    'Pressure Canning': 'Preservation',
    'Middle Eastern': 'International',
    'Plants': 'Vegetarian/Vegan',
    'Southwestern': 'Regional',
    'Jam': 'Sauces/Condiments',
    'Peaches': 'Fruit',
    'Egg-Free': 'Special Dietary Needs',
    'Eastern European': 'International',
    'Soft Drinks': 'Beverages',
    'Picnics': 'Outdoor Cooking',
    'Kiwi': 'Fruit',
    'Ice Cream': 'Desserts',
    'Turkey': 'Poultry',
    'Cherry': 'Fruit',
    'Vegetable Casserole': 'Vegetables',
    'Goat': 'Meat',
    'Dressings': 'Sauces/Condiments',
    'Cabbage': 'Vegetables',
    'Romaine': 'Vegetables',
    'Low Fat': 'Healthy',
    'Sausage': 'Meat',
    'Roasts': 'Meat',
    'Casseroles': 'Main Dish',
    'North American': 'International',
    'High Potassium': 'Healthy',
    'Soups': 'Soups',
    'Main Dishes': 'Main Dish',
    'Crisps': 'Desserts',
    'French Canadian Tourtiere': 'Regional',
    'Irish Soda Bread': 'Baked Goods',
    'Loaves': 'Baked Goods',
    'Crepes': 'Breakfast/Brunch',
    'Potatoes': 'Vegetables',
    'Rhubarb': 'Vegetables',
    'Salmon Lox': 'Seafood',
    'Apricot': 'Fruit',
    'Bbq': 'Cooking Methods',
    'Herb And Spice Mixes': 'Sauces/Condiments',
    'Low Calorie': 'Healthy',
    'Salmon Fillets': 'Seafood',
    'Apricots': 'Fruit',
    'South Carolina': 'Regional',
    'Shrimp': 'Seafood',
    'Chinese Five-Spice': 'Spices/Seasonings',
    'Grains/Cereals': 'Nuts/Seeds/Grains',
    'Honduran': 'International',
    'Chilean': 'International',
    'Flat Shell Fish': 'Seafood',
    'Portuguese Sausage': 'Meat',
    'Cinnamon': 'Spices/Seasonings',
    'Swiss Chard': 'Vegetables',
    'Bulgarian': 'International',
    'Champagne': 'Beverages',
    'Mashed Potatoes': 'Side Dishes',
    'Vermont': 'Regional',
    'Finger Food': 'Appetizers',
    'Side Dish': 'Side Dishes',
    'Steamed': 'Cooking Methods',
    'Raspberry': 'Fruit',
    'Berries And Currants': 'Fruit',
    'Kentucky': 'Regional',
    'Ethnic Foods': 'International',
    'New Hampshire': 'Regional',
    'Alfredo': 'Pasta',
    'Whole Chicken': 'Poultry',
    'North Dakota': 'Regional',
    'Gelatin Desserts': 'Desserts',
    'Iowa': 'Regional',
    'Spreads': 'Sauces/Condiments',
    'Dried Beans': 'Beans/Legumes',
    'Fruit': 'Fruit',
    'Oklahoma': 'Regional',
    'Pennsylvania Dutch Cooking': 'Regional',
    'Broccoli': 'Vegetables',
    'California Style': 'Regional',
    'Fish': 'Seafood',
    'Crab': 'Seafood',
    'Vegetarian/Vegan': 'Vegetarian/Vegan',
    'Brisket': 'Meat',
    'Jewish Holidays': 'Occasions',
    'Mussels/Squid': 'Seafood',
    'Wok': 'Cooking Methods',
    'St. Louis': 'Regional',
    'Breads': 'Baked Goods',
    'Polenta': 'Nuts/Seeds/Grains',
    'Rice Cooker': 'Cooking Methods',
    'Arizona Style': 'Regional',
    'Cucumber': 'Vegetables',
    'Pineapple': 'Fruit',
    'Cheese': 'Dairy',
    'Omelets': 'Breakfast/Brunch',
    'Cantaloupe': 'Fruit',
    'Pancakes And Waffles': 'Breakfast/Brunch',
    'Danish Pastry': 'Baked Goods',
    'Cherry Tomatoes': 'Vegetables',
    'Freshwater Fish': 'Seafood',
    'Lunch/Snacks': 'Lunch/Snacks',
    'Cornmeal': 'Nuts/Seeds/Grains',
    'Squash': 'Vegetables',
    'Meat': 'Meat',
    'Polynesian/Hawaiian': 'Regional',
    'High Protein': 'Healthy',
    'Chutneys': 'Sauces/Condiments',
    'Southwestern United States': 'Regional',
    'Wine': 'Beverages',
    'Smoothies': 'Beverages',
    'South Dakota': 'Regional',
    'High Fiber Cereals': 'Nuts/Seeds/Grains',
    'Chowders': 'Soups',
    'Chiles': 'Spices/Seasonings',
    'Lamb': 'Meat',
    'Mangoes': 'Fruit',
    'Belgian Waffle': 'Breakfast/Brunch',
    'Jamaican Patties': 'International',
    'Mozzarella': 'Dairy',
    'Fish Fry': 'Main Dish',
    'Swiss Fondue': 'International',
    'Jellies': 'Sauces/Condiments',
    'Southwest': 'Regional',
    'Lettuce': 'Vegetables',
    'Poppy Seeds': 'Nuts/Seeds/Grains',
    'Hummus': 'Sauces/Condiments',
    'Icing/Frosting': 'Desserts',
    'Lobster': 'Seafood',
    'St. Patrick\'s Day': 'Occasions',
    'Food Processor/Blender': 'Cooking Methods',
    'Hamburgers': 'Meat',
    'Lemon Juice': 'Flavor Profiles',
    'Valentine\'s Day': 'Occasions',
    'Cranberries': 'Fruit',
    'North Carolina': 'Regional',
    'Baked Goods': 'Baked Goods',
    'Poultry': 'Poultry',
    'Root Vegetables': 'Vegetables',
    'Tamales': 'International',
    'Vegetarian And Vegan': 'Vegetarian/Vegan',
    'Oats': 'Nuts/Seeds/Grains',
    'Brazilian': 'International',
    'High Vitamin C': 'Healthy',
    'Southern': 'Regional',
    'Hawaiian': 'International',
    'Kiwi Fruit': 'Fruit',
    'Ice Cream Maker': 'Cooking Methods',
    'South': 'Regional',
    'Creole/Cajun': 'Regional',
    'Pork': 'Meat',
    'American': 'International',
    'Moroccan Chicken': 'International',
    'Chicken Breasts': 'Poultry',
    'Austrian/German/Swiss': 'International',
    'Baked Potato': 'Side Dishes',
    'Pineapple Juice': 'Flavor Profiles',
    'Lunch': 'Lunch/Snacks',
    'Peanuts': 'Nuts/Seeds/Grains',
    'Mushrooms': 'Vegetables',
    'Smoker': 'Cooking Methods',
    'Stir-Fry': 'Main Dish',
    'Northwest': 'Regional',
    'Breakfast/Brunch': 'Breakfast/Brunch',
    'Chinese': 'International',
    'Hot Dogs/Poultry': 'Poultry',
    'Mixed Drinks': 'Beverages',
    'Grilled Cheese': 'Sandwiches',
    'South African': 'International',
    'Pakistani': 'International',
    'Pakistani And Indian': 'International',
    'Oranges': 'Fruit',
    'Jewish Cuisine': 'International',
    'Peppers': 'Vegetables',
    'Alaska': 'Regional',
    'Jewish Holidays And Events': 'Occasions',
    'Baked Beans': 'Beans/Legumes',
    'Low Sodium': 'Healthy',
    'Smoothie Bowl': 'Beverages',
    'Southern United States': 'Regional',
    'Alaskan King Crab': 'Seafood',
    'Diabetic': 'Special Dietary Needs',
    'Mideast': 'International',
    'Crock Pot': 'Cooking Methods',
    'Sourdough': 'Baked Goods',
    'German': 'International',
    'West Virginia Style': 'Regional',
    'Fish And Seafood': 'Seafood',
    'Puerto Rican': 'International',
    'Minnesota': 'Regional',
    'Okra': 'Vegetables',
    'Bass': 'Seafood',
    'Panfish': 'Seafood',
    'West': 'Regional',
    'Pumpkin': 'Vegetables',
    'Cajun/Creole': 'Regional',
    'Bundt Cake': 'Desserts',
    'Mexican': 'International',
    'Northwest Usa': 'Regional',
    'Congo': 'International',
    'Alcohol': 'Beverages',
    'Christmas': 'Occasions',
    'Czech Republic': 'International',
    'Vinegar': 'Sauces/Condiments',
    'Soy': 'Vegetarian/Vegan',
    'Sushi': 'International',
    'Crockpot': 'Cooking Methods',
    'California/Mexican': 'Regional',
    'Coffee': 'Beverages',
    'Jerk': 'International',
    'Cheddar Cheese': 'Dairy',
    'Minnesota Style': 'Regional',
    'Ranch Dressing': 'Sauces/Condiments',
    'West Coast': 'Regional',
    'Bavarian': 'International',
    'Spanish': 'International',
    'Middle East': 'International',
    'Southeast Asian': 'International',
    'Cheese Balls': 'Appetizers',
    'Bar Cookies': 'Desserts',
    'Zucchini And Yellow Squash': 'Vegetables',
    'Thai': 'International',
    'Latin American': 'International',
    'Peruvian': 'International',
    'Chocolate': 'Desserts',
    'Corn': 'Vegetables',
    'Seafood': 'Seafood',
    'Cucumber Salad': 'Salads',
    'Greek': 'International',
    'Veal': 'Meat',
    'Beef': 'Meat',
    'Southern Us': 'Regional',
    'Central American': 'International',
    'Scones': 'Baked Goods',
    'Beverages': 'Beverages',
    'Pumpkin Seeds': 'Nuts/Seeds/Grains',
    'Indian Subcontinent': 'International',
    'Italian': 'International',
    'Pork Chops': 'Meat',
    'Curry': 'International',
    'Caribbean': 'International',
    'Caribbean And West Indian': 'International',
    'Chinese Regional': 'International',
    'Hawaiian And Pacific Islands': 'International',
    'Canning/Preserving': 'Preservation',
    'Cookies': 'Desserts',
    'Cookies And Brownies': 'Desserts',
    'Hamburger Patties': 'Meat',
    'Sugar-Free': 'Special Dietary Needs',
    'Grapes': 'Fruit',
    'Meatloaf': 'Meat',
    'Greek Style': 'International',
    'Duck': 'Poultry',
    'Egg Nog': 'Beverages',
    'Bhutan': 'International',
    'Spice Blends': 'Spices/Seasonings',
    'Raisins': 'Fruit',
    'Rye': 'Nuts/Seeds/Grains',
    'Omelet/Frittatas': 'Breakfast/Brunch',
    'Canadian': 'International',
    'Ground Beef': 'Meat',
    'Turkey Leftovers': 'Meat',
    'Hummus And Pita': 'Sauces/Condiments',
    'Broccoli Rabe': 'Vegetables',
    'Polish': 'International',
    'Beans And Peas': 'Beans/Legumes',
    'Butternut Squash': 'Vegetables',
    'Cheddar': 'Dairy',
    'Butter': 'Dairy',
    'Sweet Potatoes/Yams': 'Vegetables',
    'Sesame': 'Nuts/Seeds/Grains',
    'Fish Fillets': 'Seafood',
    'New Mexico': 'Regional',
    'Broth': 'Soups',
    'Crock Pot/Slow Cooker': 'Cooking Methods',
    'Russian': 'International',
    'Tuna': 'Seafood',
    'Artichokes': 'Vegetables',
    'Finnish/Nordic': 'International',
    'Low Cholesterol': 'Healthy',
    'Irish Soda Bread Ii': 'Baked Goods',
    'Salsa': 'Sauces/Condiments',
    'North Carolina Style': 'Regional',
    'Nebraska': 'Regional',
    'Creole': 'Regional',
    'Iced/Cold Beverages': 'Beverages',
    'Southern Style': 'Regional',
    'Iowa Style': 'Regional',
    'Low Carbohydrate': 'Healthy',
    'Creole/Creole And Cajun': 'Regional',
    'Brazilian Favourites': 'International',
    'Asian': 'International',
    'Yogurt': 'Dairy',
    'Oregon': 'Regional',
    'Hamburgers/Hot Dogs': 'Meat',
    'Dairy': 'Dairy',
    'Low Protein': 'Healthy',
    'Freezer': 'Preservation',
    'Buttermilk': 'Dairy',
    'Jam/Jelly': 'Sauces/Condiments',
    'Candy': 'Desserts',
    'Main Dish': 'Main Dish',
    'Easy': 'Easy',
    'Korean': 'International',
    'Oktoberfest': 'Occasions',
    'Lobster/Crab/Shrimp': 'Seafood',
    'English': 'International',
    'Belizean': 'International',
    'Californian': 'Regional',
    'Lebanese': 'International',
    'South American': 'International',
    'Thanksgiving': 'Occasions',
    'Indian': 'International',
    'Fish And Chips': 'Seafood',
    'Vegetables/Fruits': 'Vegetables',
    'Vegetarian': 'Vegetarian/Vegan',
    'Kansas': 'Regional',
    'Salsa/Hot Sauces': 'Sauces/Condiments',
    'Salads': 'Salads',
    'Poultry And Game Birds': 'Poultry',
    'Sauces/Condiments': 'Sauces/Condiments',
    'Thanksgiving Leftovers': 'Meat',
    'Juices': 'Beverages',
    'Chile': 'International',
    'Garlic': 'Flavor Profiles',
    'Candy/Candy Making': 'Desserts',
    'Dutch Oven': 'Cooking Methods',
    'Condiments': 'Sauces/Condiments',
    'Main Course': 'Main Dish',
    'South American And Central American': 'International',
    'English/Irish/Scottish': 'International',
    'No-Cook': 'No-Cook',
    'Maryland': 'Regional',
    'Preservation': 'Preservation',
    'Greece': 'International',
    'Nut-Free': 'Special Dietary Needs',
    'Asian/Asian And Indian': 'International',
    'Jamaican': 'International',
    'German Regional': 'International',
    'French': 'International',
    'Yeast Breads': 'Baked Goods',
    'Scandinavian': 'International',
    'Minnesota Recipes': 'Regional',
    'Cake Mixes': 'Baked Goods',
    'Pacific Northwest': 'Regional',
    'Sweet Corn': 'Vegetables',
    'Cake Decorating': 'Desserts',
    'Moroccan': 'International',
    'Dairy-Free': 'Special Dietary Needs',
    'Icelandic': 'International',
    'European': 'International',
    'Meringue': 'Desserts',
    'Low Carb': 'Healthy',
    'Chickpeas': 'Beans/Legumes',
    'Low Sodium Main Dishes': 'Healthy',
    'Potato Salad': 'Salads',
    'Tarts': 'Desserts',
    'Low Sodium Desserts': 'Healthy',
    'New York Style': 'Regional',
    'Cheesecake': 'Desserts',
    'Candy Bars': 'Desserts',
    'North Carolina Style Bbq Sauce': 'Sauces/Condiments',
    'Condiment': 'Sauces/Condiments',
    'Creole And Cajun': 'Regional',
    'Illinois': 'Regional',
    'South African Cuisine': 'International',
    'Mexican/Southwestern': 'Regional',
    'Pacific Rim/Asian': 'International',
    'African': 'International',
    'Shellfish': 'Seafood',
    'English And Irish': 'International',
    'Lentils': 'Beans/Legumes',
    'Ethiopian': 'International',
    'East Indian': 'International',
    'African American': 'International',
    'German And Austrian': 'International',
    'Microwave': 'Cooking Methods',
    'Hawaiian Regional': 'Regional',
    'Mediterranean': 'International',
    'Quick Breads': 'Baked Goods',
    'Honduran': 'International',
    'Snacks': 'Lunch/Snacks',
    'Swiss': 'International',
    'Caribbean And Jamaican': 'International',
    'East Coast': 'Regional',
    'Chinese Regional And Chinese': 'International',
    'Bakery': 'Baked Goods',
    'Kansas City': 'Regional',
    'Party': 'Occasions',
    'Asian/Asian And Pacific Rim': 'International',
    'Southern/Cajun And Creole': 'Regional',
    'Greek Regional': 'International',
    'Valentine\'s Day And Romantic': 'Occasions',
    'Indian And South Asian': 'International',
    'Seafood/Fish': 'Seafood',
    'Caribbean And Latin American': 'International',
    'Beef Roast': 'Meat',
    'German And Austrian And Swiss': 'International',
    'Pasta': 'Pasta',
    'Baking': 'Baked Goods',
    'Potato': 'Vegetables',
    'Pork Loin': 'Meat',
    'Cajun': 'Regional',
    'Peruvian And Bolivian': 'International',
    'Turkey': 'Meat',
    'Ireland': 'International',
    'High Protein Low Carb': 'Healthy',
    'Indian And South African': 'International',
    'Asian/Indian': 'International',
    'Indian Subcontinent And Pakistan': 'International',
    'Potatoes': 'Vegetables',
    'Special Diets': 'Special Dietary Needs',
    'International': 'International',
    'Cabbage': 'Vegetables',
    'Stir-Fries': 'Main Dish',
    'Czechoslovakian': 'International',
    'New England': 'Regional',
    'Asian/Chinese': 'International',
    'Szechuan/Sichuan': 'International',
    'Czech': 'International',
    'Chile Pepper': 'Spices/Seasonings',
    'Microwave Cooking': 'Cooking Methods',
    'Mid-Atlantic': 'Regional',
    'Pizza': 'Main Dish',
    'Caribbean And Puerto Rican': 'International',
    'Pennsylvania': 'Regional',
    'Soups': 'Soups',
    'Iceland': 'International',
    'Low Cholesterol Desserts': 'Healthy',
    'Cocktails': 'Beverages',
    'Easy Main Dish': 'Easy',
    'Sauce': 'Sauces/Condiments',
    'German And Austrian': 'International',
    'Peruvian And Ecuadorian': 'International',
    'Nuts/Seeds': 'Nuts/Seeds/Grains',
    'Kentucky': 'Regional',
    'Colorado': 'Regional',
    'Asian/Japanese': 'International',
    'Japanese': 'International',
    'Jewish': 'International',
    'Middle Eastern': 'International',
    'Baking Mixes': 'Baked Goods',
    'Low Fat': 'Healthy',
    'Alabama': 'Regional',
    'Cheese Appetizers': 'Appetizers',
    'Jewish And Kosher': 'International',
    'Cakes': 'Desserts',
    'Southwestern': 'Regional',
    'Appetizers': 'Appetizers',
    'Alcoholic': 'Beverages',
    'Czechoslovakian And German': 'International',
    'Desserts': 'Desserts',
    'Maryland Regional': 'Regional',
    'Deli': 'Sandwiches',
    'Chile Pepper And Chile Pepper Sauce': 'Spices/Seasonings',
    'Seafood/Fish And Seafood': 'Seafood',
    'Oklahoma': 'Regional',
    'Salads/Salads And Dressings': 'Salads',
    'New England And Mid-Atlantic': 'Regional',
    'Dairy And Eggs': 'Dairy',
    'Soul Food': 'Regional',
    'Swedish': 'International',
    'Alcoholic Beverages': 'Beverages',
    'Eggs': 'Dairy',
    'Iowa': 'Regional',
    'Arizona': 'Regional',
    'Brazilian And South American': 'International',
    'Lunch/Snacks': 'Lunch/Snacks',
    'Noodles': 'Pasta',
    'Hot Drinks': 'Beverages',
    'Texas': 'Regional',
    'Maryland And Virginia': 'Regional',
    'Pacific Northwest And Western': 'Regional',
    'Poultry': 'Poultry',
    'British Isles': 'International',
    'Polish And Eastern European': 'International',
    'Apples': 'Fruit',
    'Italian Regional': 'International',
    'Gluten-Free': 'Special Dietary Needs',
    'Oregon Regional': 'Regional',
    'Mexican Regional': 'Regional',
    'Austrian': 'International',
    'Southwest': 'Regional',
    'Low Fat Main Dishes': 'Healthy',
    'Casserole': 'Main Dish',
    'Southern/Cajun And Creole And Cajun': 'Regional',
    'Eastern European': 'International',
    'Asian/Indian And South Asian': 'International',
    'Casseroles': 'Main Dish',
    'Noodles And Pasta': 'Pasta',
    'Breads': 'Baked Goods',
    'Sauces': 'Sauces/Condiments',
    'Quick': 'Quick',
    'Southwestern And Mexican': 'Regional',
    'New England And Eastern European': 'Regional',
    'Appetizer': 'Appetizers',
    'California': 'Regional',
    'Curries': 'International',
    'Baking Soda': 'Baking Ingredients',
    'Southwestern And Mexican And Tex-Mex': 'Regional',
    'Pennsylvania Dutch': 'International',
    'Southwestern And Mexican And Southwestern': 'Regional',
    'Caribbean And Central American': 'International',
    'Southern/Cajun And Creole And Southern': 'Regional',
    'Arizona And New Mexican': 'Regional',
    'Midwestern': 'Regional',
    'Middle Eastern And Israeli': 'International',
    'Southwestern And Mexican And Mexican': 'Regional',
    'German And Eastern European': 'International',
    'Dairy And Poultry': 'Dairy',
    'Eggs And Dairy': 'Dairy',
    'German And Polish': 'International',
    'British': 'International',
    'Pasta And Noodles': 'Pasta',
    'Irish': 'International',
    'Chinese': 'International',
    'Muffins': 'Baked Goods',
    'Southwestern And Tex-Mex': 'Regional',
    'Eastern European And Russian': 'International',
    'Dessert Sauces': 'Sauces/Condiments',
    'Jewish And Passover': 'International',
    'Northwestern': 'Regional',
    'Northern Italian': 'International',
    'Taco': 'Main Dish',
    'Italian Regional And Italian': 'International',
    'Crock-Pot': 'Cooking Methods',
    'Breads/Bread Machine': 'Baked Goods',
    'Salads/Salads And Vegetables': 'Salads',
    'Cabbage And Corned Beef': 'Meat',
    'Crepes': 'Breakfast/Brunch',
    'Southern/Cajun And Creole And Cajun And Creole': 'Regional',
    'Chocolate Chip': 'Desserts',
    'Sour Cream': 'Dairy',
    'Caribbean And Cuban': 'International',
    'Mexican And Southwestern': 'Regional',
    'Eastern European And German': 'International',
    'German': 'International',
    'Condiments And Sauces': 'Sauces/Condiments',
    'Southern/Cajun And Creole And Southern And Cajun And Creole': 'Regional',
    'Pies': 'Desserts',
    'German And Austrian And German And Austrian': 'International',
    'Healthy': 'Healthy',
    'Low Sodium': 'Healthy',
    'Scandinavian And Swedish': 'International',
    'Eastern European And Hungarian': 'International',
    'German And Austrian And Polish': 'International',
    'German And Austrian And Swiss And Swiss': 'International',
    'Middle Eastern And Jewish': 'International',
    'Peanut Butter': 'Nuts/Seeds/Grains',
    'Southern/Cajun And Creole And Southern And Creole': 'Regional',
    'Fruit': 'Fruit',
    'Southern/Cajun And Creole And Cajun And Creole And Southern': 'Regional',
    'Dips': 'Appetizers',
    'Thai And Southeast Asian': 'International',
    'South American And Mexican': 'International',
    'Quick And Easy': 'Quick',
    'Low Sodium Main Dishes And Healthy': 'Healthy',
    'Canning': 'Preservation',
    'Mexican And South American': 'International',
    'California And Southwestern': 'Regional',
    'Czech And Eastern European': 'International',
    'California And American': 'Regional',
    'Southern/Cajun And Creole And Southern And Creole And Cajun': 'Regional',
    'Greek And Italian': 'International',
    'Low Fat Desserts': 'Healthy',
    'North Dakota': 'Regional',
    'German And Polish And Eastern European': 'International',
    'Jewish And Hanukkah': 'International',
    'Artichoke': 'Vegetables',
    'Bean Soup': 'Soups',
    'Beef Liver': 'Meat',
    'Beginner Cook': 'Uncategorized',
    'Birthday': 'Occasions',
    'Black Bean Soup': 'Soups',
    'Bread Pudding': 'Desserts',
    'Breakfast Casseroles': 'Breakfast/Brunch',
    'Breakfast Eggs': 'Breakfast/Brunch',
    'Broccoli Soup': 'Soups',
    'Buttermilk Biscuits': 'Baked Goods',
    'Cambodian': 'International',
    'Chicken Crock Pot': 'Chicken',
    'Chocolate Chip Cookies': 'Desserts',
    'Coconut Cream Pie': 'Desserts',
    'Dairy Free Foods': 'Special Dietary Needs',
    'Deep Fried': 'Cooking Methods',
    'Desserts Fruit': 'Desserts',
    'Ecuadorean': 'International',
    'Egg Free': 'Special Dietary Needs',
    'Fish Salmon': 'Seafood',
    'Fish Tuna': 'Seafood',
    'From Scratch': 'Cooking Methods',
    'Guatemalan': 'International',
    'Ham And Bean Soup': 'Soups',
    'Hanukkah': 'Occasions',
    'Hunan': 'International',
    'Inexpensive': 'Budget',
    'Iraqi': 'International',
    'Key Lime Pie': 'Desserts',
    'Labor Day': 'Occasions',
    'Lemon Cake': 'Desserts',
    'Macaroni And Cheese': 'Pasta',
    'Main Dish Casseroles': 'Main Dish',
    'Margarita': 'Beverages',
    'Memorial Day': 'Occasions',
    'Mongolian': 'International',
    'Mushroom Soup': 'Soups',
    'Nepalese': 'International',
    'Oatmeal': 'Breakfast/Brunch',
    'Oysters': 'Seafood',
    'Palestinian': 'International',
    'Peanut Butter Pie': 'Desserts',
    'Pot Roast': 'Meat',
    'Potato Soup': 'Soups',
    'Roast Beef Crock Pot': 'Meat',
    'Small Appliance': 'Cooking Methods',
    'Snacks Sweet': 'Desserts',
    'Somalian': 'International',
    'Soups Crock Pot': 'Soups',
    'Spaghetti Sauce': 'Sauces/Condiments',
    'Steam': 'Cooking Methods',
    'Sudanese': 'International',
    'Turkey Gravy': 'Poultry',
    'Wheat Bread': 'Baked Goods',
    'Appetizers, Dietary Restrictions': 'Special Dietary Needs',
    'Beans/Legumes': 'Beans/Legumes',
    'Beef, Cooking Methods': 'Meat',
    'Beverage': 'Beverages',
    'Cake, Dessert': 'Desserts',
    'Casseroles, Main Dish': 'Main Dish',
    'Chicken, Cooking Methods': 'Chicken',
    'Cookies, Dessert': 'Desserts',
    'Cooking Methods': 'Cooking Methods',
    'Cooking Skill Level': 'Uncategorized',
    'Cooking Times': 'Cooking Times',
    'Cuisine': 'International',
    'Cost': 'Budget',
    'Dessert, Fruit': 'Desserts',
    'Dietary Restrictions': 'Special Dietary Needs',
    'Family-Friendly': 'Occasions',
    'Flavor Profiles': 'Flavor Profiles',
    'Gravy, Turkey': 'Poultry',
    'Health/Wellness': 'Healthy',
    'Household': 'Uncategorized',
    'Occasion': 'Occasions',
    'Occasions': 'Occasions',
    'Outdoor Cooking': 'Occasions',
    'Pasta, Cheese, Main Dish': 'Pasta',
    'Pie, Dessert': 'Desserts',
    'Quick and Easy': 'Quick and Easy',
    'Regional': 'Regional',
    'Sauce, Pasta': 'Pasta',
    'Seasonal':'Seasonal',
    'Side Dishes': 'Side Dishes',
    'Snacks, Dessert': 'Desserts',
    'Soup': 'Soups',
    'Soup, Cooking Methods' : 'Soups',
    'Special Dietary Needs': 'Special Dietary Needs',
    'Uncategorized': 'Uncategorized',
    'Gluten Free Appetizers': 'Special Dietary Needs',
    'Easy': 'Quick and Easy',
    'Family-Friendly': 'Occasions',
    'Outdoor Cooking': 'Occasions',
    

}

Checking if we didn't cover anything:

In [39]:
set(recipes['RecipeCategory'].unique()) - set(category_mapping.keys()) 

set()

In [40]:
recipes['RecipeCategory'] = recipes['RecipeCategory'].map(category_mapping)

In [41]:
recipes['RecipeCategory'].unique(), recipes['RecipeCategory'].nunique()

(array(['Desserts', 'Chicken', 'Beverages', 'Vegetarian/Vegan',
        'Vegetables', 'Regional', 'Sauces/Condiments', 'Main Dish',
        'Beans/Legumes', 'Quick and Easy', 'Special Dietary Needs',
        'Baked Goods', 'Poultry', 'Healthy', 'International',
        'Breakfast/Brunch', 'Nuts/Seeds/Grains', 'Fruit', 'Meat', 'Dairy',
        'Seafood', 'Pasta', 'Lunch/Snacks', 'Cooking Methods', 'Soups',
        'Seasonal', 'Flavor Profiles', 'Uncategorized', 'Occasions',
        'Family-Friendly', 'Side Dishes', 'Preservation', 'Household',
        'Appetizers', 'Outdoor Cooking', 'Budget'], dtype=object),
 36)

We now have 36 categories!

We can still see that some categories barfely have any memebers. I'll merge them into other categories:

In [42]:
recipes['RecipeCategory'].value_counts()

Desserts                 100467
Vegetables                50272
Main Dish                 40224
Meat                      36411
Lunch/Snacks              32564
Quick and Easy            32320
Baked Goods               30084
Chicken                   26358
Sauces/Condiments         22785
Beverages                 22764
Breakfast/Brunch          21859
Healthy                   18371
International             17429
Nuts/Seeds/Grains          8711
Fruit                      8558
Dairy                      8459
Beans/Legumes              7891
Seafood                    6583
Poultry                    6517
Soups                      5206
Pasta                      3962
Side Dishes                2622
Vegetarian/Vegan           1841
Occasions                  1797
Flavor Profiles            1391
Regional                   1313
Family-Friendly            1219
Special Dietary Needs      1142
Uncategorized               936
Seasonal                    814
Cooking Methods             471
Househol

In [43]:
new_cat_list = ['Desserts', 'Chicken', 'Beverages', 'Vegetarian/Vegan',
        'Vegetables', 'Regional', 'Sauces/Condiments', 'Main Dish',
        'Beans/Legumes', 'Quick and Easy', 'Special Dietary Needs',
        'Baked Goods', 'Poultry', 'Healthy', 'International',
        'Breakfast/Brunch', 'Nuts/Seeds/Grains', 'Fruit', 'Meat', 'Dairy',
        'Seafood', 'Pasta', 'Lunch/Snacks', 'Cooking Methods', 'Soups',
        'Seasonal', 'Flavor Profiles', 'Uncategorized', 'Occasions',
        'Family-Friendly', 'Side Dishes', 'Preservation', 'Household',
        'Appetizers', 'Outdoor Cooking', 'Budget']

In [44]:
new_cat_dict = {x:x for x in new_cat_list}

In [45]:
new_cat_dict

{'Desserts': 'Desserts',
 'Chicken': 'Chicken',
 'Beverages': 'Beverages',
 'Vegetarian/Vegan': 'Vegetarian/Vegan',
 'Vegetables': 'Vegetables',
 'Regional': 'Regional',
 'Sauces/Condiments': 'Sauces/Condiments',
 'Main Dish': 'Main Dish',
 'Beans/Legumes': 'Beans/Legumes',
 'Quick and Easy': 'Quick and Easy',
 'Special Dietary Needs': 'Special Dietary Needs',
 'Baked Goods': 'Baked Goods',
 'Poultry': 'Poultry',
 'Healthy': 'Healthy',
 'International': 'International',
 'Breakfast/Brunch': 'Breakfast/Brunch',
 'Nuts/Seeds/Grains': 'Nuts/Seeds/Grains',
 'Fruit': 'Fruit',
 'Meat': 'Meat',
 'Dairy': 'Dairy',
 'Seafood': 'Seafood',
 'Pasta': 'Pasta',
 'Lunch/Snacks': 'Lunch/Snacks',
 'Cooking Methods': 'Cooking Methods',
 'Soups': 'Soups',
 'Seasonal': 'Seasonal',
 'Flavor Profiles': 'Flavor Profiles',
 'Uncategorized': 'Uncategorized',
 'Occasions': 'Occasions',
 'Family-Friendly': 'Family-Friendly',
 'Side Dishes': 'Side Dishes',
 'Preservation': 'Preservation',
 'Household': 'Household

In [46]:
new_dict = {'Desserts': 'Desserts',
 'Chicken': 'Chicken',
 'Beverages': 'Beverages',
 'Vegetarian/Vegan': 'Vegetarian/Vegan',
 'Vegetables': 'Vegetables',
 'Regional': 'Regional',
 'Sauces/Condiments': 'Sauces/Condiments',
 'Main Dish': 'Main Dish',
 'Beans/Legumes': 'Beans/Legumes',
 'Quick and Easy': 'Quick and Easy',
 'Special Dietary Needs': 'Special Dietary Needs',
 'Baked Goods': 'Baked Goods',
 'Poultry': 'Poultry',
 'Healthy': 'Healthy',
 'International': 'International',
 'Breakfast/Brunch': 'Breakfast/Brunch',
 'Nuts/Seeds/Grains': 'Nuts/Seeds/Grains',
 'Fruit': 'Fruit',
 'Meat': 'Meat',
 'Dairy': 'Dairy',
 'Seafood': 'Seafood',
 'Pasta': 'Pasta',
 'Lunch/Snacks': 'Lunch/Snacks',
 'Cooking Methods': 'Cooking Methods',
 'Soups': 'Soups',
 'Seasonal': 'Seasonal',
 'Flavor Profiles': 'Flavor Profiles',
 'Uncategorized': 'Uncategorized',
 'Occasions': 'Occasions',
 'Family-Friendly': 'Family-Friendly',
 'Side Dishes': 'Side Dishes',
 'Preservation': 'Uncategorized',
 'Household': 'Uncategorized',
 'Appetizers': 'Uncategorized',
 'Outdoor Cooking': 'Occasions',
 'Budget': 'Uncategorized'}

In [47]:
recipes['RecipeCategory'] = recipes['RecipeCategory'].map(new_dict)

In [48]:
recipes['RecipeCategory'].value_counts(), recipes['RecipeCategory'].nunique()

(Desserts                 100467
 Vegetables                50272
 Main Dish                 40224
 Meat                      36411
 Lunch/Snacks              32564
 Quick and Easy            32320
 Baked Goods               30084
 Chicken                   26358
 Sauces/Condiments         22785
 Beverages                 22764
 Breakfast/Brunch          21859
 Healthy                   18371
 International             17429
 Nuts/Seeds/Grains          8711
 Fruit                      8558
 Dairy                      8459
 Beans/Legumes              7891
 Seafood                    6583
 Poultry                    6517
 Soups                      5206
 Pasta                      3962
 Side Dishes                2622
 Vegetarian/Vegan           1841
 Occasions                  1840
 Flavor Profiles            1391
 Regional                   1313
 Uncategorized              1264
 Family-Friendly            1219
 Special Dietary Needs      1142
 Seasonal                    814
 Cooking M

Good! Now we have 31 major categories..

In [49]:
recipes.sample(2)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
143651,150662.0,Spinach-Filled Fish Rolls,248292,Asha1126,"Quick, light, and very easy meal. Serve it wit...",Main Dish,"[Lunch/Snacks, < 30 Mins, Beginner Cook, Easy]","[1, 1 1⁄2, 1⁄4, None, 1⁄3, 1⁄2, 1⁄4, None]","[flounder fillets, black pepper, low-fat mayon...",5.0,3.0,117.8,1.9,0.5,54.6,138.7,2.0,0.4,0.2,22.0,4.0,,[Preheat oven to 400°F Coat 8x8x2inch baking d...,https://www.food.com/recipe/Spinach-Filled-Fis...,"[1, 1 1⁄2, 1⁄4, , 1⁄3, 1⁄2, 1⁄4, ]",[1 lb orange roughy fillets or 1 lb flounder f...,2006,1,8,17,30
129044,135573.0,Cream Cheese Bread,217226,coconutcream,Make and share this Cream Cheese Bread recipe ...,Baked Goods,"[Breakfast, < 4 Hours]","[1, 1⁄2, 1, 1⁄2, 2, 1⁄2, 2, 4, 2, 3⁄4, 1, 1⁄8,...","[sour cream, sugar, salt, butter, active dry y...",,,1747.2,80.5,48.7,371.8,1250.7,226.9,4.6,122.7,31.1,,4 loaves,[Heat sour cream in saucepan over low heat. Ad...,https://www.food.com/recipe/Cream-Cheese-Bread...,"[1, 1⁄2, 1, 1⁄2, 2, 1⁄2, 2, 4, 2, 3⁄4, 1, 1⁄8,...","[cup sour cream, cup sugar, teaspoon salt, cup...",2005,8,30,16,105


In [50]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
Description                        5
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              252432
ReviewCount                   246700
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182734
RecipeYield                   347633
RecipeInstructions                 0
url                                0
ingred_quants                      0
ingred_items                       0
YearPublished                      0
M

Note that we have also eliminated the null values for `RecipeCategory`. We also have 5 nul values for `Description`; let's drop them too:

In [51]:
recipes['Description'].dropna(inplace=True)

In [52]:
recipes['Description'].isna().sum()

5

In [53]:
recipes[recipes['Description'].isna()]

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
3416,5177.0,Herb Pull-Aparts,1552,Ron Joyce Ripple S,,Baked Goods,"[Breakfast, < 15 Mins, For Large Groups, Oven]","[1 1⁄2, 1⁄4, 1, 1, 1, 1, 1, 1⁄4]","[butter, margarine, parmesan cheese, rosemary,...",5.0,4.0,35.5,3.1,1.9,8.8,80.7,0.3,0.1,0.0,1.6,24.0,,"[Grease a fluted tube Bundt pan., combine chee...",https://www.food.com/recipe/Herb-Pull-Aparts-5177,"[1 1⁄2, 1⁄4, 1, 1, 1, 1, 1, 1⁄4]","[loaves frozen bread dough, thawed, cup butter...",1999,11,30,23,0
3526,5300.0,Chicken Liver Parfait,1992,Jackie Roe-Lawton,,Chicken,"[Chicken, Beef Organ Meats, Beef Liver, Poultr...","[2, 1 1⁄2, 900, 9, 8, 1, None]","[sweet sherry, chicken livers, eggs, nutmeg]",,,4650.2,391.1,208.1,7517.1,1606.2,30.1,0.5,5.8,243.4,1.0,,[Bring cream to simmering point. Puree all oth...,https://www.food.com/recipe/Chicken-Liver-Parf...,"[2, 1 1⁄2, 900, 9, 8, 1, ]","[tablespoons sweet sherry, pints double cream,...",1999,12,5,13,0
3645,5428.0,Hot Swiss Chard Salad,1534,Tonkcats,,International,"[European, Very Low Carbs, < 15 Mins]","[1, 1⁄3, 10 -12, 1⁄4, 1⁄4, None, 2, 3]","[garlic, fresh swiss chard, red wine vinegar, ...",5.0,5.0,928.6,93.0,15.9,344.6,1172.6,7.2,2.1,2.5,16.4,1.0,,"[Marinate garlic clove in oil for 1 hour., Rem...",https://www.food.com/recipe/Hot-Swiss-Chard-Sa...,"[1, 1⁄3, 10 -12, 1⁄4, 1⁄4, , 2, 3]","[clove garlic, cut, cup vegetable oil, ounces ...",1999,12,15,23,0
4590,7426.0,Hidden Valley Mix for Dressing(copycat),1534,Tonkcats,,Sauces/Condiments,[< 15 Mins],"[2, 1⁄2, 1, 1, 1⁄2, 1, 1]","[salt, garlic powder, parsley flakes, mayonnai...",,,119.9,2.3,1.4,9.8,1829.2,16.4,0.8,12.1,9.0,,1 batch,"[Mix instant onion mix, salt, garlic powder, p...",https://www.food.com/recipe/Hidden-Valley-Mix-...,"[2, 1⁄2, 1, 1, 1⁄2, 1, 1]","[teaspoons onion soup mix, teaspoon salt, dash...",1999,12,15,23,0
4591,7427.0,Cranberry Cocktail Meatballs,1534,Tonkcats,,Fruit,"[Meat, < 15 Mins, Oven]","[2, 1, 2, 1⁄2, 1⁄3, 3, 2, 1⁄4, 1⁄4, 1, 12, 1, 1]","[beef, eggs, parsley, ketchup, onions, soy sau...",4.5,5.0,1264.1,109.3,45.1,211.8,1364.7,51.9,4.6,40.9,17.5,6.0,,"[In a large bowl, combine ground beef, cornfla...",https://www.food.com/recipe/Cranberry-Cocktail...,"[2, 1, 2, 1⁄2, 1⁄3, 3, 2, 1⁄4, 1⁄4, 1, 12, 1, 1]","[lbs beef, Ground, cup corn flakes, eggs, cup ...",1999,12,15,23,0


In [54]:
recipes.drop([3416,3526,3645,4591,4590],axis=0,inplace=True)

In [55]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
Description                        0
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
AggregatedRating              252430
ReviewCount                   246698
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182733
RecipeYield                   347629
RecipeInstructions                 0
url                                0
ingred_quants                      0
ingred_items                       0
YearPublished                      0
M

### Dealing with `AggregatedRating`

We will soon notice that `AggregatedRating`, which is supposed to record the average of ratings given to a recipe (if at all), has many wrong entries. We are lucky to have another dataset that collects the reviews of all the recipes with reviews. We can use the reviews in that dataset to correct `AggregatedRating` in the recipes dataset.

In [56]:
reviews = pd.read_parquet('../reviews.parquet')

In [57]:
reviews.sample(5)

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified
905056,1030367,133062,1004745,Keithletes,5,I made these as a x-mas gift for my friends an...,2010-01-14 15:41:28+00:00,2010-01-14 15:41:28+00:00
1006664,1152665,354922,1660227,august26,5,Really good flavor. I dont like alot o; meat ...,2010-10-06 15:56:14+00:00,2010-10-06 15:56:14+00:00
954963,1092682,155369,888475,PixiBlossom,5,So good!! I used my bread machine to make the ...,2010-04-30 09:42:33+00:00,2010-04-30 09:42:33+00:00
159512,170972,96904,172732,Samantha675,4,"So simple, yet it is sooo wonderful.",2005-04-30 06:08:34+00:00,2005-04-30 06:08:34+00:00
768691,851060,285770,47907,Lvs2Cook,5,These were a great treat. I tossed everything...,2009-04-17 09:57:09+00:00,2009-04-17 09:57:09+00:00


In [58]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1401982 entries, 0 to 1401981
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype              
---  ------         --------------    -----              
 0   ReviewId       1401982 non-null  int32              
 1   RecipeId       1401982 non-null  int32              
 2   AuthorId       1401982 non-null  int32              
 3   AuthorName     1401982 non-null  object             
 4   Rating         1401982 non-null  int32              
 5   Review         1401982 non-null  object             
 6   DateSubmitted  1401982 non-null  datetime64[ns, UTC]
 7   DateModified   1401982 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2), int32(4), object(2)
memory usage: 64.2+ MB


In [59]:
reviews.isna().sum()

ReviewId         0
RecipeId         0
AuthorId         0
AuthorName       0
Rating           0
Review           0
DateSubmitted    0
DateModified     0
dtype: int64

In [60]:
reviews.describe()

Unnamed: 0,ReviewId,RecipeId,AuthorId,Rating
count,1401982.0,1401982.0,1401982.0,1401982.0
mean,817973.9,152641.2,155863800.0,4.407951
std,528082.1,130111.2,530511100.0,1.272012
min,2.0,38.0,1533.0,0.0
25%,374386.2,47038.75,133680.0,4.0
50%,771780.5,109327.0,330545.0,5.0
75%,1204126.0,231876.8,818359.0,5.0
max,2090347.0,541298.0,2002902000.0,5.0


In [61]:
reviews['RecipeId'].value_counts()

45809     2892
2886      2182
27208     1614
89204     1584
39087     1491
          ... 
229614       1
320225       1
47944        1
270626       1
230339       1
Name: RecipeId, Length: 271678, dtype: int64

**NOTE:** There's a mismatch batween the actual aggregated rating and one recorded in the recipes dataset:

In [62]:
recipes[recipes['RecipeId'] == 992]

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes
702,992.0,Jalapeno Pepper Poppers,1545,Nancy Van Ess,Make and share this Jalapeno Pepper Poppers re...,Vegetables,"[< 30 Mins, For Large Groups]","[8, 4, 4, 6, 1⁄4, 1⁄4, 1⁄4, 1, 1⁄2, None]","[cream cheese, sharp cheddar cheese, monterey ...",5.0,15.0,111.4,9.2,4.9,23.7,172.5,3.2,0.6,0.9,4.3,24.0,,"[In a mixing bowl, combine cheeses, bacon and ...",https://www.food.com/recipe/Jalapeno-Pepper-Po...,"[8, 4, 4, 6, 1⁄4, 1⁄4, 1⁄4, 1, 1⁄2, ]","[ounces cream cheese, softened, ounces sharp c...",1999,9,6,4,30


In [63]:
recipes[recipes['RecipeId'] == 992]['AggregatedRating']

702    5.0
Name: AggregatedRating, dtype: float64

In [64]:
reviews[reviews['RecipeId'] == 992]['Rating'].mean()

4.916666666666667

More examples:

In [65]:
print(f"Recorded rating: {recipes[recipes['RecipeId'] == 45809]['AggregatedRating'][41924]}")
print(f"Actual rating: {reviews[reviews['RecipeId'] == 45809]['Rating'].mean()}")

Recorded rating: 5.0
Actual rating: 4.314661134163209


In [66]:
print(f"Recorded rating: {recipes[recipes['RecipeId'] == 2886]['AggregatedRating'][1436]}")
print(f"Actual rating: {reviews[reviews['RecipeId'] == 2886]['Rating'].mean()}")

Recorded rating: 5.0
Actual rating: 4.218148487626031


So we can drop the inaccurate `AggregatedRating` from the recipes dataset, and replace the amount with the average of the `Rating` in the reviews dataset.

In [67]:
ratings = reviews.groupby(['RecipeId']).mean()[['Rating']]
ratings

  ratings = reviews.groupby(['RecipeId']).mean()[['Rating']]


Unnamed: 0_level_0,Rating
RecipeId,Unnamed: 1_level_1
38,4.250000
39,3.000000
40,4.333333
41,4.500000
42,2.666667
...,...
540899,5.000000
541001,0.000000
541030,5.000000
541195,5.000000


In [68]:
ratings.index

Int64Index([    38,     39,     40,     41,     42,     43,     44,     45,
                46,     47,
            ...
            540716, 540717, 540731, 540836, 540876, 540899, 541001, 541030,
            541195, 541298],
           dtype='int64', name='RecipeId', length=271678)

In [69]:
recipes[['RecipeId','AggregatedRating']].isna().sum()

RecipeId                 0
AggregatedRating    252430
dtype: int64

In [70]:
recipe_ids_with_aggrating = recipes[['RecipeId','AggregatedRating']].dropna()['RecipeId']
recipe_ids_with_aggrating

0             38.0
1             39.0
2             40.0
3             41.0
4             42.0
            ...   
521464    540288.0
521544    540370.0
521568    540416.0
521622    540470.0
521650    540498.0
Name: RecipeId, Length: 269277, dtype: float64

In [71]:
recipe_ids_with_aggrating.values

array([3.80000e+01, 3.90000e+01, 4.00000e+01, ..., 5.40416e+05,
       5.40470e+05, 5.40498e+05])

In [72]:
recipes['CorrectAggregatedRating'] = ''

In [73]:
recipes['CorrectAggregatedRating']

0          
1          
2          
3          
4          
         ..
521707     
521708     
521709     
521710     
521711     
Name: CorrectAggregatedRating, Length: 521707, dtype: object

In [74]:
# get the indices from the ratings dataframe that exist in the recipes dataframe as RecipeId:

indices = []
for i,j in zip(ratings.index,ratings.values):
    if i in recipe_ids_with_aggrating.values:
        indices.append(i)

In [75]:
len(indices)

265965

In [76]:
recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].isna().sum()

0

In [77]:
recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            521254, 521268, 521278, 521418, 521439, 521464, 521544, 521568,
            521622, 521650],
           dtype='int64', length=265965)

In [78]:
ratings.loc[indices]['Rating'].values

array([4.25      , 3.        , 4.33333333, ..., 5.        , 5.        ,
       4.        ])

In [79]:
# Assign to the `CorrectAggregatedRating` value of recipes with existing AggregatedRating the actual aggregated rating,
# recorded in the ratings dataframe:

recipes.loc[recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index,'CorrectAggregatedRating'] = ratings.loc[indices]['Rating'].values

In [80]:
recipes.loc[recipes[recipes['RecipeId'].isin(indices)]['AggregatedRating'].index].sample(4)

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes,CorrectAggregatedRating
161144,168762.0,Jelly Filled Thumbprints,103876,Chris from Kansas,Make and share this Jelly Filled Thumbprints r...,Desserts,"[Dessert, Cookie & Brownie, Kid Friendly, Easy]","[1, 1, 1⁄2, 1⁄3]",[],4.0,1.0,95.6,4.4,1.9,4.3,69.4,13.6,0.3,7.3,0.8,,36 cookies,"[Heat oven to 350. In large bowl, break up co...",https://www.food.com/recipe/Jelly-Filled-Thumb...,"[1, 1, 1⁄2, 1⁄3]",[(18 ounce) roll refrigerated sugar cookie dou...,2006,5,19,21,60,4.0
149736,156955.0,Blueberry Compote Topping,177083,rainyfriday,Make and share this Blueberry Compote Topping ...,Sauces/Condiments,"[Breakfast, Low Protein, Low Cholesterol, Heal...","[2, 1⁄2, 1, 1⁄2, 2]","[blueberries, cornstarch, orange zest, sugar]",5.0,3.0,55.5,0.2,0.0,0.0,0.8,13.9,1.2,10.8,0.5,6.0,,"[In heavy saucepan, combine all ingredients., ...",https://www.food.com/recipe/Blueberry-Compote-...,"[2, 1⁄2, 1, 1⁄2, 2]","[cups blueberries, cup orange juice, teaspoon ...",2006,2,21,14,15,5.0
150694,157944.0,Malted Milk Ball (Whoppers) Cookies,237154,smltheppl,Make and share this Malted Milk Ball (Whoppers...,Desserts,"[Dessert, Cookie & Brownie, Sweet, For Large G...","[1, 3⁄4, 1⁄4, 1⁄2, 2, 1, 2 3⁄4, 1⁄4, 1, 2 1⁄4]","[unsalted butter, brown sugar, white sugar, eg...",2.0,1.0,126.6,5.7,3.5,25.4,66.7,17.3,0.5,9.3,1.6,36.0,36 cookies,"[Preheat oven to 350°F (180°C)., In a bowl, mi...",https://www.food.com/recipe/Malted-Milk-Ball-(...,"[1, 3⁄4, 1⁄4, 1⁄2, 2, 1, 2 3⁄4, 1⁄4, 1, 2 1⁄4]","[cup unsalted butter, cup firmly packed brown ...",2006,2,27,19,120,2.0
257695,268085.0,Old Fashioned Oatmeal Pancakes,303700,Lorrie in Montreal,Make and share this Old Fashioned Oatmeal Panc...,Breakfast/Brunch,[< 30 Mins],"[1, 1, 1, 2, 1⁄4, 1, 1⁄2, 1⁄2, 1⁄4, 1, 1⁄2]","[quick oatmeal, buttermilk, egg, margarine, fl...",4.5,6.0,214.1,8.8,1.9,55.3,390.8,26.1,2.3,6.6,7.7,4.0,,[Combine the oats and buttermilk in a bowl. Co...,https://www.food.com/recipe/Old-Fashioned-Oatm...,"[1, 1, 1, 2, 1⁄4, 1, 1⁄2, 1⁄2, 1⁄4, 1, 1⁄2]","[cup uncooked quick oatmeal, cup buttermilk, e...",2007,11,27,17,20,4.5


We now have the actual aggregated ratings, recorded in `CorrectAggregatedRating`:

In [81]:
recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()

Unnamed: 0,RecipeId,AggregatedRating,CorrectAggregatedRating
0,38.0,4.5,4.25
1,39.0,3.0,3.0
2,40.0,4.5,4.333333
3,41.0,4.5,4.5
4,42.0,4.5,2.666667
...,...,...,...
521464,540288.0,5.0,5.0
521544,540370.0,5.0,5.0
521568,540416.0,5.0,5.0
521622,540470.0,5.0,5.0


In [82]:
reviews[reviews['RecipeId'] == 40.0]['Rating'].mean()

4.333333333333333

We can now see how many wrong entries existed in our original recipes dataset, as values of `AggregatedRating`:

In [83]:
(recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()['AggregatedRating'] != recipes[['RecipeId','AggregatedRating','CorrectAggregatedRating']].dropna()['CorrectAggregatedRating']).sum()

88851

Whew, this was some cleaning!! Our EDA and models woul've been filled with wrong entries if we didn't fix this! Let's now drop the original `AggregatedRating`:

In [84]:
recipes.drop(['AggregatedRating'],axis=1,inplace=True)

In [85]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 521707 entries, 0 to 521711
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    521707 non-null  float64
 1   Name                        521707 non-null  object 
 2   AuthorId                    521707 non-null  int32  
 3   AuthorName                  521707 non-null  object 
 4   Description                 521707 non-null  object 
 5   RecipeCategory              521707 non-null  object 
 6   Keywords                    521707 non-null  object 
 7   RecipeIngredientQuantities  521707 non-null  object 
 8   RecipeIngredientParts       521707 non-null  object 
 9   ReviewCount                 275009 non-null  float64
 10  Calories                    521707 non-null  float64
 11  FatContent                  521707 non-null  float64
 12  SaturatedFatContent         521707 non-null  float64
 13  CholesterolCon

Let's also turn the new values into floats and round the them by 2 decimals:

In [86]:
recipes['CorrectAggregatedRating']

0             4.25
1              3.0
2         4.333333
3              4.5
4         2.666667
            ...   
521707            
521708            
521709            
521710            
521711            
Name: CorrectAggregatedRating, Length: 521707, dtype: object

In [87]:
recipes['CorrectAggregatedRating'] = recipes['CorrectAggregatedRating'].apply(lambda x: round(float(x),2) if x != '' else None)

In [88]:
recipes['CorrectAggregatedRating']

0         4.25
1         3.00
2         4.33
3         4.50
4         2.67
          ... 
521707     NaN
521708     NaN
521709     NaN
521710     NaN
521711     NaN
Name: CorrectAggregatedRating, Length: 521707, dtype: float64

In [89]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 521707 entries, 0 to 521711
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    521707 non-null  float64
 1   Name                        521707 non-null  object 
 2   AuthorId                    521707 non-null  int32  
 3   AuthorName                  521707 non-null  object 
 4   Description                 521707 non-null  object 
 5   RecipeCategory              521707 non-null  object 
 6   Keywords                    521707 non-null  object 
 7   RecipeIngredientQuantities  521707 non-null  object 
 8   RecipeIngredientParts       521707 non-null  object 
 9   ReviewCount                 275009 non-null  float64
 10  Calories                    521707 non-null  float64
 11  FatContent                  521707 non-null  float64
 12  SaturatedFatContent         521707 non-null  float64
 13  CholesterolCon

### Dealing with `RecipeYield`

Finally, let's deald with `RecipeYield`.

`RecipeYield` gives the number of recipe outputs that are obtained using the the ingredients and their amounts. 

We can extract the numbers and work with those, but the problem with that approach is that, e.g., even though 1 cake and and 10 rolls are both a result of several ingredients, the latter's number is 10 times more than the former. So only extracting the numbers seems pretty blunt and inaccurate. A better approach would scale up, or weigh more, items like cakes, pizzas and so on.

Another issue is that even the numbers sometime are shown in ranges, such as 7-8 insetad of 7. For now I'll just take the average of the ranges as the yield number for the sake of the experiement:

In [91]:
recipes['RecipeYield'].apply(lambda x: x.split(' ')[0] if x!=None else x).dropna()

3          4
5          1
8          1
9         84
10         1
          ..
521670    22
521671    24
521691     3
521694     1
521698    20
Name: RecipeYield, Length: 174078, dtype: object

Due to these complexities, and given that we don't really need `RecipeYeilds` for any of our analyses going forward, I'll just drop the column.

In [92]:
recipes.drop(['RecipeYield'],axis=1,inplace=True)

In [93]:
recipes.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,Description,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeInstructions,url,ingred_quants,ingred_items,YearPublished,MonthPublished,DayPublished,HourPublished,TotalMinutes,CorrectAggregatedRating
0,38.0,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,Make and share this Low-Fat Berry Blue Frozen ...,Desserts,"[Dessert, Low Protein, Low Cholesterol, Health...","[4, 1⁄4, 1, 1]","[blueberries, granulated sugar, vanilla yogurt...",4.0,170.9,2.5,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,"[Toss 2 cups berries with sugar., Let stand fo...",https://www.food.com/recipe/Low-Fat-Berry-Blue...,"[4, 1⁄4, 1, 1]","[cups blueberries, fresh or frozen, cup granul...",1999,8,9,21,285,4.25
1,39.0,Biryani,1567,elly9812,Make and share this Biryani recipe from Food.com.,Chicken,"[Chicken Thigh & Leg, Chicken, Poultry, Meat, ...","[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[saffron, milk, hot green chili peppers, onion...",1.0,1110.7,58.8,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,[Soak saffron in warm milk for 5 minutes and p...,https://www.food.com/recipe/Biryani-39,"[1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","[tablespoon saffron, teaspoons milk, warm, hot...",1999,8,29,13,265,3.0
2,40.0,Best Lemonade,1566,Stephen Little,This is from one of my first Good House Keepi...,Beverages,"[Low Protein, Low Cholesterol, Healthy, Summer...","[1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4]","[sugar, lemons, rind of, lemon, zest of, fresh...",10.0,311.1,0.2,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,"[Into a 1 quart Jar with tight fitting lid, pu...",https://www.food.com/recipe/Best-Lemonade-40,"[1 1⁄2, 1, , 1 1⁄2, , 3⁄4]","[cups sugar, tablespoon lemons, rind of or 1 t...",1999,9,5,19,35,4.33
3,41.0,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,This dish is best prepared a day in advance to...,Vegetarian/Vegan,"[Beans, Vegetable, Low Cholesterol, Weeknight,...","[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[extra firm tofu, eggplant, zucchini, mushroom...",2.0,536.1,24.0,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,"[Drain the tofu, carefully squeezing out exces...",https://www.food.com/recipe/Carina's-Tofu-Vege...,"[12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","[ounces extra firm tofu, water-packed, medium ...",1999,9,3,14,260,4.5
4,42.0,Cabbage Soup,1538,Duckie067,Make and share this Cabbage Soup recipe from F...,Vegetables,"[Low Protein, Vegan, Low Cholesterol, Healthy,...","[46, 4, 1, 2, 1]","[plain tomato juice, cabbage, onion, carrots, ...",11.0,103.6,0.4,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,"[Mix everything together and bring to a boil.,...",https://www.food.com/recipe/Cabbage-Soup-42,"[46, 4, 1, 2, 1]","[ounces plain tomato juice, cups cabbage, shre...",1999,9,19,6,50,2.67


In [95]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 521707 entries, 0 to 521711
Data columns (total 30 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    521707 non-null  float64
 1   Name                        521707 non-null  object 
 2   AuthorId                    521707 non-null  int32  
 3   AuthorName                  521707 non-null  object 
 4   Description                 521707 non-null  object 
 5   RecipeCategory              521707 non-null  object 
 6   Keywords                    521707 non-null  object 
 7   RecipeIngredientQuantities  521707 non-null  object 
 8   RecipeIngredientParts       521707 non-null  object 
 9   ReviewCount                 275009 non-null  float64
 10  Calories                    521707 non-null  float64
 11  FatContent                  521707 non-null  float64
 12  SaturatedFatContent         521707 non-null  float64
 13  CholesterolCon

In [94]:
recipes.isna().sum()

RecipeId                           0
Name                               0
AuthorId                           0
AuthorName                         0
Description                        0
RecipeCategory                     0
Keywords                           0
RecipeIngredientQuantities         0
RecipeIngredientParts              0
ReviewCount                   246698
Calories                           0
FatContent                         0
SaturatedFatContent                0
CholesterolContent                 0
SodiumContent                      0
CarbohydrateContent                0
FiberContent                       0
SugarContent                       0
ProteinContent                     0
RecipeServings                182733
RecipeInstructions                 0
url                                0
ingred_quants                      0
ingred_items                       0
YearPublished                      0
MonthPublished                     0
DayPublished                       0
H

We now save our dataframe for later use. 

**NOTE:** We didn't touch many of the categorical columns, as well as the `url` column here. We will be dealing with these in other notebooks; here we only wanted to perform some baseic cleaning and feature engineering.

In [96]:
recipes.to_parquet('BasicCleanData.parquet') 