# Intro
This capstone will explore how various recipe features impact it's online rating. The dataset consists of Food.com recipes scraped for their cooking instructions, nutritional info, and other key features. It is a rather large dataset (500k+ rows) so we have a lot of room to clean and narrow our scope.

This analysis may be useful for bloggers, cookbook writes, and aspiring cooks to get a landscape for what kind of food people are interested in making and react positively to.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lib.sb_utils import save_file
from datetime import datetime

In [2]:
recipes = pd.read_csv('data/recipes.csv')

In [3]:
recipes.head()

Unnamed: 0,RecipeId,Name,AuthorId,AuthorName,CookTime,PrepTime,TotalTime,DatePublished,Description,Images,...,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings,RecipeYield,RecipeInstructions
0,38,Low-Fat Berry Blue Frozen Dessert,1533,Dancer,PT24H,PT45M,PT24H45M,1999-08-09T21:46:00Z,Make and share this Low-Fat Berry Blue Frozen ...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0,,"c(""Toss 2 cups berries with sugar."", ""Let stan..."
1,39,Biryani,1567,elly9812,PT25M,PT4H,PT4H25M,1999-08-29T13:12:00Z,Make and share this Biryani recipe from Food.com.,"c(""https://img.sndimg.com/food/image/upload/w_...",...,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0,,"c(""Soak saffron in warm milk for 5 minutes and..."
2,40,Best Lemonade,1566,Stephen Little,PT5M,PT30M,PT35M,1999-09-05T19:52:00Z,This is from one of my first Good House Keepi...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0,,"c(""Into a 1 quart Jar with tight fitting lid, ..."
3,41,Carina's Tofu-Vegetable Kebabs,1586,Cyclopz,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,This dish is best prepared a day in advance to...,"c(""https://img.sndimg.com/food/image/upload/w_...",...,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0,4 kebabs,"c(""Drain the tofu, carefully squeezing out exc..."
4,42,Cabbage Soup,1538,Duckie067,PT30M,PT20M,PT50M,1999-09-19T06:19:00Z,Make and share this Cabbage Soup recipe from F...,"""https://img.sndimg.com/food/image/upload/w_55...",...,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0,,"c(""Mix everything together and bring to a boil..."


In [4]:
recipes.describe()

Unnamed: 0,RecipeId,AuthorId,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,522517.0,522517.0,269294.0,275028.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,522517.0,339606.0
mean,271821.43697,45725850.0,4.632014,5.227784,484.43858,24.614922,9.559457,86.487003,767.2639,49.089092,3.843242,21.878254,17.46951,8.606191
std,155495.878422,292971400.0,0.641934,20.381347,1397.116649,111.485798,46.622621,301.987009,4203.621,180.822062,8.603163,142.620191,40.128837,114.319809
min,38.0,27.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,137206.0,69474.0,4.5,1.0,174.2,5.6,1.5,3.8,123.3,12.8,0.8,2.5,3.5,4.0
50%,271758.0,238937.0,5.0,2.0,317.1,13.8,4.7,42.6,353.3,28.2,2.2,6.4,9.1,6.0
75%,406145.0,565828.0,5.0,4.0,529.1,27.4,10.8,107.9,792.2,51.1,4.6,17.9,25.0,8.0
max,541383.0,2002886000.0,5.0,3063.0,612854.6,64368.1,26740.6,130456.4,1246921.0,108294.6,3012.0,90682.3,18396.2,32767.0


In [5]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522517 entries, 0 to 522516
Data columns (total 28 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   RecipeId                    522517 non-null  int64  
 1   Name                        522517 non-null  object 
 2   AuthorId                    522517 non-null  int64  
 3   AuthorName                  522517 non-null  object 
 4   CookTime                    439972 non-null  object 
 5   PrepTime                    522517 non-null  object 
 6   TotalTime                   522517 non-null  object 
 7   DatePublished               522517 non-null  object 
 8   Description                 522512 non-null  object 
 9   Images                      522516 non-null  object 
 10  RecipeCategory              521766 non-null  object 
 11  Keywords                    505280 non-null  object 
 12  RecipeIngredientQuantities  522514 non-null  object 
 13  RecipeIngredie

### Some columns don't seem relevant to us - like description, images, etc. We'll drop those first

In [6]:
drop=['RecipeId','AuthorId', 'AuthorName','Description', 'Images','Keywords','RecipeIngredientQuantities', \
              'RecipeIngredientParts','RecipeYield', 'RecipeInstructions']
recipes.drop(drop, axis=1, inplace=True)
recipes.reset_index(drop=True,inplace=True)
recipes.head()

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
0,Low-Fat Berry Blue Frozen Dessert,PT24H,PT45M,PT24H45M,1999-08-09T21:46:00Z,Frozen Desserts,4.5,4.0,170.9,2.5,1.3,8.0,29.8,37.1,3.6,30.2,3.2,4.0
1,Biryani,PT25M,PT4H,PT4H25M,1999-08-29T13:12:00Z,Chicken Breast,3.0,1.0,1110.7,58.8,16.6,372.8,368.4,84.4,9.0,20.4,63.4,6.0
2,Best Lemonade,PT5M,PT30M,PT35M,1999-09-05T19:52:00Z,Beverages,4.5,10.0,311.1,0.2,0.0,0.0,1.8,81.5,0.4,77.2,0.3,4.0
3,Carina's Tofu-Vegetable Kebabs,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,Soy/Tofu,4.5,2.0,536.1,24.0,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0
4,Cabbage Soup,PT30M,PT20M,PT50M,1999-09-19T06:19:00Z,Vegetable,4.5,11.0,103.6,0.4,0.1,0.0,959.3,25.1,4.8,17.7,4.3,4.0


### Null Ratings
Next we'll look for recipes that don't have an associated rating, since we want to learn about how other features impact rating.

In [7]:
recipes.isna().sum()

Name                        0
CookTime                82545
PrepTime                    0
TotalTime                   0
DatePublished               0
RecipeCategory            751
AggregatedRating       253223
ReviewCount            247489
Calories                    0
FatContent                  0
SaturatedFatContent         0
CholesterolContent          0
SodiumContent               0
CarbohydrateContent         0
FiberContent                0
SugarContent                0
ProteinContent              0
RecipeServings         182911
dtype: int64

Looks like we have a fair number of null values for ratings, and a few for CookTime and RecipeCategory as well. We have a huge dataset so can go ahead and drop all nulls.

In [8]:
recipes.dropna(inplace=True)

In [9]:
recipes.reset_index(drop=True,inplace=True)
recipes.shape

(145363, 18)

We should also double check that none of our ratings values are euqal to 0.

In [10]:
recipes[recipes.ReviewCount==0]

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings


## 2. Explore new df

In [11]:
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0,145363.0
mean,4.616921,5.312452,371.119965,18.73519,7.22446,76.132228,590.804455,33.856629,3.13861,11.773142,17.205572,8.869272
std,0.648521,22.340973,397.682555,28.776871,10.61553,109.175863,2585.703973,43.300951,4.791035,25.805539,22.382921,150.626859
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,183.2,6.2,1.6,10.2,152.0,12.6,0.9,2.2,4.1,4.0
50%,5.0,2.0,304.0,13.2,4.6,51.3,364.2,26.8,2.1,5.3,10.6,6.0
75%,5.0,4.0,466.2,24.1,9.7,105.6,729.2,44.9,4.0,13.5,26.3,8.0
max,5.0,3063.0,41770.2,4701.1,992.1,11823.8,704129.6,4320.9,835.7,3623.9,3270.3,32767.0


## Initial observations
#### Our max value for each column is way above the others: 
- RecipeServings: 30k seems like a lot! That's likely an error, so we'll scale it down to 6 which is more standard for a recipe. This will also mitigate the problem of our data not differntiating whether nutritional info is for the entire recipe or an individual serving.
- Calories: For consistency, we probably shouldn'y have foods with no calories or over 1000. If we narrow the range, we'll be comparing more similar recipes 

In [12]:
recipes[(recipes.RecipeServings < 12)].shape

(118084, 18)

In [13]:
recipes=recipes[(recipes.RecipeServings < 12)]
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0,118084.0
mean,4.617438,5.236196,402.32447,20.389499,7.767378,84.819916,665.925261,35.126941,3.507262,11.048841,19.858911,5.205625
std,0.636985,21.946851,425.046305,30.843039,11.419519,117.427759,2855.882463,46.856677,5.181448,27.627969,23.761114,2.146368
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,209.3,7.0,1.7,11.4,188.6,12.4,1.1,2.1,5.7,4.0
50%,5.0,2.0,333.2,14.6,5.0,62.5,439.9,27.5,2.4,4.8,15.0,4.0
75%,5.0,4.0,497.5,26.1,10.4,116.9,817.625,46.6,4.5,11.2,29.1,6.0
max,5.0,3063.0,41770.2,4701.1,992.1,11823.8,704129.6,4320.9,835.7,3623.9,3270.3,11.0


From looking around on the internet we can find some baselines for how much fat, surgar, protein, and carbs we can reasonably expect. We'll drop extreme outliers as there could have been a mistake in entering that value, which would compromise our analysis.

In [14]:
recipes = recipes[recipes.Calories<1000]

In [15]:
recipes[recipes.FatContent < 70].shape, recipes[recipes.SaturatedFatContent < 30].shape

((113433, 18), (112893, 18))

In [16]:
recipes = recipes[recipes.FatContent < 70]
recipes = recipes[recipes.SaturatedFatContent < 30]
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0,112628.0
mean,4.617608,5.259234,351.341597,16.867637,6.421974,75.797248,605.350781,31.98316,3.328197,9.797061,18.18162,5.260379
std,0.634997,21.966305,198.307737,13.287817,6.013337,83.355756,1672.547248,25.496569,3.499016,14.224176,15.402431,2.135713
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,203.5,6.7,1.6,10.2,180.6,12.2,1.1,2.1,5.5,4.0
50%,5.0,2.0,321.1,13.9,4.7,58.9,424.15,26.9,2.4,4.7,14.1,4.0
75%,5.0,4.0,467.9,24.1,9.5,108.9,782.725,45.1,4.4,10.8,27.9,6.0
max,5.0,3063.0,999.8,69.9,29.9,1739.8,351950.2,258.5,70.7,249.8,171.0,11.0


In [17]:
recipes[recipes.CholesterolContent > 400].shape, recipes[recipes.SodiumContent > 2000].shape

((1068, 18), (3008, 18))

In [18]:
recipes = recipes[recipes.CholesterolContent < 400]
recipes = recipes[recipes.SodiumContent < 2000]
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0,108632.0
mean,4.617093,5.260227,344.678661,16.476744,6.291254,70.794306,510.727784,31.758755,3.299498,9.746492,17.646658,5.304625
std,0.63489,22.083658,193.893558,13.006363,5.922506,71.344878,424.335884,25.212445,3.449187,14.18417,14.966645,2.127304
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,200.175,6.5,1.6,9.0,173.875,12.1,1.1,2.1,5.3,4.0
50%,5.0,2.0,315.8,13.6,4.6,56.6,407.5,26.7,2.4,4.6,13.4,4.0
75%,5.0,4.0,458.9,23.5,9.3,105.8,742.8,44.8,4.4,10.7,27.3,6.0
max,5.0,3063.0,999.8,69.9,29.9,399.9,1999.7,258.5,70.7,249.8,137.4,11.0


In [19]:
recipes[recipes.CarbohydrateContent > 70].shape, recipes[recipes.FiberContent > 15].shape

((8597, 18), (1511, 18))

In [20]:
recipes[recipes.CarbohydrateContent > 80].sort_values('CarbohydrateContent',ascending=True)[:10]

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
31459,Candy Apple Pie,PT50M,PT20M,PT1H10M,2004-04-25T19:59:00Z,Pie,5.0,5.0,607.4,32.0,12.3,32.3,457.4,80.1,3.1,52.9,3.6,6.0
91971,Ww Big Breakfast Cookies,PT20M,PT5M,PT25M,2008-01-03T22:53:00Z,Breakfast,4.0,4.0,414.5,2.2,0.6,9.0,397.6,80.1,4.4,41.5,20.8,4.0
102670,Stuffed Pork Chops,PT10M,PT20M,PT30M,2008-07-13T02:30:00Z,Pork,5.0,1.0,588.1,16.4,5.4,75.0,462.1,80.1,3.4,41.1,30.6,6.0
121524,Chocolate-Cherry Loaf,PT55M,PT35M,PT1H30M,2009-08-26T09:57:00Z,Quick Breads,4.5,5.0,507.3,18.9,10.8,133.5,391.5,80.1,4.6,46.3,9.4,6.0
2762,Asian Vegetables With Tofu and Coconut Milk,PT8M,PT5M,PT13M,2001-09-20T10:22:00Z,Soy/Tofu,5.0,2.0,613.2,28.9,17.8,0.0,1618.3,80.1,31.7,28.6,30.2,2.0
145127,Instant Pot &quot;Sunday Night&quot; Pasta,PT20M,PT10M,PT30M,2018-03-07T21:43:00Z,Kid Friendly,5.0,1.0,412.4,7.3,1.4,2.6,917.8,80.1,12.1,12.2,8.3,6.0
85865,Linda's Toll House Pie,PT1H,PT10M,PT1H10M,2007-09-21T17:11:00Z,Pie,5.0,6.0,784.2,51.5,23.5,140.3,318.6,80.1,4.0,56.4,8.7,8.0
36426,Plum Cobbler,PT35M,PT15M,PT50M,2004-10-20T19:59:00Z,Dessert,4.5,2.0,418.7,11.0,2.2,2.4,379.5,80.1,3.8,38.4,5.6,6.0
1463,Grandma Brenda's Sour Cream Coffee Cake,PT40M,PT30M,PT1H10M,2001-01-27T13:08:00Z,Breads,4.0,3.0,643.8,33.6,15.3,133.9,407.4,80.1,2.1,52.5,8.9,8.0
82662,Comfort Me With Apple Crisp,PT55M,PT20M,PT1H15M,2007-07-30T06:07:00Z,Dessert,5.0,6.0,504.3,20.4,8.2,20.4,141.8,80.1,7.6,52.8,5.7,3.0


In [21]:
recipes[recipes.FiberContent > 15]

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
3,Carina's Tofu-Vegetable Kebabs,PT20M,PT24H,PT24H20M,1999-09-03T14:54:00Z,Soy/Tofu,4.5,2.0,536.1,24.0,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,2.0
11,Betty Crocker's Southwestern Guacamole Dip,PT2H,PT5M,PT2H5M,1999-09-15T03:25:00Z,Southwestern U.S.,5.0,4.0,415.9,36.9,5.4,0.0,310.6,24.9,17.3,2.8,5.5,4.0
15,"Black Bean, Corn, and Tomato Salad",PT15M,PT10M,PT25M,1999-08-19T05:12:00Z,Black Beans,5.0,23.0,407.8,15.4,2.3,0.0,20.0,55.8,16.6,4.3,17.1,2.0
208,Garam Masala,PT33M,PT45M,PT1H18M,1999-08-27T04:45:00Z,Asian,4.0,1.0,410.1,24.8,3.8,0.0,215.9,64.5,31.5,2.3,14.2,1.0
297,Barley &amp; Mushroom Stuffed Green Bell Peppers,PT30M,PT15M,PT45M,1999-09-28T05:45:00Z,Vegetable,5.0,4.0,503.9,14.9,6.6,26.4,429.6,77.9,16.5,11.7,18.1,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144837,Nepali Bara,PT30M,PT30M,PT1H,2016-06-06T14:17:00Z,Nepalese,5.0,1.0,650.2,38.2,5.6,125.9,81.1,44.0,19.3,3.8,34.1,4.0
144850,Nassau Peas and Rice,PT40M,PT10M,PT50M,2016-07-05T15:48:00Z,Caribbean,5.0,1.0,764.9,9.2,1.4,0.0,527.6,138.6,21.6,2.2,33.8,6.0
145224,Cauliflower Rice Burrito Bowl,PT5M,PT10M,PT15M,2018-10-02T14:29:00Z,Cauliflower,5.0,1.0,810.0,44.1,6.3,0.0,275.8,85.7,30.6,13.0,26.8,2.0
145290,Margarita Chicken Salad Sandwiches,PT2H,PT45M,PT2H45M,2019-04-12T15:35:00Z,Chicken,5.0,2.0,767.1,37.6,10.5,135.0,459.8,68.2,19.1,28.1,46.7,6.0


An example of a high carb food like Coffee Cake only has 60 carbs per serving, and an example of a high fiber meal has 12g of fiber. Let's eliminate anything over 60 and 15 as it is either a mistake or represents multpling the nutrition facts by the serving size

In [22]:
recipes = recipes[recipes.CarbohydrateContent < 70]
recipes = recipes[recipes.FiberContent < 15]
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0,99168.0
mean,4.61964,5.330459,320.044778,16.137958,6.16393,71.110733,501.084729,26.450199,2.845626,8.060062,17.401364,5.337145
std,0.631562,22.787846,177.111451,12.922131,5.836765,71.333861,417.378153,17.900253,2.55714,9.969453,15.030483,2.126027
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,189.7,6.3,1.5,9.8,169.875,11.1,1.0,1.9,5.1,4.0
50%,5.0,2.0,296.05,13.2,4.5,57.1,400.3,23.9,2.2,4.3,12.9,5.0
75%,5.0,4.0,423.2,23.0,9.1,105.8,728.9,39.5,3.9,9.5,27.2,6.0
max,5.0,3063.0,999.8,69.9,29.9,399.9,1999.7,69.9,14.9,67.5,137.4,11.0


High protein meats tend to have 20-40grams of protein, let's drop rows with over 60.

In [23]:
recipes = recipes[recipes.ProteinContent < 45]
recipes.describe()

Unnamed: 0,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
count,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0,94156.0
mean,4.619355,5.257689,303.593669,15.238402,5.888887,64.938077,486.733295,26.620961,2.856928,8.134638,15.324365,5.377321
std,0.631945,19.248944,161.479647,12.064185,5.646956,66.415174,409.77754,17.870142,2.551424,10.060475,12.067098,2.139478
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4.5,1.0,183.775,6.0,1.4,7.6,161.1,11.3,1.0,2.0,4.8,4.0
50%,5.0,2.0,285.3,12.6,4.2,52.2,388.2,24.2,2.2,4.3,11.7,5.0
75%,5.0,4.0,403.2,21.8,8.6,96.5,708.4,39.6,4.0,9.6,25.2,6.0
max,5.0,2273.0,993.6,69.9,29.9,399.9,1999.7,69.9,14.9,67.5,44.9,11.0


In [24]:
recipes.reset_index(drop=True,inplace=True)

## Cleaning datatypes
CookTime, PrepTime, TotalTime, and DatePublished are all 'object' types, but it will be easier for us to have them in ints and datetimes

### Time Columns

In [25]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94156 entries, 0 to 94155
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Name                 94156 non-null  object 
 1   CookTime             94156 non-null  object 
 2   PrepTime             94156 non-null  object 
 3   TotalTime            94156 non-null  object 
 4   DatePublished        94156 non-null  object 
 5   RecipeCategory       94156 non-null  object 
 6   AggregatedRating     94156 non-null  float64
 7   ReviewCount          94156 non-null  float64
 8   Calories             94156 non-null  float64
 9   FatContent           94156 non-null  float64
 10  SaturatedFatContent  94156 non-null  float64
 11  CholesterolContent   94156 non-null  float64
 12  SodiumContent        94156 non-null  float64
 13  CarbohydrateContent  94156 non-null  float64
 14  FiberContent         94156 non-null  float64
 15  SugarContent         94156 non-null 

### Cleaning Times
1. Inspect each time value to understand patterns
2. Write a funciton to clean them and convert to ints

In [26]:
pt = recipes.PrepTime.unique()
ct = recipes.CookTime.unique()
tt= recipes.TotalTime.unique()
pt.sort()
ct.sort()
tt.sort()
pt,ct,tt

(array(['PT-30M', 'PT0S', 'PT108H', 'PT10H', 'PT10M', 'PT11H', 'PT11M',
        'PT12H', 'PT12M', 'PT13H', 'PT13M', 'PT14H', 'PT14M', 'PT15H',
        'PT15M', 'PT168H', 'PT16H', 'PT16M', 'PT17M', 'PT18M', 'PT19M',
        'PT1H', 'PT1H10M', 'PT1H15M', 'PT1H18M', 'PT1H19M', 'PT1H1M',
        'PT1H20M', 'PT1H23M', 'PT1H25M', 'PT1H28M', 'PT1H30M', 'PT1H35M',
        'PT1H38M', 'PT1H3M', 'PT1H40M', 'PT1H45M', 'PT1H4M', 'PT1H50M',
        'PT1H55M', 'PT1H5M', 'PT1H9M', 'PT1M', 'PT20H', 'PT20M', 'PT21M',
        'PT22M', 'PT23M', 'PT24H', 'PT24H30M', 'PT24M', 'PT25H', 'PT25M',
        'PT26M', 'PT27M', 'PT28H', 'PT28M', 'PT29M', 'PT2H', 'PT2H10M',
        'PT2H12M', 'PT2H15M', 'PT2H17M', 'PT2H18M', 'PT2H20M', 'PT2H30M',
        'PT2H40M', 'PT2H45M', 'PT2H50M', 'PT2H5M', 'PT2H9M', 'PT2M',
        'PT30H', 'PT30M', 'PT31M', 'PT32M', 'PT33M', 'PT34M', 'PT35M',
        'PT36H', 'PT36M', 'PT37M', 'PT38M', 'PT39M', 'PT3H', 'PT3H10M',
        'PT3H15M', 'PT3H20M', 'PT3H24M', 'PT3H30M', 'PT3H40M', 

In [27]:
recipes[recipes.PrepTime=='PT-30M']
recipes.at[11086,'PrepTime']='PT30M'
recipes.iloc[11086]

Name                             Pear Crisp
CookTime                              PT20M
PrepTime                              PT30M
TotalTime                             PT50M
DatePublished          2002-12-08T20:14:00Z
RecipeCategory                      Dessert
AggregatedRating                        4.0
ReviewCount                             5.0
Calories                              259.2
FatContent                             10.2
SaturatedFatContent                     4.6
CholesterolContent                     17.8
SodiumContent                          78.4
CarbohydrateContent                    43.8
FiberContent                            5.9
SugarContent                           29.3
ProteinContent                          2.0
RecipeServings                          6.0
Name: 11086, dtype: object

In [28]:
t1= 'PT1H15M'
t2= 'PT1H'
t3= 'PT5M'

def cleantime(s):
    s = s[2:] #get rid of 'PT' for each entry
    if s.find('H')!= -1:
        h = int(s[:s.find('H')])
        if s.find('M')!= -1:
            m = int(s[s.find('H')+1:s.find('M')]) / 60
        else:
            m=0
    elif s.find('H')== -1:
        h=0
        m = int(s[s.find('H')+1:s.find('M')]) / 60
    return round(h+m,3)
cleantime(t1),cleantime(t2),cleantime(t3)

(1.25, 1, 0.083)

In [29]:
recipes['PrepTime'] = recipes.PrepTime.apply(lambda x: cleantime(x))
recipes['CookTime'] = recipes.CookTime.apply(lambda x: cleantime(x))
recipes['TotalTime'] = recipes.TotalTime.apply(lambda x: cleantime(x))

In [30]:
recipes.TotalTime.sort_values()

46347        0.017
3397         0.017
62242        0.017
49650        0.017
49633        0.017
           ...    
22083     1008.083
85180     1008.500
22312     1440.167
60528     1440.333
86296    17520.000
Name: TotalTime, Length: 94156, dtype: float64

We can see we have some strangely long total times. Some recipes do take couple days, but others don't make sense at all. We'll cut those out.

In [31]:
recipes = recipes[recipes.CookTime<72]

In [32]:
recipes.reset_index(drop=True,inplace=True)

#### DatePublished

In [33]:
set([len(i) for i in recipes.DatePublished])

{20}

In [34]:
recipes['DatePublished'] = recipes.DatePublished.apply(lambda x: pd.to_datetime(x[:10]))

## Removing Duplicate Names
It's possible we'd have multiple entries for the same dish. Let's look at duplicate names and get rid of them if so.

In [35]:
recipes.Name.nunique()

86458

In [36]:
recipes[recipes.duplicated(subset=['Name'])]

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
674,Chicken Fajitas,0.167,2.000,2.167,2000-03-06,Chicken Breast,3.0,1.0,314.2,21.7,4.2,54.5,66.9,2.5,0.5,0.6,18.1,8.0
687,Osso Bucco,1.500,0.000,1.500,2000-03-06,Meat,4.0,1.0,252.2,17.4,6.3,21.6,173.4,15.1,1.4,3.2,2.7,6.0
740,Curried Chicken,1.000,0.333,1.333,2000-03-13,Chicken Breast,4.0,3.0,201.4,7.1,2.0,46.4,366.9,17.4,1.9,9.7,17.1,6.0
1071,Chiles Rellenos Casserole,1.000,0.333,1.333,2001-05-30,Chicken,4.5,3.0,314.5,11.5,5.2,112.4,730.6,30.7,5.5,4.9,23.5,6.0
1092,Butter Chicken,0.417,1.000,1.417,2001-06-01,Lunch/Snacks,4.5,13.0,466.3,30.5,15.1,159.7,329.5,17.6,3.1,9.5,32.7,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94026,Banana Oat Pancakes,0.333,0.250,0.583,2018-10-17,Breakfast,4.0,1.0,410.6,17.7,3.1,95.5,321.9,50.7,2.8,5.7,11.6,4.0
94044,Bacon Cheeseburger Meatloaf,1.250,0.167,1.417,2018-12-10,Meatloaf,4.0,1.0,496.9,32.4,14.0,143.4,993.4,18.4,1.0,5.9,31.7,6.0
94050,Copycat Panera Bread Spinach &amp; Artichoke S...,0.667,0.167,0.833,2019-02-22,Breakfast,5.0,1.0,569.0,40.5,15.5,179.0,664.7,34.5,1.1,3.8,16.8,4.0
94083,Mojo Pork Tenderloin,0.667,0.250,0.917,2019-07-20,Cuban,5.0,1.0,547.1,33.8,5.8,110.7,968.1,29.0,6.5,11.6,37.8,4.0


In [37]:
recipes[recipes.Name=='Bacon Cheeseburger Meatloaf']

Unnamed: 0,Name,CookTime,PrepTime,TotalTime,DatePublished,RecipeCategory,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
57370,Bacon Cheeseburger Meatloaf,1.0,0.25,1.25,2007-11-06,Meat,5.0,3.0,359.5,25.7,10.0,128.5,566.6,7.7,0.3,3.3,23.6,8.0
63028,Bacon Cheeseburger Meatloaf,1.0,0.25,1.25,2008-03-27,Onions,5.0,1.0,546.5,28.8,14.1,207.6,1017.0,25.9,1.4,2.6,43.8,4.0
94044,Bacon Cheeseburger Meatloaf,1.25,0.167,1.417,2018-12-10,Meatloaf,4.0,1.0,496.9,32.4,14.0,143.4,993.4,18.4,1.0,5.9,31.7,6.0


In [38]:
recipes = recipes[~recipes.duplicated(subset=['Name'])].reset_index(drop=True)

## Categories

In [39]:
recipes.RecipeCategory.unique()

array(['Frozen Desserts', 'Vegetable', 'Pie', 'Beverages', 'Stew',
       'Dessert', 'Brazilian', 'Breakfast', 'Brown Rice', 'Cheese',
       'Chicken', 'Chicken Breast', 'Scones', 'Whole Chicken',
       'Weeknight', 'Low Protein', 'Curries', 'Chicken Livers',
       '< 60 Mins', 'Savory Pies', 'Coconut', 'Pork', 'Quick Breads',
       '< 30 Mins', 'Lunch/Snacks', 'Crab', 'Potato', 'Lamb/Sheep',
       'Chowders', 'Onions', 'European', 'Indonesian', 'Greek', 'Corn',
       'Healthy', 'Long Grain Rice', 'Pineapple', 'Cauliflower',
       'Mexican', 'Free Of...', 'Meat', 'Soy/Tofu', 'Breads',
       'Yeast Breads', 'Beans', 'Sauces', 'German', 'One Dish Meal',
       'Short Grain Rice', 'Candy', 'Very Low Carbs', 'Oven', 'Microwave',
       'Rice', 'Apple', 'Tuna', 'Lentil', 'Fruit', 'Clear Soup', 'Veal',
       '< 15 Mins', '< 4 Hours', 'Spanish', 'Shakes', 'Orange Roughy',
       'Mussels', 'Chicken Thigh & Leg', 'Halibut', 'Poultry', 'Roast',
       'Cheesecake', 'Colombian', 'Manico

In [40]:
recipes.groupby('RecipeCategory').mean().head()

Unnamed: 0_level_0,CookTime,PrepTime,TotalTime,AggregatedRating,ReviewCount,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,RecipeServings
RecipeCategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
< 15 Mins,0.117344,0.082012,0.196822,4.663895,4.973872,228.147743,12.120428,4.245368,55.227078,403.679335,17.286936,1.843943,4.44133,12.226128,4.064133
< 30 Mins,0.234775,0.176591,0.411298,4.628521,4.030282,302.710423,15.910282,5.746831,71.877887,496.131408,23.305845,2.471549,4.543028,16.658662,4.883099
< 4 Hours,1.200285,0.488246,1.688519,4.633218,3.119377,334.746886,18.152941,7.121626,71.913322,610.321626,26.327855,3.371453,7.079412,16.892734,6.449827
< 60 Mins,0.453776,0.293183,0.746958,4.598168,3.956152,335.044437,18.072775,7.159097,79.009555,592.092147,25.436911,2.795353,5.738154,17.917736,5.483639
African,1.230356,0.251422,1.481778,4.666667,3.555556,264.375556,12.577778,3.311111,43.24,362.337778,25.817778,4.066667,6.437778,12.995556,4.822222


# Takeaways
We have a clean dataset with no null values, removed significant outliers and possible mistaken entries, and reformatted several important rows to give us a measure of time. We should have a sufficient amount of information to begin exploring relationships between features in EDA.

## Save Our Data

In [42]:
datapath = 'data' 
save_file(recipes, 'cleaned.csv', datapath)

Writing file.  "data/cleaned.csv"
