# 1.1 Birth Rates

The data on US births, provided by the CDC is in `data/births.csv`.

Reproduce the following plot of births by gender over time given the data:

![](births_gender.png)

Note the `1e6` on the y axis for scale

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns

birth = pd.read_csv("data/births.csv")
fig, ax = plt.subplots()
#filtering data
birth.day = birth.day.fillna(0)
birth.day=birth.day.astype(int)
birth = birth[~birth.day.isin([99])]
#calculate for female
female = birth[birth['gender'] == 'F']
female_birth_total=female.groupby('year')\
                 .agg(total_birth=("births",'sum'))\
                 .reset_index()
#calculate for male
male = birth[birth['gender'] == 'M']
male_birth_total=male.groupby('year')\
                 .agg(total_birth=("births",'sum'))\
                 .reset_index()
#plotting graph
ax.plot(female_birth_total.year,female_birth_total.total_birth, '-',lw=2)
ax.plot(male_birth_total.year,male_birth_total.total_birth, '-',lw=2)

#styling graph
plt.xlabel('year')
plt.ylabel('total birth per year')
legend_labels = ['F','M']
plt.legend(legend_labels,title="gender")
#plt.grid(color = 'white')
#ax.set_facecolor(color = 'gray')
sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'w'})

#print(male)
#print(male_birth_total)
# print(male_birth_total.index)

# 1.2 Births anomalies

This was analyzed by beloved statistician Andrew Gelman [here](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/), leading to this plot:

![](births_gp100.png)

Explain all three plots in Gelman's figure. 

**1.2:** What is the periodic component? What is the residual? Use your research skills to learn then explain it (in english).

# 1.3 Holiday Anomalies Plot

Reproduce *as best you can* the first of the 3 figures from Andrew Gelman's blog post (your plot may have small differences)

**1.3.1:** Reproduce the births line in a plot. Hint: Make the x axis a `pd.datetime` object

**1.3.2:** Reproduce the `smoothed` line. Hint: use a rolling window average

**1.3.3:** Reproduce the entire figure with the mean line as a horizontal. You can make the y axis total births instead of a % deviation from mean axis (they'll look the same anyway)

In [None]:
df = pd.read_csv("data/births.csv")
quartiles = np.percentile(df['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
df = df.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
df['day'] = df['day'].astype(int)
# create a datetime index from the year, month, day
df.index = pd.to_datetime(10000 * df.year +
                          100 * df.month +
                          df.day, format='%Y%m%d')
df = df.pivot_table('births', [df.index.month, df.index.day])
df.index = [pd.datetime(2012, month, day)
                        for (month, day) in df.index]
# Plot the results
fig, ax = plt.subplots(figsize=(12, 4))
df.plot(ax=ax)
df.rolling(window=10).mean().plot(color='r', ax=ax)

# 2. Recipe Database

### 2.1 

Load the JSON recipe database we saw in lecture 4.

How many of the recipes are for breakfast food? Hint: The `description` would contain the work "breakfast"

In [4]:
import json
import gzip
import numpy as np
import pandas as pd
with gzip.open('data/recipe.json.gz','r') as f:
    # Extract each line
    data = (line.strip().decode() for line in f)
    # Reformat so each line is the element of a list
    data_json = f"[{','.join(data)}]"
# read the result as a JSON
recipes = pd.read_json(data_json)
recipes.shape

(173278, 17)

In [None]:
recipes.description.str.contains("[Bb]reakfast", na=False).sum()


### 2.2 A simple recipe recommender

Let's build a recipe recommender: given a list of basic ingredients, find a recipe that uses all those ingredients.

Here is the list of ingredients that can be asked for:

```
['salt', 'pepper', 'oregano', 'sage', 'parsley',
 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
```

**Hint:** Build a new column for each of the ingredients that indicates whether that ingredient is in the recipe.

**example:**
```
recommend_ingredients(["parsley", "paprika", "tarragon"], df)

result: 
# The rows where these 3 ingredients are in the recipe
[2069, 74964, 93768, 113926, 137686, 140530, 158475, 158486, 163175, 165243]
```

In [13]:
recipes.columns
recipes

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions,check
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,,
1,{'$oid': '5160756d96cc62079cc2db16'},Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276013902},PT20M,thepioneerwoman,12,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,,
2,{'$oid': '5160756f96cc6207a37ff777'},Morrocan Carrot and Chickpea Salad,Dressing:\n1 tablespoon cumin seeds\n1/3 cup /...,http://www.101cookbooks.com/archives/moroccan-...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276015332},,101cookbooks,,2013-01-07,PT15M,A beauty of a carrot salad - tricked out with ...,,,,,,
3,{'$oid': '5160757096cc62079cc2db17'},Mixed Berry Shortcake,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/mix...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276016700},PT15M,thepioneerwoman,8,2013-03-18,PT15M,It's Monday! It's a brand new week! The birds ...,,,,,,
4,{'$oid': '5160757496cc6207a37ff778'},Pomegranate Yogurt Bowl,For each bowl: \na big dollop of Greek yogurt\...,http://www.101cookbooks.com/archives/pomegrana...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276020318},,101cookbooks,Serves 1.,2013-01-20,PT5M,A simple breakfast bowl made with Greek yogurt...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173273,{'$oid': '551c030e96cc6233c0d0dc3d'},The Easiest Homemade Vanilla Ice Cream,250 milliliters Cream\n395 grams Canned Sweete...,http://tastykitchen.com/recipes/desserts/the-e...,http://tastykitchen.com/recipes/wp-content/upl...,{'$date': 1427899150211},PT,tastykitchen,10,2015-04-01,PT10M,The easiest vanilla ice cream you will ever ma...,,,,,,
173274,{'$oid': '551c030f96cc6233c0d0dc3e'},Butterfinger Eggs with Vanilla,2 cups Candy Corn\n1 teaspoon Vanilla Extract\...,http://tastykitchen.com/recipes/holidays/butte...,http://tastykitchen.com/recipes/wp-content/upl...,{'$date': 1427899151232},PT5M,tastykitchen,24,2015-04-01,PT8H,Chocolate coated peanut butter eggs with a hin...,,,,,,
173275,{'$oid': '551c86b796cc626b1ab4d901'},The Best Homemade Taco Seasoning,1/4 cup ground cumin\n1/4 cup kosher salt\n2 t...,http://picky-palate.com/2015/04/01/the-best-ho...,http://picky-palate.com/wp-content/uploads/201...,{'$date': 1427932855918},,pickypalate,Makes about 1 cup,2015-04-01,,,,,,,,
173276,{'$oid': '551f29b696cc62227991d465'},The Ultimate Queso Bean Dip,Two 16 ounce cans Old El Paso Refried Beans\n4...,http://picky-palate.com/2015/04/03/the-ultimat...,http://picky-palate.com/wp-content/uploads/201...,{'$date': 1428105654508},,pickypalate,8-10 Servings,2015-04-03,,,,,,,,


In [6]:
import re
import pandas as pd
#df = pd.DataFrame()
#x = recipes.iloc[2,2]
#print(x)
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
spice_df=pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()
spice_df.query('parsley & paprika & tarragon').index
ingredients1 = ["parsley", "paprika", "tarragon"]
print('&'.join(ingredients1))

parsley&paprika&tarragon


In [8]:
import re
def recommend_ingredients(list_of_ingredients, recipes):
    row_id = []
    spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
    spice_df=pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
    join_ingredients = '&'.join(list_of_ingredients)
    row_id.append(spice_df.query(join_ingredients).index)
    return row_id

list_of_ingredients= ["parsley", "paprika", "rosemary"]
recommend_ingredients(list_of_ingredients, recipes)

[Int64Index([  4635,   4875,   5468,   5470,   6955,  12052,  66245,  67429,
              70209,  70655,  71443,  71793,  74265,  87168,  89794,  90652,
              97330, 101573, 105548, 106200, 114843, 131504, 165030, 165476,
             167105, 168089, 169167, 171491, 171724, 172905],
            dtype='int64')]

In [43]:
recipes.ingredients[0]
recipes._id[0]
#print(recipes.ingredients[2069],"2069")
#print(recipes.ingredients[74964],"74964")
df = pd.DataFrame()
#recipes[recipes['ingredients'].isin(["parsley", "paprika", "tarragon"])]
ingredients1 = ["parsley", "paprika", "tarragon"]
ppt = recipes[recipes['ingredients'].isin(ingredients1)]
#print(ppt)
#ppt = recipes[recipes['ingredients'].str.contains(ingredients1)]
#print(ppt)
#print(recipes.ingredients.isin(ingredients1) == True)
# specific_ingredients = set(['parsley','paprika','tarragon'])
# recipes['check'] = recipes[recipes['ingredients'].map(specific_ingredients.issubset)]
recipe = pd.DataFrame(recipes['ingredients'].str.contains('[Pp]arsley'))
# recipes.check.dropna()
# print(recipes[recipes.check==False)
recipe[recipe.ingredients==True]

Unnamed: 0,ingredients
19,True
21,True
33,True
34,True
44,True
...,...
173171,True
173180,True
173201,True
173248,True


In [None]:
def recommend_ingredients(list_of_ingredients, df):
    results = []
    if df['ingredients'].str.contains('&'.join(list_of_ingredients)):
        results.append(df.ingredients.index)
    
#recommend_ingredients(["parsley", "paprika", "tarragon"], recipes)

# 3. Movies!

Recall the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) from lecture 4. It's made up of several tables which we've played with in lecture 4.

The tables have common columns (`id` and `movie_id`) around which you can merge and join tables.

### 3.1 best director

Your task is to find **The best director** in terms of average ratings of his movies. This can be from the `ratings` or `ratings_small` table, or simply the vote average in the `metadata` table. The director can be found in the `cast` table.

You will have to use all of your skills to get this done, between using groupbys and merging multiple tables together

In [11]:
movies_url = {
"movies_metadata": "1RLvh6rhzYiDDjPaudDgyS9LmqjbKH-wh",
"keywords": "1YLOIxb-EPC_7QpkmRqkq9E6j7iqmoEh3",
"ratings": "1_5HNurSOMnU0JIcXBJ5mv1NaXCx9oCVG",
"credits": "1bX9othXfLu5NZbVZtIPGV5Hbn8b5URPf",
"ratings_small": "1fCWT69efrj4Oxdm8ZNoTeSahCOy6_u6w",
"links_small": "1fh6pS7XuNgnZk2J3EmYk_9jO_Au_6C15",
"links": "1hWUSMo_GwkfmhehKqs8Rs6mWIauklkbP",
}
def read_gdrive(url):
    """
    Reads file from Google Drive sharing link
    """
    path = 'https://drive.google.com/uc?export=download&id='+url
    return pd.read_csv(path)

df = read_gdrive(movies_url["movies_metadata"])
df1 = read_gdrive(movies_url["keywords"])
df2 = read_gdrive(movies_url["ratings"])
df3 = read_gdrive(movies_url["credits"])
df4 = read_gdrive(movies_url["ratings_small"])
df5 = read_gdrive(movies_url["links_small"])
df6 = read_gdrive(movies_url["links"])
df6.columns

  if (await self.run_code(code, result,  async_=asy)):


Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

In [None]:
def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

In [78]:
# credits.join( pd.DataFrame(list(json.loads(d).values())[0] for d in credits.pop('crew')) )