In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from scipy.stats import norm
from random import uniform
import seaborn as sns

# 1.1 Birth Rates

The data on US births, provided by the CDC is in `data/births.csv`.

Reproduce the following plot of births by gender over time given the data:

![](births_gender.png)

Note the `1e6` on the y axis for scale

In [None]:
df = pd.read_csv(r'C:\Users\David\Documents\code\Module 2\m2-3-exploration\data\births.csv')
df.groupby(['year' , 'gender']).births.sum().unstack().plot()

# 1.2 Births anomalies

This was analyzed by beloved statistician Andrew Gelman [here](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/), leading to this plot:

![](births_gp100.png)

Explain all three plots in Gelman's figure. 

**1.2:** What is the periodic component? What is the residual? Use your research skills to learn then explain it (in english).

#1.2.

To begin, this plot depicts the births by day. The black line represents the daily % deviation from the mean line, which is represeted in red. Finally, the blue line is a smoothed line showing a rolling average.

Periodic Component: A periodic component uses constructive interference to clear noise, thus giving us a clearer view into periodic changes and other anomolies in the plot. For instance, on the upper most plot, the daily birth line fluctuates so regularily that you might miss particularily high or low trends. On the middle graph, you can observe strong trends in the 'arkened' portion of the plot.

Residual: Simply put, a residual is the distance between a point and the regression line (or in this case, a rolling average). The bottom plot takes all the points from the original plot, and represents them by their distance from the smoothed line. This allows us to see how close the smoothed line is to the true observations.

# 1.3 Holiday Anomalies Plot

Reproduce *as best you can* the first of the 3 figures from Andrew Gelman's blog post (your plot may have small differences)

**1.3.1:** Reproduce the births line in a plot. Hint: Make the x axis a `pd.datetime` object

**1.3.2:** Reproduce the `smoothed` line. Hint: use a rolling window average

**1.3.3:** Reproduce the entire figure with the mean line as a horizontal. You can make the y axis total births instead of a % deviation from mean axis (they'll look the same anyway)

In [None]:
import datetime
from scipy.ndimage.filters import gaussian_filter1d

df = pd.read_csv(r'C:\Users\David\Documents\code\Module 2\m2-3-exploration\data\births.csv')

#A bit of data cleaning
df = df.dropna()
df = df.loc[df['day'] != 99]

#The above plot only uses one year, so I will do the same.
df = df.loc[df['year'] == 1980]

#Create the datetime
df.day = df.day.astype(int)
df['time'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')

#Get our y-variables
df = df.groupby(['time']).sum()
ysmoothed = gaussian_filter1d(df.births, sigma=10)

#And plot
fig, ax = plt.subplots(figsize=(10, 6))

ax.hlines(y=df.births.mean(), xmin = df.index.min(), xmax= df.index.max(), label='Mean', colors='r')
plt.plot(df.index, df.births, lw=1, label='Births', color='black')
plt.plot(df.index, ysmoothed, label='Smoothed', color='b')

ax.legend()
plt.show()

# 2. Recipe Database

### 2.1 

Load the JSON recipe database we saw in lecture 4.

How many of the recipes are for breakfast food? Hint: The `description` would contain the work "breakfast"

In [None]:
import json
import gzip

with gzip.open(r'C:\Users\David\Documents\code\Module 2\recipe.json.gz', 'r') as f:
    data = (line.strip().decode() for line in f)
    data_json = f"[{','.join(data)}]"

recipes = pd.read_json(data_json)

print('The number of breakfast recipes is:', len( recipes[recipes.description.str.contains("breakfast", na=False)] ) )

### 2.2 A simple recipe recommender

Let's build a recipe recommender: given a list of basic ingredients, find a recipe that uses all those ingredients.

Here is the list of ingredients that can be asked for:

```
['salt', 'pepper', 'oregano', 'sage', 'parsley',
 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
```

**Hint:** Build a new column for each of the ingredients that indicates whether that ingredient is in the recipe.

**example:**
```
recommend_ingredients(["parsley", "paprika", "tarragon"], df)

result: 
# The rows where these 3 ingredients are in the recipe
[2069, 74964, 93768, 113926, 137686, 140530, 158475, 158486, 163175, 165243]
```

In [None]:
def recipe_ingredients(ingredients, recipes):
    recipes['contain'] = 0
    true = []

    for element in ingredients:
        true.append(
            recipes[recipes.ingredients
            .str.contains(element, na=False)]
            .index)
    for i in true:
        recipes.contain[i] += 1
    print(
         'The rows where these', len(ingredients),'ingredients are in the recipe', 
     recipes[recipes['contain'] == len(ingredients)].index.tolist())
        

ingredients = ["parsley", "paprika", "tarragon"]


recipe_ingredients(ingredients, recipes)

# 3. Movies!

Recall the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) from lecture 4. It's made up of several tables which we've played with in lecture 4.

The tables have common columns (`id` and `movie_id`) around which you can merge and join tables.

### 3.1 best director

Your task is to find **The best director** in terms of average ratings of his movies. This can be from the `ratings` or `ratings_small` table, or simply the vote average in the `metadata` table. The director can be found in the `cast` table.

You will have to use all of your skills to get this done, between using groupbys and merging multiple tables together

In [221]:
cast_df = pd.read_csv(r'C:\Users\David\Documents\code\Module 2\archive\credits.csv')
ratings_df = pd.read_csv(r'C:\Users\David\Documents\code\Module 2\archive\ratings.csv')

In [222]:
#The long process of cleaning the crew data and isolating the Director
cast_df['Director'] = 0

cast_df = cast_df.dropna()

for i in cast_df.index:
    cast_df.crew[i] = str(cast_df.crew[i]).replace(" \ / [] } { ", ' ')
    cast_df.crew[i] = str(cast_df.crew[i]).split(',')
    
    if " 'job': 'Director'" in cast_df.crew[i]:
        temp = cast_df.crew[i].index(" 'job': 'Director'") + 1
        cast_df['Director'][i] = cast_df.crew[i][temp]
    else:
        pass

cast_df = cast_df.loc[cast_df.Director != 0]

#Merge the dataframes, groupby Director and sort by ratings value
df = pd.merge(ratings_df, cast_df, left_on=ratings_df.movieId, right_on=cast_df.id, how='inner')
df = df.groupby(['Director'], as_index=False).mean()
df = df.sort_values(by = ['rating'], ascending=False)

#now we just find our top Director
the_director = df['Director'][0]

print('The top rated Director is', the_director)

The top rated Director is  'name': "Antonio D'Ambrosio"
