# 1.1 Birth Rates

The data on US births, provided by the CDC is in `data/births.csv`.

Reproduce the following plot of births by gender over time given the data:

![](births_gender.png)

Note the `1e6` on the y axis for scale

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 

df = pd.read_csv('data/births.csv')
print(df.head(5))
df.loc[df['gender'] == 'F'].groupby(['year']).sum().reset_index()['births'].plot()
df.loc[df['gender'] == 'M'].groupby(['year']).sum().reset_index()['births'].plot()
plt.legend(['F', 'M'], title='gender')

# 1.2 Births anomalies

This was analyzed by beloved statistician Andrew Gelman [here](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/), leading to this plot:

![](births_gp100.png)

Explain all three plots in Gelman's figure. 

**1.2:** What is the periodic component? What is the residual? Use your research skills to learn then explain it (in english).

## Explain the plots 
It represtends the variation of the ratio of the average number of births for each day in a year over the average number of births over each day of a year. 

## The Periodic Component 
It's uses constructive interference to clear noise. it gives us a clearer view into periodic changes and other anomalies in the plot. For example, on the upper plot, the daily birth line fluctuates so regularily that we might miss high or low trends.

## Residual 
It's the distance between a point and the regression line. The bottom plot takes all the points from the first plot and represent the distances from the smoothed line. It gives us another way to analyse the data.

# 1.3 Holiday Anomalies Plot

Reproduce *as best you can* the first of the 3 figures from Andrew Gelman's blog post (your plot may have small differences)

**1.3.1:** Reproduce the births line in a plot. Hint: Make the x axis a `pd.datetime` object

**1.3.2:** Reproduce the `smoothed` line. Hint: use a rolling window average

**1.3.3:** Reproduce the entire figure with the mean line as a horizontal. You can make the y axis total births instead of a % deviation from mean axis (they'll look the same anyway)

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

df = pd.read_csv('data/births.csv')

#data cleaning 
df = df.dropna()
df = df.loc[(df['day'] > 0) & (df['day'] <=31)]
df = df.loc[(df['month'] > 0 & (df['month'] <= 12))]


#Create the datetime
df.day = df.day.astype(int)
df['date'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')

# get day of the year
df['day_of_year'] = df['date'].dt.dayofyear


#grouping by day of year
df = df.groupby(['day_of_year']).sum().reset_index()
df['births'] = df['births'] / df['births'].mean() * 100
df['mean'] = 100 

# rolling average pandas as way to calculate it for us. 
rolling = df.rolling(7, center=True)
s=pd.Series([100])


ax = df['births'].plot(color='orange')
rolling.mean()['births'].plot(color="blue")
df['mean'].plot(color="red")
ax.set_xlabel("Day of the year")
ax.set_ylabel("Relative number of births")

# 2. Recipe Database

### 2.1 

Load the JSON recipe database we saw in lecture 4.

How many of the recipes are for breakfast food? Hint: The `description` would contain the work "breakfast"

In [None]:
import json
import gzip
import numpy as np 
import pandas as pd 
recipes = pd.read_json('data/recipe.json.gz', compression='infer', lines = True)

print('The number of breakfast recipes is:', len( recipes[recipes.description.str.contains("breakfast", na=False)]))


### 2.2 A simple recipe recommender

Let's build a recipe recommender: given a list of basic ingredients, find a recipe that uses all those ingredients.

Here is the list of ingredients that can be asked for:

```
['salt', 'pepper', 'oregano', 'sage', 'parsley',
 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
```

**Hint:** Build a new column for each of the ingredients that indicates whether that ingredient is in the recipe.

**example:**
```
recommend_ingredients(["parsley", "paprika", "tarragon"], df)

result: 
# The rows where these 3 ingredients are in the recipe
[2069, 74964, 93768, 113926, 137686, 140530, 158475, 158486, 163175, 165243]
```

In [None]:
def recommend_ingredients(ingredients_list, df):
    
    
    for ingredient in ingredients_list:
        df[ingredient] = df['ingredients'].str.lower().str.contains(ingredient)
    remaining = df
    for ingredient in ingredients_list:
        remaining = remaining.loc[remaining[ingredient] == True]
    
    return remaining
    
print(f' there are {len(recommend_ingredients(["parsley", "paprika"], recipes))} with parsley and paprika')

# 3. Movies!

Recall the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) from lecture 4. It's made up of several tables which we've played with in lecture 4.

The tables have common columns (`id` and `movie_id`) around which you can merge and join tables.

### 3.1 best director

Your task is to find **The best director** in terms of average ratings of his movies. This can be from the `ratings` or `ratings_small` table, or simply the vote average in the `metadata` table. The director can be found in the `cast` table.

You will have to use all of your skills to get this done, between using groupbys and merging multiple tables together

In [None]:
#open file 
df_credit = pd.read_csv('data/credits.csv')

df_movies_metadata = pd.read_csv('data/movies_metadata.csv')
df_credit.crew = df_credit.crew.apply(eval)

In [None]:
def director_finder(credit_crew):
    dir_name = None
    for rows in credit_crew:
        if rows['job'] == 'Director':
            dir_name = rows['name']
    return dir_name

df_credit['director_name'] = df_credit.crew.apply(director_finder)

In [None]:
#data cleaning

df_dir_id = df_credit[['id', 'director_name']]
df_dir_id = df_dir_id.astype({"id": str})
df_rat_id = df_movies_metadata[['id', 'vote_average' ]]

#merge df
df_merge = pd.merge(df_dir_id, df_rat_id, on = 'id')
df_best_director = df_merge.groupby('director_name').mean('vote_average')
df_best_director = df_best_director[df_best_director['vote_average']== 10.0] 

print(df_best_director)
