### Pandas Lab -- Grouping & Merging

Welcome to today's lab!  It will come in two different parts:  

One section will be devoted to using the `groupby` method in order to answer different questions about our data.  

The second portion will be devoted towards combining grouping & merging to create summary statistics -- one of the more important features you can add to a dataset for statistical modeling.  

### Section I - Grouping

**Question 1:** What restaurant had the highest total amount of visitors throughout the dataset?

In [4]:
# your answer here
import pandas as pd
import numpy as np
df = pd.read_csv('../data/restaurant data/master.csv', parse_dates=['visit_date'])

FileNotFoundError: [Errno 2] File ../data/restaurant data/restaurants.csv does not exist: '../data/restaurant data/restaurants.csv'

In [None]:
# the restaurant 
df.groupby('id')['visitors'].sum().idxmax()

In [None]:
# restaurant with the amount attached
visits = df.groupby('id')[['visitors']].sum()
idx    = visits.idxmax()
visits.loc[idx, :]

**Question 2:** What was the average difference in attendance between holidays & non-holidays for each restaurant?

In [None]:
df.columns

In [None]:
# your answer here
df.groupby('holiday')['visitors'].mean()

In [None]:
# if you wanted to get the difference between them
df.groupby('holiday')['visitors'].mean().diff()

**Question 3:** Can you grab the first 15 rows of dates for each restaurant?  The last 15 rows?

In [None]:
# your answer here -- first 15 rows
df.groupby('id').apply(lambda x: x.iloc[:15])

In [None]:
# the last 15 rows
df.groupby('id').apply(lambda x: x.iloc[-15:])

**Question 4:** Grab the quarterley sales for each individual restaurant within our dataset

In [None]:
# your answer here -- notice the use of the date parts within the groupby -- without necessarily creating them
df.groupby(['id', df.visit_date.dt.year, df.visit_date.dt.quarter])['visitors'].sum()

**Question 6:** What restaurant had the highest amount of reservations?

In [None]:
# your answer here -- to get both answers, see the previous solution
df.groupby('id')['reserve_visitors'].sum().idxmax()

**Question 7:** What is the total number of missing entries for each restaurant?  

In [None]:
# your answer here
df.groupby('id').apply(lambda x: x.isnull().sum().sum())

**Question 8:**  Create two variables, `train` and `test`.  Make `train` a dataset that contains all but the **last 15 rows** for each restaurant.  Make `test` the last 15 rows for each restaurant.

In [None]:
# we'll make sure our dataset is sorted properly first
df = df.sort_values(by=['id', 'visit_date'], ascending=[True, True])
# and then apply our lambda functions
train = df.groupby('id').apply(lambda x: x.iloc[:-15])
test  = df.groupby('id').apply(lambda x: x.iloc[:-15])

### Grouping & Merging

In this section of the lab, we are going to create different types of summary statistics -- where the rows for an individual sample can be compared with a larger group statistic.

**Bonus:** If you want to make this a little bit more effective, instead of using the entire `df`, try using a grouping from the `train` variable you just created, and use the grouping's values to populate both the training and test sets.

Use the technique discussed in class to create columns for the following stats:

**Question 1:** Create columns that list the average, median and standard deviation of visitors for each restaurant

In [None]:
# your answer here
id_vals = df.groupby('id')['visitors'].agg(['mean', 'median', 'std']).rename({'mean': 'id-mean', 'median': 'id-median', 'std': 'id-std'}, axis=1)
df = df.merge(id_vals, left_on=['id'], right_index=True, how='left')

**Question 2:** Create a column that lists the average and median sales amount for each restaurant on a particular day of the week.

In [None]:
# your answer here
id_day_vals = df.groupby(['id', 'day_of_week'])['visitors'].agg(['mean', 'median', 'std']).rename({'mean': 'id-day-mean', 'median': 'id-day-median', 'std': 'id-day-std'}, axis=1)
df = df.merge(id_day_vals, left_on=['id', 'day_of_week'], right_index=True, how='left')

**Question 3:** Create columns that display the average and median sales amount for each genre in each city on each day of the week.  Create a column called `city` that captures the first value of `area` in order to this.  Values should be `Tokyo`, `Hiroshima`, etc.  **Hint:** You should use the `str` attribute combined with `split` in order to do this.

In [None]:
# your answer here
df['city'] = df['area'].str.split().str[0]
day_city_vals = df.groupby(['genre', 'city', 'day_of_week'])['visitors'].agg(['mean', 'median', 'std']).rename({'mean': 'city-day-mean', 'median': 'city-day-median', 'std': 'city-day-std'}, axis=1)
df = df.merge(day_city_vals, left_on=['genre', 'city', 'day_of_week'], right_index=True, how='left')

NameError: name 'df' is not defined