### Pandas Lab -- Grouping & Merging

Welcome to today's lab!  It will come in two different parts:  

One section will be devoted to using the `groupby` method in order to answer different questions about our data.  

The second portion will be devoted towards combining grouping & merging to create summary statistics -- one of the more important features you can add to a dataset for statistical modeling.  

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('../../data_master/master.csv', parse_dates = ['visit_date'])

### Section I - Grouping

**Question 1:** What restaurant had the highest total amount of visitors throughout the dataset?

In [26]:
df.groupby('id').sum()['visitors'].sort_values(ascending = False)

id
air_399904bdb7685ca0    18717
air_f26f36ec4dc5adb0    18577
air_e55abd740f93ecc4    18101
air_99157b6163835eec    18097
air_5c817ef28f236bdf    18009
                        ...  
air_9dd7d38b0f1760c4      803
air_5b704df317ed1962      800
air_fdcfef8bd859f650      625
air_bbe1c1a47e09f161      581
air_a21ffca0bea1661a      190
Name: visitors, Length: 829, dtype: int64

**Question 2:** What was the average attendance for holidays & non-holidays for all restaurants?

In [16]:
df.groupby(['id', 'holiday']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,latitude,longitude,reserve_visitors
id,holiday,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
air_00a91d42b08b08d9,0,26.103896,35.694003,139.753595,15.563636
air_00a91d42b08b08d9,1,21.000000,35.694003,139.753595,
air_0164b9927d20bcc3,0,9.291667,35.658068,139.751599,16.136842
air_0164b9927d20bcc3,1,8.000000,35.658068,139.751599,21.750000
air_0241aa3964b7f861,0,9.883905,35.712607,139.779996,16.748201
...,...,...,...,...,...
air_fef9ccb3ba0da2f7,1,12.000000,34.815149,134.685353,18.428571
air_ffcc2d5087e1b476,0,20.436975,35.658068,139.751599,16.284483
air_ffcc2d5087e1b476,1,11.000000,35.658068,139.751599,9.666667
air_fff68b929994bfbd,0,5.093385,35.708146,139.666288,17.809160


**Question 3:** Can you grab the first 15 rows of dates for each restaurant?  The last 15 rows? (**Hint:** Use the `apply` method for this)

In [18]:
df.groupby('id').apply(lambda x: x.iloc[:15])

Unnamed: 0_level_0,Unnamed: 1_level_0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
air_00a91d42b08b08d9,166836,air_00a91d42b08b08d9,2016-07-01,35,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,166837,air_00a91d42b08b08d9,2016-07-02,9,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
air_00a91d42b08b08d9,166838,air_00a91d42b08b08d9,2016-07-04,20,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,166839,air_00a91d42b08b08d9,2016-07-05,25,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,166840,air_00a91d42b08b08d9,2016-07-06,29,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
...,...,...,...,...,...,...,...,...,...,...,...
air_fff68b929994bfbd,216418,air_fff68b929994bfbd,2016-07-14,11,Thursday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,2.0
air_fff68b929994bfbd,216419,air_fff68b929994bfbd,2016-07-15,3,Friday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,
air_fff68b929994bfbd,216420,air_fff68b929994bfbd,2016-07-16,8,Saturday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,4.0
air_fff68b929994bfbd,216421,air_fff68b929994bfbd,2016-07-19,6,Tuesday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,


**Question 4:** Grab the quarterly sales for each individual restaurant within our dataset

In [24]:
#add quarter
df['quarter'] = df['visit_date'].dt.quarter
df.groupby(['id', 'quarter']).sum()[['visitors', 'holiday', 'reserve_visitors']]

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,holiday,reserve_visitors
id,quarter,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
air_00a91d42b08b08d9,1,2041,0,966.0
air_00a91d42b08b08d9,2,490,0,150.0
air_00a91d42b08b08d9,3,1780,1,10.0
air_00a91d42b08b08d9,4,1740,0,586.0
air_0164b9927d20bcc3,1,593,1,864.0
...,...,...,...,...
air_ffcc2d5087e1b476,4,1522,3,685.0
air_fff68b929994bfbd,1,411,3,1307.0
air_fff68b929994bfbd,2,102,0,216.0
air_fff68b929994bfbd,3,404,5,18.0


**Question 5:** What restaurant had the highest amount of reservations?

In [27]:
df.groupby('id').sum()['reserve_visitors'].sort_values(ascending = False)

id
air_36bcf77d3382d36e    2724.0
air_5c817ef28f236bdf    2724.0
air_883ca28ef0ed3d55    2724.0
air_05c325d315cc17f5    2724.0
air_232dcee6f7c51d37    2722.0
                         ...  
air_900d755ebd2f7bbd     233.0
air_cb083b4789a8d3a2     182.0
air_b259b4e4a51a690d     159.0
air_d63cfa6d6ab78446     155.0
air_2703dcb33192b181      18.0
Name: reserve_visitors, Length: 829, dtype: float64

**Question 6:** What is the total number of missing entries for each restaurant?  

In [58]:
df[df['reserve_visitors'].isnull()].groupby('id').count()['visit_date']

id
air_00a91d42b08b08d9    122
air_0164b9927d20bcc3     50
air_0241aa3964b7f861    249
air_0328696196e46f18     50
air_034a3d5b40d5b1b1    130
                       ... 
air_fea5dc9594450608    139
air_fee8dcf4d619598e    149
air_fef9ccb3ba0da2f7    123
air_ffcc2d5087e1b476    124
air_fff68b929994bfbd    132
Name: visit_date, Length: 829, dtype: int64

**Question 7:**  Create two variables, `train` and `test`.  Make `train` a dataset that contains all but the **last 15 rows** for each restaurant, ordered chronologically.  Make `test` the last 15 rows for each restaurant.

In [60]:
train = df.groupby('id').apply(lambda x: x.iloc[-15:])

### Grouping & Merging

In this section of the lab, we are going to create different types of summary statistics -- where the rows for an individual sample can be compared with a larger group statistic.

**Bonus:** If you want to make this a little bit more effective, instead of using the entire `df`, try using a grouping from the `train` variable you just created, and use the grouping's values to populate both the training and test sets.

Use the technique discussed in class to create columns for the following stats:

**Question 1:** Create columns that list the average, median and standard deviation of visitors for each restaurant

In [None]:
# your answer here

**Question 2:** Create a column that lists the average and median sales amount for each restaurant on a particular day of the week.

In [None]:
# your answer here

**Question 3:** Create columns that display the average and median sales amount for each genre in each city on each day of the week.  Create a column called `city` that captures the first value of `area` in order to this.  Values should be `Tokyo`, `Hiroshima`, etc.  **Hint:** You should use the `str` attribute combined with `split` in order to do this.

In [None]:
# your answer here