<img src="https://user-images.strikinglycdn.com/res/hrscywv4p/image/upload/c_limit,fl_lossy,h_300,w_300,f_auto,q_auto/1266110/Logo_wzxi0f.png" style="float: left; margin: 20px; height: 55px">

***Tout ce qu'un homme peut imaginer, un jour d'autres hommes le réaliseront - [Jules Verne](https://en.wikipedia.org/wiki/Jules_Verne)***

Anything one man can imagine, other men can make real

# Query

In pandas, we can query the columns of DataFrames with boolean expressions using the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html'>`query()`</a> method. I'll walk through lots of simple examples.

### Import Modules

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

### Get Flights Data

Let's get the `flights` dataset included in the `seaborn` library and assign it to the DataFrame `df_flights`.

In [None]:
df_flights = sns.load_dataset('flights')

Preview the first few rows of `df_flights`. 

Each row represents a month's flight history details. The `passengers` column represents that total number of passengers that flew that month.

In [None]:
df_flights.head()

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121


This dataset spans 1949 to 1960.

### Practice Filtering Rows and Columns

#### Query for rows in which year is equal to 1949

In [None]:
df_flights.query('year==1949')

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121
5,1949,June,135
6,1949,July,148
7,1949,August,148
8,1949,September,136
9,1949,October,119


#### Query for rows in which month is equal to January

Notice how `'January'` is in single quotes because it's a string.

In [None]:
df_flights.query("month=='January'")

Unnamed: 0,year,month,passengers
0,1949,January,112
12,1950,January,115
24,1951,January,145
36,1952,January,171
48,1953,January,196
60,1954,January,204
72,1955,January,242
84,1956,January,284
96,1957,January,315
108,1958,January,340


#### Query for rows in which year is equal to 1949 and month is equal to January

In [None]:
df_flights.query("year==1949 and month=='January'")

Unnamed: 0,year,month,passengers
0,1949,January,112


#### Query for rows in which month is January or February

In [None]:
df_flights.query("month==['January', 'February']")

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
12,1950,January,115
13,1950,February,126
24,1951,January,145
25,1951,February,150
36,1952,January,171
37,1952,February,180
48,1953,January,196
49,1953,February,196


#### Query for rows in which month equals January and year is less than 1955

In [None]:
df_flights.query("month=='January' and year<1955")

Unnamed: 0,year,month,passengers
0,1949,January,112
12,1950,January,115
24,1951,January,145
36,1952,January,171
48,1953,January,196
60,1954,January,204


#### Query for rows in which month equals January and year is greater than 1955

In [None]:
df_flights.query("month=='January' and year>1955")

Unnamed: 0,year,month,passengers
84,1956,January,284
96,1957,January,315
108,1958,January,340
120,1959,January,360
132,1960,January,417


#### Query for rows in which month equals January and the year is not 1955

In [None]:
df_flights.query("month=='January' and year!=1955")

Unnamed: 0,year,month,passengers
0,1949,January,112
12,1950,January,115
24,1951,January,145
36,1952,January,171
48,1953,January,196
60,1954,January,204
84,1956,January,284
96,1957,January,315
108,1958,January,340
120,1959,January,360


#### Query for rows in which month equals January or year equals 1955

In [None]:
df_flights.query("month=='January' or year==1955")

Unnamed: 0,year,month,passengers
0,1949,January,112
12,1950,January,115
24,1951,January,145
36,1952,January,171
48,1953,January,196
60,1954,January,204
72,1955,January,242
73,1955,February,233
74,1955,March,267
75,1955,April,269


# Crosstab

A [crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) computes aggregated metrics among two or more columns in a dataset that contains categorical values. 

### Get Tips Dataset

Let's get the `tips` dataset from the `seaborn` library and assign it to the DataFrame `df_tips`.

In [None]:
df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name | column description 
--- | ---
`total_bill` | financial amount of meal in U.S. dollars
`tip` |  financial amount of the meal's tip in U.S. dollars
`sex` | gender of server
`smoker` | boolean to represent if server smokes or not
`day` | day of week
`time` | meal name (Lunch or Dinner)
`size` | count of people eating meal

Preview the first 5 rows of `df_tips`. 

In [None]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Implement Crosstabs with Tips Dataset

#### Grouping By a Single Field for the Index and Column

Let's compute a simple crosstab across the `day` and `sex` column. We want our returned index to be the unique values from `day` and our returned columns to be the unique values from `sex`. By default in pandas, the `crosstab()` computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

sex,Male,Female
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


One issue with this crosstab output is the column names are nonsensical. Just saying `Male` or `Female` isn't very specific. They should be renamed to be clearer. We can use the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html'>rename()</a> method and set the argument `columns` to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']
           ).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

sex,count_meals_served_by_males,count_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


Also, in the output above, see where it says `sex` as the column name and `day` for the row name? We can modify those names using arguments in the `crosstab()` method. Let's set the `colnames` argument to `gender` since that's a more specific name than `sex`.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']
           ).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

gender,count_meals_served_by_males,count_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


In this example, we passed in two columns from our DataFrame. However, one nice feature of `crosstab()` is that you don't need the data to be in a DataFrame. For the `index` and `columns` arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [None]:
len(df_tips.query("sex=='Male' and day=='Thur'"))

30

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the `unstack()` method and then the `reset_index()` method to pivot the DataFrame so each row is a unique combination of a value from `sex` and `day` with the appropriate count of meals served.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index(
                                                                ).rename(columns={0: "count_meals_served"})

Unnamed: 0,gender,day,count_meals_served
0,Male,Thur,30
1,Male,Fri,10
2,Male,Sat,59
3,Male,Sun,58
4,Female,Thur,32
5,Female,Fri,9
6,Female,Sat,28
7,Female,Sun,18


For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument `margins` to `True` to get these totals. By default, the returned output will have a column and row name of `All`.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(
                       columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

gender,count_meals_served_by_males,count_meals_served_by_females,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,30,32,62
Fri,10,9,19
Sat,59,28,87
Sun,58,18,76
All,157,87,244


In the `crosstab()` method, we can also rename the `All` column. First, use all the same arguments from above. Then, set the argument `margins_name` to `count_meals_served` .

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, 
    margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", 
                                                       "Female": "count_meals_served_by_females"})

gender,count_meals_served_by_males,count_meals_served_by_females,count_meals_served
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,30,32,62
Fri,10,9,19
Sat,59,28,87
Sun,58,18,76
count_meals_served,157,87,244


For each cell value, we can calculate what percentage it is of the row's total. To do that, set the `normalize` argument to `'index'` (since index applies to each row).

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, 
            margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": 
                           "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

gender,proportion_meals_served_by_males,proportion_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,0.483871,0.516129
Fri,0.526316,0.473684
Sat,0.678161,0.321839
Sun,0.763158,0.236842
proportion_meals_served,0.643443,0.356557


For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the `normalize` argument to `'columns'`.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, 
            margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": 
                            "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

gender,proportion_meals_served_by_males,proportion_meals_served_by_females,proportion_meals_served
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,0.191083,0.367816,0.254098
Fri,0.063694,0.103448,0.077869
Sat,0.375796,0.321839,0.356557
Sun,0.369427,0.206897,0.311475


Given two categorical columns, the `crosstab()` method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists `total_bill` values in our datasets for males that served meals on Thursday. Below, I preview the first few `total_bill` values that meet this criteria.

In [None]:
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()

77    27.20
78    22.76
79    17.29
80    19.44
81    16.66
Name: total_bill, dtype: float64

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [None]:
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()

18.71466666666667

Now, we can perform this same operation using the `crosstab()` method. Same as before, we want our returned index to be the unique values from `day` and our returned columns to be the unique values from `sex`. Additionally, we want the values inside the table to be from our `total_bill` column so we'll set the argument `values` to be `df_tips['total_bill']`. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the `aggfunc` argument to `'mean'`.

In [None]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'],
            aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males",
                                            "Female": "mean_bill_size_meals_served_by_females"})

gender,mean_bill_size_meals_served_by_males,mean_bill_size_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,18.714667,16.715312
Fri,19.857,14.145556
Sat,20.802542,19.680357
Sun,21.887241,19.872222


This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the `aggfunc` method too such as mean and standard deviation.

You can learn more about details of using `crosstab()` from the official pandas <a href='https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.crosstab.html'>documentation page</a>.

#### Grouping By Multiple Fields for the Index and/or Columns

We can also use the `crosstabs()` method to group by multiple pandas columns for the `index` or `columns` arguments. 

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from `day` and our returned columns to be the unique values from all combinations of the `sex` (gender) and `time` (meal time) columns. 

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

In [None]:
pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']
           ).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})

gender,Male,Male,Female,Female
meal,count_lunch_meals_served,count_dinner_meals_served,count_lunch_meals_served,count_dinner_meals_served
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Thur,30,0,31,1
Fri,3,7,4,5
Sat,0,59,0,28
Sun,0,58,0,18


We can verify the logic in a value with a separate query. I'm curious to see the count of meals on Thursday served by females during lunch time. Our result on this query is 31 which matches the value in the crosstab above!

In [None]:
len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))

31

If you're familiar with the `groupby()` method in pandas, the code below performs a similar operation to the crosstab above, but the output format is a little different.

In [None]:
df_tips.groupby(['day', 'sex'])['time'].count()

day   sex   
Thur  Male      30
      Female    32
Fri   Male      10
      Female     9
Sat   Male      59
      Female    28
Sun   Male      58
      Female    18
Name: time, dtype: int64

We can also group by multiple pandas columns in the `index` argument of the `crosstabs()` method. 

Below is a fairly complex operation. The code below helps us find out for every combination of day of the week, gender and meal type what was the count of meals served for the various group sizes (ex - party of two people versus party of three). 

In our output, the index includes unique values from the `day` and `sex` fields while columns include unique values from the `time` and `size` fields.

To interpret a value from our table below, across all Thursdays in our dataset, males 24 lunch meals that contained a group of 2 people.

In [None]:
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], 
            colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served",
                                                                    "Dinner": "count_dinner_meals_served"})

Unnamed: 0_level_0,meal,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served
Unnamed: 0_level_1,party_people_size,1,2,3,4,5,6,1,2,3,4,5,6
day,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Thur,Male,0,24,2,2,1,1,0,0,0,0,0,0
Thur,Female,1,23,2,3,0,2,0,1,0,0,0,0
Fri,Male,1,2,0,0,0,0,0,6,0,1,0,0
Fri,Female,0,3,1,0,0,0,0,5,0,0,0,0
Sat,Male,0,0,0,0,0,0,0,34,13,11,1,0
Sat,Female,0,0,0,0,0,0,2,19,5,2,0,0
Sun,Male,0,0,0,0,0,0,0,32,9,14,2,1
Sun,Female,0,0,0,0,0,0,0,7,6,4,1,0


We can verify 24 lunch meals for a party of 2 were served by males on Thursdays with the query below too.

In [None]:
len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))

24

### Style Output

While the above output is daunting to read with all the cells, there's a cool trick in Seaborn to make your DataFrame output appear as a heatmap. I added some code based off the example from the <a href='https://pandas.pydata.org/pandas-docs/stable/style.html#Builtin-Styles'>pandas documention</a> for built-in styles. Now, we can more easily identify the large values as being dark orange colors in our visualization below.

In [None]:
orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], 
            colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served",
            "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)

Unnamed: 0_level_0,meal,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_lunch_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served,count_dinner_meals_served
Unnamed: 0_level_1,party_people_size,1,2,3,4,5,6,1,2,3,4,5,6
day,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
Thur,Male,0,24,2,2,1,1,0,0,0,0,0,0
Thur,Female,1,23,2,3,0,2,0,1,0,0,0,0
Fri,Male,1,2,0,0,0,0,0,6,0,1,0,0
Fri,Female,0,3,1,0,0,0,0,5,0,0,0,0
Sat,Male,0,0,0,0,0,0,0,34,13,11,1,0
Sat,Female,0,0,0,0,0,0,2,19,5,2,0,0
Sun,Male,0,0,0,0,0,0,0,32,9,14,2,1
Sun,Female,0,0,0,0,0,0,0,7,6,4,1,0


# GroupBy

A [group by](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) is a process that tyipcally involves splitting the data into groups based on some criteria, applying a function to each group independently, and then combining the outputted results.

### Implement Group bys with Tips Dataset

#### Group by of a Single Column and Apply a Single Aggregate Method on a Column

The simplest example of a `groupby()` operation is to compute the size of groups in a single column. By size, the calculation is a count of unique occurences of values in a single column. Here is the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html'>official documentation</a> for this operation. 

This is the same operation as utilizing the `value_counts()` method in pandas.

Below, for the `df_tips` DataFrame, I call the `groupby()` method, pass in the `sex` column, and then chain the `size()` method.

In [None]:
df_tips.groupby(by='sex').size()

sex
Male      157
Female     87
dtype: int64

To interpret the output above, 157 meals were served by males and 87 meals were served by females.

A note, if there are any `NaN` or `NaT` values in the grouped column that would appear in the index, those are automatically excluded in your output (<a href='https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#na-and-nat-group-handling'>reference here</a>). 

In pandas, we can also group by one columm and then perform an aggregate method on a different column.

For example, in our dataset, I want to group by the `sex` column and then across the `total_bill` column, find the mean bill size. To do this in pandas, given our `df_tips` DataFrame, apply the `groupby()` method and pass in the `sex` column (that'll be our index), and then reference our `['total_bill']` column (that'll be our returned column) and chain the `mean()` method.

Meals served by males had a mean bill size of 20.74 while meals served by females had a mean bill size of 18.06.

In [None]:
df_tips.groupby(by='sex')['total_bill'].mean()

sex
Male      20.744076
Female    18.056897
Name: total_bill, dtype: float64

We can verify the output above with a query. We get the same result that meals served by males had a mean bill size of 20.74

In [None]:
df_tips.query("sex=='Male'")['total_bill'].mean()

20.744076433121016

Another interesting tidbit with the `groupby()` method is the ability to group by a single column, and call an aggregate method that will apply to all other numeric columns in the DataFrame.

For example, if I group by the `sex` column and call the `mean()` method, the mean is calculated for the three other numeric columns in `df_tips` which are `total_bill`, `tip`, and `size`.

In [None]:
df_tips.groupby(by='sex').mean()

Unnamed: 0_level_0,total_bill,tip,size
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,20.744076,3.089618,2.630573
Female,18.056897,2.833448,2.45977


#### Aggregate Methods

Other aggregate methods you could perform with a `groupby()` method in pandas are:

method | description 
--- | ---
`sum()` | summation 
`mean()` | average
`count()` | count of all values
`size()` | count of non-null values
`max()` | maximum value
`min()` | minimum value
`std()` | standard deviation
`median()` | median

To illustrate the difference between the `size()` and `count()` methods, I included this simple example below.

The DataFrame below of `df_rides` includes Dan and Jamie's ride data.

In [None]:
data = {'person': ['Dan', 'Dan', 'Jamie', 'Jamie'], 
        'ride_duration_minutes': [4, np.NaN, 8, 10]}
df_rides = pd.DataFrame(data)
df_rides

Unnamed: 0,person,ride_duration_minutes
0,Dan,4.0
1,Dan,
2,Jamie,8.0
3,Jamie,10.0


For one of Dan's rides, the `ride_duration_minutes` value is null. However, if we apply the `size` method, we'll still see a count of `2` rides for Dan. We are 100% sure he took 2 rides but there's only a small issue in our dataset in which the the exact duration of one ride wasn't recorded.

In [None]:
df_rides.groupby(by='person')['ride_duration_minutes'].size()

person
Dan      2
Jamie    2
Name: ride_duration_minutes, dtype: int64

Upon applying the `count()` method, we only see a count of `1` for Dan because that's the number of non-null values in the `ride_duration_minutes` field that belongs to him.

In [None]:
df_rides.groupby(by='person')['ride_duration_minutes'].count()

person
Dan      1
Jamie    2
Name: ride_duration_minutes, dtype: int64

#### Group by of a Single Column and Apply the `describe()` Method on a Single Column

With grouping of a single column, you can also apply the `describe()` method to a numerical column. Below, I group by the `sex` column, reference the `total_bill` column and apply the `describe()` method on its values. The `describe` method outputs many descriptive statistics. Learn more about the `describe()` method on the official <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html'>documentation page</a>.

In [None]:
df_tips.groupby(by='sex')['total_bill'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Male,157.0,20.744076,9.246469,7.25,14.0,18.35,24.71,50.81
Female,87.0,18.056897,8.009209,3.07,12.75,16.4,21.52,44.3


#### Group by of a Single Column and Apply a Lambda Expression on a Single Column

Most examples in this tutorial involve using simple aggregate methods like calculating the mean, sum or a count. However, with group bys, we have flexibility to apply custom lambda functions. 

You can learn more about lambda expressions from the <a href='https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions'>Python 3 documentation</a> and about using instance methods in group bys from the <a href='https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#dispatching-to-instance-methods'>official pandas documentation</a>.

Below, I group by the `sex` column and apply a lambda expression to the `total_bill` column. The expression is to find the range of `total_bill` values. The range is the maximum value subtracted by the minimum value. I also rename the single column returned on output so it's understandable.

In [None]:
df_tips.groupby(by='sex').agg({'total_bill': lambda bill: bill.max() - bill.min()}).rename(
    columns={'total_bill': "range_total_bill"})

Unnamed: 0_level_0,range_total_bill
sex,Unnamed: 1_level_1
Male,43.56
Female,41.23


In this dataset, males had a bigger range of `total_bill` values.

#### Group by of Multiple Columns and Apply a Single Aggregate Method on a Column

We can group by multiple columns too. For example, I want to know the count of meals served by people's gender for each day of the week. So, call the `groupby()` method and set the `by` argument to a list of the columns we want to group by.

In [None]:
df_tips.groupby(by=['sex', 'day']).size()

sex     day 
Male    Thur    30
        Fri     10
        Sat     59
        Sun     58
Female  Thur    32
        Fri      9
        Sat     28
        Sun     18
dtype: int64

We can also group by multiple columns and apply an aggregate method on a different column. Below I group by people's gender and day of the week and find the total sum of those groups' bills.

In [None]:
df_tips.groupby(by=['sex', 'day'])['total_bill'].sum()

sex     day 
Male    Thur     561.44
        Fri      198.57
        Sat     1227.35
        Sun     1269.46
Female  Thur     534.89
        Fri      127.31
        Sat      551.05
        Sun      357.70
Name: total_bill, dtype: float64

#### Group by of a Single Column and Apply Multiple Aggregate Methods on a Column

The `agg()` method allows us to specify multiple functions to apply to each column. Below, I group by the `sex` column and then we'll apply multiple aggregate methods to the `total_bill` column. Inside the `agg()` method, I pass a dictionary and specify `total_bill` as the key and a list of aggregate methods as the value. 

You can pass various types of syntax inside the argument for the `agg()` method. I chose a dictionary because that syntax will be helpful when we want to apply aggregate methods to multiple columns later on in this tutorial.

In [None]:
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']})

Unnamed: 0_level_0,total_bill,total_bill,total_bill
Unnamed: 0_level_1,count,mean,sum
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,157,20.744076,3256.82
Female,87,18.056897,1570.95


You can learn more about the `agg()` method on the official pandas <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html'>documentation page</a>.

The code below performs the same group by operation as above, and additionally I rename columns to have clearer names.

In [None]:
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).rename(
    columns={'count': 'count_meals_served', 'mean': 'average_bill_of_meal', 'sum': 'total_bills_of_meals'})

Unnamed: 0_level_0,total_bill,total_bill,total_bill
Unnamed: 0_level_1,count_meals_served,average_bill_of_meal,total_bills_of_meals
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,157,20.744076,3256.82
Female,87,18.056897,1570.95


We can modify the format of the output above through chaining the `unstack()` and `reset_index()` methods after our group by operation. This format may be ideal for additional analysis later on.

In [None]:
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).unstack().reset_index(
).rename(columns={'level_0': 'aggregated_column', 'level_1': 'aggregate_metric', 
                  'sex': 'grouped_column', 0: 'aggregate_calculation'}).round(2)

Unnamed: 0,aggregated_column,aggregate_metric,grouped_column,aggregate_calculation
0,total_bill,count,Male,157.0
1,total_bill,count,Female,87.0
2,total_bill,mean,Male,20.74
3,total_bill,mean,Female,18.06
4,total_bill,sum,Male,3256.82
5,total_bill,sum,Female,1570.95


#### Group by of a Single Column and Apply Multiple Aggregate Methods on Multiple Columns

Below, I use the `agg()` method to apply two different aggregate methods to two different columns. I group by the `sex` column and for the `total_bill` column, apply the `max` method, and for the `tip` column, apply the `min` method. 

In [None]:
df_tips.groupby(by='sex').agg({'total_bill': 'max', 'tip': 'min'}).rename(
    columns={'total_bill': 'max_total_bill', 'tip': 'min_tip_amount'})

Unnamed: 0_level_0,max_total_bill,min_tip_amount
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,50.81,1.0
Female,44.3,1.0


#### Group by of Multiple Columns and Apply a Groupwise Calculation on Multiple Columns

In restaurants, common math by guests is to calculate the tip for the waiter/waittress. My mom thinks 20% tip is customary. So, if the bill was 10, you should tip 2 and pay 12 in total. I'm curious what the tip percentages are based on the gender of servers, meal and day of the week. We can perform that calculation with a `groupby()` and the `pipe()` method.

The `pipe()` method allows us to call functions in a *chain*. So as the `groupby()` method is called, at the same time, another function is being called to perform data manipulations. You can learn more about `pipe()` from the <a href='https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html#piping-function-calls'>official documentation</a>.

To perform this calculation, we need to group by `sex`, `time` and `day`, then call our `pipe()` method and calculate the `tip` divided by `total_bill` multiplied by `100`.

In [None]:
df_tips.groupby(by=['sex', 'time', 'day']).pipe(lambda group: group.tip.sum()/group.total_bill.sum()*100)

sex     time    day 
Male    Lunch   Thur    15.925121
                Fri     16.686183
                Sat           NaN
                Sun           NaN
        Dinner  Thur          NaN
                Fri     12.912840
                Sat     14.824622
                Sun     14.713343
Female  Lunch   Thur    15.388192
                Fri     19.691535
                Sat           NaN
                Sun           NaN
        Dinner  Thur    15.974441
                Fri     19.636618
                Sat     14.236458
                Sun     16.944367
dtype: float64

The highest tip percentage has been for females for dinner on Sunday.

# Melt

In pandas, we can "unpivot" a DataFrame - turn it from a *wide* format - many columns - to a *long* format - few columns but many rows. We can accomplish this with the pandas <a href='https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html'>`melt()`</a> method. This can be helpful for further analysis of our new unpivoted DataFrame.

### Example: Pivot Tesla Car Acceleration Details

Here is fictional acceleration tests for three popular Tesla car models. In order to verify acceleration of the cars, I figured a third-party may make three *runs* to test the three models alongside one another.

In [None]:
s = 'Tesla Model S P100D'
x = 'Tesla Model X P100D'
three = 'Tesla Model 3 AWD Dual Motor'

s_data = [s, 2.5, 2.51, 2.54]
x_data = [x, 2.92, 2.91, 2.93]
three_data = [three, 3.33, 3.31, 3.35]

data = [s_data, x_data, three_data] 
df = pd.DataFrame(data, columns=['car_model', 'Sept 1 9am', 'Sept 1 10am', 'Sept 1 11am'])
df

Unnamed: 0,car_model,Sept 1 9am,Sept 1 10am,Sept 1 11am
0,Tesla Model S P100D,2.5,2.51,2.54
1,Tesla Model X P100D,2.92,2.91,2.93
2,Tesla Model 3 AWD Dual Motor,3.33,3.31,3.35


Notice how this DataFrame features four columns, one for the car model name, and three for acceleration *runs* of a car. If they were to continue with this trend of data collection and do far more *runs*, this dataset would have lots of columns - perhaps making it daunting to visualize and analyze. 

I want to "unpivot" this data from a wide format to a long format using the pandas `melt()` method.

On the `df` DataFrame, we'll call the `melt()` method and set the following arguments:

- `id_vars` to `['car_model']` since each row from `df` is identified by the car model name
- `var_name` to `'date'` since this new column needs a name
- `value_name` to `'0-60mph_in_seconds'` since this new column needs a name

In [None]:
df_unpivoted = df.melt(id_vars=['car_model'], var_name='date', value_name='0-60mph_in_seconds')
df_unpivoted

Unnamed: 0,car_model,date,0-60mph_in_seconds
0,Tesla Model S P100D,Sept 1 9am,2.5
1,Tesla Model X P100D,Sept 1 9am,2.92
2,Tesla Model 3 AWD Dual Motor,Sept 1 9am,3.33
3,Tesla Model S P100D,Sept 1 10am,2.51
4,Tesla Model X P100D,Sept 1 10am,2.91
5,Tesla Model 3 AWD Dual Motor,Sept 1 10am,3.31
6,Tesla Model S P100D,Sept 1 11am,2.54
7,Tesla Model X P100D,Sept 1 11am,2.93
8,Tesla Model 3 AWD Dual Motor,Sept 1 11am,3.35


After this "unpivot", we can easily calculate the minimum (essentially the fastest) 0-60 time that we'd publish in a final report. To do so, we take our `df_unpivoted` DataFrame, group by the `car_model` column, and find the minimum value in the `0-60mph_in_seconds` column.

In [None]:
df_unpivoted.groupby('car_model')['0-60mph_in_seconds'].min()

car_model
Tesla Model 3 AWD Dual Motor    3.31
Tesla Model S P100D             2.50
Tesla Model X P100D             2.91
Name: 0-60mph_in_seconds, dtype: float64