# 2. Grouping and Aggregating with Multiple Columns

### Objectives

+ Use multiple grouping columns
+ Aggregate multiple columns
+ Use multiple aggregating functions
+ Know different syntax for performing an aggregation

### Overview
In this notebook we will learn how to form groups using more than 1 column. We will also aggregate more than one column as well as learn how to apply multiple aggregation functions to each group.

## Adding Years of Experience to City of Houston Data
Before we get started with grouping and aggregating multiple columns, let's read in the City of Houston employee dataset and append a column for the years of experience.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date', 'job_date'])
emp.head()

### Calculate years of experience from hire date
The data was pulled on December 1, 2016. Let's use the **`dt`** accessor with the **`year`** attribute to get the year that each employee was hired. We can subtract this year from 2016 to approximate the years of experience and assign it as a new column.

In [None]:
emp['experience'] = 2016 - emp['hire_date'].dt.year

### Take a peak at distribution of experience
Use the **`value_counts`** to get a quick understanding of how experience is distributed.

In [None]:
emp['experience'].value_counts(normalize=True).head(10)

In [None]:
emp.head()

## Review grouping and aggregating with a single column
In the previous notebook, we had a single grouping column, aggregating column, and aggregating function. The following syntax was used as a guide:

**```
df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})
```**

Let's see this again by calculating the average years of experience for each gender.

In [None]:
emp.groupby('gender').agg({'experience': 'mean'})

# Grouping with Multiple Columns
To create groups based on distinct values from multiple columns, we will need to pass a list of these columns to the **`groupby`** method. Let's find the average years of experience for every unique combination of race and gender.

In [None]:
emp.groupby(['race', 'gender']).agg({'experience': 'mean'})

### What happened to our index?
Both race and gender are not columns and have been pushed into the index. This is called a **multi-level index** and technically a **`MultiIndex`** object. **`Race`** and **`Gender`** are considered **levels** of the index. They are NOT columns. You'll notice that duplicated values do not repeat in an index when they immediately follow one another.

### The MultiIndex is confusing and not necessary for beginners
In my opinion, this multi-level index only adds to confusion. By default, all grouping columns will be added to the index. From this point on, we will chain the **`reset_index`** method to keep these values as columns.

In [None]:
emp.groupby(['race', 'gender']).agg({'experience': 'mean'}).reset_index()

### Isn't it easier to read with a MultiIndex?
The MultiIndex can make the results easier to read, but it makes further data analysis more difficult as you need to become familiar with special syntax just for the MultiIndex. This added complexity for beginners is not worth any benefit.

# Aggregating Multiple Columns
To aggregate multiple columns, add the column name to the dictionary paired with its aggregation function. The aggregation functions can be different. The following finds the average salary and max years of experience for each gender.

In [None]:
emp.groupby('gender').agg({'salary': 'mean', 'experience': 'max'}).reset_index()

# Grouping and Aggregating with Multiple Columns
We can combine the last two approaches to group with multiple columns along with aggregating multiple columns.

The following finds the mean salary and max experience for every unique combination of race and gender. It might make things more readable by placing the each aggregating column on a separate line.

In [None]:
emp.groupby(['race', 'gender']).agg({'salary': 'mean', 
                                     'experience': 'max'}).reset_index()

# Multiple Aggregation Functions
Let's say we want to find the min, max, mean, and median salary for each race. We do this by using a list of aggregating functions as the key in our **`agg`** dictionary.

In [None]:
emp.groupby('race').agg({'salary': ['min', 'max', 'mean', 'median']})

## What's up with those column names???
The column names probably look pretty bizarre to you. Although it doesn't take much effort to decipher what each column means, the column names are not particularly friendly to work with.

Pandas created a **multi-level column index** with two levels. These are difficult to work with. There isn't a standard way to deal with them like we did with the multi-level index from above.

## Renaming all the columns
I recommend renaming all the columns after the aggregation. This is quite simple, but tedious. Simply assign the DataFrame's **`columns`** attribute to a list of desired column names. The list must be the same length as the original.

In [None]:
race_salary = emp.groupby('race').agg({'salary': ['min', 'max', 'mean', 'median']}).reset_index()
race_salary.columns = ['race', 'min salary', 'max salary', 'mean salary', 'median salary']
race_salary

If you are not planning on using the returned DataFrame then you don't need to bother renaming the columns, but having a single level index is going to be much easier to work with than a MultiIndex when you are first beginning your Pandas journey. 

## No added functionality of a MultiIndex
I actually don't think the MultiIndex offers much benefit. All data analysis is possible without it. There are some cool tricks you can do with it, but overall it will not prevent you from achieving any kind of analysis if you do not use it.

# Multiple Grouping Columns, Aggregating Columns, and Aggregating Functions
You can make complex aggregations by having multiple grouping columns, aggregating columns, and aggregating functions.

In [None]:
rg_sal_exp = emp.groupby(['race', 'gender']) \
                .agg({'salary': ['min', 'max', 'mean', 'median'],
                      'experience': ['max', 'std']}).reset_index()
rg_sal_exp

Again, I suggest renaming the columns for easier data manipulation.

In [None]:
rg_sal_exp.columns = ['race', 'gender', 'min salary', 'max salary', 'mean salary',
                      'median salary', 'max exp', 'std exp']
rg_sal_exp

# Getting the size of each group
Let's say we just want to know the number of rows in each group. The correct aggregation function is **`size`** and not **`count`** (this returns the number of non-missing values).

In [None]:
emp.groupby(['race', 'gender']).agg({'salary': 'size'}).reset_index()

### The aggregating column doesn't matter
The same result will be returned regardless of what aggregating column we use since the size only depends on the number of rows and not on the actual values in the column. Using the department column does not change the output.

In [None]:
emp.groupby(['race', 'gender']).agg({'dept': 'size'}).reset_index()

## Alternative Syntax for size
You can call the **`size`** method directly after grouping. This will return the same data as a Series.

In [None]:
emp.groupby(['race', 'gender']).size().reset_index()

## Rename the column when using `reset_index`
When calling `reset_index` on a Series, like we did above, the new column name for the Series values will be the `name` attribute of the Series. If it doesn't exist (like in the example above) then you can supply the column name with the `name` parameter with `reset_index`.

In [None]:
emp.groupby(['race', 'gender']).size().reset_index(name='size')

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">For each department and gender find the number of unique position titles, the total number of employees and the average salary. Make sure there is no multi-index for the index or columns.</span>

### Problem 2
<span  style="color:green; font-size:16px">For each department, race and gender find the maximum years of experience and salary.</span>

## Use the college dataset for the rest of the problems

In [None]:
college = pd.read_csv('../data/college.csv')
college.head()

### Problem 3
<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

### Problem 4
<span  style="color:green; font-size:16px">Does the city **`Houston`** only appear in the state of **`Texas`**?</span>

### Problem 5
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

### Problem 6
<span  style="color:green; font-size:16px">Among colleges that have the largest undergrad population for each state, what is the difference between the most and least populous college?</span>

### Problem 7: Advanced
<span  style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

### Problem 8
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

### Problem 9
<span  style="color:green; font-size:16px">Do distance only schools tend to be more or less religously affiliated than non-distance-only schools?</span>

### Problem 10
<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

### Problem 11
<span  style="color:green; font-size:16px">Trim the **`college`** DataFrame to only the 'race' columns - those beginning with **`ugds_`**. Create a new column called **`ugds_other`** that is the sum of any race column that averages under 4% for the entire dataset.</span>

### Problem 12
<span  style="color:green; font-size:16px">Which top 5 historically black colleges that have the highest white percentage?</span>