<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Introduction to `pandas` Grouping

_Authors: Kiefer Katovich (SF), Dave Yerrington (SF), Mario Carrillo (SF)_

---

### Learning Objectives
*After this lesson, you will be able to:*
- Identify the situations in which **grouping** is useful.
- Explain and use the **`.groupby()`** function in `pandas`.
- Demonstrate aggregation and plotting methods by groups in `pandas`.

### Lesson Guide

- [Overview of Multi-Dimensional Data Analysis](#overview)
- [Analyzing Data by Group: Examples](#grouping_examples)
- [Exploring the Titanic Data Set with Grouping](#load_titanic)
- [Introducing `pandas` `.groupby()` Function](#groupby)
- [Grouping by Multiple Variables](#groupby_multiple)
- [Applying Basic Functions to Groups](#basic_functions)
- [Removing the Hierarchical Index](#removing_hierarchical)
- [Applying Custom Functions with `.apply()`](#custom_functions)
- [Plotting Basic Histograms with Groups](#basic_plotting)
- [Grouped Histograms with `pandas`](#grouped_hists)
- [Independent Practice](#independent_practice)


<a id='overview'></a>

### Overview of Multi-Dimensional Data Analysis

---

Multi-dimensional data analysis allows you to:

- Describe segments of your data based on unique values.
- Understand characteristics of your data.
   - Calculate summary statistics across subsets.
   - Discover patterns that exist in some subsets but not others.
- Find duplicate values or redundant data.
- Apply aggregate functions to subsets.


<a id='grouping_examples'></a>

### Analyzing Data by Group: Examples

---

Scenarios include determining the: 

 - Sum of crimes by time of day in San Francisco (morning, afternoon, night).
 - Count number of people with the same last name.
 - Median number of multi-unit buildings in a region.
 - Popularity of movie genres by region.
 - Customer segments based on age, buying habits, interests, and behavior.
 
 You can also apply the "GROUP BY" clause in a database query using SQL.

#### Subset Aggregation:

This chart stratifies a single variable, "industry," **counting** job openings within a specific category.

![](http://www.rasmussen.edu/images/blogs/1360270834-402_Graphs_JobOpeningsByIndustry.jpg)

#### Hierarchical Aggregation

This chart aggregates first by a top-level group, "industry," and then by a secondary group, "date," within each industry.

![](http://junkcharts.typepad.com/.a/6a00d8341e992c53ef0192acc65090970d-pi)

<a id='load_titanic'></a>

### Exploring the Titanic Data Set with Grouping

---

To explore the power of grouping with `pandas`, we will be using [the famous Titanic data set](https://www.kaggle.com/c/titanic), which can be downloaded from Kaggle. Here's the competition description:

>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out of 2,224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper class.

Although we will not be modeling survival rates in this lesson, there are interesting patterns to be found just by exploring descriptive statistics in cross-sections of the data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format ='retina'

**Load the data into `pandas`.**

In [5]:
path_to_file = './datasets/titanic_clean.csv'


The data contains a variety of information about the passengers present at the sinking of the Titanic.

**Describe the data in the columns with summary statistics.**

<a id='groupby'></a>

### Introducing `pandas`' `.groupby()` Function

---

The built-in `.groupby()` function for DataFrames is one of the most useful tools in `pandas`. As the name implies, `.groupby()` will group your data set by one or more user-specified column names.

**Using `.groupby()`, create a grouped DataFrame in which the Titanic data set is grouped by "Pclass."**

**Print out the type of the grouped DataFrame.**

Instead of a `DataFrame` object, we now have a `DataFrameGroupBy` object. This operates somewhat differently than the DataFrames we're used to, as we'll soon see.

**Try pulling out the first group from the grouped DataFrame with index 0.**

Grouped DataFrames operate differently than Python lists. You can't pull out different groups with indexers. Despite this, grouped DataFrame objects **are** iterable! You can step through them using a for loop, for example.

In our grouped DataFrame, each element will be a tuple containing the "Pclass" group as its first element and the subset of the original Titanic DataFrame for that "Pclass" as its second element.

**Write a for loop to iterate through the grouped DataFrame, printing out the "PClass" and the header of the subset each time.**

<a id='groupby_multiple'></a>

### Grouping by Multiple Variables

---

Grouping by more than one column is simple; the `.groupby()` function can accept a list of column names to group by. When you group by more than one column, each subset of the data will correspond to one distinct combination of the grouping columns.

**Create a grouped DataFrame by grouping the Titanic data by "Pclass" and "Survived."**

**Print out the length of this grouped DataFrame.**

It is the same length as the unique combinations of "Pclass" and "Survived:" three Pclasses by two survival values.

<a id='basic_functions'></a>

### Applying Basic Functions to Groups

---

`pandas` makes it easy to apply basic statistical functions to grouped data with built-in functions. For example, assume you have a grouped DataFrame, `grouped`:

```python
print grouped.mean()
print grouped.median()
print grouped.count()
print grouped.max()
```

We can calculate the mean, median, count, and max of the columns by group. 

**Try out these built-in functions on the grouped data you made above.**

You can also operate on single columns or subsets of columns across grouped DataFrames using the indexing syntax for standard DataFrames.

**Find the percent of passengers who survived, grouped by where they embarked.**

**Find the average fare and age, grouped by the location embarked as well as the class.**

<a id='removing_hierarchical'></a>

### Removing the Hierarchical Index

---

By default, `pandas` will give you back your groups in a hierarchical index format. If this is not what you prefer, you can use the `.reset_index()` function to take your row labels and convert them into columns.

**Remove the hierarchical index for the average fare and age data set you just created, converting "Embarked" and "Pclass" to columns.**

<a id='custom_functions'></a>

### Applying Custom Functions with `.apply()`

---

While `pandas` does contain a variety of useful, built-in summary functions, you'll often need to apply a custom function to the data in your groups. 

The `.apply()` function takes a function as an argument and applies it to the subsets of data in your DataFrame groups.

**See what happens when you replace the built-in `.mean()` function with `.apply(np.mean)` for the question above.**

Say we want to determine the mean of fare and age per "Embarked" and "Pclass," but we also want the numbers to be rounded. One way to do this would be to round the columns after we apply the mean function as we did above. 

Another way would be to write a custom function to pass into `.apply()`. *The function passed to `.apply()` will run on all of the subsets of data.*

**Write a function that will take the mean of columns in a data set and round the values.**

**Apply your custom function to the grouped data.**

Functions that can be applied to a DataFrame and return a DataFrame can also be applied to *groups* of DataFrames.

**For example, write a function that will return the subset of the Titanic data set with the top five paying female passsengers.**


**Group the Titanic data by "Survived" and apply your function to extract the top paying females.**

<a id='basic_plotting'></a>

### Plotting Basic Histograms with Groups

---

We can leverage the power of `pandas` even more by mixing its plotting capabilities with its grouping capabilities.

**First, find the number of passengers per "PClass" by using `.groupby()` and `.size()`.**

Here we have a Series object with the counts of the passengers-per-class group. It's easy to create a histogram of these counts by appending `.plot(kind="bar", color="g", width=0.85)`.

**Plot the average fare per sex and class as a histogram.**

<a id='grouped_hists'></a>

### Grouped Histograms with `pandas`

---

In the chart we just made, each bar represents a distinct combination of our groups in `.groupby()`. This is fine, but it would be a more visually appealing and informative chart if we represented each group with a different color and made a grouped bar chart.

**Calculate the mean of fare by "Pclass" and "Sex" using `.groupby()`, assign it to a variable, and print it out.**

There is another built-in function for `pandas` objects called `.unstack()`. When we have a hierarchical index like we do above with "Pclass" as the broader category and "Sex" as the subcategory, the `.unstack()` command will attempt to move the subcategory from index to column representation.

This is a way to move from a "long" to a "wide" column format.

**Use the `.unstack()` function on your mean fare variable.**

**Now, use the plot function on the unstacked data to create a bar chart.**

If you add the keyword variable `stacked=True`, it will instead stack the bars within the broader "Pclass" category.

<a id='independent_practice'></a>

### Independent Practice

---

Now that you've covered the basics of grouping, applying functions, aggregating data, and `pandas` plotting with grouped data, [open up the practice notebook](./practice/practice_pandas_grouping.ipynb)  and explore the UFO sightings data!