# Lecture 9
# Section 2: Summarization, Aggregation and Grouping of Data

In this section we will look for ways to operate functions on the rows of a DataFrame, as well as to summarise and group the DataFrame.

## Apply a function to every row in a Pandas dataframe

We often want to map across all of the rows in a DataFrame. And Pandas has a function for that: `apply`.

Let's look at an example.

###  United States Census Data

Let's look at some census data. The data is stored in the file `census.csv` and comes from the *United States Census Bureau*. In particular, this is a breakdown of the population level data at the US county level.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/census.csv')
df = df[df['SUMLEV']==50]
df.tail()

In this DataFrame we have five columns for population estimates. Each column corresponds to one year of estimates. It makes sense to create some new columns for minimum or maximum values, and the `apply` function is an easy way to do this.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/census.csv')
df = df[df['SUMLEV']==50]
df.tail()

First we write a function that 
* takes a certain row of data, 
* finds a minimum and maximum value and 
* returns a new row of data. 

We call this function `min_max`:
* we can create a slice of a row by selecting the population columns 
* then we use the NumPy `min` and `max` functions and 
* we create a new series with a label that represents the new values that we want to apply


In [None]:
import numpy as np
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    return pd.Series({'min': np.min(data), 'max': np.max(data)})

Then we just need to call `apply` on the DataFrame. <br>
Now we have to be careful: We have talked about the zero axis being the rows of the DataFrame. But this parameter here is the parameter of the index (the columns) to be used. So to apply to all rows, pass `axis=1`. 

In [None]:
df.apply(min_max, axis=1)

Please note: `apply` is rarely used in full function definitions, as we have done. Instead, it is usually used in `lambda` functions.

Here is a one line example of how to calculate the maximum of columns using the `apply` function

In [None]:
import numpy as np
rows = ['POPESTIMATE2010',
        'POPESTIMATE2011',
        'POPESTIMATE2012',
        'POPESTIMATE2013',
        'POPESTIMATE2014',
        'POPESTIMATE2015']
df.apply(lambda x: np.max(x[rows]), axis=1)

You can imagine how you can chain multiple `apply` calls with `lambda`s to create a readable yet concise data manipulation script.

## Summarizing the DataFrame

Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, `mean`, `max`, `min`, *standard deviations* and more for columns are easily calculable:

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/census.csv')
df = df[df['SUMLEV']==50]

In [None]:
df['CENSUS2010POP'][df['STNAME'] == 'Florida'].mean()

In [None]:
df['CENSUS2010POP'][df['STNAME'] == 'Florida'].max()

In [None]:
df['CENSUS2010POP'][df['STNAME'] == 'Florida'].min()

In [None]:
df['CENSUS2010POP'][df['STNAME'] == 'Florida'].std()

The need for custom functions is minimal unless you have very specific requirements. The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

| Function | Description                         |
|----------|-------------------------------------|
| count    | Number of non-null observations     |
| sum      | Sum of values                       |
| mean     | Mean of values                      |
| mad      | Mean absolute deviation             |
| median   | Arithmetic median of values         |
| min      | Minimum                             |
| max      | Maximum                             |
| mode     | Mode                                |
| abs      | Absolute Value                      |
| prod     | Product of values                   |
| std      | Unbiased standard deviation         |
| var      | Unbiased variance                   |
| sem      | Unbiased standard error of the mean |
| skew     | Unbiased skewness (3rd moment)      |
| kurt     | Unbiased kurtosis (4th moment)      |
| quantile | Sample quantile (value at %)        |
| cumsum   | Cumulative sum                      |
| cumprod  | Cumulative product                  |
| cummax   | Cumulative maximum                  |
| cummin   | Cumulative minimum                  |

The `describe()` function is a useful summary tool that quickly displays statistics for each variable or group to which it is applied. The output of `describe()` varies depending on whether you apply it to a numeric or character column.

In [None]:
df['CENSUS2010POP'][df['STNAME'] == 'Florida'].describe()

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

## Group by

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/census.csv')
df = df[df['SUMLEV']==50]
df.tail()

By mastering the `groupby()` functionality of the Pandas, power is placed in your hands. Groupby divides the data into groups, depending on a variable of your choice. For example, the expression `data.groupby('STNAME')` divides our current DataFrame into state names.

The function `groupby()` returns a GroupBy object, but essentially describes how the rows of the original data set were split. The GroupBy object `.groups` variable is a dictionary whose keys are the computed unique groups and the corresponding values are the `axis` labels that belong to each group. For example:

In [None]:
df.groupby('STNAME').groups.keys()

In [None]:
len(df.groupby(['STNAME']).groups['Florida'])

The group "Florida" consists of **67** entries.

Functions like `max()`, `min()`, `mean()`, `first()`, `last()` can be quickly applied to the GroupBy object to obtain summary statistics for each group – an immensely useful function. Different variables can be excluded / included from each summary requirement.

Get the first entry for each state with `first()`. <br>
*Please note: to have a clear overview, we only show the first five entries via 'head()' here*

In [None]:
firstdf = df.groupby('STNAME').first()
firstdf.head()

What is the average population of a county per state, based on the 2010 census?

In [None]:
df.groupby('STNAME')['CENSUS2010POP'].mean()


The `groupby` output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass `as_index=False` to the groupby operation.

In [None]:
df = df[df['SUMLEV']==50]
df.groupby('STNAME', as_index=False)['CENSUS2010POP'].mean()

We have already seen how we can select a group using the `groups` dictionary and the corresponding key. Another way to select a group is to use `GroupBy.get_group()`.  This function returns a DataFrame containing the data of the given group.

In the following example we get the DataFrame of the `Florida` group.

In [None]:
df.groupby('STNAME').get_group('Florida')

### Applying specific functions to various columns in groups

Please note that `agg` and `aggregate` can be used interchangeably. `agg` is shorter, so let's use it.

Instructions for aggregation are provided in the form of a python dictionary or list. The dictionary keys are used to specify the columns on which you want to perform operations and the dictionary values to specify the function to be performed.

In [None]:
df.groupby('STNAME').agg({'CENSUS2010POP': 'mean'})

In [None]:
df = df[df['SUMLEV']==50]
df.groupby('STNAME').agg(
        {
            'CENSUS2010POP': 'mean',
            'CTYNAME': 'count'
        }
    )

The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using `lambda` functions to extract statistics that are not provided by the built-in options.

In [None]:
# Define the aggregation procedure outside of the groupby operation
aggregations = {
    'CTYNAME': 'sum', # or try lambda here too: lambda names: '; '.join(names)
    'CENSUS2010POP': lambda x: sum(x)/1_000_000 # population total per 1 million inhabitants
}
df.groupby('STNAME').agg(aggregations)

### Applying multiple functions to columns in groups

To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe:

In [None]:
df = df[df['SUMLEV']==50]
df.groupby('STNAME').agg(
        {
            'CENSUS2010POP':  ['min', 'max', 'sum'],
            'CTYNAME':  ['min', 'max', 'count'],
        }
    )

The `agg()` syntax is flexible and easy to use. Remember that you can pass custom and lambda functions to your list of aggregated calculations, and each will be passed the values from the column in your grouped data.

## Renaming grouped aggregation columns

Pandas supports group aggregation with relabelling by **named aggregation** with simple tuples. Python tuples are used to provide the column name to work on along with the function to **apply**. 

In [None]:
df = df[df['SUMLEV']==50]
df.groupby('STNAME').agg(
        # get min of the population for each group
        pop_min= ('CENSUS2010POP', min),
        # get max of the population for each group
        pop_max= ('CENSUS2010POP', max),
        # get sum of the population for each group
        pop_sum= ('CENSUS2010POP', sum),
    )

The GroupBy functionality in Pandas is performant and is well documented in the official [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).