<a href="https://colab.research.google.com/github/anyuanay/info212/blob/main/INFO212_Week6_Lecture_Aggregation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 212: Data Science Programming 1
___

## Week 6: Lecture: Data Aggregation and Group Operations
---

**Agenda:**
- frame.groupby(list).agg
- groupby by a function
- groupby by a dictionary
- groupby by a list
- iterate over groups
- split-apply-combine
- pivotal tables and cross-tabulation


In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# GroupBy Mechanics

Categorizing a dataset and applying a function to each group, whether an aggregation
or transformation, is often a critical component of a data analysis workflow. After
loading, merging, and preparing a dataset, you may need to compute group statistics
or possibly pivot tables for reporting or visualization purposes.

## The split-apply-combine Principle for GroupBy
In the first stage of the groupby process, data contained in a pandas object, whether a Series, Data‐
Frame, or otherwise, is split into groups based on one or more keys that you provide.
The splitting is performed on a particular axis of an object. For example, a DataFrame
can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a
function is applied to each group, producing a new value. Finally, the results of all
those function applications are combined into a result object. The form of the resulting
object will usually depend on what’s being done to the data.

![](https://i.imgur.com/89PD9om.png)

## pandas provides a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way.

## Each grouping key can take many forms, and the keys do not have to be all of the same type:
- A list or array of values that is the same length as the axis being grouped
- A value indicating a column name in a DataFrame
- A dict or Series giving a correspondence between the values on the axis being
grouped and the group names
- A function to be invoked on the axis index or the individual labels in the index

Create a DataFrame example:

```
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df.info()
```

## Frequently, the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:
```
df.groupby('key1')[['data1', 'data2']].mean()

df.groupby('key1')

```

## Exercise:
Compute the average total_bill and tip_pct for each unique value in the day column.

## The group keys could be any arrays of the right length:

```
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])


df['data1'].groupby(states).mean()

df['data2'].groupby(years).sum()

df.groupby([states, years]).mean()
```

# DataSet: Meals and Tips
Download the `tips.csv` dataset and load it as a DataFrame here.

## Explore the tips dataset

## Add a tip_pct column
Compute the percentage of tips for each meal

## Exercise:
Extract the column `time` as a list called `meal_type`. Compute the maximum total_bill, tip, and tip_pct of each type of meal.

## Exercise:
Group by the tips data using the combinations of days and smoker types. Sort the results by the maximums of tip_pct in descending order.

# Iterating Over Groups
## The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:
```
for key, group in df.groupby('key1'):
    print(key)
    print(group)
```

## In the case of multiple keys, the first element in the tuple will be a tuple of key values:

```
for key, chunk in df.groupby(['key1', 'key2']):
    print(key[0], key[1])
    print(chunk)
```

## Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dict of the data pieces as a one-liner:

```
pieces = dict(list(df.groupby('key1')))
for k in pieces:
    print('key = ', k)
    print('chunk =\n', pieces[k])


pieces['b']
```

## Exercise:
Group by the tips dataset by the unique values in day column;  sort the all Sunday's records by the sizes in descending order.

## By default, groupby groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example df here by dtype like so:

```
grouped = df.groupby(df.dtypes, axis=1)


for dtype, group in grouped:
    print(dtype)
    print(group)
```

## Exercise:
Group by the tips dataset on the data types of the columns.

# How to Group by Two Consecutive Rows

```
df = pd.DataFrame([1, 2, 3, 4, 1, 2,3], columns=['val'])

for key, data in df.groupby((np.arange(len(df)) // 3)):
    print(key)
    print(data)
```

## Exercise:
Group by the tips dataset by every 4 consecutive rows.

# Grouping with Dicts and Series
## Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:

```
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people
```

## Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:

```
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}
```

## Now, you could construct an array from this dict to pass to groupby, but instead we can just pass the dict:

```
by_column = people.groupby(mapping, axis=1)
by_column.sum()
```

## The same functionality holds for Series, which can be viewed as a fixed-size mapping:

```
map_series = pd.Series(mapping)
map_series


people.groupby(map_series, axis=1).sum()
```

## Exercise:
For the day column, map 'Sun' and 'Sat' to 'weekend' and 'Thur' and 'Fri' to weekday. Compute the mean total_bill and tip_pct of weekend and weekday.

# Grouping with Functions
Using Python functions is a more generic way of defining a group mapping compared
with a dict or Series. Any function passed as a group key will be called once per index
value, with the return values being used as the group names. More concretely, consider
the example DataFrame above, which has people’s first
names as index values. Suppose you wanted to group by the length of the names;
while you could compute an array of string lengths, it’s simpler to just pass the len
function:

```
for k, chunk in people.groupby(len):
    print(k)
    print(chunk)
```

## Exercise:
Define a function that converts the tip_pct to percentage (100%) and round the percentage to the nearest number multiplied by 10 (i.e, 10, 20, 30, ...). For each rounded percentage, compute the average total_bill and tip_pct.

# Grouping by Index Levels
## A convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let’s look at an example:

```
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
```

## To group by level, pass the level number or name using the level keyword:

```
hier_df.groupby(level='cty', axis=1).count()
```

# Data Aggregation
Aggregations refer to any data transformation that produces scalar values from
arrays. The preceding examples have used several of them, including mean, count,
min, and sum. However, you are not limited to only this set of
methods.

```
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
```

You can use aggregations of your own devising and additionally call any method that
is also defined on the grouped object. For example, you might recall that quantile
computes sample quantiles of a Series or a DataFrame’s columns.
While quantile is not explicitly implemented for GroupBy, it is a Series method and
thus available for use. Internally, GroupBy efficiently slices up the Series, calls
piece.quantile(0.9) for each piece, and then assembles those results together into
the result object:

```
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
```

## To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method:

```
def peak_to_peak(arr):
    return arr.max() - arr.min()

df[['data1', 'data2']].apply(peak_to_peak)

df.groupby('key1').agg(peak_to_peak)
```

## You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking:

```
df.describe()

grouped.describe()
```

## Exercise:
For each unique value in the day column, compute the maximum, mininum, and the difference between the max and min for total_bill and tip_pct.

## Select Top Results on Aggregation
Suppose we wanted to select the top five tip_pct values by group. First, write a function that selects the rows with the largest values in a particular column:

```
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column, ascending=False)[:n]

```

Now, if we group by smoker, say, and call apply with this function, we get the following:

```
tips.groupby('smoker').apply(top)
```

The top function is called on each row group from the
DataFrame, and then the results are glued together using pandas.concat, labeling the
pieces with the group names. The result therefore has a hierarchical index whose
inner level contains index values from the original DataFrame.

## Exercise:
For each unique value in the day column, extract the top 3 total bills.

In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object. You can disable this by passing group_keys=False to groupby:

```
tips.groupby('smoker', group_keys=False).apply(top)
```

# What is a Pivot Table and How to Use It?
- A pivot table is a data summarization tool frequently found in spreadsheet programs
and other data analysis software.
- It aggregates a table of data by one or more keys,
arranging the data in a rectangle with some of the group keys along the rows and
some along the columns.
- Pivot tables in Python with pandas are made possible
through the groupby facility combined with reshape operations
utilizing hierarchical indexing.
- DataFrame has a pivot_table method, and
there is also a top-level pandas.pivot_table function.
- In addition to providing a
convenience interface to groupby, pivot_table can add partial totals, also known as
margins.

## Returning to the `tips` dataset, suppose you wanted to compute a table of group means (the default pivot_table aggregation type) arranged by day and smoker on the rows:
```
tips[['day', 'smoker', 'total_bill', 'tip']].pivot_table(index=['day', 'smoker']).mean()
```

## This could have been produced with groupby directly. Now, suppose we want to aggregate only tip_pct and size, and additionally group by time. I’ll put smoker in the table columns and day in the rows:
```
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker')
```

## We could augment this table to include partial totals by passing margins=True. This has the effect of adding All row and column labels, with corresponding values being the group statistics for all the data within a single tier:
```
tips.pivot_table(['size', 'tip_pct'], index=['time', 'day'],
                 columns='smoker', margins=True)
```

## Here, the All values are means without taking into account smoker versus nonsmoker (the All columns) or any of the two levels of grouping on the rows (the All row).

## To use a different aggregation function, pass it to aggfunc. For example, 'count' or len will give you a cross-tabulation (count or frequency) of group sizes:
```
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
                 aggfunc=len, margins=True)
```

## If some combinations are empty (or otherwise NA), you may wish to pass a fill_value:
```
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
                 columns='day', aggfunc='mean', fill_value=0)
```

# What is Cross-Tabulation? How to Use Crosstab?
## A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. Here is an example:

```
import pandas as pd
from io import StringIO
data = """\
Sample  Nationality  Handedness
1   USA  Right-handed
2   Japan    Left-handed
3   USA  Right-handed
4   Japan    Right-handed
5   Japan    Left-handed
6   Japan    Right-handed
7   USA  Right-handed
8   USA  Left-handed
9   Japan    Right-handed
10  USA  Right-handed"""
data = pd.read_table(StringIO(data), sep='\s+')
```

## As part of some survey analysis, we might want to summarize this data by nationality and handedness. You could use pivot_table to do this, but the pandas.crosstab function can be more convenient:

```
pd.crosstab(data.Nationality, data.Handedness, margins=True)
```

## The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:

```
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True).plot.bar()
```