# Week 13
# GroupBy Examples

Today we will consider some examples of handling data using `groupby()` function.

In [None]:
import numpy as np
import pandas as pd

## Example 1: Airports

From the [airports](https://ourairports.com/data/airports.csv) data, find the number of large airports of each country.

In [None]:
airports = pd.read_csv("https://ourairports.com/data/airports.csv", sep=",")
airports.head()

## Example 2: Filling Missing Values with Group-Specific Values

Previously, we learned that a common practice of handling missing value is to fill with mean values. A more delicate way of doing this is to fill with the mean value of the specific group the record belongs to. Let's consider the following example:

In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.DataFrame(np.random.randn(8), index=states, columns=['Value'])
data.loc[['Vermont', 'Nevada', 'Idaho']] = np.nan
data['group_key'] = group_key
data

There are two groups of states: eastern states and western states. Instead of filling the missing values with the average value of all states, let's fill Vermont's value with the average of eastern states, and fill Nevada's value and Idaho's value with the average of western states.

In [None]:
# Find the average value of estern states and western states



In [None]:
# Fill missing values with group specific average.



**Example: Random Sampling and Permutation**

In [None]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)
deck

In [None]:
# Randomly sample 5 rows



In [None]:
# Randomly sample 2 cards from each suit



## Example: Analyzing Cellphone History

In [None]:
# Reference:
# # https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
url = "https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv"
data = pd.read_csv(url, sep=",", index_col='index')
print(data.shape)
data.head(3)

1. **date**: The date and time of the entry
2. **duration**: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry.
3. **item**: A description of the event occurring – can be one of call, sms, or data.
4. **month**: The billing month that each entry belongs to – of form ‘YYYY-MM’.
5. **network**: The mobile network that was called/texted for each entry.
6. **network_type**: Whether the number being called was a mobile, international (‘world’), voicemail, landline, or other (‘special’) number.

In [None]:
# Convert date column from string to datetime objects
data['date'] = data['date'].astype(np.datetime64)
data.head(3)

In [None]:
# Which months are covered in this data set?



In [None]:
# What is the longest call duration for each month?



In [None]:
# What is the total call durations of each month?



In [None]:
# How many calls, messages, and data entries are there in each month?



In [None]:
# How many instances are there per month, splitted by network_type?



## Data Aggregations
Aggregation refer to any data transformation that produces numeric values from arrays. Examples of data aggregation methods include `mean()`, `count()`, `first()`, `min()`, and `sum()`. Moreover, user-defined functions can also be applied to create desired summary.

In [None]:
# Define function get_range() that returns(max - min)
def get_range(array):
    return array.max() - array.min()

In [None]:
# Apply agg() to find the range of each type of cell phone use.
data.groupby(['item'])['duration'].agg(get_range)

In [None]:
# Apply multiple aggregation functions
data.groupby(['item'])['duration'].agg([get_range, np.max, np.min])

In [None]:
# Declare columns names
data.groupby(['item'])['duration'].agg([('range', get_range),
                                        ('maximum', np.max),
                                        ('minimum', np.min)])