# Cars: Grouping and Aggregation

## Imports

In [1]:
import pandas as pd
import numpy as np
from altair import load_dataset

## Dataset

In [2]:
cars = load_dataset('cars')

In [3]:
cars.head()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
0,12.0,8,307.0,130.0,18.0,chevrolet chevelle malibu,USA,3504,1970-01-01
1,11.5,8,350.0,165.0,15.0,buick skylark 320,USA,3693,1970-01-01
2,11.0,8,318.0,150.0,18.0,plymouth satellite,USA,3436,1970-01-01
3,12.0,8,304.0,150.0,16.0,amc rebel sst,USA,3433,1970-01-01
4,10.5,8,302.0,140.0,17.0,ford torino,USA,3449,1970-01-01


In [4]:
len(cars)

406

## Grouped mean

Compute the average of `Acceleration`, `Displacement` and `Horsepower`, grouped by `Origin`:

In [5]:
g1 = cars.groupby('Origin').mean().loc[:, ['Acceleration', 'Displacement', 'Horsepower']]

In [6]:
assert list(g1.columns)==['Acceleration', 'Displacement', 'Horsepower']
assert list(g1.index)==['Europe', 'Japan', 'USA']

Find the maximum `Acceleration` and `Displacement` grouped by `Origin` and `Cylinders`:

In [7]:
g2 = cars.groupby(['Origin', 'Cylinders']).max().loc[:, ['Acceleration', 'Displacement']]

In [8]:
assert list(g2.columns)==['Acceleration', 'Displacement']
assert [list(i) for i in list(g2.index.levels)]==[['Europe', 'Japan', 'USA'], [3, 4, 5, 6, 8]]

## Grouped counts

Compute the number of cars, grouped by `Year` and `Cylinders` and unstack the result:

In [9]:
g3 = cars.groupby(['Year', 'Cylinders']).count().sum(axis=1).unstack()

In [10]:
assert list(g3.columns)==[3, 4, 5, 6, 8]
assert list(g3.index)==['1970-01-01', '1971-01-01', '1972-01-01', '1973-01-01',
                        '1974-01-01', '1975-01-01', '1976-01-01', '1977-01-01', 
                        '1978-01-01', '1979-01-01', '1980-01-01', '1982-01-01']

## Multiple aggregations for different columns

Compute two aggregate quantities with the following names and values:

* `min_mpg` as the minimum `Miles_per_Gallon`.
* `max_mpg` as the maximum `Miles_per_Gallon`.
 
Group by `Origin`:

In [11]:
g4 = cars.groupby('Origin')['Miles_per_Gallon'].aggregate([max, min]).add_suffix('_mpg')

In [31]:
g4

Unnamed: 0_level_0,max_mpg,min_mpg
Origin,Unnamed: 1_level_1,Unnamed: 2_level_1
Europe,44.3,16.2
Japan,46.6,18.0
USA,39.0,9.0


In [12]:
assert list(g4.columns)==['max_mpg', 'min_mpg']
assert list(g4.index)==['Europe', 'Japan', 'USA']

## Custom aggregation function

Compute the range of cylinders (`range` = max - min) grouped by `Origin`:

In [13]:
def num_range(data):
    return (max(data) - min(data))

In [14]:
g5 = cars.groupby('Origin')['Cylinders'].aggregate(num_range)

In [15]:
g5.columns = ['range']

In [16]:
assert list(g5.columns)==['range']
assert list(g5.index)==['Europe', 'Japan', 'USA']

## Group filtering

Compute the average acceleration, grouped by `Year` and `Origin`, only including groups with a max number of Cylinders less than 6. Unstack the `Origin` level of the resulting hierarchical row index:

In [17]:
groups = cars.groupby(['Origin', 'Year'])

In [18]:
groups = groups.filter(lambda x: x['Cylinders'].max() < 6).groupby(['Origin', 'Year'])

In [19]:
g6 = groups.aggregate({'Acceleration': pd.DataFrame.mean}).unstack(level='Origin')

In [20]:
assert [list(i) for i in list(g6.columns.levels)]==[['Acceleration'], ['Europe', 'Japan']]
assert g6.index.name=='Year'
assert len(g6)==10

## Grouped z-scores

Here is the average `Miles_per_Gallon`, grouped by `Year`:

In [21]:
cars.groupby(['Year'])['Miles_per_Gallon'].mean()

Year
1970-01-01    17.689655
1971-01-01    21.250000
1972-01-01    18.714286
1973-01-01    17.100000
1974-01-01    22.703704
1975-01-01    20.266667
1976-01-01    21.573529
1977-01-01    23.375000
1978-01-01    24.061111
1979-01-01    25.093103
1980-01-01    33.696552
1982-01-01    31.045000
Name: Miles_per_Gallon, dtype: float64

Replace the `Miles_per_Gallon` values by the [z-score](https://en.wikipedia.org/wiki/Standard_score) of that value relative to its group for each `Year`:

In [22]:
z_scored_mpg = cars.groupby(['Year']).transform(lambda x: (x - x.mean()) / x.std())

In [30]:
z_scored_mpg.head()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Weight_in_lbs
0,-0.216631,0.672181,0.169366,-0.379951,0.058125,0.077632
1,-0.368272,0.672181,0.530921,0.325261,-0.503753,0.311694
2,-0.519914,0.672181,0.261857,0.023027,0.058125,-0.006581
3,-0.216631,0.672181,0.144141,0.023027,-0.31646,-0.010297
4,-0.671556,0.672181,0.127325,-0.178462,-0.129168,0.009518


Show that the average z-scores for `Miles_per_Gallon` grouped by year are all zero:

In [23]:
z_scored_mpg.mean()['Miles_per_Gallon']

6.6032815731277956e-18

Show that the standard deviation of the z-scores for `Miles_per_Gallon` grouped by year are all 1.0:

In [24]:
z_scored_mpg.std()['Miles_per_Gallon']

0.98604877741203478

## Grouped missing value replacement

## Introduce missing values

Let's introduce some missing values into the `Cylinders` column:

In [25]:
cars['Cylinders'] = np.where(np.random.rand(len(cars)) > 0.8, np.nan, cars['Cylinders'])
cars.Cylinders

0      8.0
1      NaN
2      8.0
3      8.0
4      8.0
5      NaN
6      8.0
7      8.0
8      8.0
9      8.0
10     4.0
11     NaN
12     8.0
13     8.0
14     8.0
15     8.0
16     8.0
17     8.0
18     8.0
19     8.0
20     4.0
21     6.0
22     6.0
23     NaN
24     4.0
25     4.0
26     4.0
27     4.0
28     4.0
29     4.0
      ... 
376    4.0
377    4.0
378    4.0
379    4.0
380    4.0
381    NaN
382    4.0
383    4.0
384    4.0
385    4.0
386    NaN
387    4.0
388    4.0
389    4.0
390    4.0
391    4.0
392    4.0
393    4.0
394    NaN
395    6.0
396    4.0
397    NaN
398    4.0
399    NaN
400    4.0
401    NaN
402    NaN
403    NaN
404    NaN
405    4.0
Name: Cylinders, dtype: float64

Here are the average number of Cylinders, grouped by `Year`:

In [26]:
cars.groupby(['Year'])['Cylinders'].mean()

Year
1970-01-01    6.774194
1971-01-01    5.523810
1972-01-01    5.954545
1973-01-01    6.125000
1974-01-01    5.217391
1975-01-01    5.826087
1976-01-01    5.862069
1977-01-01    5.105263
1978-01-01    5.206897
1979-01-01    5.800000
1980-01-01    4.222222
1982-01-01    4.390244
Name: Cylinders, dtype: float64

Replace the missing values in `Cylinders` by the group average (grouped by `Year`):

In [27]:
replaced_nan_cars = cars.groupby(['Year']).apply(lambda x: x.replace(np.nan, x.mean()))

By replacing missing values with group averages, the grouped averages remain unchanged:

In [28]:
replaced_nan_cars.groupby(['Year'])['Cylinders'].mean()

Year
1970-01-01    6.774194
1971-01-01    5.523810
1972-01-01    5.954545
1973-01-01    6.125000
1974-01-01    5.217391
1975-01-01    5.826087
1976-01-01    5.862069
1977-01-01    5.105263
1978-01-01    5.206897
1979-01-01    5.800000
1980-01-01    4.222222
1982-01-01    4.390244
Name: Cylinders, dtype: float64