# Activity: Aggregations

## Introduction

In this activity you will practice using Pandas functionality to work with various aggregations.

This activity will cover the following topics:
- Measure an aggregate statistic over a specific column.
- Measure an aggregate statistic over a specific column of subsets using `groupby()` over one column.
- Measure multiple aggregate statistics over a specific column of subsets using `groupby()` over one column.
- Measure multiple aggregate statistics over a specific column of subsets using `groupby()` over multiple columns.
- Measure multiple aggregate statistics over a specific column of subsets using `groupby()` over multiple columns.
- Take the transpose of a DataFrame.


In [1]:
import pandas as pd

# Data from https://github.com/mwaskom/seaborn-data/blob/2b29313169bf8dfa77d8dc930f7bd3eba559a906/mpg.csv
df = pd.read_csv('mpg.csv')

In [2]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [3]:
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')

#### Question 1
Assign the average of the `weight` column to the variable `avg_weight`.

In [4]:
# Your code here

avg_weight = df.weight.mean()

In [5]:
# Question 1 Grading Checks

assert isinstance(avg_weight, float), 'Did you assign a number to `avg_weight`?'

#### Question 2

Assign the average of the `weight` column where the `origin` is `'usa'` to the variable `weight_usa`.

In [6]:
# Your code here

usa = df[df.origin == "usa"]
usa

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,151.0,90.0,2950,17.3,82,usa,chevrolet camaro
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger


In [7]:
weight_usa = usa.weight.mean()

In [8]:
# Question 2 Grading Checks

assert isinstance(avg_weight, float), 'Did you assign a number to `weight_usa`?'

#### Question 3

Assign the maximum of the `weight` column where the `origin` is `'japan'` to the variable `weight_japan`.

In [9]:
# Your code here

japan = df[df.origin == "japan"]

japan.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
14,24.0,4,113.0,95.0,2372,15.0,70,japan,toyota corona mark ii
18,27.0,4,97.0,88.0,2130,14.5,70,japan,datsun pl510
29,27.0,4,97.0,88.0,2130,14.5,71,japan,datsun pl510
31,25.0,4,113.0,95.0,2228,14.0,71,japan,toyota corona
53,31.0,4,71.0,65.0,1773,19.0,71,japan,toyota corolla 1200


In [10]:
weight_japan = japan.weight.mean()

In [11]:
# Question 3 Grading Checks

print(weight_japan)

2221.227848101266


#### Question 4

Using Pandas' `groupby()` method, group by the `cylinders` column to find the minimum and maximum `horsepower` and assign the result to the variable `weight_by_cylinder`.

In [12]:
# Your code here

weight_by_cylinder = df.groupby(by='cylinders').agg(["min","max"])

weight_by_cylinder

Unnamed: 0_level_0,mpg,mpg,displacement,displacement,horsepower,horsepower,weight,weight,acceleration,acceleration,model_year,model_year,origin,origin,name,name
Unnamed: 0_level_1,min,max,min,max,min,max,min,max,min,max,min,max,min,max,min,max
cylinders,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
3,18.0,23.7,70.0,80.0,90.0,110.0,2124,2720,12.5,13.5,72,80,japan,japan,maxda rx3,mazda rx2 coupe
4,18.0,46.6,68.0,156.0,46.0,115.0,1613,3270,11.6,24.8,70,82,europe,usa,amc concord,vw rabbit custom
5,20.3,36.4,121.0,183.0,67.0,103.0,2830,3530,15.9,20.1,78,80,europe,europe,audi 5000,mercedes benz 300d
6,15.0,38.0,145.0,262.0,72.0,165.0,2472,3907,11.3,21.0,70,82,europe,usa,amc concord,volvo diesel
8,9.0,26.6,260.0,455.0,90.0,230.0,3086,5140,8.0,22.2,70,81,usa,usa,amc ambassador brougham,pontiac safari (sw)


In [13]:
# Question 4 Grading Checks

assert isinstance(weight_by_cylinder, pd.DataFrame), 'Did you create a DataFrame with `groupby()`?'

#### Question 5

Using Pandas' `groupby()` method, group by the `model_year` column to find the maximum `acceleration` but have the result be a DataFrame (not a Series). Then assign the result to the variable `acc_by_model_year`.

In [14]:
# Your code here

acc_by_model_year = pd.DataFrame(df.groupby(by='acceleration').max())

acc_by_model_year

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,model_year,origin,name
acceleration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8.0,14.0,8,340.0,160.0,3609,70,usa,plymouth 'cuda 340
8.5,15.0,8,440.0,215.0,4312,70,usa,plymouth fury iii
9.0,14.0,8,454.0,220.0,4354,70,usa,chevrolet impala
9.5,16.0,8,400.0,230.0,4278,73,usa,pontiac grand prix
10.0,15.0,8,455.0,225.0,4425,70,usa,pontiac catalina
...,...,...,...,...,...,...,...,...
22.2,29.0,8,260.0,90.0,3420,79,usa,oldsmobile cutlass salon brougham
23.5,23.0,4,97.0,54.0,2254,72,europe,volkswagen type 3
23.7,43.4,4,90.0,48.0,2335,80,europe,vw dasher (diesel)
24.6,44.0,4,97.0,52.0,2130,82,europe,vw pickup


In [15]:
# Question 5 Grading Checks


assert isinstance(acc_by_model_year, pd.DataFrame), 'Did you create a DataFrame with `groupby()`?'

#### Question 6

Using Pandas' `groupby()` method, group by the `origin` column and then by the `cylinders` column to find the median `acceleration` and assign the result to the variable `acc_by_origin_and_cylinders`. Don't create a DataFrame; the result should be a Series.

In [16]:
# Your code here

acc_by_origin_and_cylinders = df.groupby(by=['origin','cylinders'])["acceleration"].median()

acc_by_origin_and_cylinders

origin  cylinders
europe  4            15.50
        5            19.90
        6            16.25
japan   3            13.50
        4            16.50
        6            13.65
usa     4            16.30
        6            16.45
        8            13.00
Name: acceleration, dtype: float64

In [17]:
# Question 6 Grading Checks

assert isinstance(acc_by_origin_and_cylinders, pd.Series), 'Did you use `groupby()` and created a Series (not a DataFrame)? Check how you choose which column you are calculating your median for.'

#### Question 7

Using Pandas' `groupby()` method, group by the `origin` column then use the `describe()` to get the various statistics of the `acceleration`. Assign the resulting DataFrame ***transpose*** to the variable `acc_stats_by_origin`.

In [21]:
# Your code here (remember to take the transpose after doing .describe())

acc_stats_by_origin = df.groupby(by=['origin'])["acceleration"].describe()
acc_stats_by_origin

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
europe,70.0,16.787143,3.045687,12.2,14.5,15.7,18.9,24.8
japan,79.0,16.172152,1.954937,11.4,14.6,16.4,17.55,21.0
usa,249.0,15.033735,2.751112,8.0,13.0,15.0,16.9,22.2


In [22]:
# Question 7 Grading Checks

assert isinstance(acc_stats_by_origin, pd.DataFrame), 'Did you create a DataFrame with `groupby()`?'