# INFO 3402 – Week 02: Aggregating and Summarizing

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)  

In [2]:
import numpy as np
import pandas as pd

pd.options.display.max_columns = 200

In [3]:
deaths_df = pd.read_csv('CDC_deaths_2014_2022.csv',parse_dates=['Week date'])

## Exercises

### Exercise 1: Annual "All Cause" deaths

Perform a groupby-aggregation to compute the total "All Cause" of death by year. How much higher were deaths in 2020 than 2019?

In [4]:
annual = deaths_df.groupby(['Year']).agg({'All Cause':'sum'})

annual.loc[2020,'All Cause'] - annual.loc[2019,'All Cause']

568256.0

### Exercise 2: Weekly flu and pneumonia patterns

Make a pivot table with the Week as an index, Year as columns, and the total flu and pneumonia death. Examining the maximum counts per week, what time time of year is the deadliest for the flu?

In [5]:
weekly_flu = pd.pivot_table(
    data = deaths_df,
    index = 'Week',
    columns = 'Year',
    values = 'Influenza and pneumonia',
    aggfunc = 'sum'
)

weekly_flu.idxmax()

Year
2014    53
2015     1
2016    10
2017    52
2018     3
2019    11
2020    12
2021     2
dtype: int64

### Exercise 3: Top heart disease state

Perform a groupby-aggregation to compute the total number of "Disease of heart" by state. What state had the most heart disease deaths from 2014 through 2021?

In [11]:
heart = deaths_df.groupby(['State']).agg({'Diseases of heart':'sum'})

heart.idxmax()

Diseases of heart    California
dtype: object

### Exercise 4: Annual mid-year "All Cause" deaths

Perform a groupby-aggregation on Year and Week to compute the total "All Cause" deaths. Use slicing to identify the number of "All Cause" deaths in the 26th week of each year.

In [15]:
idx = pd.IndexSlice

all_cause = deaths_df.groupby(['Year','Week']).agg({'All Cause':'sum'})

all_cause.loc[idx[:,26],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,All Cause
Year,Week,Unnamed: 2_level_1
2014,26,47131.0
2015,26,48798.0
2016,26,49348.0
2017,26,50229.0
2018,26,50403.0
2019,26,51826.0
2020,26,57986.0
2021,26,56896.0


### Exercise 5: Comparing Diabetes deaths in 2019 and 2020 between Utah and Colorado

Perform a groupby-aggregation on Year and State and compute the total "Diabetes mellitus" deaths. Use slicing to identify the number of diabetes deaths in 2019 and 2020 for Utah and Colorado.

In [17]:
diabetes = deaths_df.groupby(['Year','State']).agg({'Diabetes mellitus':'sum'})

diabetes.loc[idx[[2019,2020],['Colorado','Utah']],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Diabetes mellitus
Year,State,Unnamed: 2_level_1
2019,Colorado,1051.0
2019,Utah,631.0
2020,Colorado,1183.0
2020,Utah,753.0
