## Step 3 -  Prepare Data Task 4 - Transform, 5 - Sort Data

This notebook provides the Python code for taking Covid total cases at county level and creating state level aggregation.  It demonstrates how you can transform the data, sort the data, and engineer new features.  Its results are stored in an intermediate file for rest of exercises.

Students will be developing a similar notebook for total deaths.  The corresponding notebook is included in the answer section.

## Import Libraries

In [1]:
import os
import pandas as pd
from datetime import date


## Set up Environment Flag

In [2]:
using_Google_colab = False
using_Anaconda_on_Mac_or_Linux = True
using_Anaconda_on_windows = False

if using_Google_colab:
    dir_input = "/content/drive/MyDrive/COVID_Project/input"
if using_Anaconda_on_Mac_or_Linux:
    dir_input = "../input"
if using_Anaconda_on_windows:
    dir_input = r"..\input"  

## Connect to Google Drive

This step will only be executed if you have set environment flag using_Google_colab to True

In [3]:
if using_Google_colab:
    from google.colab import drive
    drive.mount('/content/drive')

## PD4.1 (Activity 1) Read file in your chosen environment

In [4]:
df_total_cases = pd.read_csv(os.path.join(dir_input, "USA_Facts", "covid_confirmed_usafacts.csv")) 
df_total_cases = df_total_cases.astype({'countyFIPS': str}).astype({'StateFIPS': str})
df_total_cases

Unnamed: 0,countyFIPS,County Name,State,StateFIPS,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-26,2020-01-27,...,2022-01-18,2022-01-19,2022-01-20,2022-01-21,2022-01-22,2022-01-23,2022-01-24,2022-01-25,2022-01-26,2022-01-27
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,12738,12833,12928,13019,13019,13019,13251,13251,13251,13251
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,47143,47662,48338,49168,49168,49168,50313,50313,50313,50313
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,4741,4800,4843,4902,4902,4902,5054,5054,5054,5054
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,5385,5486,5565,5663,5663,5663,5795,5795,5795,5795
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,56037,Sweetwater County,WY,56,0,0,0,0,0,0,...,9082,9184,9241,9449,9449,9449,9609,9712,9810,10007
3189,56039,Teton County,WY,56,0,0,0,0,0,0,...,8531,8638,8741,8814,8814,8814,8960,9049,9121,9195
3190,56041,Uinta County,WY,56,0,0,0,0,0,0,...,4660,4751,4827,4927,4927,4927,5034,5081,5167,5222
3191,56043,Washakie County,WY,56,0,0,0,0,0,0,...,1994,2002,2023,2025,2025,2025,2041,2066,2093,2130


## PD 4.1 (Activity 2) Select data for LA County

In [5]:
df_total_cases_LA = df_total_cases[df_total_cases['County Name'] == 'Los Angeles County ']
df_total_cases_LA

Unnamed: 0,countyFIPS,County Name,State,StateFIPS,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-26,2020-01-27,...,2022-01-18,2022-01-19,2022-01-20,2022-01-21,2022-01-22,2022-01-23,2022-01-24,2022-01-25,2022-01-26,2022-01-27
209,6037,Los Angeles County,CA,6,375,379,382,384,385,388,...,2276388,2343261,2367401,2384427,2390482,2430653,2453693,2468026,2472960,2473095


## PD 4.1 (Activity 3) Transform LA County data to total cases by date

In [6]:
df_total_cases_LA_by_date = df_total_cases_LA.melt(id_vars=['State', 
                                                            'StateFIPS', 
                                                            'County Name',
                                                            'countyFIPS'],
                                                   var_name='Date', 
                                                   value_name='Total Cases')
df_total_cases_LA_by_date

Unnamed: 0,State,StateFIPS,County Name,countyFIPS,Date,Total Cases
0,CA,6,Los Angeles County,6037,2020-01-22,375
1,CA,6,Los Angeles County,6037,2020-01-23,379
2,CA,6,Los Angeles County,6037,2020-01-24,382
3,CA,6,Los Angeles County,6037,2020-01-25,384
4,CA,6,Los Angeles County,6037,2020-01-26,385
...,...,...,...,...,...,...
732,CA,6,Los Angeles County,6037,2022-01-23,2430653
733,CA,6,Los Angeles County,6037,2022-01-24,2453693
734,CA,6,Los Angeles County,6037,2022-01-25,2468026
735,CA,6,Los Angeles County,6037,2022-01-26,2472960


## PD 4.2 (Activity 4) Transform all County data to total cases by date

In [7]:
df_total_county_cases_by_date = df_total_cases.melt(id_vars=['State', 
                                                      'StateFIPS', 
                                                      'County Name',
                                                      'countyFIPS'],
                                             var_name='Date', 
                                             value_name='Total Cases')
df_total_county_cases_by_date

Unnamed: 0,State,StateFIPS,County Name,countyFIPS,Date,Total Cases
0,AL,1,Statewide Unallocated,0,2020-01-22,0
1,AL,1,Autauga County,1001,2020-01-22,0
2,AL,1,Baldwin County,1003,2020-01-22,0
3,AL,1,Barbour County,1005,2020-01-22,0
4,AL,1,Bibb County,1007,2020-01-22,0
...,...,...,...,...,...,...
2353236,WY,56,Sweetwater County,56037,2022-01-27,10007
2353237,WY,56,Teton County,56039,2022-01-27,9195
2353238,WY,56,Uinta County,56041,2022-01-27,5222
2353239,WY,56,Washakie County,56043,2022-01-27,2130


## PD4.3 (Activity 5)  Group total cases by state 


In [8]:
df_total_cases_by_state = df_total_cases.groupby(['State', 'StateFIPS']).sum().reset_index()
df_total_cases_by_state

Unnamed: 0,State,StateFIPS,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-26,2020-01-27,2020-01-28,2020-01-29,...,2022-01-18,2022-01-19,2022-01-20,2022-01-21,2022-01-22,2022-01-23,2022-01-24,2022-01-25,2022-01-26,2022-01-27
0,AK,2,0,0,0,0,0,0,0,0,...,179006,181507,183777,187047,188250,190051,193306,195288,197425,199769
1,AL,1,0,0,0,0,0,0,0,0,...,1071264,1088370,1104356,1120881,1120881,1120881,1153149,1153149,1153149,1153149
2,AR,5,0,0,0,0,0,0,0,0,...,677449,690890,701050,709958,709958,709958,721587,732071,738053,743699
3,AZ,4,0,0,0,0,1,1,1,1,...,1645695,1666191,1683915,1701950,1701950,1701950,1767303,1781275,1799504,1813797
4,CA,6,722,733,739,749,756,766,772,776,...,7085843,7274629,7346608,7398487,7469125,7554303,7626265,7680316,7702919,7705725
5,CO,8,0,0,0,0,0,0,0,0,...,1169876,1180568,1184668,1199935,1208995,1212408,1221013,1229321,1232348,1233235
6,CT,9,0,0,0,0,0,0,0,0,...,657680,662425,667230,671674,671674,671674,683731,687555,690350,693386
7,DC,11,0,0,0,0,0,0,0,0,...,125707,126187,126675,127200,127200,127200,128550,128739,129108,129479
8,DE,10,0,0,0,0,0,0,0,0,...,229759,232132,233843,236022,236022,236022,241367,242324,242324,244730
9,FL,12,0,0,0,0,0,0,0,0,...,5196294,5242386,5281000,5303375,5311729,5347828,5382565,5420175,5448288,5478894


## PD 4.3 (Activity 6) Transform state total cases to total cases by date


In [9]:
df_total_cases_by_state_by_date = df_total_cases_by_state.melt(id_vars=['State','StateFIPS'], 
                                                               var_name='Date', 
                                                               value_name='Total Cases')
df_total_cases_by_state_by_date

Unnamed: 0,State,StateFIPS,Date,Total Cases
0,AK,2,2020-01-22,0
1,AL,1,2020-01-22,0
2,AR,5,2020-01-22,0
3,AZ,4,2020-01-22,0
4,CA,6,2020-01-22,722
...,...,...,...,...
37582,VT,50,2022-01-27,103066
37583,WA,53,2022-01-27,1257918
37584,WI,55,2022-01-27,1495240
37585,WV,54,2022-01-27,434221


## Bonus Activity - Pivot is the reverse of melt - you can revert back using pivot_table 

In [10]:
df_pivot_table = pd.pivot_table(df_total_cases_by_state_by_date, 
                                values='Total Cases', 
                                columns='Date', 
                                index=['State','StateFIPS'])
df_pivot_table

Unnamed: 0_level_0,Date,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-26,2020-01-27,2020-01-28,2020-01-29,2020-01-30,2020-01-31,...,2022-01-18,2022-01-19,2022-01-20,2022-01-21,2022-01-22,2022-01-23,2022-01-24,2022-01-25,2022-01-26,2022-01-27
State,StateFIPS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AK,2,0,0,0,0,0,0,0,0,0,0,...,179006,181507,183777,187047,188250,190051,193306,195288,197425,199769
AL,1,0,0,0,0,0,0,0,0,0,0,...,1071264,1088370,1104356,1120881,1120881,1120881,1153149,1153149,1153149,1153149
AR,5,0,0,0,0,0,0,0,0,0,0,...,677449,690890,701050,709958,709958,709958,721587,732071,738053,743699
AZ,4,0,0,0,0,1,1,1,1,1,1,...,1645695,1666191,1683915,1701950,1701950,1701950,1767303,1781275,1799504,1813797
CA,6,722,733,739,749,756,766,772,776,783,798,...,7085843,7274629,7346608,7398487,7469125,7554303,7626265,7680316,7702919,7705725
CO,8,0,0,0,0,0,0,0,0,0,0,...,1169876,1180568,1184668,1199935,1208995,1212408,1221013,1229321,1232348,1233235
CT,9,0,0,0,0,0,0,0,0,0,0,...,657680,662425,667230,671674,671674,671674,683731,687555,690350,693386
DC,11,0,0,0,0,0,0,0,0,0,0,...,125707,126187,126675,127200,127200,127200,128550,128739,129108,129479
DE,10,0,0,0,0,0,0,0,0,0,0,...,229759,232132,233843,236022,236022,236022,241367,242324,242324,244730
FL,12,0,0,0,0,0,0,0,0,0,0,...,5196294,5242386,5281000,5303375,5311729,5347828,5382565,5420175,5448288,5478894


## Why Sort?

It is needed before any multi-row operation. As you can see below state records are not sorted by state currently

In [11]:
df_total_cases_by_state_by_date

Unnamed: 0,State,StateFIPS,Date,Total Cases
0,AK,2,2020-01-22,0
1,AL,1,2020-01-22,0
2,AR,5,2020-01-22,0
3,AZ,4,2020-01-22,0
4,CA,6,2020-01-22,722
...,...,...,...,...
37582,VT,50,2022-01-27,103066
37583,WA,53,2022-01-27,1257918
37584,WI,55,2022-01-27,1495240
37585,WV,54,2022-01-27,434221


## PD 5.1 (Activity 1) Sort LA County Total Cases

In [12]:
df_total_cases_LA_by_date = df_total_cases_LA_by_date.astype({'Date': 'datetime64[ns]'})
df_sorted_LA_county_total_cases = df_total_cases_LA_by_date.sort_values(by=['Date'])
df_sorted_LA_county_total_cases

Unnamed: 0,State,StateFIPS,County Name,countyFIPS,Date,Total Cases
0,CA,6,Los Angeles County,6037,2020-01-22,375
1,CA,6,Los Angeles County,6037,2020-01-23,379
2,CA,6,Los Angeles County,6037,2020-01-24,382
3,CA,6,Los Angeles County,6037,2020-01-25,384
4,CA,6,Los Angeles County,6037,2020-01-26,385
...,...,...,...,...,...,...
732,CA,6,Los Angeles County,6037,2022-01-23,2430653
733,CA,6,Los Angeles County,6037,2022-01-24,2453693
734,CA,6,Los Angeles County,6037,2022-01-25,2468026
735,CA,6,Los Angeles County,6037,2022-01-26,2472960


## PD 5.2 (Activity 2) Sort County Total Cases

In [13]:
df_total_county_cases_by_date = df_total_county_cases_by_date.astype({'Date': 'datetime64[ns]'})
df_sorted_county_total_cases = df_total_county_cases_by_date.sort_values(by=['StateFIPS', 
                                                                             'countyFIPS', 
                                                                             'Date'])
df_sorted_county_total_cases

Unnamed: 0,State,StateFIPS,County Name,countyFIPS,Date,Total Cases
0,AL,1,Statewide Unallocated,0,2020-01-22,0
3193,AL,1,Statewide Unallocated,0,2020-01-23,0
6386,AL,1,Statewide Unallocated,0,2020-01-24,0
9579,AL,1,Statewide Unallocated,0,2020-01-25,0
12772,AL,1,Statewide Unallocated,0,2020-01-26,0
...,...,...,...,...,...,...
2337598,CT,9,Windham County,9015,2022-01-23,23067
2340791,CT,9,Windham County,9015,2022-01-24,23620
2343984,CT,9,Windham County,9015,2022-01-25,23811
2347177,CT,9,Windham County,9015,2022-01-26,23984


## PD 5.3 (Activity 3) Sort State Total Cases

In [14]:
df_total_cases_by_state_by_date = df_total_cases_by_state_by_date.astype({'Date': 'datetime64[ns]'})
df_sorted_state_total_cases = df_total_cases_by_state_by_date.sort_values(by=['StateFIPS', 'Date'])
df_sorted_state_total_cases

Unnamed: 0,State,StateFIPS,Date,Total Cases
1,AL,1,2020-01-22,0
52,AL,1,2020-01-23,0
103,AL,1,2020-01-24,0
154,AL,1,2020-01-25,0
205,AL,1,2020-01-26,0
...,...,...,...,...
37338,CT,9,2022-01-23,671674
37389,CT,9,2022-01-24,683731
37440,CT,9,2022-01-25,687555
37491,CT,9,2022-01-26,690350
