# OVERVIEW #

This notebook will import the grouped version of the COVID data exported as a .csv file in the COVID19_API_Calls notebook.


In the COVID19_API_Calls.ipynb file, CDC COVID data was called through an API, saved in a DataFrame, then grouped by case reported date with hospitalization and death counts.


**You will need to visit the [TSA website here](https://www.tsa.gov/coronavirus/passenger-throughput) and copy/paste the traveler data into a .csv file on your local machine.**

*NAME YOUR TSA FILE: "TSA_Data.csv"


This notebook will call in that TSA data file, create DataFrames with both sets of data, convert datatypes, and merge the data into a single DataFrame.  Finally, the merged DataFrame will be exported as a .csv file.

In [13]:
#import Pandas, Numpy, and MatPlotLib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# SECTION 1: Import the COVID and TSA .csv files #

In [3]:
#read COVID data csv file in as Pandas DataFrame and preview it
covid_data = pd.read_csv('covid_data.csv')
covid_data.tail()

Unnamed: 0,cdc_report_dt,Cases,Deaths,Hospitalizations
250,2020-09-11,20163,188,812
251,2020-09-12,18277,100,574
252,2020-09-13,15436,154,560
253,2020-09-14,25893,163,1031
254,2020-09-15,39927,400,1754


In [4]:
#read TSA data csv file in as a Pandas DataFrame and preview it
tsa_data = pd.read_csv('TSA_Data.csv',delimiter=',')
tsa_data.head()

Unnamed: 0,Date,Total Traveler Throughput,Total Traveler Throughput_1 Year Ago_Same Weekday
0,10/9/2020,968545.0,2688032.0
1,10/8/2020,936915.0,2605291.0
2,10/7/2020,668519.0,2215233.0
3,10/6/2020,590766.0,2035628.0
4,10/5/2020,816838.0,2400153.0


In [6]:
#The next step is to combine the COVID and TSA DataFrames so that for each day, I have all values

#For both DataFrames, I want the date to be the index and I want the date to be the same datetype


# SECTION 2: Convert Datatypes #

In [5]:
#Check the datetype for the COVID DataFrame using .info().  It has changed to an object.
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255 entries, 0 to 254
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   cdc_report_dt     255 non-null    object
 1   Cases             255 non-null    int64 
 2   Deaths            255 non-null    int64 
 3   Hospitalizations  255 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 8.1+ KB


In [6]:
#Check the datetype for the TSA DataFrame using .info().  It has changed to an object also.
#Note: the traveler numbers also came through as objects also and will need to be converted
tsa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 3 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Date                                               223 non-null    object 
 1   Total Traveler Throughput                          223 non-null    float64
 2   Total Traveler Throughput_1 Year Ago_Same Weekday  223 non-null    float64
dtypes: float64(2), object(1)
memory usage: 5.4+ KB


#### Convert dates to datetime64 ####

In [10]:
#Import the datetime module and use it to extract date details
import datetime


In [11]:
#Convert the COVID report date
case_date = pd.to_datetime(covid_data["cdc_report_dt"]) 
covid_case_date = pd.DataFrame(case_date)

#Check the datatype, it is datetime
covid_case_date.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255 entries, 0 to 254
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   cdc_report_dt  255 non-null    datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 2.1 KB


In [12]:
#Convert the TSA traveler count date
tsa_date = pd.to_datetime(tsa_data["Date"]) 
tsa_travel_date = pd.DataFrame(tsa_date)

#Check the datatype, it is datetime
tsa_travel_date.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    223 non-null    datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 1.9 KB


#### Convert TSA traveler numbers to float64 ####

In [12]:
#Now I can convert the TSA traveler numbers from Python object datatypes to integer datatypes

tsa_2020_1 = pd.to_numeric(tsa_data['Total Traveler Throughput'])
tsa_2020 = pd.DataFrame(tsa_2020_1)
tsa_2020.info()  #this is my new 2020 traveler number DataFrame, the datatype is float

tsa_2019_1 = pd.to_numeric(tsa_data['Total Traveler Throughput_1 Year Ago_Same Weekday'])
tsa_2019 = pd.DataFrame(tsa_2019_1)
tsa_2019.info()  #this is my new 2019 traveler number DataFrame, the datatype is float

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 1 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Total Traveler Throughput  223 non-null    float64
dtypes: float64(1)
memory usage: 1.9 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 1 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Total Traveler Throughput_1 Year Ago_Same Weekday  223 non-null    float64
dtypes: float64(1)
memory usage: 1.9 KB


In [13]:
#Now I can create separate DataFrames using the other datapoints
    #I will re-combine them later into one DataFrame together
    
#These are my COVID datapoints
covid_case_date  #the converted covid case date from the COVID DataFrame

covid_cases = pd.DataFrame(covid_data['Cases'])
covid_deaths = pd.DataFrame(covid_data['Deaths'])
covid_hopitalizations = pd.DataFrame(covid_data['Hospitalizations'])

#These are my TSA datapoints
tsa_travel_date  #the converted tsa traveler count date from the TSA DataFrame

tsa_2020 #this is my new 2020 traveler number DataFrame, the datatype is float
tsa_2019 #this is my new 2019 traveler number DataFrame, the datatype is float

tsa_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 1 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Total Traveler Throughput  223 non-null    float64
dtypes: float64(1)
memory usage: 1.9 KB


In [14]:
#I don't want to create a combined DataFrame at this stage because there are 
    #COVID case dates and counts, dated earlier than the TSA traveler date counts
    #TSA traveler date counts dated later than the COVID case dates and counts
    #If I concatenate the DataFrame columns I just created, they will merge on the index number, not the date
    
#Next, I will create a COVID DataFrame using the date as the index; I will do the same with the TSA data.
#And then I'll merge those two DataFrames


In [15]:
#Concatenate the COVID DataFrame columns 
covid_df = pd.concat([covid_case_date, covid_cases, covid_deaths, covid_hopitalizations], axis=1)
covid_df.head()

#Set the Date Column as the Index
covid_dataframe = covid_df.set_index('cdc_report_dt')

#Preview the new DataFrame
covid_dataframe.head()

Unnamed: 0_level_0,Cases,Deaths,Hospitalizations
cdc_report_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,12,0,1
2020-01-02,2,0,0
2020-01-03,2,0,0
2020-01-05,1,0,0
2020-01-08,1,0,0


In [16]:
#Concatenate the TSA DataFrame columns 
tsa_df = pd.concat([tsa_travel_date, tsa_2020, tsa_2019], axis=1)

#Set the Date Column as the Index
tsa_dataframe = tsa_df.set_index('Date')

#Preview the new DataFrame
tsa_dataframe.head()


Unnamed: 0_level_0,Total Traveler Throughput,Total Traveler Throughput_1 Year Ago_Same Weekday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-10-09,968545.0,2688032.0
2020-10-08,936915.0,2605291.0
2020-10-07,668519.0,2215233.0
2020-10-06,590766.0,2035628.0
2020-10-05,816838.0,2400153.0


In [17]:
#Now, the COVID and TSA DataFrames can be merged together on the Date index
    #This will be accomplished with an inner merge
    #Which will only join the rows that have matching values (matching dates)
    #What will be excluded is any COVID case dates or TSA traveler count dates that are in both DataFrames
    
merged_df = pd.merge(covid_dataframe, tsa_dataframe, left_index=True, right_index=True)
merged_df.tail()

Unnamed: 0,Cases,Deaths,Hospitalizations,Total Traveler Throughput,Total Traveler Throughput_1 Year Ago_Same Weekday
2020-09-11,20163,188,812,731353.0,2484025.0
2020-09-12,18277,100,574,613703.0,1879822.0
2020-09-13,15436,154,560,809850.0,2485134.0
2020-09-14,25893,163,1031,729558.0,2405832.0
2020-09-15,39927,400,1754,522383.0,2013050.0


In [18]:
#Finally, I want to rename my columns so that they are easier to consume

covid_travel = merged_df.rename(columns = {'Total Traveler Throughput': '2020 Traveler Count', 'Total Traveler Throughput_1 Year Ago_Same Weekday': '2019 Traveler Count (Same Weekday)'})
covid_travel.head()

#Datetime format = year-month-day

Unnamed: 0,Cases,Deaths,Hospitalizations,2020 Traveler Count,2019 Traveler Count (Same Weekday)
2020-03-01,233,12,71,2280522.0,2301439.0
2020-03-02,161,14,46,2089641.0,2257920.0
2020-03-03,224,11,59,1736393.0,1979558.0
2020-03-04,217,15,67,1877401.0,2143619.0
2020-03-05,253,9,73,2130015.0,2402692.0


In [19]:
#And here is the information about this finalized DataFrame
covid_travel.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 199 entries, 2020-03-01 to 2020-09-15
Freq: D
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Cases                               199 non-null    int64  
 1   Deaths                              199 non-null    int64  
 2   Hospitalizations                    199 non-null    int64  
 3   2020 Traveler Count                 199 non-null    float64
 4   2019 Traveler Count (Same Weekday)  199 non-null    float64
dtypes: float64(2), int64(3)
memory usage: 9.3 KB


In [20]:
#Now, I want to store this DataFrame as a .csv in order to open it in a different ('Visualization') notebook.

covid_travel.to_csv('covid_travel.csv')