# Objective

The primary objectives of this project are as follows:

* Use Time Series Analysis to forecast the number of COVID-19 positive patients in the United States in the coming months.  

* Determine which models are better suited for making these predictions: ARIMA or Long Short-Term Mermory (LSTM) Recurrent Neural Networks (RNNs).

The secondary objectives are the following:

* Determine the states that are at higher risk surges in positive cases.
* Find any social circumstances that could attribute to increase in cases.

# Obtaining Data

<- In this section, we'll discuss COVID-19. Symptoms, dangers, how it spreads, etc.  Also have a subsection for challenges to modeling ->

## Importing Libraries and Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Notes on the Data:

So we'll be looking at a few different datasets for different tasks.  This is simply because some data sets are formatted in a way that is better for Time Series Analysis, while others are more suited for Exploratory Data Analysis. 

With that, the data that we **preprocess** will be from the sets we'll be using for **Time Series Analysis**, while the rest will be used **only for our EDA**.

# Preprocessing

First, let's take a look at the ```us_daily.csv```.

## US_Daily.csv
This dataset contains a lot of information on the national level starting from 1/21/2020 through 9/27/2020 (the day the data was downloaded).  Some of the data includes the total number of positive cases to-date, the amount of tests performed daily, the number of people who tested negative or positive on a daily basis, etc..  It makes for a great summary and can serve as a "downsampled" dataset for our models.

In [2]:
us_daily = pd.read_csv('csv_files/us_daily.csv')
us_daily.head()

Unnamed: 0,date,states,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,totalTestResults,lastModified,total,posNeg,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease,hash
0,20200927,56,7080459,90648092,11136.0,29432.0,404083.0,6080.0,20049.0,1511.0,...,101298794,2020-09-27T00:00:00Z,0,0,307,758,665609,35289,806258,e7c64e674bfc2af02802153452e53628d44c241c
1,20200926,56,7045170,89982483,11183.0,29554.0,403325.0,6057.0,20002.0,1509.0,...,100492536,2020-09-26T00:00:00Z,0,0,866,1154,886140,47733,1004261,e98f5076c72de4a27a283d22756b7d0b9a44d41f
2,20200925,56,6997437,89096343,10905.0,29769.0,402171.0,6133.0,19919.0,1506.0,...,99488275,2020-09-25T00:00:00Z,0,0,844,1331,856519,55526,1011675,8d311e73fe038522a1a6be4bc3202de206ec0adb
3,20200924,56,6941911,88239824,12008.0,30043.0,400840.0,6168.0,19555.0,1560.0,...,98476600,2020-09-24T00:00:00Z,0,0,921,1588,823449,43772,940353,375a88dd29991abc1946cd7f98f4f20a9e37fb5d
4,20200923,56,6898139,87416375,10535.0,29905.0,399252.0,6113.0,19452.0,1544.0,...,97536247,2020-09-23T00:00:00Z,0,0,1157,1451,800878,38567,923704,b4fe7067370631b26f8e988fd2524b5691235a09


In the notebook `"CSV_previews"`, we established that the column `states` does not refer to a state's FIPS code, but to the number of states/territories that had patients who were tested positive for COVID-19. 

Let's checkout all 25 columns of the dataset and see what will be relevant to our time series modeling. 

In [3]:
us_daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   date                      250 non-null    int64  
 1   states                    250 non-null    int64  
 2   positive                  250 non-null    int64  
 3   negative                  250 non-null    int64  
 4   pending                   208 non-null    float64
 5   hospitalizedCurrently     195 non-null    float64
 6   hospitalizedCumulative    208 non-null    float64
 7   inIcuCurrently            186 non-null    float64
 8   inIcuCumulative           187 non-null    float64
 9   onVentilatorCurrently     187 non-null    float64
 10  onVentilatorCumulative    180 non-null    float64
 11  recovered                 187 non-null    float64
 12  dateChecked               250 non-null    object 
 13  death                     231 non-null    float64
 14  hospitaliz

### Dropping columns

Ok, so we're only interested in the number of positive and negative cases, the number of tests, and possibly the number of deaths.  We can probably use information on the number of states to track how fast the virus spread and maybe even use it to discover where it started in the US.

With that, let's make a new data frame and drop the irrelevant data.

In [4]:
usd_df = us_daily[['date', 'states', 'positive', 'negative', 'death', 
                   'totalTestResults', 'total', 'posNeg', 'deathIncrease', 
                   'negativeIncrease', 'positiveIncrease', 
                   'totalTestResultsIncrease']]
usd_df.head()

Unnamed: 0,date,states,positive,negative,death,totalTestResults,total,posNeg,deathIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
0,20200927,56,7080459,90648092,196869.0,101298794,0,0,307,665609,35289,806258
1,20200926,56,7045170,89982483,196562.0,100492536,0,0,866,886140,47733,1004261
2,20200925,56,6997437,89096343,195696.0,99488275,0,0,844,856519,55526,1011675
3,20200924,56,6941911,88239824,194852.0,98476600,0,0,921,823449,43772,940353
4,20200923,56,6898139,87416375,193931.0,97536247,0,0,1157,800878,38567,923704


Alright! Now before we tackle the issue of the date being an integer and not in date-time, let's look at the values inside the columns `total` and `posNeg` and see if they are really only zeros.

In [5]:
print(f'total value counts: {usd_df.total.value_counts()}')
print(f'posNeg value counts: {usd_df.posNeg.value_counts()}')

total value counts: 0    250
Name: total, dtype: int64
posNeg value counts: 0    250
Name: posNeg, dtype: int64


Ok! Looks like the curators of this dataset decided to make different columns for this information and didn't drop these ones, so let's take care of that.

In [6]:
usdf = usd_df.drop(['total', 'posNeg'], axis=1)
usdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   date                      250 non-null    int64  
 1   states                    250 non-null    int64  
 2   positive                  250 non-null    int64  
 3   negative                  250 non-null    int64  
 4   death                     231 non-null    float64
 5   totalTestResults          250 non-null    int64  
 6   deathIncrease             250 non-null    int64  
 7   negativeIncrease          250 non-null    int64  
 8   positiveIncrease          250 non-null    int64  
 9   totalTestResultsIncrease  250 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 19.7 KB


### Missing Values

We're missing some data in the column `death`. Let's find out why!

In [7]:
usdf['death'].isna().value_counts()

False    231
True      19
Name: death, dtype: int64

In [8]:
death = usdf[usdf['death'].isna() == True]
death

Unnamed: 0,date,states,positive,negative,death,totalTestResults,deathIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
231,20200209,2,0,0,,11,0,0,0,0
232,20200208,2,0,0,,11,0,0,0,1
233,20200207,2,0,0,,10,0,0,0,0
234,20200206,2,0,0,,10,0,0,0,1
235,20200205,2,0,0,,9,0,0,0,0
236,20200204,2,0,0,,9,0,0,0,2
237,20200203,2,0,0,,7,0,0,0,3
238,20200202,2,0,0,,4,0,0,0,0
239,20200201,2,0,0,,4,0,0,0,0
240,20200131,2,0,0,,4,0,0,0,0


Ok! It appears that all of these `NaN` values are at the start of the pandemic.  They might have been left them blank because it was "possible" that someone could have died, but not have it confirmed since it was in the early stages of the pandemic. However, since they were certain enough to say that the total number of positive cases were zero, we can be comfortable with imputing these missing values with zeros as well.   

In [9]:
usdf = usdf.fillna(value=0)
usdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   date                      250 non-null    int64  
 1   states                    250 non-null    int64  
 2   positive                  250 non-null    int64  
 3   negative                  250 non-null    int64  
 4   death                     250 non-null    float64
 5   totalTestResults          250 non-null    int64  
 6   deathIncrease             250 non-null    int64  
 7   negativeIncrease          250 non-null    int64  
 8   positiveIncrease          250 non-null    int64  
 9   totalTestResultsIncrease  250 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 19.7 KB


### Fixing Dates

Alright, now for the addressing the data type issue.  `date` is in an integer form, so let's fix that.

In [10]:
usdf['date'] = pd.to_datetime(usdf['date'], format='%Y%m%d')
usdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      250 non-null    datetime64[ns]
 1   states                    250 non-null    int64         
 2   positive                  250 non-null    int64         
 3   negative                  250 non-null    int64         
 4   death                     250 non-null    float64       
 5   totalTestResults          250 non-null    int64         
 6   deathIncrease             250 non-null    int64         
 7   negativeIncrease          250 non-null    int64         
 8   positiveIncrease          250 non-null    int64         
 9   totalTestResultsIncrease  250 non-null    int64         
dtypes: datetime64[ns](1), float64(1), int64(8)
memory usage: 19.7 KB


In [11]:
usdf.head()

Unnamed: 0,date,states,positive,negative,death,totalTestResults,deathIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
0,2020-09-27,56,7080459,90648092,196869.0,101298794,307,665609,35289,806258
1,2020-09-26,56,7045170,89982483,196562.0,100492536,866,886140,47733,1004261
2,2020-09-25,56,6997437,89096343,195696.0,99488275,844,856519,55526,1011675
3,2020-09-24,56,6941911,88239824,194852.0,98476600,921,823449,43772,940353
4,2020-09-23,56,6898139,87416375,193931.0,97536247,1157,800878,38567,923704


### Export to CSV
Awesome! Now we'll export this as a new CSV!

In [12]:
usdf.to_csv('us_daily_preprocessed.csv')

## JHU Confirmed

This dataset is one of many from Johns Hopkins University datasets on the COVID-19 pandemic. It contains the number of cases per state, and even county levels on a day-by-day basis.  Let's open it up!

In [13]:
jhu_conf = pd.read_csv('csv_files/jhu_confirmed.csv')
jhu_conf.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,1673,1690,1691,1714,1715,1738,1757,1764,1773,1785
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,5047,5061,5087,5124,5141,5165,5456,5477,5526,5588
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,830,835,838,848,851,857,873,882,885,886
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,...,628,632,636,635,638,642,652,654,656,657
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,...,1542,1551,1560,1573,1580,1594,1608,1611,1617,1618


In [14]:
jhu_conf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3340 entries, 0 to 3339
Columns: 262 entries, UID to 9/28/20
dtypes: float64(3), int64(253), object(6)
memory usage: 6.7+ MB


As you can see, we have a very large data frame stretching out in every direction.  In addition to 262 columns (most of which are dates which need to be in the index instead of the columns), there are 3300 rows containing all of the states which have been split up by counties and we'll need to consolidate this information if we're going to get any meaningful visualizations that aren't a giant mess.

Let's first get started with the columns and drop information that isn't useful at all.  Unlike the previous dataset, it will be easier to drop the columns we don't want rather than explicitly name the ones we want to keep. 

Before we do that, let's just double check that this ONLY contains data from the United States.

In [58]:
unassigned = jhu_conf[jhu_conf['Admin2'] == 'Unassigned']
unassigned.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
64,84090001,US,USA,840,90001.0,Unassigned,Alabama,US,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
95,84090002,US,USA,840,90002.0,Unassigned,Alaska,US,0.0,0.0,...,1,1,1,1,17,17,18,19,20,20
115,84090004,US,USA,840,90004.0,Unassigned,Arizona,US,0.0,0.0,...,3,4,1,1,0,3,0,2,0,1
188,84090005,US,USA,840,90005.0,Unassigned,Arkansas,US,0.0,0.0,...,1649,1691,1686,1592,1624,1778,1766,1857,1772,1900
251,84090006,US,USA,840,90006.0,Unassigned,California,US,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Checking the 3 letter abbreviations for countries
jhu_conf.iso3.unique()

array(['USA', 'ASM', 'GUM', 'MNP', 'PRI', 'VIR'], dtype=object)

In [19]:
# Checking Country_Region
jhu_conf.Country_Region.unique()

array(['US'], dtype=object)

So in addition to the 50 states, we also have information from the US territories American Samoa (ASM), Guam (GUM), the Northern Mariana Islands (MNP), Puerto Rico (PRI), and the Virgin Islands (VIR).  Although these are territories, they are still part of America, so we'll keep them.  The only issue that could come up is that the column `Admin2` (reserved for county names) will likely contain NaN values. We'll look into this after trimming the columns a bit.

In [20]:
jhu_df = jhu_conf.drop(['UID', 'iso2', 'code3', 'FIPS', 'Country_Region', 
                        'Lat', 'Long_'], axis=1)
jhu_df.head()

Unnamed: 0,iso3,Admin2,Province_State,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
0,USA,Autauga,Alabama,"Autauga, Alabama, US",0,0,0,0,0,0,...,1673,1690,1691,1714,1715,1738,1757,1764,1773,1785
1,USA,Baldwin,Alabama,"Baldwin, Alabama, US",0,0,0,0,0,0,...,5047,5061,5087,5124,5141,5165,5456,5477,5526,5588
2,USA,Barbour,Alabama,"Barbour, Alabama, US",0,0,0,0,0,0,...,830,835,838,848,851,857,873,882,885,886
3,USA,Bibb,Alabama,"Bibb, Alabama, US",0,0,0,0,0,0,...,628,632,636,635,638,642,652,654,656,657
4,USA,Blount,Alabama,"Blount, Alabama, US",0,0,0,0,0,0,...,1542,1551,1560,1573,1580,1594,1608,1611,1617,1618


### Missing Values

Interesting.  `Combined_Key` could take care of any NaN's that we have in `Admin2`, but that could lead to some issues later if we want to consolidate the data on a state-by-state basis.  Let's check for NaN values in the first few columns.

In [26]:
# Broad stroke check of the first few columns
for i in range(0, 4):
    print(jhu_df.columns[i])
    print(jhu_df.iloc[:,i].isna().any())
    print('----------------------------------------')

iso3
False
----------------------------------------
Admin2
True
----------------------------------------
Province_State
False
----------------------------------------
Combined_Key
False
----------------------------------------


As suspected, `Admin2` is missing some values.  Let's check to make sure it's just the territories.

In [27]:
missing = jhu_df[jhu_df['Admin2'].isna() == True]

# Viewing first 20 rows
missing.head(20)

Unnamed: 0,iso3,Admin2,Province_State,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
100,ASM,,American Samoa,"American Samoa, US",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
336,USA,,Diamond Princess,"Diamond Princess, US",0,0,0,0,0,0,...,49,49,49,49,49,49,49,49,49,49
570,USA,,Grand Princess,"Grand Princess, US",0,0,0,0,0,0,...,103,103,103,103,103,103,103,103,103,103
571,GUM,,Guam,"Guam, US",0,0,0,0,0,0,...,2074,2074,2147,2190,2235,2263,2286,2286,2286,2286
2121,MNP,,Northern Mariana Islands,"Northern Mariana Islands, US",0,0,0,0,0,0,...,68,68,69,69,69,69,69,70,70,70
3007,VIR,,Virgin Islands,"Virgin Islands, US",0,0,0,0,0,0,...,1257,1269,1269,1278,1290,1290,1296,1317,1317,1318


In [42]:
missing['Admin2'].isna()

100     True
336     True
570     True
571     True
2121    True
3007    True
Name: Admin2, dtype: bool

So there are actually only 6 missing values and as we thought, they are all territories.  We'll impute these values with the name of the territories.

In [54]:
jhu_df['Admin2'].fillna(jhu_df['Province_State'], inplace=True)

# Checking
jhu_df.Admin2.isna().sum()

0

In [55]:
jhu_df.Admin2.value_counts()

Unassigned    52
Washington    31
Jefferson     26
Franklin      25
Lincoln       24
              ..
Centre         1
Lanier         1
Irwin          1
Oglethorpe     1
Clare          1
Name: Admin2, Length: 1984, dtype: int64

### Placeholders
It appears that there is a placeholder value designated as `Unassigned`.  Time to figure out what these values might point to.

In [49]:
placeholder = jhu_df[jhu_df['Admin2'] == 'Unassigned']

In [56]:
placeholder.head(52)

Unnamed: 0,iso3,Admin2,Province_State,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
64,USA,Unassigned,Alabama,"Unassigned, Alabama, US",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
95,USA,Unassigned,Alaska,"Unassigned, Alaska, US",0,0,0,0,0,0,...,1,1,1,1,17,17,18,19,20,20
115,USA,Unassigned,Arizona,"Unassigned, Arizona, US",0,0,0,0,0,0,...,3,4,1,1,0,3,0,2,0,1
188,USA,Unassigned,Arkansas,"Unassigned, Arkansas, US",0,0,0,0,0,0,...,1649,1691,1686,1592,1624,1778,1766,1857,1772,1900
251,USA,Unassigned,California,"Unassigned, California, US",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
317,USA,Unassigned,Colorado,"Unassigned, Colorado, US",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
329,USA,Unassigned,Connecticut,"Unassigned, Connecticut, US",0,0,0,0,0,0,...,121,121,142,137,135,130,132,132,132,147
335,USA,Unassigned,Delaware,"Unassigned, Delaware, US",0,0,0,0,0,0,...,476,480,480,482,482,488,487,489,489,493
339,USA,Unassigned,District of Columbia,"Unassigned, District of Columbia, US",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
403,USA,Unassigned,Florida,"Unassigned, Florida, US",0,0,0,0,0,0,...,1592,1598,1600,1613,1623,1645,1654,1669,1687,1690


In [60]:
placeholder.Province_State.unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

In [61]:
# Checking to see if Washington D.C. is treated as its own 'state' in the 
# parent dataset
jhu_df.Province_State.unique()

array(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'Diamond Princess', 'District of Columbia', 'Florida', 'Georgia',
       'Grand Princess', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virgin Islands', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

So there seems to be an `Unassigned` value attached to every state, Puerto Rico, and a separate one for Washington D.C..  We also see that the Washington D.C. is being reported as if it was its own state. 

After some research, I've discovered that this was a solution by Johns Hopkins University to account for changes in how states reported their daily statistics.  Whether it was a state no longer reporting data on county levels or redefining "probable" cases. Since there is still data in these fields, I don't want to drop these.  Instead, we'll leave them be.

### Finish Preprocessing & Export

Let's move on with preprocessing this data.  We'll drop `iso3` and `Combined_Key` since they're redundant.

In [62]:
jhu = jhu_df.drop(['iso3', 'Combined_Key'], axis=1)
jhu.head()

Unnamed: 0,Admin2,Province_State,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,9/19/20,9/20/20,9/21/20,9/22/20,9/23/20,9/24/20,9/25/20,9/26/20,9/27/20,9/28/20
0,Autauga,Alabama,0,0,0,0,0,0,0,0,...,1673,1690,1691,1714,1715,1738,1757,1764,1773,1785
1,Baldwin,Alabama,0,0,0,0,0,0,0,0,...,5047,5061,5087,5124,5141,5165,5456,5477,5526,5588
2,Barbour,Alabama,0,0,0,0,0,0,0,0,...,830,835,838,848,851,857,873,882,885,886
3,Bibb,Alabama,0,0,0,0,0,0,0,0,...,628,632,636,635,638,642,652,654,656,657
4,Blount,Alabama,0,0,0,0,0,0,0,0,...,1542,1551,1560,1573,1580,1594,1608,1611,1617,1618


We'll save swapping the axes for when it's time to start modeling. So let's export this to a csv!

In [63]:
jhu.to_csv('jhu_confirmed_preprocessed')

## States_Daily

This dataset contains a lot of great information that will be useful for EDA and modeling, too!

In [2]:
states_daily = pd.read_csv('csv_files/eda_only/states_daily.csv')
states_daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11466 entries, 0 to 11465
Data columns (total 54 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   date                         11466 non-null  int64  
 1   state                        11466 non-null  object 
 2   positive                     11367 non-null  float64
 3   negative                     11229 non-null  float64
 4   pending                      1319 non-null   float64
 5   totalTestResults             11455 non-null  float64
 6   hospitalizedCurrently        8647 non-null   float64
 7   hospitalizedCumulative       6480 non-null   float64
 8   inIcuCurrently               4846 non-null   float64
 9   inIcuCumulative              1845 non-null   float64
 10  onVentilatorCurrently        4072 non-null   float64
 11  onVentilatorCumulative       646 non-null    float64
 12  recovered                    7766 non-null   float64
 13  dataQualityGrade

In [4]:
state_df = states_daily[['date', 'state', 'positiveIncrease', 
                         'negativeIncrease', 'total', 
                         'totalTestResultsIncrease']]
state_df.head()

Unnamed: 0,date,state,positiveIncrease,negativeIncrease,total,totalTestResultsIncrease
0,20200924,AK,0,0,433198,0
1,20200924,AL,1053,6296,1092953,7277
2,20200924,AR,1086,9436,919786,10466
3,20200924,AS,0,0,1571,0
4,20200924,AZ,568,9983,1420417,10551


In [5]:
state_df['date'] = pd.to_datetime(state_df['date'], format='%Y%m%d')
state_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11466 entries, 0 to 11465
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      11466 non-null  datetime64[ns]
 1   state                     11466 non-null  object        
 2   positiveIncrease          11466 non-null  int64         
 3   negativeIncrease          11466 non-null  int64         
 4   total                     11466 non-null  int64         
 5   totalTestResultsIncrease  11466 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 537.6+ KB


In [6]:
state_df.to_csv('states_daily_preprocessed.csv')

In [3]:
all_states = pd.read_csv('csv_files/all-states-history.csv')
all_states

Unnamed: 0,date,state,dataQualityGrade,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2020-10-01,WY,B,53.0,,3,,274.0,274.0,27.0,...,101295.0,135,,,,,100258.0,0,163788.0,2371
1,2020-10-01,NE,A,478.0,,0,,2349.0,2349.0,226.0,...,459542.0,3904,,,,,459845.0,3902,624954.0,6602
2,2020-10-01,ND,B,191.0,188.0,0,3.0,884.0,884.0,106.0,...,619331.0,5844,9578.0,,,,242900.0,1416,642453.0,6079
3,2020-10-01,NC,A+,3579.0,3551.0,47,28.0,,,939.0,...,3063661.0,28790,,2721.0,,,,0,3058541.0,28599
4,2020-10-01,MT,C,181.0,,1,,727.0,727.0,178.0,...,348709.0,5551,,,,,,0,348709.0,5551
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11853,2020-01-24,WA,,,,0,,,,,...,0.0,0,,,,,,0,,0
11854,2020-01-23,WA,,,,0,,,,,...,0.0,0,,,,,,0,,0
11855,2020-01-23,MA,,,,0,,,,,...,2.0,1,,,,,,0,2.0,1
11856,2020-01-22,WA,,,,0,,,,,...,0.0,0,,,,,,0,,0


In [None]:
all_states_trimmed = all_states[['date', 'deathIncrease', ]]

Time to move on to some EDA!