***This notebook is mainly written by Veronica, and the idea was developed by Veronica, Yuan and Yoyo collaborately for the end-of-semester project of Machine Learning, Fall 2020***
# This notebook will cover:
- Sources, description and structure of the datasets we used
- The way we analyze, preprocess, clean and merge the datasets, and the rationale behind that

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Dataset 1. Covid Tracking Project Dataset
- this dataset is downloaded from [CovidTrackingProject](https://covidtracking.com)
- we have uploaded this dataset to GitHub as part of the codes

### 1.1 Dataset Structure

In [4]:
df = pd.read_csv('all-states-history.csv')
print(df.shape)
df.head()

(15633, 42)


Unnamed: 0,date,state,dataQualityGrade,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2020-12-06,AK,A,143.0,143.0,0,,799.0,799.0,164.0,...,1077776.0,10545,,,,,,0,1077776.0,10545
1,2020-12-06,AL,A,3889.0,3462.0,12,427.0,26331.0,26331.0,1927.0,...,1645041.0,7880,,,74784.0,,1645041.0,7880,,0
2,2020-12-06,AR,A+,2660.0,2437.0,40,223.0,9401.0,9401.0,1076.0,...,1763150.0,14704,,21856.0,,155934.0,,0,1763150.0,14704
3,2020-12-06,AS,D,0.0,,0,,,,,...,2140.0,0,,,,,,0,2140.0,0
4,2020-12-06,AZ,A+,6950.0,6431.0,25,519.0,28248.0,28248.0,2977.0,...,2370499.0,20586,370928.0,,,,2370499.0,20586,,0


In [5]:
df.drop('dataQualityGrade', axis=1, inplace=True)  ## irrelevant column

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15633 entries, 0 to 15632
Data columns (total 41 columns):
date                                15633 non-null object
state                               15633 non-null object
death                               14807 non-null float64
deathConfirmed                      6850 non-null float64
deathIncrease                       15633 non-null int64
deathProbable                       5104 non-null float64
hospitalized                        9434 non-null float64
hospitalizedCumulative              9434 non-null float64
hospitalizedCurrently               12516 non-null float64
hospitalizedIncrease                15633 non-null int64
inIcuCumulative                     2700 non-null float64
inIcuCurrently                      7713 non-null float64
negative                            15323 non-null float64
negativeIncrease                    15633 non-null int64
negativeTestsAntibody               995 non-null float64
negativeTestsPeopleAnt

In [5]:
# calculate the number of null values of each column annd sort them in order
isnull_sum = df.isnull().sum()
isnull_sum.sort_values()

date                                    0
totalTestsPeopleViralIncrease           0
totalTestResultsIncrease                0
totalTestEncountersViralIncrease        0
positiveScore                           0
positiveIncrease                        0
negativeIncrease                        0
hospitalizedIncrease                    0
totalTestsViralIncrease                 0
state                                   0
deathIncrease                           0
totalTestResults                       35
positive                              152
negative                              310
death                                 826
hospitalizedCurrently                3117
positiveCasesViral                   3516
recovered                            4522
totalTestsViral                      5812
hospitalizedCumulative               6199
hospitalized                         6199
inIcuCurrently                       7920
deathConfirmed                       8783
totalTestsPeopleViral             

#### Conclusion:
- coulmns named **XXXIncrease** should be dropped (although no empty entries in these columns, they contain a lot of zero values) 
- **totalestResults, positive, negative, hospitalizedCurrently, positiveCasesViral, recovered** are acceptable to use
- For features representing the hospital capacity, such as ICU bed usage and ventilator usage, we need to find other datasets with fewer nulls

In [7]:
## calculate the number of null values on each date
null_by_date = df.groupby('date').apply(lambda x: x.isnull().sum())
print(null_by_date['death'].shape)  ## Since our y-variable is death, we should first make sure this column doesn't contain any null
print(null_by_date['death'][:1], null_by_date['death'][-1:])

(320,)
date
2020-01-22    2
Name: death, dtype: int64 date
2020-12-06    0
Name: death, dtype: int64


#### Conclusion:
- this dataset covers data from 2020-01-22 to 2020-11-02 for every state in U.S

### 1.2 Preprocessing specific columns

In [12]:
## transform to 'datetime' format data
datetime = pd.to_datetime(df.iloc[:, 0], format='%Y-%m-%d')
df.insert(0, 'datetime', datetime)
df.drop(['date'], axis=1, inplace=True)

In [13]:
print(df.shape)
df.head()

(15633, 41)


Unnamed: 0,datetime,state,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2020-12-06,AK,143.0,143.0,0,,799.0,799.0,164.0,0,...,1077776.0,10545,,,,,,0,1077776.0,10545
1,2020-12-06,AL,3889.0,3462.0,12,427.0,26331.0,26331.0,1927.0,0,...,1645041.0,7880,,,74784.0,,1645041.0,7880,,0
2,2020-12-06,AR,2660.0,2437.0,40,223.0,9401.0,9401.0,1076.0,21,...,1763150.0,14704,,21856.0,,155934.0,,0,1763150.0,14704
3,2020-12-06,AS,0.0,,0,,,,,0,...,2140.0,0,,,,,,0,2140.0,0
4,2020-12-06,AZ,6950.0,6431.0,25,519.0,28248.0,28248.0,2977.0,242,...,2370499.0,20586,370928.0,,,,2370499.0,20586,,0


In [14]:
us_state_abbrev = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',  ##
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',  ##
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',  ##
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',  ##
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',  ##
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

In [15]:
## transform state name abbreviation to full name
location_name = []
for i in df.loc[:, 'state']:
    location_name.append(us_state_abbrev[i])
df.insert(1, 'location_name', location_name)
df.drop(['state'], axis=1, inplace=True)
print(df.shape)
df.head()

(15633, 41)


Unnamed: 0,datetime,location_name,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2020-12-06,Alaska,143.0,143.0,0,,799.0,799.0,164.0,0,...,1077776.0,10545,,,,,,0,1077776.0,10545
1,2020-12-06,Alabama,3889.0,3462.0,12,427.0,26331.0,26331.0,1927.0,0,...,1645041.0,7880,,,74784.0,,1645041.0,7880,,0
2,2020-12-06,Arkansas,2660.0,2437.0,40,223.0,9401.0,9401.0,1076.0,21,...,1763150.0,14704,,21856.0,,155934.0,,0,1763150.0,14704
3,2020-12-06,American Samoa,0.0,,0,,,,,0,...,2140.0,0,,,,,,0,2140.0,0
4,2020-12-06,Arizona,6950.0,6431.0,25,519.0,28248.0,28248.0,2977.0,242,...,2370499.0,20586,370928.0,,,,2370499.0,20586,,0


In [16]:
## select columns with the fewest number of null values
## !: inIcuCurrently and onVetilatorCurrently still have a lot of nulls, but we keep them here first in case we may find other datasets that can complement for the empty entries
features = ['datetime',
 'location_name',
 'death',
 'hospitalizedCurrently',
 'inIcuCurrently',
 'negative',
 'onVentilatorCurrently',
 'positive',
 'recovered',
 'totalTestResults']
df_clean = df.loc[:, features]
print(df_clean.shape)
df_clean.head()

(15633, 10)


Unnamed: 0,datetime,location_name,death,hospitalizedCurrently,inIcuCurrently,negative,onVentilatorCurrently,positive,recovered,totalTestResults
0,2020-12-06,Alaska,143.0,164.0,,1042056.0,21.0,35720.0,7165.0,1077776.0
1,2020-12-06,Alabama,3889.0,1927.0,,1421126.0,,269877.0,168387.0,1645041.0
2,2020-12-06,Arkansas,2660.0,1076.0,374.0,1614979.0,179.0,170924.0,149490.0,1763150.0
3,2020-12-06,American Samoa,0.0,,,2140.0,,0.0,,2140.0
4,2020-12-06,Arizona,6950.0,2977.0,714.0,2018813.0,462.0,364276.0,56382.0,2370499.0


# Dataset 2. Hospitalization data
- this dataset is downloaded from [HealthData](https://covid19.healthdata.org/united-states-of-america?view=total-deaths&tab=trend)
- we cannot upload this dataset to GitHub due to limited capacity of files allowed on GitHub. You may download the dataset through this [Google link](https://drive.google.com/file/d/1CV4TTALHU3EUFfyGHosF6_L6Cbv4hFjy/view?usp=sharing)

### 2.1 Dataset structure

In [17]:
df_hospital = pd.read_csv('hospitalization_all_locs.csv')
print(df_hospital.shape)
df_hospital.head()

(152488, 73)


Unnamed: 0,location_id,date,V1,location_name,allbed_mean,allbed_lower,allbed_upper,ICUbed_mean,ICUbed_lower,ICUbed_upper,...,est_infections_mean_p100k_rate,est_infections_lower_p100k_rate,est_infections_upper_p100k_rate,inf_cuml_mean,inf_cuml_upper,inf_cuml_lower,seroprev_mean,seroprev_upper,seroprev_lower,seroprev_data_type
0,1,2020/2/4,48609,Global,14282.96574,14282.96574,14282.96574,5827.528414,5827.528414,5827.528414,...,0.257229,0.216345,0.307828,19902.99479,23818.10353,16739.59124,3.3e-05,4.4e-05,2.4e-05,projected
1,1,2020/2/5,48610,Global,15571.17255,15571.17255,15571.17255,6217.948134,6217.948134,6217.948134,...,0.248723,0.209659,0.296248,39147.82713,46740.16291,32961.88382,3.7e-05,4.7e-05,2.8e-05,projected
2,1,2020/2/6,48611,Global,16762.15309,16762.15309,16762.15309,6559.103608,6559.103608,6559.103608,...,0.240485,0.203118,0.284973,57755.26436,68789.85954,48678.06379,4e-05,5.1e-05,3.1e-05,projected
3,1,2020/2/7,48612,Global,17837.85508,17837.85508,17837.85508,6845.497859,6845.497859,6845.497859,...,0.232862,0.196742,0.274455,75772.87967,90025.7151,63900.89907,4.3e-05,5.4e-05,3.4e-05,projected
4,1,2020/2/8,48613,Global,18776.7554,18776.7554,18776.7554,7071.639802,7071.639802,7071.639802,...,0.225968,0.192549,0.266491,93257.07859,110645.3853,78799.28392,4.6e-05,5.7e-05,3.7e-05,projected


In [18]:
df_hospital.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152488 entries, 0 to 152487
Data columns (total 73 columns):
location_id                         152488 non-null int64
date                                152488 non-null object
V1                                  152488 non-null int64
location_name                       152488 non-null object
allbed_mean                         152488 non-null float64
allbed_lower                        152488 non-null float64
allbed_upper                        152488 non-null float64
ICUbed_mean                         152488 non-null float64
ICUbed_lower                        152488 non-null float64
ICUbed_upper                        152488 non-null float64
InvVen_mean                         152488 non-null float64
InvVen_lower                        152488 non-null float64
InvVen_upper                        152488 non-null float64
admis_mean                          152488 non-null float64
admis_lower                         152488 non-null flo

In [19]:
### Since the dataset includes all regions worldwide, we need to select U.S. states only at first
us_states = list(us_state_abbrev.values())
df_us_hospital = df_hospital[df_hospital['location_name'] == us_states[0]]
for i in range(1, len(us_states)):
    name = us_states[i]
    data = df_hospital[df_hospital['location_name'] == name]
    print(name, data.shape[0])
    df_us_hospital = df_us_hospital.append(data)
print(df_us_hospital.shape) ## print out the number of data (rows) for each state to make sure they all start and end at the same days

Alabama 392
Arkansas 392
American Samoa 0
Arizona 392
California 392
Colorado 392
Connecticut 392
District of Columbia 392
Delaware 392
Florida 392
Georgia 392
Guam 392
Hawaii 392
Iowa 392
Idaho 392
Illinois 392
Indiana 392
Kansas 392
Kentucky 392
Louisiana 392
Massachusetts 392
Maryland 392
Maine 392
Michigan 392
Minnesota 392
Missouri 392
Northern Mariana Islands 0
Mississippi 392
Montana 392
North Carolina 392
North Dakota 392
Nebraska 392
New Hampshire 392
New Jersey 392
New Mexico 392
Nevada 392
New York 392
Ohio 392
Oklahoma 392
Oregon 392
Pennsylvania 392
Puerto Rico 392
Rhode Island 392
South Carolina 392
South Dakota 392
Tennessee 392
Texas 392
Utah 392
Virginia 392
Virgin Islands 0
Vermont 392
Washington 392
Wisconsin 392
West Virginia 392
Wyoming 392
(20776, 73)


#### Conclusion:
- this dataset doesn't have information for **American Samoa, Northern Mariana Islands, and Virgin Islands** which are oversee territories of U.S. not belonging to the 50 states
- each state covers 392 days

### 2.2 Preprocess certain columns

In [20]:
datetime = pd.to_datetime(df_us_hospital.iloc[:, 1], format='%Y-%m-%d')
df_us_hospital.insert(2, 'datetime', datetime)
df_us_hospital.drop(['date'], axis=1, inplace=True)
df_us_hospital.head()

Unnamed: 0,location_id,datetime,V1,location_name,allbed_mean,allbed_lower,allbed_upper,ICUbed_mean,ICUbed_lower,ICUbed_upper,...,est_infections_mean_p100k_rate,est_infections_lower_p100k_rate,est_infections_upper_p100k_rate,inf_cuml_mean,inf_cuml_upper,inf_cuml_lower,seroprev_mean,seroprev_upper,seroprev_lower,seroprev_data_type
70168,524,2020-02-04,2353,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,projected
70169,524,2020-02-05,2354,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,projected
70170,524,2020-02-06,2355,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,projected
70171,524,2020-02-07,2356,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,projected
70172,524,2020-02-08,2357,Alaska,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,projected


In [21]:
groups = df_us_hospital.groupby('datetime')
dates = []
for name, group in groups:
    dates.append(name)
dates.sort()
print(dates[:1], dates[-1:])

[Timestamp('2020-02-04 00:00:00')] [Timestamp('2021-03-01 00:00:00')]


- this dataset covers data from 2020-02-04 to 2021-03-01 for every state in U.S. **(but data starting from the access date are projected, not observed)**

In [22]:
## This dataset was downloaded at 11/10, which means observed data ends at 11/10
df_hospital_clean = df_us_hospital[df_us_hospital['datetime']<'2020-11-10']
df_hospital_clean.shape

(14840, 73)

In [25]:
null_by_state_hospital = df_hospital_clean.groupby('location_name').apply(lambda x: x.isnull().sum())
null_by_state_hospital

Unnamed: 0_level_0,location_id,datetime,V1,location_name,allbed_mean,allbed_lower,allbed_upper,ICUbed_mean,ICUbed_lower,ICUbed_upper,...,est_infections_mean_p100k_rate,est_infections_lower_p100k_rate,est_infections_upper_p100k_rate,inf_cuml_mean,inf_cuml_upper,inf_cuml_lower,seroprev_mean,seroprev_upper,seroprev_lower,seroprev_data_type
location_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Alaska,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Arizona,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Arkansas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
California,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Colorado,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Connecticut,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Delaware,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
District of Columbia,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Florida,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
##  because this dataset replaces  most null values by zeros, calculate the number of zero values of each feature for each state
zero_by_state_hospital = df_hospital_clean.groupby('location_name').apply(lambda x: (x==0).sum())
zero_by_state_hospital

Unnamed: 0_level_0,location_id,datetime,V1,location_name,allbed_mean,allbed_lower,allbed_upper,ICUbed_mean,ICUbed_lower,ICUbed_upper,...,est_infections_mean_p100k_rate,est_infections_lower_p100k_rate,est_infections_upper_p100k_rate,inf_cuml_mean,inf_cuml_upper,inf_cuml_lower,seroprev_mean,seroprev_upper,seroprev_lower,seroprev_data_type
location_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama,0,0,0,0,33,33,33,33,33,33,...,18,22,18,18,18,22,30,30,34,0
Alaska,0,0,0,0,26,26,26,26,26,26,...,11,15,11,11,11,15,23,23,27,0
Arizona,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Arkansas,0,0,0,0,30,30,30,30,30,30,...,15,19,15,15,15,19,27,27,31,0
California,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Colorado,0,0,0,0,26,26,26,26,26,26,...,11,15,11,11,11,15,23,23,27,0
Connecticut,0,0,0,0,27,27,27,27,27,27,...,12,16,12,12,12,16,24,24,28,0
Delaware,0,0,0,0,52,52,52,52,52,52,...,37,41,37,37,37,41,49,49,53,0
District of Columbia,0,0,0,0,28,28,28,28,28,28,...,13,17,13,13,13,17,25,25,29,0
Florida,0,0,0,0,22,22,22,22,22,22,...,7,11,7,7,7,11,19,19,23,0


#### Conclusion: we want to select useful columns based on the following criteria:
- The column should contain relatively small number of zeros
- The column should be interpretable. Columns such as **inf_cuml** are not explained clearly on the source website of this dataset, hence we need to dope ambiguous columns that may lack reliability or relevance
- Some columns represent the data of a same feature, e.g. ICU_bed, inVentilator and etc have mean, lower bound and upper bound as 3 separate columns. We will choose **mean** for such columns only.
- We dope columnns that have been included in the first dataset, e.g. death, total test numbers, etc.

In [53]:
features = ['datetime', 'location_name', 'allbed_mean', 'ICUbed_mean', 'InvVen_mean', 'admis_mean', 'newICU_mean', 'mobility_composite', 
            'confirmed_infections', 'total_pop']
df_hospital_clean = df_hospital_clean.loc[:, features]
print(df_hospital_clean.shape)
df_hospital_clean.head()

(11872, 10)


Unnamed: 0,datetime,location_name,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop
70224,2020-03-31,Alaska,136.761231,50.402575,35.366337,2.032854,0.741063,-51.364092,5.0,788027.5317
70225,2020-04-01,Alaska,144.557148,53.047031,37.174038,1.961619,0.729537,-52.020827,14.0,788027.5317
70226,2020-04-02,Alaska,150.801533,54.888955,38.411008,1.855147,0.704542,-52.518709,10.0,788027.5317
70227,2020-04-03,Alaska,155.270379,56.304417,39.342377,1.719822,0.667501,-52.871371,14.0,788027.5317
70228,2020-04-04,Alaska,157.813616,57.038762,39.792126,1.563691,0.620584,-53.09101,14.0,788027.5317


# Merge Dataset 1 & 2
#### Until now, we have:
- df_clean (the first): ranging from 1/22 to 11/2; 56 states; 10 columns
- df_hospital_clean (the second): ranging from 2/4 to 11/2; 53 states; 12 columns
- According to previous observations, we notice that columns in early months have more null values. Hence we chose to use data starting from **3/31**

In [54]:
df_clean  = df_clean[df_clean['datetime']>'2020-03-30']
df_clean  = df_clean[df_clean['datetime']<'2020-11-10']
df_hospital_clean = df_hospital_clean[df_hospital_clean['datetime']>'2020-03-30']

In [55]:
## df_hospital_clean doesn't have data for American Samoa, Northern Mariana Islands, Virgin Islands
## Hence we dropped data for the three regions in df_clean
df_clean = df_clean.drop(df_clean[df_clean['location_name']=='American Samoa'].index, axis=0)
df_clean = df_clean.drop(df_clean[df_clean['location_name']=='Northern Mariana Islands'].index, axis=0)
df_clean = df_clean.drop(df_clean[df_clean['location_name']=='Virgin Islands'].index, axis=0)

In [56]:
print(df_clean.shape)
print(df_hospital_clean.shape)

(11872, 10)
(11872, 10)


In [57]:
df_merge = pd.merge(df_clean, df_hospital_clean, on=['datetime', 'location_name'])
print(df_merge.shape)

(11872, 18)


In [58]:
## since the second dataset contains complete columns of hospitalized capacity, we directly drop similar columnns obtained from the first dataset
df_merge.drop(['hospitalizedCurrently', 'inIcuCurrently', 'onVentilatorCurrently', 'recovered'], axis=1, inplace=True)
df_merge

Unnamed: 0,datetime,location_name,death,negative,positive,totalTestResults,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop
0,2020-11-09,Alaska,84.0,748801.0,19196.0,767997.0,108.298152,29.167966,12.430271,5.062989,1.847292,-18.474221,,7.880275e+05
1,2020-11-09,Alabama,3084.0,1232975.0,204857.0,1406829.0,1148.634788,309.362073,131.838276,46.678702,23.462433,-8.458567,1170.0,4.977688e+06
2,2020-11-09,Arkansas,2108.0,1324655.0,122811.0,1436416.0,690.777323,186.047216,79.286203,96.597606,38.754578,-13.490509,945.0,3.057349e+06
3,2020-11-09,Arizona,6164.0,1630206.0,259699.0,1883207.0,1600.698804,431.116579,183.725386,96.360769,42.056134,-24.696158,435.0,7.249680e+06
4,2020-11-09,California,17977.0,18946628.0,971851.0,19918479.0,3335.268826,828.269426,352.976730,131.275756,55.830378,-34.427708,8584.0,3.987203e+07
5,2020-11-09,Colorado,2179.0,1224658.0,134537.0,2274428.0,1142.510694,307.712669,131.135363,44.398528,22.531914,-28.697947,3553.0,5.401063e+06
6,2020-11-09,Connecticut,4698.0,2498115.0,81463.0,2579578.0,438.527188,118.108629,50.333377,21.428839,12.393794,-27.360530,3307.0,3.693747e+06
7,2020-11-09,District of Columbia,655.0,540727.0,18087.0,558814.0,97.145927,26.164335,11.150238,4.449856,1.599537,-40.327409,86.0,6.502245e+05
8,2020-11-09,Delaware,719.0,340672.0,26908.0,593799.0,113.763486,27.337703,11.650283,4.380501,2.203550,-27.260176,,9.750952e+05
9,2020-11-09,Florida,17391.0,5573760.0,836370.0,10600474.0,2658.798631,716.094850,305.172219,179.846546,77.915998,-23.842662,3924.0,2.117489e+07


In [59]:
df_merge.to_csv('covid_us_merge.csv', index=False)

# Dataset 3. Mobility data
 - this dataset is downloaded from [a github dataset](https://github.com/GeoDS/COVID19USFlows) and merged according to the instructions
 - we have chosen to download data from 2020-03-27 to 2020-11-02
 - we have uploaded this dataset (merged) to GitHub as part of the codes

### 3.1 Dataset Structure & Preprocessing

In [34]:
df_mobility = pd.read_csv('daily_state2state.csv')
print(df_mobility.shape)
df_mobility.head()

(589813, 9)


Unnamed: 0,geoid_o,geoid_d,lng_o,lat_o,lng_d,lat_d,date_range,visitor_flows,pop_flows
0,1,1,-86.844521,32.75688,-86.844521,32.75688,2020/3/27,806784,9782825
1,1,2,-86.844521,32.75688,-151.250549,63.788469,2020/3/27,19,230
2,1,4,-86.844521,32.75688,-111.66446,34.293095,2020/3/27,199,2413
3,1,5,-86.844521,32.75688,-92.439237,34.899772,2020/3/27,470,5699
4,1,6,-86.844521,32.75688,-119.663846,37.215308,2020/3/27,479,5808


#### Conclusion:
- The dataset uses **longitude and latitude** to represent different states (but they are self-defined by the creator, instead of using the standard latitudes and longitudes)
- For each date, one row records **visitor_flows and pop_flows** move from one place to another destination
- We want to calculate the total visitor flows and total population flows for every state on each date

In [35]:
datetime = pd.to_datetime(df_mobility.iloc[:, 6], format='%Y-%m-%d')
df_mobility.insert(0, 'datetime', datetime)
df_mobility.drop(['date_range'], axis=1, inplace=True)
df_mobility.head()

Unnamed: 0,datetime,geoid_o,geoid_d,lng_o,lat_o,lng_d,lat_d,visitor_flows,pop_flows
0,2020-03-27,1,1,-86.844521,32.75688,-86.844521,32.75688,806784,9782825
1,2020-03-27,1,2,-86.844521,32.75688,-151.250549,63.788469,19,230
2,2020-03-27,1,4,-86.844521,32.75688,-111.66446,34.293095,199,2413
3,2020-03-27,1,5,-86.844521,32.75688,-92.439237,34.899772,470,5699
4,2020-03-27,1,6,-86.844521,32.75688,-119.663846,37.215308,479,5808


In [36]:
df_mobility[df_mobility['geoid_o'] == 1]['geoid_d']

0          1
1          2
2          4
3          5
4          6
5          8
6          9
7         10
8         11
9         12
10        13
11        15
12        16
13        17
14        18
15        19
16        20
17        21
18        22
19        23
20        24
21        25
22        26
23        27
24        28
25        29
26        30
27        31
28        32
29        33
          ..
587180    25
587181    26
587182    27
587183    28
587184    29
587185    30
587186    31
587187    32
587188    33
587189    34
587190    35
587191    36
587192    37
587193    38
587194    39
587195    40
587196    41
587197    42
587198    45
587199    46
587200    47
587201    48
587202    49
587203    50
587204    51
587205    53
587206    54
587207    55
587208    56
587209    72
Name: geoid_d, Length: 11491, dtype: int64

- We noticed that the region with geoid ==72 has a problematic latitude and longitude, which cannot be recognized as any concrete region on Google Map, hence we dropped rows whose geoid == 72

In [37]:
df_mobility.drop(df_mobility[df_mobility['geoid_o']==72].index, axis=0, inplace=True)
df_mobility.drop(df_mobility[df_mobility['geoid_d']==72].index, axis=0, inplace=True)

### 3.2 Aggregate visitor flows and pop flows of each state on each date

In [38]:
result = df_mobility.groupby(['datetime', 'geoid_o']).agg({'visitor_flows': 'sum', 'pop_flows': 'sum'})
df_outflow = result.reset_index()
df_outflow.rename(columns={'geoid_o': 'geoid', 'visitor_flows': 'visitor_outflow', 'pop_flows':'pop_outflow'}, inplace=True)
print(df_outflow.shape)
df_outflow.head()

(11271, 4)


Unnamed: 0,datetime,geoid,visitor_outflow,pop_outflow
0,2020-03-27,1,851538,10325474
1,2020-03-27,2,40167,1081863
2,2020-03-27,4,603781,11941596
3,2020-03-27,5,446011,6067115
4,2020-03-27,6,2377377,65100117


In [39]:
result2 = df_mobility.groupby(['datetime', 'geoid_d']).agg({'visitor_flows': 'sum', 'pop_flows': 'sum'})
df_inflow = result2.reset_index()
df_inflow.rename(columns={'geoid_d': 'geoid', 'visitor_flows': 'visitor_inflow', 'pop_flows':'pop_inflow'}, inplace=True)
print(df_inflow.shape)
df_inflow.head()

(11271, 4)


Unnamed: 0,datetime,geoid,visitor_inflow,pop_inflow
0,2020-03-27,1,843234,10367598
1,2020-03-27,2,40642,1082693
2,2020-03-27,4,602943,11967096
3,2020-03-27,5,445254,6113562
4,2020-03-27,6,2365173,64430714


In [40]:
df_mob_merge = pd.merge(df_outflow, df_inflow, on=['datetime', 'geoid'])
df_mob_merge.head()

Unnamed: 0,datetime,geoid,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow
0,2020-03-27,1,851538,10325474,843234,10367598
1,2020-03-27,2,40167,1081863,40642,1082693
2,2020-03-27,4,603781,11941596,602943,11967096
3,2020-03-27,5,446011,6067115,445254,6113562
4,2020-03-27,6,2377377,65100117,2365173,64430714


### 3.3 Match state names to latitude & longitude
- As mentioned before, the latitudes and longitudes used in this dataset are self-defined by the creator
- Since there are no keys of their according state names, we searched these latitudes and longitudes on Google Map to locate the state

In [41]:
df_mobility.groupby(['geoid_o', 'lat_o', 'lng_o']).groups.keys()

dict_keys([(1, 32.75687994, -86.844521), (2, 63.74298902, -151.5934219), (2, 63.78846948, -151.25054880000002), (4, 34.29309519, -111.66446029999999), (5, 34.89977242, -92.43923686), (6, 37.21530826, -119.6638459), (8, 38.9985316, -105.5478211), (9, 41.57516415, -72.73825768), (10, 38.99497529, -75.45249263), (11, 38.90477389, -77.01629090000002), (12, 28.47705841, -82.46641839), (13, 32.63861711, -83.42714021), (15, 20.9951112, -158.1099738), (16, 44.38905509, -114.65941399999998), (17, 40.12420083, -89.14863899), (18, 39.91986962, -86.28183839), (19, 42.07464833, -93.50009012), (20, 38.48472707, -98.3801554), (21, 37.52661417, -85.29055223), (22, 30.909072899999998, -91.81423318), (23, 45.27432853, -69.20275986), (24, 38.94649396, -76.68717734), (25, 42.16009327, -71.50397204), (26, 44.874773600000005, -85.73095291), (27, 46.34911038, -94.1983056), (28, 32.71289227, -89.65335941), (29, 38.36763044, -92.4774252), (30, 47.03342111, -109.64520700000001), (31, 41.52715113, -99.81085586),

In [44]:
## match state abbreviation with geoid
geo_dict = {1: 'AL', 2:'AK', 4: 'AZ', 5: 'AR', 6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE', 
        11: 'DC', 12: 'FL', 13: 'GA', 15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA', 20: 'KS', 
       21: 'KY', 22: 'LA', 23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 28: 'MS', 29: 'MO', 30: 'MT', 
       31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC', 38: 'ND', 39: 'OH', 40: 'OK',
       41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN', 48: 'TX', 49: 'UT', 50: 'VT', 
       51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY'}   ## 51 regions (50 states + District of Columbia)

In [45]:
location_name = df_mob_merge.geoid.apply(lambda x: us_state_abbrev[geo_dict[x]])
df_mob_merge.insert(1, 'location_name', location_name)
df_mob_merge.drop(['geoid'], axis=1, inplace=True)
df_mob_merge.head()

Unnamed: 0,datetime,location_name,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow
0,2020-03-27,Alabama,851538,10325474,843234,10367598
1,2020-03-27,Alaska,40167,1081863,40642,1082693
2,2020-03-27,Arizona,603781,11941596,602943,11967096
3,2020-03-27,Arkansas,446011,6067115,445254,6113562
4,2020-03-27,California,2377377,65100117,2365173,64430714


In [46]:
print(df_mob_merge.shape)

(11271, 6)


In [47]:
df_mob_merge.to_csv('mobility_us_merge.csv', index=False)

# Merge Dataset 1, 2, 3
#### Until now, we have:
- dataset 1&2 merged together: covers from 3/31 to 11/2, 53 states
- dataset 3 (mobility data): covers from 3/27 to 11/2, 51 states (compared to the first 2 datasets, lacking data for Guam and Puerto Rico)

In [60]:
## this datasheet has merged dataset 1 & 2
df = pd.read_csv('covid_us_merge.csv')
print(df.shape)
df.head()

(11872, 14)


Unnamed: 0,datetime,location_name,death,negative,positive,totalTestResults,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop
0,2020-11-09,Alaska,84.0,748801.0,19196.0,767997.0,108.298152,29.167966,12.430271,5.062989,1.847292,-18.474221,,788027.5
1,2020-11-09,Alabama,3084.0,1232975.0,204857.0,1406829.0,1148.634788,309.362073,131.838276,46.678702,23.462433,-8.458567,1170.0,4977688.0
2,2020-11-09,Arkansas,2108.0,1324655.0,122811.0,1436416.0,690.777323,186.047216,79.286203,96.597606,38.754578,-13.490509,945.0,3057349.0
3,2020-11-09,Arizona,6164.0,1630206.0,259699.0,1883207.0,1600.698804,431.116579,183.725386,96.360769,42.056134,-24.696158,435.0,7249680.0
4,2020-11-09,California,17977.0,18946628.0,971851.0,19918479.0,3335.268826,828.269426,352.97673,131.275756,55.830378,-34.427708,8584.0,39872030.0


In [61]:
df2 = pd.read_csv('mobility_us_merge.csv')
print(df2.shape)
df2.head()

(11271, 6)


Unnamed: 0,datetime,location_name,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow
0,2020-03-27,Alabama,851538,10325474,843234,10367598
1,2020-03-27,Alaska,40167,1081863,40642,1082693
2,2020-03-27,Arizona,603781,11941596,602943,11967096
3,2020-03-27,Arkansas,446011,6067115,445254,6113562
4,2020-03-27,California,2377377,65100117,2365173,64430714


In [62]:
df_all = pd.merge(df2, df, on=['datetime', 'location_name'])
print(df_all.shape)
df_all.head()

(11067, 18)


Unnamed: 0,datetime,location_name,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow,death,negative,positive,totalTestResults,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop
0,2020-03-31,Alabama,769270,8825216,760336,8835878,13.0,6298.0,981.0,7279.0,191.012578,70.396601,49.395688,13.247017,6.528103,-34.04896,89.0,4977688.0
1,2020-03-31,Alaska,41968,1082608,42641,1086290,3.0,3585.0,128.0,3713.0,136.761231,50.402575,35.366337,2.032854,0.741063,-51.364092,5.0,788027.5
2,2020-03-31,Arizona,606842,11593010,602261,11527235,24.0,18082.0,1289.0,19371.0,195.322141,71.984866,50.510137,19.359117,8.312889,-46.676021,131.0,7249680.0
3,2020-03-31,Arkansas,429017,5545115,430337,5616031,8.0,5959.0,523.0,6482.0,58.883668,21.701241,15.227266,6.055093,2.313203,-34.007596,49.0,3057349.0
4,2020-03-31,California,2409606,62351675,2400612,61786250,153.0,21772.0,7482.0,29254.0,1543.960063,686.615038,481.782101,123.75947,49.084774,-53.153443,1072.0,39872030.0


In [63]:
state_names = list(df_all.groupby('location_name').groups.keys())
len(state_names)

51

In [64]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11067 entries, 0 to 11066
Data columns (total 18 columns):
datetime                11067 non-null object
location_name           11067 non-null object
visitor_outflow         11067 non-null int64
pop_outflow             11067 non-null int64
visitor_inflow          11067 non-null int64
pop_inflow              11067 non-null int64
death                   11066 non-null float64
negative                11067 non-null float64
positive                11067 non-null float64
totalTestResults        11067 non-null float64
allbed_mean             11067 non-null float64
ICUbed_mean             11067 non-null float64
InvVen_mean             11067 non-null float64
admis_mean              11067 non-null float64
newICU_mean             11067 non-null float64
mobility_composite      11067 non-null float64
confirmed_infections    10886 non-null float64
total_pop               11067 non-null float64
dtypes: float64(12), int64(4), object(2)
memory usage: 

In [65]:
df_all = df_all[df_all['datetime']>='2020-04-01']

In [66]:
print(df_all.shape)

(11016, 18)


# Afterwards asjustments
- We noticed that the **death** column (obtained from the first dataset) represents the accumulative total deaths, instead of the daily new deaths. Hence we need to preprocess the column further.
- We want to add standard latitudes and longitudes information of states out of 2 reasons: 
    - they could differentiate data on the same datetime
    - states with similar latitudes and longitutdes could be considered as neighbours, whose daily deaths number might be similar to each other

In [67]:
states = df_all.groupby('location_name')
state_list = []
for name, group in states:
    df_temp = states.get_group(name)
    daily_death = df_temp['death'] - df_temp['death'].shift(1)
    df_temp.drop(['death'], axis=1, inplace=True)
    df_temp['death'] = daily_death
    state_list.append(df_temp.iloc[1:, :])  ## first row of death for each state will be null
    
df_new = pd.concat(state_list)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [73]:
df_new = df_new.reset_index()
df_new.drop(['index'], axis=1, inplace=True)
df_new.death[df_new['death'] < 0] = 0  ## there are few rows containing negative daily deaths after we do the subtraction, hence we clean these entries by setting them to 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [72]:
df_new

Unnamed: 0,datetime,location_name,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow,negative,positive,totalTestResults,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop,death
0,2020-04-02,Alabama,911906,9566649,904113,9582535,7503.0,1233.0,8736.0,224.946291,81.876268,57.296591,15.312697,7.590339,-35.207619,175.0,4.977688e+06,6.0
1,2020-04-03,Alabama,974948,10292520,969556,10335855,8187.0,1432.0,9619.0,246.399825,89.349936,62.432738,16.178042,8.042432,-35.667289,265.0,4.977688e+06,3.0
2,2020-04-04,Alabama,854480,9205708,852529,9277032,9273.0,1580.0,10853.0,266.754599,96.413430,67.261197,16.916438,8.432862,-36.044666,120.0,4.977688e+06,8.0
3,2020-04-05,Alabama,699333,7848208,696765,7892404,11282.0,1796.0,13078.0,273.266913,97.884635,68.174187,17.527168,8.760063,-36.334809,153.0,4.977688e+06,2.0
4,2020-04-06,Alabama,815949,9098459,809844,9118196,12797.0,1968.0,14765.0,287.322866,101.783974,70.768875,18.017711,9.026577,-36.532078,189.0,4.977688e+06,5.0
5,2020-04-07,Alabama,800970,8880744,795176,8907915,12797.0,2119.0,14916.0,303.175816,106.121403,73.744129,18.401899,9.238272,-36.631877,219.0,4.977688e+06,6.0
6,2020-04-08,Alabama,822245,9090832,816514,9116636,16753.0,2369.0,19122.0,318.960947,108.492215,75.314414,18.698007,9.403467,-36.631253,161.0,4.977688e+06,10.0
7,2020-04-09,Alabama,860851,9424386,857062,9481502,18058.0,2769.0,20827.0,319.757769,106.504510,73.994431,18.927377,9.532283,-36.529348,380.0,4.977688e+06,8.0
8,2020-04-10,Alabama,904245,10085256,900495,10150378,18058.0,2968.0,21026.0,330.706191,108.816095,75.548602,19.113003,9.635906,-36.327775,247.0,4.977688e+06,6.0
9,2020-04-11,Alabama,853779,9551467,852255,9633466,18058.0,3191.0,21249.0,341.205289,110.535196,76.714825,19.278145,9.725862,-36.031160,273.0,4.977688e+06,11.0


In [74]:
df_geo = pd.read_excel('geo_us_state.xlsx')  
df_geo.head()

Unnamed: 0,latitude,longitude,location_name
0,63.588753,-154.493062,Alaska
1,32.318231,-86.902298,Alabama
2,35.20105,-91.831833,Arkansas
3,34.048928,-111.093731,Arizona
4,36.778261,-119.417932,California


- this dataset is downloaded from (https://developers.google.com/public-data/docs/canonical/states_csv)

In [75]:
df_merge2 = pd.merge(df_new, df_geo, on=['location_name'])
df_merge2

Unnamed: 0,datetime,location_name,visitor_outflow,pop_outflow,visitor_inflow,pop_inflow,negative,positive,totalTestResults,allbed_mean,ICUbed_mean,InvVen_mean,admis_mean,newICU_mean,mobility_composite,confirmed_infections,total_pop,death,latitude,longitude
0,2020-04-02,Alabama,911906,9566649,904113,9582535,7503.0,1233.0,8736.0,224.946291,81.876268,57.296591,15.312697,7.590339,-35.207619,175.0,4.977688e+06,6.0,32.318231,-86.902298
1,2020-04-03,Alabama,974948,10292520,969556,10335855,8187.0,1432.0,9619.0,246.399825,89.349936,62.432738,16.178042,8.042432,-35.667289,265.0,4.977688e+06,3.0,32.318231,-86.902298
2,2020-04-04,Alabama,854480,9205708,852529,9277032,9273.0,1580.0,10853.0,266.754599,96.413430,67.261197,16.916438,8.432862,-36.044666,120.0,4.977688e+06,8.0,32.318231,-86.902298
3,2020-04-05,Alabama,699333,7848208,696765,7892404,11282.0,1796.0,13078.0,273.266913,97.884635,68.174187,17.527168,8.760063,-36.334809,153.0,4.977688e+06,2.0,32.318231,-86.902298
4,2020-04-06,Alabama,815949,9098459,809844,9118196,12797.0,1968.0,14765.0,287.322866,101.783974,70.768875,18.017711,9.026577,-36.532078,189.0,4.977688e+06,5.0,32.318231,-86.902298
5,2020-04-07,Alabama,800970,8880744,795176,8907915,12797.0,2119.0,14916.0,303.175816,106.121403,73.744129,18.401899,9.238272,-36.631877,219.0,4.977688e+06,6.0,32.318231,-86.902298
6,2020-04-08,Alabama,822245,9090832,816514,9116636,16753.0,2369.0,19122.0,318.960947,108.492215,75.314414,18.698007,9.403467,-36.631253,161.0,4.977688e+06,10.0,32.318231,-86.902298
7,2020-04-09,Alabama,860851,9424386,857062,9481502,18058.0,2769.0,20827.0,319.757769,106.504510,73.994431,18.927377,9.532283,-36.529348,380.0,4.977688e+06,8.0,32.318231,-86.902298
8,2020-04-10,Alabama,904245,10085256,900495,10150378,18058.0,2968.0,21026.0,330.706191,108.816095,75.548602,19.113003,9.635906,-36.327775,247.0,4.977688e+06,6.0,32.318231,-86.902298
9,2020-04-11,Alabama,853779,9551467,852255,9633466,18058.0,3191.0,21249.0,341.205289,110.535196,76.714825,19.278145,9.725862,-36.031160,273.0,4.977688e+06,11.0,32.318231,-86.902298


In [76]:
df_merge2.to_csv('all_us_data.csv', index=False)

*We have uploaded the final version of dataset to GitHub*