This database cleaner is for:

United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv

Steps to clean the file:

(**Name of relevant dataframes:  df, df_cases, df_deaths**)

1.  Chronological.  Change column "submission_date" dtype to datetime format. Place column "submission" in increasing order AND sort by "state" also.


2.  Column deletion.  Delete column "created_at".


3.  Retaining rows.  Keep row if value of either "consent_cases" or "consent_deaths" is Agree or Not Agree.


4.  Fill in missing values.  Replace all null values with the integer 0 for columns "tot_cases", "conf_cases", "prob_cases", "new_case", "pnew_case", "tot_death", "conf_death", "prob_death", "new_death", "pnew_death".


5.  Change datatype.  For columns "tot_cases", "conf_cases", "prob_cases", "new_case", "pnew_case", "tot_death", "conf_death", "prob_death", "new_death", "pnew_death", change dtype to integer.


6.  Assume that only the year is needed.  Extract the year and place it in a column "Year_submitted."  Delete the column "submission_date."

7.  Add the target columns to df.  This is done before modifying the database because changes make it difficult to add later.  To start with a target column will be added for cases and for deaths.  It will be the mean value of    the year 2020 of columns "tot_cases" and "tot_death."
    The approach is to select a statistical measure to describe an assessment column for states relative to the population.  For *starters* the population mean will be selected as the comparison. criterion against each entry for the states.  For example, if the number of cases for a state is above or equal to the population mean assign a 1 to the assessment column, and if less than the population mean assign a 0.  
    First assessment columns will be added to the datatframe df.  The dataframe df will be split into two dataframes: df_cases and df_deaths.  An assessment column for cases will be assigned to df_cases and an assessment column for deaths will be assigned to df_deaths. 
    The 2020 population mean will be the basis for the assessment columns and will be applied to both the years 2020 and 2021. In this way states will be assessed against the previous year (2020) for training and testing in 2020 and 2021.
    The steps are:

    a.  Perform a describe() on column "tot_cases" over the year 2020 only for all states.  The results apply to the population of states.
    
    b.  Perform a describe() on column "tot_death" over the year 2020 only for all states.  The results apply to the population of states.
    
    c.  Populate "2020_mean_cases" with 1 or 0 
    
    d.  Populate "2020_mean_deaths" with 1 or 0


8.  Apply OneHotEncoder.  At this point columns "consent_cases" or "consent_deaths" will have values Agree, Not Agree, or N/A.  Apply OneHotEncoder to THREE columns:  "consent_cases", "consent_deaths", and "state."

9.  Make a new dataframe for cases only.

10.  Make a new dataframe for deaths only.

11.  Save df_cases as csv file.

12.  Save df_deaths as csv file.


Summary of dataframes used for this cleaning script:

**df**:  this is the *first* time "United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv" is read

**df_cases**:  this is made from df, and a new column ">= 2020 mean" is added to it based on statistical analysis from df2_cases_stats

**df_deaths**: this is made from df, and a new column ">= 2020 mean" is added to it based on statistical analysis from df2_cases_stats


**This file is only for cleaning data to be used for input to a machine learning model.**  Machine learing models will have separate file(s) but will read in dataframes "df_cases" and "df_deaths."

In [87]:
#Import dependencies

import pandas as pd


In [88]:
# read the file

file_path = "United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,03/11/2021,KS,297229,241035.0,56194.0,0,0.0,4851,,,0,0.0,03/12/2021 03:20:13 PM,Agree,
1,03/19/2020,FL,386,,,76,0.0,12,,,2,0.0,03/19/2020 12:00:00 AM,Not agree,Not agree
2,06/11/2021,TX,2965966,,,1463,355.0,51158,,,17,0.0,06/13/2021 12:00:00 AM,Not agree,Not agree
3,03/01/2021,CO,438745,411869.0,26876.0,677,60.0,5952,5218.0,734.0,1,0.0,03/01/2021 12:00:00 AM,Agree,Agree
4,08/22/2020,AR,56199,,,547,0.0,674,,,11,0.0,08/23/2020 02:15:28 PM,Not agree,Not agree


In [89]:
df.columns

Index(['submission_date', 'state', 'tot_cases', 'conf_cases', 'prob_cases',
       'new_case', 'pnew_case', 'tot_death', 'conf_death', 'prob_death',
       'new_death', 'pnew_death', 'created_at', 'consent_cases',
       'consent_deaths'],
      dtype='object')

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38700 entries, 0 to 38699
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   submission_date  38700 non-null  object 
 1   state            38700 non-null  object 
 2   tot_cases        38700 non-null  int64  
 3   conf_cases       20473 non-null  float64
 4   prob_cases       20401 non-null  float64
 5   new_case         38700 non-null  int64  
 6   pnew_case        34696 non-null  float64
 7   tot_death        38700 non-null  int64  
 8   conf_death       20343 non-null  float64
 9   prob_death       20343 non-null  float64
 10  new_death        38700 non-null  int64  
 11  pnew_death       34635 non-null  float64
 12  created_at       38700 non-null  object 
 13  consent_cases    32245 non-null  object 
 14  consent_deaths   32895 non-null  object 
dtypes: float64(6), int64(4), object(5)
memory usage: 4.4+ MB


**Step 1**

In [91]:
# Change column "submission" dtype to datetime format.

df["submission_date"] = pd.to_datetime(df["submission_date"])
df.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,2021-03-11,KS,297229,241035.0,56194.0,0,0.0,4851,,,0,0.0,03/12/2021 03:20:13 PM,Agree,
1,2020-03-19,FL,386,,,76,0.0,12,,,2,0.0,03/19/2020 12:00:00 AM,Not agree,Not agree
2,2021-06-11,TX,2965966,,,1463,355.0,51158,,,17,0.0,06/13/2021 12:00:00 AM,Not agree,Not agree
3,2021-03-01,CO,438745,411869.0,26876.0,677,60.0,5952,5218.0,734.0,1,0.0,03/01/2021 12:00:00 AM,Agree,Agree
4,2020-08-22,AR,56199,,,547,0.0,674,,,11,0.0,08/23/2020 02:15:28 PM,Not agree,Not agree


In [92]:
# Place column "submission" in increasing order AND sort by "state" also.

df = df.sort_values(by = ["submission_date", "state"])
df.head(200)

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
3969,2020-01-22,AK,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,,
24632,2020-01-22,AL,7,6.0,1.0,7,1.0,0,0.0,0.0,0,0.0,01/24/2020 12:00:00 AM,Agree,Agree
34496,2020-01-22,AR,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Not agree
5334,2020-01-22,AS,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,,
30600,2020-01-22,AZ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,2020-01-25,IA,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Not agree
10372,2020-01-25,ID,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree
10924,2020-01-25,IL,1,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree
23129,2020-01-25,IN,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree


**Step 2**

In [93]:
# Delete column "created_at".

df.drop(columns=["created_at"], inplace = True)
df.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths
3969,2020-01-22,AK,0,,,0,,0,,,0,,,
24632,2020-01-22,AL,7,6.0,1.0,7,1.0,0,0.0,0.0,0,0.0,Agree,Agree
34496,2020-01-22,AR,0,,,0,,0,,,0,,Not agree,Not agree
5334,2020-01-22,AS,0,,,0,,0,,,0,,,
30600,2020-01-22,AZ,0,,,0,,0,,,0,,Agree,Agree


**Step 3**

In [94]:
# Keep row if value of either "consent_cases" or "consent_deaths" is Agree or Not Agree.
df = df.loc[(df["consent_cases"]=="Agree")|(df["consent_deaths"]=="Agree") ]
df.head(200)

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths
24632,2020-01-22,AL,7,6.0,1.0,7,1.0,0,0.0,0.0,0,0.0,Agree,Agree
30600,2020-01-22,AZ,0,,,0,,0,,,0,,Agree,Agree
19662,2020-01-22,CA,0,,,0,,0,,,0,,Agree,Not agree
4559,2020-01-22,CO,0,,,0,,0,,,0,,Agree,Agree
12045,2020-01-22,CT,0,,,0,,0,,,0,,Agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32489,2020-01-26,NYC,0,,,0,,0,,,0,,Agree,Agree
37010,2020-01-26,OH,0,,,0,,0,,,0,,Agree,Agree
5058,2020-01-26,OK,0,,,0,,0,,,0,,Not agree,Agree
29275,2020-01-26,OR,0,,,0,,0,,,0,,Agree,Agree


**Step 4**

In [95]:
# Fill in missing values.  Replace all null values with the integer 0 for columns "tot_cases", "conf_cases",
#"prob_cases", "new_case", "pnew_case", "tot_death", "conf_death", "prob_death", "new_death", "pnew_death".

zero_list = ["tot_cases", "conf_cases","prob_cases", "new_case","pnew_case", "tot_death","conf_death", "prob_death","new_death", "pnew_death"]

for x in zero_list:
    df[x] = df[x].fillna(0)

df.head(200)


Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths
24632,2020-01-22,AL,7,6.0,1.0,7,1.0,0,0.0,0.0,0,0.0,Agree,Agree
30600,2020-01-22,AZ,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree
19662,2020-01-22,CA,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Not agree
4559,2020-01-22,CO,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree
12045,2020-01-22,CT,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32489,2020-01-26,NYC,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree
37010,2020-01-26,OH,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree
5058,2020-01-26,OK,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Not agree,Agree
29275,2020-01-26,OR,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,Agree,Agree


**Step 5**

In [96]:
# For columns "tot_cases", "conf_cases", "prob_cases", "new_case", "pnew_case", "tot_death", "conf_death",
# "prob_death", "new_death", "pnew_death", change dtype to integer.


col_headers = ["tot_cases", "conf_cases", "prob_cases", "new_case", "pnew_case", "tot_death", "conf_death",
"prob_death", "new_death", "pnew_death"]

for col in col_headers:
    df[col]=df[col].astype("int64")

df.head(200)

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths
24632,2020-01-22,AL,7,6,1,7,1,0,0,0,0,0,Agree,Agree
30600,2020-01-22,AZ,0,0,0,0,0,0,0,0,0,0,Agree,Agree
19662,2020-01-22,CA,0,0,0,0,0,0,0,0,0,0,Agree,Not agree
4559,2020-01-22,CO,0,0,0,0,0,0,0,0,0,0,Agree,Agree
12045,2020-01-22,CT,0,0,0,0,0,0,0,0,0,0,Agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32489,2020-01-26,NYC,0,0,0,0,0,0,0,0,0,0,Agree,Agree
37010,2020-01-26,OH,0,0,0,0,0,0,0,0,0,0,Agree,Agree
5058,2020-01-26,OK,0,0,0,0,0,0,0,0,0,0,Not agree,Agree
29275,2020-01-26,OR,0,0,0,0,0,0,0,0,0,0,Agree,Agree


In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26533 entries, 24632 to 27816
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   submission_date  26533 non-null  datetime64[ns]
 1   state            26533 non-null  object        
 2   tot_cases        26533 non-null  int64         
 3   conf_cases       26533 non-null  int64         
 4   prob_cases       26533 non-null  int64         
 5   new_case         26533 non-null  int64         
 6   pnew_case        26533 non-null  int64         
 7   tot_death        26533 non-null  int64         
 8   conf_death       26533 non-null  int64         
 9   prob_death       26533 non-null  int64         
 10  new_death        26533 non-null  int64         
 11  pnew_death       26533 non-null  int64         
 12  consent_cases    24598 non-null  object        
 13  consent_deaths   25248 non-null  object        
dtypes: datetime64[ns](1), int64(10), o

**Step 6**

In [98]:
#  Assume that only the year is needed.  Extract the year and place it in a column "Year_submitted."  

df["Year_submitted"] = df["submission_date"].dt.year
df.head()


Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths,Year_submitted
24632,2020-01-22,AL,7,6,1,7,1,0,0,0,0,0,Agree,Agree,2020
30600,2020-01-22,AZ,0,0,0,0,0,0,0,0,0,0,Agree,Agree,2020
19662,2020-01-22,CA,0,0,0,0,0,0,0,0,0,0,Agree,Not agree,2020
4559,2020-01-22,CO,0,0,0,0,0,0,0,0,0,0,Agree,Agree,2020
12045,2020-01-22,CT,0,0,0,0,0,0,0,0,0,0,Agree,Agree,2020


In [99]:
df.columns

Index(['submission_date', 'state', 'tot_cases', 'conf_cases', 'prob_cases',
       'new_case', 'pnew_case', 'tot_death', 'conf_death', 'prob_death',
       'new_death', 'pnew_death', 'consent_cases', 'consent_deaths',
       'Year_submitted'],
      dtype='object')

In [100]:
# Delete the column "submission_date."

df.drop(columns = ["submission_date"], inplace = True)

In [101]:
# reorder columns

df_columns_new = [
'Year_submitted',
'state',
'tot_cases', 
'conf_cases', 
'prob_cases',
'new_case', 
'pnew_case', 
'tot_death', 
'conf_death', 
'prob_death',
'new_death', 
'pnew_death', 
'consent_cases',
'consent_deaths'
 ]



In [102]:
len(df_columns_new)

14

In [103]:
df = df.reindex(columns = df_columns_new)
df.head()

Unnamed: 0,Year_submitted,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths
24632,2020,AL,7,6,1,7,1,0,0,0,0,0,Agree,Agree
30600,2020,AZ,0,0,0,0,0,0,0,0,0,0,Agree,Agree
19662,2020,CA,0,0,0,0,0,0,0,0,0,0,Agree,Not agree
4559,2020,CO,0,0,0,0,0,0,0,0,0,0,Agree,Agree
12045,2020,CT,0,0,0,0,0,0,0,0,0,0,Agree,Agree


In [104]:
df.shape

(26533, 14)

**Step 7**

In [105]:
# Add the target columns to df.

df["2020_mean_cases"] = 0
df["2020_mean_deaths"] = 0


In [106]:
# Perform a describe() on column "tot_cases" over the year 2020 only for all states.  The results apply to 
# the population of states

df_cases_2020 = df.loc[df["Year_submitted"]==2020]
mean_cases = df_cases_2020["tot_cases"].mean()
mean_cases



90113.83999436103

In [107]:
# Perform a describe() on column "tot_death" over the year 2020 only for all states.  The results apply to the
#population of states.

df_deaths_2020 = df.loc[df["Year_submitted"]==2020]
mean_deaths = df_deaths_2020["tot_death"].mean()
mean_deaths



2634.3666032283077

In [108]:
# Populate "2020_mean_cases" with 1 or 0 
# Populate "2020_mean_deaths" with 1 or 0 

# cases

for index, row in df.iterrows():
    x = row["tot_cases"]
    if x >= int(mean_cases):
        df.loc[index, "2020_mean_cases"]=1
    else:
        df.loc[index, "2020_mean_cases"]=0

print(df["2020_mean_cases"].value_counts())

# deaths

for index, row in df.iterrows():
    x = row["tot_death"]
    if x >= int(mean_deaths):
        df.loc[index, "2020_mean_deaths"]=1
    else:
        df.loc[index, "2020_mean_deaths"]=0

print(df["2020_mean_deaths"].value_counts())


1    15118
0    11415
Name: 2020_mean_cases, dtype: int64
0    13638
1    12895
Name: 2020_mean_deaths, dtype: int64


In [109]:
df.head()

Unnamed: 0,Year_submitted,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,consent_cases,consent_deaths,2020_mean_cases,2020_mean_deaths
24632,2020,AL,7,6,1,7,1,0,0,0,0,0,Agree,Agree,0,0
30600,2020,AZ,0,0,0,0,0,0,0,0,0,0,Agree,Agree,0,0
19662,2020,CA,0,0,0,0,0,0,0,0,0,0,Agree,Not agree,0,0
4559,2020,CO,0,0,0,0,0,0,0,0,0,0,Agree,Agree,0,0
12045,2020,CT,0,0,0,0,0,0,0,0,0,0,Agree,Agree,0,0


**Step 8**

In [110]:
# import dependencies

from sklearn.preprocessing import OneHotEncoder, LabelEncoder


In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26533 entries, 24632 to 27816
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Year_submitted    26533 non-null  int64 
 1   state             26533 non-null  object
 2   tot_cases         26533 non-null  int64 
 3   conf_cases        26533 non-null  int64 
 4   prob_cases        26533 non-null  int64 
 5   new_case          26533 non-null  int64 
 6   pnew_case         26533 non-null  int64 
 7   tot_death         26533 non-null  int64 
 8   conf_death        26533 non-null  int64 
 9   prob_death        26533 non-null  int64 
 10  new_death         26533 non-null  int64 
 11  pnew_death        26533 non-null  int64 
 12  consent_cases     24598 non-null  object
 13  consent_deaths    25248 non-null  object
 14  2020_mean_cases   26533 non-null  int64 
 15  2020_mean_deaths  26533 non-null  int64 
dtypes: int64(13), object(3)
memory usage: 4.4+ MB


In [112]:
obj_list = df.dtypes[df.dtypes == "object"].index.to_list()
obj_list

['state', 'consent_cases', 'consent_deaths']

In [113]:
# Apply OneHotEncoder to THREE columns:  "consent_cases", "consent_deaths", and "state."

enc = OneHotEncoder(sparse = False)
encoded_df = pd.DataFrame(enc.fit_transform(df[obj_list]))
encoded_df.columns = enc.get_feature_names(obj_list)
encoded_df.head(200)

Unnamed: 0,state_AL,state_AZ,state_CA,state_CO,state_CT,state_DE,state_FSM,state_GA,state_ID,state_IL,...,state_VA,state_WI,state_WV,state_WY,consent_cases_Agree,consent_cases_Not agree,consent_cases_nan,consent_deaths_Agree,consent_deaths_Not agree,consent_deaths_nan
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [114]:
df = df.merge(encoded_df, left_index = True, right_index = True)
df = df.drop(obj_list, 1)
df.head(200)

Unnamed: 0,Year_submitted,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,...,state_VA,state_WI,state_WV,state_WY,consent_cases_Agree,consent_cases_Not agree,consent_cases_nan,consent_deaths_Agree,consent_deaths_Not agree,consent_deaths_nan
24632,2020,7,6,1,7,1,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
19662,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4559,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
12045,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
19606,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16093,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7151,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
17481,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
22238,2020,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [115]:
df.columns

Index(['Year_submitted', 'tot_cases', 'conf_cases', 'prob_cases', 'new_case',
       'pnew_case', 'tot_death', 'conf_death', 'prob_death', 'new_death',
       'pnew_death', '2020_mean_cases', '2020_mean_deaths', 'state_AL',
       'state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DE', 'state_FSM',
       'state_GA', 'state_ID', 'state_IL', 'state_IN', 'state_KS', 'state_KY',
       'state_LA', 'state_MA', 'state_MD', 'state_ME', 'state_MI', 'state_MN',
       'state_MP', 'state_MS', 'state_MT', 'state_NC', 'state_ND', 'state_NE',
       'state_NJ', 'state_NV', 'state_NYC', 'state_OH', 'state_OK', 'state_OR',
       'state_PA', 'state_PR', 'state_RMI', 'state_SC', 'state_SD', 'state_TN',
       'state_UT', 'state_VA', 'state_WI', 'state_WV', 'state_WY',
       'consent_cases_Agree', 'consent_cases_Not agree', 'consent_cases_nan',
       'consent_deaths_Agree', 'consent_deaths_Not agree',
       'consent_deaths_nan'],
      dtype='object')

**Step 9**

In [119]:
#. Make a new dataframe for cases only.

columns_cases = [
'Year_submitted',
'tot_cases',
 'conf_cases',
 'prob_cases',
 'new_case',
 'pnew_case',
 'state_AL',
 'state_AZ',
 'state_CA',
 'state_CO',
 'state_CT',
 'state_DE',
 'state_FSM',
 'state_GA',
 'state_ID',
 'state_IL',
 'state_IN',
 'state_KS',
 'state_KY',
 'state_LA',
 'state_MA',
 'state_MD',
 'state_ME',
 'state_MI',
 'state_MN',
 'state_MP',
 'state_MS',
 'state_MT',
 'state_NC',
 'state_ND',
 'state_NE',
 'state_NJ',
 'state_NV',
 'state_NYC',
 'state_OH',
 'state_OK',
 'state_OR',
 'state_PA',
 'state_PR',
 'state_RMI',
 'state_SC',
 'state_SD',
 'state_TN',
 'state_UT',
 'state_VA',
 'state_WI',
 'state_WV',
 'state_WY',
 'consent_cases_Agree',
 'consent_cases_Not agree',
 'consent_cases_nan',
 '2020_mean_cases'
 ]

df_cases = df.copy()
df_cases.drop(columns = ['tot_death','conf_death','prob_death','new_death','pnew_death','consent_deaths_Agree',
 'consent_deaths_Not agree','consent_deaths_nan'], inplace = True)

df_cases = df_cases[columns_cases]
df_cases.head(200)

Unnamed: 0,Year_submitted,tot_cases,conf_cases,prob_cases,new_case,pnew_case,state_AL,state_AZ,state_CA,state_CO,...,state_TN,state_UT,state_VA,state_WI,state_WV,state_WY,consent_cases_Agree,consent_cases_Not agree,consent_cases_nan,2020_mean_cases
24632,2020,7,6,1,7,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
19662,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4559,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
12045,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
19606,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16093,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
7151,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
17481,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
22238,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0


In [120]:
df_cases.shape

(18323, 52)

**Step 10**

In [121]:
# Make a dataframe for deaths only

columns_deaths = [
'Year_submitted',
 'tot_death',
 'conf_death',
 'prob_death',
 'new_death',
 'pnew_death',
 'state_AL',
 'state_AZ',
 'state_CA',
 'state_CO',
 'state_CT',
 'state_DE',
 'state_FSM',
 'state_GA',
 'state_ID',
 'state_IL',
 'state_IN',
 'state_KS',
 'state_KY',
 'state_LA',
 'state_MA',
 'state_MD',
 'state_ME',
 'state_MI',
 'state_MN',
 'state_MP',
 'state_MS',
 'state_MT',
 'state_NC',
 'state_ND',
 'state_NE',
 'state_NJ',
 'state_NV',
 'state_NYC',
 'state_OH',
 'state_OK',
 'state_OR',
 'state_PA',
 'state_PR',
 'state_RMI',
 'state_SC',
 'state_SD',
 'state_TN',
 'state_UT',
 'state_VA',
 'state_WI',
 'state_WV',
 'state_WY',
 'consent_deaths_Agree',
 'consent_deaths_Not agree',
 'consent_deaths_nan',
 '2020_mean_deaths'
 ]

df_deaths = df.copy()
df_deaths.drop(columns = ['tot_cases','conf_cases','prob_cases','new_case','pnew_case','consent_cases_Agree',
'consent_cases_Not agree','consent_cases_nan'], inplace = True)

df_deaths = df_deaths[columns_deaths]
df_deaths.head()

Unnamed: 0,Year_submitted,tot_death,conf_death,prob_death,new_death,pnew_death,state_AL,state_AZ,state_CA,state_CO,...,state_TN,state_UT,state_VA,state_WI,state_WV,state_WY,consent_deaths_Agree,consent_deaths_Not agree,consent_deaths_nan,2020_mean_deaths
24632,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
19662,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4559,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
12045,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
19606,2020,0,0,0,0,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0


**Step 11**

In [123]:
import os

In [124]:
# Save df_cases as csv file.

os.makedirs("Cases_Cleaned/",exist_ok=True)
df_cases.to_csv('Cases_Cleaned/ML_cases.csv', index = False)



**Step 12**

In [125]:
# Save df_deaths as csv file.

os.makedirs("Deaths_Cleaned/",exist_ok=True)
df_deaths.to_csv('Deaths_Cleaned/ML_deaths.csv', index = False)

