In this notebook, we are going to explore COVID-19 cases in the United States by state.

Import the necessary modules.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("Setup complete.")

Setup complete.


# Inspect and Clean Dataset

Load the dataset to begin analysis.

In [2]:
us_states_covid19_df = pd.read_csv("/kaggle/input/covid19-in-usa/us_states_covid19_daily.csv")

Use the .head() attribute to preview the first few rows of the DataFrame.

In [3]:
us_states_covid19_df.head()

Unnamed: 0,date,state,positive,probableCases,negative,pending,totalTestResultsSource,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade
0,20201206,AK,35720.0,,1042056.0,,totalTestsViral,1077776.0,164.0,799.0,...,1077776,0,0,7b1d31e2756687bb9259b29195f1db6cdb321ea6,0,0,0,0,0,
1,20201206,AL,269877.0,45962.0,1421126.0,,totalTestsPeopleViral,1645041.0,1927.0,26331.0,...,1691003,12,0,19454ed8fe28fc0a7948fc0771b2f3c846c1c92e,0,0,0,0,0,
2,20201206,AR,170924.0,22753.0,1614979.0,,totalTestsViral,1763150.0,1076.0,9401.0,...,1785903,40,21,25fc83bffff5b32ba1a737be8e087fad9f4fde33,0,0,0,0,0,
3,20201206,AS,0.0,,2140.0,,totalTestsViral,2140.0,,,...,2140,0,0,8c39eec317586b0c34fc2903e6a3891ecb00469e,0,0,0,0,0,
4,20201206,AZ,364276.0,12590.0,2018813.0,,totalTestsPeopleViral,2370499.0,2977.0,28248.0,...,2383089,25,242,7cf59da9e4bc31d905e179211313d08879880a85,0,0,0,0,0,


We first need inspect the dataset and identify the data types of the variables. Pandas automatically determines variables data types, but sometimes is incorrect. We use the **.info()** method to quickly inspect the statistics for each column, including the data type and number of entries.

In [4]:
us_states_covid19_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15633 entries, 0 to 15632
Data columns (total 55 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   date                         15633 non-null  int64  
 1   state                        15633 non-null  object 
 2   positive                     15481 non-null  float64
 3   probableCases                5449 non-null   float64
 4   negative                     15323 non-null  float64
 5   pending                      1684 non-null   float64
 6   totalTestResultsSource       15633 non-null  object 
 7   totalTestResults             15598 non-null  float64
 8   hospitalizedCurrently        12516 non-null  float64
 9   hospitalizedCumulative       9434 non-null   float64
 10  inIcuCurrently               7713 non-null   float64
 11  inIcuCumulative              2700 non-null   float64
 12  onVentilatorCurrently        6211 non-null   float64
 13  onVentilatorCumu

Using this information, we can tell that there are a total of 15,633 entries in the csv file, each involving COVID-19 results. However, many columns in this DataFrame have incorrect data types, and many are missing significant amounts of data. 

We will first drop any columns in the DataFrame that we will not use in the analysis, then correct the data types so that every variable has its correct type.

Create a new DataFrame that will hold the variables used for the analysis. Inspect this new dataset.

In [5]:
us_states_df = us_states_covid19_df[['date', 'state', 'positive', 'probableCases', 'negative', 'totalTestResults', 'hospitalizedCurrently', 'inIcuCurrently', 'onVentilatorCurrently', 'death', 'dataQualityGrade']].reset_index(drop=True)
us_states_df.head()

Unnamed: 0,date,state,positive,probableCases,negative,totalTestResults,hospitalizedCurrently,inIcuCurrently,onVentilatorCurrently,death,dataQualityGrade
0,20201206,AK,35720.0,,1042056.0,1077776.0,164.0,,21.0,143.0,A
1,20201206,AL,269877.0,45962.0,1421126.0,1645041.0,1927.0,,,3889.0,A
2,20201206,AR,170924.0,22753.0,1614979.0,1763150.0,1076.0,374.0,179.0,2660.0,A+
3,20201206,AS,0.0,,2140.0,2140.0,,,,0.0,D
4,20201206,AZ,364276.0,12590.0,2018813.0,2370499.0,2977.0,714.0,462.0,6950.0,A+


In [6]:
us_states_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15633 entries, 0 to 15632
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   15633 non-null  int64  
 1   state                  15633 non-null  object 
 2   positive               15481 non-null  float64
 3   probableCases          5449 non-null   float64
 4   negative               15323 non-null  float64
 5   totalTestResults       15598 non-null  float64
 6   hospitalizedCurrently  12516 non-null  float64
 7   inIcuCurrently         7713 non-null   float64
 8   onVentilatorCurrently  6211 non-null   float64
 9   death                  14807 non-null  float64
 10  dataQualityGrade       14372 non-null  object 
dtypes: float64(8), int64(1), object(2)
memory usage: 1.3+ MB


We want to make sure the quality of the data is up to par. For the analysis, we will use data that has a B- rating or higher. Any rating below this will be discarded. This will make our data cleanup quicker.

In [7]:
#modify the existing DataFrame to only include rows in which the data quality grade is between B- and A+
us_states_df = us_states_df[us_states_df.dataQualityGrade.isin(['A+', 'A', 'B', 'B-'])]
us_states_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12674 entries, 0 to 14539
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   12674 non-null  int64  
 1   state                  12674 non-null  object 
 2   positive               12674 non-null  float64
 3   probableCases          5205 non-null   float64
 4   negative               12671 non-null  float64
 5   totalTestResults       12674 non-null  float64
 6   hospitalizedCurrently  11658 non-null  float64
 7   inIcuCurrently         7492 non-null   float64
 8   onVentilatorCurrently  6005 non-null   float64
 9   death                  12647 non-null  float64
 10  dataQualityGrade       12674 non-null  object 
dtypes: float64(8), int64(1), object(2)
memory usage: 1.2+ MB


We initially had 15,633 COVID-19 entries, but filtered the DataFrame to only include reputable data. We now have 12,674 data entries.

Notice how several columns have missing NaN (Not a Number) values. We will fill each NaN value with 0, as it means the state has not reported a value for these categories. We *must* fill in these NaN values before we convert from a float to an int.

In [8]:
us_states_df['positive'] = us_states_df['positive'].fillna(0)
us_states_df['probableCases'] = us_states_df['probableCases'].fillna(0)
us_states_df['negative'] = us_states_df['negative'].fillna(0)
us_states_df['hospitalizedCurrently'] = us_states_df['hospitalizedCurrently'].fillna(0)
us_states_df['inIcuCurrently'] = us_states_df['inIcuCurrently'].fillna(0)
us_states_df['onVentilatorCurrently'] = us_states_df['onVentilatorCurrently'].fillna(0)
us_states_df['death'] = us_states_df['death'].fillna(0)
us_states_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12674 entries, 0 to 14539
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   12674 non-null  int64  
 1   state                  12674 non-null  object 
 2   positive               12674 non-null  float64
 3   probableCases          12674 non-null  float64
 4   negative               12674 non-null  float64
 5   totalTestResults       12674 non-null  float64
 6   hospitalizedCurrently  12674 non-null  float64
 7   inIcuCurrently         12674 non-null  float64
 8   onVentilatorCurrently  12674 non-null  float64
 9   death                  12674 non-null  float64
 10  dataQualityGrade       12674 non-null  object 
dtypes: float64(8), int64(1), object(2)
memory usage: 1.2+ MB


Now we can change all the data types. The **date** variable has an int data type, when it should have a datetime data type. Use pd.to_datetime() to change the type.
The other variables should be an int data type, as they are all discrete variables.

In [9]:
us_states_df['date'] = pd.to_datetime(us_states_df['date'])
us_states_df['positive'] = us_states_df['positive'].astype('int64')
us_states_df['probableCases'] = us_states_df['probableCases'].astype('int64')
us_states_df['negative'] = us_states_df['negative'].astype('int64')
us_states_df['totalTestResults'] = us_states_df['totalTestResults'].astype('int64')
us_states_df['hospitalizedCurrently'] = us_states_df['hospitalizedCurrently'].astype('int64')
us_states_df['inIcuCurrently'] = us_states_df['inIcuCurrently'].astype('int64')
us_states_df['onVentilatorCurrently'] = us_states_df['onVentilatorCurrently'].astype('int64')
us_states_df['death'] = us_states_df['death'].astype('int64')
us_states_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12674 entries, 0 to 14539
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   date                   12674 non-null  datetime64[ns]
 1   state                  12674 non-null  object        
 2   positive               12674 non-null  int64         
 3   probableCases          12674 non-null  int64         
 4   negative               12674 non-null  int64         
 5   totalTestResults       12674 non-null  int64         
 6   hospitalizedCurrently  12674 non-null  int64         
 7   inIcuCurrently         12674 non-null  int64         
 8   onVentilatorCurrently  12674 non-null  int64         
 9   death                  12674 non-null  int64         
 10  dataQualityGrade       12674 non-null  object        
dtypes: datetime64[ns](1), int64(8), object(2)
memory usage: 1.2+ MB


The DataFrame is now cleaned up and ready for analysis.

# Exploratory Data Analysis

Now that we have a clean dataset of COVID-19 statistics, we can begin to ask questions about COVID-19 data throughout the United States. 

We can answer questions as simple as "what was the date with the most amount of positive COVID tests?" to "are positive test results more prevalent in different regions of the US?"