# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas EDA
_Author: Noelle Brown (DSI DEN)_

## About the Dataset: COVID-19

This dataset is provided by the [Johns Hopkins University Center for Systems Science and Engineering](https://systems.jhu.edu/) for educational purposes. The complete GitHub repository containing daily updated data can be found [here](https://github.com/CSSEGISandData/COVID-19).

## Imports

First, import pandas and matplotlib.pyplot

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Reading in Data

Read in the COVID-19 csv. Note that we are using a github link to read in our data, but you can also use a path in your local directory. *This dataset was last updated on March 20, 2020 so numbers may not be accurate.*

In [21]:
covid = pd.read_csv('https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv')

## Inspecting our DataFrame: The basics

Look at the first 5 rows:

In [22]:
covid.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,China,2020-03-20T07:43:02,67800,3133,58382,30.9756,112.2707
1,,Italy,2020-03-20T17:43:03,47021,4032,4440,41.8719,12.5674
2,,Spain,2020-03-20T17:43:03,20410,1043,1588,40.4637,-3.7492
3,,Germany,2020-03-20T20:13:15,19848,67,180,51.1657,10.4515
4,,Iran,2020-03-20T15:13:21,19644,1433,6745,32.4279,53.688


And the last 5 rows:

In [23]:
covid.tail()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
294,,Jersey,2020-03-17T18:33:03,0,0,0,49.19,-2.11
295,,Puerto Rico,2020-03-17T16:13:14,0,0,0,18.2,-66.5
296,,Republic of the Congo,2020-03-17T21:33:03,0,0,0,-1.44,15.556
297,,The Bahamas,2020-03-19T12:13:38,0,0,0,24.25,-76.0
298,,The Gambia,2020-03-18T14:13:56,0,0,0,13.4667,-16.6


Look at the `shape` of the DataFrame. How many rows are there? Columns?

In [24]:
print(f'This DataFrame has {covid.shape[0]} rows and {covid.shape[1]} columns.')

This DataFrame has 299 rows and 8 columns.


## Selecting Subsets of Columns

Look at just the `Country/Region` and `Confirmed` columns.

In [25]:
covid[['Country/Region', 'Confirmed']]

Unnamed: 0,Country/Region,Confirmed
0,China,67800
1,Italy,47021
2,Spain,20410
3,Germany,19848
4,Iran,19644
...,...,...
294,Jersey,0
295,Puerto Rico,0
296,Republic of the Congo,0
297,The Bahamas,0


## Data Types

What are the datatypes of all of the columns?

In [26]:
covid.dtypes

Province/State     object
Country/Region     object
Last Update        object
Confirmed           int64
Deaths              int64
Recovered           int64
Latitude          float64
Longitude         float64
dtype: object

## Summary Statistics

| statistic | meaning |
| --- | --- |
| count | Number of non-null elements |
| mean | Average of non-null elements |
| std | Standard deviation |
| min | The smallest value in the column |
| 25% | Value greater than 25% of data (lower quartile, Q1) |
| 50% | The middle value (50th percentile, Q2) |
| 75% | Value greater than 75% of data (upper quartile, Q3) |
| max | The largest value in the column |

Display the summary statistics of this dataset.

In [27]:
covid.describe()

Unnamed: 0,Confirmed,Deaths,Recovered,Latitude,Longitude
count,299.0,299.0,299.0,299.0,299.0
mean,910.257525,37.789298,292.317726,25.562647,4.227507
std,5232.353837,312.302854,3409.146535,23.37487,80.443658
min,0.0,0.0,0.0,-41.4545,-157.4983
25%,8.0,0.0,0.0,12.87255,-71.53655
50%,64.0,0.0,0.0,31.8257,9.5375
75%,253.5,2.0,5.0,42.19795,56.7638
max,67800.0,4032.0,58382.0,72.0,178.065


In [28]:
covid.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Confirmed,299.0,910.257525,5232.353837,0.0,8.0,64.0,253.5,67800.0
Deaths,299.0,37.789298,312.302854,0.0,0.0,0.0,2.0,4032.0
Recovered,299.0,292.317726,3409.146535,0.0,0.0,0.0,5.0,58382.0
Latitude,299.0,25.562647,23.37487,-41.4545,12.87255,31.8257,42.19795,72.0
Longitude,299.0,4.227507,80.443658,-157.4983,-71.53655,9.5375,56.7638,178.065


Based on the summary statistics, what is the mean number of confirmed cases across the dataset?

> 910.26 Cases

Based on the summary statistics, what is the mean number of recovered patients across the dataset?

> 292.32 Cases

Does it make sense to include latitude and longitude in our summary statistics? Why or why not?

> No - latitude and longitude are unique in that the actual numerical values do not make sense to take summary statistics of. They can be useful for plotting or understanding where the cases are located.

## Counts
Find the value counts of each Country/Region

In [29]:
# find value_counts of Country/region
covid['Country/Region'].value_counts(dropna = True, normalize = False)

US                      57
China                   33
Canada                  11
France                   9
Australia                9
                        ..
Vietnam                  1
Colombia                 1
Moldova                  1
Panama                   1
United Arab Emirates     1
Name: Country/Region, Length: 174, dtype: int64

By default, we ignore null values, but can return these by passing:

In [30]:
# parameter : dropna = False
covid['Country/Region'].value_counts(dropna = False, normalize = False)

US                      57
China                   33
Canada                  11
France                   9
Australia                9
                        ..
Vietnam                  1
Colombia                 1
Moldova                  1
Panama                   1
United Arab Emirates     1
Name: Country/Region, Length: 174, dtype: int64

There's also an option to normalize these counts; here, we'll get the percent distribution rather than the pure count:

In [31]:
# parameter : normalize = True
covid['Country/Region'].value_counts(dropna = False, normalize = True)

US                      0.190635
China                   0.110368
Canada                  0.036789
France                  0.030100
Australia               0.030100
                          ...   
Vietnam                 0.003344
Colombia                0.003344
Moldova                 0.003344
Panama                  0.003344
United Arab Emirates    0.003344
Name: Country/Region, Length: 174, dtype: float64

## US Cases
Filter the dataframe to only display US observations. Save this as a new dataframe called `us_covid`.

In [32]:
# your answer here
us_covid = covid[covid['Country/Region'] == 'US']
us_covid.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
7,New York,US,2020-03-20T22:14:43,8310,42,0,42.1657,-74.9481
15,Washington,US,2020-03-20T23:43:03,1524,83,0,47.4009,-121.4905
20,California,US,2020-03-20T23:43:03,1177,23,0,36.1162,-119.6816
27,New Jersey,US,2020-03-20T21:13:32,890,11,0,40.2989,-74.521
35,Illinois,US,2020-03-20T21:13:32,585,5,0,40.3495,-88.9861


Which state has the highest number of confirmed cases?

In [33]:
us_covid.sort_values('Confirmed', ascending=False).head(1)

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
7,New York,US,2020-03-20T22:14:43,8310,42,0,42.1657,-74.9481


Which state has the highest number of deaths?

In [34]:
us_covid.sort_values('Deaths', ascending=False).head(1)

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
15,Washington,US,2020-03-20T23:43:03,1524,83,0,47.4009,-121.4905


List all states that have 0 deaths.

In [35]:
us_covid[us_covid['Deaths'] == 0]

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
79,Tennessee,US,2020-03-20T23:43:03,233,0,0,35.7478,-86.6923
88,North Carolina,US,2020-03-20T23:43:03,172,0,0,35.6301,-79.8064
115,Minnesota,US,2020-03-20T19:13:30,115,0,0,45.6945,-93.9002
119,Arkansas,US,2020-03-20T21:43:03,96,0,0,34.9697,-92.3731
129,Alabama,US,2020-03-20T21:43:03,83,0,0,32.3182,-86.9023
133,Arizona,US,2020-03-20T19:13:30,78,0,0,33.7298,-111.4312
134,Utah,US,2020-03-20T16:13:23,78,0,0,40.15,-111.8624
150,Maine,US,2020-03-20T19:13:30,56,0,0,44.6939,-69.3819
152,Rhode Island,US,2020-03-20T19:43:03,54,0,0,41.6809,-71.5118
158,Diamond Princess,US,2020-03-20T19:43:03,49,0,0,35.4437,139.638


**Note:**  
The following cells will not run for you unless you install [plotly](https://plot.ly/python/getting-started/) - this is not necessary. Just look at the following graph and answer the question after.

In [36]:
import pandas as pd 
import chart_studio.plotly as py 
import plotly.graph_objs as go 
import plotly.express as px

In [37]:
# need to get us state abbreviations for mapping to work
# source: https://stackabuse.com/using-plotly-library-for-interactive-data-visualization-in-python/
us_state_abbrev = {
    'Alabama': 'AL','Alaska': 'AK','Arizona': 'AZ','Arkansas': 'AR',
    'California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE',
    'Florida': 'FL','Georgia': 'GA','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL',
    'Indiana': 'IN','Iowa': 'IA','Kansas': 'KS','Kentucky': 'KY','Louisiana': 'LA',
    'Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN',
    'Mississippi': 'MS','Missouri': 'MO','Montana': 'MT','Nebraska': 'NE','Nevada': 'NV',
    'New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC',
    'North Dakota': 'ND','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA',
    'Rhode Island': 'RI','South Carolina': 'SC','South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX',
    'Utah': 'UT','Vermont': 'VT','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV',
    'Wisconsin': 'WI','Wyoming': 'WY',
}
us_covid['state_abbrev'] = us_covid['Province/State'].map(us_state_abbrev)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [38]:
fig = px.choropleth(locations=us_covid['state_abbrev'], 
                    locationmode="USA-states", 
                    color=us_covid['Confirmed'],
                    color_continuous_scale="Viridis",
                    scope="usa",
                    labels={'color':'Confirmed'})
fig.show()

In [40]:
# saving figure
# https://plot.ly/python/static-image-export/
fig.write_image('uscov.png')

In case the figure above does not show up for you, I will display it here as well:
<img src="./assets/uscov.png" alt="drawing" width="550"/>

**Data literacy**: Looking at the above map, do you think this is an accurate depiction of real-world data? Why or why not?

> **Answer:** No, this map is misleading for several reasons. First, due to testing limitations, these are only confirmed cases that have been tested and do not reflect the total number of cases in the US. Second, each state has a different population. It would be more accurate to divide the number of cases by the population of the state to accurately compare the numbers. New York and California will naturally have more cases than states like Wyoming and Montana solely due to the higher population in these states.