# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pandas EDA
_Author: Noelle Brown (DSI DEN)_

## About the Dataset: COVID-19

This dataset is provided by the [Johns Hopkins University Center for Systems Science and Engineering](https://systems.jhu.edu/) for educational purposes. The complete GitHub repository containing daily updated data can be found [here](https://github.com/CSSEGISandData/COVID-19).

## Imports

First, import pandas and matplotlib.pyplot

## Reading in Data

Read in the COVID-19 csv. Note that we are using a github link to read in our data, but you can also use a path in your local directory. *This dataset was last updated on March 20, 2020 so numbers may not be accurate.*

In [None]:
covid = pd.read_csv('https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/03-20-2020.csv')

## Inspecting our DataFrame: The basics

Look at the first 5 rows:

Look at the `shape` of the DataFrame. How many rows are there? Columns?

## Selecting Subsets of Columns

Look at just the `Country/Region` and `Confirmed` columns.

## Data Types

What are the datatypes of all of the columns?

## Summary Statistics

| statistic | meaning |
| --- | --- |
| count | Number of non-null elements |
| mean | Average of non-null elements |
| std | Standard deviation |
| min | The smallest value in the column |
| 25% | Value greater than 25% of data (lower quartile, Q1) |
| 50% | The middle value (50th percentile, Q2) |
| 75% | Value greater than 75% of data (upper quartile, Q3) |
| max | The largest value in the column |

Display the summary statistics of this dataset.

Based on the summary statistics, what is the mean number of confirmed cases across the dataset?

> 

Based on the summary statistics, what is the mean number of recovered patients across the dataset?

> 

Does it make sense to include latitude and longitude in our summary statistics? Why or why not?

> 

## Counts
Find the value counts of each Country/Region

By default, we ignore null values, but can return these by passing:

There's also an option to normalize these counts; here, we'll get the percent distribution rather than the pure count:

## US Cases
Filter the dataframe to only display US observations. Save this as a new dataframe called `us_covid`.

Which state has the highest number of confirmed cases?

Which state has the highest number of deaths?

List all states that have 0 deaths.

**Note:**  
The following cells will not run for you unless you install [plotly](https://plot.ly/python/getting-started/) - this is not necessary. Just look at the following graph and answer the question after.

In [36]:
import pandas as pd 
import chart_studio.plotly as py 
import plotly.graph_objs as go 
import plotly.express as px

In [37]:
# need to get us state abbreviations for mapping to work
# source: https://stackabuse.com/using-plotly-library-for-interactive-data-visualization-in-python/
us_state_abbrev = {
    'Alabama': 'AL','Alaska': 'AK','Arizona': 'AZ','Arkansas': 'AR',
    'California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE',
    'Florida': 'FL','Georgia': 'GA','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL',
    'Indiana': 'IN','Iowa': 'IA','Kansas': 'KS','Kentucky': 'KY','Louisiana': 'LA',
    'Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN',
    'Mississippi': 'MS','Missouri': 'MO','Montana': 'MT','Nebraska': 'NE','Nevada': 'NV',
    'New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC',
    'North Dakota': 'ND','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA',
    'Rhode Island': 'RI','South Carolina': 'SC','South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX',
    'Utah': 'UT','Vermont': 'VT','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV',
    'Wisconsin': 'WI','Wyoming': 'WY',
}
us_covid['state_abbrev'] = us_covid['Province/State'].map(us_state_abbrev)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [38]:
fig = px.choropleth(locations=us_covid['state_abbrev'], 
                    locationmode="USA-states", 
                    color=us_covid['Confirmed'],
                    color_continuous_scale="Viridis",
                    scope="usa",
                    labels={'color':'Confirmed'})
fig.show()

In [40]:
# saving figure
# https://plot.ly/python/static-image-export/
fig.write_image('uscov.png')

In case the figure above does not show up for you, I will display it here as well:
<img src="./assets/uscov.png" alt="drawing" width="550"/>

**Data literacy**: Looking at the above map, do you think this is an accurate depiction of real-world data? Why or why not?

> **Answer:**