<img src="https://github.com/djp840/MSDS_422_Public/blob/master/images/NorthwesternHeader.png?raw=1">

## MSDS422 Assignment 01:

<div class="alert alert-block alert-success">
    <b>More Technical</b>: Throughout the notebook. This types of boxes provide more technical details and extra references about what you are seeing. They contain helpful tips, but you can safely skip them the first time you run through the code.
</div>

### European Centre for Disease Prevention and Control 
<div class="alert alert-block alert-info">
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide</div>

### Data Dictionary COVID-19 

The MSDS422_COVID19 data frame has 32771 rows and 10 columns.<br>
<br>
This data frame contains the following columns:

<b>Date</b><br>
Formatted  datetime64[ns]<br>
<br>
<b>Day</b><br>
Calendar day, dtype int64<br>
<br>
<b>Month</b><br>
Calendar month, dtype int64<br>
<br>
<b>Year</b><br>
Calendar year, dtype int64<br>
<br>
<b>Cases</b><br>
Number of Cases Per Day, dtype int64<br>
<br>
<b>Deaths</b><br>
Number of Deaths, dtype int64<br>
<br>
<b>Country </b><br>
Country Name, dtype object<br>
<br>
<b>Population</b><br>
Country Population<br>
<br>
<b>Continent</b><br>
Continent continuous expanses of land (Africa, Antarctica, Asia, Australia, Europe, North America, South America)<br>
<br>
<b>CumulativeNumberPer100KCases </b><br>
Cumulative Number For 14 Days of COVID-19 Cases per 100000<br>
<br>
<b>Sources:</b><br>

## Import packages 



In [None]:
import pandas as pd  
import numpy as np  
import scipy as sp
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import math
from datetime import datetime

In [None]:
#pd.options.display.float_format = '{:.3f}'.format
%matplotlib inline

### Load Data (Local Directory)

In [None]:
covid19_dfA=pd.read_csv('./data/MSDS422_covid19_20200825v3.csv')

### Data Quality Review 

In [None]:
print("Shape:", covid19_dfA.shape,"\n")
print("Variable Types:") 
print(covid19_dfA.dtypes)

covid19_dfA.head(15)

## Exploritory Data Analysis (EDA) 

### Number of Coutries 

In [None]:
len(covid19_dfA.Country.unique())

### Summary Statistics 

<div class="alert alert-block alert-warning">
Attention to the <b>count</b> row will indicate if column has missing records
</div> 

In [None]:
covid19_dfA.describe()

### Review Dataset for Missing Values

<div class="alert alert-block alert-warning">
Review dataset for missing records
</div>

In [None]:
covid19_dfA.isnull().sum()

## Preprocess Data for Analysis

#### Date Column formatted into ISO 8601 standard format (Year - Month - Day)

In [None]:
covid19_dfA['Date']=pd.to_datetime(covid19_dfA['Date'], format='%d/%m/%Y').dt.strftime('%Y%m%d')
covid19_dfA['Date']=pd.to_datetime(covid19_dfA['Date'], format='%Y/%m/%d')
covid19_dfA['Date'].head()

#### Review Data Types (dtypes)

In [None]:
covid19_dfA.dtypes

In [None]:
covid19_dfA.isnull().sum()

In [None]:
covid19_dfA.head()

In [None]:
covid19_dfA.shape

In [None]:
covid19_dfA.dtypes

### Write out file</br>
> - covid19_dfA.to_excel()</br>
> - covid19_dfA.to_csv()</br>

## Visualizing Data

In [None]:
world_daily = covid19_dfA.set_index('Date')
sns.set_color_codes("colorblind")
sns.set(rc={'figure.figsize':(15, 11)})
world_daily['Cases'].plot(linewidth = 2.5)


plt.title('Worldwide Cases Over Time', fontsize = 20)
plt.xlabel('Date', fontsize = 16)
plt.xticks(fontsize = 13)
plt.ylabel('Cases', fontsize = 16)
plt.yticks(fontsize = 13)

plt.show()

In [None]:
sns.set(rc={'figure.figsize':(15,11)})
world_daily['Deaths'].plot(linewidth = 2.5)


plt.title('Worldwide Deaths Over Time', fontsize = 20)
plt.xlabel('Number of Deaths by Date', fontsize = 16)
plt.xticks(fontsize = 13)
plt.ylabel('Number of Deaths', fontsize = 16)
plt.yticks(fontsize = 13)

plt.show()

In [None]:
UScovid19_df = covid19_dfA[covid19_dfA["Country"] == "United_States_of_America"].reset_index()
US_daily = UScovid19_df.set_index('Date')

sns.set(rc={'figure.figsize':(15, 11)})
US_daily['Cases'].plot(linewidth = 2.5)

plt.title('US Cases Over Time', fontsize = 20)
plt.xlabel('Case by Date', fontsize = 16)
plt.xticks(fontsize = 13)
plt.ylabel('Number of Cases', fontsize = 16)
plt.yticks(fontsize = 13)

plt.show()

In [None]:
UScovid19_df = covid19_dfA[covid19_dfA["Country"] == "United_States_of_America"].reset_index()
US_daily = UScovid19_df.set_index('Date')

sns.set(rc={'figure.figsize':(15, 11)})
US_daily['Deaths'].plot(linewidth = 2.5)

plt.title('US Deaths Over Time', fontsize = 20)
plt.xlabel('Deaths by Date', fontsize = 16)
plt.xticks(fontsize = 13)
plt.ylabel('Number of Deaths', fontsize = 16)
plt.yticks(fontsize = 13)

plt.show()