# ETL Requirements

## Overivew
Objective of the project is to evalute if there is any realtionship beween new cases and weather (temperature and humidity) for two citites New york, Sao Paulo. This notebook provide solution to pull data from the data providers selected (see more information from discovery stage) and transform them into one target file which combines all necessary inputs . Final file is then in Resources folder with file name final_combine_data.csv.
This file is then used for the model phase of the porject.

## Extract requirements

### Data providers:

    1.Extract New york new cases data from NYC Health Git hub repository: https://github.com/nychealth/coronavirus-data
    2.Extract Sao Paulo new cases data from Seade Foundation Statistics Agency of the State of São Paulo: https://saludata.saludcapital.gov.co/osb/index.php/datos-de-salud/enfermedades-trasmisibles/covid19/				
    3.Load Temperature data using Openweathermap.org history bulk product.www.openweathermap.org
    
### Data Dictionary of Data Sources:

    1.New York meta data https://github.com/nychealth/coronavirus-data/blob/master/trends/Readme.md#cases-by-daycsv
    2.Brazil new cases by cities meta data https://github.com/seade-R/dados-covid-sp#dicion%C3%A1rio-de-vari%C3%A1veis-fontes-prim%C3%A1rias-e-demais-informa%C3%A7%C3%B5es-t%C3%A9cnicas
    3.Weather data from openweathermap.org https://openweathermap.org/history-bulk#examples
 
### Source data file links

    1.New york file link https://github.com/nychealth/coronavirus-data/blob/master/trends/cases-by-day.csv
    2.Brazil city data file link https://raw.githubusercontent.com/seade-R/dados-covid-sp/master/data/dados_covid_sp.csv	
    3.Weather data file link: https://history.openweathermap.org/storage/fa037ddb81b7f7f0a0d1a0ebd131858e.csv	
    Note: weather data was one time requested as bulk history up to May 16,2021. If we decide to refresh the model using latest data, then new weather data need to be added either via another history bulk or through daily api

## Transformation requirements:

### New york data transformation requirements:
Note: ensure data is ordered by date before running rolling function.

    1.Extract data from source into a data frame
    2.Validate the data types and data for date_of_interest , case_count, CASE_COUNT_7DAY_AVG
    3.Renmae column data_of_interest as "Reported_Date"
    4.Rename column case_count as "New_Cases"
    5.Rename column CASE_COUNT_7DAY_AVG as "mavg_7day_new_cases"
    6.Add new column City and populate all rows with value "New York"
    7.Add new column is_newyork and populate all rows with 1
    8.Add new column population and populate with New york city popluating found by googling term "new york city 2020 population" i.e 18804000
    9.Add new column Data_Source and populate it with value https://github.com/nychealth/coronavirus-data/blob/master/trends/data-by-day.csv
    10.Add mew calculated column mavg_7day_new_cases using rolling function for pandas
    11.Add new calculated column new_cases_per_100K using formula (new_cases/population)*100000
    12.Add new calculated column mavg_7day_per_100k_new_cases using rolling function for pandas

In [2]:
#Import dependancies
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

## New York Data one time ETL

In [9]:
# import directly from GitHub
ny_raw_df = pd.read_csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/cases-by-day.csv",parse_dates=['date_of_interest'])
ny_raw_df.head()

Unnamed: 0,date_of_interest,CASE_COUNT,PROBABLE_CASE_COUNT,CASE_COUNT_7DAY_AVG,ALL_CASE_COUNT_7DAY_AVG,BX_CASE_COUNT,BX_PROBABLE_CASE_COUNT,BX_CASE_COUNT_7DAY_AVG,BX_ALL_CASE_COUNT_7DAY_AVG,BK_CASE_COUNT,...,MN_ALL_CASE_COUNT_7DAY_AVG,QN_CASE_COUNT,QN_PROBABLE_CASE_COUNT,QN_CASE_COUNT_7DAY_AVG,QN_ALL_CASE_COUNT_7DAY_AVG,SI_CASE_COUNT,SI_PROBABLE_CASE_COUNT,SI_CASE_COUNT_7DAY_AVG,SI_ALL_CASE_COUNT_7DAY_AVG,INCOMPLETE
0,2020-02-29,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2020-03-01,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2020-03-02,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2020-03-03,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,2020-03-04,5,0,0,0,0,0,0,0,1,...,0,2,0,0,0,0,0,0,0,0


In [10]:
#check columns data types
ny_raw_df.dtypes

date_of_interest              datetime64[ns]
CASE_COUNT                             int64
PROBABLE_CASE_COUNT                    int64
CASE_COUNT_7DAY_AVG                    int64
ALL_CASE_COUNT_7DAY_AVG                int64
BX_CASE_COUNT                          int64
BX_PROBABLE_CASE_COUNT                 int64
BX_CASE_COUNT_7DAY_AVG                 int64
BX_ALL_CASE_COUNT_7DAY_AVG             int64
BK_CASE_COUNT                          int64
BK_PROBABLE_CASE_COUNT                 int64
BK_CASE_COUNT_7DAY_AVG                 int64
BK_ALL_CASE_COUNT_7DAY_AVG             int64
MN_CASE_COUNT                          int64
MN_PROBABLE_CASE_COUNT                 int64
MN_CASE_COUNT_7DAY_AVG                 int64
MN_ALL_CASE_COUNT_7DAY_AVG             int64
QN_CASE_COUNT                          int64
QN_PROBABLE_CASE_COUNT                 int64
QN_CASE_COUNT_7DAY_AVG                 int64
QN_ALL_CASE_COUNT_7DAY_AVG             int64
SI_CASE_COUNT                          int64
SI_PROBABL

In [14]:
ny_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   date_of_interest            452 non-null    datetime64[ns]
 1   CASE_COUNT                  452 non-null    int64         
 2   PROBABLE_CASE_COUNT         452 non-null    int64         
 3   CASE_COUNT_7DAY_AVG         452 non-null    int64         
 4   ALL_CASE_COUNT_7DAY_AVG     452 non-null    int64         
 5   BX_CASE_COUNT               452 non-null    int64         
 6   BX_PROBABLE_CASE_COUNT      452 non-null    int64         
 7   BX_CASE_COUNT_7DAY_AVG      452 non-null    int64         
 8   BX_ALL_CASE_COUNT_7DAY_AVG  452 non-null    int64         
 9   BK_CASE_COUNT               452 non-null    int64         
 10  BK_PROBABLE_CASE_COUNT      452 non-null    int64         
 11  BK_CASE_COUNT_7DAY_AVG      452 non-null    int64         

In [22]:
#Statistical summary of raw data
ny_raw_df[["CASE_COUNT","CASE_COUNT_7DAY_AVG"]].describe()

Unnamed: 0,CASE_COUNT,CASE_COUNT_7DAY_AVG
count,452.0,452.0
mean,1730.181416,1728.433628
std,1617.196406,1518.278237
min,0.0,0.0
25%,366.75,349.75
50%,1049.5,1068.0
75%,3005.5,2916.0
max,6578.0,5291.0


In [23]:
ny_transform_df = ny_raw_df[["date_of_interest","CASE_COUNT","CASE_COUNT_7DAY_AVG"]]
ny_transform_df.head()

Unnamed: 0,date_of_interest,CASE_COUNT,CASE_COUNT_7DAY_AVG
0,2020-02-29,1,0
1,2020-03-01,0,0
2,2020-03-02,0,0
3,2020-03-03,1,0
4,2020-03-04,5,0


In [24]:
ny_transform_df=ny_transform_df.rename(columns={"date_of_interest":"Reported_Date",
                                               "CASE_COUNT":"New_Cases",
                                               "CASE_COUNT_7DAY_AVG":"mavg_7day_new_cases"})
ny_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases
0,2020-02-29,1,0
1,2020-03-01,0,0
2,2020-03-02,0,0
3,2020-03-03,1,0
4,2020-03-04,5,0


In [25]:
ny_transform_df["City"] ="New York"
ny_transform_df["Population"] = 18804000
ny_transform_df["Data_Source"] ="'https://github.com/nychealth/coronavirus-data/blob/master/trends/data-by-day.csv'"

ny_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,Population,Data_Source
0,2020-02-29,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
1,2020-03-01,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
2,2020-03-02,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
3,2020-03-03,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
4,2020-03-04,5,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...


In [26]:
#before running window function to calculate rolling average check if reported date is in order
ny_transform_df

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,Population,Data_Source
0,2020-02-29,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
1,2020-03-01,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
2,2020-03-02,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
3,2020-03-03,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
4,2020-03-04,5,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
...,...,...,...,...,...,...
447,2021-05-21,335,350,New York,18804000,'https://github.com/nychealth/coronavirus-data...
448,2021-05-22,217,341,New York,18804000,'https://github.com/nychealth/coronavirus-data...
449,2021-05-23,168,327,New York,18804000,'https://github.com/nychealth/coronavirus-data...
450,2021-05-24,320,309,New York,18804000,'https://github.com/nychealth/coronavirus-data...


In [27]:
# just to be on safe side reorder the dataframe on the reproted date
ny_transform_df.sort_values(by='Reported_Date', inplace=True)
ny_transform_df

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,Population,Data_Source
0,2020-02-29,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
1,2020-03-01,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
2,2020-03-02,0,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
3,2020-03-03,1,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
4,2020-03-04,5,0,New York,18804000,'https://github.com/nychealth/coronavirus-data...
...,...,...,...,...,...,...
447,2021-05-21,335,350,New York,18804000,'https://github.com/nychealth/coronavirus-data...
448,2021-05-22,217,341,New York,18804000,'https://github.com/nychealth/coronavirus-data...
449,2021-05-23,168,327,New York,18804000,'https://github.com/nychealth/coronavirus-data...
450,2021-05-24,320,309,New York,18804000,'https://github.com/nychealth/coronavirus-data...


In [None]:
#Add caclulated column for per 100 K
ny_transform_df[]