# ETL Requirements

## Overivew
Objective of the project is to evalute if there is any realtionship beween new cases and weather (temperature and humidity) for two citites New york, Sao Paulo. This notebook provide solution to pull data from the data providers selected (see more information from discovery stage) and transform them into one target file which combines all necessary inputs . Final file is then saved in Resources folder with file name final_combine_data.csv.
Final file is then used for the model phase of the porject.

## Extract requirements

### Data providers:

    1.Extract New york new cases data from NYC Health Git hub repository: https://github.com/nychealth/coronavirus-data
    2.Extract Sao Paulo new cases data from Seade Foundation Statistics Agency of the State of São Paulo: https://saludata.saludcapital.gov.co/osb/index.php/datos-de-salud/enfermedades-trasmisibles/covid19/				
    3.Load Temperature data using Openweathermap.org history bulk product.www.openweathermap.org
    
### Data Dictionary of Data Sources:

    1.New York meta data https://github.com/nychealth/coronavirus-data/blob/master/trends/Readme.md#cases-by-daycsv
    2.Brazil new cases by cities meta data https://github.com/seade-R/dados-covid-sp#dicion%C3%A1rio-de-vari%C3%A1veis-fontes-prim%C3%A1rias-e-demais-informa%C3%A7%C3%B5es-t%C3%A9cnicas
    3.Weather data from openweathermap.org https://openweathermap.org/history-bulk#examples
 
### Source data file links

    1.New york file link https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/cases-by-day.csv
    2.Brazil city data file link https://raw.githubusercontent.com/seade-R/dados-covid-sp/master/data/dados_covid_sp.csv	
    3.Weather data file link: https://history.openweathermap.org/storage/fa037ddb81b7f7f0a0d1a0ebd131858e.csv	
    Note: weather data was one time requested as bulk history up to May 16,2021. If we decide to refresh the model using latest data, then new weather data need to be added either via another history bulk or through daily api



In [20]:
#Import dependancies
import pandas as pd
import numpy as np
import datetime as dt
import math
import matplotlib.pyplot as plt
%matplotlib inline


# New York Data one time ETL

### Transformation requirements:

#### New york data transformation requirements:
Note: ensure data is ordered by date before running rolling function.

    1.Extract data from source into a data frame
    2.Validate the data types and data for date_of_interest , case_count, CASE_COUNT_7DAY_AVG
    3.Renmae column data_of_interest as "Reported_Date"  
    4.Rename column case_count as "New_Cases"
    5.Rename column CASE_COUNT_7DAY_AVG as "mavg_7day_new_cases"
    6.Add new column City and populate all rows with value "New York"
    7.Add new column is_newyork and populate all rows with value 1
    8.Add new column population and populate with New york city popluating found by googling term "new york city 2020 population" i.e 18804000
    9.Add new column Data_Source and populate it with value https://github.com/nychealth/coronavirus-data/blob/master/trends/data-by-day.csv
    10.Add new column extract_date as todays date time to stamp the date the data is downloaded
    11.Add new calculated column new_cases_per_100K using formula (new_cases/population)*100000
    12.Add new calculated column mavg_7day_per_100k_new_cases using rolling function for pandas on new_cases_per_100K
    13.Change the order of columns "Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source",
                          "New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases"
    14.Export the final results as "NYC_Covid_New_Cases_Final.csv" in Resources folder

In [133]:
#1. import directly from GitHub
ny_raw_df = pd.read_csv("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/cases-by-day.csv",parse_dates=['date_of_interest'])
ny_raw_df.head()

Unnamed: 0,date_of_interest,CASE_COUNT,PROBABLE_CASE_COUNT,CASE_COUNT_7DAY_AVG,ALL_CASE_COUNT_7DAY_AVG,BX_CASE_COUNT,BX_PROBABLE_CASE_COUNT,BX_CASE_COUNT_7DAY_AVG,BX_ALL_CASE_COUNT_7DAY_AVG,BK_CASE_COUNT,...,MN_ALL_CASE_COUNT_7DAY_AVG,QN_CASE_COUNT,QN_PROBABLE_CASE_COUNT,QN_CASE_COUNT_7DAY_AVG,QN_ALL_CASE_COUNT_7DAY_AVG,SI_CASE_COUNT,SI_PROBABLE_CASE_COUNT,SI_CASE_COUNT_7DAY_AVG,SI_ALL_CASE_COUNT_7DAY_AVG,INCOMPLETE
0,2020-02-29,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2020-03-01,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2020-03-02,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2020-03-03,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,2020-03-04,5,0,0,0,0,0,0,0,1,...,0,2,0,0,0,0,0,0,0,0


In [134]:
#2. check columns data types
ny_raw_df.dtypes

date_of_interest              datetime64[ns]
CASE_COUNT                             int64
PROBABLE_CASE_COUNT                    int64
CASE_COUNT_7DAY_AVG                    int64
ALL_CASE_COUNT_7DAY_AVG                int64
BX_CASE_COUNT                          int64
BX_PROBABLE_CASE_COUNT                 int64
BX_CASE_COUNT_7DAY_AVG                 int64
BX_ALL_CASE_COUNT_7DAY_AVG             int64
BK_CASE_COUNT                          int64
BK_PROBABLE_CASE_COUNT                 int64
BK_CASE_COUNT_7DAY_AVG                 int64
BK_ALL_CASE_COUNT_7DAY_AVG             int64
MN_CASE_COUNT                          int64
MN_PROBABLE_CASE_COUNT                 int64
MN_CASE_COUNT_7DAY_AVG                 int64
MN_ALL_CASE_COUNT_7DAY_AVG             int64
QN_CASE_COUNT                          int64
QN_PROBABLE_CASE_COUNT                 int64
QN_CASE_COUNT_7DAY_AVG                 int64
QN_ALL_CASE_COUNT_7DAY_AVG             int64
SI_CASE_COUNT                          int64
SI_PROBABL

In [135]:
#2.validate
ny_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 454 entries, 0 to 453
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   date_of_interest            454 non-null    datetime64[ns]
 1   CASE_COUNT                  454 non-null    int64         
 2   PROBABLE_CASE_COUNT         454 non-null    int64         
 3   CASE_COUNT_7DAY_AVG         454 non-null    int64         
 4   ALL_CASE_COUNT_7DAY_AVG     454 non-null    int64         
 5   BX_CASE_COUNT               454 non-null    int64         
 6   BX_PROBABLE_CASE_COUNT      454 non-null    int64         
 7   BX_CASE_COUNT_7DAY_AVG      454 non-null    int64         
 8   BX_ALL_CASE_COUNT_7DAY_AVG  454 non-null    int64         
 9   BK_CASE_COUNT               454 non-null    int64         
 10  BK_PROBABLE_CASE_COUNT      454 non-null    int64         
 11  BK_CASE_COUNT_7DAY_AVG      454 non-null    int64         

In [136]:
#2. Statistical summary of raw data
ny_raw_df[["CASE_COUNT","CASE_COUNT_7DAY_AVG"]].describe()

Unnamed: 0,CASE_COUNT,CASE_COUNT_7DAY_AVG
count,454.0,454.0
mean,1723.779736,1722.169604
std,1616.67578,1518.040463
min,0.0,0.0
25%,365.25,346.0
50%,1043.0,1053.0
75%,2999.0,2914.0
max,6578.0,5291.0


In [143]:
# 2. Select the columns  
ny_transform_df = ny_raw_df[["date_of_interest","CASE_COUNT","CASE_COUNT_7DAY_AVG"]]
ny_transform_df.head()

Unnamed: 0,date_of_interest,CASE_COUNT,CASE_COUNT_7DAY_AVG
0,2020-02-29,1,0
1,2020-03-01,0,0
2,2020-03-02,0,0
3,2020-03-03,1,0
4,2020-03-04,5,0


In [144]:
#3,4,5 Rename columns
ny_transform_df=ny_transform_df.rename(columns={"date_of_interest":"Reported_Date",
                                               "CASE_COUNT":"New_Cases",
                                               "CASE_COUNT_7DAY_AVG":"mavg_7day_new_cases"})
ny_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases
0,2020-02-29,1,0
1,2020-03-01,0,0
2,2020-03-02,0,0
3,2020-03-03,1,0
4,2020-03-04,5,0


In [145]:
ny_transform_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 454 entries, 0 to 453
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Reported_Date        454 non-null    datetime64[ns]
 1   New_Cases            454 non-null    int64         
 2   mavg_7day_new_cases  454 non-null    int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 10.8 KB


In [146]:
#6,7,8,9,10 add new static columns
ny_transform_df["City"] ="New York"
ny_transform_df["is_newyork"] =1
ny_transform_df["Population"] = 18804000
ny_transform_df["Data_Source"] ="https://github.com/nychealth/coronavirus-data/blob/master/trends/data-by-day.csv"
ny_transform_df["Extract_Date"] = dt.datetime.date(dt.datetime.utcnow())
ny_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,is_newyork,Population,Data_Source,Extract_Date
0,2020-02-29,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
1,2020-03-01,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
2,2020-03-02,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
3,2020-03-03,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
4,2020-03-04,5,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30


In [147]:
# Before running rolling function ensure data is sorted by reported date
ny_transform_df.sort_values(by='Reported_Date' ,ascending=True, inplace=True)
ny_transform_df

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,is_newyork,Population,Data_Source,Extract_Date
0,2020-02-29,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
1,2020-03-01,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
2,2020-03-02,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
3,2020-03-03,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
4,2020-03-04,5,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
...,...,...,...,...,...,...,...,...
449,2021-05-23,175,330,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
450,2021-05-24,331,314,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
451,2021-05-25,286,298,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30
452,2021-05-26,228,279,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30


In [148]:
#11 Add caclulated column for per 100 K
ny_transform_df["new_cases_per_100K"]=round((ny_transform_df["New_Cases"]/ny_transform_df["Population"])*100000 ,2)
ny_transform_df 

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,is_newyork,Population,Data_Source,Extract_Date,new_cases_per_100K
0,2020-02-29,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.01
1,2020-03-01,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.00
2,2020-03-02,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.00
3,2020-03-03,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.01
4,2020-03-04,5,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.03
...,...,...,...,...,...,...,...,...,...
449,2021-05-23,175,330,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.93
450,2021-05-24,331,314,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,1.76
451,2021-05-25,286,298,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,1.52
452,2021-05-26,228,279,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,1.21


In [149]:
#12 Add rolling 7 days moving average for new_cases_per_100K
ny_transform_df["mavg_7day_per_100k_new_cases"] = round(ny_transform_df["new_cases_per_100K"].rolling(window=7,min_periods=1).mean(),2)
ny_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,mavg_7day_new_cases,City,is_newyork,Population,Data_Source,Extract_Date,new_cases_per_100K,mavg_7day_per_100k_new_cases
0,2020-02-29,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.01,0.01
1,2020-03-01,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.0,0.0
2,2020-03-02,0,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.0,0.0
3,2020-03-03,1,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.01,0.0
4,2020-03-04,5,0,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,2021-05-30,0.03,0.01


In [150]:
# 13 Re order columns to create final data set
nyc_clean_df = ny_transform_df[["Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source",
                          "New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases"]]
nyc_clean_df

Unnamed: 0,Extract_Date,Reported_Date,City,is_newyork,Population,Data_Source,New_Cases,mavg_7day_new_cases,new_cases_per_100K,mavg_7day_per_100k_new_cases
0,2021-05-30,2020-02-29,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0,0.01,0.01
1,2021-05-30,2020-03-01,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0,0.00,0.00
2,2021-05-30,2020-03-02,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0,0.00,0.00
3,2021-05-30,2020-03-03,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0,0.01,0.00
4,2021-05-30,2020-03-04,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,5,0,0.03,0.01
...,...,...,...,...,...,...,...,...,...,...
449,2021-05-30,2021-05-23,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,175,330,0.93,1.76
450,2021-05-30,2021-05-24,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,331,314,1.76,1.67
451,2021-05-30,2021-05-25,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,286,298,1.52,1.58
452,2021-05-30,2021-05-26,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,228,279,1.21,1.48


In [151]:
#14 Export clean new york data as NYC_Covid_New_Cases_Final.csv
output_file = "Resources\\NYC_Covid_New_Cases_Final.csv"
nyc_clean_df.to_csv(output_file,index=False, header=True)

## Sao Paulo Data one time ETL

### Transformation requirements:

#### Sao Paulo data transformation requirements:
Note: ensure data is ordered by date before running rolling function.

    1.Extract data from source into a data frame, filter data to where codigo_ibge=3550308
    2.Validate the data types and data for datahora , casos_novos, pop
    3.Renmae column datahora as "Reported_Date"  
    4.Rename column casos_novos as "New_Cases"
    5.Rename column pop as "Population"
    6.Add new column City and populate all rows with value "Sao Paulo"
    7.Add new column is_newyork and populate all rows with value 0
    8.Add new column Data_Source and populate it with value https://raw.githubusercontent.com/seade-R/dados-covid-sp/master/data/dados_covid_sp.csv
    9.Add new column extract_date as todays date time to stamp the date the data is downloaded
    10.Add new calculated column new_cases_per_100K using formula (new_cases/population)*100000
    11.Add new calculated column mavg_7day_per_100k_new_cases using rolling function for pandas on new_cases_per_100K
    12.Add new calculated column mavg_7day_new_cases  using rolling function for pandas on new_cases column
    13.Change the order of columns to match "Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source",
                          "New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases"
    14.Export the final results as "SP_Covid_New_Cases_Final.csv" in Resources folder

In [152]:
#1. import directly from GitHub
sp_raw_df = pd.read_csv("https://raw.githubusercontent.com/seade-R/dados-covid-sp/master/data/dados_covid_sp.csv",parse_dates=['datahora'],sep=";")
sp_raw_df.head()

Unnamed: 0,nome_munic,codigo_ibge,dia,mes,datahora,casos,casos_novos,casos_pc,casos_mm7d,obitos,...,nome_drs,cod_drs,pop,pop_60,area,map_leg,map_leg_s,latitude,longitude,semana_epidem
0,Adamantina,3500105,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Marília,5,33894,7398,41199,0,8.0,-216820,-510737,9
1,Adolfo,3500204,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São José do Rio Preto,15,3447,761,21106,0,8.0,-212325,-496451,9
2,Aguaí,3500303,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São João da Boa Vista,14,35608,5245,47455,0,8.0,-220572,-469735,9
3,Águas da Prata,3500402,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São João da Boa Vista,14,7797,1729,14267,0,8.0,-219319,-467176,9
4,Águas de Lindóia,3500501,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Campinas,3,18374,3275,6013,0,8.0,-224733,-466314,9


In [153]:
#1 filter data for Sao Paulo where codigo_ibge=3550308
sp_raw_df = sp_raw_df[sp_raw_df['codigo_ibge'] == 3550308]  
sp_raw_df

Unnamed: 0,nome_munic,codigo_ibge,dia,mes,datahora,casos,casos_novos,casos_pc,casos_mm7d,obitos,...,nome_drs,cod_drs,pop,pop_60,area,map_leg,map_leg_s,latitude,longitude,semana_epidem
562,São Paulo,3550308,25,2,2020-02-25,1,0,842484114962012e-03,0000000000000000,0,...,Grande São Paulo,10,11869660,1853286,152111,<50,7.0,-235329,-466395,9
1207,São Paulo,3550308,26,2,2020-02-26,1,0,842484114962012e-03,0000000000000000,0,...,Grande São Paulo,10,11869660,1853286,152111,<50,7.0,-235329,-466395,9
1852,São Paulo,3550308,27,2,2020-02-27,1,0,842484114962012e-03,0000000000000000,0,...,Grande São Paulo,10,11869660,1853286,152111,<50,7.0,-235329,-466395,9
2497,São Paulo,3550308,28,2,2020-02-28,2,1,168496822992402e-02,0000000000000000,0,...,Grande São Paulo,10,11869660,1853286,152111,<50,7.0,-235329,-466395,9
3142,São Paulo,3550308,29,2,2020-02-29,2,0,168496822992402e-02,0000000000000000,0,...,Grande São Paulo,10,11869660,1853286,152111,<50,7.0,-235329,-466395,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294682,São Paulo,3550308,26,5,2021-05-26,775801,3298,"6,53600018871644e+03",2426000000000000000,30192,...,Grande São Paulo,10,11869660,1853286,152111,>5000,1.0,-235329,-466395,21
295327,São Paulo,3550308,27,5,2021-05-27,778119,2318,"6,55552897050126e+03",2319000000000000000,30340,...,Grande São Paulo,10,11869660,1853286,152111,>5000,1.0,-235329,-466395,21
295972,São Paulo,3550308,28,5,2021-05-28,778550,431,"6,55916007703675e+03",1965714285714285779,30476,...,Grande São Paulo,10,11869660,1853286,152111,>5000,1.0,-235329,-466395,21
296617,São Paulo,3550308,29,5,2021-05-29,780675,2125,"6,57706286447969e+03",1817285714285714221,30566,...,Grande São Paulo,10,11869660,1853286,152111,>5000,1.0,-235329,-466395,21


In [154]:
#2. check columns data types
sp_raw_df.dtypes

nome_munic               object
codigo_ibge               int64
dia                       int64
mes                       int64
datahora         datetime64[ns]
casos                     int64
casos_novos               int64
casos_pc                 object
casos_mm7d               object
obitos                    int64
obitos_novos              int64
obitos_pc                object
obitos_mm7d              object
letalidade               object
nome_ra                  object
cod_ra                    int64
nome_drs                 object
cod_drs                   int64
pop                       int64
pop_60                    int64
area                      int64
map_leg                  object
map_leg_s               float64
latitude                 object
longitude                object
semana_epidem             int64
dtype: object

In [155]:
#validate
sp_raw_df[["datahora","codigo_ibge","casos_novos","pop"]].describe()

Unnamed: 0,codigo_ibge,casos_novos,pop
count,461.0,461.0,461.0
mean,3550308.0,1696.733189,11869660.0
std,0.0,1414.240966,0.0
min,3550308.0,0.0,11869660.0
25%,3550308.0,545.0,11869660.0
50%,3550308.0,1408.0,11869660.0
75%,3550308.0,2572.0,11869660.0
max,3550308.0,8646.0,11869660.0


In [156]:
# 3.Renmae column datahora as "Reported_Date"  
#4.Rename column casos_novos as "New_Cases"
#5.Rename column pop as "Population"
sp_transform_df = sp_raw_df[["datahora","casos_novos","pop"]]
sp_transform_df = sp_transform_df.rename(columns={"datahora":"Reported_Date",
                                               "casos_novos":"New_Cases",
                                               "pop":"Population"})
sp_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,Population
562,2020-02-25,0,11869660
1207,2020-02-26,0,11869660
1852,2020-02-27,0,11869660
2497,2020-02-28,1,11869660
3142,2020-02-29,0,11869660


In [157]:
#6,7,8,9  add new static columns
sp_transform_df["City"] ="Sao Paulo"
sp_transform_df["is_newyork"] =0
sp_transform_df["Data_Source"] ="https://raw.githubusercontent.com/seade-R/dados-covid-sp/master/data/dados_covid_sp.csv"
sp_transform_df["Extract_Date"] = dt.datetime.date(dt.datetime.utcnow())
sp_transform_df.head()

Unnamed: 0,Reported_Date,New_Cases,Population,City,is_newyork,Data_Source,Extract_Date
562,2020-02-25,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30
1207,2020-02-26,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30
1852,2020-02-27,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30
2497,2020-02-28,1,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30
3142,2020-02-29,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30


In [101]:
#before running window function to calculate rolling average check if reported date is in order
sp_transform_df

Unnamed: 0,Reported_Date,New_Cases,Population,City,is_newyork,Data_Source,Extract_Date,new_cases_per_100K
562,2020-02-25,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
1207,2020-02-26,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
1852,2020-02-27,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
2497,2020-02-28,1,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.01
3142,2020-02-29,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
...,...,...,...,...,...,...,...,...
294037,2021-05-25,2737,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,23.06
294682,2021-05-26,3298,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,27.79
295327,2021-05-27,2318,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,19.53
295972,2021-05-28,431,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,3.63


In [160]:
# reorder by reported date 
sp_transform_df=sp_transform_df.sort_values(by='Reported_Date' ,ascending=True ) .reset_index()
sp_transform_df

Unnamed: 0,index,Reported_Date,New_Cases,Population,City,is_newyork,Data_Source,Extract_Date,new_cases_per_100K
0,562,2020-02-25,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
1,1207,2020-02-26,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
2,1852,2020-02-27,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
3,2497,2020-02-28,1,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.01
4,3142,2020-02-29,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.00
...,...,...,...,...,...,...,...,...,...
456,294682,2021-05-26,3298,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,27.79
457,295327,2021-05-27,2318,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,19.53
458,295972,2021-05-28,431,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,3.63
459,296617,2021-05-29,2125,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,17.90


In [161]:
#11,12 Add rolling 7 days moving average for new_cases_per_100K and new cases
sp_transform_df["mavg_7day_per_100k_new_cases"] = round(sp_transform_df["new_cases_per_100K"].rolling(window=7,min_periods=1).mean(),2)
sp_transform_df["mavg_7day_new_cases"] = round(sp_transform_df["New_Cases"].rolling(window=7,min_periods=1).mean(),2)
sp_transform_df.head()

Unnamed: 0,index,Reported_Date,New_Cases,Population,City,is_newyork,Data_Source,Extract_Date,new_cases_per_100K,mavg_7day_per_100k_new_cases,mavg_7day_new_cases
0,562,2020-02-25,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.0,0.0,0.0
1,1207,2020-02-26,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.0,0.0,0.0
2,1852,2020-02-27,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.0,0.0,0.0
3,2497,2020-02-28,1,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.01,0.0,0.25
4,3142,2020-02-29,0,11869660,Sao Paulo,0,https://raw.githubusercontent.com/seade-R/dado...,2021-05-30,0.0,0.0,0.2


In [162]:
# 13 Re order columns to create final data set
sp_clean_df = sp_transform_df[["Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source",
                          "New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases"]]
sp_clean_df

Unnamed: 0,Extract_Date,Reported_Date,City,is_newyork,Population,Data_Source,New_Cases,mavg_7day_new_cases,new_cases_per_100K,mavg_7day_per_100k_new_cases
0,2021-05-30,2020-02-25,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,0,0.00,0.00,0.00
1,2021-05-30,2020-02-26,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,0,0.00,0.00,0.00
2,2021-05-30,2020-02-27,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,0,0.00,0.00,0.00
3,2021-05-30,2020-02-28,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,1,0.25,0.01,0.00
4,2021-05-30,2020-02-29,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,0,0.20,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...
456,2021-05-30,2021-05-26,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,3298,2426.00,27.79,20.44
457,2021-05-30,2021-05-27,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2318,2319.00,19.53,19.54
458,2021-05-30,2021-05-28,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,431,1965.71,3.63,16.56
459,2021-05-30,2021-05-29,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2125,1817.29,17.90,15.31


In [163]:
#14 Export clean sao paulo data as SP_Covid_New_Cases_Final.csv
output_file = "Resources\SP_Covid_New_Cases_Final.csv"
sp_clean_df.to_csv(output_file,index=False, header=True)

## Weather Data one time ETL

Notes:

    1.The weather data was a one time pull from openweathermap.org using their history bulk product. 
    2.The one time weather data from Jan 1st,2020 to May 16th,2021.
    3.The historical weather data is by hour.
     
### Transformation requirements:

#### Weather data transformation requirements:
Note: ensure data is ordered by date before running rolling function.

    1.Extract data from source into a data frame 
    2.Validate the data types and data for dt_iso ,city_name, temp, humidity,temp_min,temp_max
    3.Add new column Reported_Date and set to dt_iso after converting to date.
    4.aggregate hourly data to daily using mean and keep only dt_iso,mean temp, min of temp_min,max of  temp_max, mean humidity drop all other weather columns 
    5.Add new columns  daily_temp and daily_humidity equal to mean daily temp and mean daily humidity , round with 2 decimal places.
    6.Create final ready to use weather file "NYC_SP_Daily_Weather_Final.csv" in resource folder with final column list as "Reported_Date","City","daily_temp","daily_humidity","mavg_7_temp","mavg_7_humidity","mavg_15_temp","mavg_15_humidity"


In [112]:
weather_raw_df = pd.read_csv("https://history.openweathermap.org/storage/fa037ddb81b7f7f0a0d1a0ebd131858e.csv",parse_dates=['dt_iso'])
weather_raw_df.head()

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,...,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1577836800,2020-01-01 00:00:00 +0000 UTC,-18000,New York,40.712775,-74.005973,6.36,2.94,5.0,7.22,...,250,,,,,97,804,Clouds,overcast clouds,04n
1,1577840400,2020-01-01 01:00:00 +0000 UTC,-18000,New York,40.712775,-74.005973,6.52,2.4,5.34,7.22,...,260,0.25,,,,90,500,Rain,light rain,10n
2,1577844000,2020-01-01 02:00:00 +0000 UTC,-18000,New York,40.712775,-74.005973,6.11,1.22,5.0,7.22,...,260,0.42,,,,90,500,Rain,light rain,10n
3,1577847600,2020-01-01 03:00:00 +0000 UTC,-18000,New York,40.712775,-74.005973,5.8,0.51,5.0,6.67,...,260,,,,,90,804,Clouds,overcast clouds,04n
4,1577851200,2020-01-01 04:00:00 +0000 UTC,-18000,New York,40.712775,-74.005973,5.46,0.67,4.44,6.11,...,260,,,,,90,804,Clouds,overcast clouds,04n


In [164]:
#2 keep only needed columns
weather_transform_df=weather_raw_df[["dt_iso","city_name","temp","temp_min","temp_max","humidity"]]
weather_transform_df

Unnamed: 0,dt_iso,city_name,temp,temp_min,temp_max,humidity
0,2020-01-01 00:00:00 +0000 UTC,New York,6.36,5.00,7.22,76
1,2020-01-01 01:00:00 +0000 UTC,New York,6.52,5.34,7.22,75
2,2020-01-01 02:00:00 +0000 UTC,New York,6.11,5.00,7.22,75
3,2020-01-01 03:00:00 +0000 UTC,New York,5.80,5.00,6.67,75
4,2020-01-01 04:00:00 +0000 UTC,New York,5.46,4.44,6.11,70
...,...,...,...,...,...,...
24799,2021-05-16 19:00:00 +0000 UTC,Sao Paulo,22.12,20.70,23.00,49
24800,2021-05-16 20:00:00 +0000 UTC,Sao Paulo,21.15,18.35,24.44,56
24801,2021-05-16 21:00:00 +0000 UTC,Sao Paulo,18.24,14.45,21.00,77
24802,2021-05-16 22:00:00 +0000 UTC,Sao Paulo,18.68,14.45,23.33,72


In [165]:
# 2 data checking
weather_transform_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24804 entries, 0 to 24803
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   dt_iso     24804 non-null  object 
 1   city_name  24804 non-null  object 
 2   temp       24804 non-null  float64
 3   temp_min   24804 non-null  float64
 4   temp_max   24804 non-null  float64
 5   humidity   24804 non-null  int64  
dtypes: float64(3), int64(1), object(2)
memory usage: 1.1+ MB


In [166]:
#checking values
weather_transform_df.describe()

Unnamed: 0,temp,temp_min,temp_max,humidity
count,24804.0,24804.0,24804.0,24804.0
mean,16.21036,14.426702,17.957804,68.276488
std,8.525136,8.315217,8.774026,18.321267
min,-9.68,-11.11,-7.95,10.0
25%,10.36,8.89,11.67,55.0
50%,18.02,16.11,20.0,72.0
75%,22.38,20.56,24.44,83.0
max,36.67,36.0,37.22,100.0


In [167]:
#3 Add reported date columns
weather_transform_df["Reported_Date"] = pd.to_datetime(weather_transform_df['dt_iso'].str[:10])
weather_transform_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_transform_df["Reported_Date"] = pd.to_datetime(weather_transform_df['dt_iso'].str[:10])


Unnamed: 0,dt_iso,city_name,temp,temp_min,temp_max,humidity,Reported_Date
0,2020-01-01 00:00:00 +0000 UTC,New York,6.36,5.00,7.22,76,2020-01-01
1,2020-01-01 01:00:00 +0000 UTC,New York,6.52,5.34,7.22,75,2020-01-01
2,2020-01-01 02:00:00 +0000 UTC,New York,6.11,5.00,7.22,75,2020-01-01
3,2020-01-01 03:00:00 +0000 UTC,New York,5.80,5.00,6.67,75,2020-01-01
4,2020-01-01 04:00:00 +0000 UTC,New York,5.46,4.44,6.11,70,2020-01-01
...,...,...,...,...,...,...,...
24799,2021-05-16 19:00:00 +0000 UTC,Sao Paulo,22.12,20.70,23.00,49,2021-05-16
24800,2021-05-16 20:00:00 +0000 UTC,Sao Paulo,21.15,18.35,24.44,56,2021-05-16
24801,2021-05-16 21:00:00 +0000 UTC,Sao Paulo,18.24,14.45,21.00,77,2021-05-16
24802,2021-05-16 22:00:00 +0000 UTC,Sao Paulo,18.68,14.45,23.33,72,2021-05-16


In [168]:
#4 aggregate
weather_agg_df = weather_transform_df.groupby(['Reported_Date','city_name']).\
               agg(temp_mean = pd.NamedAgg(column='temp',aggfunc='mean'),
                  humidity_mean = pd.NamedAgg(column='humidity',aggfunc='mean'),
                  temp_min = pd.NamedAgg(column='temp_min',aggfunc='min'),
                  temp_max = pd.NamedAgg(column='temp_max',aggfunc='max') 
                  ).reset_index()
weather_agg_df

Unnamed: 0,Reported_Date,city_name,temp_mean,humidity_mean,temp_min,temp_max
0,2020-01-01,New York,4.071667,60.708333,2.00,7.22
1,2020-01-01,Sao Paulo,25.668333,66.208333,19.00,33.89
2,2020-01-02,New York,3.264167,58.625000,-3.33,10.00
3,2020-01-02,Sao Paulo,22.892000,81.400000,19.96,29.00
4,2020-01-03,New York,7.418400,78.120000,4.44,9.44
...,...,...,...,...,...,...
999,2021-05-14,Sao Paulo,16.577083,74.250000,12.00,23.33
1000,2021-05-15,New York,18.578750,39.083333,10.00,26.00
1001,2021-05-15,Sao Paulo,17.322500,77.083333,12.72,23.00
1002,2021-05-16,New York,18.638333,41.833333,11.67,26.00


In [169]:
# 5 add daily_temp and daily humidity equal to mean daily temp and mean daily humidity , round with 2 decimal places.
weather_agg_df['daily_temp']=weather_agg_df['temp_mean'].round(decimals=2)
weather_agg_df['daily_humidity']=weather_agg_df['humidity_mean'].round(decimals=2)
weather_agg_df

Unnamed: 0,Reported_Date,city_name,temp_mean,humidity_mean,temp_min,temp_max,daily_temp,daily_humidity
0,2020-01-01,New York,4.071667,60.708333,2.00,7.22,4.07,60.71
1,2020-01-01,Sao Paulo,25.668333,66.208333,19.00,33.89,25.67,66.21
2,2020-01-02,New York,3.264167,58.625000,-3.33,10.00,3.26,58.62
3,2020-01-02,Sao Paulo,22.892000,81.400000,19.96,29.00,22.89,81.40
4,2020-01-03,New York,7.418400,78.120000,4.44,9.44,7.42,78.12
...,...,...,...,...,...,...,...,...
999,2021-05-14,Sao Paulo,16.577083,74.250000,12.00,23.33,16.58,74.25
1000,2021-05-15,New York,18.578750,39.083333,10.00,26.00,18.58,39.08
1001,2021-05-15,Sao Paulo,17.322500,77.083333,12.72,23.00,17.32,77.08
1002,2021-05-16,New York,18.638333,41.833333,11.67,26.00,18.64,41.83


In [170]:
#5 before rolling function ensure data is order by city and date.
weather_agg_df = weather_agg_df.sort_values(['city_name','Reported_Date'],ascending=True).reset_index()
weather_agg_df.head()

Unnamed: 0,index,Reported_Date,city_name,temp_mean,humidity_mean,temp_min,temp_max,daily_temp,daily_humidity
0,0,2020-01-01,New York,4.071667,60.708333,2.0,7.22,4.07,60.71
1,2,2020-01-02,New York,3.264167,58.625,-3.33,10.0,3.26,58.62
2,4,2020-01-03,New York,7.4184,78.12,4.44,9.44,7.42,78.12
3,6,2020-01-04,New York,8.625417,94.875,6.67,11.11,8.63,94.88
4,8,2020-01-05,New York,4.917917,59.666667,2.22,10.56,4.92,59.67


In [171]:
#5 add rolling 7 and 15 day temperate with 1 day shift 
weather_agg_df['mavg_7_temp']= (weather_agg_df.groupby(['city_name'],sort=False)['daily_temp']
                                   .transform(lambda x: x.shift(1).rolling(window=7).mean())
                                   .reset_index(drop=True))

weather_agg_df['mavg_15_temp']= (weather_agg_df.groupby(['city_name'],sort=False)['daily_temp']
                                   .transform(lambda x: x.shift(1).rolling(window=15).mean())
                                   .reset_index(drop=True))
weather_agg_df

Unnamed: 0,index,Reported_Date,city_name,temp_mean,humidity_mean,temp_min,temp_max,daily_temp,daily_humidity,mavg_7_temp,mavg_15_temp
0,0,2020-01-01,New York,4.071667,60.708333,2.00,7.22,4.07,60.71,,
1,2,2020-01-02,New York,3.264167,58.625000,-3.33,10.00,3.26,58.62,,
2,4,2020-01-03,New York,7.418400,78.120000,4.44,9.44,7.42,78.12,,
3,6,2020-01-04,New York,8.625417,94.875000,6.67,11.11,8.63,94.88,,
4,8,2020-01-05,New York,4.917917,59.666667,2.22,10.56,4.92,59.67,,
...,...,...,...,...,...,...,...,...,...,...,...
999,995,2021-05-12,Sao Paulo,17.551154,79.115385,11.94,25.00,17.55,79.12,18.682857,18.600667
1000,997,2021-05-13,Sao Paulo,15.758571,84.464286,11.67,22.22,15.76,84.46,18.208571,18.492000
1001,999,2021-05-14,Sao Paulo,16.577083,74.250000,12.00,23.33,16.58,74.25,17.447143,18.301333
1002,1001,2021-05-15,Sao Paulo,17.322500,77.083333,12.72,23.00,17.32,77.08,17.352857,18.254667


In [172]:
#5 moving average for Humidity using using 1 day shift
weather_agg_df['mavg_7_humidity']= (weather_agg_df.groupby(['city_name'],sort=False)['daily_humidity']
                                   .transform(lambda x: x.shift(1).rolling(window=7).mean())
                                   .reset_index(drop=True))

weather_agg_df['mavg_15_humidity']= (weather_agg_df.groupby(['city_name'],sort=False)['daily_humidity']
                                   .transform(lambda x: x.shift(1).rolling(window=15).mean())
                                   .reset_index(drop=True))
weather_agg_df

Unnamed: 0,index,Reported_Date,city_name,temp_mean,humidity_mean,temp_min,temp_max,daily_temp,daily_humidity,mavg_7_temp,mavg_15_temp,mavg_7_humidity,mavg_15_humidity
0,0,2020-01-01,New York,4.071667,60.708333,2.00,7.22,4.07,60.71,,,,
1,2,2020-01-02,New York,3.264167,58.625000,-3.33,10.00,3.26,58.62,,,,
2,4,2020-01-03,New York,7.418400,78.120000,4.44,9.44,7.42,78.12,,,,
3,6,2020-01-04,New York,8.625417,94.875000,6.67,11.11,8.63,94.88,,,,
4,8,2020-01-05,New York,4.917917,59.666667,2.22,10.56,4.92,59.67,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
999,995,2021-05-12,Sao Paulo,17.551154,79.115385,11.94,25.00,17.55,79.12,18.682857,18.600667,71.952857,72.612000
1000,997,2021-05-13,Sao Paulo,15.758571,84.464286,11.67,22.22,15.76,84.46,18.208571,18.492000,73.975714,72.628000
1001,999,2021-05-14,Sao Paulo,16.577083,74.250000,12.00,23.33,16.58,74.25,17.447143,18.301333,76.582857,73.433333
1002,1001,2021-05-15,Sao Paulo,17.322500,77.083333,12.72,23.00,17.32,77.08,17.352857,18.254667,75.772857,73.452667


In [174]:
#6 create final weather file
weather_agg_df.rename(columns={"city_name":"City"},inplace=True)
weather_final_df= weather_agg_df[["Reported_Date","City","daily_temp","daily_humidity","mavg_7_temp","mavg_7_humidity","mavg_15_temp","mavg_15_humidity"]]
weather_final_df

Unnamed: 0,Reported_Date,City,daily_temp,daily_humidity,mavg_7_temp,mavg_7_humidity,mavg_15_temp,mavg_15_humidity
0,2020-01-01,New York,4.07,60.71,,,,
1,2020-01-02,New York,3.26,58.62,,,,
2,2020-01-03,New York,7.42,78.12,,,,
3,2020-01-04,New York,8.63,94.88,,,,
4,2020-01-05,New York,4.92,59.67,,,,
...,...,...,...,...,...,...,...,...
999,2021-05-12,Sao Paulo,17.55,79.12,18.682857,71.952857,18.600667,72.612000
1000,2021-05-13,Sao Paulo,15.76,84.46,18.208571,73.975714,18.492000,72.628000
1001,2021-05-14,Sao Paulo,16.58,74.25,17.447143,76.582857,18.301333,73.433333
1002,2021-05-15,Sao Paulo,17.32,77.08,17.352857,75.772857,18.254667,73.452667


In [175]:
#6 Export clean sao paulo data as SP_Covid_New_Cases_Final.csv
output_file = "Resources\\NYC_SP_Daily_Weather_Final.csv"
weather_final_df.to_csv(output_file,index=False, header=True)

## Create target data - Combine weather with city data

#### Tartget ETL requirements:

    1.append two cities data into combine data set and check data
    2. find common period for both cities, using max of min reported date by city, and min of max reported date by city
    3. filter the combine city date for the period calcualted in step 2
    4. merge with weather data. Use weather as the source for all dates. so that we can identify any dates where city data is missing.
    5. check for null rows and delete them as we have weather data up to May 16 2021.
    6. Save combine final file as Final_Combine_data.csv in Resources folder using column order "Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source","New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases","daily_temp","daily_humidity","mavg_7_temp","mavg_7_humidity","mavg_15_temp","mavg_15_humidity"

In [194]:
# 1. Combine NYC and SP new cases

nyc_sp_covid_new_cases_df = pd.concat([nyc_clean_df,sp_clean_df], ignore_index=True)
nyc_sp_covid_new_cases_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915 entries, 0 to 914
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Extract_Date                  915 non-null    object        
 1   Reported_Date                 915 non-null    datetime64[ns]
 2   City                          915 non-null    object        
 3   is_newyork                    915 non-null    int64         
 4   Population                    915 non-null    int64         
 5   Data_Source                   915 non-null    object        
 6   New_Cases                     915 non-null    int64         
 7   mavg_7day_new_cases           915 non-null    float64       
 8   new_cases_per_100K            915 non-null    float64       
 9   mavg_7day_per_100k_new_cases  915 non-null    float64       
dtypes: datetime64[ns](1), float64(3), int64(3), object(3)
memory usage: 71.6+ KB


In [195]:
# check combine data
nyc_sp_covid_new_cases_df.describe()

Unnamed: 0,is_newyork,Population,New_Cases,mavg_7day_new_cases,new_cases_per_100K,mavg_7day_per_100k_new_cases
count,915.0,915.0,915.0,915.0,915.0,915.0
mean,0.496175,15310310.0,1710.153005,1703.459552,11.750503,11.696579
std,0.500259,3468965.0,1517.289908,1269.671924,10.708006,8.484317
min,0.0,11869660.0,0.0,0.0,0.0,0.0
25%,0.0,11869660.0,412.0,522.145,2.435,3.15
50%,0.0,11869660.0,1285.0,1560.0,9.26,11.83
75%,1.0,18804000.0,2737.0,2577.57,18.325,17.605
max,1.0,18804000.0,8646.0,5291.0,72.84,39.65


In [196]:
#check combine data
nyc_sp_covid_new_cases_df

Unnamed: 0,Extract_Date,Reported_Date,City,is_newyork,Population,Data_Source,New_Cases,mavg_7day_new_cases,new_cases_per_100K,mavg_7day_per_100k_new_cases
0,2021-05-30,2020-02-29,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0.00,0.01,0.01
1,2021-05-30,2020-03-01,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0.00,0.00,0.00
2,2021-05-30,2020-03-02,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0.00,0.00,0.00
3,2021-05-30,2020-03-03,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0.00,0.01,0.00
4,2021-05-30,2020-03-04,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,5,0.00,0.03,0.01
...,...,...,...,...,...,...,...,...,...,...
910,2021-05-30,2021-05-26,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,3298,2426.00,27.79,20.44
911,2021-05-30,2021-05-27,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2318,2319.00,19.53,19.54
912,2021-05-30,2021-05-28,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,431,1965.71,3.63,16.56
913,2021-05-30,2021-05-29,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2125,1817.29,17.90,15.31


In [197]:
# check combine data for min and max reported date by city
combine_city_gp = nyc_sp_covid_new_cases_df.groupby(['City']).\
                    agg(reported_date_min = pd.NamedAgg(column='Reported_Date',aggfunc='min'),
                        reported_date_max = pd.NamedAgg(column='Reported_Date',aggfunc='max')) 
combine_city_gp


Unnamed: 0_level_0,reported_date_min,reported_date_max
City,Unnamed: 1_level_1,Unnamed: 2_level_1
New York,2020-02-29,2021-05-27
Sao Paulo,2020-02-25,2021-05-30


In [198]:
 #2. find the common period dates paramater for keeping the common reported date in both cities new cases data
keep_data_start_date = combine_city_gp["reported_date_min"].max()
keep_data_end_date = combine_city_gp["reported_date_max"].min()
print(f"keep new cases data from both cities with reproted date start from :{keep_data_start_date}") 
print(f"keep new cases data from both cities with reproted date end to :{keep_data_end_date}") 

keep new cases data from both cities with reproted date start from :2020-02-29 00:00:00
keep new cases data from both cities with reproted date end to :2021-05-27 00:00:00


In [209]:
# 3 keep the data within the range identified in step 2
nyc_sp_covid_new_cases_df = nyc_sp_covid_new_cases_df[(nyc_sp_covid_new_cases_df['Reported_Date'] >= keep_data_start_date)
                                                      & (nyc_sp_covid_new_cases_df['Reported_Date'] <= keep_data_end_date )]
                                                      
print(f"Filterd data min reported date:" ,nyc_sp_covid_new_cases_df["Reported_Date"].min())
print(f"Filterd data max reported date:",nyc_sp_covid_new_cases_df["Reported_Date"].max())
print(f"Combine city data frame shape:" ,nyc_sp_covid_new_cases_df.shape )

Filterd data min reported date: 2020-02-29 00:00:00
Filterd data max reported date: 2021-05-27 00:00:00
Combine city data frame shape: (908, 10)


In [210]:
#4 Merge with weather data
final_merged_df = pd.merge(weather_final_df, nyc_sp_covid_new_cases_df, how="inner", on=["Reported_Date", "City"])
final_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 886 entries, 0 to 885
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Reported_Date                 886 non-null    datetime64[ns]
 1   City                          886 non-null    object        
 2   daily_temp                    886 non-null    float64       
 3   daily_humidity                886 non-null    float64       
 4   mavg_7_temp                   886 non-null    float64       
 5   mavg_7_humidity               886 non-null    float64       
 6   mavg_15_temp                  886 non-null    float64       
 7   mavg_15_humidity              886 non-null    float64       
 8   Extract_Date                  886 non-null    object        
 9   is_newyork                    886 non-null    int64         
 10  Population                    886 non-null    int64         
 11  Data_Source                   88

In [211]:
#5 Check null in final combine data
for column in final_merged_df.columns:
    print(f"Null count fot {column} is",final_merged_df[column].isnull().sum())
    

Null count fot Reported_Date is 0
Null count fot City is 0
Null count fot daily_temp is 0
Null count fot daily_humidity is 0
Null count fot mavg_7_temp is 0
Null count fot mavg_7_humidity is 0
Null count fot mavg_15_temp is 0
Null count fot mavg_15_humidity is 0
Null count fot Extract_Date is 0
Null count fot is_newyork is 0
Null count fot Population is 0
Null count fot Data_Source is 0
Null count fot New_Cases is 0
Null count fot mavg_7day_new_cases is 0
Null count fot new_cases_per_100K is 0
Null count fot mavg_7day_per_100k_new_cases is 0


In [212]:
# check date range for combine data
print(f"For final data , min reported date is:" ,final_merged_df["Reported_Date"].min())
print(f"For final data, max reported date is:",final_merged_df["Reported_Date"].max())
print(f"For final data, data frame shape:" ,final_merged_df.shape )

For final data , min reported date is: 2020-02-29 00:00:00
For final data, max reported date is: 2021-05-16 00:00:00
For final data, data frame shape: (886, 16)


In [213]:
#rearrange columns and save data
final_combine_df = final_merged_df[["Extract_Date","Reported_Date","City","is_newyork","Population","Data_Source",
                                    "New_Cases","mavg_7day_new_cases","new_cases_per_100K","mavg_7day_per_100k_new_cases",
                                    "daily_temp","daily_humidity","mavg_7_temp","mavg_7_humidity","mavg_15_temp",
                                    "mavg_15_humidity"]]
final_combine_df

Unnamed: 0,Extract_Date,Reported_Date,City,is_newyork,Population,Data_Source,New_Cases,mavg_7day_new_cases,new_cases_per_100K,mavg_7day_per_100k_new_cases,daily_temp,daily_humidity,mavg_7_temp,mavg_7_humidity,mavg_15_temp,mavg_15_humidity
0,2021-05-30,2020-02-29,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0.00,0.01,0.01,0.40,44.08,5.702857,58.961429,3.500000,56.268000
1,2021-05-30,2020-03-01,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0.00,0.00,0.00,-0.10,45.71,5.394286,59.610000,3.432000,55.587333
2,2021-05-30,2020-03-02,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,0,0.00,0.00,0.00,6.61,52.25,4.487143,61.085714,3.816000,55.348667
3,2021-05-30,2020-03-03,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,1,0.00,0.01,0.00,11.27,70.72,4.317143,61.561429,4.190667,54.404000
4,2021-05-30,2020-03-04,New York,1,18804000,https://github.com/nychealth/coronavirus-data/...,5,0.00,0.03,0.01,10.21,58.76,4.688571,60.670000,4.576667,55.585333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
881,2021-05-30,2021-05-12,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2808,2069.00,23.66,17.43,17.55,79.12,18.682857,71.952857,18.600667,72.612000
882,2021-05-30,2021-05-13,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2505,2084.00,21.10,17.56,15.76,84.46,18.208571,73.975714,18.492000,72.628000
883,2021-05-30,2021-05-14,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,3248,2189.00,27.36,18.44,16.58,74.25,17.447143,76.582857,18.301333,73.433333
884,2021-05-30,2021-05-15,Sao Paulo,0,11869660,https://raw.githubusercontent.com/seade-R/dado...,2732,2293.86,23.02,19.33,17.32,77.08,17.352857,75.772857,18.254667,73.452667


In [214]:
#6 Export clean final combine weather and covid new cases data
output_file = "Resources\Final_Combine_Data.csv"
final_combine_df.to_csv(output_file,index=False, header=True)