# Introduction/Business Problem
The COVID-19 pandemic has taken an unprecedented toll globally, already affecting over 2M people, taking over 200K lives and, according to WEF, will likely cost the world 2 trillion USD in economic losses.

In the Philippines, the responsibility of managing the quarantine falls on the shoulders of the mayors of the cities and municipalities.

Not all cities are affected by the virus the same way. Some cities are overwhelmed by the number of cases. Some cities have seen the peak and are now slowly coming down in terms of cases. Some cities are growing in terms of the number of cases.

The mayors will be interested to see where their city is relative to cases. Is their city on the downward path or in the growth path?

The citizens, on the other hand, will be interested to know where the hospitals are near their location.

This is the problem we want to solve with data analysis.

As an application, we would have an interface for mayors to see if their daily rates of cases are growing or decreasing.

For citizens, they can find out the nearest health facilities and testing centers are so they can go there in case they need to be tested or see a healthcare professional.

# Data Section
For this exercise, we will be using the following datasets:

1. Case Information dataset provided by the Philippine Department of Health. We will use this dataset to determine the growth rates per city. This dataset includes the following fields:
	* CaseCode : Random code assigned for labelling cases
	* Age : Age
	* AgeGroup : Five-year age group
	* Sex : Sex
	* DateRepConf : Date publicly announced as confirmed case
	* DateRecover : Date recovered
	* DateDied : Date died
	* RemovalType : Type of removal (recovery or death)
	* DateRepRem : Date publicly announced as removed
	* Admitted : Binary variable indicating patient has been admitted to hospital
	* RegionRes : Region of residence
	* ProvCityRes : Province of residence
	* RegionPSGC : Philippine Standard Geographic Code of Region of Residence
	* ProvPSGC : Philippine Standard Geographic Code of Province of Residence
	* MunCityPSGC : Philippine Standard Geographic Code of Municipality or City of Residence
	* HealthStatus : Known current health status of patient (asymptomatic, mild, severe, critical, died, recovered)
	* Quarantined : Ever been home quarantined, not necessarily currently in home quarantine
2. Geojson dataset of cities and municipalities in the Philippines. We will use this dataset to provide the mapping boundaries of the different cities and municipalities in the Philippines.
	* ID_0 : Unique ID 0
	* ISO : ISO Country Code
	* NAME_0 : Country Name
	* NAME_2 : Municipality or City Name
	* PROVINCE : Province Name
	* REGION : Regiona Name
	* geometry : Polygon, coordinates
3. Foursquare Search API. We will use this dataset to provide us with with information of health related venues especially their location expressed in latitudes and longitudes. We will also use the categories as a filter. We will bash this data with the boundaries provided by the city geojson dataset above. The Foursquare Search API returns the following data:
	* id : A unique string identifier for this venue.
	* name : The best known name for this venue.
	* location : An object containing none, some, or all of address (street address), crossStreet, city, state, postalCode, country, lat, lng, and distance. All fields are strings, except for lat, lng, and distance. Distance is measured in meters. Some venues have their locations intentionally hidden for privacy reasons (such as private residences). If this is the case, the parameter isFuzzed will be set to true, and the lat/lng parameters will have reduced precision.
	* categories : An array, possibly empty, of categories that have been applied to this venue. One of the categories will have a primary field indicating that it is the primary category for the venue. For the complete category tree, see categories.

# Methodology
section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

We will use the CRISP-DM (Cross-Industry Process for Data Mining) methodology. The CRISP-DM methodology is well-proven methodology in data science. CRISP-DM loosely and iteratively follows six major phases:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

As we have have covered #1 and #2 previously, we will continue with Data Preparation.

## Data Preparation

We start by importing all necessary libraries and installing all dependencies for the project.

In [None]:
import numpy as np
import pandas as pd
import json5 # library to handle JSON files
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
import seaborn as sns
print('Libraries imported.')

In [None]:
#explore DOH data
#cases_df = pd.read_csv('DOH COVID Data Drop_ 20200510 - 05 Case Information.csv',
#                       parse_dates=[5,6,7,9])
cases_df = pd.read_csv('https://storage.googleapis.com/doh_datadrops/DOH%20Data%20Drop%2020200518.csv',
                      parse_dates=[5,6,7,9,17])

In [None]:
cases_df.dtypes

In [None]:
#cast data to appropriate types for easy handling
for col in ['AgeGroup', 'Sex','RemovalType', 'Admitted', 'RegionRes','ProvRes','CityMunRes',
            'CityMuniPSGC','HealthStatus','Quarantined','Pregnanttab']:
    cases_df[col] = cases_df[col].astype('category')
for col in ['DateRepConf', 'DateDied', 'DateRecover', 'DateRepRem','DateOnset']:
    cases_df[col] = cases_df[col].astype('datetime64')
cases_df.Age = cases_df.Age.astype('Int64')

In [None]:
cases_df.dtypes

In [None]:
cases_df.info() 

In [None]:
cases_df.DateRepConf.max()

In [None]:
list_cities = cases_df.CityMunRes.unique()

In [None]:
list_cities

In [None]:
#create dataframe grouping cases dataset and looking at their growth rates in daily cases
#cases_by_city = cases_df.groupby(["CityMunRes","DateRepConf"])
#cases_by_city.get_group('CITY OF PARAÑAQUE')
#gb_cases_by_city = cases_df.groupby(['CityMunRes'],as_index =False)
#gb_cases_by_citytrue = cases_df.groupby(['CityMunRes'])
#cases_by_province = cases_df.groupby(['ProvRes'])['CaseCode'].count()
#cases_by_region = cases_df.groupby(['RegionRes'])['CaseCode'].count()
gb_cases_by_city_by_date = cases_df.groupby(['CityMunRes','DateRepConf'],as_index =False)
gb_cases_by_city_by_date_true = cases_df.groupby(['CityMunRes','DateRepConf'])
#cases_by_region.sort_values()

In [None]:
df_cases_by_city = gb_cases_by_city['CaseCode'].count()
df_cases_by_city.head()

In [None]:
df_cases_by_citytrue = gb_cases_by_citytrue['CaseCode'].count()
df_cases_by_citytrue.head()

In [None]:
df_cases_by_city_by_date = gb_cases_by_city_by_date['CaseCode'].count()
df_cases_by_city_by_date.rename(columns={"CityMunRes": "City", "DateRepConf": "Date","CaseCode":"Casecount"},
                                inplace = True)
#lst_by_city = df_cases_by_city_by_date.loc[('ABUCAY')]
#print(type(df2))
#print(df2.index)
#df2.iloc[-7:1]
#lst_by_city.plot()
#df_cases_by_city_by_date.index

In [None]:
df_cases_by_city_by_date.info()

In [None]:
df_cases_by_city_by_date.head()

In [None]:
df_cases_by_city_by_date.loc['ABUCAY',:]

In [None]:
df_cases_by_city_by_date_true = gb_cases_by_city_by_date_true['CaseCode'].count()
#df_cases_by_city_by_date_true.rename(columns={"CityMunRes": "City", "DateRepConf": "Date","CaseCode":"Casecount"},
                                #inplace = True)

In [None]:
df_cases_by_city_by_date_true.head()

In [None]:
df_cases_by_city_by_date_true.loc['ABUCAY','2020-05-01':'2020-05-10']

In [None]:
df_cases_by_city_by_date_true_cum = gb_cases_by_city_by_date_true['CaseCode'].count().cumsum().pct_change().tail(14).mean()

In [None]:
list_cases_by_city_by_date = cases_df.groupby(['CityMunRes','DateRepConf'])['CaseCode'].count().cumsum().pct_change()

In [None]:
list_cases_by_city_by_date

In [None]:
list_cases_by_city_by_date.loc['ABUCAY'].tail(14).mean()

In [None]:
a = list_cases_by_city_by_date.index.get_level_values('CityMunRes')

In [None]:
a

In [None]:
d = {city: list_cases_by_city_by_date.loc[city].tail(14).mean() 
     for city in list_cases_by_city_by_date.index.get_level_values('CityMunRes')}

df_cities_growth = pd.DataFrame(data=d.values(), index=d.keys(), columns=['mean'])

In [None]:
df_cities_growth.describe()

In [None]:
df_cities_growth.loc[df_cities_growth['mean']>0.000009].plot.hist(bins=500)

In [None]:
df_cities_growth.head()

In [None]:
df_cases_by_city_by_date_true_cum

In [None]:
df_cases_by_city_by_date_true_cum.loc['ABUCAY'].tail(14).mean()

In [None]:
df_cases_by_city_by_date_pctchg = gb_cases_by_city_by_date['CaseCode'].pct_change()

In [None]:
print(df_cases_by_city_by_date_pctchg.tail())
type(df_cases_by_city_by_date_pctchg)

In [None]:
gb_df_cases_by_city_by_date

In [None]:
#df_cases_by_city_by_date['ABUCAY':'ALFONSO',:]
#df_cases_by_city_by_date[['ABUCAY', 'ALFONSO']]
#df_cases_by_city_by_date[('ABUCAY':'ALFONSO',:)]
df_cases_by_city_by_date.loc[(slice('ABUCAY', 'AGOO')), :]

In [None]:
df_cases_by_city_by_date_true = gb_cases_by_city_by_date_true['CaseCode'].count()

In [None]:
type(df_cases_by_city_by_date_true)

In [None]:
df_cases_by_city_by_date_true.head()

In [None]:
df2

In [None]:
x = df2.pct_change().tail(7)

In [None]:
x.shape

In [None]:
df3 = cases_by_city_by_date.size

In [None]:
df3

In [None]:
#%matplotlib inline
cases_by_city['CaseCode'].count()

# Results
section where you discuss the results.

# Discussion
section where you discuss any observations you noted and any recommendations you can make based on the results.

# Conclusion
section where you conclude the report.