# Introduction/Business Problem
The COVID-19 pandemic has taken an unprecedented toll globally, already affecting over 2M people, taking over 200K lives and, according to WEF, will likely cost the world 2 trillion USD in economic losses.

Not all cities are affected by the virus the same way. We want to see how poverty incidence, a metric used to measured the level of poverty in a geographic area, is a predictor for severity of problem as measured by cases per capita.

# Data Section
For this exercise, we will be using the following datasets:

1. **Case Information dataset**
    
    **Source:** Philippine Department of Health. See https://www.doh.gov.ph/2019-nCoV
    
    **Description:** List of confirmed COVID-19 cases from the DOH Epidemiological Bureau.

    **Dataset fields:**
	* CaseCode : Random code assigned for labelling cases
	* Age : Age
	* AgeGroup : Five-year age group
	* Sex : Sex
	* DateRepConf : Date publicly announced as confirmed case
	* DateRecover : Date recovered
	* DateDied : Date died
	* RemovalType : Type of removal (recovery or death)
	* DateRepRem : Date publicly announced as removed
	* Admitted : Binary variable indicating patient has been admitted to hospital
	* RegionRes : Region of residence
	* ProvCityRes : Province of residence
	* RegionPSGC : Philippine Standard Geographic Code of Region of Residence
	* ProvPSGC : Philippine Standard Geographic Code of Province of Residence
	* MunCityPSGC : Philippine Standard Geographic Code of Municipality or City of Residence
	* HealthStatus : Known current health status of patient (asymptomatic, mild, severe, critical, died, recovered)
	* Quarantined : Ever been home quarantined, not necessarily currently in home quarantine


2. **Municipal and City Level Poverty Estimates**

    **Source:** The Philippine Statistics Authority (PSA). See https://psa.gov.ph/content/psa-releases-2015-municipal-and-city-level-poverty-estimates

    **Description:** This is a set of estimates using the small area estimation (SAE) technique. See https://psa.gov.ph/sites/default/files/Technical%20Notes%20on%202015%20SAE.pdf

    **Dataset fields:**
    * PSGC
    * Region/Province
    * Municipality/City
    * Poverty Incidence


2. **Philippine Standard Geographic Code (PSGC)**

    **Source:** The Philippine Statistics Authority (PSA). See https://psa.gov.ph/content/psa-releases-2015-municipal-and-city-level-poverty-estimates

    **Description:** The PSGC is a systematic classification and coding of geographic areas in the Philippines. It is based on the four (4) well-established hierarchical levels of geographical-political subdivisions of the country, namely, the administrative region, the province, the municipality/city, and the barangay.

    **Dataset fields:**
    
    * Column "Code": PSGC Code
    * Column "Income Classification"
    * Column "Class" (Average Annual Income)
        * Provinces:
            * 1st	P 450M or more
            * 2nd	P 360M or more but less than P 450M
            * 3rd	P 270M or more but less than P 360M
            * 4th	P 180M or more but less than P 270M
            * 5th	P 90M or more but less than P 180M
            * 6th	Below P 90M
        * Cities
            * 1st	P 400M or more
            * 2nd	P 320M or more but less than P 400M
            * 3rd	P 240M or more but less than P 320M
            * 4th	P 160M or more but less than P 240M
            * 5th	P 80M or more but less than P 160M
            * 6th	Below P 80M
        * Municipalities
            * 1st	P 55M or more
            * 2nd	P 45M or more but less than P 55M
            * 3rd	P 35M or more but less than P 45M
            * 4th	P 25M or more but less than P 35M
            * 5th	P 15M or more but less than P 25M
            * 6th	Below P 15M		
    * Column "Urban / Rural"
	* Column "Population"	
    


2. Geojson dataset of cities and municipalities in the Philippines. We will use this dataset to provide the mapping boundaries of the different cities and municipalities in the Philippines.
	* ID_0 : Unique ID 0
	* ISO : ISO Country Code
	* NAME_0 : Country Name
	* NAME_2 : Municipality or City Name
	* PROVINCE : Province Name
	* REGION : Regiona Name
	* geometry : Polygon, coordinates

# Methodology
section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

We will use the CRISP-DM (Cross-Industry Process for Data Mining) methodology. The CRISP-DM methodology is well-proven methodology in data science. CRISP-DM loosely and iteratively follows six major phases:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

As we have have covered #1 and #2 previously, we will continue with Data Preparation.

# Results

# Discussion

# Conclusion

## Data Preparation

### Loading Data

1. We start by importing all necessary libraries and installing all dependencies for the project.

## plan

- [Setup](#setup)
    - import libraries
- [Prepare data](#data_prep)
    - DOH data
        - explore
        - clean
        - select
    - Poverty data
        - explore
        - clean
        - select
    - City index data
        - explore
        - clean
        - select
    - City Latlong
        - load json
        - explore
        - clean
        - select
    - Merge data
        - calculate case/pop as predicted variable
        - calculate distance from manila as additional feature
- Predict
- Visualize

<a name="setup"></a>
### Setup

<a name="import_lib"></a>
#### Import libraries

In [None]:
import numpy as np
import pandas as pd
import json5 as json # library to handle JSON files
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
import seaborn as sns
import xlrd
#import shapely
#import fiona
#import pyproj
import geopandas as gpd

#prep
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, MaxAbsScaler, QuantileTransformer

#models
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, LinearRegression, Ridge, RidgeCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

#validation libraries
from sklearn.model_selection import KFold, StratifiedKFold
from IPython.display import display
from sklearn import metrics



print('Libraries imported.')

<a name='data_prep'></a>
### Data Preparation

#### DOH Data

1. We define the columns and their proper data types.

In [None]:
#define column types
col_types = {'Age' : 'float64', 
             'AgeGroup' : 'category', 
             'Sex' : 'category', 
             'RemovalType' : 'category', 
             'Admitted' : 'category', 
             'RegionRes' : 'category', 
             'ProvRes' : 'category', 
             'CityMunRes' : 'category', 
             'CityMuniPSGC' : 'category',
             'BarangayRes' : 'category',
             'BarangayPSGC' : 'category',
             'HealthStatus' : 'category', 
             'Quarantined' : 'category', 
             'Pregnanttab' : 'category', 
             'ValidationStatus' : 'category'}
col_date = [4, 5, 6, 7, 8, 19]

3. We read the csv and load it into a data frame.

In [None]:
df_cases_raw = pd.read_csv("DOH COVID Data Drop_ 20201013 - 04 Case Information.csv", dtype = col_types, parse_dates = col_date)

##### Exploratory Data Analysis
1. We do a quick check on the dataframe

In [None]:
#check our dataframe
df_cases_raw.info()
df_cases_raw.head()


In [None]:
#determine unique values per column
df_cases_raw.nunique(axis=0)

2. Since our study wants to see if certain city statistic predicts cases, we will limit our variables to cases and cities, i.e CityMuniPSGC

In [None]:
df_cases = df_cases_raw

In [None]:
df_cases = df_cases[['CaseCode','DateRepConf','CityMuniPSGC','CityMunRes']]

3. We drop the rows where we can't identify the cities

In [None]:
#drop Nan cities
df_cases = df_cases.loc[-df_cases.CityMuniPSGC.isna(),:]

4. For consistency, we rename CityMuniPSGC column to PSGC, which stands for Philippine Standard Geographic Code.

In [None]:
df_cases = df_cases.rename(columns={"CityMuniPSGC": "PSGC"})

5. Then we group the dataframe into cities are represented by PSGC and count the cases.

In [None]:
df_cases_by_city = df_cases.groupby(['PSGC','CityMunRes'])['CaseCode'].count().reset_index()

6. Then we transform PSGC to standard nine-digit code for consistency

In [None]:
df_cases_by_city.PSGC = df_cases_by_city.PSGC.str.replace('PH','')
df_cases_by_city.rename(columns={"CaseCode": "Cases"}, inplace = True)
df_cases_by_city.rename(columns={"CityMunRes": "Name"}, inplace = True)
df_cases_by_city.sort_values('PSGC', inplace = True)
df_cases_by_city.set_index('PSGC', inplace = True)
#check
df_cases_by_city.info()

7. To limit our dataset, we remove the cities with no cases.

In [None]:
df_cases_by_city = df_cases_by_city.loc[df_cases_by_city.Cases > 0,:].copy()

8. Convert city names to lower case

In [None]:
df_cases_by_city.Name = df_cases_by_city.Name.str.lower()

9. Quick checking if data makes sense.

In [None]:
df_cases_by_city.sort_values(by = 'Cases', ascending = False).head()

In [None]:
df_cases_by_city.loc[df_cases_by_city.Name.str.contains('df_cases_by_city')]

In [None]:
df_cases_by_city.info()

### Poverty Incidence dataset

1. We read the poverty incidence data set.

In [None]:
#read city stats
df_poverty = pd.read_csv('City and Municipal-level Small Area Poverty Estimates_ 2009, 2012 and 2015_0.csv',
                  skiprows = [0,1,2,3,4],
                  encoding = "ISO-8859-1",
                  names = ['PSGC_ID','Poverty Incidence'],
                  usecols = [0,5],
                  dtype = {'PSGC_ID':'object','Poverty Incidence':np.float16})

2. We clean it up by dropping NaNs and invalid rows.

In [None]:
#drop NaN indexes
df_poverty = df_poverty.loc[df_poverty.index.dropna()]

In [None]:
#drop other invalid rows
df_poverty = df_poverty.loc[-df_poverty.PSGC_ID.isna()]

3. For consistency on merge, we transform the PSGC codes.

In [None]:
df_poverty['PSGC'] = df_poverty.PSGC_ID.str.zfill(6)
df_poverty['PSGC'] = df_poverty.PSGC.str.ljust(9, '0')
df_poverty.sort_values('PSGC', inplace = True)
df_poverty.drop('PSGC_ID', axis =1, inplace = True)
df_poverty.set_index('PSGC', inplace = True)
df_poverty.sort_index(inplace = True)
#check
df_poverty.info()

In [None]:
df_poverty.head()

In [None]:
df_poverty.loc['133902000']

### City index data

1. We load city data containing geographic code and other features.

In [None]:
df_cityindex = pd.read_excel('PSGC 2Q 2020 Publication.xlsx',
                    sheet_name = 'PSGC',
                    usecols = [i for i in range(7)],
                    names = ['PSGC',
                             'Name',
                             'Geographic Level',
                             'City Class',
                             'Income Classification',
                             'Urban Rural',
                             'Population'])

2. We filter the dataset to only include cities, municipalities and sub municipalities.

In [None]:
df_cityindex.head()

In [None]:
df_cityindex = df_cityindex[df_cityindex['Geographic Level'].isin(['City','Mun','SubMun'])]

3. For our purposes, we replace NaNs with 'UNK'

In [None]:
df_cityindex = df_cityindex.fillna('UNK')

In [None]:
df_cityindex.info()

3. We examine each feature to see if we can further clean it up.

In [None]:
df_cityindex['Geographic Level'].unique()

In [None]:
df_cityindex['City Class'].unique()

In [None]:
df_cityindex['Income Classification'].unique()

In [None]:
df_cityindex['Income Classification'] = df_cityindex['Income Classification'].str.replace(r'\*.*','')
df_cityindex['Income Classification'] = df_cityindex['Income Classification'].str.replace(r' .*','')
df_cityindex['Income Classification'] = df_cityindex['Income Classification'].str.replace('-','UNK')

In [None]:
df_cityindex['Urban Rural'].unique()

Since Urban Rural only contains UNK, we drop it.

In [None]:
df_cityindex.drop('Urban Rural', axis =1, inplace = True)

4. As before, we transform the geographic code for consistency.

In [None]:
df_cityindex.PSGC = df_cityindex.PSGC.astype('str')
df_cityindex.PSGC = df_cityindex.PSGC.str.zfill(9)

5. Then we sort the dateframe according the the geographic code and make that the index

In [None]:
df_cityindex.sort_values('PSGC', inplace = True)
df_cityindex.set_index('PSGC', inplace = True)

6. For consistency, we transform city name to lower case.

In [None]:
df_cityindex.Name = df_cityindex.Name.str.lower()

7. Then we change the features into category type.

In [None]:
df_cityindex = df_cityindex.astype(dtype = {'Name':'category',
                                            'Geographic Level':'category',
                                            'City Class':'category',
                                            'Income Classification':'category'})

8. Quick checking

In [None]:
df_cityindex.info()

In [None]:
df_cityindex.head()

In [None]:
df_cityindex.loc['133902000']

### Cities Geodata

1. We load the geojson data on Philippine cities and municipalities.

In [None]:
#df_geodata = gpd.read_file('/Users/gio/Google Drive/Education/IBM Data Science Professional Certificate/Capstone/Municities') 
df_geodata = gpd.read_file('https://github.com/altcoder/philippines-psgc-shapefiles/raw/master/source/2015/Municities.zip')

Note: Other possible geojson files

'https://github.com/faeldon/philippines-json-maps/tree/master/geojson/municties/medres'
'https://github.com/justinelliotmeyers/official_philippines_shapefile_data_2016'
'https://raw.githubusercontent.com/macoymejia/geojsonph/master/MuniCities/MuniCities.json'
'https://github.com/altcoder/philippines-psgc-shapefiles/blob/master/source/2015/Municities.zip'
'https://raw.githubusercontent.com/macoymejia/geojsonph/master/MuniCities/MuniCities.minimal.json'

2. As we are interested in finding out whether distance from the capital is a factor, we need to calculate the distance from Manila.

In [None]:
df_geodata['Centroid'] = df_geodata.geometry.centroid
manila_loc = gpd.tools.geocode('Manila')
df_geodata['Distance from Manila'] = gpd.GeoSeries(df_geodata.Centroid).distance(manila_loc.iloc[0,0])



3. And as usual, let's do a clean up.

In [None]:
# drop other invalid rows
df_geodata = df_geodata.loc[-df_geodata['Distance from Manila'].isna(),:]

In [None]:
df_geodata['PSGC'] = df_geodata.ADM3_PCODE.str.replace('PH','')

In [None]:
#change index to PSGC
df_geodata.set_index('PSGC', inplace = True)

4. Quick checking. 

In [None]:
df_geodata.loc['133902000']

### Merge data

In [None]:
df_merged = pd.merge(df_geodata, df_cityindex, how = 'inner', left_index = True, right_index = True)

In [None]:
df_merged = pd.merge(df_merged, df_cases_by_city, how = 'inner', left_index=True, right_index=True)

In [None]:
df_merged = pd.merge(df_merged, df_poverty, how = 'inner', left_index=True, right_index=True)

2. We can add 'Density' as additional feature.

In [None]:
df_merged['Density'] = df_merged.Population/df_merged.Shape_Area

3. We clean up irrelevant columns.

In [None]:
df_merged.drop(['Name_x','Name_y', 'Shape_Leng', 'date', 'validOn', 'validTo', 
                'ADM3_PCODE', 'ADM3_REF', 'ADM3ALT1EN', 'ADM3ALT2EN', 'ADM2_EN', 'ADM2_PCODE', 'ADM1_EN', 
                'ADM1_PCODE', 'ADM0_EN', 'ADM0_PCODE', 'Shape_Area'], 
               axis = 1, inplace = True)

In [None]:
df_merged.rename(columns = {'ADM3_EN' : 'Name'}, inplace=True)

4. We now define 'Case Per Capita" as our target variable.

In [None]:
df_merged['Case Per Capita'] = df_merged.Cases/df_merged.Population

In [None]:
df_merged.head()

### Exploratory Data Analysis
So now we have a dataset of cities and municipalities with feature statistics like city class, income classification, population, poverty incidences, distance from manila, and density as possible predictors of cases per capita.

1. We check our data.

In [None]:
df_merged.sort_values('Case Per Capita', ascending = False)

In [None]:
df_merged.info()

2. Let's get rid of rows with NaN Density values.

In [None]:
df_merged.dropna(axis=0, how='any', thresh=None, subset=['Density'], inplace = True)

3. Since we have explicitly changed NaNs to literal UNK categorical values, it would be worth checking how many UNKs are there.

In [None]:
#change UNK for different columns
for columns in df_merged.columns:
    print(f'Percent UNK Column {columns} = {((sum(df_merged[columns] == "UNK"))/len(df_merged[columns])):.0%}')

This points to feature City Class to have mostly UNKs so let's get that off our list of features. So we get rid of it.

In [None]:
df_merged.drop(['City Class'], axis = 1, inplace = True)

5. We will now run some quick correlation analysis to determine which of the features could predict our target variable. Note that we exclude Cases and Population as they are already captured in Case Per Capita. 

6. Rearrange columns for easier reading

In [None]:
df_merged = df_merged[['Name', 'geometry', 'Centroid', 'Geographic Level', 'Income Classification', 
                       'Population', 'Cases', 'Distance from Manila', 'Poverty Incidence', 'Density', 
                       'Case Per Capita']]

In [None]:
g = sns.pairplot(df_merged, vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])
g.fig.suptitle("Pair Plot", y=1.08)

In [None]:
sns.heatmap(df_merged.corr().iloc[2:,2:],annot=True,lw=1)

We can therefore conclude that of the continous variables we have, Density has the strongest correlation.

5. Separating the analysis by geographic level shows correlation to be stronger at City level.

Let us now explore our data if we group them according to Geographic Level and Income Classification.

First let's do a quick pairplot correlation analysis while differentiating the colors by Geographic Level.

In [None]:
sns.pairplot(df_merged, hue = "Geographic Level", vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
geo_levels = df_merged['Geographic Level'].unique()

In [None]:
for level in geo_levels:
    sub_df = df_merged.loc[df_merged['Geographic Level'] == level,['Distance from Manila','Poverty Incidence', 
                                                                   'Density', 'Case Per Capita']]
    print(level)
    #print('\n')
    print(sub_df.corr())
    print('\n\n')
    g = sns.pairplot(sub_df)
    g.fig.suptitle(level, y=1.08)

Consistent with our earlier observation, we can see that even when we drilled down to Geographic Level, Density still has the strongest correlation. Interestingly, at the Sub Municipality Level, the correlation is still strong but negative.

Examining the effects if we group them by Income Classification.

In [None]:
sns.pairplot(df_merged, hue = "Income Classification", vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
income_classes = df_merged['Income Classification'].unique()

for level in income_classes:
    sub_df = df_merged.loc[df_merged['Income Classification'] == level,['Distance from Manila','Poverty Incidence','Density','Case Per Capita']]
    print(level)
    #print('\n')
    print(sub_df.corr())
    print('\n\n')
    g = sns.pairplot(sub_df)
    g.fig.suptitle(level, y=1.08)

When drilling down Income Class, our earlier observation remains consistent, with almost perfect correlation in 1st Class cities. The correlation weakens after the 4th Class.

### Metro Manila Cases

Since most the of cases occur in Metro Manila, let us see how our variables predict Cases Per Capita in the Metro Manila Area.

We know that cities in the National Capital Region of Metro Manila has a PSGC of 130000000 or more. Limiting our dataset to Metro Manila cities we have:

In [None]:
#psgc_mm = ['133900000', '137401000', '137402000', '137403000', '137404000', '137405000', '137501000', '137502000', '137503000', '137504000', '137601000', '137602000', '137603000', '137604000', '137606000', '137605000', '137607000']
psgc_mm = ['137401000', '137402000', '137403000', '137404000', '137405000', '137501000', '137502000', '137503000', '137504000', '137601000', '137602000', '137603000', '137604000', '137606000', '137605000', '137607000']
df_mm = df_merged.loc[psgc_mm]

In [None]:
sns.pairplot(df_mm, vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
sns.heatmap(df_mm.corr().iloc[2:,2:],annot=True,lw=1)

Interestingly, for Metro Manila, Density is not such a strong predictor. Interestingly, Distance from the center of old Manila, is a negative strong predictor. At least in the Metro capital area, the farther you are from old manila, the safer it is, from a Cases Per Capita perspective.

### Normalizing Data

In [None]:
df2 = df_merged.copy()

In [None]:
df2[['Distance from Manila','Poverty Incidence', 'Density']]

In [None]:
# normalize variables

#initialize scaler
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
data = df2[['Distance from Manila','Poverty Incidence', 'Density']]
mm.fit(data)
df2[['Distance from Manila scaled','Poverty Incidence scaled', 'Density scaled']] = list(mm.transform(data))


In [None]:
df2 = df2[['Name', 'geometry', 'Centroid', 'Distance from Manila','Geographic Level', 'Income Classification', 
     'Population', 'Cases','Poverty Incidence', 'Density', 'Distance from Manila scaled', 
     'Poverty Incidence scaled','Density scaled', 'Case Per Capita']]

In [None]:
g = sns.pairplot(df2, vars = ['Distance from Manila scaled','Poverty Incidence scaled', 'Density scaled', 'Case Per Capita'])
g.fig.suptitle("Pair Plot", y=1.08)

In [None]:
sns.heatmap(df2.corr().iloc[5:,5:],annot=True,lw=1)

5. Separating the analysis by geographic level shows correlation to be stronger at City level.

Let us now explore our data if we group them according to Geographic Level and Income Classification.

First let's do a quick pairplot correlation analysis while differentiating the colors by Geographic Level.

In [None]:
sns.pairplot(df2, hue = "Geographic Level", vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
geo_levels = df2['Geographic Level'].unique()

In [None]:
for level in geo_levels:
    sub_df = df2.loc[df2['Geographic Level'] == level,['Distance from Manila','Poverty Incidence', 
                                                                   'Density', 'Case Per Capita']]
    print(level)
    #print('\n')
    print(sub_df.corr())
    print('\n\n')
    g = sns.pairplot(sub_df)
    g.fig.suptitle(level, y=1.08)

Consistent with our earlier observation, we can see that even when we drilled down to Geographic Level, Density still has the strongest correlation. Interestingly, at the Sub Municipality Level, the correlation is still strong but negative.

Examining the effects if we group them by Income Classification.

In [None]:
sns.pairplot(df2, hue = "Income Classification", vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
income_classes = df2['Income Classification'].unique()

for level in income_classes:
    sub_df = df2.loc[df2['Income Classification'] == level,['Distance from Manila','Poverty Incidence','Density','Case Per Capita']]
    print(level)
    #print('\n')
    print(sub_df.corr())
    print('\n\n')
    g = sns.pairplot(sub_df)
    g.fig.suptitle(level, y=1.08)

When drilling down Income Class, our earlier observation remains consistent, with almost perfect correlation in 1st Class cities. The correlation weakens after the 4th Class.

### Metro Manila Cases

Since most the of cases occur in Metro Manila, let us see how our variables predict Cases Per Capita in the Metro Manila Area.

We know that cities in the National Capital Region of Metro Manila has a PSGC of 130000000 or more. Limiting our dataset to Metro Manila cities we have:

In [None]:
#psgc_mm = ['133900000', '137401000', '137402000', '137403000', '137404000', '137405000', '137501000', '137502000', '137503000', '137504000', '137601000', '137602000', '137603000', '137604000', '137606000', '137605000', '137607000']
psgc_mm = ['137401000', '137402000', '137403000', '137404000', '137405000', '137501000', '137502000', '137503000', '137504000', '137601000', '137602000', '137603000', '137604000', '137606000', '137605000', '137607000']
df_mm = df2.loc[psgc_mm]

In [None]:
df_mm = df_mm[['Name', 'geometry', 'Centroid', 'Distance from Manila','Geographic Level', 'Income Classification', 
               'Population', 'Cases','Poverty Incidence', 'Density', 'Distance from Manila scaled', 
               'Poverty Incidence scaled','Density scaled', 'Case Per Capita']]

In [None]:
sns.pairplot(df_mm, vars = ['Distance from Manila','Poverty Incidence', 'Density', 'Case Per Capita'])

In [None]:
sns.heatmap(df_mm.corr().iloc[5:,5:],annot=True,lw=1)

## Visualization

In [None]:
df_mm.head()

In [None]:
df_mm.plot(column='Case Per Capita', cmap='OrRd', scheme='quantiles')

In [None]:
m = folium.Map()
m.choropleth(df_mm, data = df_mm, columns=['Name', 'Case Per Capita'], fill_color='YlOrBr')

## Next Steps

* mapped visualizations
* get rid of non normalized data
* write report

# Others

In [None]:
#dummy variables

#change UNK for different columns
for columns in df_merged.columns:
    print(f'Number of Unknowns for Column {columns} = {sum(df_merged[columns] == "UNK")}')

In [None]:
df_merged = pd.get_dummies(df_merged, 
               columns = ['Geographic Level', 'Income Classification'], 
               prefix = ['GL','IC'])

In [None]:
df_merged = df_merged[['Name', 'Population', 'Poverty Incidence', 'GL_City', 
     'GL_Mun', 'GL_SubMun', 'IC_1st', 'IC_2nd','IC_3rd', 'IC_4th', 
     'IC_5th', 'IC_6th', 'IC_Special', 'IC_UNK','Case Per Capita']]

In [None]:
#check histogram of population
df_merged['Population'][df_merged['IC_3rd'] == 1].hist()

In [None]:
#normalize population\
ss = StandardScaler()
df_merged['Population Normalized'] = ss.fit_transform(df_merged[['Population']])

In [None]:
df_merged.corr().loc['Case Per Capita']

In [None]:
df_merged.head()

### Inference

In [None]:
#define target variable
y = df_merged['Case Per Capita']

In [None]:
y

In [None]:
X = df_merged[['Population Normalized', 'Poverty Incidence', 'GL_City', 'GL_Mun', 'GL_SubMun', 'IC_1st', 'IC_2nd','IC_3rd', 'IC_4th', 'IC_5th', 'IC_6th', 'IC_Special', 'IC_UNK']]

In [None]:
X

In [None]:
#split into training and test sets
X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

In [None]:
#fitting a linear model
lm = LinearRegression()
lm.fit(X_train,y_train)

In [None]:
lm.score(X_train,y_train)

In [None]:
lm.score(X_valid,y_valid)

In [None]:
y_pred = lm.predict(X_valid)
rmse = np.sqrt(metrics.mean_squared_error(y_pred, y_valid))
rmse

In [None]:
rdgCV = RidgeCV(alphas=[0.01,0.1,1,10,100,1000], cv=5)
rdgCV.fit(X_train,y_train)

In [None]:
print(rdgCV.alpha_)

In [None]:
rdg = Ridge(alpha=10)
rdg.fit(X_train, y_train)
rdg.score(X_valid, y_valid)

In [None]:
y_pred = rdg.predict(X_valid)
rmse = np.sqrt(metrics.mean_squared_error(y_pred, y_valid))
rmse

In [None]:
rfr = RandomForestRegressor(n_jobs=-1, n_estimators=100)
rfr.fit(X,y)

In [None]:
rfr.score(X_valid,y_valid)

In [None]:
y_pred = rfr.predict(X_valid)
rmse = np.sqrt(metrics.mean_squared_error(y_pred, y_valid))
rmse

In [None]:
print(lm.coef_)
print(np.argmax(lm.coef_))
print(df_merged.columns[np.argmax(lm.coef_)])
print(rdgCV.coef_)
print(np.argmax(rdgCV.coef_))

In [None]:
rfr.fit(X_train,y_train)

y_lm_pred = lm.predict(X_train)
y_rdgCV_pred = rdgCV.predict(X_train)
y_rfr_pred = rfr.predict(X_train)

print('-----training score ---')
print(lm.score(X_train, y_train))
print(rdgCV.score(X_train, y_train))
print(rfr.score(X_train, y_train))
print('----Validation score ---')
print(lm.score(X_valid, y_valid))
print(rdgCV.score(X_valid, y_valid))
print(rfr.score(X_valid, y_valid))

In [None]:
sns.heatmap(df_merged.corr().loc['Case Per Capita'], robust = True)

# Results
section where you discuss the results.

# Discussion
section where you discuss any observations you noted and any recommendations you can make based on the results.

# Conclusion
section where you conclude the report.

# Extras below
## Next Steps
* merge geojson in main dataframe
* calculate distance from manila
* determine correlation between distance from manila
* do same analysis as earlier with baranggay level

In [None]:
feature_to_remove = 'Geographic Level'

In [None]:
sub_df = df_merged

In [None]:
col_list_new = list(sub_df_cases_raw.columns)
col_list_new

In [None]:
sns.pairplot(df_merged[['Geographic Level','Population','Poverty Incidence','Case Per Capita']], hue = 'Geographic Level',
            diag_kind="hist")

In [None]:
def corr_by_level(dataframe, variable):
    for level in geo_levels:
        sub_df = df_merged.loc[df_merged['Geographic Level'] == level,['Population', 'Poverty Incidence','Case Per Capita']]
    print(level)
    print('\n')
    #print(sub_df_cases_raw.corr().loc['Case Per Capita'])
    sns.pairplot(sub_df)
    print('\n\n')

In [None]:
for level in geo_levels:
    sub_df = df_merged.loc[df_merged['Geographic Level'] == level,['Population', 'Poverty Incidence','Case Per Capita']]
    print(level)
    print('\n')
    #print(sub_df_cases_raw.corr().loc['Case Per Capita'])
    sns.pairplot(sub_df)
    print('\n\n')

In [None]:
#check use of "city of" or "<city name> city"
df_cityindex.loc[df_cityindex.Name.str.contains('city'),:]

In [None]:
df_cityindex['Name2'] = df_cityindex.Name.str.replace('city of','')
df_cityindex['Name2'] = df_cityindex.Name2.str.replace(r'\(capital\)','')

In [None]:
df_cityindex.loc[df_cityindex.Name.str.contains('city'),:]

In [None]:
df_merged.columns

In [None]:
df_merged['Geographic Level'].unique()

In [None]:
df_merged.loc[df_merged['Geographic Level'] == 'Mun',['Population', 'Poverty Incidence']]

In [None]:
for level in df_merged['Geographic Level'].unique():
    sub_df = df_merged.loc[df_merged['Geographic Level'] == level,['Population', 'Poverty Incidence','Case Per Capita']]
    print(level)
    print('\n')
    #print(sub_df_cases_raw.corr().loc['Case Per Capita'])
    sns.pairplot(sub_df)
    print('\n\n')

In [None]:
df_merged['Income Classification'].unique()

In [None]:
df1 = gpd.read_file('https://raw.githubusercontent.com/macoymejia/geojsonph/master/MuniCities/MuniCities.minimal.json')

In [None]:
df1['Centroid'] = df1.geometry.centroid
manila_loc = gpd.tools.geocode('Manila')
df1['Distance from Manila'] = gpd.GeoSeries(df1.Centroid).distance(manila_loc.iloc[0,0])

#drop other invalid rows
df1 = df1.loc[-df1.Distance from Manila.isna(),:]

In [None]:
df1.NAME_2 = df1.NAME_2.str.lower()

In [None]:
df1.loc[df1.NAME_2.str.contains('city'),:]

In [None]:
df1['NAME_3'] = df1.NAME_2.str.replace(' city','')

In [None]:
df1.head()

In [None]:
# clean up NaNs



In [None]:
df1.sort_values(by = 'Distance from Manila')

In [None]:
df_geodata = gpd.read_file('https://github.com/altcoder/philippines-psgc-shapefiles/blob/master/source/2015/Municities.zip')

In [None]:
#df_geodata = pd.read_csv('https://github.com/altcoder/philippines-psgc-shapefiles/blob/master/datasets/CSV/Municipalities.csv')

Note:
    
* bug reported in stackoverflow https://stackoverflow.com/questions/65229994/geopandas-readfile-not-recognizing-a-legit-shape-file and github of repository https://github.com/faeldon/philippines-json-maps/issues/1.

* follow up with faeldon bot@renovateapp.com