# IBM Data Science Capstone Project

# The problem.

Nowadays the world is facing the COVID-19 pandemic, a disease that is spread all around the globe.

 The inexistence of a vaccine has pushed the local governments to take different kind of measures to stop the advance of this virus, being the most effective the social distancing and lockdowns. The second one, is an extreme measure that severely affects the economy of a country, knowing this, the lockdowns are implemented partially meanwhile the levels of confirmed cases and risk is minimum.

 In this notebook, we will explore the city of Santiago, the capital of Chile. This city is composed of 32 neighborhoods, each one with different levels of confirmed covid 19 cases. We will cluster the neighborhoods and find a relationship between the features of each neighborhoods and the level of infected population, and finally, determine if the neighborhood should or not should be in lockdown.

# The data

We will scrap websites and use various libraries to create our dataset:

    1. Wikipedia: to get demographic data of each neighborhood.
    2. The official health department websites to get covid 19 related information.
    3. Official Statistics department to get demographic info.
    4. Foursquare API to get data from each neighborhood.
  

In [270]:
import pandas as pd # Data Analysis
import numpy as np # Mathematical computation

import matplotlib.pyplot as plt
%matplotlib inline

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium #Map Rendering

# Web scrapping
import requests
from bs4 import BeautifulSoup

import json

In [271]:
#We create a DataFrame of the Santiago neighborhood's from wikipedia:
column_names = ['Neighborhood','Borough']
df_stgo = pd.DataFrame(columns = column_names, index = None)

# We use beautifulsoup to scrap wikipedia
url = 'https://es.wikipedia.org/wiki/Anexo:Comunas_de_Chile'
        
res = requests.get(url).text
soup = BeautifulSoup(res,'lxml')
soup.prettify()

table = soup.find('table',{'class':'wikitable sortable'}).find_all('tr')[1::1]

for items in table:
    data = items.find_all(['th','td'])
    #code = data[0].text
    name = data[1].a.text
    borough = data[3].text

    df_stgo = df_stgo.append({'Neighborhood':name,
                              'Borough':borough[:-1]}, ignore_index = True)

# We process the data and leave only the rows were Borough == 'Santiago'
df_stgo['Neighborhood'].dropna(inplace = True)
df_stgo = df_stgo[df_stgo['Borough'] == 'Santiago'].reset_index(drop = True)

len(df_stgo) # 32 Neighborhoods in Santiago

32

In [272]:
df_stgo.head()

Unnamed: 0,Neighborhood,Borough
0,Santiago,Santiago
1,Cerrillos,Santiago
2,Cerro Navia,Santiago
3,Conchalí,Santiago
4,El Bosque,Santiago


The wikipedia table we scrapped has a "population" and "density" columns that migh be very valuable for our study, but they are outdated, so they they will not provide meaningful insights as the covid spread is highly dinamic.

To get the updated information, we will get the official population data from the national statistic department (INE)

In [273]:
# Get data from Nationa statistic department
neighborhood_population = pd.read_csv(
    "http://www.ine.cl/docs/default-source/proyecciones-de-poblacion/cuadros-estadisticos/base-2017/ine_estimaciones-y-proyecciones-2002-2035_base-2017_comunas0381d25bc2224f51b9770a705a434b74.csv", 
    encoding='ISO-8859-1', sep=",", thousands='.')

neighborhood_pop = neighborhood_population.groupby('Nombre Comuna')['Poblacion 2020'].sum()
neighborhood_pop = neighborhood_pop.reset_index()
neighborhood_pop.rename(columns = {'Nombre Comuna':'Neighborhood', 'Poblacion 2020':'Population'}, inplace = True)
neighborhood_pop.head()

Unnamed: 0,Neighborhood,Population
0,Algarrobo,15174
1,Alhué,7405
2,Alto Biobío,6775
3,Alto Hospicio,129999
4,Alto del Carmen,5729


In [274]:
# We merge both DataFrames.
df_stgo = pd.merge(df_stgo, neighborhood_pop, on = 'Neighborhood', how = 'left')
df_stgo.head()

Unnamed: 0,Neighborhood,Borough,Population
0,Santiago,Santiago,503147
1,Cerrillos,Santiago,88956
2,Cerro Navia,Santiago,142465
3,Conchalí,Santiago,139195
4,El Bosque,Santiago,172000


In [275]:
# We add the confirmed cases in each Neighborhood updated up to the last report from the health department
df_stgo['Confirmed cases'] = 0
df_stgo.set_index('Neighborhood', inplace = True)

#Confirmed cases per Neighborhood, official Data from epidemicology report #32 Health Department
confirmed_cases = {'Santiago':13592, 'Cerrillos':2706, 'Cerro Navia':5759,
                  'Conchalí':5587, 'El Bosque':5700,'Estación Central':5648,
                  'Huechuraba':3387, 'Independencia':5934, 'La Cisterna':2888,
                  'La Florida':12342, 'La Granja':6433, 'La Pintana':8807,
                  'La Reina':2003, 'Las Condes':5255, 'Lo Barnechea':2847,
                  'Lo Espejo':3746, 'Lo Prado':4211, 'Macul':4436,
                  'Maipú':12902, 'Ñuñoa':5000,'Pedro Aguirre Cerda':3826,
                  'Peñalolén':10589, 'Providencia':2833, 'Pudahuel':8752,
                  'Quilicura':8211, 'Quinta Normal':5223, 'Recoleta':7820,
                  'Renca':7075,'San Joaquín':4400, 'San Miguel':4685,
                  'San Ramón':4084, 'Vitacura':1361}

for i in df_stgo.index:
    df_stgo.loc[i, 'Confirmed cases'] = confirmed_cases[i]

df_stgo.reset_index(inplace=True)
df_stgo.head()

Unnamed: 0,Neighborhood,Borough,Population,Confirmed cases
0,Santiago,Santiago,503147,13592
1,Cerrillos,Santiago,88956,2706
2,Cerro Navia,Santiago,142465,5759
3,Conchalí,Santiago,139195,5587
4,El Bosque,Santiago,172000,5700


In [276]:
# We use nominatim to get latitude & longitude of each neighborhood
for i, loc in enumerate(df_stgo['Neighborhood'].to_list()):
        address = f'{loc.lower()}, Santiago'
        geolocator = Nominatim(user_agent="my-project")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        print(f'{i} - {loc} Done')
        #print(f'Latitude: {latitude}, {longitude}')
        df_stgo.loc[i, 'Latitude'] = latitude
        df_stgo.loc[i, 'Longitude'] = longitude

0 - Santiago Done
1 - Cerrillos Done
2 - Cerro Navia Done
3 - Conchalí Done
4 - El Bosque Done
5 - Estación Central Done
6 - Huechuraba Done
7 - Independencia Done
8 - La Cisterna Done
9 - La Florida Done
10 - La Granja Done
11 - La Pintana Done
12 - La Reina Done
13 - Las Condes Done
14 - Lo Barnechea Done
15 - Lo Espejo Done
16 - Lo Prado Done
17 - Macul Done
18 - Maipú Done
19 - Ñuñoa Done
20 - Pedro Aguirre Cerda Done
21 - Peñalolén Done
22 - Providencia Done
23 - Pudahuel Done
24 - Quilicura Done
25 - Quinta Normal Done
26 - Recoleta Done
27 - Renca Done
28 - San Joaquín Done
29 - San Miguel Done
30 - San Ramón Done
31 - Vitacura Done


In [277]:
# Each neighborhood has now its geospatial coordinates:
df_stgo.head()

Unnamed: 0,Neighborhood,Borough,Population,Confirmed cases,Latitude,Longitude
0,Santiago,Santiago,503147,13592,-33.437797,-70.650445
1,Cerrillos,Santiago,88956,2706,-33.502503,-70.715918
2,Cerro Navia,Santiago,142465,5759,-33.425145,-70.743954
3,Conchalí,Santiago,139195,5587,-33.384775,-70.674606
4,El Bosque,Santiago,172000,5700,-33.562352,-70.67682


### Geospatial visualizations of Santiago

In [278]:
# We create the map object of Santiago, Chile
address = 'Santiago, Chile'

geolocator = Nominatim(user_agent = 'my-project')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Santiago are {}, {}.'.format(latitude, longitude))

# create map of Santiago, Chile using latitude and longitude values
stgo_map = folium.Map(location=[latitude, longitude], zoom_start=11)

The geograpical coordinate of Santiago are -33.4377968, -70.6504451.


In [279]:
# Preparing the geojson file
# GeoJson file that marks the boundaries of the different neighborhoods in Santiago
json_geo = r'comunas_santiago.geojson'

with open('comunas_santiago.geojson') as json_data:
    stgo_data = json.load(json_data)
    
neighborhoods = df_stgo['Neighborhood'].unique()

# We filter the json data to use only the neighborhoods of Santiago
k = {'type': 'FeatureCollection',
    'features':[]}

for data in stgo_data['features']:
    if data['properties']['nombre'] in neighborhoods:
        k['features'].append(data)

In [280]:
# create a numpy array of length 6 and has linear spacing from the minium total to the maximum total
threshold_scale = np.linspace(df_stgo['Confirmed cases'].min(),
                              df_stgo['Confirmed cases'].max(),
                              6, dtype=int)

threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum

stgo_map.choropleth(
    geo_data=k,
    data=df_stgo,
    columns=['Neighborhood', 'Confirmed cases'],
    key_on='feature.properties.nombre',
    threshold_scale = threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.6, 
    line_opacity=0.2,
    legend_name='Confirmed cases',
    reset = True
)

# add markers to map
for lat, lng, cases, neighborhood in zip(df_stgo['Latitude'], df_stgo['Longitude'],df_stgo['Confirmed cases'], df_stgo['Neighborhood']):
    label = '{}, {}'.format(neighborhood, cases)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(stgo_map)  


# display map
stgo_map