<a id='title'></a>
# **Covid19 Vaccination Centers in Lima, Peru**
### Coursera Data Science Capstone - *The battle of neighborhoods (week 1)*
---

### **Content**

- [1. Introduction / Business Understanding](#introduction)
- [2. Data Requirements](#data_req)
- [3. Data Adquisition and Preparation](#data_prep)

---
<a id='introduction'></a>
### **1. Introduction**
#### **1.1 Background**
The covid-19 pandemic has negatively impacted the health, economy, education and other aspects of our society since the beginning of 2020. Peru has been one of the countries with the most infections and deaths per million inhabitants in the world, reaching almost 200k deaths since the start of the pandemic.
Below are the accumulated indicators related to the impact of the disease throughout the country according to the Ministry of Health as of August 29, 2021.

| Measure | Result |
|---|---|
| Tests performed | 16 733 426 |
| Positive cases | 2 149 591 |
| Deaths | 198 263 |
    
**Source:** https://www.gob.pe/institucion/minsa/noticias/514015-minsa-casos-confirmados-por-coronavirus-covid-19-ascienden-a-2-149-591-en-el-peru-comunicado-n-662

#### **1.2 The problem**
The Ministry of Health has installed many vaccination centers in most districts of the country, and since February 9th 2021, the national vaccination strategy has been carried out, scheduling appointments for each inhabitant of the country by age groups, starting with the elderly and decreasing the age range over the weeks.

Recently, other risk groups are being included in the vaccination schedule (rare and orphan diseases and neurodevelopmental disorders) without age restriction (they must only be 12 years or older) in addition to the possibility of being vaccinated in a district other than their residence. This is causing alterations in the flow of people traveling to vaccination centers, resulting in two possible scenarios:
- Long queues outside the vaccination centers
- Empty vaccination centers

This project's objective is to suggest the number and location of vaccination centers based on the density of population in Lima. This can be used to redirect staff from vaccination centers to appropriate locations to avoid wasting health professional resources.

#### **1.3 Intended audience**
Directors and management staff of:
- Ministry of Health
- EsSalud (social insurance)
- District municipalities
- Private health centers

#### **1.4 Geographic scope**
This project will cover the districts from Lima and Callao provinces.

- Image on the left: Lima location within the country
- Image on the right: Districts of Lima and Callao

<img src="img scope/Peru_Lima_Lima.png" width="325" /> <img src="img scope/Lima-Callao.png" width="248" />

---
<a id='data_req'></a>
### **2. Data Requirements**
The data to be used in this project comes from two principal sources:
- Open Data National Platform https://www.datosabiertos.gob.pe/
- Foursquare

#### **2.1 Description of the data**
The following data will be required for this project:
- List of *ubigeo* equivalences (codes assigned to each district in Peru) available in the Open Data National Platform
- List of venues of Lima (extracted with Foursquare, from all districts in Lima)
- List of vaccination centers with location data (longitude/latitude) available in the Open Data National Platform


#### **2.2 How the data will be used**

**List of *ubigeo* equivalences:** First it will be filtered with the location data from districts of Lima, these coordinates will be used to search venues with Foursquare.
This dataset will also be merged with the list of vaccination centers. It has the codes given to each district, which are the data to filter the vaccination centers in this project's scope. This dataset has 17 columns from which the following are of interest for this project:

| Column | Type | Description |
|---|---|---|
| ID Ubigeo | Number | Code assigned to districts |
| Department | Text | 1st level of local government, most of times called regional |
| Province | Text | 2nd level of local government |
| District | Text | 3rd level of local government |
| Region | Text | Administrative division, in most cases the same that Department |
| Surface | Number | District area in square kilometers |
| Latitude | Number | Location data |
| Longitude | Number | Location data |

Link: https://www.datosabiertos.gob.pe/dataset/codigos-equivalentes-de-ubigeo-del-peru

**List of vaccination centers:** This is a list of places where the vaccination process is being taken. The purpose to use this data is to plot the locations to analyze if the vaccination centers are near to a point of high population density. Some of the columns that will be used from this dataset are:

| Column | Type | Description |
|---|---|---|
| ID Ubigeo | Number | District code where the vaccination center has been installed |
| Vaccination Center | Text | Name of the place that hosts the vaccination center |
| Latitude | Number | Location data |
| Longitude | Number | Location data |

Link: https://www.datosabiertos.gob.pe/dataset/centros-de-vacunacion

**List of venues of Lima:** It will be extracted with Foursquare to determine the concentration of dwellings by clustering the venues. For this project, due to the nearness between Lima and Callao, the districts in this two provinces will be included. This dataset will have the following columns:

| Column | Type |
|---|---|
| Venue Name | Text |
| Category | Text |
| Longitude | Number |
| Latitude | Number |

It will be deduced that each cluster of venues or dwellings is a point of high population density.

---
<a id='data_prep'></a>
### **3. Data Acquisition and Preparation**
First we will download the list of ubigeos and vaccination centers from https://www.datosabiertos.gob.pe/ and save them in `data orig` folder.
- https://www.datosabiertos.gob.pe/dataset/codigos-equivalentes-de-ubigeo-del-peru
- https://www.datosabiertos.gob.pe/dataset/centros-de-vacunacion

Then, import the necessary libraries.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import json
import requests
import folium

#### **3.1 Creating a dataframe with district data**
Considerations:
- Lima and Callao provinces are *LIMA PROVINCIA* and *CALLAO* values in **region** column.
- A new column **num_requests** will be created comparing the district surface to an area in a 3Km radius circle (~ 28.27 Km<sup>2</sup>).
- Column num_requests will tell the number of times we will query to Foursquare for venues near to a 3Km radius area. If the division between district surface and 28.27 Km<sup>2</sup> is more than 1, then num_requests will be equal to 2, that means all the venues in the result of the first query will be used as central coordinates for new requests.

In [2]:
# Read dataset TBL_UBIGEOS.csv
dfdis = pd.read_csv("data orig/TB_UBIGEOS.csv")
# Filter by regions Lima and Callao
dfdis = dfdis[dfdis.region.isin(["LIMA PROVINCIA","CALLAO"])]
# Delete not necessary columns
dfdis.drop(["ubigeo_reniec","ubigeo_inei","departamento_inei","provincia_inei","macroregion_inei",
         "macroregion_minsa","iso_3166_2","fips","altitud"], axis=1, inplace=True)
dfdis.rename(columns={"departamento":"department","provincia":"province","distrito":"district",
                   "superficie":"surface","latitud":"latitude","longitud":"longitude"}, inplace=True)
# Drop rows if laitude is NaN
dfdis.dropna(subset=["latitude"], axis=0, inplace=True)
# Calculate new column Number of Requests (times we will query Foursquare based on the surface divided by a 3Km radius circle area)
dfdis["num_requests"] = np.ceil(dfdis["surface"]/(9*np.pi)) # Area in 3 Km radius = pi * radius^2 = 28.27 Km^2
dfdis["num_requests"] = dfdis["num_requests"].astype("int") # Change data type
dfdis["num_requests"] = [2 if x>1 else 1 for x in dfdis["num_requests"]] # Two times maximum for running requests to Foursquare
# Reset index
dfdis.reset_index(drop=True, inplace=True)
dfdis

Unnamed: 0,id_ubigeo,department,province,district,region,surface,latitude,longitude,num_requests
0,690,CALLAO,CALLAO,CALLAO,CALLAO,46.0,-12.0631,-77.1469,2
1,691,CALLAO,CALLAO,BELLAVISTA,CALLAO,5.0,-12.0625,-77.1292,1
2,692,CALLAO,CALLAO,CARMEN DE LA LEGUA REYNOSO,CALLAO,2.0,-12.0394,-77.0903,1
3,693,CALLAO,CALLAO,LA PERLA,CALLAO,3.0,-12.0658,-77.1081,1
4,694,CALLAO,CALLAO,LA PUNTA,CALLAO,18.0,-12.0728,-77.1633,1
5,695,CALLAO,CALLAO,VENTANILLA,CALLAO,70.0,-11.8772,-77.1278,2
6,696,CALLAO,CALLAO,MI PERU,CALLAO,3.0,-11.855,-77.125,1
7,1281,LIMA,LIMA,LIMA,LIMA PROVINCIA,22.0,-12.0453,-77.0308,1
8,1282,LIMA,LIMA,ANCON,LIMA PROVINCIA,285.0,-11.7739,-77.1764,2
9,1283,LIMA,LIMA,ATE,LIMA PROVINCIA,78.0,-12.0264,-76.9214,2


In [3]:
# Checking column data types
dfdis.dtypes

id_ubigeo         int64
department       object
province         object
district         object
region           object
surface         float64
latitude        float64
longitude       float64
num_requests      int64
dtype: object

#### **3.2 Creating a dataframe with vaccination centers data**
Considerations:
- The dataset with vaccination centers will be merged with the district dataframe to get only the vaccination centers in Lima and Callao provinces.

In [4]:
# Read dataset TB_CENTRO_VACUNACION.csv and merge (inner join) with district dataframe (dfdis)
dfvac = pd.merge(pd.read_csv("data orig/TB_CENTRO_VACUNACION.csv"), dfdis, on='id_ubigeo', how='inner')
# Delete not necessary columns
dfvac.drop(["id_centro_vacunacion","entidad_administra","surface","latitude","longitude",
         "num_requests"], axis=1, inplace=True)
dfvac.rename(columns={"nombre":"vaccination_center","latitud":"latitude","longitud":"longitude"}, inplace=True)
pd.set_option('display.max_rows', None)
dfvac

Unnamed: 0,id_ubigeo,vaccination_center,latitude,longitude,department,province,district,region
0,1317,Vacunatorio San Isidro labrador,-12.043744,-76.946264,LIMA,LIMA,SANTA ANITA,LIMA PROVINCIA
1,1317,Estadio Municipal de Santa Anita,-12.03422,-76.965259,LIMA,LIMA,SANTA ANITA,LIMA PROVINCIA
2,1292,Clínica Jesus Del Norte,-11.98958,-77.058795,LIMA,LIMA,INDEPENDENCIA,LIMA PROVINCIA
3,1292,Plaza Norte,-12.007614,-77.058953,LIMA,LIMA,INDEPENDENCIA,LIMA PROVINCIA
4,1292,Coliseo de la Amistad Perú - Japón,-11.97852,-77.05048,LIMA,LIMA,INDEPENDENCIA,LIMA PROVINCIA
5,1281,Centro De Vacunación Aljovin,-12.057287,-77.03233,LIMA,LIMA,LIMA,LIMA PROVINCIA
6,1281,Universidad Nacional Mayor de San Marcos - UNMSM,-12.05617,-77.08424,LIMA,LIMA,LIMA,LIMA PROVINCIA
7,1281,Clínica Internacional,-12.058217,-77.038457,LIMA,LIMA,LIMA,LIMA PROVINCIA
8,1281,Parque La Exposición,-12.062962,-77.035168,LIMA,LIMA,LIMA,LIMA PROVINCIA
9,1281,Visita Domiciliaria,-12.04945,-77.03776,LIMA,LIMA,LIMA,LIMA PROVINCIA


In [5]:
pd.reset_option('max_rows')
# Checking column data types
dfvac.dtypes

id_ubigeo               int64
vaccination_center     object
latitude              float64
longitude             float64
department             object
province               object
district               object
region                 object
dtype: object

#### **3.3 Creating a dataframe with venues data**
Considerations:
- To get this dataset we will use Foursquare
- Each request will have 100 venues as limit and 3 Km as radius
- The district coordinates (lat/long) will be used as central points for each request
- If the district has `num_requests = 2` then all the venues in the result of the query to Foursquare will be used as central points for a second query. This is to cover a greater surface that posibly has more population density

#### **Setup:** 
First we will setup the parameters to use in each request to Foursquare

In [6]:
# Setup Foursquare credentials and limits
CLIENT_ID = 'MSCXPOVAYO4TGJNPQ0ZYQ1DFIYIW1AHDBQ4G25RRXADSP304' # Foursquare ID
CLIENT_SECRET = 'VOHBJF1JHQAHJET21WF3YHNE4IMX2IZFRXIEKJADMMO3SLTY' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
limit = 100 # Limit of number of venues returned by Foursquare API
radius = 3000 # 3 Km of radius

#### **Functions:**
The following functions will be used to simplify the collection:
- **`get_category`** to get the category of each venue
- **`get_venues`** to request the venues through Foursquare


**Function `get_category`:**

In [7]:
# Function that extracts the category of venues
def get_category(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

**Function `get_venues`:**

In [8]:
# Function that returns a dataframe with venues
def get_venues(clientid, secret, vers, lat, lng, rad, lim):
    # Setup the URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        clientid, secret, vers, lat, lng, rad,lim)
    # Results
    results = requests.get(url).json()
    # Venues
    venues_json = results['response']['groups'][0]['items']
    venues = pd.json_normalize(venues_json) # flatten JSON
    # Filter columns
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
    venues = venues.loc[:, filtered_columns]
    # Filter the category for each row
    venues['venue.categories'] = venues.apply(get_category, axis=1)
    # Clean columns
    venues.columns = [col.split(".")[-1] for col in venues.columns]
    return venues

#### **Queries to Foursquare:**

**First run:** Will use the data from districts dataframe as central points. The resultant dataframe will have the origin district from which the query was run.

In [9]:
# Create empty dataframe dfven
dfven = pd.DataFrame(columns = ["name","categories","lat","lng","origin_district","num_requests"])

# First search of venues with data from district list
for i in dfdis.itertuples():
    # Look up venues
    venues = get_venues(CLIENT_ID, CLIENT_SECRET, VERSION, i.latitude, i.longitude, radius, limit)
    venues["origin_district"] = i.district
    venues["num_requests"] = i.num_requests
    # Append dataframe
    dfven = dfven.append(venues)
# Clean the dataframe
dfven.drop_duplicates(subset=["name","categories","lat","lng"], inplace=True)
dfven.reset_index(drop=True, inplace=True)
dfven

Unnamed: 0,name,categories,lat,lng,origin_district,num_requests
0,CasaCor Callao,Public Art,-12.060004,-77.147282,CALLAO,2
1,Casa Fugaz,Art Gallery,-12.060145,-77.147335,CALLAO,2
2,Fortaleza del Real Felipe,Monument / Landmark,-12.062062,-77.148059,CALLAO,2
3,Cabos Restaurante de Puerto,Seafood Restaurant,-12.060282,-77.150105,CALLAO,2
4,Panadería Olcese,Bakery,-12.061041,-77.143544,CALLAO,2
...,...,...,...,...,...,...
1840,Pizzeria D'Camila,Pizza Place,-12.171204,-76.930947,VILLA MARIA DEL TRIUNFO,2
1841,Cementerio Nueva Esperanza,Cemetery,-12.164202,-76.919884,VILLA MARIA DEL TRIUNFO,2
1842,San Juan De Mira flores (Ex boulebard),Park,-12.167155,-76.968202,VILLA MARIA DEL TRIUNFO,2
1843,Santa Alitas,Mexican Restaurant,-12.172082,-76.966780,VILLA MARIA DEL TRIUNFO,2


**Second run:** Will use the data from venues dataframe as central points. For this purpose a temporary dataframe *newsource* will be created, this will have the venues from districts with `num_requests > 1`

In [10]:
# Create new source from dataframe dfven
newsource = dfven.copy() #Copy the dataframe
newsource = newsource[newsource.num_requests > 1] # Filter rows by num_requests column
newsource["num_requests"] = newsource["num_requests"] - 1
newsource.drop_duplicates(subset=["name","categories","lat","lng"], inplace=True)
newsource.reset_index(drop=True, inplace=True)

# Create empty dataframe dfven2
dfven2 = pd.DataFrame(columns = ["name","categories","lat","lng","origin_district","num_requests"])

# Second search of venues with data from new source
for j in newsource.itertuples():
    # Look up venues
    venues = get_venues(CLIENT_ID, CLIENT_SECRET, VERSION, j.lat, j.lng, radius, limit)
    venues["origin_district"] = j.origin_district
    venues["num_requests"] = j.num_requests
    # Append dataframe
    dfven2 = dfven2.append(venues)

# Clean the dataframe
dfven2.drop_duplicates(subset=["name","categories","lat","lng"], inplace=True)
dfven2.reset_index(drop=True, inplace=True)
dfven2

Unnamed: 0,name,categories,lat,lng,origin_district,num_requests
0,CasaCor Callao,Public Art,-12.060004,-77.147282,CALLAO,1
1,Casa Fugaz,Art Gallery,-12.060145,-77.147335,CALLAO,1
2,Cabos Restaurante de Puerto,Seafood Restaurant,-12.060282,-77.150105,CALLAO,1
3,Monumental Callao,Art Gallery,-12.059839,-77.147090,CALLAO,1
4,Fortaleza del Real Felipe,Monument / Landmark,-12.062062,-77.148059,CALLAO,1
...,...,...,...,...,...,...
2157,Fruzymix,Ice Cream Shop,-12.158508,-76.991243,VILLA MARIA DEL TRIUNFO,1
2158,Parque La Cruceta,Dog Run,-12.158336,-76.991784,VILLA MARIA DEL TRIUNFO,1
2159,Area de Deportes - Saga Atocongo,Sporting Goods Shop,-12.145492,-76.981388,VILLA MARIA DEL TRIUNFO,1
2160,Mamara,Plaza,-12.147577,-76.950001,VILLA MARIA DEL TRIUNFO,1


After the two runs, we will join both dataframes and drop duplicates.

In [11]:
dfven = dfven.append(dfven2)
dfven.drop_duplicates(subset=["name","categories","lat","lng"], inplace=True)
dfven.reset_index(drop=True, inplace=True)
dfven.rename(columns={"name":"venue_name","categories":"category"}, inplace=True)
dfven.drop(["num_requests"], axis=1, inplace=True)
dfven

Unnamed: 0,venue_name,category,lat,lng,origin_district
0,CasaCor Callao,Public Art,-12.060004,-77.147282,CALLAO
1,Casa Fugaz,Art Gallery,-12.060145,-77.147335,CALLAO
2,Fortaleza del Real Felipe,Monument / Landmark,-12.062062,-77.148059,CALLAO
3,Cabos Restaurante de Puerto,Seafood Restaurant,-12.060282,-77.150105,CALLAO
4,Panadería Olcese,Bakery,-12.061041,-77.143544,CALLAO
...,...,...,...,...,...
2733,Fruzymix,Ice Cream Shop,-12.158508,-76.991243,VILLA MARIA DEL TRIUNFO
2734,Parque La Cruceta,Dog Run,-12.158336,-76.991784,VILLA MARIA DEL TRIUNFO
2735,Area de Deportes - Saga Atocongo,Sporting Goods Shop,-12.145492,-76.981388,VILLA MARIA DEL TRIUNFO
2736,Mamara,Plaza,-12.147577,-76.950001,VILLA MARIA DEL TRIUNFO


In [12]:
# Checking column data types
dfven.dtypes

venue_name          object
category            object
lat                float64
lng                float64
origin_district     object
dtype: object

#### **Data export:**

Finally we will export the dataframes as CSV files for further usage.

In [13]:
dfdis.to_csv("data clean/districts.csv", index=False)
dfvac.to_csv("data clean/vaccination centers.csv", index=False)
dfven.to_csv("data clean/venues.csv", index=False)