<img src="logo.png" width="350" height="350" align="center"/>

### Final Project: Data Science Professional

___

Diego N. Vilela - Biomedical Scientist

December, 2019

### Data Section

___

In the course of this ETL (Extraction, Transformation, and Load) work, I will gather and describe the information I believe to be relevant to my research. It should be borne in mind that some statistics, because they deal with population issues, cannot always be kept up to date. Thus, all work is based on time-difference data and is for estimation only. Some links had problems formatting the information, so in some cases I decided to manually extract "CSV" files and keep it available in the repository.

### Resource loading and parameterization

___

In [1]:
import pandas as pd
import numpy as np
import warnings
from geopy.geocoders import Nominatim as geo
import folium as fl

warnings.filterwarnings('ignore')

### Data load on the counties of the state of São Paulo

___

As a starting point, I will create a table with the names of the counties of the state of São Paulo and their respective populations, because everything revolves around their inhabitants.

* Source: São Paulo State Virtual Library (http://www.bibliotecavirtual.sp.gov.br/temas/sao-paulo/sao-paulo-populacao-dos-municipios-paulistas.php)
* Year: 2018

In [2]:
# Population table

df_pop = pd.read_html('http://www.bibliotecavirtual.sp.gov.br/temas/sao-paulo/sao-paulo-populacao-dos-municipios-paulistas.php', thousands = '.', skiprows = 1)[0]

In [3]:
# Renaming the columns

df_pop.columns = ['County', 'Population'] 

In [4]:
# Sorting values

df_pop.sort_values('County', inplace = True)

In [5]:
# There seems to be a disagreement in the name of one of the countie, because "Moji Mirim" is actually "Mogi Mirim"

df_pop.loc[df_pop['County'] == 'Moji Mirim']

Unnamed: 0,County,Population
84,Moji Mirim,89738


In [6]:
# Correcting the county name

df_pop.loc[84] = ['Mogi Mirim', 89738]

In [7]:
# I will leave the County name as index to facilitate the compilation of the information below.

df_pop.set_index('County', inplace = True)

In [8]:
# Checking the table

df_pop.head()

Unnamed: 0_level_0,Population
County,Unnamed: 1_level_1
Adamantina,33888
Adolfo,3469
Aguaí,34919
Agudos,35828
Alambari,5600


In [9]:
print('The state of São Paulo has %s counties and a total of %i inhabitants.' % (df_pop.shape[0], df_pop['Population'].sum()))

The state of São Paulo has 645 counties and a total of 43993159 inhabitants.


___

The area in km 2 of the municipalities will be important in the parameterization of the search of the places within their limits.

* Source: Wikipedia (https://pt.wikipedia.org/wiki/Lista_dos_munic%C3%ADpios_de_S%C3%A3o_Paulo_por_%C3%A1rea)
* Year: 2018

In [10]:
# Area table

df_area = pd.read_csv('tab_area.CSV', sep = ';', encoding = 'cp1252', decimal = ',')

In [11]:
# Renaming the columns

df_area.columns = ['County', 'Area KM2']

In [12]:
# Sorting by County

df_area.sort_values('County', inplace = True)

In [13]:
# I will use the name of the County already corrected from the population table

df_area['County'] = df_pop.index.values

In [14]:
# County as index

df_area.set_index('County', inplace = True)

The search engine by location is based on geographic coordinates and radius when searching for establishments. Clearly, counties are not perfect squares to have a well-defined radius, so one must be estimated. I plan on calculating about 60% of each area to avoid overlapping regions. Since I already have the area, just divide by 2 to find the radius and multiply by 0.6 to get the value relative to the percentage.

In [15]:
# Proportional radius calculation function

def radius(x, p):
    return(x / 2 * p)

In [16]:
# Creating new vector

df_area['Radius KM'] = [radius(a, 0.6) for a in df_area['Area KM2']]

In [17]:
# Checking the table

df_area.head()

Unnamed: 0_level_0,Area KM2,Radius KM
County,Unnamed: 1_level_1,Unnamed: 2_level_1
Adamantina,411.987,123.5961
Adolfo,211.055,63.3165
Aguaí,474.554,142.3662
Agudos,966.708,290.0124
Alambari,159.6,47.88


___

"Gross Domestic Product" (GDP) is one of the key economic indicators for assessing a country's financial health, based on the sum of all goods and services produced over a one-year period. The next table will be filled with the GDP of each municipality divided among several sectors.

* Source: State Data Analysis System Foundation (https://www.seade.gov.br/produtos/pib-municipal/)
* Year: 2017

In [18]:
# GDP table

df_gdp = pd.read_csv('tab_pib_2017.CSV', sep = ';', encoding = 'cp1252', decimal = ',')

In [19]:
# Renaming the columns

df_gdp.columns = ['County', 'Farming', 'Industry', 'Public administration', 'Public services', 'Taxes', 'GDP', 'GDP per capita']

In [20]:
# Sorting by County

df_gdp.sort_values('County', inplace = True)

In [21]:
# County as index

df_gdp.set_index('County', inplace = True)

In [22]:
# Checking the table

df_gdp.head()

Unnamed: 0_level_0,Farming,Industry,Public administration,Public services,Taxes,GDP,GDP per capita
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Adamantina,45201.02467,138803.3355,168908.8097,621261.3956,80366.51443,1054541.0,31121.17692
Adolfo,33549.76852,8059.446375,21422.15942,33338.62674,3364.626749,99734.63,28667.61362
Aguaí,87873.1925,297084.2142,121789.1294,336929.4595,99032.52835,942708.5,27263.24611
Agudos,80476.4423,800703.35,174721.0886,672594.7295,334526.0885,2063022.0,57826.59768
Alambari,35578.25977,7684.430367,24265.69012,28925.0689,4113.246791,100566.7,18245.04644


___

According to the source, the "Human Development Index" is a comparative measure of wealth, literacy, education, life expectancy, birth and other factors for the various countries of the world. It is a standardized way of assessing and measuring the welfare of a population, especially child welfare.

* Source: Wikipedia (https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_de_S%C3%A3o_Paulo_por_IDH-M)
* Year: 2010

In [23]:
# Table IDHM

df_hdmi = pd.read_csv('tab_idhm_2010.CSV', sep = ';', encoding = 'cp1252', decimal = ',')

In [24]:
# Renaming the columns

df_hdmi.columns = ['County', 'HDI-M', 'HDI-R', 'HDI-L', 'HDI-E']

In [25]:
# Sorting by County

df_hdmi.sort_values('County', inplace = True)

In [26]:
# County as index

df_hdmi.set_index('County', inplace = True)

In [27]:
df_hdmi.head()

Unnamed: 0_level_0,HDI-M,HDI-R,HDI-L,HDI-E
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adamantina,0.79,0.772,0.852,0.75
Adolfo,0.73,0.71,0.844,0.648
Aguaí,0.715,0.703,0.858,0.606
Agudos,0.745,0.705,0.845,0.694
Alambari,0.712,0.682,0.805,0.658


___

The FIRJAN Municipal Development Index (IFDM) is a study designed to track the human, economic and social development of Brazilian municipalities, based exclusively on official statistics. It takes into account three indicators: employment and income as a single indicator and education and health as separate indicators, each with a respective set of variables. Due to its characteristics, the tool has served as a public policy photograph and as a source for "national and international studies on Brazilian development". Even because its result is able to portray the level of development of each countie and thus give an idea about the quality of life of its citizens.

* Source: Wikipedia (https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_de_S%C3%A3o_Paulo_por_IFDM)
* Year: 2013

In [28]:
# Table FMDI

df_fmdi = pd.read_csv('tab_ifdm_2013.CSV', sep = ';', encoding = 'cp1252', decimal = ',')

In [29]:
# Renaming the columns

df_fmdi.columns = ['County', 'FMDI']

In [30]:
# Sorting by County

df_fmdi.sort_values('County', inplace = True)

In [31]:
# County as index

df_fmdi.set_index('County', inplace = True)

In [32]:
df_fmdi.head()

Unnamed: 0_level_0,FMDI
County,Unnamed: 1_level_1
Adamantina,0.7827
Adolfo,0.7836
Aguaí,0.7168
Agudos,0.7705
Alambari,0.7028


___

Time to group all information into one table. The "County" will guide the concatenation of the tables.

In [33]:
# Table São Paulo

df_sp = pd.concat([df_pop, df_area, df_gdp, df_hdmi, df_fmdi], axis=1, join='inner')

In [34]:
# Rounding values for better aesthetics before adding coordinates

df_sp = df_sp.round(decimals = 2)

In [35]:
# Setting the index name

df_sp.index.name = 'County'

In [36]:
# Voilá!

df_sp.head()

Unnamed: 0_level_0,Population,Area KM2,Radius KM,Farming,Industry,Public administration,Public services,Taxes,GDP,GDP per capita,HDI-M,HDI-R,HDI-L,HDI-E,FMDI
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Adamantina,33888,411.99,123.6,45201.02,138803.34,168908.81,621261.4,80366.51,1054541.08,31121.18,0.79,0.77,0.85,0.75,0.78
Adolfo,3469,211.06,63.32,33549.77,8059.45,21422.16,33338.63,3364.63,99734.63,28667.61,0.73,0.71,0.84,0.65,0.78
Aguaí,34919,474.55,142.37,87873.19,297084.21,121789.13,336929.46,99032.53,942708.52,27263.25,0.72,0.7,0.86,0.61,0.72
Agudos,35828,966.71,290.01,80476.44,800703.35,174721.09,672594.73,334526.09,2063021.7,57826.6,0.74,0.7,0.84,0.69,0.77
Alambari,5600,159.6,47.88,35578.26,7684.43,24265.69,28925.07,4113.25,100566.7,18245.05,0.71,0.68,0.8,0.66,0.7


___

Only counties coordinates missing for completing this dataset

In [37]:
# Centering the coordinates of the state of São Paulo

sp = 'São Paulo, BR'

geolocator = geo(user_agent="sp_explorer")
location = geolocator.geocode(sp)
latitude = location.latitude
longitude = location.longitude

print('The geographical coordinates of of the state of São Paulo are %f, %f.' % (latitude, longitude))

The geographical coordinates of of the state of São Paulo are -23.550651, -46.633382.


In [38]:
# Searching the coordinates

lat = []
lgt = []

for mun in df_sp.index.values:
    location = geolocator.geocode('%s, São Paulo, BR' % mun, timeout=3)
    lat.append(location.latitude)
    lgt.append(location.longitude)

In [39]:
# Adding coordinates to the dataset

df_sp['Latitude'] = lat
df_sp['Longitude'] = lgt

In [40]:
# Checking

df_sp.head()

Unnamed: 0_level_0,Population,Area KM2,Radius KM,Farming,Industry,Public administration,Public services,Taxes,GDP,GDP per capita,HDI-M,HDI-R,HDI-L,HDI-E,FMDI,Latitude,Longitude
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Adamantina,33888,411.99,123.6,45201.02,138803.34,168908.81,621261.4,80366.51,1054541.08,31121.18,0.79,0.77,0.85,0.75,0.78,-21.686652,-51.076298
Adolfo,3469,211.06,63.32,33549.77,8059.45,21422.16,33338.63,3364.63,99734.63,28667.61,0.73,0.71,0.84,0.65,0.78,-21.23566,-49.644192
Aguaí,34919,474.55,142.37,87873.19,297084.21,121789.13,336929.46,99032.53,942708.52,27263.25,0.72,0.7,0.86,0.61,0.72,-22.059204,-46.979384
Agudos,35828,966.71,290.01,80476.44,800703.35,174721.09,672594.73,334526.09,2063021.7,57826.6,0.74,0.7,0.84,0.69,0.77,-22.471162,-48.987822
Alambari,5600,159.6,47.88,35578.26,7684.43,24265.69,28925.07,4113.25,100566.7,18245.05,0.71,0.68,0.8,0.66,0.7,-23.550338,-47.897971


In [41]:
# Map for visual inspection

map_sp = fl.Map(location=[latitude, longitude], zoom_start=6.5)

for lat, lng, mun in zip(df_sp['Latitude'], df_sp['Longitude'], df_sp.index.values):
    label = '{}'.format(mun)
    label = fl.Popup(label, parse_html=True)
    fl.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sp)  
    
map_sp

In [42]:
# Saving the dataset to explore in the next part of the job

df_sp.to_csv('dataset_sp.csv', sep = ';')

The next step of this paper will be to use the Foursquare API to gather information on the main categories of establishments and analyze which ones have wheelchair adaptation, the "humanity factor", make a statistical summary and cluster groups to understand what are the common factors among the municipalities that have the most adaptation.

___