# The Battle of Neighborhoods - Week 4 - Part 2

## 2 Data
On this project was used data originated from three different sources, the first dataset gathered was the brazilian postal code dataset, the next one was data retrieved from the geocoder ArcGIS APIs/Python Library and the third and last source Foursquare API data. It’s important to say that the data gathered from the APIs was used to enrich the original postal code dataset.
- Postal code: In Brazil is known as CEP, but differently from the data provided on the course a single neibourhood can have several CEPs assigned to it. Link to data: http://cep.la/CEP-dados-2018-UTF8.zip
- Geocoder ArcGIS: This API was used to retrieve Latitude and Longitude based on Postal code (CEP). ArcGIS API for Python is a Python library for working with maps and geospatial data, powered by web GIS. It provides simple and efficient tools for sophisticated vector and raster analysis, geocoding, map making, routing and directions, as well as for organizing and managing a GIS with users, groups and information items. Link to documentation: https://developers.arcgis.com/python/guide/
- Foursquare: This API was used to retrieve venue information based on Latitude and Longitude. Back in 2009, Foursquare invented the check-in. Today, those 13+ billion check-ins are the foundation of our powerful, proprietary Pilgrim technology that helps make sense of where phones go for the more than 150,000 partners. Link to API documentation: https://developer.foursquare.com/docs/api
### 2.1 Data Cleansing
The original postal code data gathered on http://cep.la/CEP-dados-2018-UTF8.zip was loaded into the environment and below is presented its representation. And as can be observed the data has several problems as for example:
- Name of de city separeted on more than one column
- State abbreviation combined with second column of the city name
- Neiborhood name and Street name concatened
- Three empty columns

To mitigate this issue and enable analyses continuation a data cleansing must be applyed, techniques of string handeling, and dataframe entanglements were utilized on this cleansing. In order to demonstrate the result of cleansing process below is presented the cleaned dataset.


### Loading Libraries and Data Gathering

In [1]:
import pandas as pd
import numpy as np
import zipfile
import csv
print('Libraries imported.')

Libraries imported.


In [2]:
!wget -q -O 'CEP-dados-2018-UTF8.zip' http://cep.la/CEP-dados-2018-UTF8.zip
print('Data downloaded!')

Data downloaded!


In [3]:
with zipfile.ZipFile("CEP-dados-2018-UTF8.zip","r") as zip_ref:
    zip_ref.extractall("./")
print('Data unziped!')

Data unziped!


In [4]:
df = pd.read_fwf('ceps.txt', header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1001000,São,Paulo/SP,Sé\tPraça da Sé - lado ímpar,,,
1,1001001,São,Paulo/SP,Sé\tPraça da Sé - lado par,,,
2,1001010,São,Paulo/SP,Sé\tRua Filipe de Oliveira,,,
3,1001900,São,Paulo/SP,"Sé\tPraça da Sé, 108 \t UNESP - Universidade E...",,,
4,1001901,São,Paulo/SP,"Sé\tPraça da Sé, 371 \t Edifício Santa Lídia",,,


### Data Cleansing and Feature Engineering

In [5]:
df1 = df[1] + " " + df[2]
df2 = pd.DataFrame(df1, columns=["CS"])


df2 = df2["CS"].str.split('/', expand=True)
df2.columns = ['City', 'State']
df2['CEP'] = df[0]

df2.head()

Unnamed: 0,City,State,CEP
0,São Paulo,SP,1001000
1,São Paulo,SP,1001001
2,São Paulo,SP,1001010
3,São Paulo,SP,1001900
4,São Paulo,SP,1001901


In [6]:
df3 = pd.DataFrame(df[3])

df4 = df3[3].str.split('\t', expand=True)
df4['CEP'] = df[0]

df5 = df4.drop([2, 3], axis = 1)
df5.columns = ['Neighborhood', 'Street', 'CEP']

df5.head()

Unnamed: 0,Neighborhood,Street,CEP
0,Sé,Praça da Sé - lado ímpar,1001000
1,Sé,Praça da Sé - lado par,1001001
2,Sé,Rua Filipe de Oliveira,1001010
3,Sé,"Praça da Sé, 108",1001900
4,Sé,"Praça da Sé, 371",1001901


In [7]:
dfm = pd.merge(df2, df5, on='CEP')
dfm.head()

Unnamed: 0,City,State,CEP,Neighborhood,Street
0,São Paulo,SP,1001000,Sé,Praça da Sé - lado ímpar
1,São Paulo,SP,1001001,Sé,Praça da Sé - lado par
2,São Paulo,SP,1001010,Sé,Rua Filipe de Oliveira
3,São Paulo,SP,1001900,Sé,"Praça da Sé, 108"
4,São Paulo,SP,1001901,Sé,"Praça da Sé, 371"


In [8]:
dt = dfm[dfm['City'] == 'São Paulo']
dt.tail()

Unnamed: 0,City,State,CEP,Neighborhood,Street
88115,São Paulo,SP,8490885,Jardim Pedra Branca,Rua Cambará-Branco
88116,São Paulo,SP,8490886,Jardim Pedra Branca,Rua Erva-Baleeira
88117,São Paulo,SP,8490890,Jardim Pedra Branca,Rua Buritizinho
88118,São Paulo,SP,8490895,Jardim Pedra Branca,Rua Rosa da Venezuela
88119,São Paulo,SP,8491030,Guaianazes,Rua Lauro


#### Dropping last three digits of CEP to enlarge the grain od the postal code and get more meaningful data from foursquare

In [9]:
dt1 = dt.drop_duplicates('Neighborhood', keep='first')
dt1.shape

(1920, 5)

In [10]:
dt1.head(10)

Unnamed: 0,City,State,CEP,Neighborhood,Street
0,São Paulo,SP,1001000,Sé,Praça da Sé - lado ímpar
9,São Paulo,SP,1002020,Centro,Viaduto do Chá
173,São Paulo,SP,1017000,Brás,Avenida Rangel Pestana - até 499/500
337,São Paulo,SP,1035100,República,Avenida São João - de 651 a 1339 - lado ímpar
540,São Paulo,SP,1101000,Luz,Avenida Santos Dumont - até 999/1000
546,São Paulo,SP,1101080,Ponte Pequena,Ponte Santos Dumont
591,São Paulo,SP,1107000,Bom Retiro,Avenida do Estado - até 2599 - lado ímpar
602,São Paulo,SP,1109000,Canindé,Avenida Cruzeiro do Sul - até 1299 - lado ímpar
723,São Paulo,SP,1134901,Barra Funda,"Avenida Rudge, 700"
734,São Paulo,SP,1136005,Várzea da Barra Funda,Rua Américo Del Veneri
