# Capstone Project - The Battle of the Neighborhoods (Week 1) #
## Applied Data Science Capstone by IBM/Coursera ##
### Fernando Tauscheck ###

## Table of contents:
1. [Introduction: Business Problem](#introduction)<br>
 1 [Curitiba:](#curitiba)<br>
2. [Data](#data)<br>
    2.1 [Geographic Data:](#data_geolocated)<br>
    2.2 [Foursquare:](#foursquare)<br>
    2.1.1 [Reference points:](#reference_points)<br>
    2.1.2 [Dataframe from Foursquare:](#foursquare_dataframe)<br>
 3 [Socioeconomic data of the neighborhoods:](#data_socioeconomic)<br>

 4 [Work Flow:](#work_flow)

## 1. Introduction: Business Problem <a name="introduction"></a>

What defines the success of a commercial business? Can we predict if a point is good enough to open a profitable bakery?

Although the analysis can, in theory, be replicated for any type of business, this report will be targeted to stakeholders interested in opening a bakery in Curitiba, Brazil. We will use geographic and socioeconomic data from existing bakeries to define a list of possibles location to open a bakery.

### 1.1 Curitiba:  <a name="curitiba"></a>

Curitiba is the capital and largest city in the Brazilian state of Paraná. The city's population was 1,948,626 as of 2020, making it the eighth-most populous city in Brazil and the largest in Brazil's South Region. According to Foursquare, Curitiba has 608 bakeries, of which: 
* 17 (1.4%) have ratings greater than 9; 
* 21 (3.4%) were classified as high cost;

![title](img/01.Curitiba.png)

## 2. Data:  <a name="data"></a>

Some factors will influence our analysis:
* Number of existing bakeries in the neighborhood;
* Socioeconomic data of the neighborhoods (Per capita income, population density, ...);
* Zones from City Master Plan;
* Proximity to parks, public square, boardwalk, main streets, and avenues of great circulation;
* If possible, we will try to compare the Foursquare Rating, Likes, and Tier of each bakery and understand if the location and these pieces of information have any correlation

As a data aggregation tool, RDMBS MySQL 8.0 will be used with 'Spatial Analysis Functions'.

### 2.1 Foursquare: <a name="foursquare"></a> ###

This project uses the Foursquare API as its main data gathering source as it has a database of millions of venues. To restrict the number of venues to request to Foursquare API, only places classified as bakery were filtered. 
To mitigate the problem with neighborhoods with more than 100 bakeries (an API limitation), we will query the API in clusters of hexagons with 600m of radius. The coordinates of these hexagons were generated through code, starting from a central point in Curitiba. All points were validated if they were 'within' the Curitiba area through a MySQL query. The coordinate of the central point was defined with a request to ‘Google Geocode API’ using the neighborhood ‘Fany’ as the parameter. 
With the venues list, an additional request was made to retrieve details of each venue:
* Rating;
* Likes;
* Tier;
* Multi-classification: For example, a Bakery with a grocery store;


#### 2.1.1 Reference points: <a name="reference_points"></a> #### 
![title](img/01.clusters.png)

#### 2.1.2 Dataframe from Foursquare: <a name="foursquare_dataframe"></a> #### 

In [12]:
df_venues = pd.read_parquet('./parquet/venues.parquet', engine='fastparquet')
df_venues[['id', 'name', 'lat', 'long', 'address', 'categories', 'tipCount', 'tier', 'likes', 'rating']].head(10)

Unnamed: 0,id,name,lat,long,address,categories,tipCount,tier,likes,rating
0,4b69efebf964a5201bbd2be3,Confeitaria das Famílias,-25.430643,-49.270212,"R. Quinze de Novembro, 374","[[""Dessert Shop"", ""4bf58dd8d48988d1d0941735""],...",114,1,235,6.7
1,4b75d4fcf964a520ee272ee3,Panetteria Maiochi,-25.472368,-49.288013,"R. Maranhão, 1730","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Con...",22,1,34,6.3
2,4b7c57d1f964a5209f8d2fe3,La Patisserie,-25.442422,-49.279188,"Av. Sete de Setembro, 4194","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Cof...",71,2,111,6.3
3,4b8abddbf964a520c07d32e3,Saint Germain,-25.432826,-49.290227,"Al. Prca. Izabel, 1347","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Del...",57,3,221,6.9
4,4ba29a89f964a520680838e3,Saint Germain,-25.444152,-49.287664,"Av. Visc. de Guarapuava, 4882","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Bre...",180,3,617,7.8
5,4ba53d22f964a5202bf038e3,Requinte,-25.412083,-49.253352,"R. Recife, 34","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Bre...",163,3,525,8.0
6,4bad2568f964a52086323be3,Panificadora Verdes Mares,-25.422681,-49.256221,"R. Mauá, 28","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Bra...",31,2,48,6.3
7,4bae12d6f964a5201f813be3,Provence Boulangerie,-25.444399,-49.289972,"R. Bruno Filgueira, 548","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Caf...",56,3,76,6.5
8,4bb36e49a32876b028a901fe,Marcolini Alimentari,-25.435534,-49.286224,"Al. Dr. Carlos de Carvalho, 1181","[[""Bakery"", ""4bf58dd8d48988d16a941735""], [""Caf...",101,3,146,6.1
9,4bb3a3e1715eef3bd7a186bb,Rico Pão,-25.428397,-49.262645,Av. Mal. Deodoro da Fonseca,"[[""Bakery"", ""4bf58dd8d48988d16a941735""]]",68,2,110,5.6


time: 78 ms (started: 2021-06-10 00:22:37 -03:00)


### 2.2 Geographic Data: <a name="data_geolocated"></a> ###

We will get geographic information from Curitiba at the website of the *"Instituto de Pesquisa e Planejamento Urbano de Curitiba"* (Institute of Urban Planning and Research of Curitiba also know as IPPUC)[^1]. The Institute provides all sorts of maps of Curitiba. We will use:

* Zones of City Master Plan;
* Neighborhoods;
* Mains streets;
* Boardwalks, public squares, and parks 

These maps are provided in SHP format (ESRI). Posteriorly they were converted to GeoJSON in a proper representation (WGS84). The GeoJSON files was inserted in an RDMBS (MySQL 8.0), where will be used the Spatial Analysis Functions to analyze. At the GitHub of this project[^2], you can find all support scripts and the structure of the tables.

[^1]: https://ippuc.org.br/geodownloads/geo.htm
[^2]: https://github.com/ftauscheck/The-Battle-of-the-Neighborhoods/tree/main/support

#### 2.2.1 Neighborhoods: <a name="neighborhoods"></a> ####

![title](img/01.Neighborhoods.png)

#### 2.2.2 Master Plan: <a name="master_plan"></a> ####

![title](img/01.MasterPlan.png)

### 2.2 Socioeconomic data of the neighborhoods: <a name="data_socioeconomic"></a> ###
The socioeconomic data of the municipality was be collected from the Wikipedia article[^3]: "Lista de bairros de Curitiba".

[^3]: https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Curitiba

In [13]:
df_venues = pd.read_parquet('./parquet/data_neighbourhood.parquet', engine='fastparquet')
df_venues.head(10)

Unnamed: 0,id,neighbourhood,norm_neighbourhood,area,men,women,total,households,avg_income
0,1,Ganchinho,GANCHINHO,11.2,3667,3658,7325,1921,767.35
1,2,Sitio Cercado,SITIO CERCADO,11.12,50631,51779,102410,27914,934.95
2,3,Umbará,UMBARA,22.47,7280,7315,14595,17064,908.7
3,4,Abranches,ABRANCHES,4.32,5463,5702,11165,3154,1009.67
4,5,Atuba,ATUBA,4.27,6156,6476,12632,3627,1211.6
5,6,Bacacheri,BACACHERI,6.98,10762,12344,23106,7107,3029.0
6,7,Bairro Alto,BAIRRO ALTO,7.02,20244,21789,42033,12071,1211.6
7,8,Barreirinha,BARREIRINHA,3.73,8079,8942,17021,5024,1272.18
8,9,Boa Vista,BOA VISTA,5.14,13677,15714,29391,9212,1817.4
9,10,Cachoeira,CACHOEIRA,3.07,3811,3927,7738,2091,908.7


time: 47 ms (started: 2021-06-10 00:32:03 -03:00)


### 2.4 Work Flow: <a name="work_flow"></a> ###

With the data collected and processed, we will use a Polynomial Regression algorithm to predict the Rating, Likes, and Tier of each sub-cluster and prepare a list of best clusters to open a bakery.