# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Vegan restaurant** in **Goiânia**, Brazil.

Since there aren't many vegan restaurants in Goiânia we will try to detect **locations that are mostly likely to receive well a new vegan restaurant** looking to places that alreary are this kind of restaurant and detecting similar neighborhoods. We are also particularly interested in **areas close to city center as possible**.

We will use our data science powers to generate a few most promissing neighborhoods based os this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:

* number of existing restaurantes in the neighborhood (any type of restaurant)
* number of Vegan restaurants int he neighborhood, if any
* similarity between neighborhoods
* distance of neighborhood form city center
    
Following data sources will be neede to extract/generate the required information:

* centers of candidate areas will be generated algoritmically and approximate addresses fo centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of restaurantes and their tpe and location in every neighborhood will be obtained using Foursquare API
* coordinate of Berlin center will be obtained using Google Maps API geocoding

### Neighborhood Candidates

1. Scraping the page [https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=334547&class=boundary] to importing Goiânia borough data
2. Pre-prosesing Borough data
3. Getting Borough data Coordinates
4. Pre-prosessing Coodinates data


#### 1. Scraping page to import Goiânia borough data

Importing necessary Libraries

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options

import pandas as pd
import numpy as np

Scraping and cleaning the pages

In [2]:
   # Set the path to chromedriver
url = r"https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=334547&class=boundary"
    
driver = webdriver.Chrome(r"C:\Users\ErikaS\Documents\Projetos Érika\Coursera__Capstone/chromedriver")

    # Get the url
driver.get(url)

    # Select the div correct and extrating the table in html format
driver.find_element_by_xpath("/html/body/div[2]/div[3]/div/table").click()
element = driver.find_element_by_xpath("/html/body/div[2]/div[3]/div/table")
html_content = element.get_attribute('outerHTML')

    # Close the browser
driver.quit()
    
    # Tranforming
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find(name = 'table')

    # Using pandas to structure the table in a DataFrame
df_full = pd.read_html(str(table))[0]


In [3]:
df_full.head()

Unnamed: 0,Local name,Type,OSM,Address rank,Admin level,Distance,Unnamed: 6
0,Goiânia,boundary:administrative,relation 334547,16.0,8.0,0.0,details >
1,Microrregião de Goiânia,boundary:administrative,relation 4857379,14.0,7.0,0.1012,details >
2,Região Geográfica Intermediária de Goiânia,boundary:administrative,relation 4873222,10.0,5.0,0.8761,details >
3,Goiás,boundary:administrative(state),relation 334443,8.0,4.0,0.7385,details >
4,Região Centro-Oeste,boundary:administrative,relation 3359944,6.0,3.0,5.2385,details >


#### 2. Pre-prosesing Borough data

In [4]:
# Selecting only the borough
borough_gyn = df_full[df_full[ 'Address rank'] > 17.0] .sort_values("Local name")

# Selecting useful columns
borough_gyn = borough_gyn.loc[:, ['Local name', 'Distance']]

# Cleaning column Distance
borough_gyn['Distance'] = borough_gyn['Distance'].str.replace(" km", "").str.replace("~", "").str.replace(" m", "").astype(float)

# Excluding duplicate data 
borough_gyn = borough_gyn.groupby('Local name').max()
borough_gyn.reset_index(level=0, inplace=True)

borough_gyn.head()

Unnamed: 0,Local name,Distance
0,Aldeia do Vale,8.7
1,Alphaville Flamboyant Residencial Araguaia,3.8
2,Bairro Boa Vista,13.6
3,Bairro Capuava,6.6
4,Bairro Feliz,3.0


It's important to be clear that the Goiânia borough data provided by web scraping provided to us:

- **Local name:** The Borough names;
- **Distance:** The distance from city center

In [5]:
borough_gyn.shape

(333, 2)

Saving the data to use later

In [6]:
borough_gyn.to_csv("bairros_gyn.csv", index=False)

#### Importing Borough data

In [3]:
import pandas as pd
import numpy as np

borough_gyn = pd.read_csv("bairros_gyn.csv")

Now that we have all the Goiânia Boroughs, we can select only the boroughs that are in a radio of 10 km from the center city.


In [5]:
borough_candidates = borough_gyn[borough_gyn['Distance'] < 10]
print("Dimensions: ", borough_candidates.shape)
borough_candidates.head()

Dimensions:  (210, 2)


Unnamed: 0,Local name,Distance
0,Aldeia do Vale,8.7
1,Alphaville Flamboyant Residencial Araguaia,3.8
3,Bairro Capuava,6.6
4,Bairro Feliz,3.0
6,Bairro Santo Hilário,6.1


#### 2. Data coordinates

Importing libraries to get geocoordinates

In [9]:
#!pip install geopandas 
#!pip install geopy
import geopandas as gpd
import pandas as pd

Using Nominatim API to get the coordinates

Getting the center coordinate

In [32]:
# Center coordinate
center_goiania = gpd.tools.geocode(" Goiânia, GO", provider = 'nominatim', user_agent="imp geocode")
lat_gyn = center_goiania['geometry'][0].x
long_gyn = center_goiania['geometry'][0].y
print('Coordinate of Goiânia, GO", Brazil: [{} , {}]'.format(lat_gyn, long_gyn))

Coordinate of Goiânia, GO", Brazil: [-49.2532691 , -16.680882]


In [14]:
adrees = pd.DataFrame({"ende": borough_candidates['Local name'] + ", Goiânia, GO "})
borough_candidates["Geom"]  = gpd.tools.geocode(adrees["ende"], provider = 'nominatim', user_agent="imp geocode", country_bias = "Brazil")["geometry"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
borough_candidates.head()

Unnamed: 0,Local name,Distance,Geom
0,Aldeia do Vale,8.7,POINT (-49.19967 -16.60501)
1,Alphaville Flamboyant Residencial Araguaia,3.8,POINT (-49.21316 -16.69865)
3,Bairro Capuava,6.6,POINT (-49.32322 -16.65565)
4,Bairro Feliz,3.0,POINT (-49.22855 -16.66210)
6,Bairro Santo Hilário,6.1,POINT (-49.19548 -16.65080)


In [26]:
borough_candidates["Lat"] = borough_candidates['Geom'].map(lambda p: p.x)
borough_candidates["Long"] = borough_candidates['Geom'].map(lambda p: p.y)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [27]:
borough_candidates.head()

Unnamed: 0,Local name,Distance,Geom,geometry,Lat,Long
0,Aldeia do Vale,8.7,POINT (-49.19967 -16.60501),POINT (-49.19967 -16.60501),-49.199667,-16.605009
1,Alphaville Flamboyant Residencial Araguaia,3.8,POINT (-49.21316 -16.69865),POINT (-49.21316 -16.69865),-49.213159,-16.698653
3,Bairro Capuava,6.6,POINT (-49.32322 -16.65565),POINT (-49.32322 -16.65565),-49.323221,-16.655652
4,Bairro Feliz,3.0,POINT (-49.22855 -16.66210),POINT (-49.22855 -16.66210),-49.228554,-16.6621
6,Bairro Santo Hilário,6.1,POINT (-49.19548 -16.65080),POINT (-49.19548 -16.65080),-49.195476,-16.650797


Saving the data to use later

In [29]:
borough_candidates.to_csv("borough_candidates.csv", index=False)

In [30]:
import pandas as pd
import numpy as np

borough_candidates = pd.read_csv("borough_candidates.csv")


Unnamed: 0,Local name,Distance,Geom,geometry,Lat,Long
0,Aldeia do Vale,8.7,POINT (-49.1996666 -16.6050087),POINT (-49.1996666 -16.6050087),-49.199667,-16.605009
1,Alphaville Flamboyant Residencial Araguaia,3.8,POINT (-49.2131586 -16.6986528),POINT (-49.2131586 -16.6986528),-49.213159,-16.698653
2,Bairro Capuava,6.6,POINT (-49.3232214 -16.6556515),POINT (-49.3232214 -16.6556515),-49.323221,-16.655652
3,Bairro Feliz,3.0,POINT (-49.2285538 -16.6620997),POINT (-49.2285538 -16.6620997),-49.228554,-16.6621
4,Bairro Santo Hilário,6.1,POINT (-49.1954757 -16.6507974),POINT (-49.1954757 -16.6507974),-49.195476,-16.650797


Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [31]:
import folium

In [37]:
map_goiania = folium.Map(location=[lat_gyn, long_gyn], zoom_start=11)
folium.Marker([lat_gyn, long_gyn], popup='Goiania').add_to(map_goiania)
for lat, lon in zip(borough_candidates["Lat"], borough_candidates['Long']):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=10, color='blue', fill=False).add_to(map_goiania)
    #folium.Marker([lat, lon]).add_to(map_berlin)
map_goiania

### Foursquare

Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each neighborhood.

We're interested in venues in 'food' category, but only those that are proper restaurants - coffe shops, pizza places, bakeries etc. are not direct competitors so we don't care about those. So we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific 'Italian restaurant' category, as we need info on Italian restaurants in the neighborhood.

Foursquare credentials are defined in hidden cell bellow.

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>