# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

#### Daniel de Amaral da Silva (BSc on Statistics at University Federal of Ceará - UFC)

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

This work aims to understand the **economic regions** of the tourist state of **Bahia**, Brazil. 

The city of **Salvador** is often the main target for several tourists from different countries with the aim of having fun, especially during the period related to carnival. 

| ![Picture of Carnaval](https://correio-cdn1.cworks.cloud/fileadmin/_processed_/9/0/csm_carnaval_bloco_63a5062383.jpg) | 
|:--:| 
| *Salvador Carnival* |

In fact, there are many tourist routes and accommodation guides, however, this notebook will make a visualization of the city trying to **partition the city into regions** in order to **identify the economic focus** of each one.

The result of this work can help both people who want to **open their own business** and become micro entrepreneurs or who want to **buy a house/apartment** in Salvador **with a pre-established neighborhood**.

...being clearer

...imagine that a person **wants to live in Salvador**, however, he **doesn't care much about the region of the city, but about the neighborhood**. For example, a couple with children might have a tendency to choose locations with good schools, parks and places to have fun for their children. In contrast, a 19/20 year old could choose locations close to bars and clubs.

We will use our data science powers to generate more **promising regions based on the user's profile**. The advantages of each area will be clearly expressed, so that the best possible final region can be chosen by the interested parties.

## 2. Data

Based on definition of our problem, factors that will influence our decission are only the **number of venues in the neighborhood (any type of venue)**.

We decided to use circular search of locations, centered around each county, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate county areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Geopy API reverse geocoding**
* number of venues and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Bahia/BR center will be obtained using **Geopy API geocoding**.

### Neighborhood Counties Candidates

Let's create latitude & longitude coordinates for centroids of our candidate counties neighborhoods, areas and their respective radius. 

Fortunately we can obtain all the names of the counties of the state of bahia and their respective areas $(km^2)$ by accessing the data from IBGE (Brazilian Institute of Geography and Statistics).

this way, for each county we will have the search radius to find the neighborhood

In [1]:
import pandas as pd

neighborhoods = pd.read_csv('https://raw.githubusercontent.com/damarals/Coursera_Capstone/master/mapa.csv')
neighborhoods.columns = ['county', 'area']

neighborhoods.head()

Unnamed: 0,county,area
0,Alcobaça,1477.929
1,Almadina,245.236
2,Anagé,1899.683
3,Arataca,435.962
4,Aurelino Leal,445.394


...a naive approach to determining the radius of each county is to imagine that each is a perfect circumference centered on its coordinates, using the following formula

$$radius = \sqrt{\frac{area}{\pi}}$$

In [2]:
import numpy as np
import math

neighborhoods['radius'] = np.sqrt(neighborhoods.area/math.pi).round(3)

lets get the coordinates latitude and longitude for each county using geopy

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>bahia_explorer</em>, as shown below.

In [3]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy.extra.rate_limiter import RateLimiter # delay between geocoding calls
from geopy.exc import GeocoderTimedOut # error messages

locator = Nominatim(user_agent= 'bahia_explorer')

geocode = RateLimiter(locator.geocode, min_delay_seconds = 0.5)

try:
    neighborhoods['location'] = (neighborhoods.county + ', Bahia - Brasil').apply(lambda neigh: geocode(neigh, timeout = 10000))
    neighborhoods['point'] = neighborhoods['location'].apply(lambda loc: tuple(loc.point) if loc else None)
    neighborhoods[['latitude', 'longitude', 'altitude']] = pd.DataFrame(neighborhoods['point'].tolist(), index=neighborhoods.index)
    neighborhoods.drop(['location', 'point', 'altitude', 'area'], axis = 1, inplace = True)
except GeocoderTimedOut as e:
    print('Error: geocode failed with message {}'.format(e.message))

In [4]:
neighborhoods.head()

Unnamed: 0,county,radius,latitude,longitude
0,Alcobaça,21.69,-17.521587,-39.196547
1,Almadina,8.835,-14.70466,-39.638199
2,Anagé,24.59,-14.613467,-41.137631
3,Arataca,11.78,-15.261742,-39.412294
4,Aurelino Leal,11.907,-14.316714,-39.325611


And make sure that the dataset has all 417 counties.

In [5]:
print('The dataframe has {} counties.'.format(neighborhoods.shape[0]))

The dataframe has 417 counties.


Let's find the latitude & longitude of Bahia city center using geopy API.

In [6]:
# get coordinates of Bahia
try:
    bahia_center = geocode('Bahia, Brasil', timeout = 10000)
    print('The geograpical coordinate of Bahia state is {}'.format(bahia_center[1]))
except GeocoderTimedOut as e:
    print('Error: geocode failed with message {}'.format(e.message))

The geograpical coordinate of Bahia state is (-12.285251, -41.9294776)


Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [7]:
import folium
import unidecode # remove accentuation, avoid problems

map_bahia = folium.Map(location = bahia_center[1], zoom_start = 6)
for lat, lon, county in zip(neighborhoods.latitude, 
                    neighborhoods.longitude,
                    neighborhoods.county):
    label = folium.Popup(unidecode.unidecode(county), parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_bahia) 

map_bahia

### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on venues in each neighborhood.

We're interested in all venues... coffe shops, gyms, pizza places, bakeries, schools...

In [8]:
# The code was removed by Watson Studio for sharing.

Foursquare API version: 20180605


Let's create a function to explore all the neighborhoods in Bahia

In [9]:
import requests

def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:
# applying funcion in each neighborhood
bahia_venues = getNearbyVenues(names = neighborhoods['county'],
                               latitudes = neighborhoods['latitude'],
                               longitudes = neighborhoods['longitude'])

Let's check the size of the resulting dataframe

In [11]:
print(bahia_venues.shape)
bahia_venues.head()

(2669, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alcobaça,-17.521587,-39.196547,Farol de Alcobaça,-17.519936,-39.193818,Lighthouse
1,Alcobaça,-17.521587,-39.196547,Mercado Estrela do Sul,-17.520896,-39.196534,Grocery Store
2,Alcobaça,-17.521587,-39.196547,Estádio Municipal de Alcobaça,-17.521949,-39.195559,Soccer Stadium
3,Alcobaça,-17.521587,-39.196547,Hebrom Moto Peças,-17.520677,-39.197127,Motorcycle Shop
4,Alcobaça,-17.521587,-39.196547,Distribuidora De Bebidas Alcobaça,-17.522186,-39.196057,Beer Store


Let's check how many venues were returned for each neighborhood (county)

In [12]:
bahia_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Abaré,1,1,1,1,1,1
Abaíra,6,6,6,6,6,6
Acajutiba,5,5,5,5,5,5
Adustina,7,7,7,7,7,7
Alagoinhas,31,31,31,31,31,31
Alcobaça,6,6,6,6,6,6
Almadina,4,4,4,4,4,4
Amargosa,12,12,12,12,12,12
Amélia Rodrigues,8,8,8,8,8,8
América Dourada,2,2,2,2,2,2


Note that of the 417 counties, we only had 368 records... a reduction of about $11.75\%$

Now, let's find out how many unique categories can be curated from all the returned venues


In [13]:
print('There are {} uniques categories.'.format(len(bahia_venues['Venue Category'].unique())))

There are 251 uniques categories.
