<h1>Capstone Project - The Battle of Neighborhoods (Week 1)

<h2>Introduction</h2>
<p>Buenos Aires is the singlest biggest city in Argentina and one of the biggest in South America. As such it has become a hub for all main activities within the country, tourism, education, employment, entertainment and presents a big opportunity for new venues.

<h2>Business Problem</h2>
<p>The client wants to open a Cinema in Buenos Aires. He needs to know where might be the best place within the city to open one.
Our objective then is to study the city, feeding from Foursquare data regarding the different venues that might have a positive/negative impact for the deployment of a cinema venue.

<h2>Data</h2>
<p>In order to help the client to determine the best place to open a Cinema in Buenos Aires we will need the following set of data:</p>
<ul>
    <li>The Barrios and its coordinates from Buenos Aires, Argentina: https://raw.githubusercontent.com/diecou/ds-training/master/buenos_aires_coor.csv
    <li>The following venues data from Foursquare:
        <ul>
            <li>Cinema venues of the Barrios
            <li>Restaurant venues of the Barrios
            <li>Theater venues of the Barrios
        </ul>
</ul>
<p>These venus will be weighted so we can determine which is the best barrio to open a new cinema

<h2>Methodology</h2>
<p>In order to determine where to best place a new cinema we will need to create a dataset with all 'barrios' of Buenos Aires and their coordinates. Additionally, we will need to combine this with the data for other venues of our interest (cinemas, restaurants, theaters) using FourSquare API.
We used the leveraged data to create a weighted list of each 'barrio'. The top one on the list will be recommended to the client.

<h3>1. Create dataset</h3>
<p>I've loaded to Github a set of data of all barrios along with their latitudes and longitudes. This will allow us, not only to place markers on a map but also to serve as the center of each 'barrio' for which we will determine how many venues are on each, given a specific radius</p>

In [6]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

df = pd.read_csv("https://raw.githubusercontent.com/diecou/ds-training/master/buenos_aires_coor.csv", error_bad_lines=False)

df.head()

Unnamed: 0,Comuna,Barrio,Latitude,Longitude
0,15,AGRONOMIA,-34.5925,-58.4944
1,5,ALMAGRO,-34.6111,-58.4202
2,3,BALVANERA,-34.6101,-58.4059
3,4,BARRACAS,-34.6411,-58.3774
4,13,BELGRANO,-34.5621,-58.4567


<h3>2. Import remaining libraries

In [7]:
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<h3>3. Create a map of Buenos Aires with its 'Barrios'</h3>
<p>Using the coordinates for each 'barrio' we can now proceed to draw a map of Buenos Aires, with a marker for each 'barrio' for reference

In [8]:
address = 'Buenos Aires'

geolocator = Nominatim(user_agent="ba_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Buenos Aires are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Buenos Aires are -34.6075616, -58.437076.


<h4>Draw the map

In [9]:
# create map of New York using latitude and longitude values
map_ba = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, comuna, barrio in zip(df['Latitude'], df['Longitude'], df['Comuna'], df['Barrio']):
    label = '{}, Comuna {}'.format(barrio, comuna)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_ba)  
    
map_ba

<h3>4. Analyze Barrios in Buenos Aires</h3>
<p>We will use Foursquare API to get the data of the 3 venues that are of our interest: Cinemas, Restaurants and Theaters. Using the coordinates for each 'barrio' we will calculate how many of these venues are located within 1km of each 'barrio'

<h4>Define Foursquare Credentials and Version

In [10]:
CLIENT_ID = 'FTXL5TNGA2NSEP5ANEOE0RXKXXVC31N0ABYXTQ525E0P5Q2N' # your Foursquare ID
CLIENT_SECRET = '1F5YPJTE2MJ3YSCMBD3V5KMZR2E23AYF5NK4SWEBOI1PCROG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FTXL5TNGA2NSEP5ANEOE0RXKXXVC31N0ABYXTQ525E0P5Q2N
CLIENT_SECRET:1F5YPJTE2MJ3YSCMBD3V5KMZR2E23AYF5NK4SWEBOI1PCROG



<h4>Create a function to explore all the Barrios in Buenos Aires to look for the venues of our interest

In [11]:
# function to repeat the exploring process to all the neighborhoods in Buenos Aires
import urllib
def getNearbyVenues(names, latitudes, longitudes, radius=5000, categoryIds=''):
    try:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            #print(name)

            # create the API request URL
            link = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)

            if (categoryIds != ''):
                link = link + '&categoryId={}'
                link = link.format(categoryIds)

            # make the GET request
            respons = requests.get(link).json()
            results = respons["response"]['venues']

            # return only relevant information for each nearby venue
            for v in results:
                success = False
                try:
                    category = v['categories'][0]['name']
                    success = True
                except:
                    pass

                if success:
                    venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['name'], 
                        v['location']['lat'], 
                        v['location']['lng'],
                        v['categories'][0]['name']
                    )])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Localidad', 
                  'Localidad Latitude', 
                  'Localidad Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    except:
        print(link)
        print(results)
        print(nearby_venues)

    return(nearby_venues)

<h4>4a. Run the above function to find the location of existing Cinemas</h4>
<p>Using the previously defined function we will get how many cinemas can be found within a 1km radius of each 'barrio'

In [12]:
LIMIT = 1000

# Use category id 4bf58dd8d48988d16c941735 to only get the burger joints
ba_cinemas = getNearbyVenues(names=df['Barrio'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=1000, categoryIds='4bf58dd8d48988d17f941735')
ba_cinemas.head()

Unnamed: 0,Localidad,Localidad Latitude,Localidad Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,AGRONOMIA,-34.5925,-58.4944,Cineclub La Pampa,-34.584548,-58.487944,Indie Movie Theater
1,ALMAGRO,-34.6111,-58.4202,Cinemark Caballito,-34.616343,-58.429011,Multiplex
2,BALVANERA,-34.6101,-58.4059,Gaumont,-34.605704,-58.40242,Multiplex
3,BALVANERA,-34.6101,-58.4059,Teatro IFT,-34.603752,-58.406408,Indie Theater
4,BALVANERA,-34.6101,-58.4059,MicroCine Fox/Warner,-34.602391,-58.394731,Movie Theater


<h4>4b. Run the same function to find the location of Restaurants</h4>
<p>Using the previously defined function we will get how many restaurants can be found within a 1km radius of each 'barrio'

In [13]:
LIMIT = 1000

# Use category id 4bf58dd8d48988d16c941735 to only get the burger joints
ba_restaurants = getNearbyVenues(names=df['Barrio'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=1000, categoryIds='4d4b7105d754a06374d81259')
ba_restaurants.head()

Unnamed: 0,Localidad,Localidad Latitude,Localidad Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,AGRONOMIA,-34.5925,-58.4944,La Floreada,-34.594121,-58.492288,Restaurant
1,AGRONOMIA,-34.5925,-58.4944,Bonafide,-34.590722,-58.498184,Coffee Shop
2,AGRONOMIA,-34.5925,-58.4944,Cremolatti,-34.590057,-58.497106,Ice Cream Shop
3,AGRONOMIA,-34.5925,-58.4944,La Casa De Maga,-34.588055,-58.493725,Café
4,AGRONOMIA,-34.5925,-58.4944,Al Piatto,-34.590089,-58.497244,Pizza Place


<h4>4c. Run the same function to find the location of Theaters</h4>
<p>Using the previously defined function we will get how many theaters can be found within a 1km radius of each 'barrio'

In [14]:
# Use category id 4bf58dd8d48988d16c941735 to only get the burger joints
ba_theaters = getNearbyVenues(names=df['Barrio'], latitudes=df['Latitude'], longitudes=df['Longitude'], radius=1000, categoryIds='4bf58dd8d48988d137941735')
ba_theaters.head()

Unnamed: 0,Localidad,Localidad Latitude,Localidad Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,AGRONOMIA,-34.5925,-58.4944,Instituto Sudamericano de Ilusionismo,-34.593473,-58.502755,Theater
1,AGRONOMIA,-34.5925,-58.4944,Gargantúa,-34.599755,-58.501886,Theater
2,ALMAGRO,-34.6111,-58.4202,Actors Studio Teatro ™ y Estudio de Carlos Gan...,-34.608654,-58.420578,Theater
3,ALMAGRO,-34.6111,-58.4202,TEATRO DEL PASILLO,-34.611836,-58.419897,Theater
4,ALMAGRO,-34.6111,-58.4202,Asociación Civil Nuestro Tiempo,-34.607031,-58.417305,Theater


<h3>5. Group Data</h3>
<p>We need to provide a score for each 'barrio' to define which one is the most suitable to open a new cinema. In order to do this we must first:
<ul>
    <li>Group all the venues acquired data per barrio on a new dataset
    <li>Assign a weight to each venue category considering its impact
    <li>Use the number of venues per each barrio and the set weight per each venue to determine the score for each barrio

<h4>Let's define a function to group all the venues found before

In [15]:
def addColumn(initDf, title, groupedDf):
    grouped = groupedDf.groupby('Localidad').count()
    
    for n in initDf['Localidad']:
        try:
            initDf.loc[initDf['Localidad'] == n,title] = grouped.loc[n, 'Venue']
        except:
            initDf.loc[initDf['Localidad'] == n,title] = 0

<h4>Add venue count to each Barrio

In [16]:
df_venues = df.copy()
df_venues.rename(columns={'Barrio':'Localidad'}, inplace=True)

addColumn(df_venues, 'Cinemas', ba_cinemas)
addColumn(df_venues, 'Restaurants', ba_restaurants)
addColumn(df_venues, 'Theaters', ba_theaters)

df_venues.rename(columns={'Localidad':'Barrio'}, inplace=True)

df_venues.head()

Unnamed: 0,Comuna,Barrio,Latitude,Longitude,Cinemas,Restaurants,Theaters
0,15,AGRONOMIA,-34.5925,-58.4944,1.0,50.0,2.0
1,5,ALMAGRO,-34.6111,-58.4202,1.0,50.0,20.0
2,3,BALVANERA,-34.6101,-58.4059,12.0,50.0,27.0
3,4,BARRACAS,-34.6411,-58.3774,0.0,48.0,1.0
4,13,BELGRANO,-34.5621,-58.4567,12.0,50.0,2.0


<h4>Assign weight according to the customer's needs</h4>
<ul>
    <li>The client wants to open the cinema where there isn't one close by to attract the most number of clients. Therefore we assign a high negative value
    <li>Places with several restaurants are desirable to attract people that are already looking to spend time out of home
    <li>Theaters are complementary to cinemas. They attract a large number of people which makes those places suitable for opening new entertainment businesses

In [17]:
weight_cinema = -10
weight_restaurant = 1
weight_theater = 2

In [19]:
df_weighted = df_venues[['Barrio']].copy()

<h2>Result

<h3>Calculate the score of each Barrio using the determined weights</h3>
<p>Using the set weights we determine the score for each barrio. 'Almagro' is the barrio with the highest score and therefore the most suitable barrio to open a new cinema

In [20]:
df_weighted['Score'] = df_venues['Cinemas'] * weight_cinema + df_venues['Restaurants'] * weight_restaurant + df_venues['Theaters'] * weight_theater
df_weighted = df_weighted.sort_values(by=['Score'], ascending=False)
df_weighted.head(10)

Unnamed: 0,Barrio,Score
1,ALMAGRO,80.0
5,BOCA,62.0
38,VILLA GRAL. MITRE,52.0
14,LINIERS,52.0
3,BARRACAS,50.0
37,VILLA DEVOTO,50.0
34,VERSALLES,50.0
25,PATERNAL,47.0
11,CONSTITUCION,46.0
31,SAN NICOLAS,46.0


<h3>Show a map of Almagro with all the existing venues that are of our interest</h3>

<h4>Create a function to add the analyzed venues to the map

In [21]:
# function to add markers for given venues to map
def addToMap(df, color, existingMap):
    for lat, lng, local, venue, venueCat in zip(df['Venue Latitude'], df['Venue Longitude'], df['Localidad'], df['Venue'], df['Venue Category']):
        label = '{} ({}) - {}'.format(venue, venueCat, local)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7).add_to(existingMap)

<h4>Show Map</h4>
<p>We can see on the map below all of the venues that are of our interest to decide which area within the neighborhood the client wants to explore opening the new cinema

In [22]:
map_result = folium.Map(location=[latitude, longitude], zoom_start=15)

winner = df[df['Barrio'] == 'ALMAGRO']

for lat, lng, local in zip(winner['Latitude'], winner['Longitude'], winner['Barrio']):
    label = '{}'.format(local)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_result) 

addToMap(ba_cinemas[ba_cinemas['Localidad'] == 'ALMAGRO'], 'red', map_result)
addToMap(ba_restaurants[ba_restaurants['Localidad'] == 'ALMAGRO'], 'green', map_result)
addToMap(ba_theaters[ba_theaters['Localidad'] == 'ALMAGRO'], 'gold', map_result)

map_result

<h2>Discussion

<p>Based on the scoring system determined above, it's clear that the barrio of Almagro is the most likely place where the cinema should be opened.
<p>However, this analysis can be further improved by adding new variables. There might be other venues not yet being considered like proximity to shopping malls or other commercial areas. Parking space might also be considered if the cinema won't have its own. Additionally, it might be useful to get a set of data with the population of each barrio, as the population might impact the number of probable customers.

That being said, I consider that this first model is a good step towards making an informed choice and to proceed looking for suitable building within the recommended area.

<h2>Conclusion

<p>The client wanted to find the best barrio to build a new cinema and leveraging data for existing venues using Foursquare we have provided what we believe is the best barrio to look forward. Given the scoring system, some other candidate places have also been provided in case there are other factors such as

The 5 target locations of new cinema may not be a good choices. As the weighting matrix is developed, I can quickly pick other locations and make the recommendation again.