# Capstone Project - The battle of the Neigborhoods (Week 2)


Applied Data Science Capstone by IBM/Coursera 

## Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#sesults)
* [Conclusion](#conclusion)

 ## Introduction: Bussines Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Italian restaurant** in **Mexico City**, Mexico.

Since there are lots of restaurants in Mexico City we will try to detect **locations that are not already crowded with restaurants**. We are also particularly interested in **areas with no Italian restaurants in vicinity**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

This is essential for the results of this final task.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of and distance to Italian restaurants in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of restaurants aMExico City their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of MExico City center will be obtained using **Google Maps API geocoding** of well known Mexico City location 

### Neighborhood Candidates on Mexico City

In [171]:
import random 
import numpy as np 
import pandas as pd 
import json 
import requests 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
from pandas.io.json import json_normalize 
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium 
#!conda install -c conda-forge urllib2 --yes 
from sklearn.cluster import KMeans
print('All libraries are well imported!!.')

All libraries are well imported!!.


In [172]:
df = pd.read_csv('D:\curso Coursera\Curso 9 Proyecto Final\Semana 5\Capstone Project The Battle of Neighborhood Part 2\cdmx_Nuevo.csv')
df = df.drop(columns=['ENTIDAD', 'CVE_ALC', 'CVE_COL','SECC_COM', 'SECC_PAR', 'Geo Shape'])
df['latitud'], df['longitud'] = df['Geo Point'].str.split(',', 1).str
df = df.drop(columns=["Geo Point"])
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 
df.columns = column_names
df = df[-df.isnull().any(axis=1)]

In [173]:
df.describe()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
count,1808,1808,1808.0,1808.0
unique,1739,16,1808.0,1808.0
top,MIGUEL HIDALGO,IZTAPALAPA,19.2701801068,-99.1358247041
freq,4,293,1.0,1.0


In [174]:
map_neigb_mexicocity = folium.Map(location=[latitude, longitude], zoom_start=11)

In [175]:
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_neigb_mexicocity) 

In [176]:
map_neigb_mexicocity.save("map_neigb_mexicocity.html ") 
map_neigb_mexicocity

In [177]:
neighborhoods=pd.DataFrame(dict, columns=column_names, dtype=None)
neighborhoods

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,MIGUEL HIDALGO,LOMAS DE REFORMA,19.401682,-99.235472
1,MIGUEL HIDALGO,DANIEL GARZA (AMPL),19.409218,-99.193839
2,MIGUEL HIDALGO,IGNACIO MANUEL ALTAMIRANO,19.463144,-99.196828
3,MIGUEL HIDALGO,LEGARIA,19.455531,-99.193048
4,MIGUEL HIDALGO,LEGARIA (U HAB),19.45002,-99.201076
5,VENUSTIANO CARRANZA,ADOLFO LOPEZ MATEOS,19.420052,-99.072184
6,COYOACAN,ADOLFO RUIZ CORTINES I,19.320967,-99.147393
7,COYOACAN,PEDREGAL DE STO DOMINGO III,19.331433,-99.164253
8,COYOACAN,PASEOS DE TAXQUEA I,19.348256,-99.123511
9,VENUSTIANO CARRANZA,PROGRESISTA,19.436092,-99.109345


Define Foursquare Credentials and Version

In [178]:
CLIENT_ID = 'WYAEZZVT00PMD0KJIHOFRGBAQXFI0UMRTDKQWGLKABOWBVL0' # your Foursquare ID
CLIENT_SECRET = 'EEMR5CG1PFXABI4PNF1CPGFNIBQ3VAO4U1LJ230KAPJZIV1L' # your Foursquare Secret
VERSION = '20180605' 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WYAEZZVT00PMD0KJIHOFRGBAQXFI0UMRTDKQWGLKABOWBVL0
CLIENT_SECRET:EEMR5CG1PFXABI4PNF1CPGFNIBQ3VAO4U1LJ230KAPJZIV1L


In [179]:
neighborhood_lat = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_long = df.loc[0, 'Longitude'] # neighborhood longitude value

In [214]:
neighborhood_name = df.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_lat, 
                                                               neighborhood_long))

Latitude and longitude values of MIGUEL HIDALGO are 19.4016815485,  -99.2354719599.


In [182]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_lat, 
    neighborhood_long, 
    radius, 
    LIMIT)
print('Esta es la url:', url) # display URL

Esta es la url: https://api.foursquare.com/v2/venues/explore?&client_id=WYAEZZVT00PMD0KJIHOFRGBAQXFI0UMRTDKQWGLKABOWBVL0&client_secret=EEMR5CG1PFXABI4PNF1CPGFNIBQ3VAO4U1LJ230KAPJZIV1L&v=20180605&ll=19.4016815485, -99.2354719599&radius=500&limit=100


In [216]:
results = requests.get(url).json()
results
open('results.JSON') as data_file:
        data=json.load(data_file)
normalized_df = pd.io.json.json_normalize(data)
normalized_df.to_csv('my_csv_file.csv',index=False)

{'meta': {'code': 200, 'requestId': '5dc8af98f96b2c002c1c584e'},
 'response': {'headerLocation': 'Miguel Hidalgo',
  'headerFullLocation': 'Miguel Hidalgo, Mexico City',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 19.406181553000007,
    'lng': -99.23070993593554},
   'sw': {'lat': 19.397181543999995, 'lng': -99.24023398386446}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e5c29b41495cac41936f7dd',
       'name': 'Flores Del Bosque',
       'location': {'lat': 19.40066209833531,
        'lng': -99.23317805788683,
        'labeledLatLngs': [{'label': 'display',
          'lat': 19.40066209833531,
          'lng': -99.23317805788683}],
        'distance': 266,
        'cc': 'MX',
        'city': 'Ciudad de Mé

In [219]:
results
with open('D:\curso Coursera\Curso 9 Proyecto Final\Semana 5\Capstone Project The Battle of Neighborhood Part 2\cdmx02.json', 'w') as json_file:
    json.dump(results, json_file)



In [196]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [220]:
#Now we are ready to clean the json and structure it into a pandas dataframe.
venues = results['response']['groups'][0]['items']

In [221]:
nearby_venues = json_normalize(venues) # flatten JSON

In [222]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

In [223]:
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

In [224]:
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print('Mas cercanos', nearby_venues.head())

Mas cercanos                 name     categories        lat        lng
0  Flores Del Bosque    Flower Shop  19.400662 -99.233178
1         K-vitacion            Spa  19.402467 -99.232648
2          Nine West     Shoe Store  19.403501 -99.238881
3  Plaza de la Radio  Historic Site  19.405568 -99.233329


In [225]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


In [226]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [238]:
cdmx_venues = getNearbyVenues(names=df['Neighbourhood'],latitudes=df['Latitude'], longitudes=df['Longitude'])    

MIGUEL HIDALGO


KeyError: 'groups'

In [236]:
results['response']['groups'][0]['items']
results['response']
#check the size of the resulting dataframe
print(cdmx_venues.shape())
print(cdmx_venues.head())

NameError: name 'cdmx_venues' is not defined

Check how many venues were returned for each neighborhood

In [233]:
print(cdmx_venues.groupby('Neighborhood').count())

NameError: name 'cdmx_venues' is not defined

Find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(cdmx_venues['Venue Category'].unique())))

Analyze Each Neighborhood

In [234]:
cdmx_onehot = pd.get_dummies(cdmx_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
cdmx_onehot['Neighborhood'] = cdmx_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [cdmx_onehot.columns[-1]] + list(cdmx_onehot.columns[:-1])
cdmx_onehot = cdmx_onehot[fixed_columns]

print(cdmx_onehot.head)

NameError: name 'cdmx_venues' is not defined

examine the new dataframe size.

In [235]:
print(cdmx_onehot.shape)

NameError: name 'cdmx_onehot' is not defined

group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
cdmx_grouped = cdmx_onehot.groupby('Neighborhood').mean().reset_index()
cdmx_grouped

Confirming the new size of the table

In [None]:
cdmx_grouped.shape

Lets print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in cdmx_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = cdmx_grouped[cdmx_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Let´s put that into a pandas dataframe

In [None]:
#First, let's write a function to sort the venues in descending orde
    
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]   

Let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 2

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)

neighborhoods_venues_sorted['Neighborhood'] = cdmx_grouped['Neighborhood']
print('sort', neighborhoods_venues_sorted)
for ind in np.arange(cdmx_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(cdmx_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.head(7))

Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = cdmx_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [None]:
#Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
cdmx_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
cdmx_merged  = cdmx_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

cdmx_merged = cdmx_merged.reset_index(drop=True)

cdmx_merged

visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cdmx_merged['Latitude'], cdmx_merged['Longitude'], cdmx_merged['Neighbourhood'], cdmx_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters.save("map_clusters_mexicocity.html ")
map_clusters

Examine Clusters

1st Cluster

In [None]:
cdmx_merged.loc[cdmx_merged['Cluster Labels'] == 0, cdmx_merged.columns[[1] + list(range(5, cdmx_merged.shape[1]))]]

2nd Cluster

In [None]:
cdmx_merged.loc[cdmx_merged['Cluster Labels'] == 1, cdmx_merged.columns[[1] + list(range(5, cdmx_merged.shape[1]))]]

3rd Cluster

In [None]:
cdmx_merged.loc[cdmx_merged['Cluster Labels'] == 2, cdmx_merged.columns[[1] + list(range(5, cdmx_merged.shape[1]))]]

4th Cluster

In [None]:
cdmx_merged.loc[cdmx_merged['Cluster Labels'] == 3, cdmx_merged.columns[[1] + list(range(5, cdmx_merged.shape[1]))]]

5th Cluster

In [None]:
cdmx_merged.loc[cdmx_merged['Cluster Labels'] == 4, cdmx_merged.columns[[1] + list(range(5, cdmx_merged.shape[1]))]]