# Coursera Capstone Project
Alejandro Rojas

## Introduction/Business Problem
Imagine we want to create an app that is able to recomend neighborhoods based on specific requests: Population, employers, type of venues? For example, someone who wants to start a new business, maybe a new italian restaurant?: High population, low amount of restaurants so less competition and a decent amount of parks to get customers when they go out for a walk. 

Currently, there are available different apps to check specific recomendatios about places at the neighborhoods of a city (nextdoor, google maps, even foursquare). But for users who are moving into new cities, where they have zero to none clue about the new places they will be living at for the next few months or even years, we want to bring a solution to them, which will help in the process of selecting a neighborhood or at least to get a list of a couple neighborhoods which follow some of their requirements. It doesn't matter if they want it to start a business, live, invest, or whatever idea they may have.

## Data
On this initial approach we will work using Toronto neighborhoods data we found on kaggle: https://www.kaggle.com/youssef19/toronto-neighborhoods-inforamtion which uses data from the open source data portal of Toronto. Here we have different information for each neighborhood such as: 
- Population
- Number of educated people
- Population with ages between 15 and 45
- Number of employers 
- Coordinates: which will enable us to do some queries on the foursquare API.

Now, adding information about the venues from foursquare we can group some of them (so we don't have like 200+ categories but instead some popular ones: Restaurants, gyms, parks, banks, etc.

Finally, we can append some crime information so users could even see that at the final result of the app. We imagine that almost any person would prefer a place with zero crime so this columns could help to select one final place after receiving multiple results on the app. This information will be obtained from the https://open.toronto.ca/dataset/neighbourhood-crime-rates/.

At the end we should end up with a dataframe for each neighborhood indicating not only that population data but with venues data and crime data.
With that we could work in a recommender system, for this initial prototype we could use a content-based model, where a user enters his desired neighborhood and we could recommend the closest ones that meet the input.

In [52]:
df_crime['Neighbourhood'].value_counts()

The Beaches                          1
Wexford/Maryvale                     1
Morningside                          1
Pleasant View                        1
Beechborough-Greenbrook              1
                                    ..
Henry Farm                           1
Runnymede-Bloor West Village         1
Broadview North                      1
Playter Estates-Danforth             1
Bridle Path-Sunnybrook-York Mills    1
Name: Neighbourhood, Length: 140, dtype: int64

In [1]:
import pandas as pd
import json 
from geopy.geocoders import Nominatim 
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium 

In [39]:
df_toronto = pd.read_csv('../data/Toronto_neighborhood_demographics_geographics_venues.csv')

In [40]:
df_toronto.head()

Unnamed: 0,Neighborhood,Total population,number of educated people,number of 15-45,number of employers,long_latt,number_gyms,number_venues
0,Agincourt North,30280.0,19805.0,11850.0,13230.0,"[-79.2816161258827, 43.797405754163]",0.0,26.0
1,Agincourt South-Malvern West,21990.0,14535.0,8840.0,9860.0,"[-79.2891688527481, 43.7851873380096]",0.0,34.0
2,Alderwood,11900.0,7915.0,4520.0,6240.0,"[-79.5532040267975, 43.5954996876866]",1.0,17.0
3,Annex,29180.0,23495.0,15095.0,16770.0,"[-79.4121466573202, 43.6744312990078]",3.0,63.0
4,Banbury-Don Mills,26910.0,20555.0,9615.0,13030.0,"[-79.326504539789, 43.7325704244428]",2.0,14.0


We will repeat some of the steps to be able to plot our data from the segmentation and clustering workshop =).

First we need to create the Latitude and Longitude columns!

In [41]:
import ast
df_toronto['long_latt'] = df_toronto['long_latt'].apply(ast.literal_eval)
df_toronto['Longitude'] = pd.DataFrame(df_toronto['long_latt'].tolist(), index= df_toronto.index)[0]
df_toronto['Latitude'] = pd.DataFrame(df_toronto['long_latt'].tolist(), index= df_toronto.index)[1]

In [42]:
df_toronto.head()

Unnamed: 0,Neighborhood,Total population,number of educated people,number of 15-45,number of employers,long_latt,number_gyms,number_venues,Longitude,Latitude
0,Agincourt North,30280.0,19805.0,11850.0,13230.0,"[-79.2816161258827, 43.797405754163]",0.0,26.0,-79.281616,43.797406
1,Agincourt South-Malvern West,21990.0,14535.0,8840.0,9860.0,"[-79.2891688527481, 43.7851873380096]",0.0,34.0,-79.289169,43.785187
2,Alderwood,11900.0,7915.0,4520.0,6240.0,"[-79.5532040267975, 43.5954996876866]",1.0,17.0,-79.553204,43.5955
3,Annex,29180.0,23495.0,15095.0,16770.0,"[-79.4121466573202, 43.6744312990078]",3.0,63.0,-79.412147,43.674431
4,Banbury-Don Mills,26910.0,20555.0,9615.0,13030.0,"[-79.326504539789, 43.7325704244428]",2.0,14.0,-79.326505,43.73257


Good, we have them now. Let's get the city central coordinates and plot a map of our neighborhoods

In [5]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [44]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Cool, I even like more this dataset because we have a lot more neighborhoods! woohoo, probably we can get cooler information or at least different results. Let's go ahead.

In [45]:
with open('../../api_credentials.json') as json_data:
    api_credentials = json.load(json_data)

In [46]:
CLIENT_ID = api_credentials['CLIENT_ID']
CLIENT_SECRET = api_credentials['CLIENT_SECRET']
#ACCESS_TOKEN = api_credentials['ACCESS_TOKEN']
VERSION = api_credentials['VERSION']
LIMIT = 100

In [47]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [48]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

Agincourt North
Agincourt South-Malvern West
Alderwood
Annex
Banbury-Don Mills
Bathurst Manor
Bay Street Corridor
Bayview Village
Bayview Woods-Steeles
Bedford Park-Nortown
Beechborough-Greenbrook
Bendale
Birchcliffe-Cliffside
Black Creek
Blake-Jones
Briar Hill-Belgravia
Bridle Path-Sunnybrook-York Mills
Broadview North
Brookhaven-Amesbury
Cabbagetown-South St. James Town
Caledonia-Fairbank
Casa Loma
Centennial Scarborough
Church-Yonge Corridor
Clairlea-Birchmount
Clanton Park
Cliffcrest
Corso Italia-Davenport
Danforth
Danforth East York
Don Valley Village
Dorset Park
Dovercourt-Wallace Emerson-Junction
Downsview-Roding-CFB
Dufferin Grove
East End-Danforth
Edenbridge-Humber Valley
Eglinton East
Elms-Old Rexdale
Englemount-Lawrence
Eringate-Centennial-West Deane
Etobicoke West Mall
Flemingdon Park
Forest Hill North
Forest Hill South
Glenfield-Jane Heights
Greenwood-Coxwell
Guildwood
Henry Farm
High Park North
High Park-Swansea
Highland Creek
Hillcrest Village
Humber Heights-Westmount
Hu

Now that we have the venues, lets add some of the crime variables to our dataframe

In [53]:
df_crime = pd.read_csv('../data/neighbourhood-crime-rates.csv')

In [56]:
df_toronto = pd.merge(df_toronto, 
         df_crime[['Neighbourhood','Assault_Rate2020', 'AutoTheft_Rate2020', 'Robbery_Rate2020', 'Homicide_Rate2020']], 
         left_on = 'Neighborhood', right_on = 'Neighbourhood', how = 'left')

In [57]:
df_toronto.head()

Unnamed: 0,Neighborhood,Total population,number of educated people,number of 15-45,number of employers,long_latt,number_gyms,number_venues,Longitude,Latitude,Neighbourhood,Assault_Rate2020,AutoTheft_Rate2020,Robbery_Rate2020,Homicide_Rate2020
0,Agincourt North,30280.0,19805.0,11850.0,13230.0,"[-79.2816161258827, 43.797405754163]",0.0,26.0,-79.281616,43.797406,Agincourt North,218.2301,183.4398,53.76684,0.0
1,Agincourt South-Malvern West,21990.0,14535.0,8840.0,9860.0,"[-79.2891688527481, 43.7851873380096]",0.0,34.0,-79.289169,43.785187,Agincourt South-Malvern West,481.6464,229.8767,65.67905,7.297672
2,Alderwood,11900.0,7915.0,4520.0,6240.0,"[-79.5532040267975, 43.5954996876866]",1.0,17.0,-79.553204,43.5955,Alderwood,279.414,211.4484,90.62075,0.0
3,Annex,29180.0,23495.0,15095.0,16770.0,"[-79.4121466573202, 43.6744312990078]",3.0,63.0,-79.412147,43.674431,Annex,792.9642,106.6897,126.8743,0.0
4,Banbury-Don Mills,26910.0,20555.0,9615.0,13030.0,"[-79.326504539789, 43.7325704244428]",2.0,14.0,-79.326505,43.73257,Banbury-Don Mills,266.1451,157.1218,22.44597,0.0


In [59]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Agincourt North,43.797406,-79.281616,Subway,43.797503,-79.282181,Sandwich Place
1,Agincourt North,43.797406,-79.281616,Shell,43.797544,-79.283345,Gas Station
2,Agincourt North,43.797406,-79.281616,Havendale Park,43.79601,-79.284398,Park
3,Agincourt North,43.797406,-79.281616,凱聲,43.801003,-79.283363,Asian Restaurant
4,Agincourt South-Malvern West,43.785187,-79.289169,The Beer Store,43.785016,-79.289861,Beer Store


Those variables should be enough, the venue category separation will be done on the week 2 notebook! 

Thank you for taking the time to read this notebook and I wish you the best of luck on the rest of the course =)

In [58]:
df_toronto.to_csv('../data/df_toronto.csv')
toronto_venues.to_csv('../data/toronto_venues.csv')