# Analysis of Toronto neighborhoods using Machine Learning

## Introduction

## Table of Contents

1.  [Import neccessary libraries](#libraries)
2.  [Data collection and cleaning](#data)

    2.1.  [Get name, boundaries and coordinates of each neighborhood in the city of Toronto](#neighborhood)
    
    2.2.  [Get the boundaries of the city of Toronto neighborhoods](#boundaries)
    
    2.4.  [Get socioeconomic data of each neighborhood](#socioeconomic)
    
    2.5.  [Get the number of existing vegan/vegetarian restaurants in each neighborhood](#restaurant)
    
    2.6.  [Get the number of existing farmer's markets in each neighborhood](#market)
         
3.  [Data Exploration](#explore)   
4.  [Machine Learning - Clustering with k-means](#cluster)
5.  [Visualize and examine the final clusters](#examine)


## 1. Import neccessary libraries<a name='libraries'></a>

Let's first import all neccesary Python libraries.

In [4]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 

# library to handle JSON files
import json 

# convert an address into latitude and longitude values
#!conda install -c conda-forge geocoder --yes
from geopy.geocoders import Nominatim
import geocoder

# library to work with geospatial data
#!conda install -c conda-forge geopandas --yes
import geopandas as gdp

# library to handle requests
#!conda install -c conda-forge requests --yes
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
#!conda install -c conda-forge matplotlib --yes
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
#!conda install -c conda-forge scipy --yes
from sklearn.cluster import KMeans

# map rendering library
#!conda install -c conda-forge folium --yes
import folium 

# library for scraping websites
#!conda install -c anaconda beautifulsoup4
from bs4 import BeautifulSoup

# Library for plotting
import matplotlib.pyplot as plt

print('Libraries imported.')

Libraries imported.


## 2. Collect and clean data <a name='data'></a> 

Now, we must collect all necessary data to perform our analysis. 

### 2.1. Get name, id number and socio economic data for each neighborhood <a name='neighborhood'></a> 

Let's first read the .cvs file.

In [12]:
# Read dataset
nghb_profiles = pd.read_csv("neighbourhood-profiles-2016-csv.csv")
nghb_profiles.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


We are going to create a new dataframe, and populate it with the data of interest for each of the 140 neighborhoods. 

In [28]:
# Create empty dataframe with column names
COLUMN_NAMES = ["ID", "Neighborhoods", "Population density", "Population 15 to 54", "Average Income"]  
df = pd.DataFrame(columns=COLUMN_NAMES)

# Extracting name of neighborhoods from vector of column names 
df["Neighborhoods"] = nghb_profiles.columns[6:]

# Extracting neighborhood number from first row
df["ID"] = np.array(nghb_profiles.iloc[0,6:].values, dtype = 'int')

# Extracting population density from row
pop_density = nghb_profiles[nghb_profiles['Characteristic'] == "Population density per square kilometre"].values[0][6:]
df["Population density"] =  np.array([x.replace(',', '') for x in pop_density], dtype=int)

# Extracting income from row
avg_income = nghb_profiles[nghb_profiles['Characteristic'] == "Total income: Average amount ($)"].values[0][6:]
df["Average Income"] = np.array([x.replace(',', '') for x in avg_income], dtype=int)

# Setting ID as the index

df.set_index('ID')

df.head()

Unnamed: 0,ID,Neighborhoods,Population density,Population 15 to 54,Average Income
0,129,Agincourt North,3929,,30414
1,128,Agincourt South-Malvern West,3034,,31825
2,20,Alderwood,2435,,47709
3,95,Annex,10863,,112766
4,42,Banbury-Don Mills,2775,,67757


Looking good. For getting the total population aged 15 to 54, we need to add the values of the population aged 15-24 and the population aged 25-54. Since the values are of type "object", we will need to remove the commas and change the data type to 'int'.

In [29]:
# Find row number with the values for the total income for each neighborhood
index1 = nghb_profiles.index[nghb_profiles['Characteristic'] == "Youth (15-24 years)"].tolist()
index2 = nghb_profiles.index[nghb_profiles['Characteristic'] == "Working Age (25-54 years)"].tolist()

# Removing commas and converting to int values
vals15_24 = np.array([x.replace(',', '') for x in nghb_profiles.iloc[index1[0],6:].values], dtype=int)
vals25_54 = np.array([x.replace(',', '') for x in nghb_profiles.iloc[index2[0],6:].values], dtype=int)


# Extracting population density from row
df["Population 15 to 54"] = vals15_24 + vals25_54
df.head()

Unnamed: 0,ID,Neighborhoods,Population density,Population 15 to 54,Average Income
0,129,Agincourt North,3929,15010,30414
1,128,Agincourt South-Malvern West,3034,13325,31825
2,20,Alderwood,2435,6455,47709
3,95,Annex,10863,18790,112766
4,42,Banbury-Don Mills,2775,13540,67757


Let's check some information on the dataset such as the index, the number of rows and columns, the data type of each column, and the number of null values in each column.

In [30]:
# look at the info of "df"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 5 columns):
ID                     140 non-null int32
Neighborhoods          140 non-null object
Population density     140 non-null int32
Population 15 to 54    140 non-null int32
Average Income         140 non-null int32
dtypes: int32(4), object(1)
memory usage: 3.4+ KB


Everything looks correct!

### 2.2. Get the boundaries of the city of Toronto neighborhoods<a name='boundaries'></a> 

Next, we are going to import the shapefile that contains the geographic coordinates of the 140 Toronto neighbourhoods. The neighbourhood boundaries are represented as polygons, defined by latitude and longitude coordinates. 

In [None]:
toronto_geo = r'C:\Users\Osas\Downloads\Data analysis\Capstone\Neighbourhoods.geojson'  # geojson file

### 2.5. Get the number of existing vegan/vegetarian restaurants in each neighborhood<a name='restaurants'></a> 

Now that we have our location candidates, let's use Foursquare API to get info on the vegan/vegetarian restaurants that are within a radius of 500 meters of the center of each neighborhood.

In [None]:
VERSION = '20201206' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
VEG_ID = "4bf58dd8d48988d1d3941735" # Foursquare category ID for Vegetarian / Vegan Restaurant

# Function that get the top 100 venues that are in each neighborhood within a radius of 500 meters.
def getNearbyVegRest(names, latitudes, longitudes, radius=500):
    
    venues_list = []
    
    for name, latitude, longitude in zip(names, latitudes, longitudes):
        # Create the GET request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},&categoryId={}&{}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, VEG_ID, radius, LIMIT) 
        
        # Send the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Get relevant information for each nearby venue
        venues_list.append([(name,
                             lat,
                             lng,
                             v['venue']['name'], 
                             v['venue']['location']['lat'], 
                             v['venue']['location']['lng'],  
                             v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
        
    return(nearby_venues)

### 2.6. Get the number of existing farmer's markets in each neighborhood<a name='market'></a> 

In [None]:
FARMER_MARKET = "4bf58dd8d48988d1fa941735" # Foursquare category ID for farmer's market



## 3. Data Exploration <a name='explore'></a> 

Prior to performing the cluster analysis, we need to make sure that our data is suitable for cluster analysis.

## 4. Cluster the neighborhoods to find similar neighborhoods<a name='cluster'></a> 

## 5. Visualize and examine the final clusters<a name='examine'></a> 