<h1 align=center><font size = 5>Clustering Toronto</font></h1>

## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in Toronto. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Load and add the location data</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # library to handle requests

#!conda install -c conda-forge beautifulsoup4 --y
from bs4 import BeautifulSoup

#!conda install -c conda-forge lxml --y
from lxml import etree

from urllib.request import urlopen

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

We will use BeautifulSoup for Web Scraping. First of all we upload htlm page though requests

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
res

<Response [200]>

Response 200 means that the page downloaded successfully. We can proceed with Scraping

In [3]:
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table)) [0]
Postal_code = df["Postal code"].tolist()
Borough = df["Borough"].tolist()
Neighborhood = df["Neighborhood"].tolist()
print('Data downloaded!')
df.head()

Data downloaded!


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


#### Replace '/' by comma for Neighborhoods in same Postal code area

In [4]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' /',',')
df['Neighborhood'].head()

0                          NaN
1                          NaN
2                    Parkwoods
3             Victoria Village
4    Regent Park, Harbourfront
Name: Neighborhood, dtype: object

#### Replace all Null and 'Not assigned' values for Neighborhood with Borough

In [5]:
df=df.ffill(axis=1)
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'],df['Neighborhood'])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Drop all rows with 'Not assigned' value for Borough

In [6]:
Drop=df[df['Borough'] == 'Not assigned'].index
df.drop(Drop, inplace = True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Let's check final shape of the dataframe

In [7]:
df.shape

(103, 3)

## 2. Load and Add the Location Data

#### Next, let's load the data.

In [8]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [9]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.sort_values(by=['Postal Code'], ignore_index = True)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
geo_df.rename(columns={'Postal Code':'Postal code'}, inplace=True)
Toronto_df= pd.merge(df, geo_df, on='Postal code', how='right')
Toronto_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*