# Segmenting and Clustering Neighborhoods in Toronto

### Import the necessary libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup # module to extract the ntml data
import lxml

from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import folium # plotting library

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

## 1. Export the html data and Explore Dataset

Make a request to webpage for a specifice url.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

Convert the url data into html format and find the table in the data.

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')
canadaTable = soup.find('table', {'class':"wikitable sortable"})

Read the table using pandas html function and convert it into Dataframe.

In [4]:
df = pd.read_html(str(canadaTable))
df = pd.DataFrame(df[0])
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Initial shape of the Dataset.

In [5]:
df.shape

(180, 3)

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
Create a new dataframe and ignore the cells with value "Not assigned" in Borough column.

In [6]:
df = df[df['Borough'] != 'Not assigned'].reset_index(drop = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### More than one neighborhood can exist in one postal code area. Combine these rows into one row with the neighborhoods separated with a comma.
Group the Neighbourhood values with respect to similar Postal Codes and combine those values.
A new datafame is produced with only columns of Postal Code and Neighbourhood.
Join the new dataframe with the old on the base of Postal Code and remove the extra coulmns from the dataframe.

In [7]:
df1 = df.groupby('Postal Code')['Neighbourhood'].apply(', '.join).reset_index()

In [8]:
df = pd.merge(df, df1, how= 'right', on= 'Postal Code').reset_index(drop=True)
df['Neighbourhood'] = df['Neighbourhood_y']
df = df.drop(['Neighbourhood_x','Neighbourhood_y'], axis=1)

In [9]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
Create a series with value of 'Not assigned' from Neighbourhood column. Look the 'Not assigned' value in Neighbourhood and replace the value of Borough falling on the same index.

In [10]:
N_NA = df['Neighbourhood'] == 'Not assigned'
df.loc[N_NA, 'Neighbourhood'] = df.loc[N_NA, 'Borough']

#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [11]:
df.shape

(103, 3)

#### Import the csv file having geographical coordinates of postal code. 

The geographical coordinate file is imported using pandas.

In [12]:
coord = pd.read_csv('.\Coursera Capstone\Geospatial_Coordinates.csv')
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the coordinates with Neighbourhood dataframe.

In [13]:
toronto_df = pd.merge(df, coord, how='inner', on= 'Postal Code').reset_index(drop=True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
