<h1 align = 'center'> Segmenting and Clustering Neighborhoods in Toronto </h1

This notebook has 3 sections, namely,

1. Data Preparation
1. Geocoding
1. Plotting on Map

The details of each section is given in the rest of the notebook.

## 1. Data Preparation

Data was downloaded from <a href = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'> this wikipedia page</a>. The data contained 3 columns, those were, 
-  Postal codes, 
-  Boroughs within those postal codes, and,
-  Neighbourhoods within those boroughs

Some of the boroughs were 'not assigned'. Hence, those records were removed from the data.

The resultant clean data contained 102 records with no 'not assigned' values.

In [456]:
import pandas as pd

### 1.1 Downloading data and save as 'raw_data' dataframe

In [457]:
raw_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

In [613]:
#Copy the downloaded data to save internet bandwidth just in case we need to roll back to original data
dirty_data = raw_data

### 1.2 Full description of the downloaded data

In [614]:
dirty_data.describe(include='all')

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,180,180,180
unique,180,11,100
top,M1C,Not assigned,Not assigned
freq,1,77,77


The profile of the data is convincing. We can use this data reliably.

### 1.3 Romoving 'not assigned' boroughs and hence neighbourhoods

In [617]:
df = dirty_data.loc[dirty_data['Borough'] != "Not assigned"]

In [618]:
df.reset_index(inplace = True, drop = True)
df.set_index('Postal Code', drop=True, inplace=False)

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...
M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
M4Y,Downtown Toronto,Church and Wellesley
M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


<div class="alert alert-block alert-success">
<b>Done.</b> The data looks good for sumbission!
</div>

## 2. Geocoding

Geeogle maps geocoding API was used to obtain latitude and longitude of various postal codes within Ontorio, Canada. To do so, I created a variable `address` which will be passed to the geocoder and collect the geographic coordinates, that is, latitude and longitude, and store them in 2 different columns.

### 2.1 Creating a function
Creating a function `get_lat_lng (address)` to get the latitude and longitude of a given address. API used: https://developers.google.com/maps/

In [622]:
def get_lat_lng(apiKey, address):
    
    import requests
    url = ('https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}'
           .format(address.replace(' ','+'), apiKey))
    try:
        response = requests.get(url)
        resp_json_payload = response.json()
        lat = resp_json_payload['results'][0]['geometry']['location']['lat']
        lng = resp_json_payload['results'][0]['geometry']['location']['lng']
    except:
        print('ERROR: {}'.format(address))
        lat = 0
        lng = 0
    return lat, lng


if __name__ == '__main__':
    # get key
    fname = 'GoogleMapsAPIKey.txt'
    file  = open(fname, 'r')
    apiKey = file.read()

### 2.2 creating an `address` column combining `Borough` and `Postal Code`

In [623]:
import warnings
warnings.simplefilter("ignore") #Pandas throws warnings while overwriting data. This line helps not showing the same

for i in range(len(df)):
    df.loc[i, 'Address'] = df.loc[i, 'Borough'] + ", " + df.loc[i, 'Postal Code'] + ", Canada"

In [624]:
df # Viewing the dataframe with the address column added

Unnamed: 0,Postal Code,Borough,Neighbourhood,Address
0,M3A,North York,Parkwoods,"North York, M3A, Canada"
1,M4A,North York,Victoria Village,"North York, M4A, Canada"
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","Downtown Toronto, M5A, Canada"
3,M6A,North York,"Lawrence Manor, Lawrence Heights","North York, M6A, Canada"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","Downtown Toronto, M7A, Canada"
...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North","Etobicoke, M8X, Canada"
99,M4Y,Downtown Toronto,Church and Wellesley,"Downtown Toronto, M4Y, Canada"
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...","East Toronto, M7Y, Canada"
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...","Etobicoke, M8Y, Canada"


### 2.3 Updating the dataframe with latitude and longitude

Calling the `get_lat_lng` function which returns a tuple. Then, extracting the latitude and longitude from that tuple and store in respective columns. This step could have been merged with the earlier, but, have been kept in this way to increase the visibility on the step-by-step approach.

In [625]:
for i in range(len(df)):
    Add = df['Address'][i]
    Coordinates = get_lat_lng(apiKey, Add)
    df.loc[i, 'Latitude'] = Coordinates[0]
    df.loc[i, 'Longitude'] = Coordinates[1]

In [626]:
df # Viewing the dataframe with the latitude and longitude columns added

Unnamed: 0,Postal Code,Borough,Neighbourhood,Address,Latitude,Longitude
0,M3A,North York,Parkwoods,"North York, M3A, Canada",43.753259,-79.329656
1,M4A,North York,Victoria Village,"North York, M4A, Canada",43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","Downtown Toronto, M5A, Canada",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights","North York, M6A, Canada",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","Downtown Toronto, M7A, Canada",43.662301,-79.389494
...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North","Etobicoke, M8X, Canada",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,"Downtown Toronto, M4Y, Canada",43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...","East Toronto, M7Y, Canada",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...","Etobicoke, M8Y, Canada",43.636258,-79.498509


<div class="alert alert-block alert-success">
<b>Done.</b> The geocoding is completed!
</div>

## 3. Analysing the locations

Objective is to see the neighbourhoods on the map of toronto and identify which are closer to the town center and which are not.

### 3.1 Creating a copy of the data frame for this section

Further, dropped `Address`, `Latitude` and `Longitude` as those would be re-created later for every `Neighbourhood`

In [627]:
df_map = df
df_map.drop(['Address', 'Latitude', 'Longitude'], axis = 1, inplace = True)

### 3.2 Every neighborhood is important

It is required to plot every neighbourhood, not just boroughs. Hence, **stacking** the neighbourhoods by **splitting** comma-separated `Neighbourhood` obtained from the data source.

In [628]:
df_map = (df_map.set_index(['Postal Code', 'Borough'])
   .stack()
   .str.split(',', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(-1, drop=True)
   .reset_index()
)

In [629]:
df_map # Viewing the data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,Malvern
1,M1B,Scarborough,Rouge
2,M1C,Scarborough,Rouge Hill
3,M1C,Scarborough,Port Union
4,M1C,Scarborough,Highland Creek
...,...,...,...
212,M9V,Etobicoke,Beaumond Heights
213,M9V,Etobicoke,Thistletown
214,M9V,Etobicoke,Albion Gardens
215,M9W,Etobicoke,Northwest


### 3.3 Address of Neighbourhood

**`Address`** of each `Neighbourhood`was created by **concatenating** `Neighbourhood`, `Borough` and `Postal Code`

In [630]:
for i in range(len(df_map)):
    df_map.loc[i, 'Address'] = df_map.loc[i, 'Neighbourhood'] + ", " + df_map.loc[i, 'Borough'] + ", " + df_map.loc[i, 'Postal Code'] + ", Canada"

### 3.4 Geocoding the neighbourhoods

Calling the `get_lat_lng` function, again, which returns a tuple. Then, extracting the latitude and longitude from that tuple and store in respective columns.

In [632]:
for i in range(len(df_map)):
    Add = df_map['Address'][i]
    Coordinates = get_lat_lng(apiKey, Add)
    df_map.loc[i, 'Latitude'] = Coordinates[0]
    df_map.loc[i, 'Longitude'] = Coordinates[1]

In [633]:
df_map # Viewing the data, for the final time, before plotting on map

Unnamed: 0,Postal Code,Borough,Neighbourhood,Address,Latitude,Longitude
0,M1B,Scarborough,Malvern,"Malvern, Scarborough, M1B, Canada",43.806686,-79.194353
1,M1B,Scarborough,Rouge,"Rouge, Scarborough, M1B, Canada",43.806686,-79.194353
2,M1C,Scarborough,Rouge Hill,"Rouge Hill, Scarborough, M1C, Canada",43.794719,-79.134478
3,M1C,Scarborough,Port Union,"Port Union, Scarborough, M1C, Canada",43.784535,-79.160497
4,M1C,Scarborough,Highland Creek,"Highland Creek, Scarborough, M1C, Canada",43.790121,-79.173392
...,...,...,...,...,...,...
212,M9V,Etobicoke,Beaumond Heights,"Beaumond Heights, Etobicoke, M9V, Canada",43.734274,-79.566214
213,M9V,Etobicoke,Thistletown,"Thistletown, Etobicoke, M9V, Canada",43.739416,-79.588437
214,M9V,Etobicoke,Albion Gardens,"Albion Gardens, Etobicoke, M9V, Canada",43.739510,-79.559100
215,M9W,Etobicoke,Northwest,"Northwest, Etobicoke, M9W, Canada",43.706748,-79.594054


### 3.5 Plotting on the map

`Folium` map was used for plotting.

- Map of Toronto centered around (43.653226, -79.383184) with a visibility to the entire city
- Marked the city center and surrounding areas with a red circle
- Market the neighbourhoods with blue dots
    - Upon clicking the dots, the name of the neighbourhoods will be popped up

In [635]:
import folium
latitude = 43.653226
longitude = -79.383184

venues_map = folium.Map(location=[latitude, longitude], zoom_start=10) # generate map centred around Toronto


# add Toronto as a red circle mark
folium.CircleMarker(
    [latitude, longitude],
    radius=100,
    popup='Toronto',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.5
    ).add_to(venues_map)


# add neighbourhoods to the map as blue circle markers
for lat, lng, label in zip(df_map.Latitude, df_map.Longitude, df_map.Neighbourhood):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(venues_map)

# display map
venues_map

<div class="alert alert-block alert-success">
<b>Done.</b> The map looks good for sumbission!
</div>