# Segmenting and Clustering Neighborhoods in Toronto

### Table of Contents

Part 1

1a. Download and Explore Dataset

Part 2

2a. Get coordinates

3. Explore Neighborhoods in Toronto
3. Analyze Each Neighborhood
4. Cluster Neighborhoods
5. Examine Clusters

### Preface

Acknowledgements: Thanks to the authors of the capstone course. Some of the code was copied from course materials.

## Part 1

This is the first part of the week 3 project of the Applied Data Science Capstone class. In this notebook, I will explore, segment, and cluster neighborhoods in Toronto, Canada.

### 1a. Download and Explore Dataset

#### Do the necessary installations and imports

In [1]:
# import Numerical and dataframe libraries
import numpy as np 
import pandas as pd 

!conda install --yes -c conda-forge geopy folium=0.5.0 beautifulsoup4 lxml geocoder

from geopy.geocoders import Nominatim 

# import web service libraries
import requests 
import json 
from pandas.io.json import json_normalize 

# import Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4
    - folium=0.5.0
    - geocoder
    - geopy
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          59 KB

The following NEW packages will be INSTALLED:

    geocoder: 1.38.1-py_1 conda-forge
    ratelim:  0.1.6-py_2  conda-forge


Downloading and Extracting Packages
geocoder-1.38.1      | 53 KB     | ##################################### | 100% 
ratelim-0.1.6        | 6 KB      | ##################################### | 100% 
Preparing transaction: done
Verifying trans

## 1. Download and Explore Dataset about Toronto Neighborhoods

### Parse the wikipedia page that has Toronto neighborhood information

#### Get the wikipedia page that has the information

In [2]:
canada_M_postal_code_page_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(canada_M_postal_code_page_url)
postal_code_page = ""
if (response.status_code == 200):
    postal_code_page = response.text
else:
    print ("Response to request was not OK. It was {}".format(response.status_code))
len(postal_code_page)

78651

#### Parse the wikipedia page

In [3]:
from bs4 import BeautifulSoup
postal_soup = BeautifulSoup(postal_code_page, 'lxml')
# print(postal_soup.prettify())

#### The page has multiple tables. Get a list of them, print out their lengths, and then proceed with the longest one, since it is the one that has the Toronto postal code information
I assume that the HTML table element that is the longest when pretty printed is the one that has the information we are looking for.

In [4]:
tables = postal_soup.find_all('table')
table_lengths = [len (table.prettify()) for table in tables]
table_lengths

[55576, 148, 180, 8480, 6088]

In [5]:
postal_table = tables[0]
# print(postal_table.prettify())

#### Get a list of all of the rows, since the rows have the postal data. Process each row, putting the ones that have data we need into a new dataframe containing Postal Codes, Boroughs, and Neighborhoods

Assumptions:
- I assume that each row in the HTML table has at least 3 columns
- I assume the first column only contains zip codes
- I assume the second column contains only a name or a hyperlink with a name as text in the hyperlink element
- I assume the third column contains only a name or multiple hyperlinks with the name in the text of the first hyperlink

In [6]:
table_rows = postal_table.find_all('tr')
print ("Wiki page table has {} rows".format(len(table_rows)))
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood']
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
# Iterate through all of the rows, except the first one, since the first one is the header row
for row in table_rows[1:]:
    # The columns of the HTML table are (0) postcode (1) Borough (2) Neighborhood
    table_data = row.find_all('td')
    borough = table_data[1].string
    if borough == 'Not assigned':
        # print(row.prettify())
        pass
    else:
        # print (row.prettify())
        pc = table_data[0].string.strip()
        bor_field = table_data[1]
        borough = ""
        if type(bor_field.a == "NoneType"):
            borough = bor_field.string.strip()
        else:
            borough = bor_field.a.string.strip()
        nei_field = table_data[2]
        nei = ""
        a_elems = nei_field.find_all('a')
        if len(a_elems) == 0:
            nei = nei_field.string.strip()
            if (nei.strip()) == "Not assigned":
                nei = borough
        else:
            nei = nei_field.a.string.strip()
        # print(pc)
        # print(borough)
        # print(nei)
        neighborhoods = neighborhoods.append({'PostalCode': pc,
                                              'Borough': borough, 
                                              'Neighborhood': nei}, ignore_index=True)
neighborhoods.sort_values(['PostalCode', 'Neighborhood'])

Wiki page table has 288 rows


Unnamed: 0,PostalCode,Borough,Neighborhood
8,M1B,Scarborough,Malvern
7,M1B,Scarborough,Rouge
20,M1C,Scarborough,Highland Creek
22,M1C,Scarborough,Port Union
21,M1C,Scarborough,Rouge Hill
32,M1E,Scarborough,Guildwood
33,M1E,Scarborough,Morningside
34,M1E,Scarborough,West Hill
38,M1G,Scarborough,Woburn
42,M1H,Scarborough,Cedarbrae


#### For the postal codes that have more than one neighborhood, combine the rows into one row 

In [7]:
def string_list(series):
    ''' Return a string that contains all of the unique strings that are in the series object,
    with the strings separated by commas (with a space after each comma)'''
    return (', '.join(series.unique()))

# Group the neighboorhood dataframe by PostalCode
ng = neighborhoods.groupby(['PostalCode'])

# Create a new dataframe that has one row per PostalCode and has the Neighborhoods within each PostalCode in one long string of names separated by ", ".
# Note that if a PostalCode contains more than one Borough, the row will have multiple borough names separated by commas, but PostalCode's should
# not have more than one Borough.
postal_codes_df_first = ng.agg(string_list)
postal_codes_df_first

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge, Malvern"
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
M1N,Scarborough,"Birch Cliff, Cliffside West"


#### Change the dataframe so that it is indexed by integers instead of PostalCodes

In [8]:
postal_codes_df = postal_codes_df_first.reset_index()
postal_codes_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
postal_codes_df.shape

(103, 3)

## Part 2

This is the second part of the week 3 project of the Applied Data Science Capstone class. In this notebook, I will explore, segment, and cluster neighborhoods in Toronto, Canada.

### 2a. Get Coordinates

##### Define a function to get the coordinates for one postal code

In [14]:
import geocoder 


postal_code = 'M9W'

def get_coords(p_code):
    '''
    Use geocoder to get the latitude and longitude of the give postal code
    '''
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    i = 0
    while(lat_lng_coords is None and i < 2500):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        print(g)
        lat_lng_coords = g.latlng
        i = i+1

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    print('It took () iteration(s)'.format(i))
    return (latitude, longitude)

#get_coords(postal_code)

<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geocode [empty]>
<[REQUEST_DENIED] Google - Geoco

KeyboardInterrupt: 


Note: I ran the get_coords function one time, letting the loop iterate more than 1800 times, and it the value of the variable g was always 
```   
<[REQUEST_DENIED] Google - Geocode [empty]>
```
so I decided to use the CSV file provided by the course   

###### Get the file and read it into a dataframe

In [22]:
geo_coord = pd.read_csv("https://cocl.us/Geospatial_data", index_col="Postal Code")
geo_coord

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
M1J,43.744734,-79.239476
M1K,43.727929,-79.262029
M1L,43.711112,-79.284577
M1M,43.716316,-79.239476
M1N,43.692657,-79.264848


###### If the postal_codes_df and geo_coords dataframes have the same number of rows, consider the geo_coords dataframe valid

In [26]:
# If I couldn't visually inspect the data, I would check for invalid values in the code, but I inspected it visually.
if postal_codes_df.shape[0] == geo_coord.shape[0]: 
    print("geo_coords valid")

geo_coords valid


In [29]:
#### Join the two data frames add the coordinates to the neighborhood data
combined_postal = pd.merge(postal_codes_df, geo_coord, left_on="PostalCode", right_index=True, validate="1:1")
combined_postal

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
