# Week 3: Segmenting and Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## Section 1

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M , in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

### Import libraries

In [1]:
import requests
import bs4
import lxml.html as lh
import pandas as pd
pd.set_option('display.max_rows',200)
import folium

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [3]:
#Store the contents of the website under doc
doc = lh.fromstring(res.content)

In [4]:
print(doc)

<Element html at 0x2afd0b35ae0>


In [5]:
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [6]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [7]:
#Parse Table Header

tr_elements = doc.xpath('//tr')

In [8]:
#Create empty list
col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print('%d:"%s"' %(i,name))
    col.append((name,[]))

1:"Postal Code
"
2:"Borough
"
3:"Neighbourhood
"


In [9]:
print(col)

[('Postal Code\n', []), ('Borough\n', []), ('Neighbourhood\n', [])]


In [10]:
col

[('Postal Code\n', []), ('Borough\n', []), ('Neighbourhood\n', [])]

In [11]:
type(col)

list

In [12]:
col = [('Postal Code', []), ('Borough', []), ('Neighbourhood', [])]

In [13]:
print(col)

[('Postal Code', []), ('Borough', []), ('Neighbourhood', [])]


In [14]:
#Since out first row is the header, data is stored on the second row onwards

for j in range(1,len(tr_elements)):
    
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [15]:
print(col)

[('Postal Code', ['M1A\n', 'M2A\n', 'M3A\n', 'M4A\n', 'M5A\n', 'M6A\n', 'M7A\n', 'M8A\n', 'M9A\n', 'M1B\n', 'M2B\n', 'M3B\n', 'M4B\n', 'M5B\n', 'M6B\n', 'M7B\n', 'M8B\n', 'M9B\n', 'M1C\n', 'M2C\n', 'M3C\n', 'M4C\n', 'M5C\n', 'M6C\n', 'M7C\n', 'M8C\n', 'M9C\n', 'M1E\n', 'M2E\n', 'M3E\n', 'M4E\n', 'M5E\n', 'M6E\n', 'M7E\n', 'M8E\n', 'M9E\n', 'M1G\n', 'M2G\n', 'M3G\n', 'M4G\n', 'M5G\n', 'M6G\n', 'M7G\n', 'M8G\n', 'M9G\n', 'M1H\n', 'M2H\n', 'M3H\n', 'M4H\n', 'M5H\n', 'M6H\n', 'M7H\n', 'M8H\n', 'M9H\n', 'M1J\n', 'M2J\n', 'M3J\n', 'M4J\n', 'M5J\n', 'M6J\n', 'M7J\n', 'M8J\n', 'M9J\n', 'M1K\n', 'M2K\n', 'M3K\n', 'M4K\n', 'M5K\n', 'M6K\n', 'M7K\n', 'M8K\n', 'M9K\n', 'M1L\n', 'M2L\n', 'M3L\n', 'M4L\n', 'M5L\n', 'M6L\n', 'M7L\n', 'M8L\n', 'M9L\n', 'M1M\n', 'M2M\n', 'M3M\n', 'M4M\n', 'M5M\n', 'M6M\n', 'M7M\n', 'M8M\n', 'M9M\n', 'M1N\n', 'M2N\n', 'M3N\n', 'M4N\n', 'M5N\n', 'M6N\n', 'M7N\n', 'M8N\n', 'M9N\n', 'M1P\n', 'M2P\n', 'M3P\n', 'M4P\n', 'M5P\n', 'M6P\n', 'M7P\n', 'M8P\n', 'M9P\n', 'M1R\n', '

In [16]:
[len(C) for (title,C) in col]

[181, 181, 181]

In [17]:
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)

In [18]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


In [19]:
df.tail()

Unnamed: 0,Postal Code,Borough,Neighbourhood
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z\n,Not assigned\n,Not assigned\n
180,\n,Canadian postal codes\n,\n


In [20]:
df.shape

(181, 3)

In [21]:
#Remove \n
df = df.replace('\n','', regex=True)

In [22]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [23]:
df.shape

(181, 3)

In [24]:
#Remove the last row
df = df[0:180]

In [25]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [26]:
#Save to csv to check in Excel
#df.to_csv("canadapostalcodes.csv",index=False)

### Data cleaning

In [27]:
df.Borough.str.contains("Not assigned")

0       True
1       True
2      False
3      False
4      False
5      False
6      False
7       True
8      False
9      False
10      True
11     False
12     False
13     False
14     False
15      True
16      True
17     False
18     False
19      True
20     False
21     False
22     False
23     False
24      True
25      True
26     False
27     False
28      True
29      True
30     False
31     False
32     False
33      True
34      True
35      True
36     False
37      True
38      True
39     False
40     False
41     False
42      True
43      True
44      True
45     False
46     False
47     False
48     False
49     False
50     False
51      True
52      True
53      True
54     False
55     False
56     False
57     False
58     False
59     False
60      True
61      True
62      True
63     False
64     False
65     False
66     False
67     False
68     False
69      True
70      True
71      True
72     False
73     False
74     False
75     False
76     False

In [28]:
df1 = df[~df.Borough.str.contains("Not assigned")]

In [29]:
df1.shape

(103, 3)

In [30]:
df1

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [31]:
df1.reset_index(drop=True,inplace=True)

In [32]:
df1   #See and check all rows

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [33]:
print("The number of rows of the dataframe:", df1.shape)

The number of rows of the dataframe: (103, 3)


In [34]:
#df1.to_csv('postcodecleaned.csv',index=False) #save a csv file

## Section 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

Use the Geocoder package or the csv file to create the following dataframe

In [35]:
#Import the geospatial data

geodata = pd.read_csv('Geospatial_Coordinates.csv')

In [36]:
geodata.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [37]:
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [38]:
df2 = df1.merge(geodata, how="inner", left_on="Postal Code", right_on="Postcode")

### Dataframe for Section 2 is below:

In [39]:
df2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Postcode,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


In [40]:
df2.drop(["Postcode"],axis=1,inplace=True)

In [41]:
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [42]:
#df2.to_csv("postalcodecomplete.csv",index=False)

## Section 3

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you. 

In [43]:
(df2["Borough"] == "Mississauga").value_counts()

False    102
True       1
Name: Borough, dtype: int64

In [44]:
latitude = 43.636966
longitude = -79.615819

In [45]:
# Create map of Mississauga using latitude and longitude values
map = folium.Map(location=[latitude, longitude], zoom_start=11)

In [46]:
map

<img src="map1.jpg">

In [47]:
#Define markers

tooltip = 'Click me!'

folium.Marker([latitude,longitude], popup='<b>Canada Post Gateway Processing Centre</b>', tooltip=tooltip).add_to(map)

<folium.map.Marker at 0x2afd0c546a0>

In [48]:
map

<img src="map2.jpg">

### Note: There is no nearby neighbourhood for Mississauga