# Dataset and clustering strategy
This dataset is about the neighborhoods in Toronto.
In order to segment the neighborhoods and explore them, I will use the latitude and longitude coordinates of each neighborhood.
I will scrape the neighborhoods from the Wikipedia page of the postal codes in Canada that start with M, which belong to Toronto: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

1. Convert the neighborhoods into their equivalent latitude and longitude values via the Google Geocode API.
2. Use the Foursquare API to explore most common venue categories in each neighborhood in Toronto.
3. This feature will be used to group the neighborhoods into clusters. The k-means clustering algorithm will be used to complete this task.
4. Folium library will be used to visualize the neighborhoods in Toronto and their emerging clusters.

## Scraping the data and loading it into a Pandas dataframe
1. Scrape the data from Wikipedia using Python and the BeautifulSoup library.
.. By looking at the HTML code of the Wikipedia page, all the required data is stored in a table class called wikitable sortable.
Therefore, the strategy is to locate this wikipedia table and then iterate through all table rows that start with the tr tag.
The values for postcode, borough and neighborhood that we are looking for are stored in the td tag of each table row tr.
This data will be loaded into the Pandas dataframe neighborhoods:

In [2]:
import pandas as pd
import numpy as np

In [3]:
# define the dataframe columns
column_names = ['Postalcode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# load libraries for scraping data
from bs4 import BeautifulSoup
import urllib.request
def make_soup(url):
    wikipage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(wikipage, "html.parser")
    return soupdata

soup = make_soup("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

# find the table
table = soup.find_all('table', class_="wikitable sortable")[0]  # Only use the first table

# iterate over it
wikidatasaved = []
for item in table.findAll('tr'):
    wikidata = []
    for data in item.findAll('td'):
        wikidata.append(data.text)
    if len(wikidata) == 3:
        neighborhoods = neighborhoods.append({'Postalcode': wikidata[0],
                                          'Borough': wikidata[1],
                                          'Neighborhood': wikidata[2]}, ignore_index=True)

# remove newline from Neighborhood column
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].map(lambda x: str(x)[:-1])
neighborhoods.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Some postalcodes don't have a borough assigned to them.
Let's drop these rows:

In [4]:
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned'].reset_index(drop = True)
neighborhoods.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Also, some postalcodes reference two different neighborhoods.
Let's group these neighborhoods under one distinct postalcode:

In [5]:
neighborhoods = neighborhoods.groupby([neighborhoods.Postalcode, neighborhoods.Borough], as_index=False).agg(', '.join)
neighborhoods.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Finally, some boroughs don't have any neighborhood assigned to them.
Let's find out which and use their borough name for their neighborhood classifier:

In [6]:
neighborhoods.loc[neighborhoods['Neighborhood'] == 'Not assigned']

Unnamed: 0,Postalcode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [7]:
neighborhoods['Neighborhood'].loc[neighborhoods['Neighborhood'] == 'Not assigned'] = neighborhoods['Borough']

Double-check that the Queen's Park borough now has a Neighborhood assigned:

In [8]:
neighborhoods[85:86]

Unnamed: 0,Postalcode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


Let's see how many distinct postal codes there are in Toronto:

In [9]:
neighborhoods.shape

(103, 3)