# Segmenting & Clustering Neighborhoods in Toronto (Part 1)

This notebook was prepared as the part of Data Science Capstone project assignment in Coursersa. This notebook will cover how to explore, segment, and cluster the neighborhoods in the city of Toronto by using the neighborhoods data from Wikipedia.

## Explore the Data
### Scarpe Wikipedia for Data
To get the neighborhoods data for this project, we will use this page from wikipedia: [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

On the wikipedia page we will have the list of postal code of Canada which started with "M", we will use the table as the source of our data in this project. In order to get data ready for the analysis we need to scarpe the website and convert the table into _Pandas_ dataframe. To do so, we will utilize _BeuatifulSoup_ library which will allow us to scrape the wikipedia page and parse the table to be used in our analysis.

Before we start with our code we could inspect the website to get to know what we should be looking for in the website code. For this purpose we could use _Chrome_ developer option to _inspect_ the element of the website and find our table.

![wiki_inspect_code](img/wiki_inspect.png)

As we can see from the image above, the header and the row of the table can be easily identified from the source code. We also able to identify which class contain the table that we're looking for, in this case table class `"wikitable sortable jquery-tablesorter"`. We will use these information for our next step when parsing the table into _Pandas_ dataframe.


#### Utilizing _BeautifulSoup_
First we're going to use _requests_ library which will allow us to send _organic, grass-fed HTTP/1.1_ requests.

In [49]:
import requests
#ping a website and return HTML of the website.
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

from bs4 import BeautifulSoup
soup = BeautifulSoup(url,'lxml')
#find the table class "wikitable sortable"
dtable = soup.find('table',{'class':'wikitable sortable'})

Next we would like to do some iteration on the row and append the value into a list which will then converted into pandas dataframe.

In [81]:
import pandas as pd

drow = dtable.findChildren(['th', 'tr'])

#iterate the row and append the value into a list called tes
tes=[]
for row in drow:
    tes1=[]
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string
        value = value.replace('\n','')
        value = value.replace(' /', ',')
        tes1.append(value)
    tes.append(tes1)

df = pd.DataFrame(tes)
df = df.rename(columns={0: 'PostalCode', 1: 'Borough', 2: 'Neighborhood'})
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,,,
2,,,
3,,,
4,M1A,Not assigned,
5,M2A,Not assigned,
6,M3A,North York,Parkwoods
7,M4A,North York,Victoria Village
8,M5A,Downtown Toronto,"Regent Park, Harbourfront"
9,M6A,North York,"Lawrence Manor, Lawrence Heights"


Let's do some clean up on the dataframe, so it follow these requirements:
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [83]:
import numpy as np

#clean up dataframe for null value in Postal Code
null_mask = df['PostalCode'].notnull().values
df = df[null_mask]

#clean up dataframe for 'Not assigned' Borough
borough_mask = df['Borough'].values != 'Not assigned'
df = df[borough_mask]

#put value into 'Not assigned' Neighborhood
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])

#reset index after clean up
df.set_index(['PostalCode'], inplace=True)
df.reset_index(level=['PostalCode'], inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Now, let's check how many unique values available in the `PostalCode` column and then compare it to the size of the dataframe. If we have a good clean up we should end up with the same number of `PostalCode` unique value vs number of rows in the dataframe.

In [86]:
print (len(df['PostalCode'].unique()))

103


From the above code, we've got 103 unique values in our `PostalCode` column. Now, let's take a look on the number of rows we had in our dataframe. We should get the same number if we had our clean up right.

In [88]:
print (df.shape)

(103, 3)
