# Segmenting and Clustering Neighborhoods in Toronto - Part 1

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

#### Using lxml library to extract data from Wikipage

In [1]:
# !pip install lxml
import requests
import lxml.html as lh
import pandas as pd
print('Import finished.')

Import finished.


In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

page = requests.get(url)
doc = lh.fromstring(page.content)
#Extract table rows: <tr> elements from table having attribute class='wikitable sortable'
tr_elements = doc.xpath("//table[@class='wikitable sortable']//tr")

Transform the data into pandas dataframe

In [3]:
# Placeholder for table data
result=[]

# Headers
for header in tr_elements[0]:
    name=header.text_content().rstrip()
    result.append((name,[]))

# Check if it is as expected 
print(result)    

[('Postal code', []), ('Borough', []), ('Neighborhood', [])]


In [4]:
# Values
for row in range(1,len(tr_elements)):
    tr=tr_elements[row]
    col=0
    for td in tr.iterchildren():
        data=td.text_content().rstrip() 
        result[col][1].append(data)
        col+=1

In [5]:
# Check columns lengths
[len(C) for (title,C) in result]

[180, 180, 180]

### Pandas dataframe
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

**Note:** The current Wikipedia version was updated since the course materials were written and now there is no need to combine cells with same postal code (already combined).

In [6]:
# Put data into dataframe
Dict={title:column for (title,column) in result}
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


#### Drop all cells with a Borough that is not assigned

In [7]:
df.drop(df.index[df['Borough'] == 'Not assigned'], inplace = True)

In [8]:
# Reset the index
df = df.reset_index(drop=True)
df.head(16)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Combining Neighbourhoods based on similar Postcode and Borough not needed - see Note above

In [9]:
# df = df.groupby(['Postal code', 'Borough'])['Neighborhood'].apply(','.join).reset_index()


#### Assign Borough values to the Neignborhood where vlaue is "Not assigned"

In [10]:
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']

#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe

In [11]:
# Check the shape of the data frame
df.shape

(103, 3)

#### Save dataframe to csv file for future use

In [12]:
df.to_csv('df_toronto.csv')