**For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.**

1. Start by creating a new Notebook for this assignment.

2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:


3. To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

4. Submit a link to your Notebook on your Github repository. (10 marks)

# **Part 1**

**Tried multiple times with multiple scraping methods, however, finally had to use an older version of Wiki page to get the right Table**

In [0]:
#latest versions were giving a lot of errors, so I used an older version of the Wiki page

url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=942655364"
res = requests.get(url)
soup = bs.BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
data = pd.read_json(df[0].to_json(orient='records'))

**Q - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

**Observation - The table that I got has POSTCODE instead of PostalCode due to the older version.**

In [8]:
data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
data

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


**If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

In [0]:
#reordering and grouping data

data_bor = data[data['Borough'] != 'Not assigned']
data_bor = data_bor.groupby(['Borough', 'Postcode'], as_index=False).agg(','.join)

In [0]:
#replacing values with Non Assigned Neigbourhoods
data_bor['Neighbourhood'] = np.where(data_bor['Neighbourhood'] == 'Not assigned', data_bor['Borough'], data_bor['Neighbourhood'])

In [12]:
data_bor

Unnamed: 0,Borough,Postcode,Neighbourhood
0,Central Toronto,M4N,Lawrence Park
1,Central Toronto,M4P,Davisville North
2,Central Toronto,M4R,North Toronto West
3,Central Toronto,M4S,Davisville
4,Central Toronto,M4T,"Moore Park,Summerhill East"
...,...,...,...
98,York,M6C,Humewood-Cedarvale
99,York,M6E,Caledonia-Fairbanks
100,York,M6M,"Del Ray,Keelesdale,Mount Dennis,Silverthorn"
101,York,M6N,"The Junction North,Runnymede"


In [13]:
data_bor.shape

(103, 3)

**Our Data contains 103 Rows and 3 Columns**
End of Part 1
---



# **Part 2**

Importing all Necessary libraries

In [0]:
import requests
import lxml.html as lh
import bs4 as bs
import urllib.request
import numpy as np 
import pandas as pd

**I uploaded the geospatial coordinates file first in my virtual colab drive and then used its location below.**

In [2]:
geo_data = "/content/Geospatial_Coordinates.csv"
geospatial_data = pd.read_csv(geo_data)
geospatial_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**As Postal Code was renamed as Postcode, we will change it here as well**

In [28]:

geospatial_data.columns = ['Postcode', 'Latitude', 'Longitude']
geospatial_data.columns

Index(['Postcode', 'Latitude', 'Longitude'], dtype='object')

In [34]:
geo_tbl = data_bor.merge(geospatial_data, left_on="Postcode", right_on="Postcode")
geo_tbl.head()

Unnamed: 0,Borough,Postcode,Neighbourhood,Latitude,Longitude
0,Central Toronto,M4N,Lawrence Park,43.72802,-79.38879
1,Central Toronto,M4P,Davisville North,43.712751,-79.390197
2,Central Toronto,M4R,North Toronto West,43.715383,-79.405678
3,Central Toronto,M4S,Davisville,43.704324,-79.38879
4,Central Toronto,M4T,"Moore Park,Summerhill East",43.689574,-79.38316


**Now I will reorder to Columns using Pandas simple code to reorder the columns as per requirement**

In [36]:
geo_tbl = geo_tbl[['Postcode','Borough','Neighbourhood','Latitude','Longitude']]
geo_tbl

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.728020,-79.388790
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.388790
4,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.383160
...,...,...,...,...,...
98,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
99,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
100,M6M,York,"Del Ray,Keelesdale,Mount Dennis,Silverthorn",43.691116,-79.476013
101,M6N,York,"The Junction North,Runnymede",43.673185,-79.487262


In [37]:
geo_tbl

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.728020,-79.388790
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.388790
4,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.383160
...,...,...,...,...,...
98,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
99,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
100,M6M,York,"Del Ray,Keelesdale,Mount Dennis,Silverthorn",43.691116,-79.476013
101,M6N,York,"The Junction North,Runnymede",43.673185,-79.487262


End of Part 2

---

