# Segmenting and Clustering Neighborhoods in Toronto -- 2

### Use wikipedia to scrap for data to create dataframe

The dataframe will contain three columns: 

    - PostalCode
    - Borough
    - Neighborhood

1) Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

2) If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

3) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.



- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

#### **Submit link when finished**

---

#### Import libraries

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
import pandas as pd
import numpy as np

Use the `requests` library to download the webpage. Save the text of the response as a variable named `html_data`.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=946126446"

html_data  = requests.get(url).text 

Parse the html data using `beautiful_soup`.

In [3]:
soup = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'html_data'

Check title to ensure correct webpage

In [4]:
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


Using beautiful soup extract the table and store it into a dataframe. The dataframe should have columns **PostalCode**, **Borough**, and **Neighborhood**. Fill in each variable with the correct data from the list `col`. 

Hint: Print the `col` list to see what data to use


In [5]:
toronto_data = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

# print(soup.find("table",{"class":"wikitable sortable"}).find("tbody").find_all("tr"))

for row in soup.find("table",{"class":"wikitable sortable"}).find("tbody").find_all("tr"):
    col = row.find_all("td")
    if(col):
        postalCode =col[0].text.strip()
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        toronto_data = toronto_data.append({"PostalCode":postalCode, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)


Check the dataframe for quick summary

In [6]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
toronto_data.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,287,287,287
unique,180,11,209
top,M9V,Not assigned,Not assigned
freq,8,77,77


In [8]:
toronto_data.shape

(287, 3)

Dupicate dataframe so we can save the OG df as reference

In [9]:
toronto_data_new = toronto_data

---

### 1) Ignore cells with a borough that is **Not assigned**.

In [10]:
toronto_data_new["Borough"] = toronto_data_new["Borough"].replace("Not assigned", np.nan)

toronto_data_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Lets count the missing values in each column

In [11]:
missing_data = toronto_data_new.isnull()
missing_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,False,True,False
1,False,True,False
2,False,False,False
3,False,False,False
4,False,False,False


In [12]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

PostalCode
False    287
Name: PostalCode, dtype: int64

Borough
False    210
True      77
Name: Borough, dtype: int64

Neighborhood
False    287
Name: Neighborhood, dtype: int64



"True" represents a missing value, "False"  means the value is present in the dataset.
We can see **Borough** has **77 missing value**.
Lets drop these 77 rows.

In [13]:
toronto_data_new = toronto_data_new.dropna(subset=["Borough"], axis=0)

# Lets reset index, because we dropped rows
toronto_data_new = toronto_data_new.reset_index(drop=True)

toronto_data_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


---

### 2) If a cell has a **borough** but a **Not assigned** neighborhood, then the **neighborhood** will be the same as the **borough**.

In [14]:
toronto_data_new['Neighborhood'] == 'Not assigned'

0      False
1      False
2      False
3      False
4      False
       ...  
205    False
206    False
207    False
208    False
209    False
Name: Neighborhood, Length: 210, dtype: bool

In [15]:
missing_data = toronto_data_new.isnull()
missing_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


In [16]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

PostalCode
False    210
Name: PostalCode, dtype: int64

Borough
False    210
Name: Borough, dtype: int64

Neighborhood
False    210
Name: Neighborhood, dtype: int64



No `Not assigned` values found in column Neighborhood

---

### 3) Check if more than one neighborhood can exist in one postal code area.

In [17]:
toronto_data_new["PostalCode"].value_counts()

M8Y    8
M9V    8
M5V    7
M9B    5
M4V    5
      ..
M6E    1
M4A    1
M1W    1
M3B    1
M9N    1
Name: PostalCode, Length: 103, dtype: int64

If PositalCode is the same, then group them up

In [18]:
toronto_data_new = toronto_data_new.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list)
toronto_data_new = toronto_data_new.sample(frac=1).reset_index()
toronto_data_new['Neighborhood'] = toronto_data_new['Neighborhood'].str.join(', ')
toronto_data_new

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3B,North York,Don Mills North
1,M4H,East York,Thorncliffe Park
2,M3K,North York,"CFB Toronto, Downsview East"
3,M9P,Etobicoke,Westmount
4,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."
...,...,...,...
98,M6N,York,"The Junction North, Runnymede"
99,M6G,Downtown Toronto,Christie
100,M4S,Central Toronto,Davisville
101,M4L,East Toronto,"The Beaches West, India Bazaar"


---

### Final dataframe after data cleaning

In [19]:
toronto_data_new.shape

(103, 3)

---

### Download GeoSpatial Dataset

In [20]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"
toronto_data_geo = pd.read_csv(filename)  
toronto_data_geo

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [21]:
toronto_data_geo.rename(columns={'Postal Code':'PostalCode'},inplace=True)

Join the two dataframe: `toronto_data_new` and `toronto_data_geo`

In [22]:
toronto_data_new = toronto_data_new.join(toronto_data_geo.set_index('PostalCode'), on='PostalCode')

In [23]:
toronto_data_new

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3B,North York,Don Mills North,43.745906,-79.352188
1,M4H,East York,Thorncliffe Park,43.705369,-79.349372
2,M3K,North York,"CFB Toronto, Downsview East",43.737473,-79.464763
3,M9P,Etobicoke,Westmount,43.696319,-79.532242
4,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So...",43.636258,-79.498509
...,...,...,...,...,...
98,M6N,York,"The Junction North, Runnymede",43.673185,-79.487262
99,M6G,Downtown Toronto,Christie,43.669542,-79.422564
100,M4S,Central Toronto,Davisville,43.704324,-79.388790
101,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572


---