# Segmenting and Clustering Neighborhoods in Toronto

Tasks: 
1. Start by creating a new Notebook for this assignment.

2. Use the Notebook to build the code to scrape the following Wikipedia page

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes 
and to transform the data into a pandas dataframe like the one shown below:

3. To create the above dataframe:
     a. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. Only process the cells that have an assigned borough. 
     b. Ignore cells with a borough that is Not assigned.
     c. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
     d. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
     e. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
     f. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

### Step 1: Import all necessary libraries 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


In [2]:
html_doc= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
source = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 

#print(soup.prettify())

# Task 3A : The dataframe will consist of three columns: 
    PostalCode, Borough, and Neighborhood. 
## Only process the cells that have an assigned borough

In [4]:
#Declare variables for data needed for dataframes 

PostalCode = []
Borough = []
Neighborhood = []

# use beautifulsoup library method 'find' to identify tag with tbody
tbody = soup.find('tbody')
#print(tbody.find_all('td')) 

In [5]:
#The enumerate() method adds counter to an iterable and returns it (the enumerate object).
#The syntax of enumerate() is: enumerate(iterable, start=0)

for index, value in enumerate(tbody.find_all('td')):
    
   #Use python default function strip() to strip the space 
   #use remainder function in python to allocate key values 
    if (index%3 == 0):
        PostalCode.append(value.text.strip())
    elif(index%3 == 1):
        Borough.append(value.text.strip())
    else:
        Neighborhood.append(value.text.strip())
        
#Dictionaries are sometimes found in other languages as “associative memories” or “associative arrays”. 
#Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys 
#which can be any immutable type; strings and numbers can always be keys

dic_colNames = { "PostalCode":PostalCode, "Borough":Borough, "Neighborhood": Neighborhood }


In [6]:
#Construct DataFrame from dict of array-like or dicts 
#pandas.DataFrame.from_dic 
toronto_df = pd.DataFrame.from_dict(dic_colNames)


In [7]:
#Print the first five rows 
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A


## Task 3B  :  More than one neighborhood can exist in one postal code area. 
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

## Task 3C Ignore cells with a borough that is Not assigned.

In [8]:
#Reset the index, or a level of it. Reset the index of the DataFrame, and use the default one instead. 
#If the DataFrame has a MultiIndex, this method can remove one or more levels.
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head() 

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A


## Task 3D. if more than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods, separated with a comma as shown in row 11 in the above table.
       

In [10]:

#lambda - anonymous function 

groupsDic = {'PostalCode': 'min',
                 "Borough": 'min',
                 "Neighborhood": lambda neighbourhood: ','.join(neighbourhood)}

#Groupby essentially splits the data into different groups depending on a variable of your choice. 
#For example, the expression data.groupby(‘month’)  will split our current DataFrame by month.
grouped_torontodf = toronto_df.groupby(toronto_df['PostalCode']).agg(groupsDic)


In [11]:
grouped_torontodf.reset_index(drop=True, inplace=True)
grouped_torontodf

Unnamed: 0,Neighborhood,PostalCode,Borough
0,"Rouge,Malvern",M1B,Scarborough
1,"Highland Creek,Rouge Hill,Port Union",M1C,Scarborough
2,"Guildwood,Morningside,West Hill",M1E,Scarborough
3,Woburn,M1G,Scarborough
4,Cedarbrae,M1H,Scarborough
5,Scarborough Village,M1J,Scarborough
6,"East Birchmount Park,Ionview,Kennedy Park",M1K,Scarborough
7,"Clairlea,Golden Mile,Oakridge",M1L,Scarborough
8,"Cliffcrest,Cliffside,Scarborough Village West",M1M,Scarborough
9,"Birch Cliff,Cliffside West",M1N,Scarborough


## Task 3E Clean my notebook 

## Task 3F In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. Setting Neighbourhood same name as borough if not assignedPostal code in Canada shape 

In [12]:
grouped_torontodf.shape


(103, 3)

In [13]:
canada_spatial = pd.read_csv('http://cocl.us/Geospatial_data')
canada_spatial

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [14]:
canada_spatial.rename(columns={'Postal Code':'PostalCode'},inplace=True)
canada_spatial

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [28]:
new = canada_spatial.merge(grouped_torontodf, on='PostalCode', how='left')
new 

Unnamed: 0,PostalCode,Latitude,Longitude,Neighborhood,Borough
0,M1B,43.806686,-79.194353,"Rouge,Malvern",Scarborough
1,M1C,43.784535,-79.160497,"Highland Creek,Rouge Hill,Port Union",Scarborough
2,M1E,43.763573,-79.188711,"Guildwood,Morningside,West Hill",Scarborough
3,M1G,43.770992,-79.216917,Woburn,Scarborough
4,M1H,43.773136,-79.239476,Cedarbrae,Scarborough
5,M1J,43.744734,-79.239476,Scarborough Village,Scarborough
6,M1K,43.727929,-79.262029,"East Birchmount Park,Ionview,Kennedy Park",Scarborough
7,M1L,43.711112,-79.284577,"Clairlea,Golden Mile,Oakridge",Scarborough
8,M1M,43.716316,-79.239476,"Cliffcrest,Cliffside,Scarborough Village West",Scarborough
9,M1N,43.692657,-79.264848,"Birch Cliff,Cliffside West",Scarborough
