## Introduction

This is the peer graded assignment **Segmenting and Clustering Neighborhoods in Toronto.**

This notebook is scripted to do the following :  
**1. Scrape the data from a wikipedia page**  
    
       a. Beautiful soup package in stalled and used together with urllib  
       b. Data is copied in html5lib parser  
       
       
**2. Convert the data obtained into a pandas dataframe**  
    
       a. The value in table is converted into 3 lists namely **Postal Code , Borough and Neighbourhood**  
       b. The 3 lists are combined into a pandas dataframe  
       
       
**3. Clean up the dataframe**  
    
       a. dropping all **Borough** termed as **"Not assigned"**  
       b. Replacing **"Not assigned"** Neighbourhood with name of the **Borough**  

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

### Scraping data from Wikipedia

**1. Installing required beautifulsoup and urllib**

In [5]:
# import the library we use to open URLs
import urllib.request

#installing beautisoup
!pip install bs4

print("Install done")

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 6.1MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.1 bs4-0.0.

**2. Copying data from url**

In [10]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

import requests

#the url to find the table in wikipedia
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# open the url using request and put the HTML into the page variable
page = requests.get(url).text

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page,'html5lib')

#trying to get the table title
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

**3. Find all tables and go through rows to find the details of column into 3 lists**

In [11]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")

right_table=soup.find('table', class_='wikitable sortable')
right_table


#loop through the rows except the header to create lists with contents of a row

A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        

**4. Convert all the lists into a pandas dataframe**

In [12]:
#convert the lists into dataframe

df=pd.DataFrame(A,columns=['Postal Code'])
df['Borough']=B
df['Neighbourhood']=C
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


**5. Removing the "\n" from the values**

In [24]:
clean_A= [sub[:-1] for sub in A]
clean_B= [sub[:-1] for sub in B]
clean_C= [sub[:-1] for sub in C]

clean_df=pd.DataFrame(clean_A,columns=['Postal Code'])
clean_df['Borough']=clean_B
clean_df['Neighbourhood']=clean_C
clean_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


## Data cleaning and Pre-processing

**1. Drop rows with "not assigned Borough"**

In [23]:
# get index of the rows with "not assigned"
indexNames = clean_df[ df['Borough'] =='Not assigned\n'].index

# Delete the above identified rows
clean_df.drop(indexNames , inplace=True)
clean_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [25]:
clean_df.shape

(180, 3)

**2. Replacing "Not assigned" in a neighbourhood with Borough**

In [26]:
clean_df.loc[df['Neighbourhood'] =='Not assigned' , 'Neighbourhood'] = df['Borough']
clean_df.shape

(180, 3)