# Segmenting and Clustering Neighborhoods In Toronto

## Project's description
This project includes scraping the Wikipedia page and wrangling the required data, and then processing, cleaning and reading that data into a pandas structured formate dataframe which is later used to explore, segment and cluster the neighborhoods in the city of Toronto, Canada. The clustering is carried out by K-Means.
### All tasks (*web scraping, wrangling, data cleaning, exploring and clustering the neighborhoods*) are implemented in the same notebook for the ease of evalution.

## Installing / Importing the required libraries

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from bs4 import BeautifulSoup

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
from IPython.display import display_html
    
import requests # library to handle requests
from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
print('Folium installed')
import folium # plotting library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    ------------------------------------------------------------
                       

## 1. Data scrapping, wrangling, pre-processing and cleaning

### 1.1 Scraping the Wiki page, Getting the data and Converting into pandas dataframe

Scraping the wiki page

In [2]:
# getting data from internet
dataSourceLink = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
rawData = requests.get(dataSourceLink).text

soup = BeautifulSoup(rawData,'lxml') # parse the HTML/XML codes by using beautiful soup

Get the data from the HTML page and store it into a list

In [3]:
# using soup object, iterate the wikitable
dataTable = soup.find('table')
row = []  #for rows
columnsName = []  # to store headers

for index, tr in enumerate(dataTable.find_all('tr')):
    results = []
    for td in tr.find_all(['th', 'td']):
        results.append(td.text.rstrip())
        
    # as first row of dataTable is the header (title for columns) and rest are rows
    if (index == 0):
        columnsName = results
    else:
        row.append(results)  

Convert into pandas Dataframe

In [4]:
df_toronto = pd.DataFrame(data = row, columns = columnsName)
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 1.2 Data pre-processing and cleaning 

Remove the rows where Borough is "Not assigned"

In [5]:
df_toronto = df_toronto[df_toronto['Borough'] != 'Not assigned']
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Combine the neighborhoods with the same Postal Code

In [6]:
df_toronto = df_toronto.groupby(["Postal Code", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Replace the name of the neighborhoods which are 'Not assigned' with the names of Borough

In [7]:
df_toronto["Neighborhood"].replace("Not assigned", df_toronto["Borough"], inplace = True)
df_toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### 1.3 Print the shape of Dataframe

Print the number of rows of the dataframe by using .shape function.

In [8]:
print("Shape of DataFrame df_toronto is :", df_toronto.shape)
print("The total number of rows of dataframe df_toronto are ", df_toronto.shape[0])

Shape of DataFrame df_toronto is : (103, 3)
The total number of rows of dataframe df_toronto are  103


## 2. Importing and Merging the Geospatial Data

Adding the geographical coordinates of the neighborhoods using given csv file

In [9]:
df_toronto_lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
df_toronto_lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the two dataframes (df_toronto & df_toronto_lat_lon) into one dataframe.

In [10]:
df_toronto = pd.merge(df_toronto, df_toronto_lat_lon, on = 'Postal Code')
df_toronto.head(15)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
