# **THIS NOTEBOOK IS CREATED FOR THE CAPSTONE PROJECT**       

## *1. Data Scrapping*
    This section is for scrapping data table from a website to create a pandas dataframe out of it. And finally, to drop some unneeded  data from the datafrme.
    Initially the table data is of length 180 rows and after cleaning we should get 103 rows.

###### Step 1: Importing the dependencies

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

Step 2: Scrapping the data table from the website

In [2]:
# The following method returns a list
df1 = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Converting the first element of the list, the first table from the website, into a dataframe
df = pd.DataFrame(df1[0])

# Validating the dataframe
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [3]:
# Checking the length
len(df)

180

In [4]:
# Check how many 'Not assigned' values exists in the 'Borough' column 
seriesObj = df.apply(lambda x: True if x['Borough'] == 'Not assigned' else False , axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
numOfRows

77

In [5]:
# Deleting the rows that have 'Not assigned' value by their index
index = df[df['Borough'] == 'Not assigned'].index
# Delete the above rows and validate the row count
df.drop(index, inplace=True)

In [6]:
# Validating the resulted length
len(df)

103

In [8]:
# checking if 'Not assigned' value exists in the 'Neighbourhood' column
index = df["Postal Code"].duplicated()
numOfRows = len(index[index == True].index)
numOfRows

0

In [9]:
# checking if duplicated value exists in the 'Postal Code' column
bool_series = df["Postal Code"].duplicated()
numOfRows = len(bool_series[bool_series == True].index)
numOfRows

0

###### RESULT:
###### - 103 rows of data
###### - No unassigned values in the last 2 columns
###### - No duplication in the postal code, therefore no more neighbourhoods in the same PC

## *2. Merging*

###### Step 1: Importing the dependencies

In [12]:
import io
import requests

In [13]:
# Import the Geospatial Coordinates csv file from the link provided for this assignment
s=requests.get("https://cocl.us/Geospatial_data").content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
c.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


###### Step 2: Merging the 2 dataframes by the 'Postal Code' column's values and keeping the df data on the left

In [14]:
# Creating a new dataframe resulted from merging
dfm=df.merge(c)
dfm.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## *3. Mapping, segmenting and clustering*

###### Step 1: Installing and importing the dependencies

In [15]:
# Installing Folium library
!conda install -c conda-forge folium=0.5.0 --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
  - anaconda/win-64::ca-certificates-2020.6.24-0, anaconda/win-64::certifi-2020.6.20-py38_0, anaconda/win-64::openssl-1.1.1g-he774522_0
  - anaconda/win-64::ca-certificates-2020.6.24-0, anaconda/win-64::certifi-2020.6.20-py38_0, defaults/win-64::openssl-1.1.1g-he774522_0
  - anaconda/win-64::certifi-2020.6.20-py38_0, anaconda/win-64::openssl-1.1.1g-he774522_0, defaults/win-64::ca-certificates-2020.6.24-0
  - anaconda/win-64::certifi-2020.6.20-py38_0, defaults/win-64::ca-certificates-2020.6.24-0, defaults/win-64::openssl-1.1.1g-he774522_0
  - anaconda/win-64::openssl-1.1.1g-he774522_0, defaults/win-64::ca-certificates-2020.6.24-0, defaults/win-64::certifi-2020.6.20-py38_0
  - defaults/win-64::ca-certificates-2020.6.24-0, defaults/win-64::certifi-2020.6.20-py38_0, defaults/win-64::openssl-1.1.1g-he774522_0
  - anaconda/win-64::ca-certificates-2020.6.24-0, anaconda/win-64::openssl-1

In [17]:
# Importing the dependencies for this section

import folium # map rendering library

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

###### Step 2: Getting the coordinates for the postal code starting with 'M5G' 

In [18]:
# Finding the index of this postal code
i=dfm[dfm['Postal Code'] == 'M5G']
i

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [20]:
# Finding its coordinates by using the iloc()
Lat = dfm.iloc[24, 3]
Lon = dfm.iloc[24, 4]
print('The geograpical coordinates of Downtown Toronto are {}, {}.'.format(Lat, Lon))

The geograpical coordinates of Downtown Toronto are 43.6579524, -79.3873826.


###### Step 3: Mapping and segmenting the neibourhoods

In [22]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[Lat, Lon], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(dfm['Latitude'], dfm['Longitude'], dfm['Borough'], dfm['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

###### From the Map above, the datapoints looks it's distributed randomly around one central point about 'North Toronto West, Lawrence Park, Central Toronto'.
### We need one big cluster 