# APPLIED DATA SCIENCE CAPSTONE PROJECT
This notebook will be used for the capstone project.

# __PART 1__

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Install lxml package which is used for reading tables from wikipedia page
!conda install -c anaconda lxml -y
print('lxml installed')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

lxml installed


The first step is reading the table from the given wikipedia page. The code below will read all tables in the given page and assign it to the 'tables'.

In [3]:
tables = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print('List contains {} tables!'.format(len(tables)))

List contains 3 tables!


Now we have 3 tables in our list. If we examine the elements of the list, it's clear that the first table is the table we need. So let's assing this table to a dataframe and check the result:

In [4]:
df_postal = pd.DataFrame(data = tables[0])
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Dataframe seems correct. Now let's extract only the rows with an assigned 'Borough':

In [5]:
#Ignore postal codes with 'Not assigned' borough
df_postal = df_postal[df_postal['Borough'] != 'Not assigned']
#Reset index
df_postal.sort_values(by='Postal Code', inplace=True )
df_postal.reset_index(drop=True, inplace=True)
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
#Check if there is unassigned neighborhood
df_postal['Neighborhood'].isna().value_counts()

False    103
Name: Neighborhood, dtype: int64

Since there is no unassigned 'Neighborhood' value, the dataframe is ready for the next part.

In [7]:
print('Shape of the dataframe:', df_postal.shape)

Shape of the dataframe: (103, 3)


# __PART 2__

Read coordinates from the csv file and assign it to a dataframe:

In [8]:
coordinates = pd.read_csv('https://cocl.us/Geospatial_data')

In [9]:
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Filter the coordinates dataframe using the dataframe from PART 1:

In [10]:
coordinates = coordinates[coordinates['Postal Code'].isin(df_postal['Postal Code'].tolist())]

Sorting the values, so we can merge 2 dataframes to obtain a dataframe with coordinate information for a given postal code

In [11]:
coordinates.sort_values(by='Postal Code')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Finally, let's merge two dataframes and get a final dataframe with coordinates:

In [13]:
df_final = pd.merge(df_postal,coordinates)
df_final.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


# **PART 3**

Now it's time to visualize the neighboorsi on the map. First, we must add a numeric label (0,1,..) for each borough instead of categorical variable like 'Scarborough'. I'm gonna use KMeans clustering for this purpose.

In [14]:
from sklearn.cluster import KMeans

Converting the 'Borough' column into a dummy variables using get_dummies method. Then fitting the KMeans instance with the obtained 'onehot' dataframe.

In [15]:
onehot = pd.get_dummies(df_final[['Borough']], prefix="", prefix_sep="")
kmeans = KMeans(n_clusters = 10).fit(onehot)

Now, let's add the obtained labels for each borough to the our final dataframe.

In [16]:
df_final['Borough_Index'] = kmeans.labels_.transpose()
df_final.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Borough_Index
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,2
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,2
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,2
3,M1G,Scarborough,Woburn,43.770992,-79.216917,2
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,2
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,2
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029,2
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577,2
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476,2
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,2


Great! We are ready to create a Toronto map and add markers to the corresponding neighborhoods.

In [17]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
map_clusters = folium.Map(location=[43.693, -79.403], zoom_start=11)
kclusters = 10
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [18]:
map_clusters = folium.Map(location=[43.693, -79.403], zoom_start=11)
kclusters = 10
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster,bor in zip(df_final['Latitude'], df_final['Longitude'], df_final['Neighborhood'], df_final['Borough_Index'],df_final['Borough']):
    label = folium.Popup(str(poi) + ', ' + str(bor), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters