**IBM Data Science Capstone Week 3**

**Install wikipedia**

In [3]:
!conda install -c conda-forge wikipedia

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - wikipedia


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    wikipedia-1.4.0            |             py_2          13 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

    python_abi:      3.6-1_cp36m       conda-forge
    wikipedia:       1.4.0-py_2        conda-forge

The following packages will be UPDATED:

   

**Lets import necessary libraries** 

In [4]:
import pandas as pd
import wikipedia as wp
import numpy as np

In [5]:
#Get the html source
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(html)[0]

**Lets see our data frame**

In [6]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
print('the shape of the dataframe: ', df.shape)

the shape of the dataframe:  (287, 3)


**1. Remove 'Not assigned' from Borough column**

In [8]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [9]:
df.shape

(210, 3)

**2. Combine rows with the same postcode**

In [10]:
df= df.groupby(['Postcode', 'Borough'], sort = False).agg(', '.join)
df.reset_index(inplace = True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


**Check whether we have "Not assigned" Neighborhoods**

In [11]:
df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [12]:
df[df['Borough'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [13]:
print('the shape of the dataframe: ', df.shape)

the shape of the dataframe:  (103, 3)


**2nd PART**

**import dataframe of longitute and latitute**

In [14]:
df_lat_log = pd.read_csv('https://cocl.us/Geospatial_data')
df_lat_log.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
print('the shape of the df_lat_log: ', df_lat_log.shape)

the shape of the df_lat_log:  (103, 3)


**change the column name 'Postal Code' into 'Postcode'**

In [16]:
df_lat_log.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [17]:
df_lat_log.rename(columns={"Postal Code": "Postcode"}, inplace = True)
df_lat_log.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**merge this 2 dataframes using the column 'Postcode'**

In [18]:
df_new = pd.merge(df, df_lat_log, on = 'Postcode')
df_new.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [19]:
df_new.shape

(103, 5)

**From now on, I am going to work with only boroughs that contain the word Toronto**

In [20]:
df2 = df_new[df_new.Borough.str.contains('Toronto',regex=False)]
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


**Longitute and Latitute of Toronto**

In [21]:
latitude, longitude = 43.651070, -79.347015

**Visualizing all the Neighbourhoods using Folium**

Lets first install folium library

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

In [None]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Now lets use K Means Clustering Algorith**

In [None]:
# set number of clusters
kclusters = 5

# import libraries
from sklearn.cluster import KMeans

# input data to be clustered
X = df2[['Latitude', 'Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state=0)
kmeans.fit(X)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

**Add cluster labels into df2**

In [None]:
# add clustering labels
df2.insert(0, 'Cluster Labels', kmeans.labels_)

**Lets visualize clusters of the df2 on the map**

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df2['Latitude'], df2['Longitude'], df2['Neighborhood'], df2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters