# IBM APPLIED DATA SCIENCE CAPSTONE PROJECT PART - 2

## Clustering and Segmenting Neighborhoods in Toronto

<p>For the Toronto neighborhood data, a wikipedia page available in this link <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">here</a>.</p>

Before we get the data and start exploring it, let's download all the dependencies that needed.

In [1]:
import pandas as pd
import numpy as np

<p>Install <b>html5lib</b> for reading Wikipedia Page to pandas</p>

In [2]:
!conda install html5lib -y

Solving environment: done

## Package Plan ##

  environment location: /home/nbuser/anaconda3_501

  added / updated specs: 
    - html5lib


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.7.12               |           py36_0         3.0 MB
    conda-package-handling-1.6.0|   py36h7b6447c_0         871 KB
    ------------------------------------------------------------
                                           Total:         3.9 MB

The following NEW packages will be INSTALLED:

    conda-package-handling: 1.6.0-py36h7b6447c_0

The following packages will be UPDATED:

    conda:                  4.5.11-py36_0        --> 4.7.12-py36_0


Downloading and Extracting Packages
conda-4.7.12         | 3.0 MB    | ##################################### | 100% 
conda-package-handli | 871 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transacti

In [5]:
import html5lib

<p>'<strong>pd.read_html()</strong>' read all the tables in the wikipedia page!
</p>

In [18]:
#the URL containing the dataset
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Assign the html content to a list named 'table'.

In [19]:
table=pd.read_html(url)
type(table)

list

<p>Type of table is list.
    We need only first element in the list list</p>

In [20]:
df=pd.DataFrame(table[0])
print(df.head())

          0             1                 2
0  Postcode       Borough     Neighbourhood
1       M1A  Not assigned      Not assigned
2       M2A  Not assigned      Not assigned
3       M3A    North York         Parkwoods
4       M4A    North York  Victoria Village


Add first index as header 

In [22]:
new_header=df.iloc[0]

df=df[1:]
df.columns=new_header

In [23]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


<strong>Analyse the dataset</strong>

In [21]:
print(df.shape)
df.info()

(289, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 3 columns):
0    289 non-null object
1    289 non-null object
2    289 non-null object
dtypes: object(3)
memory usage: 6.9+ KB


<p>We need cells that have an assigned borough. Ignore cells with a borough that is <strong>Not assigned</strong>.</p>

In [24]:
#Check for the index value Not assigned in Borough column of df
index_borough=df[df['Borough'].isin(['Not assigned'])].index

#print indexes 
print(index_borough)


Int64Index([  1,   2,  10,  14,  21,  22,  31,  37,  38,  46,  47,  51,  52,
             53,  55,  56,  60,  61,  62,  74,  75,  76,  89,  90,  91, 105,
            106, 107, 121, 122, 137, 138, 149, 150, 156, 162, 163, 168, 176,
            182, 183, 189, 190, 191, 195, 196, 202, 203, 204, 205, 210, 211,
            224, 225, 238, 239, 242, 243, 248, 249, 254, 255, 259, 260, 261,
            262, 264, 265, 275, 276, 277, 278, 279, 280, 281, 282, 288],
           dtype='int64')


<p>Drop the cells with a borough <b>Not assigned</b></p>

In [25]:
df.drop(index_borough, inplace=True,axis=0)

<p>We can reset the row index in dataframe with reset_index() to make the index start from 0 and specify <b>drop=True</b> to not to keep the original index with the argument.</p>

In [26]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Inspect the Neighbourhood column for Not assigned value.

In [28]:
df.loc[df['Neighbourhood'].isin(['Not assigned'])]

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M7A,Queen's Park,Not assigned


<p>Assign the Borough value to Neighbourhood, then the neighborhood will be the same as the borough. So for the 6th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.</p>

In [29]:
df.loc[df['Neighbourhood'].isin(['Not assigned']), 'Neighbourhood']=df['Borough']
print(df.iloc[6,:])

0
Postcode                  M7A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 6, dtype: object


Spelling of Neighborhood in US English is Neighborhood.

In [30]:
df.rename(columns={'Neighbourhood':'Neighborhood'}, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


<p>Here, we can groupby on <b>'Postcode'</b> and <b>'Borough'</b> seeing as their relationship is the same, cast the <b>'Neighbourhood'</b> column to str and join with a delimiter:</p>

In [31]:
df_sort=df.groupby(['Postcode','Borough'])['Neighborhood'].apply(lambda x: ','.join(x.astype(str))).reset_index()

<p>Display the first 10 observations from the Wikipedia page grouped by Postcode</p>

In [37]:
df_sort.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Display the number of rows in the dataframe

In [35]:
df_sort.shape

(103, 3)

### PART 2 

<strong>
Now to built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
</strong>

In [100]:
df_loc = pd.read_csv('Geospatial_Coordinates.csv')

In [101]:
df_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [102]:
df_loc.shape

(103, 3)

<strong>Rename the column 'Postal Code' to 'Postcode' so as to match with df_sort dataframe to merge both.</strong>

In [103]:
df_loc.rename(columns={'Postal Code':'Postcode'}, inplace=True)
df_loc.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


###### Merging Dataframes 'df_sort' of Neighbourhood datasets and  'df_loc' of  locations dataframe.

In [104]:
df_merged=pd.merge(left=df_sort, right=df_loc, on='Postcode')

Display the first five rows

In [111]:
df_merged.head(12)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [109]:
df_merged.tail(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
93,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
94,M9B,Etobicoke,"Cloverdale,Islington,Martin Grove,Princess Gar...",43.650943,-79.554724
95,M9C,Etobicoke,"Bloordale Gardens,Eringate,Markland Wood,Old B...",43.643515,-79.577201
96,M9L,North York,Humber Summit,43.756303,-79.565963
97,M9M,North York,"Emery,Humberlea",43.724766,-79.532242
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.739416,-79.588437
102,M9W,Etobicoke,Northwest,43.706748,-79.594054


In [106]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
Postcode        103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
Latitude        103 non-null float64
Longitude       103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


In [107]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_merged['Borough'].unique()),
        df_merged.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.
