 # Part 1

<h1>Segmenting and Clustering Neighborhoods of Toronto </h1>


## <p style="color:#239B56;" > Import `BeautifulSoup` Package to scrap the data </p>
<h3>Step 1: Import the requests library and import pandas library</h3>

In [0]:
import requests

In [0]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans


<h3>Step 2: Check the response status code to see if everything went as planned</h3>
<li>status code 200: the request response cycle was successful
<li>any other status code: it didn't work (e.g., 404 = page not found)

In [0]:
#print(response.status_code)

In [0]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

### Retrieve data from wikipedia  page content  

In [0]:
soup = BeautifulSoup(response, 'html.parser')

data = []
for tr in soup.tbody.find_all('tr'):
    data.append([ td.get_text().strip() for td in tr.find_all('td')])

### Data cleaning

In [0]:
df=pd.DataFrame(data,columns=['PostalCode','Borough','Neighborhood'])

In [0]:
# Find indexes of rows that have "Not assigned" in Borough column
indexNames = df[(df['Borough'] == "Not assigned")].index

# Drop rows that have "Not assigned" in Borough column
df.drop(indexNames,inplace=True)

# Drop the first row
df.dropna(inplace=True)

In [9]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront
6,M6A,North York,Lawrence Manor / Lawrence Heights
7,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Malvern / Rouge
12,M3B,North York,Don Mills
13,M4B,East York,Parkview Hill / Woodbine Gardens
14,M5B,Downtown Toronto,"Garden District, Ryerson"


Combine multiple rows into one row based on Postal code and Borough

In [0]:

df=df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

## clean data 

In [11]:
# Replace "Not assigned" in Neighborhood column with the value in Borough column
def custom_fx(data):
    if data['Neighborhood']=='Not assigned':
        var=data['Borough']
    else:
        var=data['Neighborhood']
    return var

# Apply the function
df['Neighborhood']=df.apply(custom_fx,axis='columns')

# Check that there is no more "Not assigned" in Neighborhood column
print("There are {} rows that have 'Not assigned' in Neighborhood column in the dataframe".format(
    len(df[df['Neighborhood']=='Not assigned'])
)
     )
df.head(11)

There are 0 rows that have 'Not assigned' in Neighborhood column in the dataframe


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


In [0]:
# Export the dataframe
df.to_csv(r'df_can.csv')

### Explore the data

### Shape of the table

In [13]:
df.shape

(103, 3)

# Part 2

## Geographical Cordinates of postal code...

we read csv file that has the geographical coordinates of each postal code: 

import **pandas** to read the csv file

In [14]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

link = "http://cocl.us/Geospatial_data"
df1= pd.read_csv(link)

df1.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
df1.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df1.columns

Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')

#### Merging the two data frames together based on their Postcode

In [16]:
df2 = pd.merge(df, df1, on='PostalCode', how='outer')
df2.head(30)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park,43.727929,-79.262029
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge,43.711112,-79.284577
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West,43.716316,-79.239476
9,M1N,Scarborough,Birch Cliff / Cliffside West,43.692657,-79.264848


#Part 3:
1. To add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. To generate maps to visualize your neighborhoods and how they cluster together.
(Here we are generating a map to visualize Toronto neighborhood)

In [18]:
# set number of clusters

kclusters = 5

df2_clustering = df2.drop('Neighborhood', 1)

df3_clustering = df2_clustering.drop('Borough',1)

df4_clustering = df3_clustering.drop('PostalCode', 1)


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df4_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:2] 

array([2, 2], dtype=int32)

In [23]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'Toronto, TR'

geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6561136, -79.392321.


In [25]:
# add clustering labels
#df2.insert(0, 'Cluster Labels', kmeans.labels_)
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
#manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
#manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#manhattan_merged.head() # check the last columns!

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df2['Latitude'], df2['Longitude'], df2['Neighborhood'], df2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters