# Segmenting and Clustering Neighborhoods in Toronto

**Instructions:** In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

## Part 1

### Retrieve HTML File from Wikipedia Containing Table of Toronto Zip Codes

In [53]:
import requests
import pandas as pd
import geocoder
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize

wiki_url: str = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wiki_soup = BeautifulSoup(wiki_url, 'lxml')

### Create the DataFrame
Table headers and table row data will be scraped from the HTML, then added to a Pandas DataFrame. This DataFrame will not have empty cells `dropna()` and will not have "Not Assigned" boroughs `df['Borough] != 'Not assigned'`.

In [2]:
table = wiki_soup.find('table', { 'class': 'wikitable sortable'})
table_headers = table.find_all('th')

parsed_headers = []
for h in table_headers:
    parsed_headers.append(h.text[:-1]) # [:-1] to remove the newline

table_rows = table.find_all('tr')
parsed_rows = []
for r in table_rows:
    table_row_data = r.find_all('td')
    row_data = []
    for d in table_row_data:
        row_data.append(d.text[:-1])
    parsed_rows.append(row_data)

df = pd.DataFrame(data=parsed_rows, columns=parsed_headers)

#### Preprocess the DataFrame

In [3]:
df = df.dropna() # Drop empty rows
df = df[df['Borough'] != 'Not assigned'] # Drop not assigned
df.reset_index(inplace=True) # Ensure index starts at 0
df.drop(columns=['index'], inplace=True) # Remove redundant, old, index
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### The Shape of the DataFrame

In [4]:
rows = df.shape[0]
cols = df.shape[1]

print(f"The DataFrame has a shape of {rows} rows and {cols} columns.")

The DataFrame has a shape of 103 rows and 3 columns.


## Part 2
### Create DataFrame of Postal Code Coordinates
Load from CSV file.


In [5]:
df_coords = pd.read_csv('geospatial_coordinates.csv')
df_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Create new DataFrame where `df` and `df_coords` are joined on `Postal Code`

In [7]:
borough_data = df.to_numpy()
geo_data = df_coords.to_numpy()

combined_data = []
for borough in borough_data:
    for geo_entry in geo_data:
        if borough[0] == geo_entry[0]:
            combined_data.append([borough[0], borough[1], borough[2], geo_entry[1], geo_entry[2]])

df_combined = pd.DataFrame(data=combined_data, columns=["Postal Code", "Borough", "Neighborhood", "Latitude", "Longitude"])
df_combined.head(12)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3

### Explore Data with Clustering

In [36]:
import folium

class Coordinates:
    def __init__(self, latitude, longitude):
        self.latitude = latitude 
        self.longitude = longitude

starting_coords = Coordinates(df_combined['Latitude'].mean(), df_combined['Longitude'].mean())

map = folium.Map(location=[starting_coords.latitude, starting_coords.longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(df_combined['Latitude'], df_combined['Longitude'], df_combined['Borough'], df_combined['Neighborhood']):
    label= f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='green',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map)
map

In [23]:
%load_ext dotenv
%dotenv -v ./../../.env
import os  
CLIENT_ID = os.getenv("FOURSQUARE_CLIENTID")
CLIENT_SECRET = os.getenv("FOURSQUARE_CLIENTSECRET")

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [24]:
VERSION = '20180605' # Foursquare API version

In [71]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

radius=500
limit=100
url=f'https://api.foursquare.com/v2/venues/search?v={VERSION}&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&ll={center_coords[0]},{center_coords[1]}&radius={radius}&limit={limit}'

results = requests.get(url).json()
venues = results['response']['venues']
venues[0]
df_venues = json_normalize(venues)

filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
df_nearby_venues =df_venues.loc[:, filtered_columns]

df_nearby_venues['venue.categories'] = df_nearby_venues.apply(get_category_type, axis=1)

df_nearby_venues.columns = [col.split(".")[-1] for col in df_nearby_venues.columns]

df_nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,categories.1
0,Coffee & Deli Delight,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.705272,-79.397969,Coffee Shop
1,Eglinton Subway Station,"[{'id': '4bf58dd8d48988d1fd931735', 'name': 'M...",43.706703,-79.39845,Metro Station
2,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.705563,-79.39763,Coffee Shop
3,Minto Midtown - Quantum South,"[{'id': '4d954b06a243a5684965b473', 'name': 'R...",43.705357,-79.397424,Residential Building (Apartment / Condo)
4,Canadian Tire Home Office,"[{'id': '4bf58dd8d48988d124941735', 'name': 'O...",43.704766,-79.398349,Office


In [68]:
print(f'{nearby_venues.shape[0]} venues were returned by Foursquare.')

100 venues were returned by Foursquare.


### Clustering w/ K-means

In [65]:
from sklearn.cluster import KMeans

In [66]:
kclusters = 5
kmeans = KMeans(n_clusters=kclusters, random_state=0)
kmeans.fit(df_combined.to_numpy())

ValueError: could not convert string to float: 'Mimico NW, The Queensway West, South of Bloor, Kingsway Park South West, Royal York South West'