# Segmenting and Clustering Neighborhoods in Toronto Project

## Introduction

This project covers web scraping algorithm along with converting scraped web data into a dataframe using BeautifulSoup. Moreover, Foursquare API is used to explore neighborhoods in Toronto City. Using Explore function, we can get the most common venue categories in each neighborhood, and we can group the neighborhoods into clusters.

### Section 1: Web Scraping - Wikipedia Table

In [1]:
#Importing required libraries
import numpy as np 
import pandas as pd 
from pandas.io.json import json_normalize 
import json 
import requests 
from bs4 import BeautifulSoup 

from sklearn.cluster import KMeans 

import matplotlib.cm as cm
import matplotlib.colors as colors
import sys

In [None]:
!conda install folium
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

In [2]:
#Since Wikipedia is restricted in some countries, we can connect the webpage by using "wikizero". That's why my website seems different
#You can use the standart wiki address if there is no such restriction.
website_url = requests.get("https://www.wikizeroo.org/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9wb3N0YWxfY29kZXNfb2ZfQ2FuYWRhOl9N").text

In [4]:
soup = BeautifulSoup(website_url,"lxml")

In [6]:
webtable = soup.find("table",{"class":"wikitable sortable"})

In [5]:
#define table attributes
column_names = ['Postal Code','Borough','Neighborhood']
toronto_data = pd.DataFrame(columns = column_names)
content = soup.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0

In [6]:
#we should extract all column and cell information in the web code
for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text
            i = i + 1
        elif i == 1:
            borough = td.text
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace(']','')
    toronto_data = toronto_data.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

In [7]:
#Deleting Not Assigned values
toronto_data = toronto_data[toronto_data.Borough!="Not assigned"]
toronto_data = toronto_data[toronto_data.Borough!= 0]
toronto_data.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,toronto_data.shape[0]):
    if toronto_data.iloc[i][2] == 'Not assigned':
        toronto_data.iloc[i][2] = toronto_data.iloc[i][1]
        i = i+1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [8]:
df = toronto_data.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
pd.options.display.max_rows=1000 #this code shows all rows
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
#The number of rows: 103 | The number of columns: 3
df.shape

(103, 3)