# Segmenting and Clustering Neighborhoods in New York City

## Introduction

In this project, I am extracting postal codes, Borough and Neighborhood of Canada using Web Scraping (Beautifulsoup) and then converting addresses into their equivalent latitude and longitude values. Also, I am using the Foursquare API to explore neighborhoods in Toronto City. I am using **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I am using the *k*-means clustering algorithm to complete this task. Finally, I am using the Folium library to visualize the neighborhoods in Toronto City and their emerging clusters.

### 1. Extracting Data from webpage to make Dataset

<div id='#item1'> As there is no Dataset of Canada directly availaible. So we are creating our own Dataset by Web Scraping data from the website <a href='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'>https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</a> using Beautifulsoup package python </div>

#### Step 1 : Importing All Libraries

In [2]:
#Beautifulsoup library helps in web scraping data from webpage
from bs4 import BeautifulSoup
#lxml library is the parser used to parse the content from diffrent HTML Tags
import lxml
# Requests library helps in getting the content of the webpage
import requests as req
# library to handle data in a vectorized manner
import numpy as np
#library for Data Analysis
import pandas as pd
# library to handle JSON files
import json 
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 
# library to handle requests
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# map rendering library
import folium 
print('Libraries imported.')

Libraries imported.


#### Step 2 : Web Scraping Data from the Webpage
We are extracting the data and writing the filtered data in the CSV file which is **demofile.csv**. After this step we are getting our raw dataset of Canada.

In [3]:
r = req.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.content,'lxml')
f = open("demofile.csv", "w")

hd=""
for header in soup.find_all('th'):
    hd=hd+','+header.text
hd=hd[1:31]

f.write(hd+'\n')
count=0
for record in soup.find_all('tr'):
    count+=1
    tdata=""
    for data in record.find_all('td'):
        tdata=tdata+','+data.text
    tdata=tdata[1:]
    if count==290:
        f.write(tdata)
        break
    f.write(tdata)
f.close()

#### Step 3: Reading the demofile.csv to dat dataframe. We can verify the content of dat dataframe using **dat.head()**.

In [4]:
dat=pd.read_csv('demofile.csv')
dat.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Step 4: Converting the 'Not Assigned' values to NaN in dat dataframe.

In [5]:
dat.replace('Not assigned',np.NaN,inplace=True)

#### Step 5: Droping the rows whose **Borough** is NaN.

In [6]:
dat.dropna(subset=['Borough'],inplace=True)

#### Step 6: If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [7]:
for index, row in dat.iterrows():
    if row['Neighbourhood']== 'NaN' or row['Neighbourhood']== '':
        row['Neighbourhood']=row['Borough']
    if row["Postcode"]=='M7A':
        row['Neighbourhood']=row['Borough']

#### Step 7: More than one neighborhood can exist in one postal code area. Those two rows will be combined into one row with the neighborhoods separated with a comma .

In [8]:
dat.Neighbourhood = dat.Neighbourhood.astype(str)
dat= dat.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
s=pd.Series(range(1,104)) 
dat.set_index(s,inplace=True)

### Step 8: Print the .shape method to print the number of rows of your dataframe.

In [9]:
dat.shape

(103, 3)

Now that I have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 
Using the Geocoder package or the csv file to create the following dataframe:



In [10]:

geospat_data=pd.read_csv('Geospatial_Coordinates.csv')
count=0
dat_lat=[]
dat_lon=[]
for index1, row1 in dat.iterrows():
    for index2,row2 in geospat_data.iterrows():
        if str(row1["Postcode"])==str(row2["Postal Code"]):
            a=row2["Longitude"]
            b=row2["Latitude"]
            dat_lat.append(b)
            dat_lon.append(a)
dat['Longitude']=dat_lon
dat['Latitude']=dat_lat
dat.isnull().sum(axis=0)  
dat.to_csv('int_data.csv')
dat

Unnamed: 0,Postcode,Borough,Neighbourhood,Longitude,Latitude
1,M1B,Scarborough,"Rouge, Malvern",-79.194353,43.806686
2,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",-79.160497,43.784535
3,M1E,Scarborough,"Guildwood, Morningside, West Hill",-79.188711,43.763573
4,M1G,Scarborough,Woburn,-79.216917,43.770992
5,M1H,Scarborough,Cedarbrae,-79.239476,43.773136
6,M1J,Scarborough,Scarborough Village,-79.239476,43.744734
7,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",-79.262029,43.727929
8,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",-79.284577,43.711112
9,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",-79.239476,43.716316
10,M1N,Scarborough,"Birch Cliff, Cliffside West",-79.264848,43.692657
