# Segmenting and Clustering Neighborhoods in Toronto

Tasks: 
1. Start by creating a new Notebook for this assignment.

2. Use the Notebook to build the code to scrape the following Wikipedia page

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes 
and to transform the data into a pandas dataframe like the one shown below:

3. To create the above dataframe:
     a. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. Only process the cells that have an assigned borough. 
     b. Ignore cells with a borough that is Not assigned.
     c. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
     d. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
     e. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
     f. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

### Step 1: Import all necessary libraries 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


In [2]:
html_doc= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
source = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 

#print(soup.prettify())

In [5]:
data = pd.read_html(source, index_col=0, attrs={"class": "wikitable"})
df = data[0]
df.reset_index(inplace=True)
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


# Task 3A : The dataframe will consist of three columns: 
    PostalCode, Borough, and Neighborhood. 
## Only process the cells that have an assigned borough

In [6]:
#Declare variables for data needed for dataframes 

PostalCode = []
Borough = []
Neighborhood = []

# use beautifulsoup library method 'find' to identify tag with tbody
tbody = soup.find('tbody')
#print(tbody.find_all('td')) 

In [7]:
#The enumerate() method adds counter to an iterable and returns it (the enumerate object).
#The syntax of enumerate() is: enumerate(iterable, start=0)

for index, value in enumerate(tbody.find_all('td')):
    
   #Use python default function strip() to strip the space 
   #use remainder function in python to allocate key values 
    if (index%3 == 0):
        PostalCode.append(value.text.strip())
    elif(index%3 == 1):
        Borough.append(value.text.strip())
    else:
        Neighborhood.append(value.text.strip())
        
#Dictionaries are sometimes found in other languages as “associative memories” or “associative arrays”. 
#Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys 
#which can be any immutable type; strings and numbers can always be keys

dic_colNames = { "PostalCode":PostalCode, "Borough":Borough, "Neighborhood": Neighborhood }


In [8]:
#Construct DataFrame from dict of array-like or dicts 
#pandas.DataFrame.from_dic 
toronto_df = pd.DataFrame.from_dict(dic_colNames)


In [9]:
#Print the first five rows 
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A


## Task 3B  :  More than one neighborhood can exist in one postal code area. 
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

## Task 3C Ignore cells with a borough that is Not assigned.

In [10]:
#Reset the index, or a level of it. Reset the index of the DataFrame, and use the default one instead. 
#If the DataFrame has a MultiIndex, this method can remove one or more levels.
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']

toronto_df.head() 



Unnamed: 0,Borough,Neighborhood,PostalCode
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,Harbourfront,M5A
5,Downtown Toronto,Regent Park,M5A
6,North York,Lawrence Heights,M6A


In [11]:
toronto_df.reset_index(drop=True, inplace=True)

In [12]:
toronto_df

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Queen's Park,Not assigned,M7A
7,Etobicoke,Islington Avenue,M9A
8,Scarborough,Rouge,M1B
9,Scarborough,Malvern,M1B


In [37]:
df = toronto_df[toronto_df.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)
df.head() 

Unnamed: 0,Borough,Neighborhood,PostalCode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A


## Task 3D. if more than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods, separated with a comma as shown in row 11 in the above table.
       

In [13]:

#lambda - anonymous function 

groupsDic = {'PostalCode': 'min',
                 "Borough": 'min',
                 "Neighborhood": lambda neighbourhood: ','.join(neighbourhood)}

#Groupby essentially splits the data into different groups depending on a variable of your choice. 
#For example, the expression data.groupby(‘month’)  will split our current DataFrame by month.
grouped_torontodf = toronto_df.groupby(toronto_df['PostalCode']).agg(groupsDic)
grouped_torontodf


Unnamed: 0_level_0,Neighborhood,Borough,PostalCode
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M1B,"Rouge,Malvern",Scarborough,M1B
M1C,"Highland Creek,Rouge Hill,Port Union",Scarborough,M1C
M1E,"Guildwood,Morningside,West Hill",Scarborough,M1E
M1G,Woburn,Scarborough,M1G
M1H,Cedarbrae,Scarborough,M1H
M1J,Scarborough Village,Scarborough,M1J
M1K,"East Birchmount Park,Ionview,Kennedy Park",Scarborough,M1K
M1L,"Clairlea,Golden Mile,Oakridge",Scarborough,M1L
M1M,"Cliffcrest,Cliffside,Scarborough Village West",Scarborough,M1M
M1N,"Birch Cliff,Cliffside West",Scarborough,M1N


In [11]:
grouped_torontodf.reset_index(drop=True, inplace=True)
grouped_torontodf

Unnamed: 0,Neighborhood,PostalCode,Borough
0,"Rouge,Malvern",M1B,Scarborough
1,"Highland Creek,Rouge Hill,Port Union",M1C,Scarborough
2,"Guildwood,Morningside,West Hill",M1E,Scarborough
3,Woburn,M1G,Scarborough
4,Cedarbrae,M1H,Scarborough
5,Scarborough Village,M1J,Scarborough
6,"East Birchmount Park,Ionview,Kennedy Park",M1K,Scarborough
7,"Clairlea,Golden Mile,Oakridge",M1L,Scarborough
8,"Cliffcrest,Cliffside,Scarborough Village West",M1M,Scarborough
9,"Birch Cliff,Cliffside West",M1N,Scarborough


## Task 3E Clean my notebook 

## Task 3F In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. Setting Neighbourhood same name as borough if not assignedPostal code in Canada shape 

In [12]:
grouped_torontodf.shape


(103, 3)

In [15]:
canada_spatial = pd.read_csv('http://cocl.us/Geospatial_data')
canada_spatial

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [16]:
canada_spatial.rename(columns={'Postal Code':'PostalCode'},inplace=True)
canada_spatial

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [19]:
new = grouped_torontodf.merge(right = canada_spatial, on='PostalCode')
new 

Unnamed: 0,Neighborhood,Borough,PostalCode,Latitude,Longitude
0,"Rouge,Malvern",Scarborough,M1B,43.806686,-79.194353
1,"Highland Creek,Rouge Hill,Port Union",Scarborough,M1C,43.784535,-79.160497
2,"Guildwood,Morningside,West Hill",Scarborough,M1E,43.763573,-79.188711
3,Woburn,Scarborough,M1G,43.770992,-79.216917
4,Cedarbrae,Scarborough,M1H,43.773136,-79.239476
5,Scarborough Village,Scarborough,M1J,43.744734,-79.239476
6,"East Birchmount Park,Ionview,Kennedy Park",Scarborough,M1K,43.727929,-79.262029
7,"Clairlea,Golden Mile,Oakridge",Scarborough,M1L,43.711112,-79.284577
8,"Cliffcrest,Cliffside,Scarborough Village West",Scarborough,M1M,43.716316,-79.239476
9,"Birch Cliff,Cliffside West",Scarborough,M1N,43.692657,-79.264848


# Task 4 
## Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

# Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

In [24]:
location = 'Toronto,CA'
locator = Nominatim(user_agent='explorer')
address = locator.geocode(location)

In [26]:
latitude = address.latitude 
longitude = address.longitude 
print('Geographical coordinate of Tornoto are {},{}'.format(latitude,longitude))

Geographical coordinate of Tornoto are 43.653963,-79.387207


In [30]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('folium imported!')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  53.59 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  34.41 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  38.81 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  45.42 MB/s
folium imported!


In [31]:
#map of Toronto 
toronto_map = folium.Map(location=[latitude,longitude], zoom_start=11)
toronto_map

In [34]:
#add markers to map 

for longitude, latitude,borough, neigh in zip(new['Longitude'],new['Latitude'],new['Borough'], new['Neighborhood']):
    label = '{},{}'.format(neigh,borough)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [latitude,longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7, 
        parse_html=False).add_to(toronto_map)
    
toronto_map

# Make use of FourSquare data and cluster 

In [44]:
api_version= '20180605'
client_id = 'DW331ZUCMEPXFHIATGIRB4LKHHS0HL11GFED0L0HEBY1FSJO'
client_secret = 'PODHREFP54C3VZHQ0ODTXWCAE42BL1GMXOQUUQZJV0VGGOB0'

In [45]:
def getVenuesNearby(names, latitudes, longitudes, radius=500): 
    
    venues_list=[]
    
    for name,lat,lng in zip(names,latitudes,longitudes):
        print(name)
    #create API request URL 
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            api_version, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
       # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues) 
       
    

In [48]:
LIMIT = 100
venues_toronto = getVenuesNearby(names=new['PostalCode'],
                                latitudes=new['Latitude'],
                                longitudes=new['Longitude']
                                )



M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


In [47]:
venues_toronto.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,M1E,43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa
