# Assignment of week 3 - Segmenting and Clustering Neighborhoods in Toronto - part 3

## Part 1

To create the dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have complete information and not greyed out or not assigned.
    - For each cell, the postal code will go under the PostalCode column, the first line under the postal code will go under Borough, and the remaining lines will go under the Neighborhood column formatted nicely and separated with commas as shown in the sample dataframe. 
    - For example, for cell (1, 3) on the Wikipedia page, M3A will go under PostalCode, North York will go under Borough, and Parkwoods will go under Neighborhood.
- If a cell has only one line under the postal code, like cell (1, 7), then that line will go under the Borough and the Neighborhood columns. So for cell (1, 7), the value of the Borough and the Neighborhood column will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Import needed libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Get the Wikipedia page

In [2]:
wikipediaPage = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

## Extract the info from the page

In [3]:
# import it in a soup object
soup = BeautifulSoup(wikipediaPage.content, 'html5lib')

# get the table with the needed info
table = soup.find_all('table')[0]

# extract all <td> from the table
tds = table.find_all('td')

### Content of a <td\>
##### Valid
    <td style="vertical-align:top;">
        <p>
            <b>M1B</b>
            <br/>
            <span style="font-size:80%;">
                <a href="/wiki/Scarborough,_Toronto" title="Scarborough, Toronto">Scarborough</a>
                <br/>(<a href="/wiki/Malvern,_Toronto" title="Malvern, Toronto">Malvern</a> / <a href="/wiki/Rouge,_Toronto" title="Rouge, Toronto">Rouge</a>)</span>
        </p>
    </td>

##### Invalid
    <td style="width:11%; vertical-align:top; color:#ccc;">
        <p>
            <b>M1A</b>
            <br/>
            <span style="font-size:80%;">
                <i>Not assigned</i>
            </span>
        </p>
    </td>

## Define a function for parsing the html

In [4]:
# function to parse the <td> and return an entry for the dataframe : return PostalCode, Borough, Neighborhood
def parseTableData(td):
    postalCode = td.b.get_text()
    borough = None
    neighborhood = None
    
    try:
        a1 = td.find('a')
    except:
        a1 = None
        
    if (a1 != None):
        borough = a1.get_text()
    
    a2s = []
    try:
        if (a1 != None):    
            a2s = a1.find_next_siblings('a')
    except:
        a2s = []
    
    if (a2s != []):
        neighborhood = ''
        for a2 in a2s:
            if (neighborhood != ''):
                neighborhood = neighborhood + ', '
            neighborhood = neighborhood + a2.get_text()
    elif (borough != None):
        neighborhood = borough
        
    return postalCode, borough, neighborhood

## Create and fill the dataframe

In [5]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
# fill the dataframe
counter = 0
for td in tds:
    data = parseTableData(td)
    # skip invalid data
    if (None in data):
        counter = counter + 1
        continue
    neighborhoods = neighborhoods.append({'PostalCode': data[0], 'Borough': data[1], 'Neighborhood': data[2]}, ignore_index=True)
print('Skipped {} invalid entries and added {} valid ones.'.format(counter, len(tds)-counter))

Skipped 79 invalid entries and added 101 valid ones.


In [7]:
neighborhoods.shape

(101, 3)

## Part 2 begins here

In [8]:
# The code was removed by DSX for sharing.

### Create a function for getting the latitude and longitude of each postal code.
NB: search for location "Toronto<PostalCode>" else result of request is empty

In [9]:
def getLocation(postalCode):
    """ 
    Search for information for place Toronto<postalCode>
    return the found place latitude and longitude (return latitude, longitude)
    """
    url = (baseUrl+'&address={}').format('Toronto'+postalCode)
    response = requests.get(url).json() # get response
    geographical_data = response['results'][0]['geometry']['location'] # get geographical coordinates
    latitude = geographical_data['lat']
    longitude = geographical_data['lng']
    return latitude, longitude

### Get the latitude and longitude for each PostalCode in the dataframe and add the 2 new columns to it

In [10]:
nh = neighborhoods.copy()

latitudeCln = []
longitudeCln = []
for index, row in nh.iterrows():
    lat, long = getLocation(row[0])
    latitudeCln.append(lat)
    longitudeCln.append(long)

nh['Latitude'] = latitudeCln
nh['Longitude'] = longitudeCln

nh.shape

(101, 5)

In [11]:
nh.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


## Part 3 begins here

In [12]:
# first make a copy of the dataframe so I can simply get it again if needed
nht = nh.copy()

### Import needed libraries

In [13]:
import numpy as np # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.16.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00 221.61 kB/s
geopy-1.16.0-p 100% |################################| Time: 0:00:00 341.93 kB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.1.0-py_0 conda-forge
    branca:  0.3.0-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge

branca-0.3.0-p 100% |################################| Time: 0:00:00 324.76 kB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00 371.09 kB/s
altair-2.1.0-p 100% |###########################

### Get only the boroughs containing Toronto in their name

In [14]:
nht = nht[nht['Borough'].str.contains('Toronto')]

### Use geopy library to get the latitude and longitude values of Toronto.

In [15]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="task3")
location = geolocator.geocode(address, timeout=60, exactly_one=True)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### Create a map of Toronto with neighborhoods superimposed on top.

In [24]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(nht['Latitude'], nht['Longitude'], nht['Borough'], nht['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Explore Neighborhoods in Toronto

#### Define Foursquare Credentials and Version

In [22]:
# The code was removed by DSX for sharing.

#### Let's import some help functions from the Foursquare lab in the previous module.

In [23]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [25]:
# function to repeat the exploring process to all the neighborhoods in Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)