### Clustering Neighborhood - Toronto
##### Arun Leo Prakash
###### 08-Jun-2020

###### Step 1 : Load the neighborhoods data from the wikipedia 
The source of the url is read into a variable, then using BeautifulSoup the source is content is scrapped to soup.
From soup the table object is loaded into a seperate variable, looped through the rows and the contents are extracted and appended to a list variable.

In [4]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = BeautifulSoup(source,'lxml')

neighborhoods_data = []
table = soup.table
table_rows = table.find_all('tr')

j= 0

for tr in table_rows:
    j = j + 1
    td = tr.find_all('td')
    row = [i.text for i in td]
    
    neighborhoods_data.append(row)

In [5]:
neighborhoods_data[1]

['M1A\n', 'Not assigned\n', 'Not assigned\n']

###### Step 2 - prepare the Data Frame and perform cleanup as mentioned in the project description.

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. 
* Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. 
* For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
neighborhoods = pd.DataFrame(neighborhoods_data, columns = ['PostalCode', 'Borough', 'Neighborhood']) 

In [7]:
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A\n,Not assigned\n,Not assigned\n
2,M2A\n,Not assigned\n,Not assigned\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n


In [8]:
#new object clean version created and cleanup methods are called
df_nb_clean = neighborhoods.replace(to_replace=r'\n', value='', regex=True)
df_nb_clean.dropna(inplace=True)

In [9]:
df_nb_clean = df_nb_clean.query('Borough != "Not assigned"')

In [10]:
df_nb_clean.query('Neighborhood == "Not assigned"')

Unnamed: 0,PostalCode,Borough,Neighborhood


In [11]:
df_nb_clean.query('PostalCode == "M5A"')

Unnamed: 0,PostalCode,Borough,Neighborhood
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [12]:
df_nb_clean.shape

(103, 3)

##### Part II
add latitude and longitude

In [34]:
df_geospat = pd.read_csv('Geospatial_Coordinates.csv')

In [35]:
df_geospat.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [41]:
df_geospat = df_geospat.rename(columns ={'Postal Code': 'PostalCode'})

In [42]:
df_nb_clean_1 = pd.merge(df_nb_clean, df_geospat, on=['PostalCode'])

In [44]:
df_nb_clean_1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [43]:
df_nb_clean_1.shape

(103, 5)

##### geocoder  was complicating and not working - maybe credentials were required and hence as mentioned in the project I have used csv file provided in the description

In [None]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None


for index, row in df_nb_clean.iterrows():
    if (index ==3):
        print(index, row['PostalCode'])
        postal_code = row['PostalCode']
        postal_code ='M5A'
        # loop until you get the coordinates
        while(lat_lng_coords is None):
            g = geocoder.google('Mountain View, CA')
            lat_lng_coords = g.latlng
        
        
        print(lat_lng_coords.typeof())
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
        
        print(latitude, longitude)

import requests
url = 'https://maps.googleapis.com/maps/api/geocode/json'
params = {'sensor': 'false', 'address': 'Mountain View, CA'}
r = requests.get(url, params=params)
results = r.json()['results']
print(results)
print(r)
#location = results[0]['geometry']['location']
#location['lat'], location['lng']


[]
<Response [200]>
