<a href="https://colab.research.google.com/github/earldennison/ibm_coursera_capstone/blob/master/Segmenting_and_Clustering_Neighborhoods_in_Toronto1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Standard Imports
Before we procede we will import the standard libraries we are going to use for manipulating the data

In [0]:
import numpy as np
import pandas as pd

## Getting The Data
Unlike what the coursera instruction says, it is fairly straight forward to load tabular data into a data frame if it is not more than one page long, one does not need beautifulsoup for this, we can easily get data through the ```pd.read_html``` method this method retruns a list of all the tables inside the location of the url you have put in as an arguement. 

In [0]:
pcodes = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

## Creating the Data Frame

I have assigned the data that will be returned to ```pcodes``` short for postal codes. Let us check what type of data ```pcodes``` has

In [3]:
type(pcodes)

list

The type of data is actually a list? But what we want is a data frame right? Lets investigate further.

In [4]:
len(pcodes)

3

Three? its a list with 3 elements in it? Lets check it out

In [5]:
pcodes

[    Postcode  ...                                      Neighbourhood
 0        M1A  ...                                       Not assigned
 1        M2A  ...                                       Not assigned
 2        M3A  ...                                          Parkwoods
 3        M4A  ...                                   Victoria Village
 4        M5A  ...                                       Harbourfront
 5        M5A  ...                                        Regent Park
 6        M6A  ...                                   Lawrence Heights
 7        M6A  ...                                     Lawrence Manor
 8        M7A  ...                                       Not assigned
 9        M8A  ...                                       Not assigned
 10       M9A  ...                                   Islington Avenue
 11       M1B  ...                                              Rouge
 12       M1B  ...                                            Malvern
 13       M2B  ...  

```pcodes``` actually has three tables in it!  That is weird . . . Lets check out what type each individual element is

In [6]:
for item in pcodes:
  print(type(item))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


Bingo! Each element is actually a data frame, with this it will be simple to just get the dataframe from the index, now what we want is the first so the index would be ```0```. Lets double check the content just to be sure that it is what we want

In [7]:
pcodes[0].head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Since we don't need the other tables lets just assign the index of 0  `pcodes` to `pcodes` itself clearing out the outher tables

In [0]:
pcodes = pcodes[0]

## Cleaning the Data
Now we have lots of bad data inside the data frame due to the Not assigned values. For this we will just follow the instructions of the coursera guideline. But first lets check the the dataframe as the dilligent datascientists that we are.



In [9]:
pcodes.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

Now we have the columns we can manipulate the data better. Lets check out the coursera guidelines for transforming the data.
> - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

We already have the three columns. So that's done. It seems to have to drop the cells that have no borrough assigned. Okay lets do that. Hmm lets try to replace ```Not assigned``` with ```NaN``` so that we can easily drop the rows

In [0]:
pcodes.replace('Not assigned',np.NaN, inplace = True)

Now we drop the NaN values using ```dropna()```, however, we just want to drop the rows where the borroughs arrent assigned, thankfully ```dropna()``` has ```subset``` parameter which can filter out the columns that we want ```dropna``` to be applied to

In [0]:
pcodes.dropna(subset=['Borough'], axis=0,inplace=True)

In [13]:
pcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Looks like it was a success M1A and M2A for which Borrough values were Not assigned  have been dropped

Lets try to comply with the other guidlines coursera has given:

>- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

Uh oh,  if we do the former first rather than the latter it would give us a headache since executing the first guidline would create a string value that is compressed inside the `Neighborhood` series, it would take much more complex processing if we get the string values of neighborhood. So I'll just do the latter first. I'm going to go through the data frame and if the `Neighborhood` is not assigned I will just replace the value in `Neighborhood` with the `Borrough`.


Since we change `Not assigned` to `Nan` there is an easy way  in pandas for this sort of transformation. We are going to use  the `fillna` method

In [14]:
pcodes.fillna(method='ffill', axis = 1, inplace=True)
pcodes.loc[8].to_frame()

Unnamed: 0,8
Postcode,M7A
Borough,Queen's Park
Neighbourhood,Queen's Park


I think a little bit of discussion is in order. As you see from the code above position 8 has been altered and automatically changed to its borough. The method parameter `ffill` means forward fill, so what happens is that once pandas sees that the next line is a `NaN` value it automatically uses the previous value to fill it. the `axis` parameter is the guidline that pandas uses for what scan it uses in order to fill it. Neat huh?

Next we are going to put all the `Neighborhood`s in the same row if they have the same post code.

In [0]:
pcodes = pcodes.groupby(
    by=['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()

In [17]:
pcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


So what happened? Well its basically a one liner but it is quite complex, we must first delve into what the ```groupby``` method does, the groupby method returns a `groupby` object, not exactly a dataframe, however, this groupby object can be iterated through, we can still access the `['Neighbourhood']` column, if we use some sort of function on it then maybe we can get a datarame. We used the `join` method on the `['Neighbourhood']` column to aggregate it. And we used ther `reset_index()` method to reset it else it would treat the `['Neighbourhood']` column as the index



In [18]:
pcodes.shape

(103, 3)

## Getting the Longitudes and Latitudes
Now that we have cleaned the data we must be able to get the latitude and longitude information using the the Postcode so we will import geocoder as per the instruction of coursera

In [0]:
import geocoder

However the method `geocoder.google` doesn't seem to be working so we will use `geocoder.arcgis` instead, my thanks to [Asim Islam](https://www.coursera.org/learn/applied-data-science-capstone/profiles/8d41d6357cf7033b900aa7daaafdf2c1) for pointing me in the right direction. Lets test it out if it will indeed get the longitud and latitude

In [20]:
g= geocoder.arcgis('M5A, Toronto, Ontario')
g.latlng

[43.65512000000007, -79.36263979699999]

Cool, now that we have this we make a function to make our lives easier, the function returns the longitude and latitude in list format, the parameters `city` and `state` already have default arguements, which are Toronto and Ontario respectively.

In [0]:
def get_geocode(postal_code, city='Toronto', state='Ontario'):
  return geocoder.arcgis(f'{postal_code}, {city}, {state}').latlng

Now we create the new columns by declaring them, the `zip` functions creates a tuple on the return value of the `get_geocode` once applied, it will the be separated and passed into the new columns

In [0]:
pcodes['Latitude'], pcodes['Longitude'] = zip(*pcodes['Postcode'].apply(get_geocode))

Lets check our Data Frame

In [26]:
pcodes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.78573,-79.15875
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.76569,-79.175256
3,M1G,Scarborough,Woburn,43.768359,-79.21759
4,M1H,Scarborough,Cedarbrae,43.769688,-79.23944


Looks like it worked! On to the next Part

## Filtering out Boroughs only in Toronto
For this we will only get Boroughs that are in Tornto

Using a method to format the contents of `[Borough]` we create a mask the filters only those that have Toronto in the string

In [0]:
toronto = pcodes[pcodes['Borough'].str.contains('Toronto')]

In [39]:
#@title Hidden {display-mode: "form"}
toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676845,-79.295225
41,M4K,East Toronto,"The Danforth West,Riverdale",43.683262,-79.35512
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.667965,-79.314673
43,M4M,East Toronto,Studio District,43.662766,-79.33483
44,M4N,Central Toronto,Lawrence Park,43.72816,-79.387085
45,M4P,Central Toronto,Davisville North,43.712815,-79.388526
46,M4R,Central Toronto,North Toronto West,43.714523,-79.40696
47,M4S,Central Toronto,Davisville,43.703395,-79.385964
48,M4T,Central Toronto,"Moore Park,Summerhill East",43.690655,-79.383561
49,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686083,-79.402335


We first import folium and Nominatim. Nominatim is a method for getting the geolocation of a certain place

In [0]:
import folium
from geopy.geocoders import Nominatim

Lets get the latitude and longitude of Toronto for our map

In [235]:
toronto_loc = Nominatim(user_agent='explorer').geocode('Toronto, Ontario')

print(toronto_loc.latitude)
print(toronto_loc.longitude)

43.653963
-79.387207


In [250]:
toronto_map = folium.Map((toronto_loc.latitude, toronto_loc.longitude), zoom_start =13, min_zoom = 12)
for lon,lat,borough, neigh in zip(toronto['Longitude'], toronto['Latitude'],toronto['Borough'],toronto['Neighbourhood'] ):
  label = f"{borough}, {neigh}"
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker([lat,lon],
                      popup= label,
                      radius = 4,
                      color='blue',
                      fill=True,
                      fill_color='#3186cc',
                     ).add_to(toronto_map)
toronto_map


In [0]:
CLIENT_ID
CLIENT_SECRET

In [35]:
for neighbourhoods,lat,lng in zip(toronto['Neighbourhood'],toronto['Latitude'],toronto['Longitude']):
  print(neighbourhoods.split(','),lat,lng)

['The Beaches'] 43.67684518300007 -79.29522499999996
['The Danforth West', 'Riverdale'] 43.68326150000007 -79.35511999999994
['The Beaches West', 'India Bazaar'] 43.66796500000004 -79.31467251099997
['Studio District'] 43.662765652000076 -79.33482999999995
['Lawrence Park'] 43.72816000000006 -79.38708518799996
['Davisville North'] 43.712815000000035 -79.38852582199996
['North Toronto West'] 43.71452278400005 -79.40695999999997
['Davisville'] 43.70339500000006 -79.38596360499997
['Moore Park', 'Summerhill East'] 43.69065500000005 -79.38356145799997
['Deer Park', 'Forest Hill SE', 'Rathnelly', 'South Hill', 'Summerhill West'] 43.68608285400006 -79.40233499999994
['Rosedale'] 43.681940000000054 -79.37847416699998
['Cabbagetown', 'St. James Town'] 43.66816000000006 -79.36660236199998
['Church and Wellesley'] 43.666585000000055 -79.38130203699995
['Harbourfront', 'Regent Park'] 43.65512000000007 -79.36263979699999
['Ryerson', 'Garden District'] 43.65736301100003 -79.37817999999999
['St. Jam