# WEEK 3 PROJECT PART 1


### Create a dataframe that lists the unique combinations of Postcodes and Boroughs with concatenaded lists of Neighborhoods for each of these unique combinations. 
### The first step is import the required libraries. I am using lxml to do an html based scrub of the wikipedia page for its table contents. 

In [1]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

!conda install -c conda-forge geocoder --yes  
import geocoder # import geocoder

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

# For scraping the wikipedia page to get the information out of the table. 
!conda install -c conda-forge lxml --yes
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


### Now, we read in the data and create a pandas dataframe out of it, like so:

In [2]:
toronto_data = pd.read_html('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641#') # returns an indexed list of dataframes? 
toronto_df = toronto_data[0]
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Next, we have to clean up the data, removing rows with no Borough assigned and writing in missing Neighborhood values.

In [3]:
toronto_df.drop(toronto_df.loc[toronto_df['Borough']=='Not assigned'].index, inplace=True) # ignore rows w/ no Borough 
toronto_df.reset_index(drop=True, inplace=True) # reset index numbers 
toronto_df.loc[(toronto_df['Neighbourhood'] == 'Not assigned'),'Neighbourhood'] = toronto_df['Borough'] # assign Neighbourhood a value if it does not have one. 
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


### Finally, we collapse the unique neighborhood rows into concatenated lists, so that each row entry reflects a unique combination of values for Postcode and Borough. 

In [4]:
toronto_df = toronto_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### The following lists the dimensions of the new cell, indicating the number of rows in our dataframe. 

In [5]:
toronto_df.shape

(103, 3)

# WEEK 3 PROJECT PART 2
### Part 2 asks us to get the geospatial data for each row in our dataframe. This means adding columns for longitude and latitude to our existing dataframe, and reading in the correct data to those places in the table. 

In [6]:
# check the df again
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [7]:
# create new dataframe out of the provided geospatial coordinates csv

geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# join method for combining the two tables by postal code.
toronto_df = toronto_df.set_index('Postcode').join(geo_df.set_index('Postal Code'))
toronto_df.reset_index()
toronto_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# WEEK 3 PROJECT PART 3

### Part 3 asks us to do some exploratory analysis on the above dataframe. This will show how the postal codes and neighborhoods are laid out in the city of Toronto.

In [9]:
# Import some fun stuff for the map

import matplotlib.cm as cm
import matplotlib.colors as colors

address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### We can see in the above image that neighborhoods are relatively even in their spacing. However, south central Toronto has a significant cluster of neighborhoods, and therefore postal codes.
### This clustered area is centered right in the heart of downtown, where the population density is likely higher. Postal codes seem to be assigned based on population density in Canada, according to methods detailed at [this link.]('https://en.wikipedia.org/wiki/Postal_codes_in_Canada')