# Toronto feature engineering
## Extract location of neighborhoods

This notebook have the objective of:
- For each neighborhood in Toronto, add the respective latitude and longitude.
- Write the new dataset to csv format.

In [1]:
import numpy as np
import pandas as pd

# to handle HTML and API's REST request
import requests
from bs4 import BeautifulSoup

# geo packages
import pgeocode # to get the location by postal code
import folium # to draw maps

In [2]:
df = pd.read_csv("datasets/toronto-postal-codes.csv")
df.drop(columns=["Unnamed: 0"], inplace=True)
df["latitude"] = np.nan
df["longitude"] = np.nan
df.head()

Unnamed: 0,postal_code,borough,neighbourhood,latitude,longitude
0,M1A,,,,
1,M2A,,,,
2,M3A,North York,Parkwoods,,
3,M4A,North York,Victoria Village,,
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",,


In [3]:
np.unique(df["postal_code"].isna())

array([False])

In [4]:
# define the geolocator
geolocator = pgeocode.Nominatim('CA')

## Find the latitude and longitude

For each postal code in the data set, we will search the respective location of the neighborhood and we will add it to the dataframe

In [5]:
for postal_code in df["postal_code"]:
    query = geolocator.query_postal_code(postal_code)
    latitude = query.latitude
    longitude = query.longitude
    
    neighborhood_index = df[df["postal_code"] == postal_code].index
    
    df.at[neighborhood_index, "latitude"] = latitude    
    df.at[neighborhood_index, "longitude"] = longitude
    
df.head()

Unnamed: 0,postal_code,borough,neighbourhood,latitude,longitude
0,M1A,,,,
1,M2A,,,,
2,M3A,North York,Parkwoods,43.7545,-79.33
3,M4A,North York,Victoria Village,43.7276,-79.3148
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626


## Standarizing values format

If we look at the new values of the dataset (latitude and longitude) we can notice that are values with 3 decimals, others with 5, and so on. So the idea here is round these values to 2 decimals.

In [6]:
standarize = lambda x: np.round(x,2)
df[["latitude", "longitude"]] = df[["latitude", "longitude"]].apply(standarize)
df.head()

Unnamed: 0,postal_code,borough,neighbourhood,latitude,longitude
0,M1A,,,,
1,M2A,,,,
2,M3A,North York,Parkwoods,43.75,-79.33
3,M4A,North York,Victoria Village,43.73,-79.31
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.66,-79.36


## Missing values
There are such values that are type NaN

In [7]:
print("There are {} neighborhoods with location NaN".format(df["latitude"].isna().sum()))
print("And there are {} neighborhoods = NaN".format(df["neighbourhood"].isna().sum()))

There are 78 neighborhoods with location NaN
And there are 77 neighborhoods = NaN


### Handling missing values

If we don't have the name of the neighborhood, and neither the location of it, maybe is not necessary have them in the dataset. So we will drop these values.

In [8]:
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.drop(columns=["index"], inplace=True)
print("Now the shape of the dataset = {}".format(df.shape))
df.head()

Now the shape of the dataset = (102, 5)


Unnamed: 0,postal_code,borough,neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.75,-79.33
1,M4A,North York,Victoria Village,43.73,-79.31
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.66,-79.36
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72,-79.45
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66,-79.39


# Visualizating data

Let's visualize our current data with folium map!

In [9]:
toronto_location = [43.65, -79.38]
toronto_map = folium.Map(location=toronto_location, zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['latitude'], df['longitude'], df['borough'], df['neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map) 
toronto_map

## Persist data

Let make the data persist in the time, we will save it into a **csv** file called "neighborhood-toronto-location.csv"

In [10]:
df.to_csv("datasets/toronto-location.csv", index=False)