# Capstone assignment - Neighborhoods in Toronto

Hello, welcome to my notebook. This is written for course assignment of applied data science.

Importing necessary libraries for this assignment.

In [1]:
import pandas as pd
import numpy as np

# For converting address to coordinates
!pip install geocoder
import geocoder




### 1. Download data

"**Note:** There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe"

Here I am scrapping the table data of neighborhoods in Toronto based on postal code from wikipedia website using pandas `read_html` function.

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### 2. Data cleaning 

Next we are ignoring the rows of Borough has not assigned and reseting the index of resulted data frame `df`

In [3]:
df= df[df.Borough != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Here we are verifying the whether any postal codes are duplicated using `duplicated`. 

In [4]:
df['Postal code'].duplicated().any()

False

There are no duplicates in the postal code. Next checking the whether there are any NaN values in Neighborhood column using `isnull`.

In [5]:
df['Neighborhood'].isnull().sum(axis = 0)

0

To make data frame as asked in the question, the character between neighborhood `/` with `,`.

In [6]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' /',',')
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


There are no NaN values in the Neighborhood column. It seems the downloaded table from wikipedia have taken care of these things. Now the dimensions of data frame `df` are printing using `shape`.

In [7]:
print('Data shape:',df.shape)

Data shape: (103, 3)


### 3. Location of neighborhood

Here I am using geocoding using ArcGIS developer platform to convert the neighborhood addresses into location coordinates. The coordinates are then stored in `lat_lng` list.

In [8]:
lat_lng = []
for pcode in df['Postal code']:
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(pcode))
        lat_lng_coords = g.latlng
        lat_lng.append(lat_lng_coords)

Converting list of `lat_lng` into a data frame of coordinates with Latitude and Longitude column names.

In [9]:
lat_lng = pd.DataFrame(lat_lng, columns = ['Latitude', 'Longitude'])
lat_lng.head()

Unnamed: 0,Latitude,Longitude
0,43.752935,-79.335641
1,43.728102,-79.31189
2,43.650964,-79.353041
3,43.723265,-79.451211
4,43.66179,-79.38939


Let's join the data frames `df` and `lat_lng`as `final_df`.

In [10]:
final_df = pd.concat([df, lat_lng], axis=1)
final_df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939
5,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953
6,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
7,M3B,North York,Don Mills,43.7489,-79.35722
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529
