# Part 1  - Capstone Project - Web scraping the Wikipedia info and creating a data frame

## Data Transformation 

### This part of project deals with collecting the postal codes of Canada data from the Wikipedia page and transforming it to pandas dataframe. This dataframe will be used for later processes

#### Before beginning, a couple of libraries needs to be imported. Those will be imported in the following snippet of code

In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

print("Required libraries imported")

Required libraries imported


#### Next step is to get and parse html file. 
#### BeautifulSoup can be used for parsing the html file.The following code snippet does the same. 

In [11]:
url_canada = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
postalcode_html = requests.get(url_canada).text
#The html file is available in the postalcode_html. But it is not in readable format as it is still HTML file. HTML file can be parsed using the BeautifulSoup package.
parser = BeautifulSoup(postalcode_html, 'html.parser')


#### Now that we have the html file of postalcodes in 'parser', we can create a dataframe arranged into rows and columns.
#### First step is to create a list of contents in the 'parser' and then create a dataframe out of it.
#### Note: Our required data lies with the table body under the tag 'tr'. So, it is required only to search and retrieve contents inside the 'tr' body of the html file. The actual data is available in the 'td' field under the 'tr' tag.
#### A loop is provided to traverse through the td and tr tags and read the contents inside tag. 
#### The column headings will be passed into the dataframe as a discrete code. 
#### The following code snippet does those


In [23]:
listofpostalcodes = []
for tr in parser.tbody.find_all('tr'):
    listofpostalcodes.append([ td.get_text().strip() for td in tr.find_all('td')])
    
listofpostalcodes

[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M

#### A dataframe with the list of potal codes can be created now. The column headings are provided separately

In [24]:
heading = ['PostalCode','Borough','Neighborhood']

postalcode_df = pd.DataFrame(listofpostalcodes, columns = heading)

postalcode_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


### Data Cleaning
#### A couple of columns in the above dataframe contains lot of 'Not assigned' fields. This can be removed as it adds nothing to further process
#### The following code snippet checks if all the three columns have 'Not assigned' values and extracts only those columns that contains information

In [25]:
postalcode_df = postalcode_df[(postalcode_df.PostalCode != 'Not assigned' ) & (postalcode_df.Borough != 'Not assigned') & (postalcode_df.Neighborhood != 'Not assigned')]

In [26]:
postalcode_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


#### Rows containing 'None' values can be removed as well

In [27]:
postalcode_df = postalcode_df.dropna()

In [28]:
postalcode_df

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern
15,M3B,North York,Don Mills North


In [29]:
postalcode_df.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


### Grouping 
#### Next step is to modify the 'Neighborhood' column by grouping its contents based on values of 'PostalCode'

In [31]:
def list_neighborhood(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
df_group = postalcode_df.groupby(['PostalCode', 'Borough'])

final_df = df_group.apply(list_neighborhood).reset_index(name='Neighborhood')

In [32]:
final_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [33]:
final_df.shape

(102, 3)

#### 'final_df' is the dataframe which forms the source for the subsequent steps

# This is end of Part 1 -  Capstone Project

# Part 2  - Capstone Project - Adding Latitude & Longitude information

### In this part, the dataframe 'final_df' will be enriched with the latitude & longitude co-ordinates

In [36]:
#!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 10.3MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [37]:
import geocoder

#### The required library has been installed and imported. Next step is to extract the latitude and longitude information for each postal code.
#### Instead of messing around with code snippets, it will be good to define a function for the same.
#### A function named lat_long is defined in the following code snippet for extracting the geo coordinate information. Upon executing it with appropriate postal code, the function returns the corresponding latitude and longitude information

In [49]:
def lat_long(postal_code):
    lt_lc_Coordinates = None
    while(lt_lc_Coordinates is None):
        geo = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lt_lc_Coordinates = geo.latlng
    return lt_lc_Coordinates

In [56]:
print('Testing function lat_long \nThe coordinates of M1G is \n' + str(lat_long('M1G')))


Testing function lat_long 
The coordinates of M1G is 
[43.76835912100006, -79.21758999999997]


#### Now that we have the latitude & longitude function, it can be incorporated to original dataframe.

#### First, let's get a list of all latitude and longitude information for the postal codes

In [67]:
pc = final_df['PostalCode']    
coordinates = [ lat_long(n) for n in pc.tolist() ]
#Call the function lat_long for all the values in list of pc, where pc is a temporary dataframe for storing the postal codes information.

In [64]:
Lat_Lon_Df = pd.DataFrame(coordinates, columns=['Latitude', 'Longitude'])
final_df['Latitude'] = Lat_Lon_Df['Latitude']
final_df['Longitude'] = Lat_Lon_Df['Longitude']

In [65]:
final_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.785730,-79.158750
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765690,-79.175256
3,M1G,Scarborough,Woburn,43.768359,-79.217590
4,M1H,Scarborough,Cedarbrae,43.769688,-79.239440
5,M1J,Scarborough,Scarborough Village,43.743125,-79.231750
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.726245,-79.263670
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.713133,-79.285055
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.723575,-79.234976
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.696665,-79.260163


In [71]:
print('The dimensions of dataframe is', final_df.shape)

The dimensions of dataframe is (102, 5)


# This is end of Part 2 -  Capstone Project