## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto Part One

### 1. Start by creating a new Notebook for this assignment.

In [41]:
#Library import
from bs4 import BeautifulSoup
import requests

### 2. Scrap Wikipedia Page

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [42]:
#Accessing Url
url=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#BeautifulSoup
soup=BeautifulSoup(url,'lxml')
neighborhood_table=soup.find('table',{'class':'wikitable sortable'})

In [43]:
# Search for tr tag
table_rows = neighborhood_table.find_all('tr') 

#Initialize Empty list for storing row data
neighborhood_data=[] 

#Iterate through each row in table_rows and append the it to list
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    neighborhood_data.append(row)   

### 3. To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you      will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will      be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
   If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
   Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

- Data Frame Creation

In [44]:
#Remove null and empy header
neighborhood_data.remove([ ]) 

In [45]:
#import pandas library
import pandas as pd

#Initialize Empty DataFrame with columne name 'PostalCode','Borough','Neighborhood'
df=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood']) 

#Iterate through each item in list neighborhood_data and append it to data frame 
for each in neighborhood_data:
    df=df.append({'PostalCode': each[0],
                  'Borough':each[1],
                 'Neighborhood':each[2]},ignore_index=True)

Each row in data frame have new line character \n below steps is used to remove it

In [46]:
#Each row values have \n character, replace function use for removing it
#Remove new line character \n from PostalCode, Borough and Neighborhood

df.PostalCode.replace({r'\n$':''},regex=True,inplace=True)

df.Borough.replace({r'\n$':''},regex=True,inplace=True)

df.Neighborhood.replace({r'\n$':''},regex=True,inplace=True)

- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [47]:
#Deleting all row whose Borough is "Not assigned"
import numpy as np
index_to_drop=np.where(df['Borough']=='Not assigned')
df.drop(df.index[index_to_drop],inplace=True)
df.reset_index(drop=True,inplace=True)

- Display Data Frame

In [48]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [55]:
df.to_csv('Dataset_For_Neighborhoods_in_Toronto.csv',index=False)

- Check and Merge Different Neighborhood for Same PostalCode

In [50]:
#Get the unique PostalCode size
df.PostalCode.unique().size

103

In [51]:
#Display Total PostalCode Size
df.PostalCode.size

103

The count of Unique PostalCode is same as that of Total PostalCode (which is 103), Hence there is no duplicate PostalCode is present and need merging of Neighborhood.

In [52]:
np.where(df.PostalCode.duplicated())

(array([], dtype=int64),)

- Shape

In [53]:
df.shape

(103, 3)

##### End of Part one!!!

## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto Part Two

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

### Load and Explore datasets

The Google Maps Geocoding API not working properly so, donloaded CSV file from given link and merge with the data frame scarped from wikipedia.

In [76]:
url='http://cocl.us/Geospatial_data'
df1=pd.read_csv(url)

In [61]:
#Rename column in dataframe df1
df1.rename(columns={'Postal Code':'PostalCode'},inplace=True)

In [77]:
df1.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [63]:
#Merge Data Frame df and df1 with common column PostalCode
data=pd.DataFrame.merge(df,df1,on='PostalCode')

In [65]:
#check the Shape of the data Frame
data.shape

(103, 5)

### Quick Display

In [78]:
data.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### End of Part Two!!!