<h1>Segmenting and Clustering Neighborhoods Part-2</h1>

<h2>1.Description</h2>
For this assignment, you will be required to create a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 
The data required for this assignment is provided on a  Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, where all the information required for the segmenting and clustering process exist.You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset. From the link http://cocl.us/Geospatial_data, a csv file that contains the geographical coordinates that is latitutudes and longitudes of each neighborhood.

<h2>2.Scraping Data from Wikipidea Page</h2>

<h4>1.a. Import necessary packages.</h4>

In [57]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # generates random numbers between 0 and 1
import requests# library to handle requests

# map rendering library
!pip install folium
import folium

!pip install lxml #easy handling of XML and HTML files, and can also be used for web scraping.


from bs4 import BeautifulSoup #for getting data out of HTML, XML, and other markup languages.

# import k-means from clustering stage
from sklearn.cluster import KMeans

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

#alternative Python interpreter that provides improvements over the default Python interpreter.
from IPython.display import Image 
from IPython.core.display import HTML
from IPython.display import display_html

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print("Done!!!")

Done!!!


<h5>1.b. Scrape the "raw" table and Tranform the data into a pandas dataframe 'df'</h5>

In [58]:
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find("table")
table_rows = table.tbody.find_all("tr")

res = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text for tr in td]
    
   
    if row != [] and row[1] != "Not assigned":
       
        if "Not assigned" in row[2]: 
            row[2] = row[1]
        res.append(row)

df=pd.DataFrame(res,columns=["PostalCode", "Borough", "Neighborhood"])
df.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


<h5>1.c. Cleaning the data by removing "\n" at the end of each string in the each column</h5>

In [59]:
df["Neighborhood"]=df["Neighborhood"].str.replace("\n","")
df["Borough"]=df["Borough"].str.replace("\n","")
df["PostalCode"]=df["PostalCode"].str.replace("\n","")
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


<h5>1.d. Again cleaning data:</h5>
    By ignoring cells with a borough that has 'Not assigned' value. Then next group all neighborhoods with the same postal code.

In [60]:
df=df[df.Borough != 'Not assigned']
df=df.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',df['Borough'],df['Neighborhood'])
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


<h2>3. Get the latitude and the longitude coordinates of each neighborhood.</h2>
<h5>3.a. Download the data and read it into a datafframe:</h5>
From the link http://cocl.us/Geospatial_data download a csv file that the geographical coordinates of each postal code and read it into a dataframe 'Geo_df'.

In [62]:
Geo_df = pd.read_csv("https://cocl.us/Geospatial_data")
Geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h5>3.b. Creating a new dataframe :</h5>
Creating a new dataframe df_toronto by merging dataframes df and Geo_df, as per the requirement of the project, that is for easy access of data. 

In [63]:
df_toronto = pd.merge(df, Geo_df, how='left', left_on = 'PostalCode', right_on = 'Postal Code')

df_toronto.drop("Postal Code", axis=1, inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
