## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
# Task 1 

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
wiki_url = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Creating function which transfers HTML table to Pandas DataFrame

We take th elements and use them as column names (without blanks).

In [3]:
def parse_html_table(table):
    n_columns = 0
    n_rows=0
    column_names = []

    # Find number of rows and columns
    # we also find the column titles if we can
    for row in table.find_all('tr'):

        # Determine the number of rows in the table
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            n_rows+=1
            if n_columns == 0:
                # Set the number of columns for our table
                n_columns = len(td_tags)

        # Handle column names if we find them
        th_tags = row.find_all('th') 
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text().strip().replace(' ', '')) # clear blanks

    # Safeguard on Column Titles
    if len(column_names) > 0 and len(column_names) != n_columns:
        raise Exception("Column titles do not match the number of columns")

    columns = column_names if len(column_names) > 0 else range(0,n_columns)
    df = pd.DataFrame(columns = columns,
                      index= range(0,n_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker,column_marker] = column.get_text().strip()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1

    # Convert to float if possible
    for col in df:
        try:
            df[col] = df[col].astype(float)
        except ValueError:
            pass

    return df

### Requesting data from the wiki

In [5]:
#load data
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table')

# parse data
df_table = parse_html_table(table)

### Clearing data

In [6]:
# clear Borough == 'Not assigned'
df_table.drop(df_table[df_table.Borough == 'Not assigned'].index, inplace=True)

df_table.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Shape of DataFrame

In [7]:
df_table.shape

(103, 3)

---
# Task 2

We need to get the latitude and the longitude coordinates of each neighborhood

In [None]:
!wget -q -O 'geospatial_data.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

In [8]:
geospatial_df = pd.read_csv('geospatial_data.csv')
geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Add latitude and the longitude to df_table

In [16]:
df_geo = df_table.merge(geospatial_df, left_on='PostalCode', right_on='Postal Code', how='inner')
df_geo.drop(columns=['Postal Code'], inplace=True)
df_geo.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [18]:
df_geo.shape

(103, 5)