# Segmenting and Clustering Neighborhoods in Toronto

This notebook is for the Week 3's assignment _Segmenting and Clustering Neighborhoods in Toronto_.

To import necessary libraries.

In [110]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Fetch data to DataFrame

Request Wikipedia page and extract the records of postal code as a list.

In [111]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')

tr_list = soup.find('table', class_='wikitable').find_all('tr')
len(tr_list)

289

Generate DataFrame of the records.

In [112]:
column_names = ['PostalCode', 'Borough', 'Neighborhood']
orig_df = pd.DataFrame(columns=column_names)

for tr in tr_list[1:]:
    td_list = [td for td in tr.contents if td != '\n']
    
    postal_code = td_list[0].text
    borough = td_list[1].text
    neig = td_list[2].text.replace('\n', '')
    
    orig_df = orig_df.append({'PostalCode': postal_code,
               'Borough': borough,
               'Neighborhood': neig
              }
              ,ignore_index=True
             )
    
orig_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Pre-processing data

**Remove cells with a borough that is Not assigned.**

In [113]:
df = orig_df[orig_df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


**Combine the `Neighborhood` with same `PostalCode`.**

1.Find duplicated PostalCodes

In [114]:
print('There are {} uniques postal codes.'.format(len(df['PostalCode'].unique())))

There are 103 uniques postal codes.


In [115]:
code_group = df.groupby('PostalCode').count()

duplicated_codes = code_group[code_group['Borough'] > 1].index.values
duplicated_codes

array(['M1B', 'M1C', 'M1E', 'M1K', 'M1L', 'M1M', 'M1N', 'M1P', 'M1R',
       'M1T', 'M1V', 'M2J', 'M2L', 'M2M', 'M3C', 'M3H', 'M3J', 'M3K',
       'M4B', 'M4K', 'M4L', 'M4T', 'M4V', 'M4X', 'M5A', 'M5B', 'M5H',
       'M5J', 'M5K', 'M5L', 'M5M', 'M5P', 'M5R', 'M5S', 'M5T', 'M5V',
       'M5X', 'M6A', 'M6H', 'M6J', 'M6K', 'M6L', 'M6M', 'M6N', 'M6P',
       'M6R', 'M6S', 'M8V', 'M8W', 'M8X', 'M8Y', 'M8Z', 'M9B', 'M9C',
       'M9M', 'M9R', 'M9V'], dtype=object)

2.Combine the neighborhoods

In [119]:
for code in duplicated_codes:
    dup_df = df[df['PostalCode'] == code]
    
    borough = dup_df.iloc[0]['Borough']
    combined_neig = ','.join(dup_df['Neighborhood'].tolist())
    
    df = df[df['PostalCode'] != code]
    df = df.append({'PostalCode': code,
                    'Borough': borough,
                    'Neighborhood': combined_neig},
                   ignore_index=True
                  )
    
df.shape

(103, 3)

** Assign `Borough` to missing `Neighborhood`.**

In [128]:
no_neig_df = df[df['Neighborhood'] == 'Not assigned']

no_neig_df['Neighborhood'] = no_neig_df['Borough']

no_neig_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M7A,Queen's Park,Queen's Park


# Question 1

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [129]:
df.shape

(103, 3)