# Part 1: Segmenting and Clustering Neighborhoods in Toronto

In [9]:
#get_ipython().system(u' pip install beautifulsoup4')
#from bs4 import BeautifulSoup
!pip install lxml



## Final Dataframe for Part 1
* Goal is to manipulate data to get this **final dataframe**
![alt text](imgs/part1_final_df.png "Final Dataframe for Part 1")
## Guidance and Requirements
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
* If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Loading in Packages

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize
import lxml

## Loading in Toronto Wikipedia Data

In [11]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tor_wiki = pd.read_html(url)
tor_wiki = tor_wiki[0] # Reset index
tor_wiki.head() # Check to see that loaded in correctly

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


# Removing Boroughs with "Not assigned" value

In [12]:
tor_wiki_na_removed = tor_wiki[tor_wiki['Borough'] != 'Not assigned']
tor_wiki_na_removed.head(20)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [13]:
# Combining Neighborhoods in same postal code

In [14]:
# More than one neighborhood can exist in one postal code area, combined these into one row with the neighborhoods separated with a comma
tor_wiki_na_removed["Neighbourhood"] = tor_wiki_na_removed.groupby("Postal Code")["Neighbourhood"].transform(lambda neigh: ', '.join(neigh))

# Removing duplicates
tor_wiki_na_removed = tor_wiki_na_removed.drop_duplicates()

# Update index to be postcode if it isn't already
if(tor_wiki_na_removed.index.name != 'Postal Code'):
    tor_wiki_na_removed = tor_wiki_na_removed.set_index('Postal Code')
    
tor_wiki_na_removed.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"
M3B,North York,Don Mills
M4B,East York,"Parkview Hill, Woodbine Gardens"
M5B,Downtown Toronto,"Garden District, Ryerson"


# Using Shape to print out dimensions of dataframe

In [15]:
tor_wiki_na_removed.shape

(103, 2)