# Segmenting and Clustering Neighborhoods in the City of Toronto, Canada
## 1. Create Data Set

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

### a. First install Wikipedia package

In [1]:
!conda install -c conda-forge wikipedia=1.4.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
print('Wikipedia installed')

Solving environment: done

# All requested packages already installed.

Wikipedia installed


### b. Import libraries

In [2]:
import pandas as pd
import wikipedia as wp

#Get the html source
html = wp.page("List_of_postal_codes_of_Canada:_M").html().encode("UTF-8")
df = pd.read_html(html)[0]
#save dataset to disc
df.to_csv('toronto_postalcodes.csv',header=0,index=False)

In [3]:
#read dataset
toronto = pd.read_csv('toronto_postalcodes.csv',sep=",")

#eliminate records with not assigned Borough
toronto=toronto.replace({'Neighbourhood':['Not assigned']},"Queen's Park")
toronto = toronto[~toronto['Borough'].isin(['Not assigned'])]

In [4]:
#Change Column names for further processing
toronto=toronto.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'})
toronto.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

In [5]:
toronto.reset_index(drop=True)
# Combine rows with similar Neighbourhood into one row with the neighborhoods separated with a comma 
toronto=toronto.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()

In [6]:
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### c. Use shape method to get number of rows and columns

In [7]:
toronto.shape

(103, 3)

In [39]:
#Save to disc
toronto.to_csv('toronto.csv',header=True,index=False)