#### Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown in https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit

In [1]:
import pandas as pd # library for data analsysis

#### Read the data

In [235]:
neighborhood_data = pd.read_csv('toronto.txt', sep="\t", header=None)

#### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [236]:
neighborhood_data.columns = ["PostalCode","Borough","Neighborhood"]

In [237]:
neighborhood_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria_Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence_Heights
6,M6A,North York,Lawrence_Manor
7,M7A,Downtown Toronto,Queen's_Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


#### Tranform the data into a *pandas* dataframe

In [238]:
index=neighborhood_data.index

In [239]:
toronto=pd.DataFrame(neighborhood_data, index=index, columns=["PostalCode","Borough","Neighborhood"])
toronto.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria_Village
4,M5A,Downtown Toronto,Harbourfront


In [240]:
toronto.shape

(287, 3)

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [241]:
toronto=toronto[toronto['Borough']!='Not assigned']
toronto.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria_Village
4,M5A,Downtown Toronto,Harbourfront


In [242]:
toronto.shape

(210, 3)

#### Reset the index

In [243]:
toronto.index

Int64Index([  2,   3,   4,   5,   6,   7,   9,  10,  11,  13,
            ...
            268, 269, 270, 271, 272, 281, 282, 283, 284, 285],
           dtype='int64', length=210)

In [244]:
toronto=toronto.set_index('PostalCode')
toronto.head(3)

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria_Village
M5A,Downtown Toronto,Harbourfront


In [245]:
toronto=toronto.reset_index()
toronto.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria_Village
2,M5A,Downtown Toronto,Harbourfront


#### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [246]:
toronto[toronto['Neighborhood']=='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M9A,Queen's Park,Not assigned


In [251]:
toronto.at[6,'Neighborhood']=toronto.at[6,'Borough']

In [253]:
toronto.iloc[6]

PostalCode               M9A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object

#### More than one neighborhood can exist in one postal code area. Combine rows with the same PostalCode into one row with the neighborhoods separated with a comma.

In [257]:
toronto=toronto.groupby(['PostalCode','Borough'], as_index=False).agg(lambda Neighborhood: ', '.join(Neighborhood))
toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland_Creek, Rouge_Hill, Port_Union"
2,M1E,Scarborough,"Guildwood, Morningside, West_Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough_Village
6,M1K,Scarborough,"East_Birchmount_Park, Ionview, Kennedy_Park"
7,M1L,Scarborough,"Clairlea, Golden_Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough_Village_West"
9,M1N,Scarborough,"Birch_Cliff, Cliffside_West"


In [258]:
toronto.shape

(103, 3)