Scraping Wikipedia page for details of neighborhoods in Toronto

In [80]:
#extract table from wikipedia
from pandas.io.html import read_html
page  = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
TN = read_html(page, attrs={'class':'wikitable'})

print('Extracted {num} wikitables'.format(num=len(TN)))

Extracted 1 wikitables


In [81]:
#View the dataframe
TorontoN = TN[0]
TorontoN.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now we have the dataframe, we will clean and format the data to the desired format

In [83]:
#Get rid of rows which have borough as "not assigned"
TorontoN = TorontoN[~TorontoN.Borough.str.contains("Not assigned")]
TorontoN.head(100)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [84]:
#Assign any neighbourhood the borough name if it listed at "not assigned"
TorontoN['Neighbourhood'] = TorontoN['Neighbourhood'].replace("Not assigned", TorontoN['Borough'])
TorontoN.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [102]:
#Combining rows so that each postcode only has a single row but may have multiple neighbourhoods
foo = lambda a: " , ".join(a) 

TorontoNGrouped = TorontoN.groupby(['Postcode', 'Borough']).agg(
                                     {'Neighbourhood': foo}).reset_index()
TorontoNGrouped.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge , Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park"
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge"
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


In [115]:
TorontoNGrouped.shape

(103, 3)

In [117]:
LatLong = pd.read_csv('http://cocl.us/Geospatial_data')
LatLong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [122]:
TNGEO = TorontoNGrouped.set_index('Postcode').join(LatLong.set_index('Postal Code')).reset_index()
TNGEO.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
