# This is the notebook for the first dataframe in the week 3 assignment

In [1]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


First, I read in the table from the wikipedia page using the 'pd.read_html' command, followed by assigning the variable 'dfcan' to the resulting dataframe

In [2]:
import pandas as pd
import numpy as np

page = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

dfcan = pd.DataFrame(page[0])
dfcan

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


I import numpy, and then replace the "Not assigned" values in the "Borough" column with "Nan" values, which python recognizes

In [3]:
import numpy as np

dfcan['Borough'].replace("Not assigned", np.nan, inplace = True)
dfcan

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,,Not assigned
176,M6Z,,Not assigned
177,M7Z,,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


I assign the variable "missing data" to find out whether or not there are "Nan" values in each cell (True means there is an Nan value, false means there isn't)

In [4]:
missing_data = dfcan.isnull()
missing_data.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,False,True,False
1,False,True,False
2,False,False,False
3,False,False,False
4,False,False,False


I then use a simple for loop to print the number of "Nan" values there are in each column (Indicated by the number of "true" values. I find that there are 77 rows with missing data which will then be removed.

In [5]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

Postal Code
False    180
Name: Postal Code, dtype: int64

Borough
False    103
True      77
Name: Borough, dtype: int64

Neighborhood
False    180
Name: Neighborhood, dtype: int64



I then drop all of the rows with "Nan" values in the "Borough" column and reset the index

In [6]:
dfcan.dropna(subset=["Borough"], axis=0, inplace=True)
dfcan.reset_index(drop=True, inplace=True)
dfcan

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


I then fill in all "Nan" values in the "Neighborhood" column with the same name as in the "Borough" column

In [7]:
dfcan["Neighborhood"].fillna(dfcan["Borough"], inplace=True)
dfcan

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


I then search each entry one by one by postal code, since each have a unique value to get each of the 12 required rows. Each is assigned their own variable. I then concatanate each of the 12 individual dataframes into one and reset the index to get the brand new dataframe

In [9]:
one=dfcan[dfcan['Postal Code'].astype(str).str.contains('M5G')]
one

Unnamed: 0,Postal Code,Borough,Neighborhood
24,M5G,Downtown Toronto,Central Bay Street


In [10]:
two=dfcan[dfcan['Postal Code'].astype(str).str.contains('M2H')]
two

Unnamed: 0,Postal Code,Borough,Neighborhood
27,M2H,North York,Hillcrest Village


In [11]:
three=dfcan[dfcan['Postal Code'].astype(str).str.contains('M4B')]
three

Unnamed: 0,Postal Code,Borough,Neighborhood
8,M4B,East York,"Parkview Hill, Woodbine Gardens"


In [12]:
four=dfcan[dfcan['Postal Code'].astype(str).str.contains('M1J')]
four

Unnamed: 0,Postal Code,Borough,Neighborhood
32,M1J,Scarborough,Scarborough Village


In [13]:
five=dfcan[dfcan['Postal Code'].astype(str).str.contains('M4G')]
five

Unnamed: 0,Postal Code,Borough,Neighborhood
23,M4G,East York,Leaside


In [14]:
six=dfcan[dfcan['Postal Code'].astype(str).str.contains('M4M')]
six

Unnamed: 0,Postal Code,Borough,Neighborhood
54,M4M,East Toronto,Studio District


In [15]:
seven=dfcan[dfcan['Postal Code'].astype(str).str.contains('M1R')]
seven

Unnamed: 0,Postal Code,Borough,Neighborhood
71,M1R,Scarborough,"Wexford, Maryvale"


In [16]:
eight=dfcan[dfcan['Postal Code'].astype(str).str.contains('M9V')]
eight

Unnamed: 0,Postal Code,Borough,Neighborhood
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [17]:
nine=dfcan[dfcan['Postal Code'].astype(str).str.contains('M9L')]
nine

Unnamed: 0,Postal Code,Borough,Neighborhood
50,M9L,North York,Humber Summit


In [18]:
ten=dfcan[dfcan['Postal Code'].astype(str).str.contains('M5V')]
ten

Unnamed: 0,Postal Code,Borough,Neighborhood
87,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


In [19]:
eleven=dfcan[dfcan['Postal Code'].astype(str).str.contains('M1B')]
eleven

Unnamed: 0,Postal Code,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"


In [20]:
twelve=dfcan[dfcan['Postal Code'].astype(str).str.contains('M5A')]
twelve

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [25]:
result = pd.concat([one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve], ignore_index=True, sort=False)
result

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


To see how many entries there are in the new dataframe, we use the "shape" method

In [33]:
final = result.shape[0]
print("Number of rows in the new dataframe:", final)

Number of rows in the new dataframe: 12
