<h1>Applied Data Science Capstone : Segmenting and Clustering Neighborhoods in Toronto</h1>
<h3>Autor: Francihelena Uzcategui</h3>


## Table of Contents
<ul>

<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#eda">Exploratory Data Analysis </a></li>
</ul>


<h3>Introduction </h3>

 <p>For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format. </p>

<h3>Data Wrangling</h3>

In [1]:
import pandas as pd

In [2]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0] # or [1], [2] Source: https://www.youtube.com/watch?v=sAuGH1Kto2I&t=1s--> we used second method: getting ALL the tables in the page. https://stackoverflow.com/questions/41100451/load-scraped-table-via-bs4-into-pandas-dataframe --> to define which table we want [0], [1], etc
print (df)

    Postal Code           Borough  \
0           M1A      Not assigned   
1           M2A      Not assigned   
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
..          ...               ...   
175         M5Z      Not assigned   
176         M6Z      Not assigned   
177         M7Z      Not assigned   
178         M8Z         Etobicoke   
179         M9Z      Not assigned   

                                          Neighborhood  
0                                                  NaN  
1                                                  NaN  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
..                                                 ...  
175                                                NaN  
176                                                NaN  
177                                       

In [3]:
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
Postal Code     180 non-null object
Borough         180 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 4.3+ KB


<h3>Data Cleaning</h3>

In [5]:
# According to the above report there are 77 null values, they are the Not assigned or NaN values
df=df.dropna().reset_index() #Borough == 'Not assigned' is equivalent to Neighborhood == NaN
df=df.drop(columns=['index'])
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postal Code     103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [7]:
# According to the above report, now the df doesn't contain null values either Not assigned; to confirm it, we applied the match method.
df2=df[df['Borough'].str.match('Not assigned')]
df2

Unnamed: 0,Postal Code,Borough,Neighborhood


In [8]:
df3=df.groupby(['Postal Code', 'Borough'])
df3.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postal Code,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


<h3>Data Exploratory</h3>

df3.shape --> AttributeError: Cannot access attribute 'shape' of 'DataFrameGroupBy' objects, try using the 'apply' method

Solution: "a groupby object is just metadata describing how to perform the groupby it isn't a DataFrame or Series as such as so .shape has no meaning here until you perform some aggregation on the object or other method such as head() which will then return a DataFrame or Series" 
https://stackoverflow.com/questions/51246363/pandas-shape-doesnt-work

In [10]:
df4=df3.head()
df4.shape

(103, 3)

Verifing that our df is aligned to this instruction: "More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table". The Wiki table has already merged neighbourhoods based on postal code.

In [11]:
df4.nunique(axis=0)

Postal Code     103
Borough          10
Neighborhood     98
dtype: int64