## Coursera_IBM_Applied-Data-Science-Capstone
#### This Notebook represents my work for the Coursera_IBM_Applied Data Science Capstone as one of the various courses of IBM Data Science Professional Certificate

### Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
###### (C) Ahmed Tealeb

##### 1- Start by creating a new Notebook for this assignment.


##### 2- Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

### Pre-processing

##### 3. To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is <B>Not assigned</B>.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that <B>M5A</B> is listed twice and has two neighborhoods: <B>Harbourfront</B> and <B>Regent Park</B>. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in <B>row 11</B> in the above table.

- If a cell has a borough but a <B>Not assigned</B> neighborhood, then the neighborhood will be the same as the borough. So for the <B>9th</B> cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be <B>Queen's Park</B>.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the <B>.shape</B> method to print the number of rows of your dataframe.

##### 4. Submit a link to your Notebook on your Github repository. <B>(10 marks)</B>


<B>Note:</B> There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples of how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe

#### Part-1: Extracting the raw table (from Wikipedia webpage) and Save it to "CSV" File

In [118]:
# Step 1: Use BeautifulSoup; the most common package, to download the table data
# Importing Libraries
import pandas as pd
import numpy as np
import os, sys
import urllib
import requests 
from urllib.request import urlopen
from bs4 import BeautifulSoup

wikipedia_link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
raw_wikipedia_page = urlopen(wikipedia_link)
soup = BeautifulSoup(raw_wikipedia_page) # Beautiful Soup to Parse the url page
raw_wikipedia_page.close()
 
fp = open("Toronto_FSAs_Raw.csv", "w")
tables = soup.findAll('table')
Toronto_FSAs_table = tables[0]
for tr in Toronto_FSAs_table.tbody.findAll('tr'):
    # print(tr.findAll('th'))
    for th in tr.findAll('th'):
        text = th.getText().strip() + ','
        fp.write(text)
    for td in tr.findAll('td'):
        text = td.getText().strip() + ','
        fp.write(text)
    fp.write('\n')
fp.close()

In [119]:
# Step 2: load data to dataframe
import pandas as pd
df = pd.read_csv('Toronto_FSAs_Raw.csv')
df.drop('Unnamed: 3', axis = 1, inplace = True)
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)
df.head()
# dfs.shape

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [154]:
# Step 3: Remove the rows which "Not assigned" existed in the "Borough" column
df_Cleaned = df[ ~ df['Borough'].str.contains('Not assigned')]
df_Cleaned.shape

(212, 3)

In [155]:
# Step 4: Combine the "Neighbourhood"'s values by grouping the "Postcode" and "Borough"
grouped = df_Cleaned.groupby(['PostalCode', 'Borough'], as_index = False).agg(', '.join)
# df_grouped = pd.DataFrame(grouped.sum())
df_grouped = pd.DataFrame(grouped)
df_grouped.head()
df_grouped.shape

(103, 3)

In [157]:
# Step 5: Replace the "Not assigned" in 'Neighbourhood' column with 'Borough' column value.
for i in range(len(df_grouped)):
    
    line_data = df_grouped.iloc[i, :]
    if line_data['Neighbourhood'] == 'Not assigned':
        line_data['Neighbourhood'] = line_data['Borough']
    df_grouped.to_csv('TorontoFSAs_Part1.csv', index = False)
df_grouped.head()
#df_grouped.shape

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
