# In this assignment, it is required to explore, segment, and cluster the neighborhoods in the city of Toronto

## 1. For the Toronto neighborhood data, a Wikipedia page exists that has all the information needed to explore and cluster the neighborhoods in Toronto. It is required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format

In [1]:
#install the components required for web pages scraping
print("INSTALLING Libraries required for Web Scraping...")
!pip install beautifulsoup4
!pip install lxml
!pip install html5lib
!pip install requests
print("INSTALLING Libraries required for Web Scraping. DONE.")

INSTALLING Libraries required for Web Scraping...
INSTALLING Libraries required for Web Scraping. DONE.


In [2]:
#import the components required for web pages scraping and dataframe creation
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
#get the source web page
source_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(source_url).text

#load it in the scraping component
soup = BeautifulSoup(source, 'lxml')

#get the html element containing the needed informations
data_table = soup.find('table', class_='wikitable sortable')
print(data_table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

The html element will be parsed for creating a pandas dataframe that:
* consists of three columns: PostalCode, Borough, and Neighborhood
* doesn't contain html element cells with a borough that is Not assigned
* if a cell has a borough but a Not assigned neighborhood, then contains a row with the neighborhood equal
  to the borough
* contains neighborhoods separated with a comma (in the same row in the Neighborhood column) 
  when more than one neighborhood exists in one postal code area

In [18]:
#creating empty dataframe
columns = ['PostalCode', 'Borough', 'Neighborhood']
neighborhoods_df = pd.DataFrame(columns = columns)

#get a list of the needed informations from the parsed html element
data_table_rows = data_table.tbody.find_all('tr')

#fill in the dataframe
for data_table_row in data_table_rows[1:]:
    data_table_row_columns = data_table_row.find_all('td')
    
    postal_code = data_table_row_columns[0].text
    borough = data_table_row_columns[1].text
    neighborhood = data_table_row_columns[2].text
    
    if borough == 'Not assigned':
        continue
    if neighborhood == 'Not assigned':
        neighborhood = borough
        
    dataframe_row = {'PostalCode': postal_code.strip(), 'Borough': borough.strip(), 'Neighborhood': neighborhood.strip()}
    neighborhoods_df = neighborhoods_df.append(dataframe_row, ignore_index = True)

#put the nighborhoods with the same postal code in the same row
#separated by commas
neighborhoods_df_final = neighborhoods_df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()
print(neighborhoods_df_final.head(12))
neighborhoods_df_final.shape

   PostalCode      Borough                                       Neighborhood
0         M1B  Scarborough                                      Rouge,Malvern
1         M1C  Scarborough               Highland Creek,Rouge Hill,Port Union
2         M1E  Scarborough                    Guildwood,Morningside,West Hill
3         M1G  Scarborough                                             Woburn
4         M1H  Scarborough                                          Cedarbrae
5         M1J  Scarborough                                Scarborough Village
6         M1K  Scarborough          East Birchmount Park,Ionview,Kennedy Park
7         M1L  Scarborough                      Clairlea,Golden Mile,Oakridge
8         M1M  Scarborough      Cliffcrest,Cliffside,Scarborough Village West
9         M1N  Scarborough                         Birch Cliff,Cliffside West
10        M1P  Scarborough  Dorset Park,Scarborough Town Centre,Wexford He...
11        M1R  Scarborough                                   Mar

(103, 3)

## 2. Now a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name has been built. In order to utilize the Foursquare location data, it is needed to get the latitude and the longitude coordinates of each neighborhood.

## 3. Explore and cluster the neighborhoods in Toronto