This is a notebook for the ibm coursera Course 9 Week 3 assignment.

# Table of Contents

1. [Part I. Get dataframe from the Wikipedia page](#getdf)
2. [Part II. Merge dataframe with the geospatial data](#geo)
3. [Part III. Explore and cluster the neighborhoods in Toronto.](#exp)


## Part I. <a name='getdf'></a>

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. 

### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [1]:
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# pandas.read_html:
# Read HTML tables into a list of DataFrame objects.
dfs = pd.read_html(WIKI_URL)
df = dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [2]:
df.columns

Index(['Postal Code', 'Borough', 'Neighborhood'], dtype='object')

### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

### More than one neighborhood can exist in one postal code area. 

For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [3]:
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
df[df['Postal Code'] == 'M5A']

Unnamed: 0,Postal Code,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [5]:
df[df['Neighborhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood


### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [6]:
df.shape

(103, 3)

## Part II. <a name='geo'></a>

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [7]:
# !wget -O geo_data.csv http://cocl.us/Geospatial_data

In [8]:
geo = pd.read_csv('geo_data.csv')
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
geo.shape

(103, 3)

In [10]:
df_geo = pd.merge(df,geo)

In [11]:
df_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part III. Explore and cluster the neighborhoods in Toronto. <a name='exp'></a>

You can decide to work with only boroughs that contain the word "Toronto" and then replicate the same analysis we did to the New York City data.

Just make sure:

- to add enough Markdown cells to explain what you decided to do and to report any observations you make.
- to generate maps to visualize your neighborhoods and how they cluster together.

In [12]:
# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [13]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
df_tor = df_geo[df_geo['Borough'].str.contains('Toronto')]
df_tor.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [14]:
df_tor.shape

(39, 5)

In [15]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_tor['Borough'].unique()),
        df_tor.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


In [16]:
cor_Tor = df_tor[df_tor['Postal Code']=='M5G'][['Latitude','Longitude']].values.tolist()

In [17]:
Lat,Long=cor_Tor[0]

In [18]:
print('The geograpical coordinate of Downtown Toronto is {}, {}.'.format(Lat, Long))

The geograpical coordinate of Downtown Toronto is 43.6579524, -79.3873826.


### Now, let us see how the neighborhoods of Toronto distribute on the map.  

In [19]:
# create map of downtown toronto using latitude and longitude values
map_toronto = folium.Map(location=[Lat, Long], zoom_start=10)

# add markers to map
for lat, lng, borough, nbhd in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Borough'], df_tor['Neighborhood']):
    label = '{}, {}'.format(nbhd, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup  = label,
        color  ='blue',
        fill   = True,
        fill_color   = '#3186cc',
        fill_opacity = 0.7,
        parse_html   = False).add_to(map_toronto)  
    
map_toronto

In [20]:
east_tor = df_tor[df_tor['Borough'] == 'East Toronto'].reset_index(drop=True)
east_tor.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


In [21]:
east_tor.shape

(5, 5)

In [22]:
cor_eastTor = east_tor[east_tor['Postal Code']=='M4E'][['Latitude','Longitude']].values.tolist()
cor_eastTor
Lat_east, Long_east=cor_eastTor[0]
print('The geograpical coordinate of East Toronto is {}, {}.'.format(Lat_east, Long_east))

The geograpical coordinate of East Toronto is 43.67635739999999, -79.2930312.


### Now, let us see how the neighborhoods of *East Toronto* distribute on the map.  

In [23]:
# create map using latitude and longitude values
Lat, Long = Lat_east, Long_east
df = east_tor
map_torontoeast = folium.Map(location=[Lat, Long], zoom_start=10)

# add markers to map
for lat, lng, borough, nbhd in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(nbhd, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup  = label,
        color  ='blue',
        fill   = True,
        fill_color   = '#3186cc',
        fill_opacity = 0.7,
        parse_html   = False).add_to(map_torontoeast)  
    
map_torontoeast