<h1> Applied Data Science Capstone</h1>

<h1>Capstone Project - The Battle of Neighborhoods</h1>

<h2>1. Obtain the data </h2>
<h3>1.1 Data used to define the neighborhoods from Wikipedia </h3>

<h3> Preparation:</h3>
<h4>Do the software installs for Part 1 - only once per launch of notebook</h4>

In [None]:
!conda install -c conda-forge beautifulsoup4 --yes
print("INSTALLED BEAUTIFULSOUP4")
!conda install -c conda-forge lxml --yes
print("INSTALLED LXML")
!conda install -c conda-forge requests --yes
print("INSTALLED REQUESTS")

<h3>Do all the imports - with launch or when you have to restart the python kernel</h3>

In [None]:
from bs4 import BeautifulSoup
import requests

from lxml import html

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('IMPORTS DONE')


<h3>Import the Canadian Post codes table from the Wikipedia webpage utilising BeautifulSoup</h3>

In [None]:

with open('C:\DATA\AI\cournsera\List of postal codes of Canada Wikipedia.html',encoding="utf-8") as f:
    source = f.read()


print("SOURCE")

soup = BeautifulSoup(source,'lxml')

print("SOUP")
#print(soup)
          
post_codes = soup.find('table',class_="wikitable sortable jquery-tablesorter")

print("POST_CODES")
print(post_codes)

print('FOUND THE POST CODE TABLE ON WIKIPEDIA')

<h4>Use BeautifulSoup find_all to extract the rows from the Canadian Postal code table</h4>

In [None]:
table_rows = post_codes.find_all('tr')

<h3>Prepare the empty Pandas Dataframe</h3>

In [None]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
pd_post_codes = pd.DataFrame(columns=column_names)

print(pd_post_codes.columns)
pd_post_codes


<h3>Populate the Pandas Dataframe</h3>

#### Loop through the table rows and add it to the pd_post_codes dataframe


In [None]:
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td] # contains a list
    if len(row)> 0:
        #Only non-empty rows will be added
        row_postalcode = row[0]
        if row[1] !=  'Not assigned': 
            # borough that is assigned
            row_borough = row[1]
            row_neighborhood = row[2] 
            #  - w.r.t. the "\n" - I am using it to put commmas in between the Neighborhoods. 
            if row_neighborhood == 'Not assigned\n':
                row_neighborhood = row_borough
            pd_post_codes = pd_post_codes.append({'PostalCode':row_postalcode,'Borough':row_borough, 'Neighborhood':row_neighborhood},ignore_index=True)

print('The rows should now be in pd_post_codes - albeit with possibly more than one row per post code') 



<h3>Replace the "\n" in the Neighborhood column with a comma so that there will be a comma between the neighbourhoods once concatenated.

In [None]:
pd_post_codes['Neighborhood'] = pd_post_codes['Neighborhood'].str.replace('\n',',') 

<h3>Combine neighborhoods with the same borough

In [None]:
pd_post_codes = pd_post_codes.groupby(['PostalCode','Borough'],as_index=False)['Neighborhood'].sum()

<h4> Remove the last character - which is the comma of the rightmost neighborhood in the list</h4>

In [None]:
pd_post_codes['Neighborhood']= pd_post_codes['Neighborhood'].str[:-1]

In [None]:
pd_post_codes.shape

<h3> Read in the csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data</h3>
    

In [None]:
Postal_codes_coordinates='http://cocl.us/Geospatial_data'
    
pd_Postal_codes_coordinates = pd.read_csv(Postal_codes_coordinates)

<h3> Rename the 'Postal Codes'- column in order for the Merge to work. (I have tried it without renaming.)

In [None]:
pd_Postal_codes_coordinates.rename(columns={'Postal Code':'PostalCode'},inplace=True)


<h3>Merge the two dataframes using the PostalCode column to join.</h3>

In [None]:

pd_Postal_codes_part2 =pd.merge(pd_post_codes,pd_Postal_codes_coordinates[['PostalCode','Latitude','Longitude']],
left_on='PostalCode',
right_on='PostalCode',
how='outer')

pd_Postal_codes_part2.shape

<h3>1.2. VENUE data from FOURSQUARE.com</h3>

<h3>Preparation</h3>

<h3>Do the installs</h3>


In [None]:
!conda install -c conda-forge geopy --yes 
print('INSTALLED GEOPY')
!conda install -c conda-forge folium=0.5.0 --yes 
print('INSTALLED FOLIUM')
print('INSTALLATION DONE.')


<h3> Do the imports</h3>

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from pandas.io.json import json_normalize

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import json # library to handle JSON files


import folium # map rendering library

print('LIBRARIES IMPORTED.')

A Pandas DataFrame has been created for Toronto that is similar to the one for New York in Part2.

In [None]:
pd_Postal_codes_part2.head()

But we only want to work with Toronto.

In [None]:
pd_Postal_codes_Toronto=pd_Postal_codes_part2[pd_Postal_codes_part2['Borough'].str.contains('Toronto')]


In [None]:
print('Now we are only working with {} boroughs in Toronto.'.format(pd_Postal_codes_Toronto.shape[0]))
print('Here are the first 5 rows')
pd_Postal_codes_Toronto.head()


#### Use geopy library to get the latitude and longitude values of Toronto.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [None]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#### Create a map of Toronto with neighborhoods superimposed on top.

In [None]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(pd_Postal_codes_Toronto['Latitude'], pd_Postal_codes_Toronto['Longitude'], pd_Postal_codes_Toronto['Borough'], pd_Postal_codes_Toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  


map_Toronto

Please feel free to zoom in where the action is using the "+" in the tot left corner.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version - Hidden

In [None]:
# The code was removed by Watson Studio for sharing.

### Obtain the Venue data from FOURSQUARE.com

#### Getting all the venues, using the latitude and Longitude of the post code as that of the Neighborhood

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=100):
    
    # instantiate the dataframe
    
    column_names = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 
                 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    nearby_venues = pd.DataFrame(columns=column_names)
    
    # for each Post Code, add all the venues to nearby_venues
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        #make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Get to the relevant part in the JSON-data returned from FOURSQUARE.com
        venues = results
                
        N = len(venues)
        for i in range(N):
            icon_prefix = venues[i]['venue']['categories'][0]['icon']['prefix']
            row_category_group=icon_prefix.split('/')[5]
            if row_category_group == 'food':
                row_category_name = venues[i]['venue']['categories'][0]['name']
                row_establishment = venues[i]['venue']['name']
                row_establishment_latitude = venues[i]['venue']['location']['lat']
                row_establishment_longitude = venues[0]['venue']['location']['lng']
                nearby_venues = nearby_venues.append({'Neighborhood':name, 'Neighborhood Latitude':lat,
                                              'Neighborhood Longitude':lng,
                                              'Venue':row_establishment, 
                                              'Venue Latitude':row_establishment_latitude, 
                                              'Venue Longitude':row_establishment_longitude,
                                              'Venue Category':row_category_name},ignore_index=True)
    
    return(nearby_venues)

#### Execution of the above function on each neighborhood and create a new dataframe called *Toronto_venues*.

In [None]:

toronto_venues = getNearbyVenues(names=pd_Postal_codes_Toronto['Neighborhood'],
                                   latitudes=pd_Postal_codes_Toronto['Latitude'],
                                   longitudes=pd_Postal_codes_Toronto['Longitude']
                                  )



Let's check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('Neighborhood').count()

### Unique categories of food places  and it counts

In [None]:
tv_columns = toronto_venues.groupby(['Venue Category'])['Venue'].count()
tv_columns

<h3>Neighborhoods with count of respective food categories</h3>

In [None]:
toronto_venues.groupby(['Neighborhood','Venue Category']).count()

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

<a id='item3'></a>

## 2. Analyze Each Neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])

toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

And let's examine the new dataframe size.

In [None]:
toronto_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

#### Let's confirm the new size

In [None]:
toronto_grouped.shape

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    #print(temp)
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']


# create columns according to number of top venues
columns = ['Neighborhood']


for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted9 = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted9['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted9.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted9.head()

<a id='item4'></a>

##  3. Cluster the Neighborhoods

Run *k*-means to cluster the neighborhood into 9 clusters.

In [None]:
# set number of clusters
kclusters = 9

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels

neighborhoods_venues_sorted9.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = pd_Postal_codes_Toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted9.set_index('Neighborhood'), on='Neighborhood')

toronto_merged9 = toronto_merged.dropna()

#toronto_merged9=toronto_merged9.astype({'Cluster Labels':int})

toronto_merged9 #.head() # check the last columns!



Change the type of the "Cluster Labels"-column from Float to INT

In [None]:
toronto_merged9=toronto_merged9.astype({'Cluster Labels':int})


In [None]:
toronto_merged9.head()

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged9['Latitude'], toronto_merged9['Longitude'], toronto_merged9['Neighborhood'], toronto_merged9['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 4. Examine the Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

I first had 5 clusters, but most of the Neighborhoods were in 1 cluster which now constitutes Cluster 8 . So I have increased it to 9 to see if I can get some more variety.

#### Cluster 1 - CN Tower, Bathurst Quay, Island Ariport, harbour

(Reddish in the south) Airport, harbour.

In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 0, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 2 - Coffee shops. mostly City centre, North-South


(Purple color North and south - mostly city centre 18 Neighborhoods


In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 1, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 3 - Davisville North|

(Medium blue, North on the map)

Northemd Suburb "standard suburb" with a Dumpling restaurant and a donut shop. 

In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 2, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 4 - Forest Hill North, Forest Hill West

(Light blue, in the middle of the map.) 

Northern Suburb - with Dumpling Restaurant and Donut Shop

In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 3, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 5 -  East-West


(Turcoise, from East to West, 8 Neighborhoods ) A mix of everything


In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 4, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 6 - Dowercourt Village, Dufferin The baker and the Brewer



(Light Turqouise, North West of the city) 


In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 5, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 7 - Business Reply Centre


In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 6, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 8 - High Park, The Junction South

(Faded Orange - to the West of the city) - 4 "proper" restaurants


In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 7, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

#### Cluster 9 - Cafe club
Young Family


Burnt orange - 3 Neighborhhooods )
Inbetween the City Centre Neighborhoods is Christie. It does not look at all like the others with a grocery store in first position, a Candy store in 4th and a baby store in 6th. Add a nightclub to the mix. Young Family?

In [None]:
toronto_merged9.loc[toronto_merged9['Cluster Labels'] == 8, toronto_merged9.columns[[2] + list(range(5, toronto_merged9.shape[1]))]]

## 5. Analysis


Two things stand out;
    • The north to south grouping of purple-coloured neighborhoods with 18 establisments (cluster 2)
    • The “wave” of turquise-coloured neighborhoods running east to west with 8 establishments.(Cluster 5) with a good mixture of food establishments

The next cluster is the Cafe-club, with 3 establishments (Cluster 8) Probably because it is around the University.

Then there are 2 clusters with all-together 3 Neighborhoods – Clusters 3 and 4. that contains Dumpling Restaurants, Donner and Donut Shop

Cluster 8  ( High Park, The Junction South, west of the city) have 4 "proper" restaurants – Mexican, Thai, Cajun/Creole and Italian. 

Cluster 6 (Dowercourt Village, Dufferin) has a bakery and a brewery as the star attractions.
Cluster 1 consist of the Neighbourhoods  CN Tower, Bathurst Quay, Island Ariport, harbour. I guess mostly “on the run” meals. But I can be completely wrong.

Lastly Cluster 7 – contains a “neighborhood” “Business Reply Centre”. It is a post code used for mass-mailing. 


To Conclude; I will think twice before opening a coffee shop in the North-South groiping of neighborhoods. The "wave" of Torquise-coloured neighborhoods might be a good environment to open a food-establishment as there are a great variety of establishments in the 8 Neighbourhoods from east to west. 


### References

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).