# Segmenting and Clustering Neighborhoods in Toronto -- 3

### Use wikipedia to scrap for data to create dataframe

The dataframe will contain three columns: 

    - PostalCode
    - Borough
    - Neighborhood

1) Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.

2) If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.

3) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.



- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

#### **Submit link when finished**

---

#### Import libraries

In [3]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page
import pandas as pd
import numpy as np

Use the `requests` library to download the webpage. Save the text of the response as a variable named `html_data`.

In [4]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=946126446"

html_data  = requests.get(url).text 

Parse the html data using `beautiful_soup`.

In [5]:
soup = BeautifulSoup(html_data,"html5lib")  # create a soup object using the variable 'html_data'

Check title to ensure correct webpage

In [6]:
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


Using beautiful soup extract the table and store it into a dataframe. The dataframe should have columns **PostalCode**, **Borough**, and **Neighborhood**. Fill in each variable with the correct data from the list `col`. 

Hint: Print the `col` list to see what data to use


In [7]:
toronto_data = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

# print(soup.find("table",{"class":"wikitable sortable"}).find("tbody").find_all("tr"))

for row in soup.find("table",{"class":"wikitable sortable"}).find("tbody").find_all("tr"):
    col = row.find_all("td")
    if(col):
        postalCode =col[0].text.strip()
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        toronto_data = toronto_data.append({"PostalCode":postalCode, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)


Check the dataframe for quick summary

In [8]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
toronto_data.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,287,287,287
unique,180,11,209
top,M9V,Not assigned,Not assigned
freq,8,77,77


In [10]:
toronto_data.shape

(287, 3)

Dupicate dataframe so we can save the OG df as reference

In [11]:
toronto_data_new = toronto_data

---

### 1) Ignore cells with a borough that is **Not assigned**.

In [12]:
toronto_data_new["Borough"] = toronto_data_new["Borough"].replace("Not assigned", np.nan)

toronto_data_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Lets count the missing values in each column

In [13]:
missing_data = toronto_data_new.isnull()
missing_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,False,True,False
1,False,True,False
2,False,False,False
3,False,False,False
4,False,False,False


In [14]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

PostalCode
False    287
Name: PostalCode, dtype: int64

Borough
False    210
True      77
Name: Borough, dtype: int64

Neighborhood
False    287
Name: Neighborhood, dtype: int64



"True" represents a missing value, "False"  means the value is present in the dataset.
We can see **Borough** has **77 missing value**.
Lets drop these 77 rows.

In [15]:
toronto_data_new = toronto_data_new.dropna(subset=["Borough"], axis=0)

# Lets reset index, because we dropped rows
toronto_data_new = toronto_data_new.reset_index(drop=True)

toronto_data_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


---

### 2) If a cell has a **borough** but a **Not assigned** neighborhood, then the **neighborhood** will be the same as the **borough**.

In [16]:
toronto_data_new['Neighborhood'] == 'Not assigned'

0      False
1      False
2      False
3      False
4      False
       ...  
205    False
206    False
207    False
208    False
209    False
Name: Neighborhood, Length: 210, dtype: bool

In [17]:
missing_data = toronto_data_new.isnull()
missing_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


In [18]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

PostalCode
False    210
Name: PostalCode, dtype: int64

Borough
False    210
Name: Borough, dtype: int64

Neighborhood
False    210
Name: Neighborhood, dtype: int64



No `Not assigned` values found in column Neighborhood

---

### 3) Check if more than one neighborhood can exist in one postal code area.

In [19]:
toronto_data_new["PostalCode"].value_counts()

M8Y    8
M9V    8
M5V    7
M4V    5
M9B    5
      ..
M2R    1
M3N    1
M3A    1
M4H    1
M5N    1
Name: PostalCode, Length: 103, dtype: int64

If PositalCode is the same, then group them up

In [20]:
toronto_data_new = toronto_data_new.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list)
toronto_data_new = toronto_data_new.sample(frac=1).reset_index()
toronto_data_new['Neighborhood'] = toronto_data_new['Neighborhood'].str.join(', ')
toronto_data_new

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel"
1,M4L,East Toronto,"The Beaches West, India Bazaar"
2,M6G,Downtown Toronto,Christie
3,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre"
4,M5C,Downtown Toronto,St. James Town
...,...,...,...
98,M4M,East Toronto,Studio District
99,M6P,West Toronto,"High Park, The Junction South"
100,M1J,Scarborough,Scarborough Village
101,M2M,North York,"Newtonbrook, Willowdale"


---

### Final dataframe after data cleaning

In [21]:
toronto_data_new.shape

(103, 3)

---
---

## Download GeoSpatial Dataset

In [22]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"
toronto_data_geo = pd.read_csv(filename)  
toronto_data_geo

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [23]:
toronto_data_geo.rename(columns={'Postal Code':'PostalCode'},inplace=True)

Join the two dataframe: `toronto_data_new` and `toronto_data_geo`

In [24]:
toronto_data_new = toronto_data_new.join(toronto_data_geo.set_index('PostalCode'), on='PostalCode')

In [25]:
toronto_data_new

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817
1,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
3,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
...,...,...,...,...,...
98,M4M,East Toronto,Studio District,43.659526,-79.340923
99,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
100,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
101,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


---
---

### Now that we have the dataset, lets do some exploring

In [26]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

In [27]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_data_new['Borough'].unique()),
        toronto_data_new.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


#### Use geopy library to get the latitude and longitude values of Toronto.

In [28]:
address = 'Toronto, CAN'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.7793879, -79.3046089.


#### Create a map of New York with neighborhoods superimposed on top.

In [29]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data_new['Latitude'], toronto_data_new['Longitude'], toronto_data_new['Borough'], toronto_data_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Lets take a closer look at `Scarborough`

In [30]:
scarborough_data = toronto_data_new[toronto_data_new['Borough'] == 'Scarborough'].reset_index(drop=True)
scarborough_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1S,Scarborough,Agincourt,43.7942,-79.262029
2,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
3,M1T,Scarborough,"Clarks Corners, Sullivan, Tam O'Shanter",43.781638,-79.304302
4,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849


In [31]:
address = 'Scarborough, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough are 43.7729744, -79.2576479.


In [32]:
# create map of Scarborough using latitude and longitude values
map_scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'], scarborough_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scarborough)  
    
map_scarborough

---

### Lets start utilizing the Foursquare API to explore the neighborhoods and segment them

#### Define Foursquare Credentials and Version

In [33]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [34]:
scarborough_data.loc[0, 'Neighborhood']

'Rouge, Malvern'

Get the neighborhood's latitude and longitude values.

In [35]:
neighborhood_latitude = scarborough_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = scarborough_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = scarborough_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge, Malvern are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in Agincourt within a radius of 500 meters.

In [36]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=&client_secret=&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

Send the GET request and examine the resutls

In [35]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '608500deb9979070d120fea7'},
 'response': {'headerLocation': 'Birch Cliff',
  'headerFullLocation': 'Birch Cliff, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.697157004500006,
    'lng': -79.25863612686784},
   'sw': {'lat': 43.6881569955, 'lng': -79.27106007313215}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '56d51743498e60da346470e2',
       'name': 'The Birchcliff',
       'location': {'address': '1666 Kingston Road',
        'crossStreet': 'Birchcliff Avenue',
        'lat': 43.69166644406541,
        'lng': -79.26453158481682,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.69166644406541,
          'lng': -79.26453158481682}],
       

Code from Foursquare lab: get_category_type function

In [36]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [37]:
venues = results['response']['groups'][0]['items']
    
# nearby_venues = json_normalize(venues) # flatten JSON
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The Birchcliff,Café,43.691666,-79.264532
1,Birchmount Community Centre,General Entertainment,43.695175,-79.262161
2,Scarborough Gardens,Skating Rink,43.694647,-79.26223
3,Birchmount Stadium,College Stadium,43.695323,-79.261293


In [38]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


### Explore Neighborhoods in Scarborough

#### Let's create a function to repeat the same process to all the neighborhoods in Scarborough

In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Create a new dataframe called `scarborough_venues`

In [40]:
scarborough_venues = getNearbyVenues(names = scarborough_data['Neighborhood'],
                                   latitudes = scarborough_data['Latitude'],
                                   longitudes = scarborough_data['Longitude'])

Birch Cliff, Cliffside West
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
L'Amoreaux West
Rouge, Malvern
Maryvale, Wexford
Cliffcrest, Cliffside, Scarborough Village West
Agincourt
Woburn
Highland Creek, Rouge Hill, Port Union
Clairlea, Golden Mile, Oakridge
Clarks Corners, Sullivan, Tam O'Shanter
Upper Rouge
Cedarbrae
Agincourt North, L'Amoreaux East, Milliken, Steeles East
Guildwood, Morningside, West Hill
Dorset Park, Scarborough Town Centre, Wexford Heights


#### Let's check the size of the resulting dataframe


In [41]:
print(scarborough_venues.shape)
scarborough_venues.head()

(97, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Birch Cliff, Cliffside West",43.692657,-79.264848,The Birchcliff,43.691666,-79.264532,Café
1,"Birch Cliff, Cliffside West",43.692657,-79.264848,Birchmount Community Centre,43.695175,-79.262161,General Entertainment
2,"Birch Cliff, Cliffside West",43.692657,-79.264848,Scarborough Gardens,43.694647,-79.26223,Skating Rink
3,"Birch Cliff, Cliffside West",43.692657,-79.264848,Birchmount Stadium,43.695323,-79.261293,College Stadium
4,Scarborough Village,43.744734,-79.239476,McCowan Park,43.745089,-79.239336,Playground


Let's check how many venues were returned for each neighborhood


In [42]:
scarborough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Birch Cliff, Cliffside West",4,4,4,4,4,4
Cedarbrae,9,9,9,9,9,9
"Clairlea, Golden Mile, Oakridge",10,10,10,10,10,10
"Clarks Corners, Sullivan, Tam O'Shanter",14,14,14,14,14,14
"Cliffcrest, Cliffside, Scarborough Village West",3,3,3,3,3,3
"Dorset Park, Scarborough Town Centre, Wexford Heights",6,6,6,6,6,6
"East Birchmount Park, Ionview, Kennedy Park",7,7,7,7,7,7
"Guildwood, Morningside, West Hill",9,9,9,9,9,9


Let's find out how many unique categories can be curated from all the returned venues

In [43]:
print('There are {} uniques categories.'.format(len(scarborough_venues['Venue Category'].unique())))

There are 55 uniques categories.


## Analyze Each Neighborhood


In [44]:
# one hot encoding
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
scarborough_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
scarborough_onehot = scarborough_onehot[fixed_columns]

scarborough_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,...,Playground,Rental Car Location,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Smoke Shop,Soccer Field,Thai Restaurant,Vietnamese Restaurant
0,"Birch Cliff, Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Birch Cliff, Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Birch Cliff, Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,"Birch Cliff, Cliffside West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Scarborough Village,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [45]:
scarborough_onehot.shape

(97, 56)

In [46]:
scarborough_grouped = scarborough_onehot.groupby('Neighborhood').mean().reset_index()
scarborough_grouped

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,...,Playground,Rental Car Location,Restaurant,Sandwich Place,Shopping Mall,Skating Rink,Smoke Shop,Soccer Field,Thai Restaurant,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0
1,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.111111,0.0,0.111111,0.111111,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0
4,"Clairlea, Golden Mile, Oakridge",0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
5,"Clarks Corners, Sullivan, Tam O'Shanter",0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.071429,0.0
6,"Cliffcrest, Cliffside, Scarborough Village West",0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0
7,"Dorset Park, Scarborough Town Centre, Wexford ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
8,"East Birchmount Park, Ionview, Kennedy Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,...,0.0,0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
scarborough_grouped.shape

(16, 56)

#### Let's print each neighborhood along with the top 5 most common venues


In [48]:
num_top_venues = 5

for hood in scarborough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = scarborough_grouped[scarborough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0  Latin American Restaurant  0.25
1               Skating Rink  0.25
2             Breakfast Spot  0.25
3                     Lounge  0.25
4        American Restaurant  0.00


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                 venue  freq
0           Playground  0.33
1         Intersection  0.33
2                 Park  0.33
3  American Restaurant  0.00
4         Noodle House  0.00


----Birch Cliff, Cliffside West----
                   venue  freq
0           Skating Rink  0.25
1  General Entertainment  0.25
2                   Café  0.25
3        College Stadium  0.25
4    American Restaurant  0.00


----Cedarbrae----
                venue  freq
0     Thai Restaurant  0.11
1              Bakery  0.11
2                Bank  0.11
3              Lounge  0.11
4  Athletics & Sports  0.11


----Clairlea, Golden Mile, Oakridge----
           venue  freq
0         Bakery   0.2
1       Bus Line   0.2
2    Bus

#### Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order.

In [49]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [50]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = scarborough_grouped['Neighborhood']

for ind in np.arange(scarborough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarborough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Vietnamese Restaurant,Construction & Landscaping,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant
1,"Agincourt North, L'Amoreaux East, Milliken, St...",Intersection,Playground,Park,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,"Birch Cliff, Cliffside West",College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Hakka Restaurant,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
3,Cedarbrae,Thai Restaurant,Athletics & Sports,Hakka Restaurant,Bakery,Bank,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Lounge,Convenience Store
4,"Clairlea, Golden Mile, Oakridge",Bakery,Bus Line,Soccer Field,Ice Cream Shop,Intersection,Metro Station,Bus Station,Park,Vietnamese Restaurant,Donut Shop


### Cluster Neighborhoods


In [51]:
# set number of clusters
kclusters = 5

scarborough_grouped_clustering = scarborough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarborough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 1, 1, 1, 1, 4, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [52]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

scarborough_merged = scarborough_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
scarborough_merged = scarborough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

scarborough_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,1.0,College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Hakka Restaurant,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
1,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,0.0,Jewelry Store,Playground,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Donut Shop
2,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,1.0,Coffee Shop,Bus Station,Discount Store,Department Store,Convenience Store,Chinese Restaurant,Hobby Shop,Breakfast Spot,Bus Line,Hakka Restaurant
3,M1W,Scarborough,L'Amoreaux West,43.799525,-79.318389,1.0,Coffee Shop,Fast Food Restaurant,Electronics Store,Discount Store,Chinese Restaurant,Intersection,Bank,Sandwich Place,Breakfast Spot,Pizza Place
4,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,3.0,Fast Food Restaurant,Vietnamese Restaurant,College Stadium,Hakka Restaurant,General Entertainment,Gas Station,Fried Chicken Joint,Electronics Store,Donut Shop,Discount Store


Drop any NaN

In [57]:
scarborough_merged = scarborough_merged.dropna(subset=["Cluster Labels"], axis=0).reset_index(drop=True)

scarborough_merged['Cluster Labels']

0     1.0
1     0.0
2     1.0
3     1.0
4     3.0
5     1.0
6     4.0
7     4.0
8     2.0
9     1.0
10    1.0
11    1.0
12    1.0
13    0.0
14    1.0
15    1.0
Name: Cluster Labels, dtype: float64

In [56]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'], scarborough_merged['Neighborhood'], scarborough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster.

#### Cluster 1

In [58]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 0, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,0.0,Jewelry Store,Playground,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store,Donut Shop
13,Scarborough,0.0,Intersection,Playground,Park,Vietnamese Restaurant,College Stadium,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


#### Cluster 2

In [59]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 1, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,1.0,College Stadium,General Entertainment,Skating Rink,Café,Vietnamese Restaurant,Hakka Restaurant,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
2,Scarborough,1.0,Coffee Shop,Bus Station,Discount Store,Department Store,Convenience Store,Chinese Restaurant,Hobby Shop,Breakfast Spot,Bus Line,Hakka Restaurant
3,Scarborough,1.0,Coffee Shop,Fast Food Restaurant,Electronics Store,Discount Store,Chinese Restaurant,Intersection,Bank,Sandwich Place,Breakfast Spot,Pizza Place
5,Scarborough,1.0,Middle Eastern Restaurant,Smoke Shop,Auto Garage,Bakery,Sandwich Place,Donut Shop,Construction & Landscaping,Convenience Store,Department Store,Discount Store
9,Scarborough,1.0,Bar,Construction & Landscaping,Vietnamese Restaurant,College Stadium,Hakka Restaurant,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store
10,Scarborough,1.0,Bakery,Bus Line,Soccer Field,Ice Cream Shop,Intersection,Metro Station,Bus Station,Park,Vietnamese Restaurant,Donut Shop
11,Scarborough,1.0,Pizza Place,Pharmacy,Gas Station,Fried Chicken Joint,Thai Restaurant,Fast Food Restaurant,Shopping Mall,Bank,Italian Restaurant,Chinese Restaurant
12,Scarborough,1.0,Thai Restaurant,Athletics & Sports,Hakka Restaurant,Bakery,Bank,Gas Station,Fried Chicken Joint,Caribbean Restaurant,Lounge,Convenience Store
14,Scarborough,1.0,Rental Car Location,Medical Center,Intersection,Bank,Restaurant,Mexican Restaurant,Breakfast Spot,Electronics Store,Donut Shop,Discount Store
15,Scarborough,1.0,Indian Restaurant,Pet Store,Chinese Restaurant,Light Rail Station,Vietnamese Restaurant,Shopping Mall,Coffee Shop,Gas Station,Fried Chicken Joint,Fast Food Restaurant


#### Cluster 3

In [60]:
scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 2, scarborough_merged.columns[[1] + list(range(5, scarborough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Scarborough,2.0,Coffee Shop,Korean BBQ Restaurant,Vietnamese Restaurant,Construction & Landscaping,Hakka Restaurant,General Entertainment,Gas Station,Fried Chicken Joint,Fast Food Restaurant,Electronics Store


...

---