# Clustering the Fifty Largest Metropolitan Areas in the World #

1.	Introduction
I am running an immigration agency in Paris, France. Many of my clients who want to start a new life ask me which cities are most culturally similar to Paris. I believe that the types of venues in each city reflect each city's distinct culture. In this lab, clustering is used to find cities similar to Paris.

2.	Data
Using the BeautifulSoup library, a list of largest cities in the world was scraped from Wikipedia (https://en.wikipedia.org/wiki/List_of_largest_cities). The list of cities was then populated with each city's respective longitudes and latitudes using the Geocoder package. These coordinates were run in the Foursquare API to get the venue categories in each city. After the data was cleaned, k-means clustering was applied.


Importing Libraries


In [735]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request
import pandas as pd
import geocoder
import requests
from pandas.io.json import json_normalize
from collections import Counter
import numpy as np
from sklearn.cluster import KMeans
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.0.0                |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.0.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
geopy-2.0.0          | 63 KB     | ##################################### | 100% 
Prepar

Specifying URL

In [736]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_cities'
page = urllib.request.urlopen(url)

Examining HTML

In [737]:
soup = BeautifulSoup(page, "lxml")
soup.prettify()


'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   List of largest cities - Wikipedia\n  </title>\n  <script>\n   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"dada9cc0-a644-4544-94a0-90fa94b7b827","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_largest_cities","wgTitle":"List of largest cities","wgCurRevisionId":965531905,"wgRevisionId":965531905,"wgArticleId":14649921,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Portuguese-language sources (pt)","CS1 uses Chinese-language script (zh)","CS1 Chinese-language sources (zh)",

In [738]:
soup.title.string



'List of largest cities - Wikipedia'

In [739]:
all_tables=soup.find_all("table")


In [740]:
right_table=soup.find('table', class_='wikitable')


Scrapping data from the the two columns for the top 50 largest cities

In [741]:
A=[]
B=[]
x=0

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==11 and x<50:
        A.append(cells[0].find(text=True))
        country=cells[1].findAll('a')
        B.append(country[1].contents[0])
        x = x+1
        
       

Cleaning data

In [742]:
df=pd.DataFrame(A,columns=['City'])
df['Country'] = B
df['Latitude'] = ''
df['Longitude'] = ''


In [745]:
for i in range(len(df['City'])):
    place = (df['City'][i],",", df['Country'][i])
    geolocator = Nominatim(user_agent="to_explorer")
    g = geolocator.geocode(place)
    df['Latitude'][i]= g.latitude
    df['Longitude'][i] = g.longitude
    
    
    
    

In [746]:
df

Unnamed: 0,City,Country,Latitude,Longitude
0,Tokyo,Japan,35.6828,139.759
1,Delhi,India,28.6517,77.2219
2,Shanghai,China,31.2253,121.489
3,São Paulo,Brazil,-23.5507,-46.6334
4,Mexico City,Mexico,19.4326,-99.1332
5,Cairo,Egypt,30.0488,31.2437
6,Mumbai,India,18.9388,72.8353
7,Beijing,China,39.9062,116.391
8,Dhaka,Bangladesh,23.7594,90.3788
9,Osaka,Japan,34.6199,135.49


Specifying credentials

In [747]:
CLIENT_ID = 'M2T1R4ATNABMR4WMRPWHCE1VOCPMJBULEF3YJJSYKUAPTT5N'
CLIENT_SECRET = 'EQLZ3ODCPE2A5DMRO3EESUYXIYSV0FYYEIGE1HOYXWN3XN55'
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: M2T1R4ATNABMR4WMRPWHCE1VOCPMJBULEF3YJJSYKUAPTT5N
CLIENT_SECRET:EQLZ3ODCPE2A5DMRO3EESUYXIYSV0FYYEIGE1HOYXWN3XN55


In [749]:

LIMIT = 500 # limit of number of venues returned by Foursquare API
radius = 30000 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Latitude, 
    Longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=M2T1R4ATNABMR4WMRPWHCE1VOCPMJBULEF3YJJSYKUAPTT5N&client_secret=EQLZ3ODCPE2A5DMRO3EESUYXIYSV0FYYEIGE1HOYXWN3XN55&v=20200701&ll=35.6828387,139.7594549&radius=30000&limit=500'

In [750]:
results = requests.get(url).json()


{'meta': {'code': 200, 'requestId': '5efd3d02587699312dfd45b9'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Tokyo',
  'headerFullLocation': 'Tokyo',
  'headerLocationGranularity': 'city',
  'totalResults': 237,
  'suggestedBounds': {'ne': {'lat': 35.952838970000265,
    'lng': 140.0912411771241},
   'sw': {'lat': 35.41283842999973, 'lng': 139.42766862287593}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54b45c5a498e35ea4e5f3c0c',
       'name': 'Aman Tokyo (アマン東京)',
       'location': {'address': '大手町1-5-6',
        'crossStreet': '大手町タワー 33F-38F',
        'lat': 35.68523599317857,
        'lng': 139.76540095569973,
        'labeledLatLngs': [{'label': 'display',
         

In [692]:
items = results['response']['groups'][0]['items']

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '50866ab1e4b0413dc57f1c93',
  'name': 'Tsujihan (つじ半)',
  'location': {'address': '日本橋3-1-15',
   'crossStreet': '久栄ビル 1F',
   'lat': 35.680763,
   'lng': 139.771563,
   'labeledLatLngs': [{'label': 'display',
     'lat': 35.680763,
     'lng': 139.771563}],
   'distance': 285,
   'postalCode': '103-0027',
   'cc': 'JP',
   'neighborhood': '八重洲',
   'city': '東京',
   'state': '東京都',
   'country': '日本',
   'formattedAddress': ['日本橋3-1-15 (久栄ビル 1F)', '中央区, 東京都', '103-0027', '日本']},
  'categories': [{'id': '55a59bace4b013909087cb0c',
    'name': 'Donburi Restaurant',
    'pluralName': 'Donburi Restaurants',
    'shortName': 'Donburi',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/japanese_',
     'suffix': '.png'},
    'primary': True}],
  'photos': {'count': 0, 'groups': []}},
 'referralId': 'e-0-50866ab1e4b0

Create function that returns the categories of nearby venues

In [751]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']



In [752]:
def getVenues(city, country, latitudes, longitudes):
    
    venues_list=[]
    for city,country, lat, lng in zip(city, country, latitudes, longitudes):
        print(city)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            city, 
            country,  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Country', 
                  'Category' ]
    
    return(nearby_venues)

In [753]:
globalvenues = getVenues(city=df['City'],country=df['Country'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Tokyo
Delhi
Shanghai
São Paulo
Mexico City
Cairo
Mumbai
Beijing
Dhaka
Osaka
New York City
Karachi
Buenos Aires
Chongqing
Istanbul
Kolkata
Manila
Lagos
Rio de Janeiro
Tianjin
Kinshasa
Guangzhou
Los Angeles
Moscow
Shenzhen
Lahore
Bangalore
Paris
Bogotá
Jakarta
Chennai
Lima
Bangkok
Seoul
Nagoya
Hyderabad
London
Tehran
Chicago
Chengdu
Nanjing
Wuhan
Ho Chi Minh City
Luanda
Ahmedabad
Kuala Lumpur
Xi'an
Hong Kong
Dongguan
Hangzhou


Globalvenues is a list of all venues with the corresponding city and country

In [754]:
print(globalvenues.shape)
globalvenues


(4536, 3)


Unnamed: 0,City,Country,Category
0,Tokyo,Japan,Hotel
1,Tokyo,Japan,Donburi Restaurant
2,Tokyo,Japan,Roof Deck
3,Tokyo,Japan,Bookstore
4,Tokyo,Japan,Train Station
...,...,...,...
4531,Hangzhou,China,Resort
4532,Hangzhou,China,Coffee Shop
4533,Hangzhou,China,Pizza Place
4534,Hangzhou,China,Hotel


In [755]:
globalvenues.groupby('City').count()
print('There are {} uniques categories.'.format(len(globalvenues['Category'].unique())))


There are 409 uniques categories.


Convert venues to dummy variables for each city

In [756]:
# one hot encoding
global_dummies = pd.get_dummies(globalvenues[['Category']], prefix="", prefix_sep="")
global_dummies['City'] = globalvenues['City'] 
fixed_columns = [global_dummies.columns[-1]] + list(global_dummies.columns[:-1])
global_dummies = global_dummies[fixed_columns]
global_dummies.head()

Unnamed: 0,City,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Wings Joint,Women's Store,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant,Yunnan Restaurant,Zhejiang Restaurant,Zoo,Zoo Exhibit
0,Tokyo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Tokyo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Tokyo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Tokyo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Tokyo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [757]:
global_dummies = global_dummies.groupby('City').mean().reset_index()
global_dummies.select_dtypes(exclude=['object', 'datetime']) * 100




Unnamed: 0,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Wings Joint,Women's Store,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant,Yunnan Restaurant,Zhejiang Restaurant,Zoo,Zoo Exhibit
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.515152,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.515152,3.030303
8,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


Creating function to see most common venues in each city

In [758]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [764]:
num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['City'] = global_dummies['City']

for ind in np.arange(global_dummies.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(global_dummies.iloc[ind, :], num_top_venues)

venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Ahmedabad,Indian Restaurant,Multiplex,Fast Food Restaurant,Dessert Shop,Café,Hotel,Restaurant,Tea Room,Pizza Place,Ice Cream Shop,Sandwich Place,Coffee Shop,Shopping Mall,Vegetarian / Vegan Restaurant,Snack Place
1,Bangalore,Hotel,Ice Cream Shop,Bakery,Brewery,Italian Restaurant,Shopping Mall,Fast Food Restaurant,Indian Restaurant,Coffee Shop,Multiplex,Breakfast Spot,Bowling Alley,Department Store,Burger Joint,Snack Place
2,Bangkok,Hotel,Park,Coffee Shop,Bookstore,Noodle House,Shopping Mall,Italian Restaurant,Bar,Asian Restaurant,Seafood Restaurant,Thai Restaurant,Spa,Dessert Shop,Clothing Store,Historic Site
3,Beijing,Hotel,Historic Site,Park,Brewery,Café,Shopping Mall,Chinese Restaurant,Dumpling Restaurant,Peking Duck Restaurant,Coffee Shop,Pizza Place,Cocktail Bar,Bar,Electronics Store,Mexican Restaurant
4,Bogotá,Park,Coffee Shop,Bakery,Hotel,Golf Course,Bookstore,Ice Cream Shop,Burger Joint,Art Museum,Gym / Fitness Center,Italian Restaurant,BBQ Joint,Asian Restaurant,Brewery,Theater
5,Buenos Aires,Ice Cream Shop,Hotel,Bakery,Plaza,Pizza Place,BBQ Joint,Park,Coffee Shop,Restaurant,Japanese Restaurant,Bar,Middle Eastern Restaurant,Italian Restaurant,Science Museum,Gym / Fitness Center
6,Cairo,Hotel,Café,Historic Site,Dessert Shop,Convenience Store,Sports Club,Bakery,Supermarket,Egyptian Restaurant,Hotel Bar,Ice Cream Shop,Falafel Restaurant,Cupcake Shop,Italian Restaurant,Gym / Fitness Center
7,Chengdu,Hotel,Shopping Mall,Coffee Shop,Hostel,Zoo Exhibit,Historic Site,Italian Restaurant,Bookstore,Chinese Restaurant,Furniture / Home Store,History Museum,Theater,Temple,Grocery Store,Nightclub
8,Chennai,Indian Restaurant,Beach,Hotel,Multiplex,South Indian Restaurant,Sandwich Place,Italian Restaurant,Ice Cream Shop,Restaurant,Food,Café,Theater,Chinese Restaurant,BBQ Joint,Pizza Place
9,Chicago,Hotel,Park,Theater,Ice Cream Shop,Italian Restaurant,Liquor Store,Sandwich Place,Grocery Store,Mediterranean Restaurant,Boat or Ferry,Beach,Tour Provider,Gym / Fitness Center,Trail,Deli / Bodega


Using K-means to cluster the cities 

In [765]:
kclusters = 10

cities_clustering = global_dummies.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cities_clustering)

# check cluster labels generated for each row in the dataframe
venues_sorted.insert(0, 'Cluster', kmeans.labels_)

global_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood

global_merged = global_merged.join(venues_sorted.set_index('City'), on='City')



In [773]:
global_merged
global_merged.drop(global_merged.iloc[:, 5:], inplace = True, axis = 1) 


In [783]:
global_merged

Unnamed: 0,City,Country,Latitude,Longitude,Cluster
0,Tokyo,Japan,35.6828,139.759,3
1,Delhi,India,28.6517,77.2219,0
2,Shanghai,China,31.2253,121.489,1
3,São Paulo,Brazil,-23.5507,-46.6334,3
4,Mexico City,Mexico,19.4326,-99.1332,8
5,Cairo,Egypt,30.0488,31.2437,0
6,Mumbai,India,18.9388,72.8353,8
7,Beijing,China,39.9062,116.391,0
8,Dhaka,Bangladesh,23.7594,90.3788,5
9,Osaka,Japan,34.6199,135.49,9


Cluster #1

In [781]:
global_merged.loc[global_merged['Cluster'] == 0, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]
        

Unnamed: 0,City
1,Delhi
5,Cairo
7,Beijing
16,Manila
20,Kinshasa
32,Bangkok
40,Nanjing
42,Ho Chi Minh City
45,Kuala Lumpur
47,Hong Kong


Cluster #2

In [785]:
global_merged.loc[global_merged['Cluster'] == 1, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
2,Shanghai
13,Chongqing
21,Guangzhou
29,Jakarta
39,Chengdu
46,Xi'an
49,Hangzhou


Cluster #3

In [788]:
global_merged.loc[global_merged['Cluster'] == 2, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
19,Tianjin
41,Wuhan
48,Dongguan


Cluster #4

In [789]:
global_merged.loc[global_merged['Cluster'] == 3, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
0,Tokyo
3,São Paulo
14,Istanbul
18,Rio de Janeiro
22,Los Angeles
23,Moscow
27,Paris
33,Seoul
34,Nagoya
37,Tehran


Cluster #5

In [790]:
global_merged.loc[global_merged['Cluster'] == 4, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
11,Karachi


Cluster #6

In [791]:
global_merged.loc[global_merged['Cluster'] == 5, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
8,Dhaka
43,Luanda


Cluster #7

In [792]:
global_merged.loc[global_merged['Cluster'] == 6, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
10,New York City
28,Bogotá
31,Lima
36,London


Cluster #8

In [793]:
global_merged.loc[global_merged['Cluster'] == 7, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
15,Kolkata
30,Chennai
44,Ahmedabad


Cluster #9

In [794]:
global_merged.loc[global_merged['Cluster'] == 8, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
4,Mexico City
6,Mumbai
12,Buenos Aires
17,Lagos
25,Lahore
26,Bangalore
35,Hyderabad


Cluseter #10

In [795]:
global_merged.loc[global_merged['Cluster'] == 9, global_merged.columns[[0] + list(range(5, global_merged.shape[1]))]]


Unnamed: 0,City
9,Osaka
24,Shenzhen
