# World Cities

This notebook will present the data science project to determine the likeness of some of the most popular cities in the world.

This information is useful for traveling agencies to determine the best traveling packages to offer to clients, as well for alone-travelers who enjoyed one city, and would like to visit a similar one.

In [260]:
#imports

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as BS
import unicodedata
import wget
import json
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
import googlemaps
import matplotlib.cm as cm
import matplotlib.colors as colors
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

print("Libs imported")


Libs imported


First, we need to select the cities that will be used.

This page shows a table with the most popular cities to visit: https://www.worldatlas.com/articles/the-most-popular-cities-in-the-world-to-visit.html.

We are going to get the data from the table and use the 25 cities.

In [73]:
url = "https://www.worldatlas.com/articles/the-most-popular-cities-in-the-world-to-visit.html"
r = requests.get(url)
soup = BS(r.text)

table = soup.find("table")
rows = table.find_all("tr")

ls = []
for tr in rows:
    td = tr.find_all('td')
    thisRow = [tr.text for tr in td]
    ls.append(thisRow)

data = pd.DataFrame(ls,columns=["Rank","City","Visitors"])
cities = data["City"]
cities = cities.iloc[1:,]
cities.reset_index(inplace=True,drop=True)
print(cities)

0          Bangkok
1           London
2            Paris
3            Dubai
4         New York
5        Singapore
6     Kuala Lumpur
7         Istanbul
8            Tokyo
9            Seoul
10       Hong Kong
11       Barcelona
12       Amsterdam
13           Milan
14          Taipei
15            Rome
16           Osaka
17          Vienna
18        Shanghai
19          Prague
20     Los Angeles
21          Madrid
22          Munich
23           Miami
24          Dublin
Name: City, dtype: object


Using geolocator to get cities lat and lng

In [58]:
geolocator = Nominatim()
latlng = []
for city in cities:
    location = geolocator.geocode(city)
    latlng.append([location.latitude,location.longitude])

  """Entry point for launching an IPython kernel.


In [74]:
cities = pd.concat([cities,pd.DataFrame(latlng)],axis=1)
cities.columns = ["City","Lat","Lng"]
print(cities)

            City        Lat         Lng
0        Bangkok  13.753893  100.816080
1         London  51.507322   -0.127647
2          Paris  48.856610    2.351499
3          Dubai  25.075010   55.188761
4       New York  40.730862  -73.987156
5      Singapore   1.290475  103.852036
6   Kuala Lumpur   3.151664  101.694303
7       Istanbul  41.009633   28.965165
8          Tokyo  35.682839  139.759455
9          Seoul  37.566679  126.978291
10     Hong Kong  22.279328  114.162813
11     Barcelona  41.382894    2.177432
12     Amsterdam  52.374540    4.897976
13         Milan  45.466797    9.190498
14        Taipei  25.037520  121.563680
15          Rome  41.894802   12.485338
16         Osaka  34.693757  135.501454
17        Vienna  48.208354   16.372504
18      Shanghai  31.225344  121.488892
19        Prague  50.087465   14.421254
20   Los Angeles  34.053683 -118.242767
21        Madrid  40.416705   -3.703582
22        Munich  48.137108   11.575382
23         Miami  25.774266  -80.193659


Now, we need to figure out how a city will be described. we must be able to collect all the data from foursquare APIs.

What I'm going to do is get all the possible venues for a city, and count is categories.

The DataFrame will consist of the number of every venue category for each city.


In [76]:
lines = [line.rstrip('\n') for line in open('foursquareCredentials.txt')]

credentials = dict(
    client_id=lines[0],
    client_secret=lines[1]
)

In [185]:

def getValue(latitude,longitude,credentials):
    url = 'https://api.foursquare.com/v2/venues/explore'
    params = dict(
        client_id=credentials['client_id'],
        client_secret=credentials['client_secret'],
        v='20180323',
        ll = str(latitude) + ',' + str(longitude),
        radius=500,
        limit=50000
    )
    resp = requests.get(url=url,params=params)
    return resp

def getDict(resp):
    results = resp.json()
    venues = results['response']['groups'][0]['items']
    venues = json_normalize(venues)
    cols = ['venue.name','venue.categories']
    venuesCols = venues.loc[:,cols]
    venuesCols['venue.categories'] = venuesCols.apply(get_category_type,axis=1)
    categories = venuesCols['venue.categories']
    categoriesUnique = pd.DataFrame(categories.unique())
    counts = Counter(categories)
    return dict(counts)

In [186]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [196]:
citiesDict = []
for lat,lng in zip(cities["Lat"],cities["Lng"]):
    resp = getValue(lat,lng,credentials)
    dd = getDict(resp)
    citiesDict.append(dd)

Here, I'm going to separate all the possible categories returned by foursquare, and usem them as columns

In [238]:
totalCategories = []
for dt in citiesDict:
    totalCategories = totalCategories + dt.keys()

print(len(totalCategories))
uniques = set(totalCategories)

clusterData = pd.DataFrame(np.zeros((25,284)),columns=uniques)
print(clusterData.shape)



1071
(25, 284)
(25, 284)
   Fast Food Restaurant  Gukbap Restaurant  Argentinian Restaurant  Boutique  \
0                   0.0                0.0                     0.0       0.0   
1                   0.0                0.0                     0.0       1.0   
2                   0.0                0.0                     0.0       0.0   
3                   0.0                0.0                     0.0       0.0   
4                   0.0                0.0                     0.0       0.0   
5                   0.0                0.0                     0.0       0.0   
6                   0.0                0.0                     0.0       2.0   
7                   1.0                0.0                     0.0       1.0   
8                   0.0                0.0                     0.0       0.0   
9                   0.0                1.0                     0.0       0.0   

   Wine Bar  Unagi Restaurant  Fish Market  Botanical Garden  Pub  \
0       0.0              

Now, for each city, lets set the number of each venue category

In [266]:
for i,dt in enumerate(citiesDict):
    row = []
    for col in clusterData.columns:
        if col in dt.keys():
            row = row + [dt[col]]
        else:
            row = row+[0]
    clusterData.iloc[i,:] = row
print(clusterData.shape)

(25, 284)


Using KMEANS with 5 clusters to get the most similar cities

In [242]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=5)
model = model.fit(clusterData)
print(model.labels_)
print(Counter(model.labels_))

[1 1 1 1 1 1 1 1 1 4 1 3 1 2 1 2 1 2 1 2 1 3 1 1 0]
Counter({1: 17, 2: 4, 3: 2, 0: 1, 4: 1})


plotting the results, where the color indicates de cluster

In [267]:
map_clusters = folium.Map(location=[0,0],zoom_start=2)

x = np.arange(5)
ys = [i+x+(i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for lat,lng,poi,cluster in zip(cities["Lat"],cities["Lng"],cities["City"],model.labels_):
    label = folium.Popup(poi)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup = label,
        color=rainbow[cluster],
        fill=True,
        fill_color = rainbow[cluster],
        fill_opacity=1).add_to(map_clusters)


map_clusters