# City Classifier
<p>The following are the general steps to be done to classify cities regarding their similarity to european or northamerican cities:</p>
<ol>
    <li>Obtain list of large cities from Europe and US and store them in a dataframe</li>
    <li>Obtain geo-reference for cities centroid</li>
    <li>Obtain venues from Foursquare</li>
    <li>Pivot venues</li>
    <li>Train classifier algorithm</li>
    <li>Apply algorithm to Medellin city</li>
</ol>
<p>Before starting, the following are the resources to be used in Python:</p>

In [1]:
import pandas as pd
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import requests # library to handle requests
import numpy as np
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.model_selection import train_test_split as ttsplit
from sklearn import metrics

In [2]:
CLIENT_ID = 'PINJLTKA2NPIJNNAE4T2UBX1BP5HOTQKYBC41Y2XYXJIYRAO' # your Foursquare ID
CLIENT_SECRET = 'ULHL0GUB4MK1VBX05VAUQIC12ZB2XAGZP11SQUD4QCM3CEFC' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PINJLTKA2NPIJNNAE4T2UBX1BP5HOTQKYBC41Y2XYXJIYRAO
CLIENT_SECRET: ULHL0GUB4MK1VBX05VAUQIC12ZB2XAGZP11SQUD4QCM3CEFC


## List of Large Cities
The following paragraphs describe the method to obtain city information from public sources and transform them into an usable format for training a classifier. Before executing, let's set a sampling size: for debuging, a small sampling size may be convenient, for time reasons. 

In [3]:
sampleSize = 200

### European Large Cities
<p>Wiki published a list of large cities in Europe ordered by population at https://ibm.biz/BdzWKG</p>.

In [4]:
europeanCitiesURL = "https://ibm.biz/BdzWKG"
europeanCities = pd.read_html(europeanCitiesURL)

<p>After an exploratory research, it has been clear that the specific table of interest is the fourth</p>

In [5]:
europeanCitiesDF = europeanCities[3][:]
europeanCitiesDF.columns = europeanCitiesDF.columns.get_level_values(0) + "." + europeanCitiesDF.columns.get_level_values(1)
nan_value = float("NaN")
europeanCitiesDF["Longitude"] = nan_value
europeanCitiesDF["Latitude"] = nan_value

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [6]:
europeanCitiesDF.head()

Unnamed: 0,Clasificación.Europa,Clasificación.UE,Clasificación.País,Ciudad.Nombre en español,Ciudad.Nombre en idioma original,Localización.País,Localización.Entidad administrativa,Localización.Subentidad administrativa,Unnamed: 8_level_0.Población,Longitude,Latitude
0,1,-,TUR-01,Estambul,İstanbul,Turquía,Provincia de Estambul,,14 657 434,,
1,2,-,RUS-0001,Moscú,Москва́,Rusia,Distrito federal Central,Ciudad federal de Moscú,12 380 664,,
2,3,UE-001,GBR-01,Londres,London,Reino Unido,Inglaterra,Gran Londres,8 787 892,,
3,4,-,RUS-0002,San Petersburgo,Санкт-Петербург,Rusia,Distrito federal del Noroeste,Ciudad federal de San Petersburgo,5 281 579,,
4,5,UE-002,ALE-01,Berlín,Berlin,Alemania,Berlín,,3 469 849,,


<p>This table has to be enriched with geo-reference of the centroid of the city (limiting to the first 100 rows, for time restrictions)</p>

In [7]:
geolocator = Nominatim(user_agent="foursquare_agent")
i = 0
for index, row in europeanCitiesDF.iterrows():
    # print("index: ", index)
    location = geolocator.geocode(row["Ciudad.Nombre en español"] + " - " + row["Localización.País"],timeout=15)
    if not location is None:
        # print(location, ": (", location.latitude, ", ", location.longitude, ")")
        europeanCitiesDF.loc[index, "Latitude"] = location.latitude
        europeanCitiesDF.loc[index, "Longitude"] = location.longitude
        i = i +1
    if i >= sampleSize:
        break

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [8]:
europeanCitiesDF.head()

Unnamed: 0,Clasificación.Europa,Clasificación.UE,Clasificación.País,Ciudad.Nombre en español,Ciudad.Nombre en idioma original,Localización.País,Localización.Entidad administrativa,Localización.Subentidad administrativa,Unnamed: 8_level_0.Población,Longitude,Latitude
0,1,-,TUR-01,Estambul,İstanbul,Turquía,Provincia de Estambul,,14 657 434,28.965165,41.009633
1,2,-,RUS-0001,Moscú,Москва́,Rusia,Distrito federal Central,Ciudad federal de Moscú,12 380 664,37.617494,55.750446
2,3,UE-001,GBR-01,Londres,London,Reino Unido,Inglaterra,Gran Londres,8 787 892,-0.127647,51.507322
3,4,-,RUS-0002,San Petersburgo,Санкт-Петербург,Rusia,Distrito federal del Noroeste,Ciudad federal de San Petersburgo,5 281 579,30.316229,59.938732
4,5,UE-002,ALE-01,Berlín,Berlin,Alemania,Berlín,,3 469 849,13.38886,52.517037


In [9]:
europeanCitiesDF = europeanCitiesDF[europeanCitiesDF['Longitude'] == europeanCitiesDF['Longitude']][['Ciudad.Nombre en español', 'Localización.País', 'Longitude', 'Latitude']]
europeanCitiesDF.columns = ['City', 'Country_State', 'Longitude', 'Latitude']
europeanCitiesDF.head()

Unnamed: 0,City,Country_State,Longitude,Latitude
0,Estambul,Turquía,28.965165,41.009633
1,Moscú,Rusia,37.617494,55.750446
2,Londres,Reino Unido,-0.127647,51.507322
3,San Petersburgo,Rusia,30.316229,59.938732
4,Berlín,Alemania,13.38886,52.517037


### United States Large Cities
<p>Wiki provides a list of large cities in United States by population and population density at https://ibm.biz/BdzWK9.  The procedure is basically the same used with Europe countries but table layout is slightly different.

In [10]:
USACitiesURL = "https://ibm.biz/BdzWK9"
USACities = pd.read_html(USACitiesURL)

USACitiesDF = USACities[3][["Ciudad", "Estado"]]

USACitiesDF["Longitude"] = nan_value
USACitiesDF["Latitude"] = nan_value

i = 0
for index, row in USACitiesDF.iterrows():
    location = geolocator.geocode(row["Ciudad"] + ", " + row["Estado"] + ", United States of America",timeout=15)
    if not location is None:
        USACitiesDF.loc[index, "Latitude"] = location.latitude
        USACitiesDF.loc[index, "Longitude"] = location.longitude
        i = i +1
    if i >= sampleSize:
        break

USACitiesDF = USACitiesDF[USACitiesDF['Longitude'] == USACitiesDF['Longitude']]
USACitiesDF.columns = ['City', 'Country_State', 'Longitude', 'Latitude']
USACitiesDF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,City,Country_State,Longitude,Latitude
0,Nueva York,Nueva York,-74.006015,40.712728
1,Los Ángeles,California,-118.242767,34.053691
2,Chicago,Illinois,-87.624421,41.875562
3,Houston,Texas,-95.367697,29.758938
4,Filadelfia,Pensilvania,-75.163526,39.952724


## Add Medellín to the dataset

In [11]:
data = [['Medellín', 'Antioquia', -75.590553, 6.230833, 'Colombia']]
Medellin = pd.DataFrame(data, columns=['City', 'Country_State', 'Longitude', 'Latitude', 'Class'])
Medellin

Unnamed: 0,City,Country_State,Longitude,Latitude,Class
0,Medellín,Antioquia,-75.590553,6.230833,Colombia


## Join Datasets together
<p>At this point we have sample cities from Europe and United States and we need to join them in a single table

In [12]:
europeanCitiesDF['Class'] = 'Europe'
USACitiesDF['Class'] = 'USA'
citiesDF = pd.concat([USACitiesDF, europeanCitiesDF, Medellin], ignore_index=True, sort=False)

In [13]:
citiesDF['Composed_City_Name'] = citiesDF['City'] + ', ' + citiesDF['Country_State'] + ', ' + citiesDF['Class']
citiesDF.tail()

Unnamed: 0,City,Country_State,Longitude,Latitude,Class,Composed_City_Name
396,Brighton & Hove,Reino Unido,-0.149759,50.845221,Europe,"Brighton & Hove, Reino Unido, Europe"
397,Graz,Austria,15.438279,47.070868,Europe,"Graz, Austria, Europe"
398,Cherkasy,Ucrania,32.05878,49.444789,Europe,"Cherkasy, Ucrania, Europe"
399,Liubliana,Eslovenia,14.506782,46.049815,Europe,"Liubliana, Eslovenia, Europe"
400,Medellín,Antioquia,-75.590553,6.230833,Colombia,"Medellín, Antioquia, Colombia"


## Enrich with Venues
Now we obtain venues distribution per city to enrich data to train classifier

In [14]:
def getCityVenues(names, latitudes, longitudes, radius=1000, limit=100):
    
    print("names size: ", names.size)
    print("latitudes size: ", latitudes.size)
    print("longitudes size: ", longitudes.size)
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        #results = requests.get(url).json()["response"]['groups'][0]['items']
        results = requests.get(url).json()
        venues = results['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in venues])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [15]:
limit = 100
city_venues = getCityVenues(names=citiesDF['Composed_City_Name'],
                                   latitudes=citiesDF['Latitude'],
                                   longitudes=citiesDF['Longitude'],
                                   limit=limit
                                  )

names size:  401
latitudes size:  401
longitudes size:  401
Nueva York, Nueva York, USA
Los Ángeles, California, USA
Chicago, Illinois, USA
Houston, Texas, USA
Filadelfia, Pensilvania, USA
Phoenix, Arizona, USA
San Antonio, Texas, USA
San Diego, California, USA
Dallas, Texas, USA
San José, California, USA
Jacksonville, Florida, USA
Indianápolis, Indiana, USA
San Francisco, California, USA
Austin, Texas, USA
Columbus, Ohio, USA
Fort Worth, Texas, USA
Charlotte, Carolina del Norte, USA
Detroit, Míchigan, USA
El Paso, Texas, USA
Memphis, Tennessee, USA
Baltimore, Maryland, USA
Boston, Massachusetts, USA
Seattle, Washington, USA
Washington, Distrito de Columbia, USA
Nashville, Tennessee, USA
Denver, Colorado, USA
Louisville, Kentucky, USA
Milwaukee, Wisconsin, USA
Portland, Oregón, USA
Las Vegas, Nevada, USA
Oklahoma City, Oklahoma, USA
Albuquerque, Nuevo México, USA
Tucson, Arizona, USA
Fresno, California, USA
Sacramento, California, USA
Long Beach, California, USA
Kansas City (Misuri), M

In [16]:
city_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Nueva York, Nueva York, USA",40.712728,-74.006015,The Bar Room at Temple Court,40.711448,-74.006802,Hotel Bar
1,"Nueva York, Nueva York, USA",40.712728,-74.006015,The Beekman - A Thompson Hotel,40.711173,-74.006702,Hotel
2,"Nueva York, Nueva York, USA",40.712728,-74.006015,Alba Dry Cleaner & Tailor,40.711434,-74.006272,Laundry Service
3,"Nueva York, Nueva York, USA",40.712728,-74.006015,Gibney Dance Center Downtown,40.713923,-74.005661,Dance Studio
4,"Nueva York, Nueva York, USA",40.712728,-74.006015,City Hall Park,40.712415,-74.006724,Park
...,...,...,...,...,...,...,...
29726,"Medellín, Antioquia, Colombia",6.230833,-75.590553,Pista Bicicross - BMX Track,6.232245,-75.588574,Athletics & Sports
29727,"Medellín, Antioquia, Colombia",6.230833,-75.590553,Asados Candela,6.233966,-75.595463,Fried Chicken Joint
29728,"Medellín, Antioquia, Colombia",6.230833,-75.590553,Papas Con Cosas,6.239296,-75.588040,Fast Food Restaurant
29729,"Medellín, Antioquia, Colombia",6.230833,-75.590553,Melao's,6.237841,-75.587466,Ice Cream Shop


Once the venues has been obtained, they will be pivoted (one-hot encoded) and aggregated by city:

In [17]:
# one hot encoding
city_venues_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
city_venues_onehot['Composed_City_Name'] = city_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [city_venues_onehot.columns[-1]] + list(city_venues_onehot.columns[:-1])
city_venues_onehot = city_venues_onehot[fixed_columns]

city_venues_onehot.head()

Unnamed: 0,Composed_City_Name,ATM,Accessories Store,Adult Boutique,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,...,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Yoshoku Restaurant,Zoo,Zoo Exhibit
0,"Nueva York, Nueva York, USA",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Nueva York, Nueva York, USA",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Nueva York, Nueva York, USA",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Nueva York, Nueva York, USA",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Nueva York, Nueva York, USA",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
cities_grouped = city_venues_onehot.groupby('Composed_City_Name').mean().reset_index()
cities_grouped.head()

Unnamed: 0,Composed_City_Name,ATM,Accessories Store,Adult Boutique,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,...,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Yoshoku Restaurant,Zoo,Zoo Exhibit
0,"Akron, Ohio, USA",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,...,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Albuquerque, Nuevo México, USA",0.013699,0.0,0.0,0.0,0.0,0.0,0.0,0.027397,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Alexandría, Virginia, USA",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
3,"Alicante, España, Europe",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Amarillo, Texas, USA",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
num_top_venues = 5

for city in cities_grouped['Composed_City_Name']:
    print("----"+city+"----")
    temp = cities_grouped[cities_grouped['Composed_City_Name'] == city].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Akron, Ohio, USA----
                 venue  freq
0                  Bar  0.06
1       Sandwich Place  0.06
2                 Bank  0.05
3  Rental Car Location  0.04
4          Coffee Shop  0.04


----Albuquerque, Nuevo México, USA----
            venue  freq
0     Pizza Place  0.08
1  Sandwich Place  0.05
2     Coffee Shop  0.05
3      Restaurant  0.04
4         Brewery  0.04


----Alexandría, Virginia, USA----
                 venue  freq
0  American Restaurant  0.06
1       Ice Cream Shop  0.05
2          Coffee Shop  0.04
3          Pizza Place  0.03
4    French Restaurant  0.03


----Alicante, España, Europe----
                venue  freq
0  Spanish Restaurant  0.08
1          Restaurant  0.06
2                 Pub  0.06
3    Tapas Restaurant  0.05
4        Burger Joint  0.04


----Amarillo, Texas, USA----
            venue  freq
0      Restaurant  0.06
1  Sandwich Place  0.06
2            Park  0.06
3    Hockey Arena  0.06
4    Burger Joint  0.06


----Amberes, Bélgica, Euro

In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [21]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Composed_City_Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['Composed_City_Name'] = cities_grouped['Composed_City_Name']

for ind in np.arange(cities_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(cities_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,Composed_City_Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Akron, Ohio, USA",Bar,Sandwich Place,Bank,Music Venue,Coffee Shop,Rental Car Location,Art Gallery,Deli / Bodega,Trail,Performing Arts Venue
1,"Albuquerque, Nuevo México, USA",Pizza Place,Sandwich Place,Coffee Shop,Brewery,Restaurant,Bar,Asian Restaurant,Farmers Market,Café,Theater
2,"Alexandría, Virginia, USA",American Restaurant,Ice Cream Shop,Coffee Shop,French Restaurant,New American Restaurant,Boutique,Italian Restaurant,Pizza Place,Pet Store,Gourmet Shop
3,"Alicante, España, Europe",Spanish Restaurant,Pub,Restaurant,Tapas Restaurant,Burger Joint,Plaza,Café,Italian Restaurant,Mediterranean Restaurant,French Restaurant
4,"Amarillo, Texas, USA",Sushi Restaurant,Restaurant,Sandwich Place,Park,Hockey Arena,Burger Joint,Italian Restaurant,Bank,Bar,Baseball Stadium


In [22]:
enrichedCitiesDF = citiesDF.merge(cities_grouped, left_on='Composed_City_Name', right_on='Composed_City_Name')
enrichedCitiesDF.columns

Index(['City', 'Country_State', 'Longitude', 'Latitude', 'Class',
       'Composed_City_Name', 'ATM', 'Accessories Store', 'Adult Boutique',
       'Advertising Agency',
       ...
       'Whisky Bar', 'Wine Bar', 'Wine Shop', 'Winery', 'Wings Joint',
       'Women's Store', 'Yoga Studio', 'Yoshoku Restaurant', 'Zoo',
       'Zoo Exhibit'],
      dtype='object', length=570)

In [23]:
# independent columns
columns = city_venues['Venue Category'].unique()
columns

array(['Hotel Bar', 'Hotel', 'Laundry Service', 'Dance Studio', 'Park',
       'Coffee Shop', 'French Restaurant', 'Gym / Fitness Center',
       'Bakery', 'Sandwich Place', 'Taco Place', 'Indian Restaurant',
       'Pizza Place', 'Gym', 'Monument / Landmark', 'Yoga Studio',
       'Furniture / Home Store', 'Burger Joint', 'Spa',
       'Falafel Restaurant', 'Antique Shop', 'Bookstore',
       'Japanese Restaurant', 'Fast Food Restaurant', "Women's Store",
       'Wine Shop', 'Café', 'Juice Bar', 'Baby Store',
       'Molecular Gastronomy Restaurant', 'American Restaurant',
       'Coworking Space', 'Sushi Restaurant', 'Nail Salon',
       'Vegetarian / Vegan Restaurant', 'Plaza', 'Discount Store',
       'Electronics Store', 'Cocktail Bar', 'Breakfast Spot',
       'Comic Shop', 'Shopping Mall', 'General Entertainment',
       'Asian Restaurant', 'New American Restaurant', 'Grocery Store',
       'Bagel Shop', 'Lingerie Store', 'Gourmet Shop', 'Auditorium',
       'Building', 'Memoria

In [24]:
X = enrichedCitiesDF[enrichedCitiesDF["Class"] != 'Colombia'][columns]
y = enrichedCitiesDF[enrichedCitiesDF["Class"] != 'Colombia']['Class']

X_train, X_test, y_train, y_test = ttsplit(X, y, test_size = 0.2)

classifier = knn(n_neighbors=4)
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=4, p=2,
                     weights='uniform')

In [25]:
yhat = classifier.predict(X_test)
yhat

array(['Europe', 'USA', 'Europe', 'Europe', 'USA', 'Europe', 'USA', 'USA',
       'Europe', 'Europe', 'USA', 'USA', 'Europe', 'Europe', 'USA',
       'Europe', 'Europe', 'USA', 'Europe', 'Europe', 'Europe', 'Europe',
       'USA', 'USA', 'USA', 'Europe', 'Europe', 'USA', 'Europe', 'USA',
       'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'USA',
       'Europe', 'USA', 'Europe', 'Europe', 'Europe', 'Europe', 'USA',
       'USA', 'USA', 'Europe', 'Europe', 'Europe', 'USA', 'USA', 'Europe',
       'Europe', 'Europe', 'USA', 'Europe', 'USA', 'USA', 'Europe',
       'Europe', 'USA', 'Europe', 'Europe', 'USA', 'USA', 'Europe',
       'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'USA', 'Europe',
       'USA', 'USA', 'Europe', 'Europe', 'USA', 'USA'], dtype=object)

In [26]:
print("Train set Accuracy: ", metrics.accuracy_score(y_train, classifier.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.9936708860759493
Test set Accuracy:  0.9746835443037974


Now, Medellín data has to be prepared to classify the city according to the city classifier

In [27]:
X_med = enrichedCitiesDF[enrichedCitiesDF["Class"] == 'Colombia'][columns]
ymed = classifier.predict(X_med)
ymed

array(['USA'], dtype=object)