# Data Science Capstone Project

## Segmenting the Crossfit Boxes in Budapest

### Data Analysis

First is to import every library that I will need for data manipulation and building the model.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!pip install geopy==2.2.0
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium
import folium # map rendering library

#!pip install pyproj
import pyproj

#pyplot
import matplotlib.pyplot as plt

print('Libraries imported.')

Libraries imported.


The following functions are needed for converting the lateral - longitudinal values into coordinates and vica-versa.

In [2]:
def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

The next step is to load our main dataset, which contains the neighborhood names for Budapest, Hungary. The following page is used for that purpose: https://en.wikipedia.org/wiki/List_of_districts_in_Budapest
Also, for the different dataset manipulation operations, the pandas package of python will be the proper tool.

In [3]:
#parsing the dataset from url
url = 'https://en.wikipedia.org/wiki/List_of_districts_in_Budapest'

data = pd.read_html(url)
df = pd.DataFrame(data = data[0])
column_names = ["DistrictNumber","DistrictName","Neighborhoods","Sights"]
df.columns = column_names
df.drop(labels = 0, axis = 0, inplace = True)
df.drop(columns="Sights", inplace = True)
df

Unnamed: 0,DistrictNumber,DistrictName,Neighborhoods
1,I.,Várkerület(Castle District),"Buda Castle, Tabán, Gellérthegy, Krisztinaváro..."
2,II.,none,"Adyliget, Budakeszierdő, Budaliget, Csatárka, ..."
3,III.,Óbuda-Békásmegyer,"Óbuda, Aquincum, Aranyhegy, Békásmegyer, Csill..."
4,IV.,Újpest(New Pest),"Újpest, Megyer, Káposztásmegyer, Székesdűlő, I..."
5,V.,Belváros-Lipótváros(Inner City-Leopold Town),"Inner City, Lipótváros"
6,VI.,Terézváros(Theresa Town),Terézváros
7,VII.,Erzsébetváros(Elizabeth Town),Erzsébetváros
8,VIII.,Józsefváros(Joseph Town),"Józsefváros, Kerepesdűlő, Tisztviselőtelep"
9,IX.,Ferencváros(Francis Town),"Ferencváros, Gubacsidűlő, József Attila-lakótelep"
10,X.,Kőbánya(Quarry),"Felsőrákos, Gyárdűlő, Keresztúridűlő, Kőbánya-..."


This looks good, but I need the proper format for finding the Geo coordinates for each neighborhood, so I need some additional adjustments.

In [4]:
s = pd.DataFrame(df.Neighborhoods.str.split(",").tolist(), index=df.DistrictNumber).stack()
added_df = pd.DataFrame(columns=["DistrictNumber", "Neighborhood"])
added_df.Neighborhood = s.values
added_df.DistrictNumber = s.index.get_level_values("DistrictNumber")
added_df

Unnamed: 0,DistrictNumber,Neighborhood
0,I.,Buda Castle
1,I.,Tabán
2,I.,Gellérthegy
3,I.,Krisztinaváros
4,I.,southern Víziváros
5,II.,Adyliget
6,II.,Budakeszierdő
7,II.,Budaliget
8,II.,Csatárka
9,II.,Erzsébetliget


In case of inappropriate values for the geolocator, I need to drop and refill some of the records with proper format.

In [5]:
added_df.drop(axis=0, index=4, inplace=True) #dropping southern Víziváros I. in case of inappropriate value
added_df.drop(axis=0, index=110, inplace=True) #dropping Sashegy XII. in case of inappropriate value
added_df.drop(axis=0, index=104, inplace=True) #dropping Krisztinaváros XII. in case of inappropriate value
added_df.drop(axis=0, index=165, inplace=True) #dropping Margit-sziget in case of inappropriate value
added_df.drop(axis=0, index=96, inplace=True) #dropping Tabán XI. in case of inappropriate value
added_df.drop(axis=0, index=2, inplace=True) #dropping Gellérthegy I. in case of inappropriate value
added_df.drop(axis=0, index=53, inplace=True) #dropping Szépvölgy III. in case of inappropriate value
added_df.drop(axis=0, index=36, inplace=True) #dropping northern Víziváros II. in case of inappropriate value, replaced
added_df.drop(axis=0, index=25, inplace=True) #dropping Petneházy-rét II. in case of inappropriate value, replaced
added_df.drop(axis=0, index=34, inplace=True) #dropping Újlak II. in case of inappropriate value
added_df.drop(axis=0, index=58, inplace=True) #dropping Újlak III. in case of inappropriate value, replaced
added_df.drop(axis=0, index=64, inplace=True) #dropping Inner City V. in case of inappropriate value, replaced
added_df.drop(axis=0, index=139, inplace=True) #dropping Szentgyörgytelep XVI. in case of inappropriate value, replaced
added_df.drop(axis=0, index=141, inplace=True) #dropping Huszkatelep XVI. in case of inappropriate value
added_df.drop(axis=0, index=159, inplace=True) #dropping Pesterzsébet-Szabótelep XX. in case of inappropriate value, replaced
added_df.reset_index(drop = True, inplace=True)

In [6]:
# added_df.replace(to_replace='northern Víziváros', value='Víziváros', inplace=True) is not working due to an internal error, 
# therefore I had to use something else

# data of for replacing the inapproriate values, tested with Nominatim
vizivaros={"DistrictNumber":"II.", "Neighborhood":"Víziváros"}
petnehazyret={"DistrictNumber":"II.", "Neighborhood":"Petneházyrét"}
ujlak={"DistrictNumber":"III.", "Neighborhood":"Újlak"}
belvaros={"DistrictNumber":"V.", "Neighborhood":"Belváros"}
szentgyorgytelep ={"DistrictNumber":"XVI.", "Neighborhood":"Szent György telep"}
szabotelep ={"DistrictNumber":"XX.", "Neighborhood":"Szabótelep"}

added_df=added_df.append(vizivaros, ignore_index=True)
added_df=added_df.append(petnehazyret, ignore_index=True)
added_df=added_df.append(ujlak, ignore_index=True)
added_df=added_df.append(belvaros, ignore_index=True)
added_df=added_df.append(szentgyorgytelep, ignore_index=True)
added_df=added_df.append(szabotelep, ignore_index=True)

At this point, we have the proper name for each neighborhood, the next step is to find the Geo locations (latitude and longitude values) for them. The following code is doing that perfectly, using Geopy python package.

In [7]:
addr_list = added_df.Neighborhood.to_list()
district_list = added_df.DistrictNumber.to_list()
lat_list = []
lon_list = []

geolocator = Nominatim(user_agent="cf_explorer")

for actual_address in range(len(addr_list)):
    location = geolocator.geocode(addr_list[actual_address])
    latitude = location.latitude
    longitude = location.longitude
    lat_list.append(latitude)
    lon_list.append(longitude)

added_df["Latitude"] = lat_list
added_df["Longitude"] = lon_list
added_df

Unnamed: 0,DistrictNumber,Neighborhood,Latitude,Longitude
0,I.,Buda Castle,47.495991,19.039801
1,I.,Tabán,47.490893,19.042639
2,I.,Krisztinaváros,47.496865,19.029776
3,II.,Adyliget,47.54755,18.938984
4,II.,Budakeszierdő,47.510273,18.951182
5,II.,Budaliget,47.567579,18.940664
6,II.,Csatárka,47.531525,19.002578
7,II.,Erzsébetliget,47.561714,18.967558
8,II.,Erzsébettelek,47.544978,18.957372
9,II.,Felhévíz,47.5175,19.037039


Now, we have all the proper names, latitude and longitude values for each neighborhood in Budapest, Hungary. The next step is to first, check the coordinates of Budapest with the geolocator class object, and then, we can create a map, to visualize our insights later.

In [8]:
#Checking the coordinates of Budapest, Hungary for creating a map
name_bud = "Budapest, Hungary"

location_bud = geolocator.geocode(name_bud)
lat_bud = location_bud.latitude
lon_bud = location_bud.longitude

print('The geograpical coordinate of Budapest are {}, {}.'.format(lat_bud, lon_bud))
print('Coordinate transformation check')
print('-------------------------------')
print('Budapest center longitude={}, latitude={}'.format(lon_bud, lat_bud))
x, y = lonlat_to_xy(lon_bud, lat_bud)
print('Budapest center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Budapest center longitude={}, latitude={}'.format(lo, la))

The geograpical coordinate of Budapest are 47.4979937, 19.0403594.
Coordinate transformation check
-------------------------------
Budapest center longitude=19.0403594, latitude=47.4979937
Budapest center UTM X=804283.1088846511, Y=5268422.807057823
Budapest center longitude=19.0403594, latitude=47.49799370000001


  # Remove the CWD from sys.path while we load stuff.
  after removing the cwd from sys.path.


In [9]:
# create map of Manhattan using latitude and longitude values
map_budapest = folium.Map(location=[lat_bud, lon_bud], zoom_start=11)

# add markers to map
for lat, lng, label in zip(added_df['Latitude'], added_df['Longitude'], added_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_budapest)  
    
map_budapest

Next step is to create the dataset for the Crossfit Boxes of Budapest. Since there is no available data, I needed to create my own.

In [10]:
#I had to create my own dataset for the Crossfit boxes in Budapest, because there was no accessable table regarding this
cf_list_bud = ["Crossfit Budapest", "Reebok Crossfit Duna", "Crossfit Tesseract", "Crossfit Everybody", 
               "Crossfit Mayfly", "Crossfit BBros", "Crossfit BBros 2", "Crossfit Grund", "Crossfit Bloodfit", 
              "Crossfit Glasshouse", "Crossfit Újbuda", "Bulls Park Crossfit", "Crossfit Corvinus"]
cf_bud_addr = ["Pillangó Park 8", "Hajógyári sziget 130", "Mexikói út 25", "Borszék köz 1", 
               "Kondorfa utca 6", "Zichy Géza utca 12", "Márton utca 4", "Hegedűs Gyula utca 14", "Tüzér utca 46",
              "Nándorfejérvári út 40", "Barázda utca 1", "Sopron út 9", "Gyömrői út 118"]
cf_bud_distr = ["XIV.", "III.", "XIV.", "XI.", "XI.", "XIV.", "IX.", "XIII.", "XIII.", "XI.", "XI.", "XI.", "XVIII"]
cf_df_col_list = ["DistrictNumber", "CrossfitBoxName", "Address", "Latitude", "Longitude"]
cf_df = pd.DataFrame(columns = cf_df_col_list)
cf_df.DistrictNumber = cf_bud_distr
cf_df.CrossfitBoxName = cf_list_bud
cf_df.Address = cf_bud_addr
cf_df

Unnamed: 0,DistrictNumber,CrossfitBoxName,Address,Latitude,Longitude
0,XIV.,Crossfit Budapest,Pillangó Park 8,,
1,III.,Reebok Crossfit Duna,Hajógyári sziget 130,,
2,XIV.,Crossfit Tesseract,Mexikói út 25,,
3,XI.,Crossfit Everybody,Borszék köz 1,,
4,XI.,Crossfit Mayfly,Kondorfa utca 6,,
5,XIV.,Crossfit BBros,Zichy Géza utca 12,,
6,IX.,Crossfit BBros 2,Márton utca 4,,
7,XIII.,Crossfit Grund,Hegedűs Gyula utca 14,,
8,XIII.,Crossfit Bloodfit,Tüzér utca 46,,
9,XI.,Crossfit Glasshouse,Nándorfejérvári út 40,,


In [11]:
cf_bud_district = ["XIV.", "III.", "XIV.", "XI.", "XI.", "XIV.", "IX.", "XIII.", "XIII.", "XI.", "XI.", "XI.", "XVIII"]
cf_bud_address = ["Pillangó Park 8", "Hajógyári sziget 130", "Mexikói út 25", "Borszék köz 1", 
               "Kondorfa utca 6", "Zichy Géza utca 12", "Márton utca 4", "Hegedűs Gyula utca 14", "Tüzér utca 46",
              "Nándorfejérvári út 40", "Barázda utca 1", "Sopron út 9", "Gyömrői út 118"]

cf_lat_list = []
cf_lon_list = []
cf_loc = []

for actual_address in range(len(cf_bud_address)):
     cf_loc.append("{}, Budapest, Hungary".format(cf_bud_address[actual_address]))
    
geolocator = Nominatim(user_agent="hun_explorer")

for actual_address in range(len(cf_loc)):
    location = geolocator.geocode(cf_loc[actual_address])
    latitude = location.latitude
    longitude = location.longitude
    cf_lat_list.append(latitude)
    cf_lon_list.append(longitude)
    
    

cf_df["Latitude"] = cf_lat_list
cf_df["Longitude"] = cf_lon_list
cf_df

Unnamed: 0,DistrictNumber,CrossfitBoxName,Address,Latitude,Longitude
0,XIV.,Crossfit Budapest,Pillangó Park 8,47.505122,19.121754
1,III.,Reebok Crossfit Duna,Hajógyári sziget 130,47.554403,19.059319
2,XIV.,Crossfit Tesseract,Mexikói út 25,47.519416,19.090593
3,XI.,Crossfit Everybody,Borszék köz 1,47.458482,19.026134
4,XI.,Crossfit Mayfly,Kondorfa utca 6,47.453356,19.043747
5,XIV.,Crossfit BBros,Zichy Géza utca 12,47.509135,19.09225
6,IX.,Crossfit BBros 2,Márton utca 4,47.476906,19.077625
7,XIII.,Crossfit Grund,Hegedűs Gyula utca 14,47.514103,19.053601
8,XIII.,Crossfit Bloodfit,Tüzér utca 46,47.524592,19.070284
9,XI.,Crossfit Glasshouse,Nándorfejérvári út 40,47.461456,19.045858


Adding the Crossfit Boxes into our previoulsy created map, to see their position.

In [12]:
# add markers to map
for lat, lng, label in zip(cf_df['Latitude'], cf_df['Longitude'], cf_df['CrossfitBoxName']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_budapest)
    folium.Circle([lat, lng], radius=500, color='white', fill=False).add_to(map_budapest) 
    
map_budapest

Now, we see that there are plenty of Crossfit Boxes among Budapest city, also they trying to follow a pattern. Each Box tries to be as close to the inner town as they can, also they tries to position themselves in the crowded areas. To fully understand their behavior and segment the neighborhoods, we need to create one final dataset.

Concatenating these two separate datasets into one greater will create our final dataset.

In [13]:
#Concat the two venue list to see how the clustering performs then
bp_df = added_df
bud_cf_df = cf_df
bud_cf_df.rename(columns={'CrossfitBoxName':'Neighborhood'}, inplace=True)
bud_cf_df.drop(labels='Address',axis=1, inplace=True)
total_df = pd.concat([bp_df,bud_cf_df], ignore_index=True)
total_df

Unnamed: 0,DistrictNumber,Neighborhood,Latitude,Longitude
0,I.,Buda Castle,47.495991,19.039801
1,I.,Tabán,47.490893,19.042639
2,I.,Krisztinaváros,47.496865,19.029776
3,II.,Adyliget,47.54755,18.938984
4,II.,Budakeszierdő,47.510273,18.951182
5,II.,Budaliget,47.567579,18.940664
6,II.,Csatárka,47.531525,19.002578
7,II.,Erzsébetliget,47.561714,18.967558
8,II.,Erzsébettelek,47.544978,18.957372
9,II.,Felhévíz,47.5175,19.037039


Visualizing the dataset will give us some insights. First of all, we can see exactly where these neighborhoods are, therefore we can check their population distribution among the city.

In [14]:
# create map of Budapest using latitude and longitude values
map_budapest = folium.Map(location=[lat_bud, lon_bud], zoom_start=11)

# add markers to map
for lat, lng, label in zip(total_df['Latitude'], total_df['Longitude'], total_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_budapest)  
    
map_budapest

Every preparation has made to start working on the model feed data. I used Foursquare data to create different clusters for the neighborhoods based on the user feedbacks on them. 

First, I need the nearby venues, then to make them into the proper format. After that I get the most common venues and then create a dataframe for them.

In [16]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

total_data = total_df
LIMIT = 500
total_bp_venues = getNearbyVenues(names=total_data['Neighborhood'],
                                   latitudes=total_data['Latitude'],
                                   longitudes=total_data['Longitude']
                                  )

print(total_bp_venues.shape)
total_bp_venues.head()

total_bp_venues.groupby('Neighborhood').count()
print('There are {} uniques categories.'.format(len(total_bp_venues['Venue Category'].unique())))

Buda Castle
 Tabán
 Krisztinaváros
Adyliget
 Budakeszierdő
 Budaliget
 Csatárka
 Erzsébetliget
 Erzsébettelek
 Felhévíz
 Gercse
 Hársakalja
 Hárshegy
 Hűvösvölgy
 Kővár
 Kurucles
 Lipótmező
 Máriaremete
 Nyék
 Országút
 Pálvölgy
 Pasarét
 Pesthidegkút-Ófalu
 Remetekertváros
 Rézmál
 Rózsadomb
 Szemlőhegy
 Széphalom
 Szépilona
 Szépvölgy
 Törökvész
 Vérhalom
 Zöldmál.
Óbuda
 Aquincum
 Aranyhegy
 Békásmegyer
 Csillaghegy
 Csúcshegy
 Filatorigát
 Hármashatár-hegy
 Kaszásdűlő
 Mátyáshegy
 Mocsárosdűlő
 Óbudai-sziget
 Remetehegy
 Rómaifürdő
 Solymárvölgy
 Táborhegy
 Testvérhegy
 Törökkő
 Ürömhegy
Újpest
 Megyer
 Káposztásmegyer
 Székesdűlő
 Istvántelek.
 Lipótváros
Terézváros
Erzsébetváros
Józsefváros
 Kerepesdűlő
 Tisztviselőtelep
Ferencváros
 Gubacsidűlő
 József Attila-lakótelep
Felsőrákos
 Gyárdűlő
 Keresztúridűlő
 Kőbánya-Kertváros
Albertfalva
 Dobogó
 Gazdagrét
 Gellérthegy
 Hosszúrét
 Kamaraerdő
 Kelenföld
 Kelenvölgy
 Kőérberek
 Lágymányos
 Madárhegy
 Őrmező
 Örsöd
 Péterhegy
 Pösing

In [17]:
# one hot encoding
total_bp_onehot = pd.get_dummies(total_bp_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
total_bp_onehot['Neighborhood'] = total_bp_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [total_bp_onehot.columns[-1]] + list(total_bp_onehot.columns[:-1])
total_bp_onehot = total_bp_onehot[fixed_columns]

total_bp_grouped = total_bp_onehot.groupby('Neighborhood').mean().reset_index()

num_top_venues = 5

for hood in total_bp_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = total_bp_grouped[total_bp_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Akadémiaújtelep----
                  venue  freq
0                Bakery  0.25
1  Fast Food Restaurant  0.25
2              Pharmacy  0.25
3                  Park  0.25
4           Yoga Studio  0.00


---- Angyalföld----
                   venue  freq
0                    Gym  0.25
1     Light Rail Station  0.25
2           Tram Station  0.25
3                 Bistro  0.25
4  Outdoors & Recreation  0.00


---- Aquincum----
           venue  freq
0  Historic Site  0.20
1         Bakery  0.13
2       Pharmacy  0.07
3          Plaza  0.07
4  Grocery Store  0.07


---- Budafok----
           venue  freq
0       Bus Stop  0.33
1     Playground  0.17
2  Grocery Store  0.17
3           Park  0.17
4    Supermarket  0.17


---- Budakeszierdő----
              venue  freq
0     Historic Site   1.0
1       Yoga Studio   0.0
2      Perfume Shop   0.0
3  Pedestrian Plaza   0.0
4       Pastry Shop   0.0


---- Budaliget----
               venue  freq
0           Bus Stop  0.50
1               

4    Pastry Shop  0.00


---- Nagyzugló----
                           venue  freq
0                       Bus Stop  0.12
1              Electronics Store  0.12
2                     Playground  0.06
3                       Boutique  0.06
4  Paper / Office Supplies Store  0.06


---- Nyék----
                venue  freq
0        Tennis Court  0.29
1  Athletics & Sports  0.14
2                Park  0.14
3            Bus Stop  0.14
4        Soccer Field  0.14


---- Németvölgy----
                  venue  freq
0            Playground  0.06
1             Wine Shop  0.04
2              Boutique  0.04
3    Italian Restaurant  0.04
4  Gym / Fitness Center  0.04


---- Népsziget----
        venue  freq
0   Beach Bar  0.17
1         Bar  0.08
2  Restaurant  0.08
3   Nightclub  0.08
4     Dog Run  0.08


---- Orbánhegy----
               venue  freq
0           Bus Stop  0.21
1     Ice Cream Shop  0.14
2  Korean Restaurant  0.07
3      Grocery Store  0.07
4             Bakery  0.07


---- Orszá

                     venue  freq
0            Grocery Store  0.09
1     Gym / Fitness Center  0.07
2                 Bus Stop  0.07
3              Coffee Shop  0.07
4  Health & Beauty Service  0.04


---- Vérhalom----
                       venue  freq
0                    Dog Run  0.12
1                  Wine Shop  0.06
2                     Bakery  0.06
3  Middle Eastern Restaurant  0.06
4                       Park  0.06


---- Wekerletelep----
       venue  freq
0   Bus Stop  0.16
1       Park  0.16
2       Café  0.08
3  Wine Shop  0.04
4   Pharmacy  0.04


---- Zöldmál.----
                 venue  freq
0             Bus Stop   0.5
1            Wine Shop   0.1
2  Japanese Restaurant   0.1
3          Snack Place   0.1
4          Pizza Place   0.1


---- Árpádföld----
               venue  freq
0           Bus Stop  0.29
1         Restaurant  0.14
2      Grocery Store  0.07
3  Food & Drink Shop  0.07
4               Park  0.07


---- Óbudai-sziget----
               venue  freq
0    

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted_total_bp = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_total_bp['Neighborhood'] = total_bp_grouped['Neighborhood']

for ind in np.arange(total_bp_grouped.shape[0]):
    neighborhoods_venues_sorted_total_bp.iloc[ind, 1:] = return_most_common_venues(total_bp_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted_total_bp.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Akadémiaújtelep,Bakery,Park,Fast Food Restaurant,Pharmacy,Falafel Restaurant,Farm,Farmers Market,Field,Fish & Chips Shop,Wine Shop
1,Angyalföld,Light Rail Station,Bistro,Tram Station,Gym,Fish & Chips Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field
2,Aquincum,Historic Site,Bakery,Karaoke Bar,Greek Restaurant,Light Rail Station,Bar,Grocery Store,Gym,Train Station,Pharmacy
3,Budafok,Bus Stop,Grocery Store,Playground,Supermarket,Park,Fish Market,Farm,Farmers Market,Fast Food Restaurant,Field
4,Budakeszierdő,Historic Site,Wine Shop,Fabric Shop,Farm,Farmers Market,Fast Food Restaurant,Field,Fish & Chips Shop,Fish Market,Flea Market


For the record, it can give a good insight if I check the population distribution among the 1st most common venue. As the result shows, the most common venues are the bus stops, grocery stores and playgrounds.

In [19]:
neighborhoods_venues_sorted_total_bp['1st Most Common Venue'].value_counts()

Bus Stop                       47
Grocery Store                  11
Playground                      7
Bakery                          5
Scenic Lookout                  5
Park                            5
Hungarian Restaurant            4
Coffee Shop                     4
Historic Site                   3
Bar                             3
Dog Run                         3
Diner                           3
Pub                             3
Fast Food Restaurant            2
Mountain                        2
Athletics & Sports              2
Electronics Store               2
Café                            2
Gym                             2
Cave                            2
Restaurant                      2
Track                           1
Beach Bar                       1
Ice Cream Shop                  1
Art Gallery                     1
Wine Bar                        1
Clothing Store                  1
Burger Joint                    1
Business Service                1
Exhibit       

Once the most common venues dataset is available for the Budapest neighborhoods and Crossfit Boxes, all the preparations has been made to start creating the clustering model.