# Optimal Neighborhood Finder

## Introduction and Motivation

Whenever someone needs to move to a new city, it is hard to know which neighborhood or location is the best place to rent a room, an apartment or a condo according to one's needs. Sometimes we have friends in the new city to ask for tips and advices, but most of the times we are all alone with our families, moving into a new city due to work conditions (a promotion, a replacement, etc.).

Likewise, if someone wants to open a new business, for example, a Japanese restaurant, it is not an easy task to choose a neighborhood to start building your cuisine dream. It is important to know where your business would likely be most successful, according to other businesses around it, the kind of people living nearby (e.g. income level, education level, etc.), and so on.

Therefore, based on the idea proposed in the assignment instruction and taking into account the problems just outlined above, this project will aim at creating a model to find the optimal placement of a venue, or the optimal place to move in according to a business or person's needs.
The model is intended to be used by anyone who needs to know optimal placements based on the characteristics of the venues in the neighborhoods of a city.

For this project, the Foursquare location data will be used. However, the project will be written in such a manner that extensions using other databases are easily implemented -- and even encouraged. For example, it is possible to add data about criminal records in a neighborhood so that this information is also taken into account when training the model.

## Data

For this project, we will be using the Foursquare location data, which can be easily obtained through RESTful API calls. Basically, we will explore many locations in a city, gathering information about venues around each of these locations. Such locations can be specified by latitude and longitude coordinates, or it may be names of neighborhoods. As a showdown, at first all the neighborhoods of a specific city will be surveyd.

With the information about venues around a given neighborhood, we then create a dataframe where we can see what kind (i.e. category) of venue appears around the neighborhoods in that city and their frequency of appearance. Then, these frequencies are weighted according to the user's need. For example, if we are looking for a place with high amount of coffee shops, we can weight the frequency of these venues in our dataframe for each neighborhood, so it is easier to identify locations with this characteristic. Negative weights may also be applied, in which case this means that the user wants locations or neighborhoods where such category appears as least as possible.

## Methodology

For this project, the methodology will be as follows.

First, the user inputs what kind/category of venues he or she would like to search for in a neighborhood. This input is given by a pair of category name and weight. In this case, weights can also be negative, indicating that the user would like the neighborhood to have the least possible amount of venues of such category.

Then, we request location data from the Foursquare API through an **explore** endpoint request. We will gather data around a neighborhood in a radius of 700 meters -- this should be adjustable by the user as well. A hard limit of 100 venues per request will be set, and this is not adjustable, since analyzing the closest 100 venues would be more than enough for our purposes here.

With the data for a neighborhood in hand (or, more precisely, in memory), we find the frequency of all venue categories relative to the total number of categories around the neighborhood. Then, we weight each category according to user input. For negative inputs, the weight will actually be abs(x)^-1, where x is the negative input. For example, if the user inputs a weight of -2, then the actual weight is 0.5. After that, we find the neighborhood that has the highest frequency of venues within the categories with positive weights, and the lowest frequency of venues within the categories with negative weights. This will be our "optimal neighborhood".

After finding the "optimal neighborhood", we take its coordinates (i.e., features) as one of the centroids for a k-means clustering algorithm. The remaining centroids (depending on how many clusters we choose) will be randomly chosen. We then fit a k-means model to the weighted relative frequency dataset to find neighborhoods that are closely related to our optimal neighborhood.

Finally, the labeled, clustered data is plotted in a map using the folium library so that the user can visualize the results. For ease of visualization, the optimal neighborhood would have a dark color and no transparency, with opacity slowly decreasing from the optimal to the least-related neighborhood which still have the desired venue categories around. Only the neighborhoods in the same cluster as the optimal neighborhood will be shown.

## Results

In this section, we proceed as outlined in the **Methodology** section and discuss the results while we obtain it.

We will assume here that our user is someone moving from the city of Melbourne to Sydney, in Australia.
Back in Melbourne, she really enjoyed her neighborhood, especially because of the many available cafés and coffee shops nearby. She also enjoys a healthy lifestyle, so having a gym nearby is a plus, since sometimes outdoor activity is not possible. However, sometimes she was annoyed by the music festivals happening just around the block's corner, so she would like to avoid having a venue that holds music festivals around her new home

Hence, she will input positive weights of 4 and 2 to the café and gym categories, respectively, and a negative weight of -4 associated with places that can be used for music festivals.

In [178]:
# First, let's import our modules/libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import colors
import folium
from geopy.geocoders import Nominatim
import bs4
bs = bs4.BeautifulSoup
import requests

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [336]:
# User input variables
user_weights = [['café', 4], ['gym', 2], ['music festival',-2]]

In [64]:
# We then convert the negative weights to x^-1
user_weights = [[x[0],abs(x[1])**-1] if x[1]<0 else [x[0],x[1]] for x in user_weights]
user_weights

[['coffee shop', 4], ['gym', 2], ['music festival', 0.5]]

Now that we have the user input, we need to create our dataframe containing the neighborhoods* of Sydney and their location in terms of latitude and longitude.

* Note: The purely geographical division that is usually called neighborhood is called suburbs in Sydney.

In [65]:
# We will read the list of all suburbs in Sydney using beatifulsoup library from the page in the following link:
# https://www.walksydneystreets.net/suburbssydneyall.htm
page = requests.get('https://www.walksydneystreets.net/suburbssydneyall.htm')
soup = bs(page.content, 'html.parser')

In [119]:
# Now we dive into the tags until we find the one containing the lists
# Since we now that Manly is a suburb of Sydney, we check each 'Tag' object looking for Manly in its text
# Also, to avoid choosing the incorrect tag, we also make sure that such take has the string Asbury, another suburb

# First, we find the <html> tag
for ele in soup.children:
    if type(ele) is bs4.element.Tag:
        html = ele
        
# Now we search for our texts within the <html> tag.
# Remember that, as any other structured, tagged document (e.g. XML, JSON, etc),
# the document may be seen as a tree-like structured, where each tag may have its own
# children tags. Thus, we need to search through this tree, with root <html>, to find our texts
# It is easir in this case to search in a breadth-first manner

# We first make a list of the children tag whose parent is the root <html>
# and then loop through it finding our tag of interest (i.e. the one containing our texts)
# and make it the new root. Loop over it again, and so on, until no tags are left
root = html

while (True):
    children_list = [tag for tag in root.children if type(tag) is Tag]
    if len(children_list) == 0: # Empty list: we've exhausted the tag list
        break
    found = False
    for child_tag in children_list:
        if (child_tag.get_text().lower().find('manly') >= 0) and (child_tag.get_text().lower().find('ashbury') >= 0):
            # We found the tag containing our texts, so make it root and loop over it again
            root = child_tag
            found = True
    # We may have run through the list without finding our texts; then, we should return as well with the last root
    if found == False:
        break
    
# The above algorithm is VERY naive and may return an incorrect result
# So we take the returned list of suburbs and find all the same tags with same class attributes in html
suburb_temp = list(child_tag.children)[0]
suburb_tag = suburb_temp.name
suburb_class = suburb_temp.get_attribute_list('class')[0]

suburb_list_tags = html.find_all(suburb_tag, class_=suburb_class)

# Now let us analyze our final results
suburb_list_tags

[<p class="suburb"> 
                           Abbotsbury<br/>
                           Abbotsford<br/>
                           Acacia Gardens<br/>
                           Agnes Banks<br/>
                           Airds<br/>
                           Alexandria<br/>
                           Alfords Point<br/>
                           Allambie Heights<br/>
                           Allawah<br/>
                           Ambarvale<br/>
                           Annandale<br/>
                           Annangrove<br/>
                           Arcadia<br/>
                           Arncliffe<br/>
                           Arndell Park<br/>
                           Artarmon<br/>
                           Ashbury<br/>
                           Ashcroft<br/>
                           Ashfield<br/>
                           Asquith<br/>
                           Auburn<br/>
                           Austral<br/>
                           Avalon Beach<br/>
     

In [137]:
# We see that the list also contains some "urban places" and "neighborhoods" (as per the link)
# Let us keep it this way.

# Now we simply use the get_text() method to get rid of any html tags in our results set
# We use .split() to remove newlines
suburb_list = []
for sublist in suburb_list_tags:
    for substr in sublist.get_text().split('\r\n'):
        tempstr = substr.strip()
        if bool(tempstr): suburb_list.append(tempstr) # Ignore empty strings

suburb_list[:10]

['Abbotsbury',
 'Abbotsford',
 'Acacia Gardens',
 'Agnes Banks',
 'Airds',
 'Alexandria',
 'Alfords Point',
 'Allambie Heights',
 'Allawah',
 'Ambarvale']

With a list of the suburbs of Sydney ready, we can find the latitude and longitude coordinates of each of them to populate our dataframe.

We will use the **geopy** module for this task.

In [161]:
geo = Nominatim(user_agent="foursquare_agent")
df_rows = []

for suburb in suburb_list:
    address = suburb + ', Sydney, Australia'
    loc = geo.geocode(address)
    if (loc): df_rows.append([suburb, loc.latitude, loc.longitude]) # 
    
df_rows

[['Abbotsbury', -33.8692846, 150.8667029],
 ['Abbotsford', -33.8505529, 151.129759],
 ['Acacia Gardens', -33.7324595, 150.9125321],
 ['Agnes Banks', -33.6145082, 150.7114482],
 ['Airds', -34.09, 150.8261111],
 ['Alexandria', -33.9091568, 151.1921281],
 ['Alfords Point', -33.9839091, 151.0241615],
 ['Allambie Heights', -33.7705067, 151.249675],
 ['Allawah', -33.9696293, 151.1142847],
 ['Ambarvale', -34.0844252, 150.8017477],
 ['Annandale', -33.881224, 151.1709976],
 ['Annangrove', -33.6574838, 150.9460077],
 ['Arcadia', -33.6227289, 151.0605602],
 ['Arncliffe', -33.938236, 151.145508],
 ['Arndell Park', -33.79, 150.8761111],
 ['Artarmon', -33.8094124, 151.1857609],
 ['Ashbury', -33.9001107, 151.1180724],
 ['Ashcroft', -33.915, 150.9011111],
 ['Ashfield', -33.8894781, 151.1274125],
 ['Asquith', -33.6871935, 151.1110257],
 ['Auburn', -33.8545702, 151.0255673],
 ['Austral', -33.9282672, 150.8082824],
 ['Avalon Beach', -33.6365039, 151.3290299],
 ['Badgerys Creek', -33.8816671, 150.7441627]

Now our list *df_rows* contains lists with the Suburb name, Latitude and Longitude, so we can create our dataframe.

In [237]:
sydney_df = pd.DataFrame(df_rows)
sydney_df.columns = ['Suburb', 'Latitude','Longitude']

sydney_df.head()

Unnamed: 0,Suburb,Latitude,Longitude
0,Abbotsbury,-33.869285,150.866703
1,Abbotsford,-33.850553,151.129759
2,Acacia Gardens,-33.732459,150.912532
3,Agnes Banks,-33.614508,150.711448
4,Airds,-34.09,150.826111


Let us check how many Suburbs we have in the list. This will also give us the amount of requests that will be needed to send to Foursquare API (remember that we have a daily quota).

In [238]:
sydney_df.shape

(759, 3)

Just for visualization purposes, let us create a map centered at Sydney with our Suburbs shown.

In [239]:
# Get Sydney coordinates
syd_loc = geo.geocode('Sydney, Australia')
syd_lat = syd_loc.latitude
syd_long = syd_loc.longitude

syd_map = folium.Map(location=[syd_lat, syd_long], zoom_start=12)

for lab, lat, long in zip(sydney_df['Suburb'], sydney_df['Latitude'], sydney_df['Longitude']):
    folium.CircleMarker([lat, long],
                       radius=3,
                       popup=lab,
                       color='blue',
                       fill=True,
                       fill_color='blue',
                       fill_opacity=1).add_to(syd_map)


syd_map

#### Reducing the data size

From the shape of our dataframe (i.e. number of rows) and from the map plot above, it is clear that we have too much Suburbs to analyze. In addition, many of them are only a few hundred meters away from each other.

There are two problems with this many location data. First, since they are too close, exploring venues of them will give almost the same results for some sets of suburbs. Also, we should remember that, since we are exploring each location, this would incur a high amount of calls to the Foursquare API, which may exceed our daily quota faster.

Hence, we will cluster suburbs together using their locations, and then, we will use the cluster centroids as the location of each neighborhood (sets of suburbs). In order to cluster sububrs together, we use the K-Means algorithm, using as features to be fit the latitude and longitude of each suburb. Using a trial-and-test approach, I found that using 200 clusters centroids to fit the model gives us neighborhoods that are roughly 1 kilometer apart from each other. Therefore, the cluster centrois would indicate a neighborhood containing the suburbs in a radius of roughly 1000 meters, which is still a fine distance to walk.

In [235]:
from sklearn.cluster import KMeans

In [240]:
kmlat = KMeans(init="k-means++", n_init=10, n_clusters=200)

fitdf = sydney_df.drop('Suburb', axis=1)

kmlat.fit(fitdf)

sydney_df['Cluster'] = kmlat.labels_

sydney_df.head()

Unnamed: 0,Suburb,Latitude,Longitude,Cluster
0,Abbotsbury,-33.869285,150.866703,8
1,Abbotsford,-33.850553,151.129759,184
2,Acacia Gardens,-33.732459,150.912532,34
3,Agnes Banks,-33.614508,150.711448,106
4,Airds,-34.09,150.826111,38


In [256]:
sydclust_map = folium.Map(location=[syd_lat, syd_long], zoom_start=12)

cols = plt.cm.Spectral(np.linspace(0,1,len(sydney_df['Cluster'].unique())))
cols = [colors.rgb2hex(c) for c in cols]

for lat, long, clus in zip(sydney_df['Latitude'], sydney_df['Longitude'], sydney_df['Cluster']):
    folium.CircleMarker([lat, long],
                       radius=3,
                        weight=1,
                        popup=clus,
                       color='black',
                       fill=True,
                       fill_color='blue',
                       fill_opacity=0.9).add_to(sydclust_map)

for clus, latlong in enumerate(kmlat.cluster_centers_):
    folium.Circle([latlong[0], latlong[1]],
                       radius = 1000,
                       color = cols[clus],
                       weight=3,
                       popup=clus,
                       fill=True,
                       fill_color=cols[clus],
                       fill_opacity=0.3).add_to(sydclust_map)
    
sydclust_map

Now we create a new dataframe containing the regions (clusters) we have just obtained, and their latitude and longitude information.

In [281]:
regions_suburbs = []

for clus in range(len(sydney_df['Cluster'].unique())):
    my_members = sydney_df[sydney_df['Cluster'] == clus]['Suburb']
    regions_suburbs.append(', '.join(list(my_members)))

list_df = []
for reg, subs, latlong in zip(range(len(sydney_df['Cluster'].unique())), regions_suburbs, kmlat.cluster_centers_):
    list_df.append([reg+1, subs, latlong[0], latlong[1]])
    
sydney_regions_df = pd.DataFrame(list_df, columns=['Sydney Region', 'Suburbs in Region', 'Latitude', 'Longitude'])
sydney_regions_df.head()

Unnamed: 0,Sydney Region,Suburbs in Region,Latitude,Longitude
0,1,"Bellevue Hill, Bondi Beach, Dover Heights, Nor...",-33.879179,151.269575
1,2,"Clyde, Granville, Harris Park, Holroyd, Parram...",-33.827906,151.010696
2,3,"Riverstone, Schofields",-33.688457,150.866431
3,4,"Eschol Park, Kearns",-34.0275,150.801111
4,5,"Cottage Point, Peach Trees",-33.625591,151.189715


In [299]:
sydney_regions_df.shape

(200, 4)

Now that we have our dataframe resized and with the proper information, we can call the Foursquare API to explore venues around each Sydney Region.

Note that we could make a request only for the categories of interest. Although this would save some memory, this would incur in *number of cat.* times *regions* API calls, which could potentially lead to exceeding the quota very quickly. Therefore, we send a single API request to get the venues of any category around a region, and we refine the data later.

In [277]:
# Set up our app credentials
client_id = 'XSSMD5IEFHMJT0UTBQ2U1SW4IBFS1H04GZG4TXXAL1IEMOVX'
client_secret = '53DFQ1DWAR20AL1QUVAU2CQNROGA13MCNKTLBZ3HNODVVUM3'
version = '20180605'

In [297]:
# Since each clustered region spams roughly the suburbs within a radius of 1000 meters, this is 
# the radius that we will explore
rad = 1000
limit = 50

base_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&radius={}&limit={}'.format(
            client_id, client_secret, version, rad, limit)

venues_list = []

for reg, lat, long in zip(sydney_regions_df['Sydney Region'], sydney_regions_df['Latitude'], sydney_regions_df['Longitude']):
    url = base_url + '&ll={},{}'.format(lat, long)
    
    venues = requests.get(url).json()['response']['groups'][0]['items']
    
    for v in venues:
        venues_list.append([reg, lat, long, v['venue']['name'], v['venue']['categories'][0]['name'], v['venue']['location']['lat'], v['venue']['location']['lng']])

In [300]:
df_columns = ['Sydney Region', 'Region Latitude', 'Region Longitude', 'Venue Name', 'Venue Category',
             'Venue Latitude', 'Venue Longitude']
regions_venues_df = pd.DataFrame(venues_list, columns=df_columns)
regions_venues_df.head()

Unnamed: 0,Sydney Region,Region Latitude,Region Longitude,Venue Name,Venue Category,Venue Latitude,Venue Longitude
0,1,-33.879179,151.269575,Gaslight Pharmacy,Pharmacy,-33.875384,151.272273
1,1,-33.879179,151.269575,SHUK,Israeli Restaurant,-33.882187,151.276701
2,1,-33.879179,151.269575,Pita Mix,Kosher Restaurant,-33.874084,151.27298
3,1,-33.879179,151.269575,Organic Republic Bakery,Bakery,-33.885693,151.273611
4,1,-33.879179,151.269575,La Piadina,Italian Restaurant,-33.885947,151.273351


In [301]:
regions_venues_df.shape

(2751, 7)

#### ***Note***: I will make this last dataframe persistent just in case the Notebook kernel/environment crashes. If this happens, another 200 calls would have to be made to the Foursquare API, and maybe there is no quota available. To avoid such situation, we make this dataframe persistent with *pickle* library.

In [302]:
import pickle

In [303]:
regions_venues_df.to_pickle('./regions_venues_df.pkl')

We now create the one-hot enconding of the above dataframe, so that we can analyze each region according to the venue categories near it.

In [306]:
regions_onehot = pd.get_dummies(regions_venues_df[['Venue Category']], prefix="", prefix_sep="")

regions_onehot['Sydney Region'] = regions_venues_df['Sydney Region']

# Fix the columns so that Sydney Region is the first column
col_order = ['Sydney Region'] + [col for col in regions_onehot.columns if not col == 'Sydney Region']
regions_onehot = regions_onehot[col_order]

regions_onehot.head()

Unnamed: 0,Sydney Region,Accessories Store,Airport,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Astrologer,Athletics & Sports,Australian Restaurant,Austrian Restaurant,Auto Workshop,BBQ Joint,Baby Store,Badminton Court,Bakery,Bar,Baseball Field,Basketball Court,Basketball Stadium,Bay,Beach,Beach Bar,Beer Bar,Beer Garden,Belgian Restaurant,Big Box Store,Bike Rental / Bike Share,Bike Trail,Bistro,Board Shop,Boat Rental,Boat or Ferry,Bookstore,Bowling Alley,Bowling Green,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Buffet,Burger Joint,Burmese Restaurant,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Café,Camera Store,Campground,Candy Store,Cantonese Restaurant,Car Wash,Caribbean Restaurant,Carpet Store,Cheese Shop,Child Care Service,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cricket Ground,Cuban Restaurant,Cupcake Shop,Dam,Dance Studio,Deli / Bodega,Dentist's Office,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distillery,Dog Run,Donut Shop,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Service,Food Truck,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,German Restaurant,Golf Course,Greek Restaurant,Grocery Store,Gun Range,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Harbor / Marina,Hardware Store,Health & Beauty Service,Health Food Store,Herbs & Spices Store,History Museum,Hobby Shop,Hockey Field,Home Service,Hostel,Hotel,Hotel Bar,Hungarian Restaurant,IT Services,Ice Cream Shop,Imported Food Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Intersection,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Kebab Restaurant,Korean BBQ Restaurant,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Lebanese Restaurant,Library,Lighthouse,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mattress Store,Medical Center,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Modern European Restaurant,Moroccan Restaurant,Motel,Motorcycle Shop,Mountain,Movie Theater,Moving Target,Multiplex,Museum,Music Store,Music Venue,National Park,Nature Preserve,Neighborhood,Night Market,Noodle House,Office,Optical Shop,Organic Grocery,Other Great Outdoors,Other Nightlife,Outdoor Supply Store,Outlet Mall,Outlet Store,Paintball Field,Paper / Office Supplies Store,Park,Pastry Shop,Pedestrian Plaza,Performing Arts Venue,Persian Restaurant,Pet Store,Pharmacy,Photography Studio,Pie Shop,Pier,Pizza Place,Planetarium,Platform,Playground,Plaza,Pool,Portuguese Restaurant,Print Shop,Pub,Public Bathroom,Racecourse,Racetrack,Ramen Restaurant,Recreation Center,Rental Car Location,Resort,Restaurant,River,Rock Club,Rugby Pitch,Rugby Stadium,Sake Bar,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soccer Field,Soccer Stadium,South American Restaurant,South Indian Restaurant,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stables,Stadium,State / Provincial Park,Steakhouse,Street Food Gathering,Summer Camp,Supermarket,Supplement Shop,Surf Spot,Sushi Restaurant,Szechuan Restaurant,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toll Plaza,Tour Provider,Tourist Information Center,Toy / Game Store,Trail,Train,Train Station,Tram Station,Travel & Transport,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo Exhibit
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [521]:
regions_grouped = regions_onehot.groupby('Sydney Region').sum().reset_index()
regions_grouped.head()

Unnamed: 0,Sydney Region,Accessories Store,Airport,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Astrologer,Athletics & Sports,Australian Restaurant,Austrian Restaurant,Auto Workshop,BBQ Joint,Baby Store,Badminton Court,Bakery,Bar,Baseball Field,Basketball Court,Basketball Stadium,Bay,Beach,Beach Bar,Beer Bar,Beer Garden,Belgian Restaurant,Big Box Store,Bike Rental / Bike Share,Bike Trail,Bistro,Board Shop,Boat Rental,Boat or Ferry,Bookstore,Bowling Alley,Bowling Green,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Buffet,Burger Joint,Burmese Restaurant,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Café,Camera Store,Campground,Candy Store,Cantonese Restaurant,Car Wash,Caribbean Restaurant,Carpet Store,Cheese Shop,Child Care Service,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Creperie,Cricket Ground,Cuban Restaurant,Cupcake Shop,Dam,Dance Studio,Deli / Bodega,Dentist's Office,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distillery,Dog Run,Donut Shop,Dry Cleaner,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Service,Food Truck,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Garden,Garden Center,Gas Station,Gastropub,German Restaurant,Golf Course,Greek Restaurant,Grocery Store,Gun Range,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Harbor / Marina,Hardware Store,Health & Beauty Service,Health Food Store,Herbs & Spices Store,History Museum,Hobby Shop,Hockey Field,Home Service,Hostel,Hotel,Hotel Bar,Hungarian Restaurant,IT Services,Ice Cream Shop,Imported Food Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Intersection,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Kebab Restaurant,Korean BBQ Restaurant,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Lebanese Restaurant,Library,Lighthouse,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mattress Store,Medical Center,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Modern European Restaurant,Moroccan Restaurant,Motel,Motorcycle Shop,Mountain,Movie Theater,Moving Target,Multiplex,Museum,Music Store,Music Venue,National Park,Nature Preserve,Neighborhood,Night Market,Noodle House,Office,Optical Shop,Organic Grocery,Other Great Outdoors,Other Nightlife,Outdoor Supply Store,Outlet Mall,Outlet Store,Paintball Field,Paper / Office Supplies Store,Park,Pastry Shop,Pedestrian Plaza,Performing Arts Venue,Persian Restaurant,Pet Store,Pharmacy,Photography Studio,Pie Shop,Pier,Pizza Place,Planetarium,Platform,Playground,Plaza,Pool,Portuguese Restaurant,Print Shop,Pub,Public Bathroom,Racecourse,Racetrack,Ramen Restaurant,Recreation Center,Rental Car Location,Resort,Restaurant,River,Rock Club,Rugby Pitch,Rugby Stadium,Sake Bar,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Smoke Shop,Snack Place,Soccer Field,Soccer Stadium,South American Restaurant,South Indian Restaurant,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Stables,Stadium,State / Provincial Park,Steakhouse,Street Food Gathering,Summer Camp,Supermarket,Supplement Shop,Surf Spot,Sushi Restaurant,Szechuan Restaurant,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toll Plaza,Tour Provider,Tourist Information Center,Toy / Game Store,Trail,Train,Train Station,Tram Station,Travel & Transport,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo Exhibit
0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We now refine our dataframe to include only the categories of interest, and weight their values.

In [522]:
# Find the correct names of the columns/categories and its weights
intcat = []
intcat_weights = []
for cat in regions_grouped.columns:
    for usercat in user_weights:
        if (cat.lower().find(usercat[0].lower()) >= 0): 
            intcat.append(cat)
            intcat_weights.append(usercat[1])

for col in regions_grouped.columns:
    if col in intcat:
        for i in range(len(intcat)):
            if intcat[i] == col:
                regions_grouped[col] = regions_grouped[col]*intcat_weights[i]
            
regions_grouped = regions_grouped[['Sydney Region'] + intcat]
regions_grouped.head()

Unnamed: 0,Sydney Region,Café,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym
0,1,32,0,2,0,0,0
1,2,4,0,4,0,0,0
2,3,4,0,0,0,0,0
3,4,0,0,0,0,0,0
4,5,0,0,0,0,0,0


Now we make a new column containing the weighted sum of the positive and negative categories, that is, the categories that the user would like and would not like to have nearby, respectively. In addition, we remove any region that has no positive category of interest (i.e., positive sum equals zero)

In [523]:
pos_cat = pd.Series([0]*regions_grouped.shape[0], dtype=int)
neg_cat = pd.Series([0]*regions_grouped.shape[0], dtype=int)
for i, cat in enumerate(intcat):
    if intcat_weights[i] >= 0:
        pos_cat += regions_grouped[cat]
    else:
        neg_cat += regions_grouped[cat]

regions_grouped['Positive Sum'] = pos_cat
regions_grouped['Negative Sum'] = neg_cat

regions_grouped = regions_grouped[regions_grouped['Positive Sum'] != 0].reset_index(drop=True)
regions_grouped.head()

Unnamed: 0,Sydney Region,Café,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Positive Sum,Negative Sum
0,1,32,0,2,0,0,0,34,0
1,2,4,0,4,0,0,0,8,0
2,3,4,0,0,0,0,0,4,0
3,7,16,0,4,0,0,0,20,0
4,9,4,0,2,0,0,0,6,0


In [524]:
regions_grouped.shape

(113, 9)

Finally, we take the region with the highest positive sum and check it against a dataframe ordered in ascending order of the negative sum. We then compare this position (in ascending order of negative weights) to the order of the category with the second highest, and so on. 

In [525]:
neg_order = []

for i, reg in enumerate(regions_grouped.sort_values('Positive Sum', ascending=False).values):
    for j, regcomp in enumerate(regions_grouped.sort_values('Negative Sum').values):
        if reg[0] == regcomp[0]:
            neg_order.append(j)
            break
    if not i == 0:
        if neg_order[i] >= neg_order[i-1]:
            best_reg = reg[0]
            break

best_reg_ind = regions_grouped[regions_grouped['Sydney Region'] == best_reg].index[0]
print('Optimal Region found:', 'Region: %s\nDataframe Index: %s' % (best_reg, best_reg_ind), sep='\n')
regions_grouped[regions_grouped['Sydney Region'] == best_reg]

Optimal Region found:
Region: 180
Dataframe Index: 98


Unnamed: 0,Sydney Region,Café,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Positive Sum,Negative Sum
98,180,56,2,2,2,0,0,62,0


In [530]:
kclusters = 15

km_int = KMeans(init="k-means++", n_clusters=kclusters, n_init=20) # When explicit centroids are passed, only 1 init is performed

km_int.fit(regions_grouped.drop(['Sydney Region', 'Positive Sum', 'Negative Sum'], axis=1))

km_int.labels_

array([ 7, 14,  2,  8, 14,  2, 13,  6,  6,  8,  8, 12,  2, 12,  4,  5,  9,
        2,  2,  6, 12,  2,  2,  6,  1,  8, 11,  3, 14,  8,  8,  6,  5,  2,
       14, 14,  2,  9,  2,  6,  5,  2, 12,  2, 14,  6,  6,  2, 11,  6,  2,
        6,  0,  7,  5,  7, 12, 13,  2,  9,  5,  7,  6,  2,  1,  5,  5,  2,
        2,  2,  5, 14,  2,  4,  5,  3,  2,  8,  0,  6,  2,  3,  2,  2, 11,
        2,  6,  1,  9,  8, 14,  9,  3,  2,  7,  2,  5,  2, 10,  8,  7,  3,
        5,  3,  2,  5,  9,  8,  6,  2,  9,  5,  3], dtype=int32)

In [531]:
regions_grouped['Cluster'] = km_int.labels_
regions_grouped.head()

Unnamed: 0,Sydney Region,Café,Climbing Gym,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Positive Sum,Negative Sum,Cluster
0,1,32,0,2,0,0,0,34,0,7
1,2,4,0,4,0,0,0,8,0,14
2,3,4,0,0,0,0,0,4,0,2
3,7,16,0,4,0,0,0,20,0,8
4,9,4,0,2,0,0,0,6,0,14


Let's join the cluster information obtained with our K-Means algorithm in the dataframe containing the Sydney Regions with their Latitude and Longitude information.

In [561]:
sydney_regions_clustered = sydney_regions_df.join(regions_grouped.set_index('Sydney Region')[['Cluster']], on='Sydney Region')

# Since the first dataframe contains all regions, and we have clustered only regions with venues of interest,
# there will be some NaN in Cluster column. We remove these.
sydney_regions_clustered.dropna(inplace=True)
sydney_regions_clustered['Cluster'] = sydney_regions_clustered['Cluster'].astype(int)
sydney_regions_clustered.reset_index(inplace=True, drop=True)
sydney_regions_clustered.head()

Unnamed: 0,Sydney Region,Suburbs in Region,Latitude,Longitude,Cluster
0,1,"Bellevue Hill, Bondi Beach, Dover Heights, Nor...",-33.879179,151.269575,7
1,2,"Clyde, Granville, Harris Park, Holroyd, Parram...",-33.827906,151.010696,14
2,3,"Riverstone, Schofields",-33.688457,150.866431,2
3,7,"Gladesville, Putney, Ryde, Tennyson Point, Ryd...",-33.822006,151.11458,8
4,9,"Abbotsbury, Bossley Park, Greenfield Park, Wet...",-33.863779,150.886051,14


Now we classify our clusters as "best regions" and set its opacity for plotting according to its position in descending order of positive sum.

In [590]:
# First we create the opacity levels for all of our clusters
op_level = np.linspace(1,0.1,kclusters)

# Then, we set the opacity level = 1 (op_level[0]) to the cluster of our optimal region
cluster_op = [0.0]*kclusters
cluster_op[regions_grouped.loc[best_reg_ind, 'Cluster']] = op_level[0]

# Now we loop through the ordered data set and map the cluster number to the opacity level
done = set()
done.add(regions_grouped.loc[best_reg_ind, 'Cluster'])
i=1
for clus in regions_grouped.sort_values('Positive Sum', ascending=False)['Cluster']:
    if not clus in done:
        cluster_op[clus] = op_level[i]
        done.add(clus)
        i += 1
        
cluster_op

[0.5499999999999999,
 0.8714285714285714,
 0.22857142857142843,
 0.6785714285714286,
 0.9357142857142857,
 0.4214285714285714,
 0.3571428571428571,
 0.7428571428571429,
 0.48571428571428565,
 0.16428571428571415,
 1.0,
 0.6142857142857142,
 0.1,
 0.8071428571428572,
 0.2928571428571428]

Finally, we add a map plot using Folium library to show the user which regions are suitable according to her interest (in this case, Cafés and Gyms nearby, without places that can hold Music Festivals).

In [596]:
final_map = folium.Map(location=[syd_lat, syd_long], zoom_start=10.5)

for lat, long, subs, clus in zip(sydney_regions_clustered['Latitude'], sydney_regions_clustered['Longitude'],
                               sydney_regions_clustered['Suburbs in Region'], sydney_regions_clustered['Cluster']):
    folium.Circle([lat, long],
                 radius=1000,
                 stroke=False,
                 popup='Cluster: %s\nSuburbs: %s' % (clus,subs),
                 fill=True,
                 fill_color='blue',
                 fill_opacity=cluster_op[clus]).add_to(final_map)
    
# Add a marker to the center of Sydney
folium.Marker([syd_lat, syd_long], popup='Sydney').add_to(final_map)
    
final_map

We can also plot a heatmap using a plugin of the Folium library. They are great for visualization. In this case, we also add weights to the latitude and longitude of the regions according to the opacity level of the cluster the region is in.

In [600]:
from folium.plugins import HeatMap

data = []

# Create list of [lat, long] points containing the weights (associated with the opacity level)
for lat, long, clus in zip(sydney_regions_clustered['Latitude'], sydney_regions_clustered['Longitude'],
                            sydney_regions_clustered['Cluster']):
    data.append([lat, long, cluster_op[clus]])

ht_map = folium.Map(location=[syd_lat, syd_long], zoom_start=10)
#folium.TileLayer('cartodbpositron').add_to(ht_map)
HeatMap(data).add_to(ht_map)
folium.Marker([syd_lat, syd_long], popup='Sydney').add_to(ht_map)

ht_map

## Discussion

The results obtained and discussed above show what is sometimes intuitive: denser regions will tend to have the higher amounts of coffee shops, cafés, gyms, etc. Comercial venues that offer services and consumables will usually be located in denser areas, since you want your customers near.

However, the above analysis can be applied to any other situation. For example, if our user did NOT want to have gyms and cafés near, we would probably find that the optimal neighborhoods would be far away from city centre.

Likewise, the above analysis is general enough so that any other features may be added and included in the analysis. For example, a data set containing crime rates for each Sydney suburb could be retrived, and such crime rates could be added to our data frames. Then, applying negative weights to these features, we would find neighborhoods that have the lowest crime rates in Sydney, for example.

The heatmap above is good to obtain an overall picture, but the regions, clustered map is the best one to look for when really searching a place to move in, as was the case in this project. Because in the heat map it is hard to differentiate between neighboring regions, you should look at the opacity level (or the transparency) of each circle in the first map.

As is always the case, there is much more to improve in the model just presented. First, we should note that the way we found the "optimal neighborhood" is probably not the best way out there to do it. Also, unfortunately in this case no regions in Sydney had the "Music Festival" category, so we could not check the effect of negative weights in our results. Lastly, it should be nice if we also added to the map a marker for each of the positive categories of interest, showing the location of each cafés and gyms in each region. In this way, we would aid the potential of analysis of our user.

## Conclusion

This project showed the potential of location data to explore regions anywhere in our globe to best aid our decisions, based on our interests in each region. In this case, we only used the *explore* endpoint of the API, meaning that we only worked with venue categories. However, there is a huge potential in using this API, especially when premium calls are added.

The purpose of this project was to aid someone move into a new city without knowing much about each region and/or neighborhood. In this case, a girl from Melbroune moving to Sydney wanted to find the optimal neighborhood to rent an apartment to live in, according to what venues she enjoys the most. The last two map plots show our final results, where each region is colored based on the likelihood that the girl would enjoy moving to that place. Stronger fill colors mean regions that are more likely to preferred by her, according to her needs, whereas weaker fill colors indicate regions where she would be best far away. The same reasoning applies to the heat map.

Overall, the project reached its final goal with reasonable results, and could be applied to real cases right away.