## CLUSTERING TASK: Banner placement

_Imagine that the international cruise agency Carnival Cruise Line decided to advertise itself with the help of banners and turned to you for this. To test whether such banners are of great use, only 20 of them will be placed around the world. You need to choose 20 such locations for placement, so that the benefits are great, and the agency continues to cooperate with you. The agency is large and has several offices around the world. Near these offices, it wants to place banners - it is easier to negotiate and check the result. Also, these places should be popular among tourists._

To find the best places, we will use the database of the largest social network based on locations — Foursquare.

Part of the open data is, for example, on the website archive.org:

https://archive.org/details/201309_foursquare_dataset_umn

### Data processing
Let's load the data.
For convenient work, we will convert to the csv format by deleting lines that do not contain coordinates - they are not informative for us:

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_table('umn_foursquare_datasets/checkins.dat', sep='|', low_memory=False)

In [3]:
data.to_csv('umn_foursquare_datasets/checkins.csv')

In [4]:
data = pd.read_csv('umn_foursquare_datasets/checkins.csv', low_memory=False)

In [5]:
data

Unnamed: 0.1,Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
0,0,---------+---------+----------+---------------...,,,,,
1,1,984301,2041916.0,5222.0,,,2012-04-21 17:39:01
2,2,984222,15824.0,5222.0,38.8951118,-77.0363658,2012-04-21 17:43:47
3,3,984315,1764391.0,5222.0,,,2012-04-21 17:37:18
4,4,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
...,...,...,...,...,...,...,...
1021963,1021963,956119,1139114.0,29488.0,42.439479,-83.7438303,2012-04-13 11:36:44
1021964,1021964,956447,2088020.0,4432.0,,,2012-04-13 12:58:05
1021965,1021965,956733,960666.0,60.0,42.331427,-83.0457538,2012-04-13 21:56:19
1021966,1021966,957139,1771518.0,10935.0,,,2012-04-14 02:44:52


In [6]:
data = data.rename(columns={' user_id ': 'user_id', '     latitude      ':'latitude', '     longitude     ':'longitude', '     created_at      ': 'created_at', '   id    ': 'id', ' venue_id ': 'venue_id' })

In [7]:
data.columns

Index(['Unnamed: 0', 'id', 'user_id', 'venue_id', 'latitude', 'longitude',
       'created_at'],
      dtype='object')

In [8]:
data = data.drop('Unnamed: 0', axis=1)

In [9]:
data

Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
0,---------+---------+----------+---------------...,,,,,
1,984301,2041916.0,5222.0,,,2012-04-21 17:39:01
2,984222,15824.0,5222.0,38.8951118,-77.0363658,2012-04-21 17:43:47
3,984315,1764391.0,5222.0,,,2012-04-21 17:37:18
4,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
...,...,...,...,...,...,...
1021963,956119,1139114.0,29488.0,42.439479,-83.7438303,2012-04-13 11:36:44
1021964,956447,2088020.0,4432.0,,,2012-04-13 12:58:05
1021965,956733,960666.0,60.0,42.331427,-83.0457538,2012-04-13 21:56:19
1021966,957139,1771518.0,10935.0,,,2012-04-14 02:44:52


In [10]:
data = data.drop(data[data.latitude == '                   '].index)

In [11]:
data = data.iloc[1:-1]

In [12]:
data

Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
2,984222,15824.0,5222.0,38.8951118,-77.0363658,2012-04-21 17:43:47
4,984234,44652.0,5222.0,33.800745,-84.41052,2012-04-21 17:43:43
8,984291,105054.0,5222.0,45.5234515,-122.6762071,2012-04-21 17:39:22
10,984318,2146539.0,5222.0,40.764462,-111.904565,2012-04-21 17:35:46
11,984232,93870.0,380645.0,33.4483771,-112.0740373,2012-04-21 17:38:18
...,...,...,...,...,...,...
1021960,955561,626076.0,20073.0,40.8501002,-73.8662464,2012-04-13 09:56:48
1021961,955892,674797.0,2297.0,33.7489954,-84.3879824,2012-04-13 10:56:03
1021962,956377,845102.0,11195.0,42.7653662,-71.467566,2012-04-13 12:08:45
1021963,956119,1139114.0,29488.0,42.439479,-83.7438303,2012-04-13 11:36:44


In [13]:
from sklearn.cluster import MeanShift

In [14]:
X = data.values[:100000, 3:5]

In [15]:
X

array([['        38.8951118 ', '       -77.0363658 '],
       ['         33.800745 ', '         -84.41052 '],
       ['        45.5234515 ', '      -122.6762071 '],
       ...,
       ['        29.7628844 ', '       -95.3830615 '],
       ['         32.802955 ', '        -96.769923 '],
       ['        37.7749295 ', '      -122.4194155 ']], dtype=object)

Now it is necessary to cluster these coordinates in order to identify the centers of tourist clusters. Since banners have a relatively small footprint, we need an algorithm that allows us to limit the size of the cluster and is independent of the number of clusters. Use MeanShift with bandwidth=0.1, which in degrees to meters ranges from about 5 to 10 km at mid-latitudes.

Note: on 396634 lines clustering will work for a long time. Being very patient is not forbidden - the result will only improve. But in order to pass the assignment, you need a subset of the first 100 thousand lines. This is a compromise between quality and time spent. It takes about an hour to train the algorithm on the entire dataset, and about 2 minutes on 100 thousand rows, but this is enough to get correct results.

Some of the resulting clusters contain too few points - such clusters are not of interest to advertisers. Therefore, it is necessary to determine which of the clusters contain, say, more than 15 elements. The centers of these clusters are optimal for placement.

If you want to see the resulting results on the map, you can pass the centers of the resulting clusters to one of the visualization tools. For example, the site mapcustomizer.com has a Bulk Entry function where you can insert the centers of the resulting clusters in the format: 38.8951118,-77.0363658

As we remember, 20 banners should be placed near the company's offices. We will find on Google Maps at the request of Carnival Cruise Line office addresses: 33.751277, -118.188740 (Los Angeles), etc.

In [16]:
clustering = MeanShift(bandwidth=2).fit(X)

In [17]:
unique = len(np.unique(clustering.labels_))

In [18]:
unique

280

In [19]:
cluster_centers = clustering.cluster_centers_

In [20]:
cluster_centers

array([[ 4.06819204e+01, -7.40758674e+01],
       [ 3.34136361e+01, -1.11901424e+02],
       [ 3.37209781e+01, -1.17900102e+02],
       [ 3.77071545e+01, -1.22221237e+02],
       [ 3.92172344e+01, -7.65175569e+01],
       [ 4.21125091e+01, -8.77908054e+01],
       [ 2.73697493e+01, -8.11814703e+01],
       [ 4.22709824e+01, -7.13778405e+01],
       [ 4.71019250e+01, -1.22400517e+02],
       [ 3.18617223e+01, -9.71948575e+01],
       [ 3.38248988e+01, -8.43692105e+01],
       [ 4.14773201e+01, -8.27688701e+01],
       [ 4.49318199e+01, -9.32495237e+01],
       [ 4.19724767e+01, -8.02225577e+01],
       [ 3.96197622e+01, -8.47919751e+01],
       [ 3.97787294e+01, -1.05068159e+02],
       [ 3.60831141e+01, -1.15134381e+02],
       [ 3.55757373e+01, -7.98271936e+01],
       [ 3.96140525e+01, -9.50930500e+01],
       [ 2.98170758e+01, -9.53795058e+01],
       [-6.31395822e+00,  1.06930266e+02],
       [ 3.89329967e+01, -9.01007189e+01],
       [ 3.59294932e+01, -8.65830729e+01],
       [ 4.

In [21]:
len(cluster_centers)

280

In [22]:
dict = {}
labels = clustering.labels_
for label in labels:
    if label in dict:
        dict[label] += 1
    else:
        dict[label] = 1
        
p = pd.DataFrame(dict, index=[0])

In [24]:
l = []
for c in dict:
    if dict[c] > 15:
        l.append(c)

print(l)        

[4, 10, 8, 27, 1, 0, 7, 2, 16, 3, 5, 33, 14, 13, 81, 18, 12, 25, 17, 9, 56, 22, 45, 24, 26, 31, 11, 85, 30, 57, 6, 29, 47, 52, 44, 84, 75, 15, 58, 19, 69, 74, 28, 50, 51, 40, 60, 65, 21, 39, 32, 48, 35, 89, 62, 63, 87, 36, 34, 53, 23, 20, 61, 43, 88, 80, 101, 93, 42, 59, 68, 46, 116, 67, 37, 49, 70, 99, 201, 66, 94, 86, 79, 77, 64, 95, 83, 97, 55, 100, 107, 98, 71, 90, 72, 38, 78, 73, 92, 104, 41, 76, 82, 91, 103]


In [25]:
len(l)

105

In [46]:
clusters_select = np.ndarray(shape=(105,2))

In [47]:
i = 0
j = 0
while i < len(cluster_centers):
    if dict[i] > 15:
        clusters_select[j] = cluster_centers[i]
        j += 1
    i += 1

In [48]:
clusters_select[0]

array([ 40.68192041, -74.0758674 ])

As we remember, 20 banners should be placed near the company's offices. We find on Google Maps on demand "Carnival Cruise Line" office locations: 33.751277, -118.188740 (Los Angeles) 25.867736, -80.324116 (Miami) 51.503016, -0.075479 (London) 52.378894, 4.885084 (Amsterdam) 39.366487, 117.036146 (Beijing) -33.868457 , 151.205134 (Sydney) It remains to determine the 20 nearest cluster centers. Those. calculate the distance to the nearest office for each point and choose 20 with the smallest value. Note: when calculating distances and in clustering, we can neglect the fact that the Earth is round, since at points located close to each other the error is small, and at other points the value is quite large.

To submit the assignment, choose from the resulting 20 centers the one that is the least distant from the office closest to it. The answer in this task is the latitude and longitude of this center, written with a space.

In [49]:
offices = np.ndarray(shape=(6,2))
offices[0] = np.array([33.751277, -118.188740])
offices[1] = np.array([25.867736, -80.324116])
offices[2] = np.array([51.503016, -0.075479])
offices[3] = np.array([52.378894, 4.885084])
offices[4] = np.array([39.366487, 117.036146])
offices[5] = np.array([-33.868457, 151.205134])

In [50]:
def distance(point1, point2):
    return ((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)**0.5

In [52]:
distance(offices[0], clusters_select[69])

40.188018024685455

In [37]:
answer_index = 0
min_dist = 0
i = 0
while i < len(clusters_select):
    distances = [distance(xx, clusters_select[i]) for xx in offices]
    if min_dist == 0:
        min_dist = min(distances)
        answer_index = i
    else:
        if min_dist > min(distances):
            min_dist = min(distances)
            answer_index = i
    i += 1

In [38]:
answer_index

74

In [39]:
clusters_select[answer_index]

array([-33.80932059, 151.14566154])

In [116]:
def write_answer(center):
    with open("answer.txt", "w") as f:
        f.write(str(center[0]) + ' ' + str(center[1]))

In [117]:
write_answer(clusters_select[answer_index])