# Clustering algorithm for extracting seasons

In this notebook we will generate `isPeakSeason` field for each month inside the country. We are using the following **heuristics**
1. Months can be split into chunks of "seasons", which are just clusters with common centroids. 
2. Currently for defining the centroids we just use `visitorIndex` information. We assume that the `peakSeason` is the one which gathers most visitors (later this concept could be extended to more dimensions)  
3. The chunk with biggest centroid score will contain `peakSeason` months.
4. We currently assume that there are 4 seasons according to their popularity (hyperparameter)

In [5]:
import numpy as np
import pandas as pd 
from scipy.cluster.vq import vq, kmeans, whiten

In [7]:
travelRegions = pd.read_json('./travelRegionsWithIndex.json')

travelRegions.head()

Unnamed: 0,ParentRegion,Region,u_name,costPerWeek,jan,feb,mar,apr,may,jun,...,beach,watersports,entertainment,wintersports,culture,culinary,architecture,shopping,budgetLevel,visitorIndex
0,,World,,400.0,o,o,o,o,o,o,...,o,o,o,o,o,o,o,o,,"{'jan': 0, 'feb': 0, 'mar': 0, 'apr': 0, 'may'..."
1,World,Europe,,,,,,,,,...,,,,,,,,,,"{'jan': 0, 'feb': 0, 'mar': 0, 'apr': 0, 'may'..."
2,World,North America,,,,,,,,,...,,,,,,,,,,"{'jan': 0, 'feb': 0, 'mar': 0, 'apr': 0, 'may'..."
3,World,Middle America and Caribbean,,,,,,,,,...,,,,,,,,,,"{'jan': 0, 'feb': 0, 'mar': 0, 'apr': 0, 'may'..."
4,World,South America,,,,,,,,,...,,,,,,,,,,"{'jan': 0, 'feb': 0, 'mar': 0, 'apr': 0, 'may'..."


In [24]:
np.array(list(travelRegions.iloc[30]['visitorIndex'].values()), dtype='f')

array([ 90.,  80., 111.,  90.,  83.,  98.,  84.,  82.,  82., 102.,  83.,
        88.], dtype=float32)

## Use kmeans algorithmm to determine peak season

In [29]:
travelRegionIndexes = np.array(list(travelRegions.iloc[30]['visitorIndex'].values()), dtype='f')
centroids, _ = kmeans(travelRegionIndexes, 4)
cluster_labels, _ = vq(travelRegionIndexes, centroids)
cluster_labels, centroids

(array([0, 3, 2, 0, 3, 1, 3, 3, 3, 1, 3, 0], dtype=int32),
 array([ 89.333336, 100.      , 111.      ,  82.333336], dtype=float32))

In [31]:
travelRegionIndexes

array([ 90.,  80., 111.,  90.,  83.,  98.,  84.,  82.,  82., 102.,  83.,
        88.], dtype=float32)

In [64]:
def applyKMeans(x, k=4):
    travelRegionIndexes = np.array(list(x.values()), dtype='f')
    centroids, _ = kmeans(travelRegionIndexes, k)
    
    sorted_indices = np.argsort(-centroids[:])
    # print(sorted_indices, centroids)
    sorted_centroids = centroids[sorted_indices]

    cluster_labels, _ = vq(travelRegionIndexes, sorted_centroids)
    is_one_cluster = np.all(np.array(cluster_labels) == 0)
    arePeakSeason = {}
    for idx, key in enumerate(x.keys()):
        if is_one_cluster:
            arePeakSeason[key] = False 
        else:
            arePeakSeason[key] = cluster_labels[idx] == 0
    return arePeakSeason

In [66]:
applyKMeans(travelRegions.iloc[0]['visitorIndex'])

{'jan': False,
 'feb': False,
 'mar': False,
 'apr': False,
 'may': False,
 'jun': False,
 'jul': False,
 'aug': False,
 'sep': False,
 'oct': False,
 'nov': False,
 'dec': False}

In [62]:
# array([2, 3, 0, 2, 3, 1, 3, 3, 3, 1, 3, 2], dtype=int32)
np.all(np.array([2, 3, 0, 2, 3, 1, 3, 3, 3, 1, 3, 2]) == 0)

False

In [68]:
travelRegions['isPeakSeason'] = travelRegions['visitorIndex'].apply(applyKMeans)

In [70]:
travelRegions.to_json('travelRegionsWithSeasons.json', orient='records', indent=2)