* https://www.bookingchallenge.com/

* Predict `city_id`
        * Metric: P@4

##### Dataset
The training dataset consists of over a million of anonymized hotel reservations, based on real data, with the following features:
*    user_id - User ID
*    check-in - Reservation check-in date
*    checkout - Reservation check-out date
*    affiliate_id - An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
*    device_class - desktop/mobile
*    booker_country - Country from which the reservation was made (anonymized)
*    hotel_country - Country of the hotel (anonymized)
*    city_id - city_id of the hotel’s city (anonymized)
*    utrip_id - Unique identification of user’s trip (a group of multi-destinations bookings within the same trip)


* Each reservation is a part of a customer’s trip (identified by utrip_id) which includes at least 4 consecutive reservations. The check-out date of a reservation is the check-in date of the following reservation in their trip.

* The evaluation dataset is constructed similarly, however the city_id of the final reservation of each trip is concealed and requires a prediction.

 
###### Evaluation criteria
The goal of the challenge is to predict (and recommend) the final city (city_id) of each trip (utrip_id). We will evaluate the quality of the predictions based on the top four recommended cities for each trip by using Precision@4 metric (4 representing the four suggestion slots at Booking.com website). When the true city is one of the top 4 suggestions (regardless of the order), it is considered correct.

------------------------------------------------------

* If we are given  the country in question, then this problem is maybe more of a _learning to rank_ problem. (Rather than massively multiclass). 
    * CatBoost learning to rank on ms dataset (0/1):  https://colab.research.google.com/github/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb
        * https://catboost.ai/docs/concepts/loss-functions-ranking.html
        * for CB ranking,  all objects in dataset must be grouped by group_id - this would be user/trip id X country, in our case. (Still need to add negatives, within each such subgroup/group/"query"). 

    * lightFM - ranking (implicit interactions)
        * https://github.com/qqwjq/lightFM

    * lstm/w2v - next item recomendation
    * dot product between different factors as features (recc.)
    * xgboost ap - https://www.kaggle.com/anokas/xgboost-2
* Relevant: Kaggle expedia hotel prediction: https://www.kaggle.com/c/expedia-hotel-recommendations/discussion  

* ALSO: `implicit interaction` - reccommendation problem (We have only positive feedback, no ranked/negative explicit feedback)'


* __BASELINE__ to beat: 4 most popular by country ; 4 most popular by affiliate_id X booker_country X hotel_country (X month?)
    * Ignore/auto answer the 4 most popular for countries with less than 4 unique cities in data
 
 
* Likely approach : build a model (and targets/negatives) per country.

-----------
#### Data notes:
* Long tail of cities and countries
* Some (31%) countries have 4 or less unique cities - for those return fixed answer/prediction ?  -
    * CAN'T! In test set, we will not have the country ID :(
    
    
----------------------
MF - embedding model

* https://blog.tensorflow.org/2020/09/introducing-tensorflow-recommenders.html
* Implicit recommendations - needs negs
    * example of explicit (simple): https://petamind.com/build-a-simple-recommender-system-with-matrix-factorization/
* sample negatives - how ? TFRS requires tf.dataset overhead (And confuses me with what user id should be )
    * https://www.kaggle.com/skihikingkevin/some-recommender-system-implementations
    
    
Simple keras example of multiple inputs : 
* keras topologies
* https://stackoverflow.com/questions/61722973/why-keras-embedding-not-learning-for-recommendation-system


*Tensorflow ranking (seems in beta) : https://colab.research.google.com/github/tensorflow/ranking/blob/master/tensorflow_ranking/examples/handling_sparse_features.ipynb#scrollTo=HfDMGnZY9eVO


Negative pairs training with generator - https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9
* See code. 

* example for implicit recomender (naive) - mainly for negatives data gen ?
* https://www.kaggle.com/skihikingkevin/some-recommender-system-implementations


* Could use **lightFM** - implicit recommender? 
    * https://github.com/lyst/lightfm/tree/master/examples/dataset
    * https://github.com/lyst/lightfm/blob/master/examples/stackexchange/hybrid_crossvalidated.ipynb
    * Note use of sparse matrices. Supports metadata
* We could use tuple of features for "user id" for purposes of recommenders? 
  
* user-item sparse OHE creation - https://github.com/piyushpathak03/Recommendation-systems/blob/master/Recomendation%20system%20end%20to%20end/4)%20Feature%20Creation.ipynb
* lightfm -
    * https://github.com/piyushpathak03/Recommendation-systems/tree/master/Recomendation%20system%20end%20to%20end  - building the sparse interactions matrix for implecit recc
    * https://making.lyst.com/lightfm/docs/examples/dataset.html#building-the-interactions-matrix
* https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html - example of metadata features for lightfm
  
* https://github.com/zhangruiskyline/DeepLearning/blob/master/doc/Recommendation.md#ranking  - includes negatives sampling! 

**Spotlight**
*  https://maciejkula.github.io/spotlight/interactions.html
    * Also has sequence support easily
    * Example of loading custom dataset for implicit recc - https://github.com/maciejkula/spotlight/issues/30

**SVD/ALS**
    * https://stats.stackexchange.com/questions/354355/what-is-the-relation-between-svd-and-als
    * https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html
    
    
RankFM (implicit package) - don't know if adds anything vs lightfm?
* https://github.com/etlundquist/rankfm
* Does seem easier to "productionize"

```
from implicit.als import AlternatingLeastSquares
from scipy import sparse

def matrix_decomposition(matrix, k, i):
    matrix = sparse.csr_matrix(matrix.T)
    model = AlternatingLeastSquares(factors=k, iterations=i)
    model.fit(matrix)
    user_latent = model.user_factors
    item_latent = model.item_factors

    return user_latent, item_latent
```

Neural Collaborative Filtering
* https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recommender/neural_collaborative_filtering
     * Using the model from here: https://nanx.me/blog/post/recsys-binary-implicit-feedback-r-keras/ and https://github.com/hexiangnan/neural_collaborative_filtering/blob/master/GMF.py

```
\# Create the Training Set
APPROX_NEGATIVE_SAMPLE_SIZE = int(len(train)*1.2)
n_users = c_user.categories.shape[0]
n_tracks = c_track.categories.shape[0]
\# Create Training Set
train_users = train['username'].cat.codes.values
train_tracks = train['track_id'].cat.codes.values
train_labels = np.ones(len(train_users))
\# insert negative samples
u = np.random.randint(n_users, size=APPROX_NEGATIVE_SAMPLE_SIZE)
i = np.random.randint(n_tracks, size=APPROX_NEGATIVE_SAMPLE_SIZE)
non_neg_idx = np.where(train_data[u,i] == 0)
train_users = np.concatenate([train_users, u[non_neg_idx[1]]])
train_tracks = np.concatenate([train_tracks, i[non_neg_idx[1]]])
train_labels = np.concatenate([train_labels, np.zeros(u[non_neg_idx[1]].shape[0])])
print((train_users.shape, train_tracks.shape, train_labels.shape))

\# random shuffle the data (because Keras takes last 10% as validation split)
X = np.stack([train_users, train_tracks, train_labels], axis=1)
np.random.shuffle(X)
```



* https://vitobellini.github.io/posts/2018/01/03/how-to-build-a-recommender-system-in-tensorflow.html  - easily turn df into matrix (need to add "as_sparse) - autoencoder approach: 
    ```
    # Convert DataFrame in user-item matrix
    matrix = df.pivot(index='user', columns='item', values='rating')
    matrix.fillna(0, inplace=True)
    ...
    # Users and items ordered as they are in matrix

    users = matrix.index.tolist()
    items = matrix.columns.tolist()

    matrix = matrix.as_matrix()
    ```
    
Triplets/siamese + triplet mining - 
* https://github.com/maciejkula/triplet_recommendations_keras

In [1]:
# Recommenders embedding - fit generator
# https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9
# Also has code for generator to generate positive, negative pairs per batch - good for siamese/triplets/metric! 

import numpy as np
import random
random.seed(100)
def generate_batch(pairs, n_positive = 50, negative_ratio = 1.0):
    """Generate batches of samples for training. 
       Random select positive samples
       from pairs and randomly select negatives."""
    
    # Create empty array to hold batch
    batch_size = n_positive * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))
    
    # Continue to yield samples
    while True:
        # Randomly choose positive examples
        for idx, (book_id, link_id) in enumerate(random.sample(pairs, n_positive)):
            batch[idx, :] = (book_id, link_id, 1)
        idx += 1
        
        # Add negative examples until reach batch size
        while idx < batch_size:
            
            # Random selection
            random_book = random.randrange(len(books))
            random_link = random.randrange(len(links))
            
            # Check to make sure this is not a positive example
            if (random_book, random_link) not in pairs_set:
                
                # Add to batch and increment index
                batch[idx, :] = (random_book, random_link, neg_label)
                idx += 1
                
        # Make sure to shuffle order
        np.random.shuffle(batch)
        yield {'book': batch[:, 0], 'link': batch[:, 1]}, batch[:, 2]


Possible approahc + negatives - https://github.com/zhangruiskyline/DeepLearning/blob/master/doc/Recommendation.md#ranking 
* To assess their quality we do the following for each user:

    compute matching scores for items (except the movies that the user has already seen in the training set),
    compare to the positive feedback actually collected on the test set using the ROC AUC ranking metric,
    average ROC AUC scores across users to get the average performance of the recommender model on the test set.
```
def average_roc_auc(match_model, data_train, data_test):
    """Compute the ROC AUC for each user and average over users"""
    max_user_id = max(data_train['user_id'].max(), data_test['user_id'].max())
    max_item_id = max(data_train['item_id'].max(), data_test['item_id'].max())
    user_auc_scores = []
    for user_id in range(1, max_user_id + 1):
        pos_item_train = data_train[data_train['user_id'] == user_id]
        pos_item_test = data_test[data_test['user_id'] == user_id]

        \# Consider all the items already seen in the training set
        all_item_ids = np.arange(1, max_item_id + 1)
        items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)

        \# Ground truth: return 1 for each item positively present in the test set
        \# and 0 otherwise.
        expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)

        if np.sum(expected) >= 1:
            # At least one positive test value to rank
            repeated_user_id = np.empty_like(items_to_rank)
            repeated_user_id.fill(user_id)

            predicted = match_model.predict([repeated_user_id, items_to_rank],
                                            batch_size=4096)
            user_auc_scores.append(roc_auc_score(expected, predicted))

    return sum(user_auc_scores) / len(user_auc_scores)
```    



* Negative sampling from the sparse user-item cooccurrence matrix
    * https://stackoverflow.com/questions/49971318/how-to-generate-negative-samples-in-tensorflow
    ```
    def subsampler(data, num_pos=10, num_neg=10):
    """ Obtain random batch size made up of positive and negative samples
    Returns
    -------
    positive_row : np.array
       Row ids of the positive samples
    positive_col : np.array
       Column ids of the positive samples
    positive_data : np.array
       Data values in the positive samples
    negative_row : np.array
       Row ids of the negative samples
    negative_col : np.array
       Column ids of the negative samples

    Note
    ----
    We are not return negative data, since the negative values
    are always zero.
    """
    N, D = data.shape
    y_data = data.data
    y_row = data.row
    y_col = data.col

    \# store all of the positive (i, j) coords
    idx = np.vstack((y_row, y_col)).T
    idx = set(map(tuple, idx.tolist()))
    while True:
        \# get positive sample
        positive_idx = np.random.choice(len(y_data), num_pos)
        positive_row = y_row[positive_idx].astype(np.int32)
        positive_col = y_col[positive_idx].astype(np.int32)
        positive_data = y_data[positive_idx].astype(np.float32)

        \# get negative sample
        negative_row = np.zeros(num_neg, dtype=np.int32)
        negative_col = np.zeros(num_neg, dtype=np.int32)
        for k in range(num_neg):
            i, j = np.random.randint(N), np.random.randint(D)
            while (i, j) in idx:
                i, j = np.random.randint(N), np.random.randint(D)
                negative_row[k] = i
                negative_col[k] = j

        yield (positive_row, positive_col, positive_data,
               negative_row, negative_col)
   ```

In [2]:
from catboost import CatBoostClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
import pandas as pd
from sklearn.model_selection import train_test_split, GroupShuffleSplit
import numpy as np
from tensorflow.keras.metrics import TopKCategoricalAccuracy, Precision, SparseTopKCategoricalAccuracy # @4
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV

%matplotlib inline

pd.set_option("display.max_columns", 90)

In [3]:
!nvidia-smi -L

GPU 0: GeForce RTX 2060 (UUID: GPU-d8cefda9-d4cb-990c-cc01-a2a4f2416484)


In [4]:
## https://www.tensorflow.org/guide/mixed_precision ## TF mixed precision - pytorch requires other setup
from tensorflow.keras.mixed_precision import experimental as mixed_precision

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
## will need to correct in places, e.g.: 
## outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)

The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
Instructions for updating:
Use tf.keras.mixed_precision.LossScaleOptimizer instead. LossScaleOptimizer now has all the functionality of DynamicLossScale


#### Features to add:
* Lag 
* Rank (popularity) of city, country (in general, +- given booker country)
* Count of hotel; user, trip size ? (may be leaky )
* Seasonal features - Holidays? , datetime

Aggregate feats:
* user changed country? last booking (lag 1) country change? 
* max/min/avg popularity rank of previous locations visited



We should create a dictionary of the rank, count, city/country etc' feats, so we can easily merge them when making more "negative" samples/feats for ranking.


* Consider using a df2 of df without dates + drop_duplicates, +- without user/trip id (After calcing that) .


Leaky or potentially leaky (Dependso n test set): 
* Target freq features - frequency of target city, given source county +- affiliate +- month of year +- given country (and interactions of target freq). 
    * Risk of leaks - depends of test data has temporal split or not. 
    * cartboost can do target encode, but this lets us do it for interactions, e.g. target city freq given the 2 countries and affiliate.
    * beware overfitting! 

In [5]:
MIN_TARGET_FREQ = 80 # drop target/city_id values that appear less than this many times, as final step's target 
KEEP_TOP_K_TARGETS = 900 # keep K most frequent city ID targets (redundnat with the above, )

## (some) categorical variables that appear less than this many times will be replaced with a placeholder value!
## Includes CITY id (but done after target filtering, to avoid creating a "rare class" target:
LOW_COUNT_THRESH = 11

RUN_TABNET = True
max_epochs = 16

In [6]:
# most basic categorical columns , without 'user_id', , 'utrip_id' ordevice_class - used for count encoding/filtering
BASE_CAT_COLS = ['city_id',  'affiliate_id', 'booker_country', 'hotel_country']

### features to get lags for. Not very robust. May want different feats for lags before -1
LAG_FEAT_COLS = ['city_id', 'device_class',
       'affiliate_id', 'booker_country', 'hotel_country', 
       'duration', 'same_country', 'checkin_weekday',
       'checkin_week',
        'checkout_weekday',
       'city_id_count', 'affiliate_id_count',
       'booker_country_count', 'hotel_country_count', 
       'checkin_month_count', 'checkin_week_count', 'city_id_nunique',
       'affiliate_id_nunique', 'booker_country_nunique',
       'hotel_country_nunique', 'city_id_rank_by_hotel_country',
       'city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
       'affiliate_id_rank_by_hotel_country',
       'affiliate_id_rank_by_booker_country', 
       'booker_country_rank_by_hotel_country',
       'booker_country_rank_by_booker_country',
       'booker_country_rank_by_affiliate',
#        'hotel_country_rank_by_hotel_country',
       'hotel_country_rank_by_booker_country',
       'hotel_country_rank_by_affiliate',
       'checkin_month_rank_by_hotel_country',
       'checkin_month_rank_by_booker_country',
       'checkin_month_rank_by_affiliate'
                ]

In [7]:
# https://stackoverflow.com/questions/33907537/groupby-and-lag-all-columns-of-a-dataframe
# https://stackoverflow.com/questions/62924987/lag-multiple-variables-grouped-by-columns
## lag features with groupby over many columns: 
def groupbyLagFeatures(df:pd.DataFrame,lag:[]=[1,2],group="utrip_id",lag_feature_cols=[]):
    """
    lag features with groupby over many columns
    https://stackoverflow.com/questions/62924987/lag-multiple-variables-grouped-by-columns"""
    if len(lag_feature_cols)>0:
        df=pd.concat([df]+[df.groupby(group)[lag_feature_cols].shift(x).add_prefix('lag'+str(x)+"_") for x in lag],axis=1)
    else:
         df=pd.concat([df]+[df.groupby(group).shift(x).add_prefix('lag'+str(x)+"_") for x in lag],axis=1)
    return df

def groupbyFirstLagFeatures(df:pd.DataFrame,group="user_id",lag_feature_cols=[]):
    """
    Get  first/head value lag-like of features with groupby over columns. Assumes sorted data!
    """
    if len(lag_feature_cols)>0:
        df=pd.concat([df]+[df.groupby(group)[lag_feature_cols].transform("first").add_prefix("first_")],axis=1)
    else:
#          df=pd.concat([df]+[df.groupby(group).first().add_prefix("first_")],axis=1)
        df=pd.concat([df]+[df.groupby(group).transform("first").add_prefix("first_")],axis=1)
    return df

######## Get n most popular items, per group
def most_popular(group, n_max=4):
    """Find most popular hotel clusters by destination
    Define a function to get most popular hotels for a destination group.

    Previous version used nlargest() Series method to get indices of largest elements. But the method is rather slow.
    Source: https://www.kaggle.com/dvasyukova/predict-hotel-type-with-pandas
    """
    relevance = group['relevance'].values
    hotel_cluster = group['hotel_cluster'].values
    most_popular = hotel_cluster[np.argsort(relevance)[::-1]][:n_max]
    return np.array_str(most_popular)[1:-1] # remove square brackets


## https://codereview.stackexchange.com/questions/149306/select-the-n-most-frequent-items-from-a-pandas-groupby-dataframe
# https://stackoverflow.com/questions/52073054/group-by-a-column-to-find-the-most-frequent-value-in-another-column
## can get modes (sorted)
# https://stackoverflow.com/questions/50592762/finding-most-common-values-with-pandas-groupby-and-value-counts
## df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])
# https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value
# source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)


In [8]:
df = pd.read_csv("booking_train_set.csv",
                 nrows=623456,
                 index_col=[0],
                 parse_dates=["checkin","checkout"],infer_datetime_format=True)

df.sort_values(["user_id","checkin"],inplace=True)

df

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
117277,136,2016-09-20,2016-09-22,52933,desktop,9924,The Devilfire Empire,Osterlich,136_4
117278,136,2016-09-22,2016-09-23,51685,desktop,9924,The Devilfire Empire,Osterlich,136_4
117279,136,2016-09-23,2016-09-24,43323,desktop,9924,The Devilfire Empire,Osterlich,136_4
117280,136,2016-09-24,2016-09-26,55990,desktop,9924,The Devilfire Empire,Osterlich,136_4
117281,136,2016-09-26,2016-09-27,46411,desktop,9924,The Devilfire Empire,Osterlich,136_4
...,...,...,...,...,...,...,...,...,...
180791,6258041,2016-05-01,2016-05-02,17338,mobile,9452,Elbonia,Glubbdubdrib,6258041_1
420479,6258087,2016-08-03,2016-08-04,17754,desktop,2436,Gondal,Gondal,6258087_1
420480,6258087,2016-08-04,2016-08-05,50073,desktop,2436,Gondal,Gondal,6258087_1
420481,6258087,2016-08-05,2016-08-06,11662,desktop,2436,Gondal,Gondal,6258087_1


In [9]:
df["duration"] = (df["checkout"] - df["checkin"]).dt.days
df["same_country"] = (df["booker_country"]==df["hotel_country"]).astype(int)

df["checkin_day"] = df["checkin"].dt.day
df["checkin_weekday"] = df["checkin"].dt.weekday
df["checkin_week"] = df["checkin"].dt.isocalendar().week.astype(int) ## week of year
df["checkin_month"] = df["checkin"].dt.month
df["checkin_year"] = df["checkin"].dt.year-2016

df["checkin_quarter"] = df["checkin"].dt.quarter # relatively redundant but may be used for "id"

df["checkout_weekday"] = df["checkout"].dt.weekday
df["checkout_week"] = df["checkout"].dt.isocalendar().week.astype(int) ## week of year
df["checkout_day"] = df["checkout"].dt.day ## day of month

## cyclical datetime embeddings
## drop originakl variables? 
## TODO:L add for other variables, +- those that we'll embed (week?)

df['checkin_weekday_sin'] = np.sin(df["checkin_weekday"]*(2.*np.pi/7))
df['checkin_weekday_cos'] = np.cos(df["checkin_weekday"]*(2.*np.pi/7))
df['checkin_month_sin'] = np.sin((df["checkin_month"]-1)*(2.*np.pi/12))
df['checkin_month_cos'] = np.cos((df["checkin_month"]-1)*(2.*np.pi/12))

#############
# last number in utrip id - probably which trip number it is:
df["utrip_number"] = df["utrip_id"].str.split("_",expand=True)[1].astype(int)

### encode string columns - must be consistent with test data 
### IF we can concat test with train, we can just do a single transformation  for the NON TARGET cols
# obj_cols_list = df.select_dtypes("O").columns.values
obj_cols_list = ['device_class','booker_country','hotel_country'] # we could also define when loading data, dtype
for c in obj_cols_list:
    df[c] = df[c].astype("category")
    df[c] = df[c].cat.codes.astype(int)

## view steps of a trip per user & trip, in order. ## last step == 1.
## count #/pct step in a trip (utrip_id) per user. Useful to get the "final" step per trip - for prediction
## note that the order is ascending, so we would need to select by "last" . (i.e "1" is the first step, 2 the second, etc') , or we could use pct .rank(ascending=True,pct=True)
#### this feature overlaps with the count of each trip id (for the final row)
##  = df.sort_values(["checkin","checkout"])... - df already sorted above
df["utrip_steps_from_end"] = df.groupby("utrip_id")["checkin"].rank(ascending=True,pct=True) #.cumcount("user_id")
# print(df["utrip_steps_from_end"].describe()) # min is greater than 0
# df[["user_id","utrip_steps_from_end","checkin"]].sort_values(["user_id","utrip_steps_from_end"])

In [10]:
### add features to be consistent with test set of row in trip, and total trips in trip

df["row_num"] = df.groupby("utrip_id")["checkin"].rank(ascending=True,pct=False).astype(int)
utrip_counts = df["utrip_id"].value_counts()
df["total_rows"] = df["utrip_id"].map(utrip_counts)

df[["row_num","total_rows"]].describe()

Unnamed: 0,row_num,total_rows
count,623456.0,623456.0
mean,3.558915,6.11783
std,2.381663,2.812467
min,1.0,1.0
25%,2.0,4.0
50%,3.0,5.0
75%,5.0,7.0
max,48.0,48.0


In [11]:
df["last"] = (df["row_num"] ==df["total_rows"]).astype(int)

* Add first country, city visited in a trip. 
* Drop first row of a trip

In [12]:
## add the "first" place visited/values
### nopte - will need to drop first row in trip, or impute nans when using this feature 

### first by user results in too much sparsity/rareness for our IDs purposes
df = groupbyFirstLagFeatures(df,group="utrip_id",lag_feature_cols=["hotel_country","city_id"]) # ["hotel_country","city_id"]

## alt - messy, but maybe good enough : 
# df = groupbyFirstLagFeatures(df,group=['device_class', 'affiliate_id',
#                                        'booker_country','checkin_month',"last"],lag_feature_cols=["hotel_country"])

df = df.loc[df["row_num"]>1]
print(df[["first_hotel_country","hotel_country","city_id"]].nunique())
df


first_hotel_country      158
hotel_country            177
city_id                30137
dtype: int64


Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,duration,same_country,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id
117278,136,2016-09-22,2016-09-23,51685,0,9924,4,113,136_4,1,0,22,3,38,9,0,3,4,38,23,0.433884,-0.900969,-0.866025,-5.000000e-01,4,0.285714,2,7,0,113,52933
117279,136,2016-09-23,2016-09-24,43323,0,9924,4,113,136_4,1,0,23,4,38,9,0,3,5,38,24,-0.433884,-0.900969,-0.866025,-5.000000e-01,4,0.428571,3,7,0,113,52933
117280,136,2016-09-24,2016-09-26,55990,0,9924,4,113,136_4,2,0,24,5,38,9,0,3,0,39,26,-0.974928,-0.222521,-0.866025,-5.000000e-01,4,0.571429,4,7,0,113,52933
117281,136,2016-09-26,2016-09-27,46411,0,9924,4,113,136_4,1,0,26,0,39,9,0,3,1,39,27,0.000000,1.000000,-0.866025,-5.000000e-01,4,0.714286,5,7,0,113,52933
117282,136,2016-09-27,2016-09-28,45399,0,9924,4,113,136_4,1,0,27,1,39,9,0,3,2,39,28,0.781831,0.623490,-0.866025,-5.000000e-01,4,0.857143,6,7,0,113,52933
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180790,6258041,2016-04-30,2016-05-01,7529,1,9452,1,58,6258041_1,1,0,30,5,17,4,0,2,6,17,1,-0.974928,-0.222521,1.000000,6.123234e-17,1,0.750000,3,4,0,58,57109
180791,6258041,2016-05-01,2016-05-02,17338,1,9452,1,58,6258041_1,1,0,1,6,17,5,0,2,0,18,2,-0.781831,0.623490,0.866025,-5.000000e-01,1,1.000000,4,4,1,58,57109
420480,6258087,2016-08-04,2016-08-05,50073,0,2436,2,59,6258087_1,1,1,4,3,31,8,0,3,4,31,5,0.433884,-0.900969,-0.500000,-8.660254e-01,1,0.500000,2,4,0,59,17754
420481,6258087,2016-08-05,2016-08-06,11662,0,2436,2,59,6258087_1,1,1,5,4,31,8,0,3,5,31,6,-0.433884,-0.900969,-0.500000,-8.660254e-01,1,0.750000,3,4,0,59,17754


In [13]:
### replace rare variables (under 2 occurrences) with "-1" dummy

affiliates_counts = df["affiliate_id"].value_counts()
print("before:", affiliates_counts)
print("uniques",df["affiliate_id"].nunique())
affiliates_counts = affiliates_counts.to_dict()
# df["affiliate_id"] = df["affiliate_id"].where(df["affiliate_id"].apply(lambda x: x.map(x.value_counts()))>=3, -1)
df["affiliate_id"] = df["affiliate_id"].where(df["affiliate_id"].map(affiliates_counts)>=3, -2)
df["affiliate_id"] = df["affiliate_id"].astype(int)

print("after\n",df["affiliate_id"].value_counts())
print("uniques",df["affiliate_id"].nunique())

before: 9924    120761
359      76156
9452     38619
384      38455
4541     17739
         ...  
5806         1
1244         1
8511         1
6456         1
6147         1
Name: affiliate_id, Length: 2337, dtype: int64
uniques 2337
after
 9924    120761
359      76156
9452     38619
384      38455
4541     17739
         ...  
1450         3
3499         3
4387         3
3755         3
8825         3
Name: affiliate_id, Length: 1480, dtype: int64
uniques 1480


In [14]:
### for possible "user id" embedding/ID : How many unique values are there for these source tuple? :
### Could also maybe add previous location/lag1 country/city ? 
## 'device_class','affiliate_id', 'booker_country' - 7.5 K "uniques"
## 'device_class','affiliate_id', 'booker_country','checkin_month' - 24 K "uniques"
## 'device_class','affiliate_id', 'booker_country','checkin_quarter' 14K "uniques"

print(df[['device_class','affiliate_id', 'booker_country','checkin_month',"total_rows"]].nunique(axis=0))
df.groupby(['device_class','affiliate_id', 'booker_country','checkin_quarter']).size()

device_class         3
affiliate_id      1480
booker_country       5
checkin_month       12
total_rows          36
dtype: int64


device_class  affiliate_id  booker_country  checkin_quarter
0             -2            0               1                   5
                                            2                   7
                                            3                  22
                                            4                  11
                            1               1                  28
                                                               ..
2              10615        1               2                   3
                                            3                  11
                                            4                   1
               10646        2               4                   1
               10668        2               1                   1
Length: 8533, dtype: int64

In [15]:
# df.groupby(['device_class','affiliate_id', 'booker_country','checkin_month']).size() ## 24k

In [16]:
##### Following aggregation features - would be best to use time window (sort data) to generate, otherwise they will LEAK! (e.g. nunique countries visited)

### count features (can also later add rank inside groups).
### Some may be leaks (# visits in a trip should use time window?) , and do users repeat? 
### can add more counts of group X time period (e.g. affiliate X month of year)
## alt way to get counts/freq :
# freq = df["city_id"].value_counts()
# df["city_id_count"] = df["city_id"].map(freq)
# print(df["city_id_count"].describe())

count_cols = [ 'city_id','affiliate_id', 'booker_country', 'hotel_country', 
#               'utrip_id','user_id', 
 "checkin_month","checkin_week"]
for c in count_cols:
    df[f"{c}_count"] = df.groupby([c])["duration"].transform("size")
    
########################################################
## nunique per trip
### https://stackoverflow.com/questions/46470743/how-to-efficiently-compute-a-rolling-unique-count-in-a-pandas-time-series

nunique_cols = [ 'city_id','affiliate_id', 'booker_country', 'hotel_country']
# df["nunique_booker_countries"] = df.groupby("utrip_id")["booker_country"].nunique()
# df["nunique_hotel_country"] = df.groupby("utrip_id")["hotel_country"].nunique()
for c in nunique_cols:
    df[f"{c}_nunique"] = df.groupby(["utrip_id"])[c].transform("nunique")
print(df.nunique())

########################################################
## get frequency/count feature's rank within a group - e.g. within a country (or affiliate) 
## add "_count" to column name to get count col name, then add rank col 

### ALT/ duplicate feat - add percent rank (instead or in addition)

rank_cols = ['city_id','affiliate_id', 'booker_country','hotel_country',
 "checkin_month"]
### what is meaning of groupby and rank of smae variable by same var? Surely should be 1 / unary? 
for c in rank_cols:
    df[f"{c}_rank_by_hotel_country"] = df.groupby(['hotel_country'])[f"{c}_count"].transform("rank")
    df[f"{c}_rank_by_booker_country"] = df.groupby(['booker_country'])[f"{c}_count"].transform("rank")
    df[f"{c}_rank_by_affiliate"] = df.groupby(['affiliate_id'])[f"{c}_count"].transform("rank")
    
df

user_id                   110636
checkin                      424
checkout                     424
city_id                    30137
device_class                   3
affiliate_id                1480
booker_country                 5
hotel_country                177
utrip_id                  116230
duration                      29
same_country                   2
checkin_day                   31
checkin_weekday                7
checkin_week                  53
checkin_month                 12
checkin_year                   2
checkin_quarter                4
checkout_weekday               7
checkout_week                 53
checkout_day                  31
checkin_weekday_sin            7
checkin_weekday_cos            7
checkin_month_sin             12
checkin_month_cos             12
utrip_number                  45
utrip_steps_from_end         407
row_num                       47
total_rows                    36
last                           2
first_hotel_country          158
first_city

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,duration,same_country,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id,city_id_count,affiliate_id_count,booker_country_count,hotel_country_count,checkin_month_count,checkin_week_count,city_id_nunique,affiliate_id_nunique,booker_country_nunique,hotel_country_nunique,city_id_rank_by_hotel_country,city_id_rank_by_booker_country,city_id_rank_by_affiliate,affiliate_id_rank_by_hotel_country,affiliate_id_rank_by_booker_country,affiliate_id_rank_by_affiliate,booker_country_rank_by_hotel_country,booker_country_rank_by_booker_country,booker_country_rank_by_affiliate,hotel_country_rank_by_hotel_country,hotel_country_rank_by_booker_country,hotel_country_rank_by_affiliate,checkin_month_rank_by_hotel_country,checkin_month_rank_by_booker_country,checkin_month_rank_by_affiliate
117278,136,2016-09-22,2016-09-23,51685,0,9924,4,113,136_4,1,0,22,3,38,9,0,3,4,38,23,0.433884,-0.900969,-0.866025,-5.000000e-01,4,0.285714,2,7,0,113,52933,162,120761,124895,8920,61119,13316,6,1,1,1,4510.5,33076.0,54206.5,7513.0,108897.0,60381.0,3427.0,62448.0,61856.0,4460.5,29086.0,34473.0,5397.5,88993.5,73845.5
117279,136,2016-09-23,2016-09-24,43323,0,9924,4,113,136_4,1,0,23,4,38,9,0,3,5,38,24,-0.433884,-0.900969,-0.866025,-5.000000e-01,4,0.428571,3,7,0,113,52933,22,120761,124895,8920,61119,13316,6,1,1,1,1438.5,10831.5,23612.0,7513.0,108897.0,60381.0,3427.0,62448.0,61856.0,4460.5,29086.0,34473.0,5397.5,88993.5,73845.5
117280,136,2016-09-24,2016-09-26,55990,0,9924,4,113,136_4,2,0,24,5,38,9,0,3,0,39,26,-0.974928,-0.222521,-0.866025,-5.000000e-01,4,0.571429,4,7,0,113,52933,102,120761,124895,8920,61119,13316,6,1,1,1,3770.5,25516.0,45383.5,7513.0,108897.0,60381.0,3427.0,62448.0,61856.0,4460.5,29086.0,34473.0,5397.5,88993.5,73845.5
117281,136,2016-09-26,2016-09-27,46411,0,9924,4,113,136_4,1,0,26,0,39,9,0,3,1,39,27,0.000000,1.000000,-0.866025,-5.000000e-01,4,0.714286,5,7,0,113,52933,498,120761,124895,8920,61119,15821,6,1,1,1,7131.5,61554.5,79582.5,7513.0,108897.0,60381.0,3427.0,62448.0,61856.0,4460.5,29086.0,34473.0,5397.5,88993.5,73845.5
117282,136,2016-09-27,2016-09-28,45399,0,9924,4,113,136_4,1,0,27,1,39,9,0,3,2,39,28,0.781831,0.623490,-0.866025,-5.000000e-01,4,0.857143,6,7,0,113,52933,70,120761,124895,8920,61119,15821,6,1,1,1,3121.5,20596.5,38761.0,7513.0,108897.0,60381.0,3427.0,62448.0,61856.0,4460.5,29086.0,34473.0,5397.5,88993.5,73845.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180790,6258041,2016-04-30,2016-05-01,7529,1,9452,1,58,6258041_1,1,0,30,5,17,4,0,2,6,17,1,-0.974928,-0.222521,1.000000,6.123234e-17,1,0.750000,3,4,0,58,57109,5,38619,102703,32100,29936,7785,3,1,1,1,2978.0,12566.5,3090.5,21182.5,52328.5,19310.0,5011.5,51352.0,11014.5,16050.5,69022.5,27384.5,7389.5,24928.5,11318.5
180791,6258041,2016-05-01,2016-05-02,17338,1,9452,1,58,6258041_1,1,0,1,6,17,5,0,2,0,18,2,-0.781831,0.623490,0.866025,-5.000000e-01,1,1.000000,4,4,1,58,57109,64,38619,102703,32100,40308,7785,3,1,1,1,11090.5,43578.0,12185.5,21182.5,52328.5,19310.0,5011.5,51352.0,11014.5,16050.5,69022.5,27384.5,13201.0,41994.0,16690.0
420480,6258087,2016-08-04,2016-08-05,50073,0,2436,2,59,6258087_1,1,1,4,3,31,8,0,3,4,31,5,0.433884,-0.900969,-0.500000,-8.660254e-01,1,0.500000,2,4,0,59,17754,26,7990,232590,42719,100507,25712,3,1,1,1,16144.5,56627.5,1975.5,12402.0,64864.0,3995.5,27009.0,116295.5,4007.0,21360.0,162648.0,5819.0,37978.5,203610.0,6964.0
420481,6258087,2016-08-05,2016-08-06,11662,0,2436,2,59,6258087_1,1,1,5,4,31,8,0,3,5,31,6,-0.433884,-0.900969,-0.500000,-8.660254e-01,1,0.750000,3,4,0,59,17754,7,7990,232590,42719,100507,25712,3,1,1,1,8146.5,26038.0,899.5,12402.0,64864.0,3995.5,27009.0,116295.5,4007.0,21360.0,162648.0,5819.0,37978.5,203610.0,6964.0


In [17]:
df.loc[df["city_id_count"]>=15]["city_id"].nunique()

3931

In [18]:
df["utrip_number"].value_counts().describe()

count        45.000000
mean      11270.200000
std       54656.263664
min           2.000000
25%           8.000000
50%          38.000000
75%         293.000000
max      357025.000000
Name: utrip_number, dtype: float64

In [19]:
## counts of each val
# df.groupby(['hotel_country']).size() # same thing as value counts only without ordering by values
df['hotel_country'].value_counts()

35     61882
51     52205
59     42719
58     32100
45     22506
       ...  
27         1
92         1
93         1
107        1
98         1
Name: hotel_country, Length: 177, dtype: int64

In [20]:
assert df.isna().sum().max() ==0

In [21]:
df[[ 'checkin', 'checkout','booker_country', 'hotel_country', 'duration']].describe(include="all",datetime_is_numeric=True)

Unnamed: 0,checkin,checkout,booker_country,hotel_country,duration
count,507159,507159,507159.0,507159.0,507159.0
mean,2016-08-02 14:47:33.244897792,2016-08-04 08:13:48.805236224,2.308181,63.091437,1.726569
min,2016-01-01 00:00:00,2016-01-02 00:00:00,0.0,0.0,1.0
25%,2016-06-08 00:00:00,2016-06-10 00:00:00,2.0,35.0,1.0
50%,2016-08-07 00:00:00,2016-08-09 00:00:00,2.0,51.0,1.0
75%,2016-09-26 00:00:00,2016-09-28 00:00:00,3.0,77.0,2.0
max,2017-02-27 00:00:00,2017-02-28 00:00:00,4.0,178.0,30.0
std,,,1.122346,40.761798,1.196462


In [22]:
# df2 = df[["user_id","city_id"]].drop_duplicates().copy()
df2 = df.drop_duplicates(subset=["user_id","city_id"])["city_id"].copy()
print(df2.shape[0])
print("df2 nunique (cities without duplicate user visits)",df2.nunique())

# c2_counts = df2["city_id"].value_counts()
c2_counts = df2.value_counts()
# df2["new_counts"] = df2["city_id"].map(c2_counts)
# df2["new_counts"] = df2.map(c2_counts)
print("city counts")
print(c2_counts)
print(c2_counts.describe())
print("cities with at least 3:",(c2_counts>=3).sum())
print("cities with at least 7:",(c2_counts>=7).sum())
print("cities with at least 15:",(c2_counts>=15).sum())
print("cities with at least 30:",(c2_counts>=30).sum())
print("cities with at least 100:",(c2_counts>=100).sum())
print("cities with at least 300:",(c2_counts>=300).sum())

c2_freq = df2.value_counts(normalize=True)
print("top 4 sum coverage (normalized): ",c2_freq[0:4].sum().round(3))
print("top 50 sum coverage (normalized): ",c2_freq[0:50].sum().round(3))
print("top 100 sum coverage (normalized): ",c2_freq[0:100].sum().round(3))
print("top 400 sum coverage (normalized): ",c2_freq[0:400].sum().round(3))
print("top 1,000 sum coverage (normalized): ",c2_freq[0:1000].sum().round(3))
print("top 5,000 sum coverage (normalized): ",c2_freq[0:5000].sum().round(3))
print("top 8,000 sum coverage (normalized): ",c2_freq[0:8000].sum().round(3))

462772
df2 nunique (cities without duplicate user visits) 30137
city counts
23921    3858
55128    3510
47499    2866
29319    2795
64876    2718
         ... 
22926       1
6534        1
389         1
22670       1
2049        1
Name: city_id, Length: 30137, dtype: int64
count    30137.000000
mean        15.355609
std         88.402933
min          1.000000
25%          1.000000
50%          2.000000
75%          6.000000
max       3858.000000
Name: city_id, dtype: float64
cities with at least 3: 13628
cities with at least 7: 6983
cities with at least 15: 3768
cities with at least 30: 2150
cities with at least 100: 806
cities with at least 300: 259
top 4 sum coverage (normalized):  0.028
top 50 sum coverage (normalized):  0.179
top 100 sum coverage (normalized):  0.259
top 400 sum coverage (normalized):  0.478
top 1,000 sum coverage (normalized):  0.639
top 5,000 sum coverage (normalized):  0.861
top 8,000 sum coverage (normalized):  0.909


In [23]:
c2_counts

23921    3858
55128    3510
47499    2866
29319    2795
64876    2718
         ... 
22926       1
6534        1
389         1
22670       1
2049        1
Name: city_id, Length: 30137, dtype: int64

In [24]:
### According to the contest description - each user should have at least 4 trips?
df["user_id"].value_counts().describe()#.hist()

count    110636.000000
mean          4.584032
std           2.461063
min           1.000000
25%           3.000000
50%           4.000000
75%           5.000000
max         135.000000
Name: user_id, dtype: float64

## Frequent city target List + City count encoding
* Get the K most frequent target city IDs - selected based on frequency as final destination (not just overall)
* +- Also after this, replace rare city IDs categorical features with count encoding to reduce dimensionality
    * Keep them as count, or aggregate all of them as "under_K"?

##### Output  : `TOP_TARGETS` - filter data by this *after* creation of lag features ! 

* Drop duplicates by the same user (reduce possible bias of frequent users? Only relevant if test is seperater from "frequent travellers") 
    * results in 216,633 , vs 217,686 without dropping duplicates by users
    * ~19.9k unique cities
    
* Could do other encodings - https://contrib.scikit-learn.org/category_encoders/count.html

* Note that all this is after we've added rank, count features beforehand, so that information won't be lost for these variables, despite these transforms



* **NOTE** he most frequent final destinations are NOT the same as the most popular overall destinations +- first location ! 

In [25]:
if KEEP_TOP_K_TARGETS > 0 :
    df_end = df.loc[df["utrip_steps_from_end"]==1].drop_duplicates(subset=["city_id","hotel_country","user_id"])[["city_id","hotel_country"]].copy()
    print(df_end.shape[0])
    end_city_counts = df_end.city_id.value_counts()
    print(end_city_counts)
    
    TOP_TARGETS = end_city_counts.head(KEEP_TOP_K_TARGETS).index.values
    print(f"top {KEEP_TOP_K_TARGETS} targets \n",TOP_TARGETS)
    
#     assert df.loc[df["city_id"].isin(TOP_TARGETS)]["city_id"].nunique() == KEEP_TOP_K_TARGETS

####        
# replace low frequency categoircal features    

# ##replace with count encoding if have at least k, group rarest as "-1":# df[BASE_CAT_COLS] = df[BASE_CAT_COLS].where(df[BASE_CAT_COLS].apply(lambda x: x.map(x.value_counts()))>=LOW_COUNT_THRESH, -1)   
# ## replace/group only the rare variables : 
# df[BASE_CAT_COLS] = df[BASE_CAT_COLS].where(df[BASE_CAT_COLS].apply(lambda x: x.map(x.value_counts()))>=LOW_COUNT_THRESH, -1)
# df[BASE_CAT_COLS].head()

115847
47499    1971
17013    1578
36063    1554
29319    1427
2416     1198
         ... 
50363       1
48308       1
38065       1
27822       1
10245       1
Name: city_id, Length: 14607, dtype: int64
top 900 targets 
 [47499 17013 36063 29319  2416 64876 26436 17127 29770 23921 52815 55763
  4932 26235 51259 62185 51291  3763 66648 48483 52818 16521 27404  2078
 21929 51765 13530 61320 19771 10485 23714 47759  6582 47527 51517 55128
 38677  8462 35160 20345 25025  9608  7410 44869 12308 28154 43306 46854
 63151 47486 64269   382 49668 58819 45188  4202 60143 30520  8335 47976
 24783 34342 14549 58178 60222 53434 66815  2748 28115 29943  2122 40521
 37874 64824  8766 44320 65856 48968 17775   950 65202 58741  6788  3082
 44103  6327 60274 51135 21555 11652 32392 42356  4790 57658 18508 15343
 22065 18820 62611 37689 19448 55196   699    55 35811 67025 22490 47360
 18417 63977 38772 56893  1940  1034 17157 30768 27269 13642 56590 24507
  4476 42482 33022 36905 21033 20392 36435 23243

##### Long tail of targets warning!
* 75% of cities appear less than 4 times in the data (as a final destination!) 
    * Dropping them will mean a maximum accuracy of 25% at best!!
    * training on intermediates may help overcome improve this. 
* Using ~2d step+ , still leaves us with 75% appearing less than 7 times

* Top 4,000 cities (just for those as final trip destination) - offers 89% coverage - 

* Unsure how to handle this - too amny targets to learn, and no auxiliary data to help learn it? 

In [26]:
df_end.city_id.nunique()

14607

In [27]:
df_end.city_id.value_counts().describe()


count    14607.000000
mean         7.930924
std         48.222616
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max       1971.000000
Name: city_id, dtype: float64

In [28]:
# df_end.city_id.value_counts(normalize=True)[0:4000].sum().round(3)# .89  (note, this is just for the end count cities, not all cities overall)

df_end.city_id.value_counts(normalize=True)[0:7000].sum().round(3) #97% coverage

0.934

In [29]:
## check distribution from "midpoint" (50%) of trips, onwards
df.loc[df["utrip_steps_from_end"]>=0.4].drop_duplicates(subset=["city_id","hotel_country","user_id"])["city_id"].value_counts().describe()


count    28783.000000
mean        14.375430
std         82.407715
min          1.000000
25%          1.000000
50%          2.000000
75%          6.000000
max       3390.000000
Name: city_id, dtype: float64

In [30]:
# df["c"] = df["city_id"].map(df["city_id"].value_counts())
# df[BASE_CAT_COLS+ ["c"]]

* Continue with EDA 

In [31]:
df["utrip_id"].value_counts().describe()

count    116230.000000
mean          4.363409
std           2.012276
min           1.000000
25%           3.000000
50%           4.000000
75%           5.000000
max          47.000000
Name: utrip_id, dtype: float64

* If country is known, then we need to rank within a given country.  How many cities/points per country? :

(Note - Later, need to consider f eatures about multi country trips) 


* EDA on city popularity by country
* Drop rare hotels to simplify

In [32]:
df_locations = df[["hotel_country","city_id"]].drop_duplicates()
print(df_locations.shape[0])
print(df_locations.nunique())
print("After filtering countries with 4 or less unique hotels/cities:")
df_locations = df_locations.loc[df_locations.groupby(["hotel_country"])["city_id"].transform("nunique")>4]
print(df_locations.shape[0])
print(df_locations.nunique())

print(df.groupby(["hotel_country"])["city_id"].nunique().describe())

30137
hotel_country      177
city_id          30137
dtype: int64
After filtering countries with 4 or less unique hotels/cities:
30025
hotel_country      122
city_id          30025
dtype: int64
count     177.000000
mean      170.265537
std       544.551768
min         1.000000
25%         3.000000
50%        19.000000
75%        97.000000
max      4705.000000
Name: city_id, dtype: float64


In [33]:
### unsure about this filtering - depends if data points are real or mistake

print("dropping users with less than 4 trips")
# df2 = df.loc[df["utrip_id_count"]>=4]
df2 = df.loc[df["total_rows"]>=4].copy()
print(df.shape[0]-df2.shape[0])

# print("dropping countries+Data with less than 4 unique cities in them:")
# df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("nunique")>=4]
# print(df2.shape[0])

print(f"dropping cities  with less than {MIN_TARGET_FREQ} occurences:")
df2 = df2.loc[df2.groupby(["city_id"])["hotel_country"].transform("count")>=MIN_TARGET_FREQ]
print(df2.shape[0])
df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("count")>=MIN_TARGET_FREQ]
print(df2.shape[0])

# print("dropping countries+Data with less than 4 unique cities in them: (afer prev filter)")
# df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("nunique")>=4]
# print(df2.shape[0])

print("nunique cities after freq filt",df2["city_id"].nunique())
print("nunique city_id per hotel_country:")
df2.groupby(["hotel_country"])["city_id"].nunique().describe()

dropping users with less than 4 trips
732
dropping cities  with less than 80 occurences:
335027
335027
nunique cities after freq filt 1038
nunique city_id per hotel_country:


count     78.000000
mean      13.307692
std       21.723952
min        1.000000
25%        2.000000
50%        5.000000
75%       16.250000
max      138.000000
Name: city_id, dtype: float64

In [34]:
df2[["hotel_country","city_id","affiliate_id","user_id"]].nunique()

hotel_country       78
city_id           1038
affiliate_id      1429
user_id          99715
dtype: int64

In [35]:
# LAG_FEAT_COLS = ['city_id', 'device_class',
#        'affiliate_id', 'booker_country', 'hotel_country', 
#        'duration', 'same_country', 'checkin_day', 'checkin_weekday',
#        'checkin_week',
#         'checkout_weekday','checkout_week',
#        'city_id_count', 'affiliate_id_count',
#        'booker_country_count', 'hotel_country_count', 
#        'checkin_month_count', 'checkin_week_count', 'city_id_nunique',
#        'affiliate_id_nunique', 'booker_country_nunique',
#        'hotel_country_nunique', 'city_id_rank_by_hotel_country',
#        'city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
#        'affiliate_id_rank_by_hotel_country',
#        'affiliate_id_rank_by_booker_country', 'affiliate_id_rank_by_affiliate',
#        'booker_country_rank_by_hotel_country',
#        'booker_country_rank_by_booker_country',
#        'booker_country_rank_by_affiliate',
#        'hotel_country_rank_by_hotel_country',
#        'hotel_country_rank_by_booker_country',
#        'hotel_country_rank_by_affiliate',
#        'checkin_month_rank_by_hotel_country',
#        'checkin_month_rank_by_booker_country',
#        'checkin_month_rank_by_affiliate']

In [36]:
# ### lag features - last n visits
# groupbyLagFeatures(df=df2.head(20), # .set_index(["checkin","checkout","user_id"])
#                    lag=[1,2],group="utrip_id",lag_feature_cols=LAG_FEAT_COLS)

In [37]:
# df2.loc[~df2["utrip_steps_from_end"].between(0.26,0.98)].sort_values("utrip_id")
## # df2["utrip_steps_from_end"].min() ## min is greater than 0

#### get a DF of all cities per country
* +- get from original DF, +- remove cities that appear less than 4? times , and countries with less than 4 hotels? (Or keep - to avoid messing up training?)
* Weighted Sample from it, for negatives, +- most freq by country/affiliate/etc
* Don't drop duplicates by user, keep orig freq? 

In [38]:
df_cities = df[["city_id","hotel_country","city_id_count"]] ## +- drop duplicates by tripid? 
print(df_cities.nunique())
df_cities = df_cities.loc[df_cities.groupby("hotel_country")["city_id"].transform("nunique")>4]
df_cities = df_cities.loc[df_cities["city_id_count"]>=10].sort_values("city_id_count",ascending=False)
print(df_cities.nunique())
print(df_cities.shape[0])


# ### 5 most frequent overall
# df_city_samples = df_cities.drop_duplicates().sort_values("city_id_count",ascending=False).groupby("city_id").head(5) 
# df_city_samples

city_id          30137
hotel_country      177
city_id_count      512
dtype: int64
city_id          5431
hotel_country     109
city_id_count     502
dtype: int64
442618


### add lag features + Train/test/data split
* Lag feats (remember for categorical)
* Drop leak features (target values - country, city)

* drop instances  that lack history (e.g. at least 3d step and onwards) - by dropna in lag feat
* fill nans
* Split train/test by `user id` / split could maybe be by `utrip ID` ? ? 
    * Test - only last trip
    *  stratified train/test split by class - then drop any train rows with overlap with tests' IDs.  
        * Could also stratify by users, but risks some classes being non present in test
        
###### Big possible improvement to lag features: Have "first location" (starting point) "lag" feature

In [39]:
### features to drop - not usable, or leaks (e.g. aggregations on target)

TARGET_COL = 'city_id'
DROP_FEATS = ['user_id',
    'checkin', 'checkout',
              'hotel_country','city_id_count','same_country',
              'utrip_id',
#               'utrip_steps_from_end',
             'city_id_count','hotel_country_count',
              'city_id_nunique', 'hotel_country_nunique',
              'city_id_rank_by_hotel_country','city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
              'affiliate_id_rank_by_hotel_country','affiliate_id_rank_by_booker_country', 'affiliate_id_rank_by_affiliate',
              'hotel_country_rank_by_hotel_country',
       'hotel_country_rank_by_booker_country','hotel_country_rank_by_affiliate',
              'booker_country_rank_by_hotel_country','booker_country_rank_by_booker_country',
              'checkin_month_rank_by_hotel_country',
             ]

# df2.drop(DROP_FEATS,axis=1).columns

In [40]:
print(df2.shape)
# ### lag features - last n visits
df_feat = groupbyLagFeatures(df=df2.copy(), 
                   lag=[1,2],group="utrip_id",lag_feature_cols=LAG_FEAT_COLS)
df_feat = df_feat.dropna(subset=["lag2_city_id"]).sample(frac=1)

### filter for only trip targets that are among the K most popular :


df_feat = df_feat.drop(DROP_FEATS,axis=1,errors="ignore")
print(df_feat.shape)

# df_feat.sort_values(["user_id","utrip_steps_from_end"])

(335027, 56)
(142320, 100)


In [41]:
### filter for most frequent targets

if KEEP_TOP_K_TARGETS > 0 :
    print(df_feat.shape[0])
    df_feat = df_feat.loc[df_feat["city_id"].isin(TOP_TARGETS)]
    print(df_feat.shape[0])    
    c = df_feat["city_id"].nunique()
    print(f"{c} unique targets left")
    assert  c<= KEEP_TOP_K_TARGETS

142320
130681
746 unique targets left


In [42]:
########################
## stratified train/test split by class - then drop any train rows with overlap wit htest IDs.  Could also stratify by users, but risks some classes being non present in test
### split could maybe be by utrip ID ? 
### orig - split by group : 

# train_inds, test_inds = next(GroupShuffleSplit(test_size=.2, n_splits=2, random_state = 7).split(df_feat, groups=df_feat['user_id']))
# X_train = df_feat.iloc[train_inds].drop(DROP_FEATS,axis=1,errors="ignore")
# X_test = df_feat.iloc[test_inds].drop(DROP_FEATS,axis=1,errors="ignore")
# assert (set(X_train[TARGET_COL].unique()) == set(X_test[TARGET_COL].unique()))
#################
## alt: split by class. May be leaky! 
X_train, X_test = train_test_split(df_feat,stratify=df_feat[TARGET_COL])

########################
print("X_train",X_train.shape[0])
## get last row in trip only in test/eval set: 
print("X_test",X_test.shape[0])
X_test = X_test.loc[X_test["utrip_steps_from_end"]==1] # last row per trip
print("X_test after filtering for last instance per trip",X_test.shape[0])

y_train = X_train.pop(TARGET_COL)
y_test = X_test.pop(TARGET_COL)

print("# classes",y_train.nunique())

# ## check that same classes in train and test - 
# assert (set(y_train.unique()) == set(y_test.unique()))

X_train 98010
X_test 32671
X_test after filtering for last instance per trip 14863
# classes 746


## Model
* For now - simple multiclass model (Tabnet? LSTM?) ; +- subsample - only most frequent classes/cities

    * Tabnet: `pip install pytorch-tabnet`
        * https://github.com/dreamquark-ai/tabnet/blob/develop/forest_example.ipynb
    * TensorFlow Tabmet: https://github.com/ostamand/tensorflow-tabnet/blob/master/examples/train_mnist.py

* split train/test by user id. 
    * Test - only last trip. 
    
* Try multiclass models

* Try tabnet models (tabular with attention)
    * + Lag feats
    * Note - the embedding here is not aware that the same IDs are the same (unlike TF's )! 

In [43]:
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score

import pandas as pd
import numpy as np
np.random.seed(0)

* cat_idxs : list of int (default=[] - Mandatory for embeddings)
    * List of categorical features indices.

* cat_dims : list of int (default=[] - Mandatory for embeddings)

    * List of categorical features number of modalities (number of unique values for a categorical feature) /!\ no new modalities can be predicted

* cat_emb_dim : list of int (optional)

    * List of embeddings size for each categorical features. (default =1)
    
    
    
All the categorical vals must be known from train (demo used label encoder). Consider doing so also here at late step, to avoid unknown vals ? 

In [44]:
X_train

Unnamed: 0,device_class,affiliate_id,booker_country,duration,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id,affiliate_id_count,booker_country_count,checkin_month_count,checkin_week_count,affiliate_id_nunique,booker_country_nunique,booker_country_rank_by_affiliate,checkin_month_rank_by_booker_country,checkin_month_rank_by_affiliate,lag1_city_id,lag1_device_class,lag1_affiliate_id,lag1_booker_country,lag1_hotel_country,lag1_duration,lag1_same_country,lag1_checkin_weekday,lag1_checkin_week,lag1_checkout_weekday,lag1_city_id_count,lag1_affiliate_id_count,...,lag1_city_id_rank_by_booker_country,lag1_city_id_rank_by_affiliate,lag1_affiliate_id_rank_by_hotel_country,lag1_affiliate_id_rank_by_booker_country,lag1_booker_country_rank_by_hotel_country,lag1_booker_country_rank_by_booker_country,lag1_booker_country_rank_by_affiliate,lag1_hotel_country_rank_by_booker_country,lag1_hotel_country_rank_by_affiliate,lag1_checkin_month_rank_by_hotel_country,lag1_checkin_month_rank_by_booker_country,lag1_checkin_month_rank_by_affiliate,lag2_city_id,lag2_device_class,lag2_affiliate_id,lag2_booker_country,lag2_hotel_country,lag2_duration,lag2_same_country,lag2_checkin_weekday,lag2_checkin_week,lag2_checkout_weekday,lag2_city_id_count,lag2_affiliate_id_count,lag2_booker_country_count,lag2_hotel_country_count,lag2_checkin_month_count,lag2_checkin_week_count,lag2_city_id_nunique,lag2_affiliate_id_nunique,lag2_booker_country_nunique,lag2_hotel_country_nunique,lag2_city_id_rank_by_hotel_country,lag2_city_id_rank_by_booker_country,lag2_city_id_rank_by_affiliate,lag2_affiliate_id_rank_by_hotel_country,lag2_affiliate_id_rank_by_booker_country,lag2_booker_country_rank_by_hotel_country,lag2_booker_country_rank_by_booker_country,lag2_booker_country_rank_by_affiliate,lag2_hotel_country_rank_by_booker_country,lag2_hotel_country_rank_by_affiliate,lag2_checkin_month_rank_by_hotel_country,lag2_checkin_month_rank_by_booker_country,lag2_checkin_month_rank_by_affiliate
117054,0,4933,4,2,5,2,40,10,0,4,4,40,7,0.974928,-0.222521,-1.000000,-1.836970e-16,1,0.833333,5,6,0,40,21929,251,124895,47584,15422,2,1,124.5,74026.0,141.5,46258.0,0.0,7974.0,4.0,45.0,1.0,0.0,1.0,40.0,2.0,551.0,8598.0,...,67442.0,4177.0,7067.0,38448.0,15088.0,62448.0,4271.0,85045.0,6035.0,9841.0,74026.0,5276.5,29770.0,0.0,4933.0,4.0,45.0,1.0,0.0,0.0,40.0,1.0,2523.0,251.0,124895.0,22506.0,47584.0,15422.0,5.0,2.0,1.0,4.0,21245.0,109495.0,239.5,1619.5,9936.5,15088.0,62448.0,124.5,85045.0,192.5,9841.0,74026.0,141.5
84631,0,9924,3,4,10,1,2,1,1,1,5,2,14,0.781831,0.623490,0.000000,1.000000e+00,2,1.000000,4,4,1,112,51259,120761,34384,21339,3455,2,1,8433.0,546.5,2807.0,51259.0,0.0,3449.0,3.0,112.0,2.0,0.0,6.0,1.0,1.0,2481.0,4470.0,...,30612.5,4083.0,602.0,10124.0,1260.5,17192.5,2232.0,6396.5,617.0,135.0,546.5,49.5,51259.0,0.0,3449.0,3.0,112.0,6.0,0.0,0.0,1.0,6.0,2481.0,4470.0,34384.0,3193.0,21339.0,4361.0,1.0,2.0,1.0,1.0,1953.0,30612.5,4083.0,602.0,10124.0,1260.5,17192.5,2232.0,6396.5,617.0,135.0,546.5,49.5
24285,0,384,2,7,4,4,44,11,0,4,4,45,11,-0.433884,-0.900969,-0.866025,5.000000e-01,2,1.000000,6,6,1,35,58413,38455,232590,24686,6507,1,1,19267.5,29867.5,4473.0,11400.0,0.0,384.0,2.0,35.0,1.0,0.0,3.0,44.0,4.0,124.0,38455.0,...,107907.5,18628.0,27544.0,139225.5,49687.5,116295.5,19267.5,220395.5,36164.0,7487.5,29867.5,4473.0,58413.0,0.0,384.0,2.0,35.0,4.0,0.0,6.0,43.0,3.0,594.0,38455.0,232590.0,61882.0,47584.0,9097.0,3.0,1.0,1.0,1.0,39988.5,178449.5,30030.0,27544.0,139225.5,49687.5,116295.5,19267.5,220395.5,36164.0,28606.5,105161.0,16545.5
448104,0,9924,2,1,2,1,31,8,0,3,2,31,3,0.781831,0.623490,-0.500000,-8.660254e-01,1,1.000000,8,8,1,69,47976,120761,232590,100507,25712,4,1,99308.0,203610.0,109414.0,2612.0,0.0,2436.0,2.0,69.0,1.0,0.0,0.0,31.0,1.0,535.0,7990.0,...,174056.5,6139.5,3324.0,64864.0,9319.5,116295.5,4007.0,87995.5,3349.0,11265.5,203610.0,6964.0,5682.0,1.0,9452.0,2.0,69.0,2.0,0.0,5.0,30.0,0.0,201.0,38619.0,232590.0,12961.0,71278.0,21177.0,6.0,4.0,1.0,1.0,2331.0,127790.5,18886.0,6969.0,165447.5,9319.5,116295.5,31585.5,87995.5,18259.0,8717.0,157168.5,29380.0
147588,0,384,2,1,6,1,36,9,0,3,2,36,7,0.781831,0.623490,-0.866025,-5.000000e-01,1,1.000000,6,6,1,51,11179,38455,232590,61119,14836,3,1,19267.5,126902.5,20053.0,30695.0,0.0,8436.0,2.0,51.0,1.0,0.0,0.0,36.0,1.0,158.0,13676.0,...,118480.0,6826.0,23056.0,95527.5,37284.5,116295.5,6860.5,193279.5,11054.0,28036.0,126902.5,6941.0,40992.0,0.0,384.0,2.0,51.0,1.0,0.0,5.0,35.0,6.0,106.0,38455.0,232590.0,52205.0,61119.0,13819.0,5.0,3.0,1.0,1.0,20316.5,102161.5,17690.5,29231.0,139225.5,37284.5,116295.5,19267.5,193279.5,31254.0,28036.0,126902.5,20053.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336842,0,3449,3,2,20,4,20,5,0,2,6,20,22,-0.433884,-0.900969,0.866025,-5.000000e-01,1,0.666667,4,6,0,45,4021,4470,34384,40308,9434,4,1,2232.0,10947.0,1283.0,15377.0,0.0,9598.0,3.0,31.0,5.0,0.0,6.0,19.0,4.0,400.0,10571.0,...,15229.0,7408.0,3790.0,12765.5,1320.5,17192.5,490.5,20350.5,3362.0,2870.5,10947.0,4024.5,29770.0,0.0,4085.0,3.0,45.0,4.0,0.0,2.0,19.0,6.0,2523.0,20.0,34384.0,22506.0,40308.0,9055.0,5.0,4.0,1.0,2.0,21245.0,31969.5,16.5,348.0,625.0,1589.0,17192.5,10.5,27295.0,15.0,7859.5,10947.0,3.0
244322,0,9924,4,2,8,0,32,8,0,3,2,32,10,0.000000,1.000000,-0.500000,-8.660254e-01,1,1.000000,6,6,1,51,64876,120761,124895,100507,28194,1,1,61856.0,117922.0,109414.0,9608.0,0.0,9924.0,4.0,51.0,3.0,0.0,4.0,31.0,0.0,2374.0,120761.0,...,107544.0,110056.0,46277.0,108897.0,18321.0,62448.0,61856.0,100178.0,97858.0,46113.0,117922.0,109414.0,55128.0,0.0,9924.0,4.0,51.0,3.0,0.0,0.0,31.0,3.0,3762.0,120761.0,124895.0,52205.0,100507.0,25712.0,5.0,1.0,1.0,1.0,50324.5,121623.5,118555.0,46277.0,108897.0,18321.0,62448.0,61856.0,100178.0,97858.0,46113.0,117922.0,109414.0
450081,0,10332,2,2,6,5,31,8,0,3,0,32,8,-0.974928,-0.222521,-0.500000,-8.660254e-01,1,0.714286,5,7,0,45,13360,11806,232590,100507,25712,1,1,8295.0,203610.0,10547.0,37601.0,0.0,10332.0,2.0,45.0,2.0,0.0,3.0,31.0,5.0,337.0,11806.0,...,152364.0,7468.5,9318.0,85200.0,20389.5,116295.5,8295.0,124700.5,6347.0,20113.0,203610.0,10547.0,29770.0,0.0,10332.0,2.0,45.0,4.0,0.0,6.0,30.0,3.0,2523.0,11806.0,232590.0,22506.0,71278.0,21177.0,6.0,1.0,1.0,3.0,21245.0,219948.5,11123.0,9318.0,85200.0,20389.5,116295.5,8295.0,124700.5,6347.0,15685.5,157168.5,8325.0
183953,1,359,4,5,5,4,5,2,0,1,2,6,10,-0.433884,-0.900969,0.500000,8.660254e-01,1,1.000000,5,5,1,76,47499,76156,124895,26230,7976,2,1,39699.5,33645.5,19737.0,47499.0,1.0,359.0,4.0,76.0,1.0,0.0,3.0,5.0,4.0,3762.0,76156.0,...,121623.5,74467.5,12831.5,73643.5,7723.5,62448.0,39699.5,65780.5,41198.5,8495.0,33645.5,19737.0,10485.0,1.0,359.0,4.0,76.0,2.0,0.0,1.0,5.0,3.0,2194.0,76156.0,124895.0,18772.0,26230.0,7976.0,4.0,2.0,1.0,2.0,13913.5,105654.5,65157.5,12831.5,73643.5,7723.5,62448.0,39699.5,65780.5,41198.5,8495.0,33645.5,19737.0


In [45]:
CAT_FEAT_NAMES = ["booker_country", "device_class","affiliate_id",
#                   "user_id", ## ? could use lower dim - depends on train/test overlap
                  "checkin_week",#"checkout_week",
#                     "checkin_weekday",
    "lag1_city_id","lag1_booker_country","lag1_hotel_country","lag1_affiliate_id", "lag1_device_class",
     "lag2_city_id","lag2_booker_country","lag2_hotel_country","lag2_affiliate_id","lag2_device_class",
#       "lag3_city_id","lag3_booker_country","lag3_hotel_country","lag3_affiliate_id","lag3_device_class",
                  "first_hotel_country","first_city_id"
                 ]

In [46]:
NUMERIC_COLS = [item for item in list(df_feat.columns.drop(TARGET_COL))  if item not in CAT_FEAT_NAMES]
print(len(NUMERIC_COLS))
print("numeric cols",NUMERIC_COLS)

for c in NUMERIC_COLS:
    l_enc =   StandardScaler() # MinMaxScaler()#
    l_enc.fit(df_feat[c].values.reshape(-1,1))
    X_train[c] = l_enc.transform(X_train[c].values.reshape(-1,1))
    X_test[c] = l_enc.transform(X_test[c].values.reshape(-1,1))

83
numeric cols ['duration', 'checkin_day', 'checkin_weekday', 'checkin_month', 'checkin_year', 'checkin_quarter', 'checkout_weekday', 'checkout_week', 'checkout_day', 'checkin_weekday_sin', 'checkin_weekday_cos', 'checkin_month_sin', 'checkin_month_cos', 'utrip_number', 'utrip_steps_from_end', 'row_num', 'total_rows', 'last', 'affiliate_id_count', 'booker_country_count', 'checkin_month_count', 'checkin_week_count', 'affiliate_id_nunique', 'booker_country_nunique', 'booker_country_rank_by_affiliate', 'checkin_month_rank_by_booker_country', 'checkin_month_rank_by_affiliate', 'lag1_duration', 'lag1_same_country', 'lag1_checkin_weekday', 'lag1_checkin_week', 'lag1_checkout_weekday', 'lag1_city_id_count', 'lag1_affiliate_id_count', 'lag1_booker_country_count', 'lag1_hotel_country_count', 'lag1_checkin_month_count', 'lag1_checkin_week_count', 'lag1_city_id_nunique', 'lag1_affiliate_id_nunique', 'lag1_booker_country_nunique', 'lag1_hotel_country_nunique', 'lag1_city_id_rank_by_hotel_country'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [47]:
for c in CAT_FEAT_NAMES:
    l_enc = LabelEncoder().fit(df_feat[c])
    X_train[c] = l_enc.transform(X_train[c])
    X_test[c] = l_enc.transform(X_test[c])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [48]:
# X_train.columns.get_loc(CAT_FEAT_NAMES)
cat_idxs = [X_train.columns.get_loc(c) for c in CAT_FEAT_NAMES if c in X_train]
assert len(cat_idxs) == len(CAT_FEAT_NAMES)
print(cat_idxs)

[2, 0, 1, 6, 33, 36, 37, 35, 34, 66, 69, 70, 68, 67, 22, 23]


In [49]:
#### get nuniques and set embeding dimension per categorical
### note that we need to change here if we want a higher embedding dimension!

nunique = X_train.nunique()
types = X_train.dtypes

# categorical_columns = []
categorical_dims = [] #{}
cat_embed_dims = []
for i,col in enumerate(cat_idxs):
#     print(i,col)
#     c_uniques = X_train.iloc[:,col].nunique()
    c_uniques = df_feat[CAT_FEAT_NAMES[i]].nunique() ## try to use original data, more nuniques? 
    
    categorical_dims.append(c_uniques)
#     if col == "user_id" :  cat_embed_dims.append(10) ## need to change to use names. user id may overfit
    cat_embed_dims.append(min(100,c_uniques//2))   

In [50]:
print(categorical_dims)
print(cat_embed_dims)

[5, 3, 1180, 52, 1037, 5, 78, 1174, 3, 1037, 5, 78, 1191, 3, 114, 6169]
[2, 1, 100, 26, 100, 2, 39, 100, 1, 100, 2, 39, 100, 1, 57, 100]


In [51]:
assert X_test.isna().sum().max() == X_train.isna().sum().max() == 0

In [52]:
print("sum top4 total percentage:",y_train.value_counts(normalize=True)[0:4].sum().round(3))
y_train.value_counts(normalize=True).round(5)

sum top4 total percentage: 0.071


23921    0.02115
47499    0.01696
29319    0.01668
36063    0.01571
17013    0.01460
          ...   
48683    0.00007
26497    0.00006
48597    0.00005
57385    0.00004
43038    0.00002
Name: city_id, Length: 746, dtype: float64

In [53]:
if RUN_TABNET:
    # TabNetPretrainer
    unsupervised_model = TabNetPretrainer(    
        n_d=16, n_a=16, n_steps=4,
        cat_idxs=cat_idxs,
       cat_dims=categorical_dims,
       cat_emb_dim=cat_embed_dims,
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        mask_type='entmax', # "sparsemax"
        device_name="auto" #"auto" "cpu" 
    )

    unsupervised_model.fit(
        X_train=X_train.values,
#         eval_set=[X_test.values],
        pretraining_ratio=0.1,
         max_epochs=4,
        batch_size = 256 ,# 1024 default , ~256-512 with GPU
    )
    
    ## save unsup model
    ### https://github.com/dreamquark-ai/tabnet/blob/develop/pretraining_example.ipynb
#     unsupervised_model.save_model('./.4_pretrain')


Device used : cuda
epoch 0  | loss: 126857.16738| val_0_unsup_loss: 2197918.5|  0:00:51s
epoch 1  | loss: 22.13734| val_0_unsup_loss: 2343696.25|  0:01:47s
epoch 2  | loss: 14.09726| val_0_unsup_loss: 2409536.0|  0:02:43s
epoch 3  | loss: 14.55485| val_0_unsup_loss: 2280972.5|  0:03:39s
epoch 4  | loss: 16.57199| val_0_unsup_loss: 2379917.25|  0:04:34s
epoch 5  | loss: 14.00867| val_0_unsup_loss: 2357729.0|  0:05:27s
Stop training because you reached max_epochs = 6 with best_epoch = 0 and best_val_0_unsup_loss = 2197918.5
Best weights from best epoch are automatically used!


In [54]:
if RUN_TABNET:
    clf = TabNetClassifier(    
        n_d=16, n_a=16, n_steps=4,
        cat_idxs=cat_idxs,
       cat_dims=categorical_dims,
       cat_emb_dim=cat_embed_dims,   
       optimizer_fn=torch.optim.Adam,
       optimizer_params=dict(lr=2e-2),
       scheduler_params={"step_size":50, # how to use learning rate scheduler
                         "gamma":0.9},
       scheduler_fn=torch.optim.lr_scheduler.StepLR,
       mask_type='entmax', # "sparsemax"
        device_name="auto" #"auto" "cpu"
    )

    clf.fit(
        X_train=X_train.values, y_train=y_train.values,
        eval_set=[(X_train.values, y_train.values), (X_test.values, y_test.values)],
    #      eval_set=[(X_test.values, y_test.values)],
        eval_name=['train','test'],
        eval_metric=['accuracy'],
         max_epochs=max_epochs, 
        batch_size = 512 ,# 1024 default , ~256-512 with GPU
        from_unsupervised=unsupervised_model,    
    )

#     clf.save_model('./.full_tabnet_1192class')

Device used : cuda
Loading weights from unsupervised pretraining
epoch 0  | loss: 5.1598  | train_accuracy: 0.17636 | test_accuracy: 0.23111 |  0:00:37s
epoch 1  | loss: 3.19021 | train_accuracy: 0.27039 | test_accuracy: 0.3362  |  0:01:15s
epoch 2  | loss: 2.70168 | train_accuracy: 0.31076 | test_accuracy: 0.36507 |  0:01:57s
epoch 3  | loss: 2.52072 | train_accuracy: 0.33928 | test_accuracy: 0.3794  |  0:02:36s
epoch 4  | loss: 2.39157 | train_accuracy: 0.36911 | test_accuracy: 0.39285 |  0:03:15s
epoch 5  | loss: 2.28616 | train_accuracy: 0.39359 | test_accuracy: 0.39736 |  0:03:55s
epoch 6  | loss: 2.19956 | train_accuracy: 0.41903 | test_accuracy: 0.41021 |  0:04:35s
epoch 7  | loss: 2.11324 | train_accuracy: 0.43766 | test_accuracy: 0.40584 |  0:05:13s
epoch 8  | loss: 2.04822 | train_accuracy: 0.45589 | test_accuracy: 0.40631 |  0:05:50s
epoch 9  | loss: 1.97378 | train_accuracy: 0.47266 | test_accuracy: 0.4121  |  0:06:29s
epoch 10 | loss: 1.91327 | train_accuracy: 0.49005 | te

In [55]:
# clf2 = TabNetClassifier(    
#     n_d=16, n_a=16, n_steps=4,
#     cat_idxs=cat_idxs,
#    cat_dims=categorical_dims,
#    cat_emb_dim=cat_embed_dims,   
#    optimizer_fn=torch.optim.Adam,
#    optimizer_params=dict(lr=2e-2),
#    scheduler_params={"step_size":50, # how to use learning rate scheduler
#                      "gamma":0.9},
#    scheduler_fn=torch.optim.lr_scheduler.StepLR,
#    mask_type='entmax', # "sparsemax"
#     device_name=  "auto"
# )

# clf2.fit(
#     X_train=X_train.values, y_train=y_train.values,
#     eval_set=[(X_train.values, y_train.values), (X_test.values, y_test.values)],
#     eval_name=['train','test'],
#     eval_metric=['accuracy'],
#      max_epochs=3, 
#     batch_size = 256 ,# 1024 default
#     from_unsupervised= clf #unsupervised_model,    
# )

#### feature importance & evaluation
* Look for leaks!
* May be bug with ordering of results - evaluation doesn't make sense. Note that diff # outputs/classes, likely culprit

In [56]:
# clf.feature_importances_

array([1.65053801e-02, 0.00000000e+00, 1.26660505e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.12086845e-02, 0.00000000e+00,
       0.00000000e+00, 6.62986348e-04, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.37245735e-08,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 4.89281891e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 5.72079810e-02, 4.31556197e-02, 0.00000000e+00,
       1.55294673e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.65215350e-03,
       1.24009905e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

In [57]:
## top features (unsorted) - booker country
# X_train.columns[clf.feature_importances_>1e-7]
feat_imp = pd.DataFrame([X_train.columns,clf.feature_importances_]).T
feat_imp = feat_imp.loc[feat_imp[1]>0].sort_values(1,ascending=False).reset_index(drop=True)
feat_imp

Unnamed: 0,0,1
0,lag2_city_id,0.480811
1,lag1_booker_country,0.155295
2,lag2_booker_country,0.0748131
3,lag2_hotel_country,0.0618727
4,lag1_city_id,0.057208
5,first_hotel_country,0.0489282
6,lag1_device_class,0.0431556
7,device_class,0.0165054
8,lag2_device_class,0.0155616
9,booker_country,0.0126661


In [58]:
print("y_test nunique classes",y_test.nunique())
y_test

y_test nunique classes 740


310807    26235
191955    29319
424066     2416
340203    47486
247728     8766
          ...  
381400    22490
251195     2748
517153    15284
148074    66815
74652     21033
Name: city_id, Length: 14863, dtype: int64

In [59]:
y_preds_test_proba = clf.predict_proba(X_test.values)
y_preds_test = clf.predict(X_test.values)
# y_preds_test[0:2]

In [60]:
y_preds_test_proba.shape

(14863, 746)

In [61]:
y_preds_test.shape

(14863,)

In [62]:
y_test.values.shape

(14863,)

In [63]:
m = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=4)
m.update_state(y_true=y_test.values, y_pred=y_preds_test_proba)    
print(m.result().numpy())

0.0


In [64]:
# m = InTopK(k=4)
# m.update_state(y_true=y_test.values, y_pred=y_preds_test)
# m.result().numpy()
# m.reset_states()

# top_k_accuracy_score(y_true=y_test.values, y_score=y_preds_test_proba)

In [65]:
# m = tf.keras.metrics.SparseCategoricalAccuracy()
# m.update_state(y_true=y_test.values, y_pred=y_preds_test) ## .reshape(-1,1) / (1,-1) ?  
# m.result().numpy()

In [66]:
# ### likely error with ordering of classes (and test has less classes than train)
# y_test_ohe = to_categorical(y_test.values)
# m = Precision(top_k=4)
# m.update_state(y_true=y_test_ohe, y_pred=y_preds_test_proba)
# m.result().numpy()

### Simpler baseline: Linear model + OHE
* Multinomial Logistic regression model + one hot encoding. 
    * +- count encoding (to reduce # dimensions). 