* https://www.bookingchallenge.com/

* Predict `city_id`
        * Metric: P@4

##### Dataset
The training dataset consists of over a million of anonymized hotel reservations, based on real data, with the following features:
*    user_id - User ID
*    check-in - Reservation check-in date
*    checkout - Reservation check-out date
*    affiliate_id - An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
*    device_class - desktop/mobile
*    booker_country - Country from which the reservation was made (anonymized)
*    hotel_country - Country of the hotel (anonymized)
*    city_id - city_id of the hotel’s city (anonymized)
*    utrip_id - Unique identification of user’s trip (a group of multi-destinations bookings within the same trip)


* Each reservation is a part of a customer’s trip (identified by utrip_id) which includes at least 4 consecutive reservations. The check-out date of a reservation is the check-in date of the following reservation in their trip.

* The evaluation dataset is constructed similarly, however the city_id of the final reservation of each trip is concealed and requires a prediction.

 
###### Evaluation criteria
The goal of the challenge is to predict (and recommend) the final city (city_id) of each trip (utrip_id). We will evaluate the quality of the predictions based on the top four recommended cities for each trip by using Precision@4 metric (4 representing the four suggestion slots at Booking.com website). When the true city is one of the top 4 suggestions (regardless of the order), it is considered correct.

------------------------------------------------------

* If we are given  the country in question, then this problem is maybe more of a _learning to rank_ problem. (Rather than massively multiclass). 
    * CatBoost learning to rank on ms dataset (0/1):  https://colab.research.google.com/github/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb
        * https://catboost.ai/docs/concepts/loss-functions-ranking.html
        * for CB ranking,  all objects in dataset must be grouped by group_id - this would be user/trip id X country, in our case. (Still need to add negatives, within each such subgroup/group/"query"). 

    * lightFM - ranking (implicit interactions)
        * https://github.com/qqwjq/lightFM

    * lstm/w2v - next item recomendation
    * dot product between different factors as features (recc.)
    * xgboost ap - https://www.kaggle.com/anokas/xgboost-2
* Relevant: Kaggle expedia hotel prediction: https://www.kaggle.com/c/expedia-hotel-recommendations/discussion  

* ALSO: `implicit interaction` - reccommendation problem (We have only positive feedback, no ranked/negative explicit feedback)'


* __BASELINE__ to beat: 4 most popular by country ; 4 most popular by affiliate_id X booker_country X hotel_country (X month?)
    * Ignore/auto answer the 4 most popular for countries with less than 4 unique cities in data
 
 
* Likely approach : build a model (and targets/negatives) per country.

-----------
#### Data notes:
* Long tail of cities and countries
* Some (31%) countries have 4 or less unique cities - for those return fixed answer/prediction ?  -
    * CAN'T! In test set, we will not have the country ID :(
    
    
----------------------
MF - embedding model

* https://blog.tensorflow.org/2020/09/introducing-tensorflow-recommenders.html
* Implicit recommendations - needs negs
    * example of explicit (simple): https://petamind.com/build-a-simple-recommender-system-with-matrix-factorization/
* sample negatives - how ? TFRS requires tf.dataset overhead (And confuses me with what user id should be )
    * https://www.kaggle.com/skihikingkevin/some-recommender-system-implementations
    
    
Simple keras example of multiple inputs : 
* keras topologies
* https://stackoverflow.com/questions/61722973/why-keras-embedding-not-learning-for-recommendation-system


*Tensorflow ranking (seems in beta) : https://colab.research.google.com/github/tensorflow/ranking/blob/master/tensorflow_ranking/examples/handling_sparse_features.ipynb#scrollTo=HfDMGnZY9eVO


Negative pairs training with generator - https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9
* See code. 

* example for implicit recomender (naive) - mainly for negatives data gen ?
* https://www.kaggle.com/skihikingkevin/some-recommender-system-implementations


* Could use **lightFM** - implicit recommender? 
    * https://github.com/lyst/lightfm/tree/master/examples/dataset
    * https://github.com/lyst/lightfm/blob/master/examples/stackexchange/hybrid_crossvalidated.ipynb
    * Note use of sparse matrices. Supports metadata
* We could use tuple of features for "user id" for purposes of recommenders? 
  
* user-item sparse OHE creation - https://github.com/piyushpathak03/Recommendation-systems/blob/master/Recomendation%20system%20end%20to%20end/4)%20Feature%20Creation.ipynb
* lightfm -
    * https://github.com/piyushpathak03/Recommendation-systems/tree/master/Recomendation%20system%20end%20to%20end  - building the sparse interactions matrix for implecit recc
    * https://making.lyst.com/lightfm/docs/examples/dataset.html#building-the-interactions-matrix
* https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html - example of metadata features for lightfm
  
* https://github.com/zhangruiskyline/DeepLearning/blob/master/doc/Recommendation.md#ranking  - includes negatives sampling! 

**Spotlight**
*  https://maciejkula.github.io/spotlight/interactions.html
    * Also has sequence support easily
    * Example of loading custom dataset for implicit recc - https://github.com/maciejkula/spotlight/issues/30

**SVD/ALS**
    * https://stats.stackexchange.com/questions/354355/what-is-the-relation-between-svd-and-als
    * https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html
    
    
RankFM (implicit package) - don't know if adds anything vs lightfm?
* https://github.com/etlundquist/rankfm
* Does seem easier to "productionize"

```
from implicit.als import AlternatingLeastSquares
from scipy import sparse

def matrix_decomposition(matrix, k, i):
    matrix = sparse.csr_matrix(matrix.T)
    model = AlternatingLeastSquares(factors=k, iterations=i)
    model.fit(matrix)
    user_latent = model.user_factors
    item_latent = model.item_factors

    return user_latent, item_latent
```

Neural Collaborative Filtering
* https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recommender/neural_collaborative_filtering
     * Using the model from here: https://nanx.me/blog/post/recsys-binary-implicit-feedback-r-keras/ and https://github.com/hexiangnan/neural_collaborative_filtering/blob/master/GMF.py

```
\# Create the Training Set
APPROX_NEGATIVE_SAMPLE_SIZE = int(len(train)*1.2)
n_users = c_user.categories.shape[0]
n_tracks = c_track.categories.shape[0]
\# Create Training Set
train_users = train['username'].cat.codes.values
train_tracks = train['track_id'].cat.codes.values
train_labels = np.ones(len(train_users))
\# insert negative samples
u = np.random.randint(n_users, size=APPROX_NEGATIVE_SAMPLE_SIZE)
i = np.random.randint(n_tracks, size=APPROX_NEGATIVE_SAMPLE_SIZE)
non_neg_idx = np.where(train_data[u,i] == 0)
train_users = np.concatenate([train_users, u[non_neg_idx[1]]])
train_tracks = np.concatenate([train_tracks, i[non_neg_idx[1]]])
train_labels = np.concatenate([train_labels, np.zeros(u[non_neg_idx[1]].shape[0])])
print((train_users.shape, train_tracks.shape, train_labels.shape))

\# random shuffle the data (because Keras takes last 10% as validation split)
X = np.stack([train_users, train_tracks, train_labels], axis=1)
np.random.shuffle(X)
```



* https://vitobellini.github.io/posts/2018/01/03/how-to-build-a-recommender-system-in-tensorflow.html  - easily turn df into matrix (need to add "as_sparse) - autoencoder approach: 
    ```
    # Convert DataFrame in user-item matrix
    matrix = df.pivot(index='user', columns='item', values='rating')
    matrix.fillna(0, inplace=True)
    ...
    # Users and items ordered as they are in matrix

    users = matrix.index.tolist()
    items = matrix.columns.tolist()

    matrix = matrix.as_matrix()
    ```
    
Triplets/siamese + triplet mining - 
* https://github.com/maciejkula/triplet_recommendations_keras

In [1]:
# Recommenders embedding - fit generator
# https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9
# Also has code for generator to generate positive, negative pairs per batch - good for siamese/triplets/metric! 

import numpy as np
import random
random.seed(100)
def generate_batch(pairs, n_positive = 50, negative_ratio = 1.0):
    """Generate batches of samples for training. 
       Random select positive samples
       from pairs and randomly select negatives."""
    
    # Create empty array to hold batch
    batch_size = n_positive * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))
    
    # Continue to yield samples
    while True:
        # Randomly choose positive examples
        for idx, (book_id, link_id) in enumerate(random.sample(pairs, n_positive)):
            batch[idx, :] = (book_id, link_id, 1)
        idx += 1
        
        # Add negative examples until reach batch size
        while idx < batch_size:
            
            # Random selection
            random_book = random.randrange(len(books))
            random_link = random.randrange(len(links))
            
            # Check to make sure this is not a positive example
            if (random_book, random_link) not in pairs_set:
                
                # Add to batch and increment index
                batch[idx, :] = (random_book, random_link, neg_label)
                idx += 1
                
        # Make sure to shuffle order
        np.random.shuffle(batch)
        yield {'book': batch[:, 0], 'link': batch[:, 1]}, batch[:, 2]


Possible approahc + negatives - https://github.com/zhangruiskyline/DeepLearning/blob/master/doc/Recommendation.md#ranking 
* To assess their quality we do the following for each user:

    compute matching scores for items (except the movies that the user has already seen in the training set),
    compare to the positive feedback actually collected on the test set using the ROC AUC ranking metric,
    average ROC AUC scores across users to get the average performance of the recommender model on the test set.
```
def average_roc_auc(match_model, data_train, data_test):
    """Compute the ROC AUC for each user and average over users"""
    max_user_id = max(data_train['user_id'].max(), data_test['user_id'].max())
    max_item_id = max(data_train['item_id'].max(), data_test['item_id'].max())
    user_auc_scores = []
    for user_id in range(1, max_user_id + 1):
        pos_item_train = data_train[data_train['user_id'] == user_id]
        pos_item_test = data_test[data_test['user_id'] == user_id]

        \# Consider all the items already seen in the training set
        all_item_ids = np.arange(1, max_item_id + 1)
        items_to_rank = np.setdiff1d(all_item_ids, pos_item_train['item_id'].values)

        \# Ground truth: return 1 for each item positively present in the test set
        \# and 0 otherwise.
        expected = np.in1d(items_to_rank, pos_item_test['item_id'].values)

        if np.sum(expected) >= 1:
            # At least one positive test value to rank
            repeated_user_id = np.empty_like(items_to_rank)
            repeated_user_id.fill(user_id)

            predicted = match_model.predict([repeated_user_id, items_to_rank],
                                            batch_size=4096)
            user_auc_scores.append(roc_auc_score(expected, predicted))

    return sum(user_auc_scores) / len(user_auc_scores)
```    



* Negative sampling from the sparse user-item cooccurrence matrix
    * https://stackoverflow.com/questions/49971318/how-to-generate-negative-samples-in-tensorflow
    ```
    def subsampler(data, num_pos=10, num_neg=10):
    """ Obtain random batch size made up of positive and negative samples
    Returns
    -------
    positive_row : np.array
       Row ids of the positive samples
    positive_col : np.array
       Column ids of the positive samples
    positive_data : np.array
       Data values in the positive samples
    negative_row : np.array
       Row ids of the negative samples
    negative_col : np.array
       Column ids of the negative samples

    Note
    ----
    We are not return negative data, since the negative values
    are always zero.
    """
    N, D = data.shape
    y_data = data.data
    y_row = data.row
    y_col = data.col

    \# store all of the positive (i, j) coords
    idx = np.vstack((y_row, y_col)).T
    idx = set(map(tuple, idx.tolist()))
    while True:
        \# get positive sample
        positive_idx = np.random.choice(len(y_data), num_pos)
        positive_row = y_row[positive_idx].astype(np.int32)
        positive_col = y_col[positive_idx].astype(np.int32)
        positive_data = y_data[positive_idx].astype(np.float32)

        \# get negative sample
        negative_row = np.zeros(num_neg, dtype=np.int32)
        negative_col = np.zeros(num_neg, dtype=np.int32)
        for k in range(num_neg):
            i, j = np.random.randint(N), np.random.randint(D)
            while (i, j) in idx:
                i, j = np.random.randint(N), np.random.randint(D)
                negative_row[k] = i
                negative_col[k] = j

        yield (positive_row, positive_col, positive_data,
               negative_row, negative_col)
   ```

In [2]:
from catboost import CatBoostClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
import pandas as pd
from sklearn.model_selection import train_test_split, GroupShuffleSplit
import numpy as np
from tensorflow.keras.metrics import TopKCategoricalAccuracy, Precision, SparseTopKCategoricalAccuracy # @4
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV

%matplotlib inline

pd.set_option("display.max_columns", 90)

In [3]:
!nvidia-smi -L

GPU 0: GeForce RTX 2060 (UUID: GPU-d8cefda9-d4cb-990c-cc01-a2a4f2416484)


In [4]:
## https://www.tensorflow.org/guide/mixed_precision ## TF mixed precision - pytorch requires other setup
from tensorflow.keras.mixed_precision import experimental as mixed_precision

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
## will need to correct in places, e.g.: 
## outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)

The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
Instructions for updating:
Use tf.keras.mixed_precision.LossScaleOptimizer instead. LossScaleOptimizer now has all the functionality of DynamicLossScale


#### Features to add:
* Lag 
* Rank (popularity) of city, country (in general, +- given booker country)
* Count of hotel; user, trip size ? (may be leaky )
* Seasonal features - Holidays? , datetime

Aggregate feats:
* user changed country? last booking (lag 1) country change? 
* max/min/avg popularity rank of previous locations visited



We should create a dictionary of the rank, count, city/country etc' feats, so we can easily merge them when making more "negative" samples/feats for ranking.


* Consider using a df2 of df without dates + drop_duplicates, +- without user/trip id (After calcing that) .


Leaky or potentially leaky (Dependso n test set): 
* Target freq features - frequency of target city, given source county +- affiliate +- month of year +- given country (and interactions of target freq). 
    * Risk of leaks - depends of test data has temporal split or not. 
    * cartboost can do target encode, but this lets us do it for interactions, e.g. target city freq given the 2 countries and affiliate.
    * beware overfitting! 

In [5]:
MIN_TARGET_FREQ = 40 # drop target/city_id values that appear less than this many times, as final step's target 
KEEP_TOP_K_TARGETS = 2000 # keep K most frequent city ID targets (redundnat with the above, )

## (some) categorical variables that appear less than this many times will be replaced with a placeholder value!
## Includes CITY id (but done after target filtering, to avoid creating a "rare class" target:
LOW_COUNT_THRESH = 8

RUN_TABNET = True
max_epochs = 30

In [6]:
# most basic categorical columns , without 'user_id', , 'utrip_id' ordevice_class - used for count encoding/filtering
BASE_CAT_COLS = ['city_id',  'affiliate_id', 'booker_country', 'hotel_country']

### features to get lags for. Not very robust. May want different feats for lags before -1
LAG_FEAT_COLS = ['city_id', 'device_class',
       'affiliate_id', 'booker_country', 'hotel_country', 
       'duration', 'same_country', 'checkin_weekday',
       'checkin_week',
        'checkout_weekday',
       'city_id_count', 'affiliate_id_count',
       'booker_country_count', 'hotel_country_count', 
       'checkin_month_count', 'checkin_week_count', 'city_id_nunique',
       'affiliate_id_nunique', 'booker_country_nunique',
       'hotel_country_nunique', 'city_id_rank_by_hotel_country',
       'city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
       'affiliate_id_rank_by_hotel_country',
       'affiliate_id_rank_by_booker_country', 
       'booker_country_rank_by_hotel_country',
       'booker_country_rank_by_booker_country',
       'booker_country_rank_by_affiliate',
#        'hotel_country_rank_by_hotel_country',
       'hotel_country_rank_by_booker_country',
       'hotel_country_rank_by_affiliate',
       'checkin_month_rank_by_hotel_country',
       'checkin_month_rank_by_booker_country',
       'checkin_month_rank_by_affiliate'
                ]

In [7]:
# https://stackoverflow.com/questions/33907537/groupby-and-lag-all-columns-of-a-dataframe
# https://stackoverflow.com/questions/62924987/lag-multiple-variables-grouped-by-columns
## lag features with groupby over many columns: 
def groupbyLagFeatures(df:pd.DataFrame,lag:[]=[1,2],group="utrip_id",lag_feature_cols=[]):
    """
    lag features with groupby over many columns
    https://stackoverflow.com/questions/62924987/lag-multiple-variables-grouped-by-columns"""
    if len(lag_feature_cols)>0:
        df=pd.concat([df]+[df.groupby(group)[lag_feature_cols].shift(x).add_prefix('lag'+str(x)+"_") for x in lag],axis=1)
    else:
         df=pd.concat([df]+[df.groupby(group).shift(x).add_prefix('lag'+str(x)+"_") for x in lag],axis=1)
    return df

def groupbyFirstLagFeatures(df:pd.DataFrame,group="user_id",lag_feature_cols=[]):
    """
    Get  first/head value lag-like of features with groupby over columns. Assumes sorted data!
    """
    if len(lag_feature_cols)>0:
        df=pd.concat([df]+[df.groupby(group)[lag_feature_cols].transform("first").add_prefix("first_")],axis=1)
    else:
#          df=pd.concat([df]+[df.groupby(group).first().add_prefix("first_")],axis=1)
        df=pd.concat([df]+[df.groupby(group).transform("first").add_prefix("first_")],axis=1)
    return df

######## Get n most popular items, per group
def most_popular(group, n_max=4):
    """Find most popular hotel clusters by destination
    Define a function to get most popular hotels for a destination group.

    Previous version used nlargest() Series method to get indices of largest elements. But the method is rather slow.
    Source: https://www.kaggle.com/dvasyukova/predict-hotel-type-with-pandas
    """
    relevance = group['relevance'].values
    hotel_cluster = group['hotel_cluster'].values
    most_popular = hotel_cluster[np.argsort(relevance)[::-1]][:n_max]
    return np.array_str(most_popular)[1:-1] # remove square brackets


## https://codereview.stackexchange.com/questions/149306/select-the-n-most-frequent-items-from-a-pandas-groupby-dataframe
# https://stackoverflow.com/questions/52073054/group-by-a-column-to-find-the-most-frequent-value-in-another-column
## can get modes (sorted)
# https://stackoverflow.com/questions/50592762/finding-most-common-values-with-pandas-groupby-and-value-counts
## df.groupby('tag')['category'].agg(lambda x: x.value_counts().index[0])
# https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value
# source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)


In [8]:
df = pd.read_csv("booking_train_set.csv",
#                  nrows=323456,
                 index_col=[0],
                 parse_dates=["checkin","checkout"],infer_datetime_format=True)

df.sort_values(["user_id","checkin"],inplace=True)

df

  mask |= (ar1 == a)


Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id
1004862,29,2016-07-09,2016-07-11,47054,desktop,1601,Elbonia,Elbonia,29_1
1004863,29,2016-07-11,2016-07-13,34444,desktop,1601,Elbonia,Elbonia,29_1
1004864,29,2016-07-13,2016-07-16,12291,desktop,1601,Elbonia,Elbonia,29_1
1004865,29,2016-07-16,2016-07-18,16386,desktop,8132,Elbonia,Elbonia,29_1
897811,81,2016-05-15,2016-05-16,33665,desktop,9924,Elbonia,Elbonia,81_1
...,...,...,...,...,...,...,...,...,...
1072993,6258065,2016-04-19,2016-04-20,55044,mobile,9452,Gondal,Pullamawang,6258065_1
420479,6258087,2016-08-03,2016-08-04,17754,desktop,2436,Gondal,Gondal,6258087_1
420480,6258087,2016-08-04,2016-08-05,50073,desktop,2436,Gondal,Gondal,6258087_1
420481,6258087,2016-08-05,2016-08-06,11662,desktop,2436,Gondal,Gondal,6258087_1


In [9]:
df["duration"] = (df["checkout"] - df["checkin"]).dt.days
df["same_country"] = (df["booker_country"]==df["hotel_country"]).astype(int)

df["checkin_day"] = df["checkin"].dt.day
df["checkin_weekday"] = df["checkin"].dt.weekday
df["checkin_week"] = df["checkin"].dt.isocalendar().week.astype(int) ## week of year
df["checkin_month"] = df["checkin"].dt.month
df["checkin_year"] = df["checkin"].dt.year-2016

df["checkin_quarter"] = df["checkin"].dt.quarter # relatively redundant but may be used for "id"

df["checkout_weekday"] = df["checkout"].dt.weekday
df["checkout_week"] = df["checkout"].dt.isocalendar().week.astype(int) ## week of year
df["checkout_day"] = df["checkout"].dt.day ## day of month

## cyclical datetime embeddings
## drop originakl variables? 
## TODO:L add for other variables, +- those that we'll embed (week?)

df['checkin_weekday_sin'] = np.sin(df["checkin_weekday"]*(2.*np.pi/7))
df['checkin_weekday_cos'] = np.cos(df["checkin_weekday"]*(2.*np.pi/7))
df['checkin_month_sin'] = np.sin((df["checkin_month"]-1)*(2.*np.pi/12))
df['checkin_month_cos'] = np.cos((df["checkin_month"]-1)*(2.*np.pi/12))

#############
# last number in utrip id - probably which trip number it is:
df["utrip_number"] = df["utrip_id"].str.split("_",expand=True)[1].astype(int)

### encode string columns - must be consistent with test data 
### IF we can concat test with train, we can just do a single transformation  for the NON TARGET cols
# obj_cols_list = df.select_dtypes("O").columns.values
obj_cols_list = ['device_class','booker_country','hotel_country'] # we could also define when loading data, dtype
for c in obj_cols_list:
    df[c] = df[c].astype("category")
    df[c] = df[c].cat.codes.astype(int)

## view steps of a trip per user & trip, in order. ## last step == 1.
## count #/pct step in a trip (utrip_id) per user. Useful to get the "final" step per trip - for prediction
## note that the order is ascending, so we would need to select by "last" . (i.e "1" is the first step, 2 the second, etc') , or we could use pct .rank(ascending=True,pct=True)
#### this feature overlaps with the count of each trip id (for the final row)
##  = df.sort_values(["checkin","checkout"])... - df already sorted above
df["utrip_steps_from_end"] = df.groupby("utrip_id")["checkin"].rank(ascending=True,pct=True) #.cumcount("user_id")
# print(df["utrip_steps_from_end"].describe()) # min is greater than 0
# df[["user_id","utrip_steps_from_end","checkin"]].sort_values(["user_id","utrip_steps_from_end"])

In [10]:
### add features to be consistent with test set of row in trip, and total trips in trip

df["row_num"] = df.groupby("utrip_id")["checkin"].rank(ascending=True,pct=False).astype(int)
utrip_counts = df["utrip_id"].value_counts()
df["total_rows"] = df["utrip_id"].map(utrip_counts)

df[["row_num","total_rows"]].describe()

Unnamed: 0,row_num,total_rows
count,1166835.0,1166835.0
mean,3.558572,6.117143
std,2.375193,2.796383
min,1.0,1.0
25%,2.0,4.0
50%,3.0,5.0
75%,5.0,7.0
max,48.0,48.0


In [11]:
df["last"] = (df["row_num"] ==df["total_rows"]).astype(int)

* Add first country, city visited in a trip. 
* Drop first row of a trip

In [12]:
## add the "first" place visited/values
### nopte - will need to drop first row in trip, or impute nans when using this feature 

### first by user results in too much sparsity/rareness for our IDs purposes
df = groupbyFirstLagFeatures(df,group="utrip_id",lag_feature_cols=["hotel_country","city_id"]) # ["hotel_country","city_id"]

## alt - messy, but maybe good enough : 
# df = groupbyFirstLagFeatures(df,group=['device_class', 'affiliate_id',
#                                        'booker_country','checkin_month',"last"],lag_feature_cols=["hotel_country"])


# df = df.loc[df["row_num"]>1] ## can't do yet, needed for lag features
print(df[["first_hotel_country","hotel_country","city_id"]].nunique())
df


first_hotel_country      174
hotel_country            195
city_id                39901
dtype: int64


Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,duration,same_country,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id
1004862,29,2016-07-09,2016-07-11,47054,0,1601,1,46,29_1,2,1,9,5,27,7,0,3,0,28,11,-0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,0.25,1,4,0,46,47054
1004863,29,2016-07-11,2016-07-13,34444,0,1601,1,46,29_1,2,1,11,0,28,7,0,3,2,28,13,0.000000,1.000000,1.224647e-16,-1.000000e+00,1,0.50,2,4,0,46,47054
1004864,29,2016-07-13,2016-07-16,12291,0,1601,1,46,29_1,3,1,13,2,28,7,0,3,5,28,16,0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,0.75,3,4,0,46,47054
1004865,29,2016-07-16,2016-07-18,16386,0,8132,1,46,29_1,2,1,16,5,28,7,0,3,0,29,18,-0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,1.00,4,4,1,46,47054
897811,81,2016-05-15,2016-05-16,33665,0,9924,1,46,81_1,1,1,15,6,19,5,0,2,0,20,16,-0.781831,0.623490,8.660254e-01,-5.000000e-01,1,0.25,1,4,0,46,33665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1072993,6258065,2016-04-19,2016-04-20,55044,1,9452,2,133,6258065_1,1,0,19,1,16,4,0,2,2,16,20,0.781831,0.623490,1.000000e+00,6.123234e-17,1,1.00,4,4,1,133,59444
420479,6258087,2016-08-03,2016-08-04,17754,0,2436,2,60,6258087_1,1,1,3,2,31,8,0,3,3,31,4,0.974928,-0.222521,-5.000000e-01,-8.660254e-01,1,0.25,1,4,0,60,17754
420480,6258087,2016-08-04,2016-08-05,50073,0,2436,2,60,6258087_1,1,1,4,3,31,8,0,3,4,31,5,0.433884,-0.900969,-5.000000e-01,-8.660254e-01,1,0.50,2,4,0,60,17754
420481,6258087,2016-08-05,2016-08-06,11662,0,2436,2,60,6258087_1,1,1,5,4,31,8,0,3,5,31,6,-0.433884,-0.900969,-5.000000e-01,-8.660254e-01,1,0.75,3,4,0,60,17754


In [13]:
### replace rare variables (under 2 occurrences) with "-1" dummy

affiliates_counts = df["affiliate_id"].value_counts()
print("before:", affiliates_counts)
print("uniques",df["affiliate_id"].nunique())
affiliates_counts = affiliates_counts.to_dict()
# df["affiliate_id"] = df["affiliate_id"].where(df["affiliate_id"].apply(lambda x: x.map(x.value_counts()))>=3, -1)
df["affiliate_id"] = df["affiliate_id"].where(df["affiliate_id"].map(affiliates_counts)>=3, -2)
df["affiliate_id"] = df["affiliate_id"].astype(int)

print("after\n",df["affiliate_id"].value_counts())
print("uniques",df["affiliate_id"].nunique())

before: 9924     277775
359      171385
384       88137
9452      85476
4541      41504
          ...  
8351          1
8464          1
2202          1
10513         1
2047          1
Name: affiliate_id, Length: 3254, dtype: int64
uniques 3254
after
 9924    277775
359     171385
384      88137
9452     85476
4541     41504
         ...  
2615         3
5963         3
2618         3
838          3
176          3
Name: affiliate_id, Length: 2152, dtype: int64
uniques 2152


In [14]:
### for possible "user id" embedding/ID : How many unique values are there for these source tuple? :
### Could also maybe add previous location/lag1 country/city ? 
## 'device_class','affiliate_id', 'booker_country' - 7.5 K "uniques"
## 'device_class','affiliate_id', 'booker_country','checkin_month' - 24 K "uniques"
## 'device_class','affiliate_id', 'booker_country','checkin_quarter' 14K "uniques"

print(df[['device_class','affiliate_id', 'booker_country','checkin_month',"total_rows"]].nunique(axis=0))
df.groupby(['device_class','affiliate_id', 'booker_country','checkin_quarter']).size()

device_class         3
affiliate_id      2152
booker_country       5
checkin_month       12
total_rows          41
dtype: int64


device_class  affiliate_id  booker_country  checkin_quarter
0             -2            0               1                   6
                                            2                  13
                                            3                  36
                                            4                  13
                            1               1                  50
                                                               ..
2              10615        1               4                   3
               10627        2               2                   1
               10646        2               4                   1
               10668        2               1                   1
                                            3                   1
Length: 13225, dtype: int64

In [15]:
# df.groupby(['device_class','affiliate_id', 'booker_country','checkin_month']).size() ## 24k

In [16]:
##### Following aggregation features - would be best to use time window (sort data) to generate, otherwise they will LEAK! (e.g. nunique countries visited)

### count features (can also later add rank inside groups).
### Some may be leaks (# visits in a trip should use time window?) , and do users repeat? 
### can add more counts of group X time period (e.g. affiliate X month of year)
## alt way to get counts/freq :
# freq = df["city_id"].value_counts()
# df["city_id_count"] = df["city_id"].map(freq)
# print(df["city_id_count"].describe())

count_cols = [ 'city_id','affiliate_id', 'booker_country', 'hotel_country', 
#               'utrip_id','user_id', 
 "checkin_month","checkin_week"]
for c in count_cols:
    df[f"{c}_count"] = df.groupby([c])["duration"].transform("size")
    
########################################################
## nunique per trip
### https://stackoverflow.com/questions/46470743/how-to-efficiently-compute-a-rolling-unique-count-in-a-pandas-time-series

nunique_cols = [ 'city_id','affiliate_id', 'booker_country', 'hotel_country']
# df["nunique_booker_countries"] = df.groupby("utrip_id")["booker_country"].nunique()
# df["nunique_hotel_country"] = df.groupby("utrip_id")["hotel_country"].nunique()
for c in nunique_cols:
    df[f"{c}_nunique"] = df.groupby(["utrip_id"])[c].transform("nunique")
print(df.nunique())

########################################################
## get frequency/count feature's rank within a group - e.g. within a country (or affiliate) 
## add "_count" to column name to get count col name, then add rank col 

### ALT/ duplicate feat - add percent rank (instead or in addition)

rank_cols = ['city_id','affiliate_id', 'booker_country','hotel_country',
 "checkin_month"]
### what is meaning of groupby and rank of smae variable by same var? Surely should be 1 / unary? 
for c in rank_cols:
    df[f"{c}_rank_by_hotel_country"] = df.groupby(['hotel_country'])[f"{c}_count"].transform("rank")
    df[f"{c}_rank_by_booker_country"] = df.groupby(['booker_country'])[f"{c}_count"].transform("rank")
    df[f"{c}_rank_by_affiliate"] = df.groupby(['affiliate_id'])[f"{c}_count"].transform("rank")
    
df

user_id                   200153
checkin                      425
checkout                     425
city_id                    39901
device_class                   3
affiliate_id                2152
booker_country                 5
hotel_country                195
utrip_id                  217686
duration                      30
same_country                   2
checkin_day                   31
checkin_weekday                7
checkin_week                  53
checkin_month                 12
checkin_year                   3
checkin_quarter                4
checkout_weekday               7
checkout_week                 53
checkout_day                  31
checkin_weekday_sin            7
checkin_weekday_cos            7
checkin_month_sin             12
checkin_month_cos             12
utrip_number                  48
utrip_steps_from_end         520
row_num                       48
total_rows                    41
last                           2
first_hotel_country          174
first_city

Unnamed: 0,user_id,checkin,checkout,city_id,device_class,affiliate_id,booker_country,hotel_country,utrip_id,duration,same_country,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id,city_id_count,affiliate_id_count,booker_country_count,hotel_country_count,checkin_month_count,checkin_week_count,city_id_nunique,affiliate_id_nunique,booker_country_nunique,hotel_country_nunique,city_id_rank_by_hotel_country,city_id_rank_by_booker_country,city_id_rank_by_affiliate,affiliate_id_rank_by_hotel_country,affiliate_id_rank_by_booker_country,affiliate_id_rank_by_affiliate,booker_country_rank_by_hotel_country,booker_country_rank_by_booker_country,booker_country_rank_by_affiliate,hotel_country_rank_by_hotel_country,hotel_country_rank_by_booker_country,hotel_country_rank_by_affiliate,checkin_month_rank_by_hotel_country,checkin_month_rank_by_booker_country,checkin_month_rank_by_affiliate
1004862,29,2016-07-09,2016-07-11,47054,0,1601,1,46,29_1,2,1,9,5,27,7,0,3,0,28,11,-0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,0.25,1,4,0,46,47054,116,4909,235344,53965,168571,27883,4,2,1,1,20639.5,91688.5,2090.5,11661.5,47847.0,2455.0,18247.5,117672.5,2435.0,26983.0,140712.5,2722.5,37602.5,180147.5,3490.5
1004863,29,2016-07-11,2016-07-13,34444,0,1601,1,46,29_1,2,1,11,0,28,7,0,3,2,28,13,0.000000,1.000000,1.224647e-16,-1.000000e+00,1,0.50,2,4,0,46,47054,32,4909,235344,53965,168571,37483,4,2,1,1,14593.5,54366.0,1178.5,11661.5,47847.0,2455.0,18247.5,117672.5,2435.0,26983.0,140712.5,2722.5,37602.5,180147.5,3490.5
1004864,29,2016-07-13,2016-07-16,12291,0,1601,1,46,29_1,3,1,13,2,28,7,0,3,5,28,16,0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,0.75,3,4,0,46,47054,34,4909,235344,53965,168571,37483,4,2,1,1,14854.5,56101.5,1217.0,11661.5,47847.0,2455.0,18247.5,117672.5,2435.0,26983.0,140712.5,2722.5,37602.5,180147.5,3490.5
1004865,29,2016-07-16,2016-07-18,16386,0,8132,1,46,29_1,2,1,16,5,28,7,0,3,0,29,18,-0.974928,-0.222521,1.224647e-16,-1.000000e+00,1,1.00,4,4,1,46,47054,3,22254,235344,53965,168571,37483,4,2,1,1,2852.0,9734.0,1171.0,19554.0,81809.5,11127.5,18247.5,117672.5,11043.5,26983.0,140712.5,12679.0,37602.5,180147.5,16667.0
897811,81,2016-05-15,2016-05-16,33665,0,9924,1,46,81_1,1,1,15,6,19,5,0,2,0,20,16,-0.781831,0.623490,8.660254e-01,-5.000000e-01,1,0.25,1,4,0,46,33665,21,277775,235344,53965,93022,21120,4,1,1,2,12518.5,44344.5,34314.5,46470.5,195618.0,138888.0,18247.5,117672.5,65835.0,26983.0,140712.5,170141.5,18839.0,96953.0,115042.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1072993,6258065,2016-04-19,2016-04-20,55044,1,9452,2,133,6258065_1,1,0,19,1,16,4,0,2,2,16,20,0.781831,0.623490,1.000000e+00,6.123234e-17,1,1.00,4,4,1,133,59444,749,85476,536036,6492,70851,18198,4,2,1,1,4853.0,341710.5,49333.5,3044.5,295975.0,42738.5,4895.5,268018.5,69968.0,3246.5,60429.5,12209.5,2585.5,119670.0,25132.0
420479,6258087,2016-08-03,2016-08-04,17754,0,2436,2,60,6258087_1,1,1,3,2,31,8,0,3,3,31,4,0.974928,-0.222521,-5.000000e-01,-8.660254e-01,1,0.25,1,4,0,60,17754,2,19356,536036,104979,228144,60231,4,1,1,1,3008.5,9080.0,332.0,31190.0,152430.0,9678.5,66192.5,268018.5,9713.0,52490.0,375198.5,14166.0,93671.5,470934.5,16997.0
420480,6258087,2016-08-04,2016-08-05,50073,0,2436,2,60,6258087_1,1,1,4,3,31,8,0,3,4,31,5,0.433884,-0.900969,-5.000000e-01,-8.660254e-01,1,0.50,2,4,0,60,17754,65,19356,536036,104979,228144,60231,4,1,1,1,40003.5,135534.0,5107.5,31190.0,152430.0,9678.5,66192.5,268018.5,9713.0,52490.0,375198.5,14166.0,93671.5,470934.5,16997.0
420481,6258087,2016-08-05,2016-08-06,11662,0,2436,2,60,6258087_1,1,1,5,4,31,8,0,3,5,31,6,-0.433884,-0.900969,-5.000000e-01,-8.660254e-01,1,0.75,3,4,0,60,17754,18,19356,536036,104979,228144,60231,4,1,1,1,21561.5,67413.0,2608.0,31190.0,152430.0,9678.5,66192.5,268018.5,9713.0,52490.0,375198.5,14166.0,93671.5,470934.5,16997.0


In [17]:
df.loc[df["city_id_count"]>=15]["city_id"].nunique()

7074

In [18]:
df["utrip_number"].value_counts().describe()

count        48.000000
mean      24309.062500
std      121954.031725
min           4.000000
25%          11.750000
50%          59.500000
75%         620.750000
max      822764.000000
Name: utrip_number, dtype: float64

In [19]:
## counts of each val
# df.groupby(['hotel_country']).size() # same thing as value counts only without ordering by values
df['hotel_country'].value_counts()

36     137791
52     117717
60     104979
59      74840
46      53965
        ...  
115         1
120         1
188         1
137         1
27          1
Name: hotel_country, Length: 195, dtype: int64

In [20]:
assert df.isna().sum().max() ==0

In [21]:
df[[ 'checkin', 'checkout','booker_country', 'hotel_country', 'duration']].describe(include="all",datetime_is_numeric=True)

Unnamed: 0,checkin,checkout,booker_country,hotel_country,duration
count,1166835,1166835,1166835.0,1166835.0,1166835.0
mean,2016-08-01 17:19:06.304207360,2016-08-03 11:03:30.708662784,2.308905,66.22731,1.739171
min,2015-12-31 00:00:00,2016-01-01 00:00:00,0.0,0.0,1.0
25%,2016-06-07 00:00:00,2016-06-09 00:00:00,2.0,36.0,1.0
50%,2016-08-07 00:00:00,2016-08-08 00:00:00,2.0,55.0,1.0
75%,2016-09-25 00:00:00,2016-09-27 00:00:00,3.0,80.0,2.0
max,2017-02-27 00:00:00,2017-02-28 00:00:00,4.0,194.0,30.0
std,,,1.120163,44.44839,1.197427


In [22]:
# df2 = df[["user_id","city_id"]].drop_duplicates().copy()
df2 = df.drop_duplicates(subset=["user_id","city_id"])["city_id"].copy()
print(df2.shape[0])
print("df2 nunique (cities without duplicate user visits)",df2.nunique())

# c2_counts = df2["city_id"].value_counts()
c2_counts = df2.value_counts()
# df2["new_counts"] = df2["city_id"].map(c2_counts)
# df2["new_counts"] = df2.map(c2_counts)
print("city counts")
print(c2_counts)
print(c2_counts.describe())
print("cities with at least 3:",(c2_counts>=3).sum())
print("cities with at least 7:",(c2_counts>=7).sum())
print("cities with at least 15:",(c2_counts>=15).sum())
print("cities with at least 30:",(c2_counts>=30).sum())
print("cities with at least 100:",(c2_counts>=100).sum())
print("cities with at least 300:",(c2_counts>=300).sum())

c2_freq = df2.value_counts(normalize=True)
print("top 4 sum coverage (normalized): ",c2_freq[0:4].sum().round(3))
print("top 50 sum coverage (normalized): ",c2_freq[0:50].sum().round(3))
print("top 100 sum coverage (normalized): ",c2_freq[0:100].sum().round(3))
print("top 400 sum coverage (normalized): ",c2_freq[0:400].sum().round(3))
print("top 1,000 sum coverage (normalized): ",c2_freq[0:1000].sum().round(3))
print("top 5,000 sum coverage (normalized): ",c2_freq[0:5000].sum().round(3))
print("top 8,000 sum coverage (normalized): ",c2_freq[0:8000].sum().round(3))

1029804
df2 nunique (cities without duplicate user visits) 39901
city counts
23921    8137
55128    7197
47499    7188
64876    6724
29319    6361
         ... 
50916       1
57063       1
46826       1
38638       1
2049        1
Name: city_id, Length: 39901, dtype: int64
count    39901.000000
mean        25.808977
std        173.750203
min          1.000000
25%          1.000000
50%          3.000000
75%          9.000000
max       8137.000000
Name: city_id, dtype: float64
cities with at least 3: 21337
cities with at least 7: 12012
cities with at least 15: 6836
cities with at least 30: 4006
cities with at least 100: 1550
cities with at least 300: 602
top 4 sum coverage (normalized):  0.028
top 50 sum coverage (normalized):  0.183
top 100 sum coverage (normalized):  0.263
top 400 sum coverage (normalized):  0.481
top 1,000 sum coverage (normalized):  0.64
top 5,000 sum coverage (normalized):  0.858
top 8,000 sum coverage (normalized):  0.905


In [23]:
c2_counts

23921    8137
55128    7197
47499    7188
64876    6724
29319    6361
         ... 
50916       1
57063       1
46826       1
38638       1
2049        1
Name: city_id, Length: 39901, dtype: int64

In [24]:
### According to the contest description - each user should have at least 4 trips?
df["user_id"].value_counts().describe()#.hist()

count    200153.000000
mean          5.829715
std           3.021691
min           1.000000
25%           4.000000
50%           5.000000
75%           6.000000
max         172.000000
Name: user_id, dtype: float64

## Frequent city target List + City count encoding
* Get the K most frequent target city IDs - selected based on frequency as final destination (not just overall)
* +- Also after this, replace rare city IDs categorical features with count encoding to reduce dimensionality
    * Keep them as count, or aggregate all of them as "under_K"?

##### Output  : `TOP_TARGETS` - filter data by this *after* creation of lag features ! 

* Drop duplicates by the same user (reduce possible bias of frequent users? Only relevant if test is seperater from "frequent travellers") 
    * results in 216,633 , vs 217,686 without dropping duplicates by users
    * ~19.9k unique cities
    
* Could do other encodings - https://contrib.scikit-learn.org/category_encoders/count.html

* Note that all this is after we've added rank, count features beforehand, so that information won't be lost for these variables, despite these transforms



* **NOTE** he most frequent final destinations are NOT the same as the most popular overall destinations +- first location ! 

In [25]:
if KEEP_TOP_K_TARGETS > 0 :
    df_end = df.loc[df["utrip_steps_from_end"]==1].drop_duplicates(subset=["city_id","hotel_country","user_id"])[["city_id","hotel_country"]].copy()
    print(df_end.shape[0])
    end_city_counts = df_end.city_id.value_counts()
    print(end_city_counts)
    
    TOP_TARGETS = end_city_counts.head(KEEP_TOP_K_TARGETS).index.values
    print(f"top {KEEP_TOP_K_TARGETS} targets \n",TOP_TARGETS)
    
#     assert df.loc[df["city_id"].isin(TOP_TARGETS)]["city_id"].nunique() == KEEP_TOP_K_TARGETS

####        
# replace low frequency categoircal features    

# ##replace with count encoding if have at least k, group rarest as "-1":# df[BASE_CAT_COLS] = df[BASE_CAT_COLS].where(df[BASE_CAT_COLS].apply(lambda x: x.map(x.value_counts()))>=LOW_COUNT_THRESH, -1)   
# ## replace/group only the rare variables : 
# df[BASE_CAT_COLS] = df[BASE_CAT_COLS].where(df[BASE_CAT_COLS].apply(lambda x: x.map(x.value_counts()))>=LOW_COUNT_THRESH, -1)
# df[BASE_CAT_COLS].head()

216633
47499    3752
17013    3022
36063    2857
29319    2644
2416     2304
         ... 
42577       1
20044       1
26185       1
7750        1
22197       1
Name: city_id, Length: 19899, dtype: int64
top 8000 targets 
 [47499 17013 36063 ... 27921 51803 41662]


##### Long tail of targets warning!
* 75% of cities appear less than 4 times in the data (as a final destination!) 
    * Dropping them will mean a maximum accuracy of 25% at best!!
    * training on intermediates may help overcome improve this. 
* Using ~2d step+ , still leaves us with 75% appearing less than 7 times

* Top 4,000 cities (just for those as final trip destination) - offers 89% coverage - 

* Unsure how to handle this - too amny targets to learn, and no auxiliary data to help learn it? 

In [26]:
df_end.city_id.nunique()

19899

In [27]:
df_end.city_id.value_counts().describe()


count    19899.000000
mean        10.886627
std         77.690197
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max       3752.000000
Name: city_id, dtype: float64

In [28]:
# df_end.city_id.value_counts(normalize=True)[0:4000].sum().round(3)# .89  (note, this is just for the end count cities, not all cities overall)

df_end.city_id.value_counts(normalize=True)[0:7000].sum().round(3) #97% coverage

0.923

In [29]:
## check distribution from "midpoint" (50%) of trips, onwards
df.loc[df["utrip_steps_from_end"]>=0.4].drop_duplicates(subset=["city_id","hotel_country","user_id"])["city_id"].value_counts().describe()


count    36200.000000
mean        21.362956
std        137.556216
min          1.000000
25%          1.000000
50%          3.000000
75%          7.000000
max       6307.000000
Name: city_id, dtype: float64

In [30]:
# df["c"] = df["city_id"].map(df["city_id"].value_counts())
# df[BASE_CAT_COLS+ ["c"]]

* Continue with EDA 

In [31]:
df["utrip_id"].value_counts().describe()

count    217686.000000
mean          5.360175
std           2.014324
min           1.000000
25%           4.000000
50%           5.000000
75%           6.000000
max          48.000000
Name: utrip_id, dtype: float64

* If country is known, then we need to rank within a given country.  How many cities/points per country? :

(Note - Later, need to consider f eatures about multi country trips) 


* EDA on city popularity by country
* Drop rare hotels to simplify

In [32]:
df_locations = df[["hotel_country","city_id"]].drop_duplicates()
print(df_locations.shape[0])
print(df_locations.nunique())
print("After filtering countries with 4 or less unique hotels/cities:")
df_locations = df_locations.loc[df_locations.groupby(["hotel_country"])["city_id"].transform("nunique")>4]
print(df_locations.shape[0])
print(df_locations.nunique())

print(df.groupby(["hotel_country"])["city_id"].nunique().describe())

39901
hotel_country      195
city_id          39901
dtype: int64
After filtering countries with 4 or less unique hotels/cities:
39786
hotel_country      134
city_id          39786
dtype: int64
count     195.000000
mean      204.620513
std       696.592775
min         1.000000
25%         3.000000
50%        15.000000
75%        89.500000
max      6440.000000
Name: city_id, dtype: float64


In [33]:
### unsure about this filtering - depends if data points are real or mistake

print("dropping users with less than 4 trips")
# df2 = df.loc[df["utrip_id_count"]>=4]
df2 = df.loc[df["total_rows"]>=4].copy()
print(df.shape[0]-df2.shape[0])

# print("dropping countries+Data with less than 4 unique cities in them:")
# df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("nunique")>=4]
# print(df2.shape[0])

print(f"dropping cities  with less than {MIN_TARGET_FREQ} occurences:")
df2 = df2.loc[df2.groupby(["city_id"])["hotel_country"].transform("count")>=MIN_TARGET_FREQ]
print(df2.shape[0])
df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("count")>=MIN_TARGET_FREQ]
print(df2.shape[0])

# print("dropping countries+Data with less than 4 unique cities in them: (afer prev filter)")
# df2 = df2.loc[df2.groupby(["hotel_country"])["city_id"].transform("nunique")>=4]
# print(df2.shape[0])

print("nunique cities after freq filt",df2["city_id"].nunique())
print("nunique city_id per hotel_country:")
df2.groupby(["hotel_country"])["city_id"].nunique().describe()

dropping users with less than 4 trips
2240
dropping cities  with less than 30 occurences:
991656
991656
nunique cities after freq filt 4175
nunique city_id per hotel_country:


count    118.000000
mean      35.381356
std       79.357346
min        1.000000
25%        2.000000
50%        8.000000
75%       30.750000
max      507.000000
Name: city_id, dtype: float64

In [34]:
df2[["hotel_country","city_id","affiliate_id","user_id"]].nunique()

hotel_country       118
city_id            4175
affiliate_id       2137
user_id          197046
dtype: int64

In [35]:
# LAG_FEAT_COLS = ['city_id', 'device_class',
#        'affiliate_id', 'booker_country', 'hotel_country', 
#        'duration', 'same_country', 'checkin_day', 'checkin_weekday',
#        'checkin_week',
#         'checkout_weekday','checkout_week',
#        'city_id_count', 'affiliate_id_count',
#        'booker_country_count', 'hotel_country_count', 
#        'checkin_month_count', 'checkin_week_count', 'city_id_nunique',
#        'affiliate_id_nunique', 'booker_country_nunique',
#        'hotel_country_nunique', 'city_id_rank_by_hotel_country',
#        'city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
#        'affiliate_id_rank_by_hotel_country',
#        'affiliate_id_rank_by_booker_country', 'affiliate_id_rank_by_affiliate',
#        'booker_country_rank_by_hotel_country',
#        'booker_country_rank_by_booker_country',
#        'booker_country_rank_by_affiliate',
#        'hotel_country_rank_by_hotel_country',
#        'hotel_country_rank_by_booker_country',
#        'hotel_country_rank_by_affiliate',
#        'checkin_month_rank_by_hotel_country',
#        'checkin_month_rank_by_booker_country',
#        'checkin_month_rank_by_affiliate']

In [36]:
# ### lag features - last n visits
# groupbyLagFeatures(df=df2.head(20), # .set_index(["checkin","checkout","user_id"])
#                    lag=[1,2],group="utrip_id",lag_feature_cols=LAG_FEAT_COLS)

In [37]:
# df2.loc[~df2["utrip_steps_from_end"].between(0.26,0.98)].sort_values("utrip_id")
## # df2["utrip_steps_from_end"].min() ## min is greater than 0

#### get a DF of all cities per country
* +- get from original DF, +- remove cities that appear less than 4? times , and countries with less than 4 hotels? (Or keep - to avoid messing up training?)
* Weighted Sample from it, for negatives, +- most freq by country/affiliate/etc
* Don't drop duplicates by user, keep orig freq? 

In [38]:
df_cities = df[["city_id","hotel_country","city_id_count"]] ## +- drop duplicates by tripid? 
print(df_cities.nunique())
df_cities = df_cities.loc[df_cities.groupby("hotel_country")["city_id"].transform("nunique")>4]
df_cities = df_cities.loc[df_cities["city_id_count"]>=10].sort_values("city_id_count",ascending=False)
print(df_cities.nunique())
print(df_cities.shape[0])


# ### 5 most frequent overall
# df_city_samples = df_cities.drop_duplicates().sort_values("city_id_count",ascending=False).groupby("city_id").head(5) 
# df_city_samples

city_id          39901
hotel_country      195
city_id_count      792
dtype: int64
city_id          9507
hotel_country     121
city_id_count     782
dtype: int64
1075000


### add lag features + Train/test/data split
* Lag feats (remember for categorical)
* Drop leak features (target values - country, city)

* drop instances  that lack history (e.g. at least 3d step and onwards) - by dropna in lag feat
* fill nans
* Split train/test by `user id` / split could maybe be by `utrip ID` ? ? 
    * Test - only last trip
    *  stratified train/test split by class - then drop any train rows with overlap with tests' IDs.  
        * Could also stratify by users, but risks some classes being non present in test
        
###### Big possible improvement to lag features: Have "first location" (starting point) "lag" feature

In [39]:
### features to drop - not usable, or leaks (e.g. aggregations on target)

TARGET_COL = 'city_id'
DROP_FEATS = ['user_id',
    'checkin', 'checkout',
              'hotel_country','city_id_count','same_country',
              'utrip_id',
#               'utrip_steps_from_end',
             'city_id_count','hotel_country_count',
              'city_id_nunique', 'hotel_country_nunique',
              'city_id_rank_by_hotel_country','city_id_rank_by_booker_country', 'city_id_rank_by_affiliate',
              'affiliate_id_rank_by_hotel_country','affiliate_id_rank_by_booker_country', 'affiliate_id_rank_by_affiliate',
              'hotel_country_rank_by_hotel_country',
       'hotel_country_rank_by_booker_country','hotel_country_rank_by_affiliate',
              'booker_country_rank_by_hotel_country','booker_country_rank_by_booker_country',
              'checkin_month_rank_by_hotel_country',
             ]

# df2.drop(DROP_FEATS,axis=1).columns

In [40]:
print(df2.shape)
# ### lag features - last n visits
df_feat = groupbyLagFeatures(df=df2.copy(), 
                   lag=[1,2],group="utrip_id",lag_feature_cols=LAG_FEAT_COLS)
df_feat = df_feat.dropna(subset=["lag2_city_id"]).sample(frac=1)


### filter for only trip targets that are among the K most popular :


df_feat = df_feat.drop(DROP_FEATS,axis=1,errors="ignore")
print(df_feat.shape)

# df_feat.sort_values(["user_id","utrip_steps_from_end"])

(991656, 56)
(570588, 100)


In [41]:
### filter for most frequent targets

if KEEP_TOP_K_TARGETS > 0 :
    print(df_feat.shape[0])
    df_feat = df_feat.loc[df_feat["city_id"].isin(TOP_TARGETS)]
    print(df_feat.shape[0])    
    c = df_feat["city_id"].nunique()
    print(f"{c} unique targets left")
    assert  c<= KEEP_TOP_K_TARGETS

570588
566251
3978 unique targets left


In [42]:
########################
## stratified train/test split by class - then drop any train rows with overlap wit htest IDs.  Could also stratify by users, but risks some classes being non present in test
### split could maybe be by utrip ID ? 
### orig - split by group : 

# train_inds, test_inds = next(GroupShuffleSplit(test_size=.2, n_splits=2, random_state = 7).split(df_feat, groups=df_feat['user_id']))
# X_train = df_feat.iloc[train_inds].drop(DROP_FEATS,axis=1,errors="ignore")
# X_test = df_feat.iloc[test_inds].drop(DROP_FEATS,axis=1,errors="ignore")
# assert (set(X_train[TARGET_COL].unique()) == set(X_test[TARGET_COL].unique()))
#################
## alt: split by class. May be leaky! 
X_train, X_test = train_test_split(df_feat,stratify=df_feat[TARGET_COL])

########################
print("X_train",X_train.shape[0])
## get last row in trip only in test/eval set: 
print("X_test",X_test.shape[0])
X_test = X_test.loc[X_test["utrip_steps_from_end"]==1] # last row per trip
print("X_test after filtering for last instance per trip",X_test.shape[0])

y_train = X_train.pop(TARGET_COL)
y_test = X_test.pop(TARGET_COL)

print("# classes",y_train.nunique())

# ## check that same classes in train and test - 
# assert (set(y_train.unique()) == set(y_test.unique()))

X_train 424688
X_test 141563
X_test after filtering for last instance per trip 44467
# classes 3978


## Model
* For now - simple multiclass model (Tabnet? LSTM?) ; +- subsample - only most frequent classes/cities

    * Tabnet: `pip install pytorch-tabnet`
        * https://github.com/dreamquark-ai/tabnet/blob/develop/forest_example.ipynb
    * TensorFlow Tabmet: https://github.com/ostamand/tensorflow-tabnet/blob/master/examples/train_mnist.py

* split train/test by user id. 
    * Test - only last trip. 
    
* Try multiclass models

* Try tabnet models (tabular with attention)
    * + Lag feats
    * Note - the embedding here is not aware that the same IDs are the same (unlike TF's )! 

In [43]:
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score

import pandas as pd
import numpy as np
np.random.seed(0)

* cat_idxs : list of int (default=[] - Mandatory for embeddings)
    * List of categorical features indices.

* cat_dims : list of int (default=[] - Mandatory for embeddings)

    * List of categorical features number of modalities (number of unique values for a categorical feature) /!\ no new modalities can be predicted

* cat_emb_dim : list of int (optional)

    * List of embeddings size for each categorical features. (default =1)
    
    
    
All the categorical vals must be known from train (demo used label encoder). Consider doing so also here at late step, to avoid unknown vals ? 

In [44]:
X_train

Unnamed: 0,device_class,affiliate_id,booker_country,duration,checkin_day,checkin_weekday,checkin_week,checkin_month,checkin_year,checkin_quarter,checkout_weekday,checkout_week,checkout_day,checkin_weekday_sin,checkin_weekday_cos,checkin_month_sin,checkin_month_cos,utrip_number,utrip_steps_from_end,row_num,total_rows,last,first_hotel_country,first_city_id,affiliate_id_count,booker_country_count,checkin_month_count,checkin_week_count,affiliate_id_nunique,booker_country_nunique,booker_country_rank_by_affiliate,checkin_month_rank_by_booker_country,checkin_month_rank_by_affiliate,lag1_city_id,lag1_device_class,lag1_affiliate_id,lag1_booker_country,lag1_hotel_country,lag1_duration,lag1_same_country,lag1_checkin_weekday,lag1_checkin_week,lag1_checkout_weekday,lag1_city_id_count,lag1_affiliate_id_count,...,lag1_city_id_rank_by_booker_country,lag1_city_id_rank_by_affiliate,lag1_affiliate_id_rank_by_hotel_country,lag1_affiliate_id_rank_by_booker_country,lag1_booker_country_rank_by_hotel_country,lag1_booker_country_rank_by_booker_country,lag1_booker_country_rank_by_affiliate,lag1_hotel_country_rank_by_booker_country,lag1_hotel_country_rank_by_affiliate,lag1_checkin_month_rank_by_hotel_country,lag1_checkin_month_rank_by_booker_country,lag1_checkin_month_rank_by_affiliate,lag2_city_id,lag2_device_class,lag2_affiliate_id,lag2_booker_country,lag2_hotel_country,lag2_duration,lag2_same_country,lag2_checkin_weekday,lag2_checkin_week,lag2_checkout_weekday,lag2_city_id_count,lag2_affiliate_id_count,lag2_booker_country_count,lag2_hotel_country_count,lag2_checkin_month_count,lag2_checkin_week_count,lag2_city_id_nunique,lag2_affiliate_id_nunique,lag2_booker_country_nunique,lag2_hotel_country_nunique,lag2_city_id_rank_by_hotel_country,lag2_city_id_rank_by_booker_country,lag2_city_id_rank_by_affiliate,lag2_affiliate_id_rank_by_hotel_country,lag2_affiliate_id_rank_by_booker_country,lag2_booker_country_rank_by_hotel_country,lag2_booker_country_rank_by_booker_country,lag2_booker_country_rank_by_affiliate,lag2_hotel_country_rank_by_booker_country,lag2_hotel_country_rank_by_affiliate,lag2_checkin_month_rank_by_hotel_country,lag2_checkin_month_rank_by_booker_country,lag2_checkin_month_rank_by_affiliate
504145,1,359,4,1,7,2,36,9,0,3,3,36,8,0.974928,-0.222521,-8.660254e-01,-5.000000e-01,2,1.000000,4,4,1,23,55763,171385,286244,142213,34012,1,1,89246.0,202077.0,113866.5,55763.0,1.0,359.0,4.0,23.0,1.0,0.0,1.0,36.0,2.0,5317.0,171385.0,...,236941.5,144219.5,26807.0,168462.0,19935.0,143122.5,89246.0,137130.0,85043.0,21090.5,202077.0,113866.5,55763.0,1.0,359.0,4.0,23.0,2.0,0.0,6.0,35.0,1.0,5317.0,171385.0,286244.0,40599.0,142213.0,32100.0,1.0,1.0,1.0,1.0,37941.0,236941.5,144219.5,26807.0,168462.0,19935.0,143122.5,89246.0,137130.0,85043.0,21090.5,202077.0,113866.5
259598,0,9924,1,3,23,6,42,10,0,4,2,43,26,-0.781831,0.623490,-1.000000e+00,-1.836970e-16,2,0.750000,9,12,0,145,66264,277775,235344,106666,18647,3,1,65835.0,119539.5,139051.0,12846.0,0.0,9924.0,1.0,145.0,1.0,0.0,5.0,42.0,6.0,61.0,277775.0,...,72667.5,59119.0,2498.5,195618.0,477.0,117672.5,65835.0,12958.0,14134.5,2393.5,119539.5,139051.0,7519.0,0.0,9924.0,1.0,145.0,1.0,0.0,4.0,42.0,5.0,53.0,277775.0,235344.0,2855.0,106666.0,18647.0,11.0,3.0,1.0,2.0,607.0,68166.0,55198.0,2498.5,195618.0,477.0,117672.5,65835.0,12958.0,14134.5,2393.5,119539.5,139051.0
659221,1,9452,4,1,15,0,33,8,0,3,1,33,16,0.000000,1.000000,-5.000000e-01,-8.660254e-01,2,0.857143,6,7,0,126,65015,85476,286244,228144,52567,1,1,45557.5,269937.0,77896.0,51999.0,1.0,9452.0,4.0,126.0,1.0,0.0,6.0,32.0,0.0,1203.0,85476.0,...,142975.0,57922.0,12071.5,115652.5,13155.0,143122.5,45557.5,110093.0,34968.5,25644.0,269937.0,77896.0,8766.0,1.0,9452.0,4.0,126.0,2.0,0.0,4.0,32.0,6.0,2755.0,85476.0,286244.0,26088.0,228144.0,63551.0,7.0,1.0,1.0,1.0,24711.0,198445.0,70287.5,12071.5,115652.5,13155.0,143122.5,45557.5,110093.0,34968.5,25644.0,269937.0,77896.0
263284,0,8373,1,6,9,1,32,8,0,3,0,33,15,0.781831,0.623490,-5.000000e-01,-8.660254e-01,1,1.000000,4,4,1,72,62611,1775,235344,228144,63551,2,1,881.5,215350.5,1624.5,12903.0,0.0,8373.0,1.0,79.0,4.0,0.0,4.0,31.0,1.0,260.0,1775.0,...,118930.5,1007.5,4788.0,29163.5,6663.0,117672.5,881.5,121935.0,906.0,42415.0,215350.5,1624.5,62611.0,0.0,3497.0,1.0,72.0,4.0,0.0,6.0,30.0,3.0,1798.0,327.0,235344.0,5399.0,168571.0,49219.0,4.0,2.0,1.0,2.0,4500.5,194819.5,289.0,288.0,13070.0,747.0,117672.5,162.5,24761.0,30.0,4613.0,180147.5,250.5
854805,0,8436,2,1,14,6,32,8,0,3,0,33,15,-0.781831,0.623490,-5.000000e-01,-8.660254e-01,1,0.300000,3,10,0,36,38677,31275,536036,228144,63551,1,1,15671.5,470934.5,27070.5,64071.0,0.0,8436.0,2.0,36.0,1.0,0.0,5.0,32.0,6.0,97.0,31275.0,...,160766.5,8728.5,51433.5,223435.5,110917.0,268018.5,15671.5,509162.0,29377.5,124492.5,470934.5,27070.5,38677.0,0.0,8436.0,2.0,36.0,1.0,0.0,4.0,32.0,5.0,2247.0,31275.0,536036.0,137791.0,228144.0,63551.0,9.0,1.0,1.0,2.0,97879.0,440110.0,25472.0,51433.5,223435.5,110917.0,268018.5,15671.5,509162.0,29377.5,124492.5,470934.5,27070.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348897,0,2436,2,1,25,6,38,9,0,3,0,39,26,-0.781831,0.623490,-8.660254e-01,-5.000000e-01,1,0.750000,3,4,0,36,65322,19356,536036,142213,31218,1,1,9713.0,292753.0,10512.5,29648.0,0.0,2436.0,2.0,36.0,2.0,0.0,4.0,38.0,6.0,380.0,19356.0,...,273048.5,10173.0,35295.0,152430.0,110917.0,268018.5,9713.0,509162.0,18648.0,80610.5,292753.0,10512.5,65322.0,0.0,2436.0,2.0,36.0,2.0,0.0,2.0,38.0,4.0,3096.0,19356.0,536036.0,137791.0,142213.0,31218.0,4.0,1.0,1.0,1.0,108208.5,465271.5,17014.0,35295.0,152430.0,110917.0,268018.5,9713.0,509162.0,18648.0,80610.5,292753.0,10512.5
440915,0,4541,2,1,30,3,26,6,0,2,4,26,1,0.433884,-0.900969,5.000000e-01,-8.660254e-01,1,0.500000,4,8,0,52,382,41504,536036,91255,22394,4,1,20791.0,158503.5,11105.0,20079.0,0.0,4541.0,2.0,52.0,1.0,0.0,2.0,26.0,3.0,55.0,41504.0,...,125658.0,10554.0,57407.5,259753.0,83566.5,268018.5,20791.0,448136.5,34810.5,26508.5,158503.5,11105.0,20079.0,0.0,5755.0,2.0,52.0,1.0,0.0,1.0,26.0,2.0,55.0,17383.0,536036.0,117717.0,91255.0,22394.0,6.0,4.0,1.0,1.0,25022.0,125658.0,3492.0,33649.0,134109.5,83566.5,268018.5,8706.5,448136.5,14598.0,26508.5,158503.5,5055.0
855601,2,2803,3,3,14,3,28,7,0,3,6,28,17,0.433884,-0.900969,1.224647e-16,-1.000000e+00,2,0.666667,4,6,0,42,2416,1933,80573,168571,37483,3,1,966.5,53496.5,1241.0,2416.0,1.0,359.0,3.0,42.0,1.0,0.0,2.0,28.0,3.0,6641.0,171385.0,...,73317.0,155456.0,11395.0,55975.5,5013.5,40287.0,12575.5,35924.5,42090.0,11636.5,53496.5,134129.0,47193.0,1.0,359.0,3.0,42.0,4.0,0.0,5.0,27.0,2.0,1551.0,171385.0,80573.0,16245.0,168571.0,27883.0,2.0,3.0,1.0,1.0,8829.0,45128.0,107489.0,11395.0,55975.5,5013.5,40287.0,12575.5,35924.5,42090.0,11636.5,53496.5,134129.0
23863,0,384,2,1,31,1,22,5,0,2,2,22,1,0.781831,0.623490,8.660254e-01,-5.000000e-01,1,0.333333,3,9,0,9,63291,88137,536036,93022,19543,3,1,44146.5,199981.5,31678.0,39290.0,0.0,384.0,2.0,9.0,1.0,0.0,0.0,22.0,1.0,46.0,88137.0,...,114070.0,19815.5,18023.5,355474.5,17719.0,268018.5,44146.5,180801.0,28320.5,7031.0,199981.5,31678.0,63291.0,0.0,384.0,2.0,9.0,2.0,0.0,5.0,21.0,0.0,180.0,88137.0,536036.0,28719.0,93022.0,20823.0,8.0,3.0,1.0,2.0,12667.5,206375.0,35380.0,18023.5,355474.5,17719.0,268018.5,44146.5,180801.0,28320.5,7031.0,199981.5,31678.0


In [45]:
CAT_FEAT_NAMES = ["booker_country", "device_class","affiliate_id",
#                   "user_id", ## ? could use lower dim - depends on train/test overlap
                  "checkin_week",#"checkout_week",
#                     "checkin_weekday",
    "lag1_city_id","lag1_booker_country","lag1_hotel_country","lag1_affiliate_id", "lag1_device_class",
     "lag2_city_id","lag2_booker_country","lag2_hotel_country","lag2_affiliate_id","lag2_device_class",
#       "lag3_city_id","lag3_booker_country","lag3_hotel_country","lag3_affiliate_id","lag3_device_class",
                  "first_hotel_country","first_city_id"
                 ]

In [46]:
NUMERIC_COLS = [item for item in list(df_feat.columns.drop(TARGET_COL))  if item not in CAT_FEAT_NAMES]
print(len(NUMERIC_COLS))
print("numeric cols",NUMERIC_COLS)

for c in NUMERIC_COLS:
    l_enc =   StandardScaler() # MinMaxScaler()#
    l_enc.fit(df_feat[c].values.reshape(-1,1))
    X_train[c] = l_enc.transform(X_train[c].values.reshape(-1,1))
    X_test[c] = l_enc.transform(X_test[c].values.reshape(-1,1))

83
numeric cols ['duration', 'checkin_day', 'checkin_weekday', 'checkin_month', 'checkin_year', 'checkin_quarter', 'checkout_weekday', 'checkout_week', 'checkout_day', 'checkin_weekday_sin', 'checkin_weekday_cos', 'checkin_month_sin', 'checkin_month_cos', 'utrip_number', 'utrip_steps_from_end', 'row_num', 'total_rows', 'last', 'affiliate_id_count', 'booker_country_count', 'checkin_month_count', 'checkin_week_count', 'affiliate_id_nunique', 'booker_country_nunique', 'booker_country_rank_by_affiliate', 'checkin_month_rank_by_booker_country', 'checkin_month_rank_by_affiliate', 'lag1_duration', 'lag1_same_country', 'lag1_checkin_weekday', 'lag1_checkin_week', 'lag1_checkout_weekday', 'lag1_city_id_count', 'lag1_affiliate_id_count', 'lag1_booker_country_count', 'lag1_hotel_country_count', 'lag1_checkin_month_count', 'lag1_checkin_week_count', 'lag1_city_id_nunique', 'lag1_affiliate_id_nunique', 'lag1_booker_country_nunique', 'lag1_hotel_country_nunique', 'lag1_city_id_rank_by_hotel_country'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

In [47]:
for c in CAT_FEAT_NAMES:
    l_enc = LabelEncoder().fit(df_feat[c])
    X_train[c] = l_enc.transform(X_train[c])
    X_test[c] = l_enc.transform(X_test[c])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [48]:
# X_train.columns.get_loc(CAT_FEAT_NAMES)
cat_idxs = [X_train.columns.get_loc(c) for c in CAT_FEAT_NAMES if c in X_train]
assert len(cat_idxs) == len(CAT_FEAT_NAMES)
print(cat_idxs)

[2, 0, 1, 6, 33, 36, 37, 35, 34, 66, 69, 70, 68, 67, 22, 23]


In [49]:
#### get nuniques and set embeding dimension per categorical
### note that we need to change here if we want a higher embedding dimension!

nunique = X_train.nunique()
types = X_train.dtypes

# categorical_columns = []
categorical_dims = [] #{}
cat_embed_dims = []
for i,col in enumerate(cat_idxs):
#     print(i,col)
#     c_uniques = X_train.iloc[:,col].nunique()
    c_uniques = df_feat[CAT_FEAT_NAMES[i]].nunique() ## try to use original data, more nuniques? 
    
    categorical_dims.append(c_uniques)
#     if col == "user_id" :  cat_embed_dims.append(10) ## need to change to use names. user id may overfit
    cat_embed_dims.append(min(100,c_uniques//2))   

In [50]:
print(categorical_dims)
print(cat_embed_dims)

[5, 3, 2023, 53, 4175, 5, 118, 2038, 3, 4175, 5, 118, 2045, 3, 142, 13375]
[2, 1, 100, 26, 100, 2, 59, 100, 1, 100, 2, 59, 100, 1, 71, 100]


In [51]:
assert X_test.isna().sum().max() == X_train.isna().sum().max() == 0

In [52]:
print("sum top4 total percentage:",y_train.value_counts(normalize=True)[0:4].sum().round(3))
y_train.value_counts(normalize=True).round(5)

sum top4 total percentage: 0.042


23921    0.01258
47499    0.01049
29319    0.00929
36063    0.00915
17013    0.00882
          ...   
11905    0.00001
35189    0.00001
31004    0.00001
58136    0.00001
57556    0.00000
Name: city_id, Length: 3978, dtype: float64

In [53]:
if RUN_TABNET:
    # TabNetPretrainer
    unsupervised_model = TabNetPretrainer(    
        n_d=32, n_a=32, n_steps=4,
        cat_idxs=cat_idxs,
       cat_dims=categorical_dims,
       cat_emb_dim=cat_embed_dims,
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        mask_type='entmax', # "sparsemax"
        device_name="auto" #"auto" "cpu" 
    )

    unsupervised_model.fit(
        X_train=X_train.values,
#         eval_set=[X_test.values],
        pretraining_ratio=0.35,
         max_epochs=6,
        batch_size = 512 ,# 1024 default , ~256-512 with GPU
    )
    
    ## save unsup model
    ### https://github.com/dreamquark-ai/tabnet/blob/develop/pretraining_example.ipynb
#     unsupervised_model.save_model('./.4_pretrain')


Device used : cuda
No early stopping will be performed, last training weights will be used.
epoch 0  | loss: 21003.6181|  0:02:27s
epoch 1  | loss: 33.5385 |  0:04:47s
epoch 2  | loss: 19.60213|  0:07:14s
epoch 3  | loss: 12.58031|  0:09:42s
epoch 4  | loss: 14.02596|  0:12:11s
epoch 5  | loss: 8.18307 |  0:14:36s


In [54]:
from __future__ import print_function, absolute_import

from pytorch_tabnet.metrics import Metric
# from sklearn.metrics import top_k_accuracy_score


__all__ = ['accuracy']

def accuracy_k(output, target, topk=(4,)): # (1,))
    """Computes the precision@k for the specified values of k"""
    maxk = max(topk)
    batch_size = target.size(0)

    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))

    res = []
    for k in topk:
        correct_k = correct[:k].view(-1).float().sum(0)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res

In [55]:
if RUN_TABNET:
    clf = TabNetClassifier(    
        n_d=32, n_a=32, n_steps=4,
        cat_idxs=cat_idxs,
       cat_dims=categorical_dims,
       cat_emb_dim=cat_embed_dims,   
       optimizer_fn=torch.optim.Adam,
       optimizer_params=dict(lr=2e-2),
       scheduler_params={"step_size":50, # how to use learning rate scheduler
                         "gamma":0.9},
       scheduler_fn=torch.optim.lr_scheduler.StepLR,
#        mask_type='entmax', # "sparsemax"
        device_name="auto" #"auto" "cpu"
    )

    clf.fit(
        X_train=X_train.values, y_train=y_train.values,
        eval_set=[(X_train.values, y_train.values), (X_test.values, y_test.values)],
    #      eval_set=[(X_test.values, y_test.values)],
        eval_name=['train','test'],
        eval_metric=['accuracy'], # 'accuracy',
         max_epochs=max_epochs, 
        batch_size = 512 ,# 1024 default , ~256-512 with GPU
        from_unsupervised=unsupervised_model,    
    )

#     clf.save_model('./.full_tabnet_1192class')

Device used : cuda
Loading weights from unsupervised pretraining
epoch 0  | loss: 5.89005 | train_accuracy: 0.11422 | test_accuracy: 0.15742 |  0:03:46s
epoch 1  | loss: 4.29659 | train_accuracy: 0.18209 | test_accuracy: 0.2489  |  0:12:03s
epoch 2  | loss: 3.71171 | train_accuracy: 0.21955 | test_accuracy: 0.27256 |  0:21:44s
epoch 3  | loss: 3.46431 | train_accuracy: 0.21491 | test_accuracy: 0.26129 |  0:25:57s
epoch 4  | loss: 3.31889 | train_accuracy: 0.25918 | test_accuracy: 0.30333 |  0:30:45s
epoch 5  | loss: 3.21108 | train_accuracy: 0.26603 | test_accuracy: 0.30355 |  0:34:19s
epoch 6  | loss: 3.12931 | train_accuracy: 0.27805 | test_accuracy: 0.3049  |  0:37:48s
epoch 7  | loss: 3.0604  | train_accuracy: 0.29201 | test_accuracy: 0.31419 |  0:41:17s
epoch 8  | loss: 3.00258 | train_accuracy: 0.30065 | test_accuracy: 0.30958 |  0:44:54s
epoch 9  | loss: 2.94773 | train_accuracy: 0.30614 | test_accuracy: 0.31605 |  0:48:27s
epoch 10 | loss: 2.89801 | train_accuracy: 0.32248 | te

In [56]:
# clf2 = TabNetClassifier(    
#     n_d=16, n_a=16, n_steps=4,
#     cat_idxs=cat_idxs,
#    cat_dims=categorical_dims,
#    cat_emb_dim=cat_embed_dims,   
#    optimizer_fn=torch.optim.Adam,
#    optimizer_params=dict(lr=2e-2),
#    scheduler_params={"step_size":50, # how to use learning rate scheduler
#                      "gamma":0.9},
#    scheduler_fn=torch.optim.lr_scheduler.StepLR,
#    mask_type='entmax', # "sparsemax"
#     device_name=  "auto"
# )

# clf2.fit(
#     X_train=X_train.values, y_train=y_train.values,
#     eval_set=[(X_train.values, y_train.values), (X_test.values, y_test.values)],
#     eval_name=['train','test'],
#     eval_metric=['accuracy'],
#      max_epochs=3, 
#     batch_size = 256 ,# 1024 default
#     from_unsupervised= clf #unsupervised_model,    
# )

#### feature importance & evaluation
* Look for leaks!
* May be bug with ordering of results - evaluation doesn't make sense. Note that diff # outputs/classes, likely culprit

In [57]:
# clf.feature_importances_

In [58]:
## top features (unsorted) - booker country
# X_train.columns[clf.feature_importances_>1e-7]
feat_imp = pd.DataFrame([X_train.columns,clf.feature_importances_]).T
feat_imp = feat_imp.loc[feat_imp[1]>0].sort_values(1,ascending=False).reset_index(drop=True)
feat_imp

Unnamed: 0,0,1
0,booker_country,0.277956
1,lag1_device_class,0.269756
2,lag1_city_id,0.116913
3,lag1_booker_country,0.10527
4,lag2_booker_country,0.101122
5,lag2_device_class,0.0943857
6,lag1_checkin_week,0.0345925
7,lag1_city_id_count,4.58621e-06
8,lag2_hotel_country,8.92654e-09


In [59]:
print(feat_imp[0].values)
# ['lag2_city_id' 'lag1_booker_country' 'lag2_booker_country'
#  'lag2_hotel_country' 'lag1_city_id' 'first_hotel_country'
#  'lag1_device_class' 'device_class' 'lag2_device_class' 'booker_country'
#  'lag1_hotel_country_rank_by_booker_country' 'checkin_week'
#  'lag1_city_id_count' 'checkin_quarter'
#  'lag2_booker_country_rank_by_booker_country' 'lag1_affiliate_id_count'
#  'lag1_checkin_month_rank_by_affiliate'
#  'lag2_checkin_month_rank_by_hotel_country' 'checkin_month_sin'
#  'lag2_hotel_country_count' 'lag2_checkin_month_count'
#  'lag2_affiliate_id_rank_by_booker_country' 'lag2_city_id_nunique']

['booker_country' 'lag1_device_class' 'lag1_city_id' 'lag1_booker_country'
 'lag2_booker_country' 'lag2_device_class' 'lag1_checkin_week'
 'lag1_city_id_count' 'lag2_hotel_country']


In [60]:
print("y_test nunique classes",y_test.nunique())
y_test

y_test nunique classes 3565


741084     45773
1141237    62185
987134     65856
746583     29319
638863     55763
           ...  
521342     12426
561691     37874
924119     59581
334876       382
54880       4932
Name: city_id, Length: 44467, dtype: int64

In [61]:
y_preds_test_proba = clf.predict_proba(X_test.values)
y_preds_test = clf.predict(X_test.values)
# y_preds_test[0:2]

In [62]:
y_preds_test_proba.shape

(44467, 3978)

In [63]:
y_preds_test.shape

(44467,)

In [64]:
y_test.values.shape

(44467,)

In [65]:
m = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=4)
m.update_state(y_true=y_test.values, y_pred=y_preds_test_proba)    
print(m.result().numpy())

2.2488586e-05


In [66]:
# m = InTopK(k=4)
# m.update_state(y_true=y_test.values, y_pred=y_preds_test)
# m.result().numpy()
# m.reset_states()

# top_k_accuracy_score(y_true=y_test.values, y_score=y_preds_test_proba)

In [67]:
# m = tf.keras.metrics.SparseCategoricalAccuracy()
# m.update_state(y_true=y_test.values, y_pred=y_preds_test) ## .reshape(-1,1) / (1,-1) ?  
# m.result().numpy()

In [68]:
# ### likely error with ordering of classes (and test has less classes than train)
# y_test_ohe = to_categorical(y_test.values)
# m = Precision(top_k=4)
# m.update_state(y_true=y_test_ohe, y_pred=y_preds_test_proba)
# m.result().numpy()

### Simpler baseline: Linear model + OHE
* Multinomial Logistic regression model + one hot encoding. 
    * +- count encoding (to reduce # dimensions). 