### Introduction

Expedia Kaggle Competition: Goal is to use historical web session data (e.g. pageviews and booking data) to determine which hotel cluster the user booked. Hotel cluster is a numeric categorical attribute that is only meaningful to the team at Expedia. 

Due to the nature of the problem, a recommender system approach is used. Because the file sizes are quite large for to process locally, the data is sampled down to a smaller subset for training, cross-validation and testing purposes. To get the full results, a connection to Amazon Web Services is used to train and predict.

The recommender system used different aggregated user data by various tuple keys, such as (e.g. (user_origin, sales_channel, session_month, destination_city) is considered a single key) and place all information that matches this key to a list. It then counts the top k elements from this list.

Work on this project was stopped when a data leakage problem was discovered and announced in the Kaggle forum.

In [2]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random
from collections import defaultdict, Counter
from math import log
import entropy
import csv

import pdb

In [2]:
def read_csv_by_sampling(fname, population_size, sample_size):
    """ Sample stream of csv lines and returns Pandas object 
        Population size can be counted from command line wl -l %filename
    """
    skip = sorted(random.sample(xrange(n), n-s))
    return pd.read_csv('train.csv', skiprows=skip, header=0)

In [3]:
def nan_analysis(_df):
    """ shows nan values in dataframe """
    
    # columns with missing values
    columns_w_missing_values = _df.columns[_df.isnull().any()]
    total = float(len(_df))
    for col in columns_w_missing_values:
        num_nans = sum(1 if x else 0 for x in _df[col].isnull())
        print '# of nan values in column {0}: {1}, {2}'.format(col, 
                                                          num_nans,
                                                          num_nans / total)

In [4]:
def pct_top_k(labels, k, verbose=False):
    """ returns percentnage dominated by top k elements in series """
    
    labels_freq = Counter(labels)
    total_freq = float(sum(labels_freq.values()))
    total_uniques = len(labels)
    
    cum_k = cum_total = 0
    tail_counter = 0    
    
    for n, (label, freq) in enumerate(sorted(labels_freq.items(), key=lambda x: x[1], reverse=True)):
        if n < k:
            cum_k += freq / total_freq
            if verbose:
                print '{0}:, marginal: {1}, cum: {2}'.format(label, freq / total_freq, cum_k)
                
    return cum_k

In [5]:
def tail_end_analysis(labels, tail_threshold, verbose=False):
    """ outputs metrics on tail end (low-frequency counts) of series """

    labels_freq = Counter(labels)
    total_freq = float(sum(labels_freq.values()))
    total_uniques = len(labels_freq)
    
    cum_total = 0
    tail_counter = 0    
    
    for n, (label, freq) in enumerate(sorted(labels_freq.items(), key=lambda x: x[1], reverse=True)):
        if cum_total > tail_threshold:
            tail_counter += 1
        cum_total += freq / total_freq  
    print 'There are {0} unique labels'.format(len(labels_freq))
    print 'There are {0} elements beyond the {1}-threshold'.format(tail_counter, tail_threshold)


### Get the User IDs 

In [None]:
user_id_training =  pd.read_csv('train.csv', usecols=['user_id'], header=0) 
user_id_testing =  pd.read_csv('test.csv', usecols=['user_id'], header=0) 

In [None]:
print 'Training user ids: {}'.format(len(user_id_training['user_id']))
print 'Testing user ids: {}'.format(len(user_id_testing['user_id']))
print 'Training unique user ids: {}'.format(len(set(user_id_training['user_id'].unique())))
print 'Testing unique user ids: {}'.format(len(set(user_id_testing['user_id'].unique())))
print 'Common user ids: {}'.format(
    len(set(user_id_training['user_id'].unique()) & set(user_id_testing['user_id'].unique())))

### Sampling

In [None]:
population_size = 2500000
df_test = read_csv_by_sampling('test.csv', population_size, sample_size)
df_test = pd.read_csv('test.csv', nrows=1000000, header=0)

In [None]:
labels = df_train['hotel_cluster'].values
labels_freq = Counter(labels)

In [None]:
cum = pct_top_k(df_train['hotel_cluster'], 3, verbose=True)
tail_end_analysis(df_train['hotel_cluster'], 0.9, verbose=True)
entropy.shannon_entropy(df_train.site_name.values.tobytes())

In [None]:
columns = df_train.columns
for step_size in xrange(0,len(columns),5):
    plt.figure();
    _df = df_train[columns[step_size:step_size + 4]]
    _df.hist(alpha=0.5)

### Feature Analysis

In [None]:
categorical_features = ['site_name','posa_continent','user_location_country','user_location_region']
df_train[['site_name','posa_continent']] = df_train[['site_name','posa_continent']].applymap(str)
pd.get_dummies(df_train[['site_name','posa_continent']])

In [None]:
def percentile_counter(counter_object, percentile=50):
    yield_counts_per_userid = (x[1] for x in counter_object.items())
    a = np.array([x for x in yield_counts_per_userid])
    return np.percentile(a, percentile)

print percentile_counter(Counter(df_train.user_id))
print percentile_counter(Counter(df_train.user_id[df_train.is_booking==0]))
print percentile_counter(Counter(df_train.user_id[df_train.is_booking==1]))

In [None]:
def plot_counter(counter_object, title):
    yield_counts_per_userid = (x[1] for x in counter_object.items())
    labels, values = zip(*counter_object.items())
    indexes = np.arange(len(labels))
    width = 1
    plt.bar(indexes, values, width)
    plt.title(title, fontsize=14)
    plt.show()

def plot_counter_wrapper(generator, title):
    id_counts_distribution = Counter(generator)
    plot_counter(id_counts_distribution, title)

In [None]:
yield_counts_per_userid = (x[1] for x in Counter(df_train.user_id).items())
plot_counter_wrapper(yield_counts_per_userid, '# of recorded events per user')

yield_counts_per_userid = (x[1] for x in Counter(df_train.user_id[df_train.is_booking==0]).items())
plot_counter_wrapper(yield_counts_per_userid, '# of non-bookings per user')

yield_counts_per_userid = (x[1] for x in Counter(df_train.user_id[df_train.is_booking==1]).items())
plot_counter_wrapper(yield_counts_per_userid, '# of bookings per user')

### Model designs

1. Time-invariant Hotel Cluster approach
       Method ia. Count number of times a hotel cluster is selected (booking + non-bookings) by user
       Method ib. Count number of times a hotel cluster is selected (booking-only) by user
       Method ii. For each user, build a decision tree
2. Keep track of past k events
       for each booking event: use previous k non-booking events to predict y
       (add recent-time-threshold for event to be considered in events. 
       e.g events 1, 2, 3 and all must be within past 6 months)


### 1. Basic: Most Frequent hotel cluster of user

In [None]:
most_common_cluster_per_user = df_train[['user_id', 'hotel_cluster']].groupby(by=['user_id']).agg(
    lambda x: list(zip(*Counter(x).most_common(5))[0]))
most_common_cluster_per_user.head()

In [None]:
df_train['hotel_cluster'].values

### Output

In [None]:
most_freq_hotel_clusters = [x[0] for x in sorted(
        Counter(df_train.hotel_cluster).items(), key=lambda x: x[1], reverse=True)[:5]]

In [None]:
_count = 0
with open('output.csv', 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['id','hotel_cluster'])
    for nid in df_test['id']:
        _count += 1
        output_row = [nid]
        output_row.append(' '.join(map(str, most_freq_hotel_clusters)))
        csv_writer.writerow(output_row)