<div>
<div style="text-align:center; display:block; float:left; padding:80px;"><img width="200px"  src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4651/logos/front_page.png"/><span style="color:red;">**New User Booking**</span></div>
<div style="">
**Objective:** In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking.  
  
** Description: ** In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user's first booking destination will be. All the users in this dataset are from the USA.
</div>
<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4651/media/airbnb_banner.png" />

## Import Packages

In [53]:
using DataFrames
using MLBase
using XGBoost

## Load Data

In [153]:
train = readtable("data/train_v1.tsv", separator='\t')
test  = readtable("data/test_v1.tsv", separator='\t')
full  = readtable("data/full_v1.tsv", separator='\t');

## Set Features and Output

In [154]:
label    = :country_destination
labels   = Set(train[label])
features = setdiff(names(test), [:id]);

original_labels = keys(labelmap(readtable("data/train_users_2.csv.gz")[label]));

## Prepare Instances

In [12]:
function split_train_val(df; train_size=.85, random_state=1)
    
    srand(random_state)
    
    nrows, ntraining_rows = size(df, 1), round(Int, size(df, 1) * train_size)
    indexes               = shuffle(collect(1:nrows))
    train                 = df[indexes[1:ntraining_rows], :]
    validation            = df[indexes[ntraining_rows+1:end], :]
    
    return train, validation
end

split_train_val (generic function with 1 method)

In [155]:
train[label]  -= 1
X_train, X_val = split_train_val(train, train_size=.85, random_state=1)

train_x = Array{Float64,2}(X_train[:, features])
train_y = Array{Float64,1}(X_train[label])
val_x   = Array{Float64,2}(X_val[:, features])
val_y   = Array{Float64,1}(X_val[label])
test_x  = Array{Float64,2}(test[:, features]);

In [156]:
dtrain  = DMatrix(train_x, label=train_y)
dval    = DMatrix(val_x, label=val_y);

## Train

In [161]:
num_rounds = 100
watchlist  = [(dtrain, "train"), (dval, "eval")]
metrics    = ["merror", "mlogloss"]
params     = Dict("objective"         => "multi:softprob",
                   "booster"          => "gbtree",
                   "eta"              => 0.3,
                   "alpha"            => .5,
                   "max_depth"        => 6,
                   "colsample_bytree" => .5,
                   "min_child_weight" => 10,
                   "subsample"        => .3)

println("Training Base Model...")
tic()
model      = XGBoost.xgboost(dtrain, num_rounds, param=params, metrics=metrics,
                             num_class=length(labels), watchlist=watchlist)
toc()

Training Base Model...


[1]	train-merror:0.372981	train-mlogloss:1.781520	eval-merror:0.372041	eval-mlogloss:1.779704
[2]	train-merror:0.371156	train-mlogloss:1.555581	eval-merror:0.369323	eval-mlogloss:1.553084
[3]	train-merror:0.370600	train-mlogloss:1.414941	eval-merror:0.369136	eval-mlogloss:1.412418
[4]	train-merror:0.370848	train-mlogloss:1.321339	eval-merror:0.369167	eval-mlogloss:1.318485
[5]	train-merror:0.369751	train-mlogloss:1.254377	eval-merror:0.369042	eval-mlogloss:1.251817
[6]	train-merror:0.369056	train-mlogloss:1.206219	eval-merror:0.368480	eval-mlogloss:1.204110
[7]	train-merror:0.368963	train-mlogloss:1.171365	eval-merror:0.368262	eval-mlogloss:1.169390
[8]	train-merror:0.368676	train-mlogloss:1.145111	eval-merror:0.368105	eval-mlogloss:1.143414
[9]	train-merror:0.368571	train-mlogloss:1.125337	eval-merror:0.367606	eval-mlogloss:1.123843
[10]	train-merror:0.368356	train-mlogloss:1.109847	eval-merror:0.368199	eval-mlogloss:1.108825
[11]	train-merror:0.368064	train-mlogloss:1.098307	eval-mer

elapsed time: 224

224.144738169

.144738169 seconds


## Predict

In [162]:
function get_top_n(prob_matrix, n)
    
    top_n_list   = Array[]
    nrows, ncols = size(prob_matrix)
    n            = min(ncols, n)
    
    for i=1:nrows
        
        tuple_list = [(j, prob_matrix[i, j]) for j=1:ncols]
        top_n      = sort(tuple_list, by = x -> last(x), rev=true)[1:n]
        top_n      = [first(x) for x in top_n]
        
        push!(top_n_list, top_n)
    end
    
    return top_n_list
end

get_top_n (generic function with 1 method)

In [163]:
yhats = XGBoost.predict(model, test_x)
yhats = reshape(yhats, length(original_labels), size(test_x, 1))';

In [164]:
yhats_top_n = get_top_n(yhats, 5);

## Generate Submission File

In [171]:
function prepare_dataframe_submission(df)
    
    ids           = repeach(df[:id], 5)
    yhats         = [ original_labels[yhat] for yhat in vcat(yhats_top_n...) ]
    submission_df = DataFrame(id=ids, country=yhats)
    
    return submission_df
end

prepare_dataframe_submission (generic function with 1 method)

In [172]:
submission_df = prepare_dataframe_submission(test);

In [173]:
writetable("data/submissions/submission_v4_xgboost_msoftprob_gbtree_eta3_md5_ss3_alpha5_csb5_mcw10_nr100.csv", submission_df);

## Kagle Scores from Submited Predictions

Best Score: <span style="color:blue;">0.86109</span> [v3]

- v4 **XGBoost** (MSoftProb GBTree Eta.3 MD 6 SS.5 NR100 Alpha5 CSB5 MCW10): **0.86109** tme.360 tmll1.02 eme.369 emll1.064   
- v3 **XGBoost** (MSoftMax GBTree Eta.3 MD 6 SS.5 NR100 Alpha5 CSB5 MCW10): **0.70496** tme 0.358 tmll 1.019 eme 0.366 emll1.061  
- v2 **XGBoost** (MSoftMax GBTree Eta.7 MD 5 SS.85 NR2000): ** 0.65697** tme.181 tmll.499 eme.403 emll1.477
- v1 **XGBoost** (MSoftMax GBTree Eta.7 MD 5 SS.85 NR100): **0.70174** tme.349 tmll.948 eme.367 emll1.077