# Prelude

## A. Choosing an Algorithm

The first step to building a model is to decide what type of algorithm to use. Below we look at some of the options.

### A.1. Decision Tree

Arguably the most well known algorithm, and one of the simplest conceptually. The decision tree works in a similar manner to the decision tree that you might create when trying to understand which decision to make based on a range of variables.

![Decision Tree](images/reg-decision-tree-1.png)

The goal of the decision tree algorithm used for classification problems (like the one we are looking at) is to create one of these decision trees to classify records into a set number of categories. 

### A.2. Random Forest

Given the limitations of decisions trees and the risk of overfitting, it may be tempting to think “why bother?” Fortunately, methods have been found to reduce the risk of overfitting and increase predictive power of decisions trees and the two most popular methods both have the same basic premise – to train multiple trees.

The algorithm constructs a large number of different trees (as defined by the user) by randomly selecting the features that can be used to build each tree (as opposed to using all the features for each tree). Typically, the trees in a random forest also have the parameters set to ensure each tree will also be relatively shallow, meaning that the algorithm creates a large number of shallow decision trees (decision bonsai?). Once the trees are constructed, each tree is used to predict the outcome for a new record, with these multiple predictions then serving as votes, with a majority rules approach applied.

![Random Forest](images/random_forest.jpg)


### A.3. K-Nearest Neighbors

The K-nearest neighbor algorithms are arguably one of the simplest algorithms in concept. The algorithm classifies a given object by looking at the classification of the k most similar records.

This type of algorithm is called a lazy learner because during the training phase, it essentially just stores the data provided. Only when a new object needs to be classified does the algorithm start looking through the data to try to find the closest matches. **This algorithm scales extremly badly and is extremely sensitive with the number of features**

![KNN](images/KNN.jpg)

### A.4. Neural Networks

As the name suggests, these algorithms simulate biological networks by creating a series of nodes and connecting them together. A neural network typically consists of three layers; an input layer, a hidden layer (although there can be multiple hidden layers) and an output layer.

A model is trained by passing records through the network and weights adjusted at each node continually adjusted to ensure that the record ends up at the right ‘output node’.

![Artificial Neural Network](images/ANN.png)

### A.4. SVM: Support Vector Machine

This type of algorithm, commonly used for text classification problems, is arguably the most difficult to visualize. At the simplest level, the algorithm tries to draw straight lines (or planes for classifications with more than 2 features) that best separate the classes provided. Although this sounds like a fairly simplistic approach to classifying objects, it becomes far more powerful due to the transformations (sometimes called a ‘kernel trick’) the algorithm can apply to the data before drawing these lines/planes. 

![SVM](images/SVM.jpg)

## B. Creating the Model

Often broken down into sub steps of feature construction and feature selection, here we will focus on feature construction. Below are a couple of ways additional features can be constructed and added to your dataset.

### B.1. Cross Validation

As mentioned in regards to decision trees, one of the keys risks when creating models of any type is the risk of overfitting. One of the primary ways data scientists will guard against overfitting is to estimate the accuracy of their models on data that was not used to train the model. To do this they typically use a method called cross validation. There are different methods for doing cross validation, but the method we will employ is called k-fold cross validation.

**k-fold cross validation** involves splitting the training data into k subsets (where k is greater than or equal to 2), training the model using k – 1 of those subsets, then running the model on the subset that was not used in the training process. Because all of the data used in the cross validation process is training data, the correct classification for each record is known and so the predicted category can be compared to the actual category. Once all folds have been completed, the average score across all folds is taken as an estimate of how the model will perform on other data. An example of a 3-fold cross validation is shown below:

![Cross Validation](images/cross-validation.png)

### B.2. Parameter Tuning

As you may have realized from the earlier description of the XGBoost algorithm – there are quite a few options (parameters) that we need to define to build the model. 

* How many trees to build? 
* How deep should each tree be? 
* How much extra weight will be attached to each misclassified record? 

Tuning these parameters to get the best results from the model is often one of the most time consuming things that data scientists do. 

Fortunately, the process can be automated to a large degree so that we do not have to sit there rerunning the model repeatedly and noting down the results. Even better, using the Scikit-Learn package, we can merge the parameter tuning and cross validation steps into one, allowing us to search for the best combination of parameters while using k-fold cross validation to verify the results.

## D. The Importance of Domain Knowledge

One of the things that may have occurred to you as you read through the various ways to modify and expand a dataset is how are you supposed to know what will help or not?

This is where knowledge about the data you are using and what it represents becomes so important. This knowledge – referred to as domain knowledge – helps guide this entire process, including cleaning the data.

Understanding how the data was collected helps to provide insight into potential errors in the data that might need to be addressed or shortcomings in the way the data was sampled (sample selection bias/errors). Understanding the relevant industry or market can also provide a range of insights including:

* what additional information is available to expand your dataset
* what information may help to increase prediction accuracy and what is likely to be irrelevant
* if the model makes intuitive sense (e.g. can you predict the likelihood of a waking up with a headache based on whether someone slept with their shoes on?)
* if the industry or market is changing in such a way that it is likely to make the model redundant in the near future.

# Let's train the XGBoost Model buddy !

## 1. Loading the Python libraries

**Install needed for xgboost:*

1. \$ source activate python2
2. \$ pip install xgboost

In [1]:
import os
from datetime import datetime

import numpy as np
import pandas as pd

import sklearn as sk
from sklearn.externals import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.grid_search import GridSearchCV

import xgboost as xgb



## 2. Re-Import the Check NaN Function from Part 2

In [2]:
def check_NaN_Values_in_df(df):
    # searching for NaN values is all the columns
    for col in df:
        nan_count = df[col].isnull().sum()

        if nan_count != 0:
            print (col + " => "+  str(nan_count) + " NaN Values")

## 3. Loading in the Data from Part 3

In [3]:
df_all = pd.read_csv("output/enriched.csv", index_col=0)

# Check for NaN Values => We must find: country_destination => 62096 NaN Values
check_NaN_Values_in_df(df_all) 

df_all.sample(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,age,gende_other,gende_female,gende_male,gende_unknown,signu_basic,signu_facebook,signu_google,signu_weibo,signu_24,...,action_detail_view_resolutions,action_detail_view_search_results,action_detail_view_security_checks,action_detail_view_user_real_names,action_detail_wishlist,action_detail_wishlist_content_update,action_detail_wishlist_note,action_detail_your_listings,action_detail_your_reservations,action_detail_your_trips
jw6ama99vu,25,0,0,0,1,1,0,0,0,0,...,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
be8jznfeue,42,0,0,0,1,1,0,0,0,0,...,0.0,10.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0
me9mt8pe66,60,0,0,0,1,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
tvcpppste4,24,0,0,1,0,0,1,0,0,0,...,0.0,17.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0
5ncu73qoqt,-1,0,0,0,1,1,0,0,0,0,...,0.0,106.0,0.0,0.0,0.0,32.0,0.0,0.0,0.0,0.0


## 4. Getting a training dataset

In order to train the model (using cross validation and parameter tuning as outlined above), we first need to define our training dataset – remembering that we previously combined the training and test data to simplify the cleaning and transforming process. 

To feed these into the model, we also need to split the training data into the three main components 
- the **user IDs** (we don’t want to use these for training as they are randomly generated)
- the **features** to use for training (X)
- the **labels** we are trying to predict (y).

In [4]:
# Loading training csv file
df_train = pd.read_csv("data/train_users_2.csv")
df_train.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


In [5]:
# Prepare training data for modelling
df_train.set_index('id', inplace=True)
df_train = pd.concat([df_train['country_destination'], df_all], axis=1, join='inner')

id_train = df_train.index.values
labels = df_train['country_destination']
le = LabelEncoder()
y = le.fit_transform(labels)
X = df_train.drop('country_destination', axis=1, inplace=False)

# Checking Up
print (y)
X.sample(n=5)

[10  7  7 ...,  7  7  4]


Unnamed: 0,age,gende_other,gende_female,gende_male,gende_unknown,signu_basic,signu_facebook,signu_google,signu_weibo,signu_24,...,action_detail_view_resolutions,action_detail_view_search_results,action_detail_view_security_checks,action_detail_view_user_real_names,action_detail_wishlist,action_detail_wishlist_content_update,action_detail_wishlist_note,action_detail_your_listings,action_detail_your_reservations,action_detail_your_trips
ewyk4jrscx,-1,0,0,0,1,1,0,0,0,0,...,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ohcbqsu9y6,28,0,1,0,0,0,1,0,0,0,...,0.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
jhyslwyvsg,-1,0,0,0,1,1,0,0,0,0,...,0.0,20.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0
rovxhr6nog,-1,0,0,0,1,1,0,0,0,0,...,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71wrxhkhcc,-1,0,0,0,1,1,0,0,0,1,...,0.0,3.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0


## 5. Training an XGBoost Model

In [6]:
# Grid Search - Used to find best combination of parameters
XGB_model = xgb.XGBClassifier(objective='multi:softprob', subsample=0.5, colsample_bytree=0.5, seed=0)
param_grid = {'max_depth': [3, 4, 5], 'learning_rate': [0.1, 0.3], 'n_estimators': [25, 50]}

model = GridSearchCV(
    estimator=XGB_model, 
    param_grid=param_grid, 
    scoring='accuracy', 
    verbose=10, 
    n_jobs=1, 
    iid=True, 
    refit=True, cv=3
)

model.fit(X, y)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] n_estimators=25, learning_rate=0.1, max_depth=3 .................
[CV]  n_estimators=25, learning_rate=0.1, max_depth=3, score=0.695258 -  20.9s
[CV] n_estimators=25, learning_rate=0.1, max_depth=3 .................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   20.9s remaining:    0.0s


[CV]  n_estimators=25, learning_rate=0.1, max_depth=3, score=0.692623 -  20.5s
[CV] n_estimators=25, learning_rate=0.1, max_depth=3 .................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   41.4s remaining:    0.0s


[CV]  n_estimators=25, learning_rate=0.1, max_depth=3, score=0.690719 -  20.8s
[CV] n_estimators=50, learning_rate=0.1, max_depth=3 .................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.0min remaining:    0.0s


[CV]  n_estimators=50, learning_rate=0.1, max_depth=3, score=0.700540 -  38.7s
[CV] n_estimators=50, learning_rate=0.1, max_depth=3 .................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.7min remaining:    0.0s


[CV]  n_estimators=50, learning_rate=0.1, max_depth=3, score=0.701036 -  39.0s
[CV] n_estimators=50, learning_rate=0.1, max_depth=3 .................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.3min remaining:    0.0s


[CV]  n_estimators=50, learning_rate=0.1, max_depth=3, score=0.700313 -  38.9s
[CV] n_estimators=25, learning_rate=0.1, max_depth=4 .................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  3.0min remaining:    0.0s


[CV]  n_estimators=25, learning_rate=0.1, max_depth=4, score=0.697168 -  26.1s
[CV] n_estimators=25, learning_rate=0.1, max_depth=4 .................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  3.4min remaining:    0.0s


[CV]  n_estimators=25, learning_rate=0.1, max_depth=4, score=0.695875 -  25.8s
[CV] n_estimators=25, learning_rate=0.1, max_depth=4 .................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  3.8min remaining:    0.0s


[CV]  n_estimators=25, learning_rate=0.1, max_depth=4, score=0.695394 -  26.0s
[CV] n_estimators=50, learning_rate=0.1, max_depth=4 .................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  4.3min remaining:    0.0s


[CV]  n_estimators=50, learning_rate=0.1, max_depth=4, score=0.701434 -  49.5s
[CV] n_estimators=50, learning_rate=0.1, max_depth=4 .................
[CV]  n_estimators=50, learning_rate=0.1, max_depth=4, score=0.702499 -  49.7s
[CV] n_estimators=50, learning_rate=0.1, max_depth=4 .................
[CV]  n_estimators=50, learning_rate=0.1, max_depth=4, score=0.701980 -  49.8s
[CV] n_estimators=25, learning_rate=0.1, max_depth=5 .................
[CV]  n_estimators=25, learning_rate=0.1, max_depth=5, score=0.698631 -  31.4s
[CV] n_estimators=25, learning_rate=0.1, max_depth=5 .................
[CV]  n_estimators=25, learning_rate=0.1, max_depth=5, score=0.699898 -  31.3s
[CV] n_estimators=25, learning_rate=0.1, max_depth=5 .................
[CV]  n_estimators=25, learning_rate=0.1, max_depth=5, score=0.699378 -  31.4s
[CV] n_estimators=50, learning_rate=0.1, max_depth=5 .................
[CV]  n_estimators=50, learning_rate=0.1, max_depth=5, score=0.701881 - 1.0min
[CV] n_estimators=50,

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 22.7min finished


Best score: 0.704
Best parameters set:
	learning_rate: 0.3
	max_depth: 4
	n_estimators: 50


### 6. Saving the model 

=> loaded_model = joblib.load("models/0001.model") # To load back the model

In [7]:
# We create the output directory if necessary
if not os.path.exists("models"):
    os.makedirs("models")

# save model to file
joblib.dump(model, 'models/0001.model')

['models/0001.model']

## 7. Making the Predictions

Now that we have trained a model based on the best parameters, the next step is to use the model to make predictions for the records in the testing dataset. Again we need to extract the testing data out of the combined dataset we created for the cleaning and transformation steps, and again we need to separate the main components for the model. After these steps, we use the model created in the previous step to make the predictions.

In [8]:
df_test = pd.read_csv("data/test_users.csv")
df_test.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,5uwns89zht,2014-07-01,20140701000006,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
1,jtl0dijy2j,2014-07-01,20140701000051,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
2,xx0ulgorjt,2014-07-01,20140701000148,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome
3,6c6puo6ix0,2014-07-01,20140701000215,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,IE
4,czqhjk3yfe,2014-07-01,20140701000305,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari


In [9]:
# Prepare test data for prediction
df_test.set_index('id', inplace=True)
df_test = pd.merge(df_test.loc[:,['date_first_booking']], df_all, how='left', left_index=True, right_index=True, sort=False)
X_test = df_test.drop('date_first_booking', axis=1, inplace=False)
X_test = X_test.fillna(-1)
id_test = df_test.index.values

# Make predictions
y_pred = model.predict_proba(X_test)

In [10]:
#Taking the 5 classes with highest probabilities
ids = []  #list of ids
cts = []  #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

## 8. Saving Predictions

In [11]:
# We create the output directory if necessary
if not os.path.exists("output"):
    os.makedirs("output")

#Generate submission
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('output/submission.csv', index=False, sep=',')