# Airbnb New User Bookings
*Where will a new guest book their first travel experience?*

* [Kaggle Page](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings)

**Outline**

* [Read Data](#read)
* [Data Summary](#data check)
* [Exploratory Data Analysis](#eda)
* [Feature Creation and Preprocessing](#preprocess)
* [Model and Score](#model) 
* [Predicition](#predict)
* [Reference](#reference)

**Related Notebooks**
- Link to [Airbnb 2.26.18 - Data Cleansing & Feature Creation](Airbnb 2.26.18 - Data Cleansing & Feature Creation.ipynb)

In [88]:
#%load_ext watermark

In [35]:
%matplotlib inline

import os
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import math

from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import model_selection
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import classification_report
import statsmodels.api as sm

SEED = 12345

  from pandas.core import datetools


In [2]:
#%watermark -a 'PredictiveII' -d -t -v -p pandas,numpy,sklearn,watermark

---

## <a id="read">Read Data</a>

In [2]:
def data_reader():
    """
    read data into notebook 
    """
        
    data_dir = os.path.join('.', 'data') #/Users/siliangchen/Airbnb

    train_binary_path = os.path.join(data_dir, 'train_binary.csv')  
    train_binary = pd.read_csv(train_binary_path, index_col = 0)

    return train_binary

In [3]:
train_binary = data_reader()

## <a id="data check">Data Summary</a>

### Data dictionary:
** Table 1. Session: web sessions log for users (can be joined with tables for additional feature extraction)**
* **user_id**: to be joined with the column 'id' in users table
* **action**
* **action_type**
* **action_detail**
* **device_type**
* **secs_elapsed**

** Table 2. Train user: the training set of users**
* **id**: user id
* **date_account_created**: the date of account creation
* **timestamp_first_active**: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up.
* **date_first_booking**: date of first booking
* **gender**
* **age**
* **signup_method**
* **signup_flow**: the page a user came to signup up from
* **language**: international language preference
* **affiliate_channel**: what kind of paid marketing
* **affiliate_provider**: where the marketing is e.g. google, craigslist, other
* **first_affiliate_tracked**: whats the first marketing the user interacted with before the signing up
* **signup_app**
* **first_device_type**
* **first_browser**
* **country_destination**: this is the target variable you are to predict

** Table 3. age_gender: summary statistics of users' age group, gender, country of destination **

** Table 4. country: summary statistics of destination countries in this dataset and their locations **

** Table 5. test user: the testing set of users **

In [4]:
train_binary.head()

Unnamed: 0,obs_count,unique_action,unique_device,avg_time,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,isNDF
0,40,13,2,21697.4,-unknown-,31.0,basic,0,en,direct,direct,omg,Web,Mac Desktop,Safari,False
1,90,10,1,3144.055556,-unknown-,34.0,basic,23,en,direct,direct,untracked,Android,Other/Unknown,-unknown-,True
2,31,5,2,9580.967742,-unknown-,34.0,basic,0,en,direct,direct,linked,Moweb,Android Phone,Android Browser,True
3,789,25,2,8221.901141,FEMALE,26.0,facebook,25,en,direct,direct,linked,iOS,iPhone,Mobile Safari,False
4,489,20,1,11706.891616,FEMALE,34.0,basic,0,en,sem-brand,google,omg,Web,Mac Desktop,Safari,False


# Fit NDF Classifier

**Create Dummy Variable**

In [5]:
categorical = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel',
                'affiliate_provider','first_affiliate_tracked','signup_app','first_device_type',
               'first_browser']                           

In [6]:
# Convert data type as 'category'
for i in categorical:
    train_binary[i] = train_binary[i].astype('category')

In [7]:
# Create dummy variables
train_binary_dummy = pd.get_dummies(train_binary, columns = categorical)
train_binary_dummy.head()

Unnamed: 0,obs_count,unique_action,unique_device,avg_time,age,isNDF,gender_-unknown-,gender_FEMALE,gender_MALE,gender_OTHER,...,first_browser_RockMelt,first_browser_Safari,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_Sogou Explorer,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,40,13,2,21697.4,31.0,False,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,90,10,1,3144.055556,34.0,True,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,31,5,2,9580.967742,34.0,True,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,789,25,2,8221.901141,26.0,False,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,489,20,1,11706.891616,34.0,False,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


## Logistic Regression

**Validation Set Score**

In [8]:
# Split data into response and predictors
y = train_binary_dummy['isNDF']
x = train_binary_dummy.drop('isNDF', axis=1)

# Create training and test data tables
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .3, random_state = SEED)

# Fit logistic model
logreg = LogisticRegression().fit(x_train, y_train)

# Print out prediction accuracy for the data
print('Model accuracy on train set: {:.2f}'.format(logreg.score(x_train, y_train)))
print('Model accuracy on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

Model accuracy on train set: 0.69
Model accuracy on test set: 0.69


**Cross Validation Score**

In [9]:
kfold = model_selection.KFold(n_splits=10, random_state = SEED)
modelCV = LogisticRegression()
scoring = 'roc_auc'
results = model_selection.cross_val_score(modelCV, x_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average AUC: %.3f" % (results.mean()))

10-fold cross validation average AUC: 0.713


**Test Misclass Rate**

In [23]:
# model is the fitted model object
def get_test_misclass(model,x_test,y_test):
    y_true = y_test
    y_pred = model.predict(x_test)
    print(classification_report(y_true, y_pred))

In [24]:
get_test_misclass(logreg,x_test,y_test)

             precision    recall  f1-score   support

      False       0.64      0.45      0.53      8432
       True       0.71      0.84      0.77     13358

avg / total       0.68      0.69      0.68     21790



**To Do**

* Feature Selection
* Model Summary (significant features...etc)

**Reference**

* [Building A Logistic Regression in Python, Step by Step](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)
* [Scoring Metric](http://scikit-learn.org/0.15/modules/model_evaluation.html)

## Random forest

**Fit a base model using default parameters and get the cv AUC**

In [16]:
def fit_randomforest(x_train, y_train, max_features="auto"):
    num_trees = 100
    kfold = model_selection.KFold(n_splits=10, random_state=SEED)
    model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features, random_state=SEED)
    results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring='roc_auc')
    print(results.mean())
    
    model.fit(x_train, y_train)
    
    return model

In [17]:
model_rf = fit_randomforest(x_train, y_train)

0.751185891983


In [25]:
get_test_misclass(model_rf,x_test,y_test)

             precision    recall  f1-score   support

      False       0.61      0.60      0.61      8432
       True       0.75      0.76      0.76     13358

avg / total       0.70      0.70      0.70     21790



In [41]:
model_rf.feature_importances_
pyplot.bar(range(len(model_rf.feature_importances_)), model_rf.feature_importances_)
pyplot.show()

NameError: name 'pyplot' is not defined

**Parameter Tuning**

In [35]:
def parameter_tuning(model, X_train, y_train, param_grid):   
    """
    Tune a tree based model using GridSearch, and return a model object with an updated parameters
    
    Parameters
    ----------
    model: sklearn's ensemble tree model
        the model we want to do the hyperparameter tuning.
    
    X_train: pandas DataFrame
        Preprocessed training data. Note that all the columns should be in numeric format.
    
    y_train: pandas Series
    
    param_grid: dict
        contains all the parameters that we want to tune for the responding model.    
        

    Note
    ----------
    * we use kfold in GridSearchCV in order to make sure the CV Score is consistent with the score 
      that we get from all the other function, including fit_bagging, fit_randomforest and fit_gbm. 
    * We use model_selection.KFold with fixed seed in order to make sure GridSearchCV uses the same seed as model_selection.cross_val_score.
    
    """
    seed=SEED
    
#     if 'n_estimators' in param_grid:
#         model.set_params(warm_start=True) 
    
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    gs_model = GridSearchCV(model, param_grid, cv=kfold)
    gs_model.fit(X_train, y_train)
    
    # best hyperparameter setting
    print('best parameters:{}'.format(gs_model.best_params_)) 
    print('best score:{}'.format(gs_model.best_score_)) 
    
    # refit model on best parameters
    model.set_params(**gs_model.best_params_)
    model.fit(X_train, y_train)

    return(model)

In [None]:
# num_trees=100
# rf = RandomForestClassifier(n_estimators=num_trees, random_state=SEED)

In [214]:
# param_grid_rf_1 = {
#     'max_depth': [None, 4, 6, 8, 10],
#     'min_samples_leaf': [1, 3, 5, 7, 9],
#     'max_features': ['auto', 'log2', None]
#                   }

In [217]:
# Take too long to run. Solve it later
#rf_2 = parameter_tuning(rf, x_train, y_train, param_grid_rf_1)

In [None]:
# param_grid_rf_2 = {'max_depth': [6, 7, 8, 9, 10, None]}

## Gradient Boosted Tree

**Fit a base model using default parameters and get the cv AUC**

In [27]:
def fit_gbm(X_train, y_train):
    """Gradient Boosting Machine for Classification"""
    seed = SEED   
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = GradientBoostingClassifier(random_state=seed)
    results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc')
    print(results.meafn())
    
    model.fit(X_train, y_train)
    
    return(model)

In [28]:
gbm_base = fit_gbm(x_train, y_train)

0.78157352778


In [31]:
get_test_misclass(gbm_base,x_test,y_test)

             precision    recall  f1-score   support

      False       0.66      0.60      0.63      8432
       True       0.76      0.80      0.78     13358

avg / total       0.72      0.72      0.72     21790



**Parameter Tuning**

## Neural Network Classification

In [32]:
def fit_nn(x_train, y_train):
    """Neural Network model with fix parameters"""        
    
    seed= SEED        
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = MLPClassifier(solver='sgd', alpha=0.0001,
                     hidden_layer_sizes=(100, 2), random_state=seed)    
    
    results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring='roc_auc')
    print(results.mean())
    
    model.fit(x_train, y_train)
    
    return model

In [33]:
nn_base = fit_nn(x_train, y_train)

0.500392333073


In [40]:
nn_base

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=12345,
       shuffle=True, solver='sgd', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [34]:
get_test_misclass(nn_base,x_test,y_test)

             precision    recall  f1-score   support

      False       0.50      0.00      0.00      8432
       True       0.61      1.00      0.76     13358

avg / total       0.57      0.61      0.47     21790



Reference
* [sklearn.neural_network.MLPClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)