<h1>Kaggle Airbnb Dataset</h1>
<h2>Capstone Project 1</h2>
<h2>Springboard Data Science Career Track</h2>

This notebook is a full demonstration of machine learning skills, including data wrangling, feature engineering, model training and hyperparameter selection.

The goal of the Airbnb Contest from 2015 is to predict each user's top five most likely destinations, but the distribution of destinations is very skewed—most users either don't book at all ("NDF") or they book in the US. This is a multiclassifier problem with <i>imbalanced classes.</i> The goal, then, is to maximize accuracy of results across the other three less probable destination countries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier

np.random.seed(0)

<h2>Data Wrangling & Feature Engineering</h2>

In [2]:
# Load data
df_train = pd.read_csv('data/raw/train_users_2.csv')
df_test = pd.read_csv('data/raw/test_users.csv')
target = df_train['country_destination'].values
df_train = df_train.drop(['country_destination'], axis=1)
id_test = df_test['id']
trainsize = df_train.shape[0]

In [3]:
# Concatenate the train and test data, to wrangle both datasets at once
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
# Remove date first booking because it does not appear in the test data
df_all = df_all.drop(['date_first_booking'], axis=1)
# Fill in null values
df_all = df_all.fillna(-1)

<h3>Datetime Objects</h3>

To break the features "date_account_created" and "timestamp_first_active" into day, month, and year, first convert each of them into datetime objects, then create three new features, then delete the originals.

It would have been possible to break these dates down into any number of features. In a previous exploration, I tried using just the week of the year. (i.e. Jan 1-7 is first week, 8-14 is second week...)

In [4]:
dac = pd.to_datetime(df_all['date_account_created'])
df_all['dac_year'] = [entry.year for entry in dac]
df_all['dac_month'] = [entry.month for entry in dac]
df_all['dac_day'] = [entry.day for entry in dac]
df_all['dac_weekday'] = [entry.dayofweek for entry in dac]
df_all = df_all.drop(['date_account_created'], axis=1)

tfa = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
df_all['tfa_year'] = [entry.year for entry in tfa]
df_all['tfa_month'] = [entry.month for entry in tfa]
df_all['tfa_day'] = [entry.day for entry in tfa]
df_all['tfa_weekday'] = [entry.dayofweek for entry in tfa]
df_all = df_all.drop(['timestamp_first_active'], axis=1)

<h3>Age</h3>

Age is the dirtiest feature. There are missing values, illogical ages (under 18, over 100), and birth years and current years incorrectly entered as ages. We will replace each error with a different dummy number, well separated from the correct values and from each other dummy number.

All ages are left to form a linear term. It would also be possible to make categories out of the ages, then keep those categories linear or one hot encode the categories. Doing so loses a bit of information, but may make for easier-to-interpret results. (If there's a particular age group you think would stand out, it'd be easier to check via one-hot encoding.)

In [5]:
ages = df_all.age.values

# Convert birth years to ages:
# (Data was collected in 2014, so 2014 minus birth year = age)
ages = np.where(np.logical_and(ages < 2000, ages > 1900), 2014-ages, ages)
# Create a dummy number for all ages below 18:
ages = np.where(np.logical_and(ages < 18, ages > 0), 4, ages)
# Create a dummy number for all entries with current year instead of age:
ages = np.where(np.logical_and(ages < 2016, ages > 2010), -5, ages)
# Create a dummy number for all entries allegedly older than 100:
ages = np.where(ages > 99, 110, ages)
# Store to age
df_all['age'] = ages

<h3>One Hot Encoding</h3>

Machine Learning algorithms can only handle numerical inputs, but you can still pass categorical data to the algorithm by one hot encoding. This involves creating a new dummy feature for each category, then assigning a 1 to that feature of the original feature was that category (hot) or a 0 if it wasn't.

For example, "Browser: Firefox" can be coded to "Is_Firefox: 1", "Is_Chrome: 0", "Is_Safari: 0", and "Is_Edge: 0".

In [6]:
dummiescols = ['gender', 'signup_method', 'signup_flow', 'language',
               'affiliate_channel', 'affiliate_provider',
               'first_affiliate_tracked', 'signup_app', 'first_device_type',
               'first_browser']
df_all = pd.get_dummies(df_all, prefix=dummiescols, columns=dummiescols)

<h3>Sessions Data: Split-Apply-Combine</h3>

The Airbnb Data Set has one file containing the user data (dealt with above) and another file entirely for sessions data. No sessions data is given for test users, only training users. Because of this, we can only use the sessions data to help us sharpen our picture of our training set. However, by including sessions data, we can append a lot of new features to the user training data, hoping to find inherent commonalities between users that carry over to their ID data.

We want to create summary statistics about the sessions users logged on the site: what kind of actions they performed, how often they performed them, and how long they spent between actions.

In [7]:
df_sess = pd.read_csv('data/raw/sessions.csv')
# Rename to be exactly the right name for our merge
df_sess['id'] = df_sess['user_id']
df_sess = df_sess.drop(['user_id'], axis=1)

In [8]:
# Fill nan with 'NAN' in all columns except id and seconds
fillcols = ['action', 'action_type', 'action_detail', 'device_type']
df_sess[fillcols] = df_sess[fillcols].fillna('NAN')

In [9]:
grouped = df_sess.groupby(['id'])
# Initialize an empty list to append each id's feature sets
id_features = []
count = 0
last = len(grouped)
for key, table in grouped:
    if count % 10000 == 0:
        print(f"Processing {count} of {last}") 
    # Initialize an empty list to append features
    list = []
    # Append the user id to the list
    list.append(key)
    # Append the number of actions taken
    list.append(len(table))
    # Append the number of unique actions taken
    list.append(table['action'].nunique())
    # Append the number of unique action details taken
    list.append(table['action_detail'].nunique())
    # Append the number of unique device types used
    list.append(table['device_type'].nunique())
    # Append the sum of seconds elapsed
    list.append(np.sum(table['secs_elapsed']))
    # Append the mean number of seconds elapsed
    list.append(np.mean(table['secs_elapsed']))
    # Append the standard deviation of seconds elapsed
    list.append(np.std(table['secs_elapsed']))
    
    # Append this data as a row of id_features.
    id_features.append(list)
    count += 1
   
# Create the aggregate sessions table
new_features = ['id', 'action_count', 'unique_action_count',
                'unique_detail_count', 'device_type_count',
                'sum_secs', 'mean_secs', 'stdev_secs']

id_features = np.array(id_features)

Processing 0 of 135483
Processing 10000 of 135483
Processing 20000 of 135483
Processing 30000 of 135483
Processing 40000 of 135483
Processing 50000 of 135483
Processing 60000 of 135483
Processing 70000 of 135483
Processing 80000 of 135483
Processing 90000 of 135483
Processing 100000 of 135483
Processing 110000 of 135483
Processing 120000 of 135483
Processing 130000 of 135483


In [10]:
df_sess_agg = pd.DataFrame(id_features, columns=new_features)
# Replace infinity and negative infinity with NaN; much easier to handle
df_sess_agg = df_sess_agg.replace([np.inf, -np.inf], np.nan)
df_sess_agg = df_sess_agg.fillna(-2)

In [11]:
# Merge sessions data onto user feature data
df_all = pd.merge(df_all, df_sess_agg, on='id', how='left')
df_all = df_all.fillna(-2) # Summary statistics for users with no session data

# Split back to training and testing data
df_train = df_all[:trainsize]
# Add target row back
df_train['country_destination'] = target
# One more check to make sure no bad values
df_train = df_train.replace([np.inf, -np.inf], np.nan)
df_train = df_train.fillna(-2)

# Repeat for test row
df_test = df_all[trainsize:]
df_test = df_test.replace([np.inf, -np.inf], np.nan)
df_test = df_test.fillna(-2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [12]:
# Export data before proceeding

df_train.to_csv('data/processed/train_users_2.csv', index=False)
df_test.to_csv('data/processed/test_users.csv', index=False)

<h2>Model Training & Hyperparameter Selection</h2>

In [13]:
target = df_train['country_destination'].values
df_train = df_train.drop(['id', 'country_destination'], axis=1)
id_test = df_test['id']
df_test = df_test.drop(['id'], axis=1)

In [14]:
# Split train and test, convert target classes to different numbers
# So the algorithm can process it
X_train = df_train.to_numpy()
le = LabelEncoder()
y_train = le.fit_transform(target)
X_test = df_test.to_numpy()

<h3>XGBoost Classifier</h3>

Here we are creating the algorithm that will classify our data (i.e. try to predict who is going where). XGBoost uses gradient boosting of random forests. We create a weak decision tree, evaluate its predictions combined with the predictions of each previously generated tree, and then create the next tree <i>with special emphasis on the data points that the forest misclassified this time.</i> After the model is trained, we'll use the whole random forest to predict data it's never seen before, and see how well it does.

We are choosing the objective 'multi:softprob' because we want the top five choices, not just the top choice. This objective will give us a table of the probabilities of belonging to a category, then we'll cut it to the top five.

In [15]:
xgb = XGBClassifier(max_depth=6, learning_rate=0.3, n_estimators=25,
                    objective='multi:softprob', subsample=0.5,
                    colsample_by_tree=0.5, seed=0)

<h3>Hyperparameter Selection</h3>

XGBClassifier takes a lot of parameters under consideration. How many trees in the forest? How quickly does the next tree improve over the previous one's mistakes? How many questions can any one decision tree ask when trying to classify? 

There's no objective answer to any of these questions that works in every situation, so we will build the model multiple times and check to see what the right answers are for this problem. We do this by cross-validating: training the model on Third A and Third B of the training set (letting it see the answers), testing on Third C (withholding the answers and making it guess), then repeat testing on Third A (giving it B&C) and Third B (giving it A&C). We take the model that performs the best under cross-validation.

In [16]:
# Select the parameters to tune
# Note by this grid, we are making 21 different models
param_grid = {'n_estimators': [25, 50, 100],
              'learning_rate': [0.3, 0.2, 0.1],
              'max_depth': [5, 6, 7]}

# With 3-fold cross-validation, it comes to 81
clf = GridSearchCV(xgb, param_grid, cv=3)

In [17]:
clf.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_by_tree=0.5,
       colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.3,
       max_delta_step=0, max_depth=6, min_child_weight=1, missing=None,
       n_estimators=25, n_jobs=1, nthread=None, objective='multi:softprob',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=0, silent=True, subsample=0.5),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [25, 50, 100], 'learning_rate': [0.3, 0.2, 0.1], 'max_depth': [5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
# Print the hyperparameters that worked the best
clf.best_params_

{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 25}

In [25]:
# Update the model with the best parameters and train it one more time

xgb_best = XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=25,
                    objective='multi:softprob', subsample=0.5,
                    colsample_by_tree=0.5, seed=0)

xgb_best.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_by_tree=0.5,
       colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=5, min_child_weight=1, missing=None,
       n_estimators=25, n_jobs=1, nthread=None, objective='multi:softprob',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=0, silent=True, subsample=0.5)

In [27]:
# Save the model for later use
xgb_best.save_model('airbnbXGB.model')

In [28]:
# Use the best model to predict the test set
y_pred = clf.predict_proba(X_test)

In [39]:
# Keep only the 5 classes with the highest probabilities,
# return them to classes rather than just numbers,
# and export them to a list
ids = [] # List of ids
countries = [] # List of countries
for i in range(len(id_test)):
    this_id = id_test.iloc[i]
    ids += [this_id] * 5
    countries += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()
    
# Generate submission. This could be uploaded to Kaggle
submission = pd.DataFrame(np.column_stack((ids, countries)),
                          columns=['id', 'country'])
submission.to_csv('submission.csv', index=False)

<h2>Conclusion</h2>

We now have a model that predicts the Top 5 most likely destinations a new Airbnb user will go for their first booking. If you upload "submission.csv" to Kaggle, you'll get a Private Score of <b>.87252</b>, which would have put your submission in the top 25% of final submissions for this competition.