In [1]:
%autosave 0

Autosave disabled


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import myfunctions

The import_train function contains the relative path to the folder containing the prepared folder within the repo. It reads and returns the train data with the corresponding labels.

In [3]:
train, train_labels = myfunctions.import_train()
train.shape, train_labels.shape

((470744, 190), (39007, 2))

The initial_prep function drops the date column 'S_2'.

In [4]:
train = myfunctions.initial_prep(train)

The train_null_counter function groups the dataframe by customer ID and counts the number of nulls in each feature per customer. It returns a dataframe of new features containing this count. The counts are scaled with the StandardScaler from sklearn to return values of a similar range to the raw data. The function returns a dataframe of a new size: one row per customer. The function also returns the StandardScaler object for use on the validate/test subsets.

In [5]:
null_df, ss = myfunctions.train_null_counter(train)
null_df.shape

(39007, 188)

The handle_categories function has the list of categorical columns found on the competition data page. It creates a dummy dataframe of these columns, including Null as a potential value for each feature. This function returns a new dataframe with one row per customer, with each value being the lat recorded value for each customer. It also returns the list of categorical columns, which must be fed into the next function to determine the numerical columns.

In [6]:
dummy_df, cat_columns = myfunctions.handle_categories(train)
dummy_df.shape

(39007, 21)

The cap_numerical_columns function uses the categorical columns list to create a list of numerical columns. The values for each numerical feature are clipped to a lower bound of -3 and an upper bound of 3. Given the features in the dataset are z-score normalized, clipping outliers to -3 and 3 still places them in an area occupying 0.1% of the area under the bell curve. The function returns a dataframe the same size as the original train dataframe. It also returns the list of numerical columns, which is used to impute the nulls in numerical columns and calculate aggregate features for these columns.

In [7]:
train, num_columns = myfunctions.cap_numerical_columns(train, cat_columns)

The impute_numerical_nulls function imputes -5 for all null values in the numerical columns. This value was chosen because it sits outside the range of possible values (-3 to 3), but not so far that it drastically skews some of the aggregate features. The optional third argument is set to -5 by default and can be changed when calling the function.

In [9]:
train = myfunctions.impute_numerical_nulls(train, num_columns)

The aggregate_features function creates many new features for the dataset: the minimum, maximum, median, standard deviation, last, and change for each feature by customer. The features are calculated for each numerical column and added as columns to a dataframe until all numerical columns have been handled.

In [11]:
agg_features = myfunctions.aggregate_features(train, num_columns)
agg_features.shape

(39007, 1063)

At this point we have three dataframes containing our engineered features: the null counts, the dummy features for the categorical columns, and the aggregate features for the numerical columns.

The concat_dataframes function combines the three feature dataframes into one.

In [24]:
train_final = myfunctions.concat_dataframes(agg_features, dummy_df, null_df)
train_final.shape

(39007, 1271)

At this point, I will cache this version of my features. It took roughly 30 minutes to execute the functions in this notebook and I don't want to take that amount of time to reach this point again.

In [25]:
train_final.to_csv('train_final.csv')

I will repeat the preparation on the validation set.

In [27]:
base_url = '../../data/prepared/'
    
valid = pd.read_csv(base_url + 'val_data.csv')
    
valid_labels = pd.read_csv(base_url + 'val_labels.csv')

In [28]:
valid = myfunctions.initial_prep(valid)

In [29]:
valid_null = myfunctions.valid_null_counter(valid, ss)

In [30]:
valid_dummies, cat_columns = myfunctions.handle_categories(valid)

In [31]:
valid, num_columns = myfunctions.cap_numerical_columns(valid, cat_columns)

In [32]:
valid = myfunctions.impute_numerical_nulls(valid, num_columns)

In [34]:
valid_agg = myfunctions.aggregate_features(valid, num_columns)

In [35]:
valid_agg.set_index('customer_ID', inplace=True)

In [36]:
valid_final = concat_dataframes(valid_agg, valid_dummies, valid_null)
valid_final.shape

(214152, 1271)

In [37]:
valid_final.to_csv('valid_final.csv')

With both prepared dataframes cached to my local machine, I can move forward with training using the validation subset as the evaluation set.

First I will do a left join of the target labels onto my feature dataframes, to ensure the features and the target will be on the same row for all observations.

In [40]:
train_complete = train_final.merge(train_labels, how='left', on='customer_ID')
valid_complete = valid_final.merge(valid_labels, how='left', on='customer_ID')

Next I define my DMatrices for train and validation using the dataframes from the previous step. I include all my features except the customer ID and the target variable as the training data, and I set the target as the label for the data.

In [44]:
train_matrix = xgb.DMatrix(train_complete.drop(columns=['customer_ID', 'target']), label=train_complete.target)
valid_matrix = xgb.DMatrix(valid_complete.drop(columns=['customer_ID', 'target']), label=valid_complete.target)

Now my matrices are created, I can move forward with defining the hyperparameters of the XGBoost model and training it on my data. The hyperparameters chosen for the initial pass are based on experimentation done in the xgboost-hyperparameter-tuning notebook.

In [45]:
steps = 1000
seed = 42

params = {
    'verbosity': 1,
    'max_depth': 4,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.8,
    'colsample_bylevel': 0.8
}

In [47]:
model = xgb.train(params, train_matrix, steps, early_stopping_rounds=3,
                  evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

[0]	Train-logloss:0.60497	Valid-logloss:0.60545
[1]	Train-logloss:0.53857	Valid-logloss:0.53966
[2]	Train-logloss:0.48732	Valid-logloss:0.48899
[3]	Train-logloss:0.44650	Valid-logloss:0.44867
[4]	Train-logloss:0.41350	Valid-logloss:0.41627
[5]	Train-logloss:0.38661	Valid-logloss:0.39003
[6]	Train-logloss:0.36436	Valid-logloss:0.36833
[7]	Train-logloss:0.34600	Valid-logloss:0.35075
[8]	Train-logloss:0.33049	Valid-logloss:0.33584
[9]	Train-logloss:0.31737	Valid-logloss:0.32327
[10]	Train-logloss:0.30575	Valid-logloss:0.31216
[11]	Train-logloss:0.29597	Valid-logloss:0.30296
[12]	Train-logloss:0.28773	Valid-logloss:0.29523
[13]	Train-logloss:0.28026	Valid-logloss:0.28826
[14]	Train-logloss:0.27385	Valid-logloss:0.28255
[15]	Train-logloss:0.26794	Valid-logloss:0.27725
[16]	Train-logloss:0.26297	Valid-logloss:0.27287
[17]	Train-logloss:0.25843	Valid-logloss:0.26888
[18]	Train-logloss:0.25460	Valid-logloss:0.26562
[19]	Train-logloss:0.25104	Valid-logloss:0.26270
[20]	Train-logloss:0.24797	Val

The model achieves the lowest validation logloss to date. I bring in a function to calculate the amex_metric for my model (will be added to my script ASAP).

In [50]:
def model_evaluator(model, data, y_true):
    
    y_hat = model.predict(data)
    
    y_true_final = pd.DataFrame(y_true)
    y_hat_final = pd.DataFrame(y_hat, columns=['prediction'])
    
    return myfunctions.amex_metric(y_true_final, y_hat_final)

In [53]:
train_metric = model_evaluator(model, train_matrix, train_complete.target)
valid_metric = model_evaluator(model, valid_matrix, valid_complete.target)

In [54]:
train_metric, valid_metric

(0.8763042974477815, 0.7708809200189399)

The model is overfit to my training data. It still performs well, considering the train dataset is roughly 1/10th the size of the validation dataset. I think I could greatly improve my results moving forward by training over the train + validation datasets combined and using the holdout set on Kaggle as the validation set. There are also opportunities here for hyperparameter tuning to improve the performance of my model.

Key hyperparameters that are adjusted to reduce overfitting:  
colsample_bytree  
subsample   
max_depth  
gamma  
eta  
min_child_weight  
scale_pos_weight