## Summary

An experimental notebook to see how much of performance we obtain when we combine PCA and xgboost methods.

In [1]:
import pandas as pd
import numpy as np
import time as time
import xgboost as xgb 
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

In [1]:
import sys
sys.path.append('../common_routines/')

### Get input

In [3]:
INPUT_DIR = '../input/'

In [4]:
ts = time.time()
train = pd.read_csv(INPUT_DIR + 'train.csv')
time.time() - ts

5.056200981140137

In [5]:
train['new_target'] = np.log(train['target'] + 1.0)

### Construct the appropriate pipeline

In [6]:
from sklearn.pipeline import Pipeline
def get_pca_xgboost_pipeline(num_pca_components=5):
    pca = PCA(n_components=5, random_state=0)
    my_pipe = Pipeline([('pca', pca),
                        ('model', xgb.XGBRegressor(learning_rate=0.1,
                                                   n_estimators=100,
                                                   objective='reg:squarederror'))])
    return my_pipe

In [7]:
from relevant_functions import get_rel_cols, fit_pipeline_and_cross_validate

In [8]:
REL_COLS = get_rel_cols(17, train)

In [9]:
(my_pipe, cross_val_score1) = fit_pipeline_and_cross_validate(get_pca_xgboost_pipeline(50), 
                                                              train,
                                                              REL_COLS)
print(cross_val_score1)

1.5583826526249025


## Try with a simple validation set.

We are not fully satisfied with the resulta of cross validation. Hence let us test out PCA using one hold out validation set as opposed to a full cross validation.

In [10]:
(train_train, validation_train) = train_test_split(train, random_state=0, shuffle=False)

In [11]:
ALL_COLS = [col for col in train.columns if col not in ['ID', 'target', 'new_target']]

In [12]:
NUM_PCA_COMPONENTS = 20
pca = PCA(n_components=NUM_PCA_COMPONENTS)
train_pca = pca.fit_transform(train_train[ALL_COLS])
validation_pca = pca.transform(validation_train[ALL_COLS])

In [13]:
col_names = ['col' + str(i) for i in np.arange(NUM_PCA_COMPONENTS)]
train_pca_df = pd.DataFrame(data=train_pca, columns=col_names)
train_pca_df['new_target'] = train_train['new_target'].values

validation_pca_df = pd.DataFrame(data=validation_pca, columns=col_names)
validation_pca_df['new_target'] = validation_train['new_target'].values

In [14]:
xgb_train_data = xgb.DMatrix(train_pca_df[col_names], label=train_pca_df[['new_target']], feature_names=col_names)
xgb_validation_data = xgb.DMatrix(validation_pca_df[col_names], 
                                  label=validation_pca_df[['new_target']], 
                                  feature_names=col_names)
xgb_params = {'eta': 0.1, 'eval_metric':'rmse'}

In [15]:
model_1 = xgb.train(params=xgb_params,
                    dtrain=xgb_train_data,
                    num_boost_round=1000,
                    evals=[(xgb_validation_data, 'eval')],
                    early_stopping_rounds=5)

[0]	eval-rmse:12.6301
Will train until eval-rmse hasn't improved in 5 rounds.
[1]	eval-rmse:11.3838
[2]	eval-rmse:10.2637
[3]	eval-rmse:9.25898
[4]	eval-rmse:8.35795
[5]	eval-rmse:7.54987
[6]	eval-rmse:6.82565
[7]	eval-rmse:6.17702
[8]	eval-rmse:5.59863
[9]	eval-rmse:5.08492
[10]	eval-rmse:4.62313
[11]	eval-rmse:4.21696
[12]	eval-rmse:3.85466
[13]	eval-rmse:3.53544
[14]	eval-rmse:3.25585
[15]	eval-rmse:3.0071
[16]	eval-rmse:2.793
[17]	eval-rmse:2.60812
[18]	eval-rmse:2.44642
[19]	eval-rmse:2.30484
[20]	eval-rmse:2.18564
[21]	eval-rmse:2.08604
[22]	eval-rmse:1.99906
[23]	eval-rmse:1.92883
[24]	eval-rmse:1.86981
[25]	eval-rmse:1.82069
[26]	eval-rmse:1.77904
[27]	eval-rmse:1.74714
[28]	eval-rmse:1.72119
[29]	eval-rmse:1.69818
[30]	eval-rmse:1.67999
[31]	eval-rmse:1.66603
[32]	eval-rmse:1.6515
[33]	eval-rmse:1.64385
[34]	eval-rmse:1.6346
[35]	eval-rmse:1.62925
[36]	eval-rmse:1.62512
[37]	eval-rmse:1.62128
[38]	eval-rmse:1.61875
[39]	eval-rmse:1.61544
[40]	eval-rmse:1.61362
[41]	eval-rmse:1

## Conclusion

The benefits from PCA look to be plateauing very fast and hence does not offer us much benefit.