# Tabular Playground Series

In this kernel I have used a simple set of steps to create an ML model to predict continuous output from 14 continuous features. Some of the key steps include:
* Feature Engineering
* Feature Selection using a forward wrapper method
* Cross validation split
* Ensemble Model Training and Prediction

Some additional steps that could be implemented to improve this model's accuracy (at the risk of increasing run time) would be:
* Improve Feature Engineering by making use of more creative methods in tandem with more exploratory data analysis
* Use a Tree based regressor instead of the linear regressor for feature selection
* Optimize and tune the hyperparameters of the gradient boosting regressor by using grid search or alternatively use a more complex gradient boosting moel such as LightGBM or XGBoost.

1. First step will be importing data and storing in a pandas dataframe

In [41]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
t_data = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
test_data = t_data.iloc[:,1:15]
row_id = t_data.iloc[:,0]
X = data.iloc[:,1:15]
X_copy = X
Y = data.iloc[:,15]
columns = []
for col in X.columns:
    columns.append(col)

2. Next some simple feature engineering is done by adding all the columns of the tabular data to one another

In [43]:
for i in columns:
    for j in columns[columns.index(i)+1:]:
        X[i+'_'+j] = X[i]+X[j]
        test_data[i+'_'+j]= test_data[i]+test_data[j]

3. In order to keep the number of features to a reasonable amount, model specific feature selection is performed, in this case a linear regressor is used to determine the top 20 features by making use of wrapper methods which in turn primarily use p-values to determine the effect a feature has on the target variable 

In [30]:

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

sfs = SFS(LinearRegression(), k_features = 20, forward = True, floating = False, scoring = 'r2', cv =0)
sfs.fit(X,Y)
top_features = sfs.k_feature_names_
use_data = X.filter(top_features)
test_use_data = test_data.filter(top_features)

4. The data is then prepared to be fed into the model, in order to judge bias and variance of the model we split the training data into a training set and a cross validation set. It is important that the data is binned so we can take a stratified split for the cross validation set

In [32]:
from sklearn.model_selection import train_test_split

data['target'].describe()

bins = [0,5,6,7,8,9,10,11]
y_binned = np.digitize(Y,bins)
X_train,X_val,Y_train,Y_val = train_test_split(use_data,Y, test_size = 0.3, random_state = 42, stratify = y_binned)

5. Finally we make use of an ensemble learning method in the form of a Gradient Boosting Regressor. With the regressor the trick of early stopping is used to determine the number of estimators to use. When the cross validation score does not improve for 3 consecutive estimators the loop is broken and the ideal amount of estimators is determined.

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingRegressor(max_depth = 5, warm_start = True)
min_score = np.infty
cnt = 0
buffer = 0
for n_estimators in range(1,200):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train,Y_train)
    y_pred = gbrt.predict(X_val)
    score = mean_squared_error(y_pred,Y_val)
    
    if score<min_score:
        min_score = score
        best_cnt = n_estimators
        buffer_time = 0
    else:
        buffer+=1
        if buffer>3:
            break
print('best score: ', min_score)
print('Optimal Estimators: ', best_cnt)


best score:  0.5087072200833646
Optimal Estimators:  99


6. Lastly, the data is slightly tweaked so it can be fed into a csv file for submission to the Kaggle Competition

In [45]:
predictions = pd.DataFrame(gbrt.predict(test_use_data))
predictions['id'] = row_id
predictions.set_axis(['target','id'], axis =1, inplace = True)
predictions = predictions[['id','target']]
predictions.to_csv('./submission.csv', index = False)