## Introduction: Tabular Playground Series - Dec 2021
> The objective of this notebook is to apply step-by-step approach to solve a tabular data competition on Kaggle.
> 
> The subject of this notebook is [a multi-classification task](https://www.kaggle.com/c/tabular-playground-series-dec-2021/data)
> 
> The target variable we are predicting consists of 7 different types of forest cover.
>
> The training dataset consists of 4 million labeled samples with features like elevation, soil type, etc.
>
> The provided dataset was synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.
> 
> Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.
>
> For the purposes of this notebook I will refer to held back training data as validation data, and data we have to submit predictions on as test data

## Changelog

* Version 1 - basic eda and linear model -> 0.209 accuracy on validation dataset
* Version 2 - established null accuracy baseline -> 0.565 accuracy on train dataset
* Version 3 - added train-test split, standard scaler, SGD Classifier -> 0.8808 accuracy on public leaderboard (a fraction of the test dataset)
* Version 3.1 - added Linear SVC -> 0.8805 accuracy on public leaderboard
* Version 4 - added XGBoost -> 0.91796 accuracy on public leaderboard
* Version 4.1 - added CatBoost -> 0.94155 accuracy on public leaderboard
* Version 4.2 - added LightGBM -> 0.92976 accuracy on public leaderboard
* Version 4.3 - dropped class 5 with only 1 occurance out of 4 million, updated Catboost -> 0.94135 accuracy on public leaderboard
* Version 5 - add stratified K fold cross validation

## Import

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Read datasets to pandas dataframe
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv')

## Reduce Memory Usage

I have used a compression function by Guillaume Martin which is discussed here: https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/291844


In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)

## EDA

In [None]:
# Checking out df_train
df_train.describe()

In [None]:
# Lets see if we have any missing values
missing_values_train = df_train.isna().any().sum()
missing_values_test = df_test.isna().any().sum()
print(f'There are {missing_values_train} missing values in the train dataset')
print(f'There are {missing_values_test} missing values in the test dataset')

In [None]:
# What are the datatypes for our features?
for col in df_train:
    print(df_train[col].dtype, col)

In [None]:
# Lets see which features are the most correlated with target
df_train.corr()['Cover_Type'].sort_values()

In [None]:
# Lets establish a baseline if we just always predict the target's most common class
# AKA: null accuracy
df_train['Cover_Type'].value_counts(normalize=True).head(1)

Since the accuracy for a model that only predicts class 2 would be 56.5%, we can judge the models we create by how much they can beat this 'dumb model'

In [None]:
# How imbalanced are the class distrubutions in our target variable?
df_train.groupby('Cover_Type').size()

Since there is only 1 occurrence of class 5 and there are only 377 occurrences of class 4 (out of 4 million samples in the train dataset) we could arguably drop both, for now lets just drop class 5

In [None]:
df_train = df_train[df_train['Cover_Type']!=5]

## Data Preprocessing

If the dataset hadn't already converted categorical features into dummy variables, we would do that here

In [None]:
# Create list of features without'id' and target variable 'cover_type'
features = list(df_train.columns)
features = features[1:55]

In [None]:
# Create feature dataframe and target dataframe for training
X = df_train[features]
Y = df_train["Cover_Type"]
# Also create feature dataframe to generate our prediction
X_test = df_test[features]

In [None]:
# Do the train test split before standardizing our features (to prevent data leak)
# Since the dataset is large we could do a smaller test_size than .2,
# Even better would be to implement StratifiedKFold, ie 5 folds of .2 with class imbalance replicated in each fold
from sklearn.model_selection import train_test_split

X_train, X_validate, Y_train, Y_validate = train_test_split( X, Y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Validation set:', X_validate.shape,  Y_validate.shape)

In [None]:
# Implement StratifiedKFold

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_validate = scaler.transform (X_validate)
X_test = scaler.fit_transform(X_test)

del df_train, df_test

## Modeling

We are predicting a category, have labled data, and >100K samples

done-ish:
* SGD
* Linear SVC
* XGBoost
* CatBoost
* Light GBM

may add:
* Random Forest
* KNeighbors Classifier
* SVC

### SGD Classifier (stochastic gradient descent)

SGD classifier allows you to select a loss function, we will use the default, which is equivalent to a Linear SVM (but faster)

In [None]:
# Create SGD model
from sklearn.linear_model import SGDClassifier
sgdmodel = SGDClassifier(loss='hinge',  penalty='l2')
sgdmodel.fit(X_train,Y_train)
# R^2 for training data
sgdmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
sgdmodel.score(X_validate,Y_validate)

In [None]:
# Create test data prediction
# sgdmodel.predict(X_test)

### Linear SVC

In [None]:
# Create Linear SVC model
from sklearn.svm import LinearSVC
lsvcmodel = LinearSVC(penalty='l2', loss='squared_hinge')
lsvcmodel.fit(X_train,Y_train)
# R^2 for training data
lsvcmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
lsvcmodel.score(X_validate,Y_validate)

### XGBoost

For this version the hyperparameters are arbitrary, for a future version we could do a grid search to establish the best performing hyperparameters, then we could fit the model again without GPU acceleration to improve accuracy

In [None]:
# Create XGBoost model
from xgboost import XGBClassifier # Alternatively there is a sklearn wrapper, from sklearn.ensemble import GradientBoostingClassifier

params = {
#             'objective':'binary:logistic',/
            'objective' : 'multi:softmax',
            'tree_method': 'gpu_hist',
            'eval_metric': 'mlogloss',
            'booster' : 'gbtree',
            'gamma' : 0.75,
            'max_depth': 7,
            'alpha': 10,
            'learning_rate': .007,
            'n_estimators':2000,
            'predictor': 'gpu_predictor'
        }  

xgbmodel = XGBClassifier(**params)

xgbmodel.fit(X_train,Y_train,
               early_stopping_rounds=200,
               eval_set=[(X_validate,Y_validate)],
               verbose=True)

# R^2 for training data
xgbmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
xgbmodel.score(X_validate,Y_validate)

### CatBoost

In [None]:
# Create CatBoost model
from catboost import CatBoostClassifier
catbmodel = CatBoostClassifier(task_type = 'GPU', devices='0')
catbmodel.fit(X_train, Y_train)

# R^2 for training data
catbmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
catbmodel.score(X_validate,Y_validate)

### LGBM

In [None]:
# Create LightGBM model
from lightgbm import LGBMClassifier

lgb_params = {
    'objective' : 'multiclass',
    'metric' : 'multi_logloss',
    'device' : 'gpu',
}

lgbmmodel = LGBMClassifier(**lgb_params)

lgbmmodel.fit(X_train,Y_train,
               early_stopping_rounds=200,
               eval_set=[(X_validate,Y_validate)],
               verbose=True)

# R^2 for training data
lgbmmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
lgbmmodel.score(X_validate,Y_validate)

## Prepare Submission

In [None]:
# View sample submission
df_sample_submission

sgdmodel (public score = 0.88080)

In [None]:
# Rename df and replace the cover type column with our predictions
df_sgd_submission = df_sample_submission
df_sgd_submission['Cover_Type'] = sgdmodel.predict(X_test).astype('int')
df_sgd_submission.to_csv("sgd_submission.csv",index=False)

xgbmodel (public score = 0.91796)


In [None]:
# Rename df and replace the cover type column with our predictions
df_xgb_submission = df_sample_submission
df_xgb_submission['Cover_Type'] = xgbmodel.predict(X_test).astype('int')
df_xgb_submission.to_csv("xgb_submission.csv",index=False)

lsvcmodel (public score = 0.88050)

In [None]:
# Rename df and replace the cover type column with our predictions
df_lsvc_submission = df_sample_submission
df_lsvc_submission['Cover_Type'] = lsvcmodel.predict(X_test).astype('int')
df_lsvc_submission.to_csv("lsvc_submission.csv",index=False)

catbmodel (public score = 0.94155)

In [None]:
# Rename df and replace the cover type column with our predictions
df_catb_submission = df_sample_submission
df_catb_submission['Cover_Type'] = catbmodel.predict(X_test).astype('int')
df_catb_submission.to_csv("catb_submission.csv",index=False)

lgbmmodel (public score = 0.92976)

In [None]:
# Rename df and replace the cover type column with our predictions
df_lgbm_submission = df_sample_submission
df_lgbm_submission['Cover_Type'] = lgbmmodel.predict(X_test).astype('int')
df_lgbm_submission.to_csv("lgbm_submission.csv",index=False)