## Tabular Playground Series - Dec 2021
> The objective of this notebook is to apply step-by-step approach to solve a tabular data competition on Kaggle.
> 
> The subject of this notebook is [a multi-classification task](https://www.kaggle.com/c/tabular-playground-series-dec-2021/data)
> 
> The target variable we are predicting consists of 7 different types of forest cover.
>
> The training dataset consists of 4 million labeled samples with features like elevation, soil type, etc.
>
> The provided dataset was synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.
> 
> Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

## Import

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv
/kaggle/input/tabular-playground-series-dec-2021/train.csv
/kaggle/input/tabular-playground-series-dec-2021/test.csv


In [2]:
# Read datasets to pandas dataframe
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv')

## Reduce Memory Usage

I have used a compression function by Guillaume Martin which is discussed here: https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/291844


In [3]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [4]:
df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)

Mem. usage decreased to 259.40 Mb (84.8% reduction)
Mem. usage decreased to 63.90 Mb (84.8% reduction)


## EDA

In [5]:
# Checking out df_train
df_train.describe()

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
count,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,...,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0,4000000.0
mean,2000000.0,2980.192,151.5857,15.09754,271.3154,51.66262,1766.642,211.8375,221.0614,140.8109,...,0.037462,0.03782075,0.011995,0.0160535,0.01071275,0.0122075,0.0407515,0.03923925,0.0316185,1.771335
std,1154701.0,289.0482,109.9611,8.546731,226.5497,68.21597,1315.61,30.75996,22.23134,43.69864,...,0.189891,0.1907625,0.1088629,0.1256813,0.1029465,0.1098111,0.197714,0.1941637,0.1749822,0.893806
min,0.0,1773.0,-33.0,-3.0,-92.0,-317.0,-287.0,-4.0,49.0,-53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,999999.8,2760.0,60.0,9.0,110.0,4.0,822.0,198.0,210.0,115.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2000000.0,2966.0,123.0,14.0,213.0,31.0,1436.0,218.0,224.0,142.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,2999999.0,3217.0,247.0,20.0,361.0,78.0,2365.0,233.0,237.0,169.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,3999999.0,4383.0,407.0,64.0,1602.0,647.0,7666.0,301.0,279.0,272.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0


In [6]:
# Lets see if we have any missing values
missing_values_train = df_train.isna().any().sum()
missing_values_test = df_test.isna().any().sum()
print(f'There are {missing_values_train} missing values in the train dataset')
print(f'There are {missing_values_test} missing values in the test dataset')

There are 0 missing values in the train dataset
There are 0 missing values in the test dataset


In [7]:
# What are the datatypes for our features?
for col in df_train:
    print(df_train[col].dtype, col)

int32 Id
int16 Elevation
int16 Aspect
int8 Slope
int16 Horizontal_Distance_To_Hydrology
int16 Vertical_Distance_To_Hydrology
int16 Horizontal_Distance_To_Roadways
int16 Hillshade_9am
int16 Hillshade_Noon
int16 Hillshade_3pm
int16 Horizontal_Distance_To_Fire_Points
int8 Wilderness_Area1
int8 Wilderness_Area2
int8 Wilderness_Area3
int8 Wilderness_Area4
int8 Soil_Type1
int8 Soil_Type2
int8 Soil_Type3
int8 Soil_Type4
int8 Soil_Type5
int8 Soil_Type6
int8 Soil_Type7
int8 Soil_Type8
int8 Soil_Type9
int8 Soil_Type10
int8 Soil_Type11
int8 Soil_Type12
int8 Soil_Type13
int8 Soil_Type14
int8 Soil_Type15
int8 Soil_Type16
int8 Soil_Type17
int8 Soil_Type18
int8 Soil_Type19
int8 Soil_Type20
int8 Soil_Type21
int8 Soil_Type22
int8 Soil_Type23
int8 Soil_Type24
int8 Soil_Type25
int8 Soil_Type26
int8 Soil_Type27
int8 Soil_Type28
int8 Soil_Type29
int8 Soil_Type30
int8 Soil_Type31
int8 Soil_Type32
int8 Soil_Type33
int8 Soil_Type34
int8 Soil_Type35
int8 Soil_Type36
int8 Soil_Type37
int8 Soil_Type38
int8 Soil_

In [8]:
# Lets see which features are the most correlated with target
df_train.corr()['Cover_Type'].sort_values()

Elevation                            -0.395961
Wilderness_Area1                     -0.117498
Horizontal_Distance_To_Roadways      -0.093850
Horizontal_Distance_To_Fire_Points   -0.069258
Wilderness_Area2                     -0.044574
Soil_Type29                          -0.031301
Soil_Type22                          -0.026344
Soil_Type23                          -0.022897
Soil_Type30                          -0.011889
Hillshade_Noon                       -0.006536
Soil_Type32                          -0.005048
Hillshade_3pm                        -0.004694
Soil_Type19                          -0.003610
Aspect                               -0.002828
Hillshade_9am                        -0.002229
Soil_Type9                           -0.001578
Soil_Type24                          -0.001350
Soil_Type16                          -0.000811
Soil_Type21                          -0.000348
Soil_Type18                          -0.000229
Soil_Type20                          -0.000052
Soil_Type12  

In [9]:
# Lets establish a baseline if we just always predict the target's most common class
# AKA: null accuracy
df_train['Cover_Type'].value_counts(normalize=True).head(1)

2    0.565522
Name: Cover_Type, dtype: float64

Since the accuracy for a model that only predicts class 2 would be 56.5%, we can judge the models we create by how much they can beat this 'dumb model'

In [10]:
# How imbalanced are the class distrubutions in our target variable?
df_train.groupby('Cover_Type').size()

Cover_Type
1    1468136
2    2262087
3     195712
4        377
5          1
6      11426
7      62261
dtype: int64

Since there is only 1 occurrence of class 5 and there are only 377 occurrences of class 4 (out of 4 million samples in the train dataset) we could arguably drop both, for now lets just drop class 5

In [11]:
df_train = df_train[df_train['Cover_Type']!=5]

## Data Preprocessing

If the dataset hadn't already converted categorical features into dummy variables, we would do that here

In [12]:
# Create list of features without'id' and target variable 'cover_type'
features = list(df_train.columns)
features = features[1:55]

In [13]:
# Create feature dataframe and target dataframe for training
X = df_train[features]
Y = df_train["Cover_Type"]
# Also create feature dataframe to generate our prediction
X_test = df_test[features]

In [14]:
# Do the train test split before standardizing our features (to prevent data leak)
# Since the dataset is large we could do a smaller test_size than .2,
# Even better would be to implement StratifiedKFold, ie 5 folds of .2 with class imbalance replicated in each fold
from sklearn.model_selection import train_test_split

X_train, X_validate, Y_train, Y_validate = train_test_split( X, Y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Validation set:', X_validate.shape,  Y_validate.shape)

Train set: (3199999, 54) (3199999,)
Validation set: (800000, 54) (800000,)


In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_validate = scaler.transform (X_validate)
X_test = scaler.fit_transform(X_test)

del df_train, df_test

## Modeling

Since we are predicting a category, have labled data, and >100K samples I want to test the performance of:
* SGD Classifier

I will also test the following estimators that are better with <100K samples:
* Linear SVC
* KNeighbors Classifier
* SVC

Also I totally forgot about the new hype:
* xgboost
hmm also?
* Random Forest
* Light GBM
* Catboost

done:
* SGD
* Linear SVC
* XGBoost
* CatBoost
todo:
Light GBM

### Step 4.1: SGD Classifier (stochastic gradient descent)

SGD classifier allows you to select a loss function, we will use the default, which is equivalent to a Linear SVM (but faster)

In [16]:
# Create SGD model
from sklearn.linear_model import SGDClassifier
sgdmodel = SGDClassifier(loss='hinge',  penalty='l2')
sgdmodel.fit(X_train,Y_train)
# R^2 for training data
sgdmodel.score(X_train,Y_train)

0.8998240311950098

In [17]:
# R^2 for validation data
sgdmodel.score(X_validate,Y_validate)

0.90058875

In [18]:
# Create test data prediction
# sgdmodel.predict(X_test)

### Step 4.3: Linear SVC

In [19]:
# Create Linear SVC model
from sklearn.svm import LinearSVC
lsvcmodel = LinearSVC(penalty='l2', loss='squared_hinge')
lsvcmodel.fit(X_train,Y_train)
# R^2 for training data
lsvcmodel.score(X_train,Y_train)



0.8962837175886618

In [20]:
# R^2 for validation data
lsvcmodel.score(X_validate,Y_validate)

0.89729625

### Step 4.?: XGBoost

For this version the hyperparameters are arbitrary, for a future version we could do a grid search to establish the best performing hyperparameters, then we could fit the model again without GPU acceleration to improve accuracy

In [21]:
# Create XGBoost model
from xgboost import XGBClassifier # Alternatively there is a sklearn wrapper, from sklearn.ensemble import GradientBoostingClassifier

params = {
#             'objective':'binary:logistic',/
            'objective' : 'multi:softmax',
            'tree_method': 'gpu_hist',
            'eval_metric': 'mlogloss',
            'booster' : 'gbtree',
            'gamma' : 0.75,
            'max_depth': 7,
            'alpha': 10,
            'learning_rate': .007,
            'n_estimators':2000,
            'predictor': 'gpu_predictor'
        }  

xgbmodel = XGBClassifier(**params)

xgbmodel.fit(X_train,Y_train,
               early_stopping_rounds=200,
               eval_set=[(X_validate,Y_validate)],
               verbose=True)

# R^2 for training data
xgbmodel.score(X_train,Y_train)



[0]	validation_0-mlogloss:1.77374
[1]	validation_0-mlogloss:1.75609
[2]	validation_0-mlogloss:1.73879
[3]	validation_0-mlogloss:1.72183
[4]	validation_0-mlogloss:1.70520
[5]	validation_0-mlogloss:1.68889
[6]	validation_0-mlogloss:1.67286
[7]	validation_0-mlogloss:1.65714
[8]	validation_0-mlogloss:1.64171
[9]	validation_0-mlogloss:1.62653
[10]	validation_0-mlogloss:1.61164
[11]	validation_0-mlogloss:1.59698
[12]	validation_0-mlogloss:1.58259
[13]	validation_0-mlogloss:1.56842
[14]	validation_0-mlogloss:1.55450
[15]	validation_0-mlogloss:1.54080
[16]	validation_0-mlogloss:1.52731
[17]	validation_0-mlogloss:1.51404
[18]	validation_0-mlogloss:1.50095
[19]	validation_0-mlogloss:1.48809
[20]	validation_0-mlogloss:1.47542
[21]	validation_0-mlogloss:1.46296
[22]	validation_0-mlogloss:1.45067
[23]	validation_0-mlogloss:1.43857
[24]	validation_0-mlogloss:1.42663
[25]	validation_0-mlogloss:1.41487
[26]	validation_0-mlogloss:1.40328
[27]	validation_0-mlogloss:1.39184
[28]	validation_0-mlogloss:1.3

0.95835811198691

In [22]:
# R^2 for validation data
xgbmodel.score(X_validate,Y_validate)

0.9572

### CatBoost

In [23]:
# Create CatBoost model
from catboost import CatBoostClassifier
catbmodel = CatBoostClassifier(task_type = 'GPU', devices='0')
catbmodel.fit(X_train, Y_train)

# R^2 for training data
catbmodel.score(X_train,Y_train)

Learning rate set to 0.336771
0:	learn: 0.7451837	total: 77.7ms	remaining: 1m 17s
1:	learn: 0.5653344	total: 145ms	remaining: 1m 12s
2:	learn: 0.4587539	total: 196ms	remaining: 1m 5s
3:	learn: 0.3896276	total: 232ms	remaining: 57.7s
4:	learn: 0.3420274	total: 268ms	remaining: 53.2s
5:	learn: 0.3047157	total: 316ms	remaining: 52.3s
6:	learn: 0.2798583	total: 350ms	remaining: 49.7s
7:	learn: 0.2576764	total: 384ms	remaining: 47.6s
8:	learn: 0.2422877	total: 418ms	remaining: 46.1s
9:	learn: 0.2303463	total: 455ms	remaining: 45.1s
10:	learn: 0.2211794	total: 488ms	remaining: 43.9s
11:	learn: 0.2134286	total: 523ms	remaining: 43.1s
12:	learn: 0.2077775	total: 557ms	remaining: 42.3s
13:	learn: 0.2016748	total: 619ms	remaining: 43.6s
14:	learn: 0.1984138	total: 714ms	remaining: 46.9s
15:	learn: 0.1939661	total: 787ms	remaining: 48.4s
16:	learn: 0.1910526	total: 843ms	remaining: 48.7s
17:	learn: 0.1874930	total: 902ms	remaining: 49.2s
18:	learn: 0.1852309	total: 965ms	remaining: 49.8s
19:	lear

0.962873113397848

In [24]:
# R^2 for validation data
catbmodel.score(X_validate,Y_validate)

0.96063375

### LGBM

In [25]:
# Create LightGBM model
from lightgbm import LGBMClassifier

lgb_params = {
    'objective' : 'multiclass',
    'metric' : 'multi_logloss',
    'device' : 'gpu',
}

lgbmmodel = LGBMClassifier(**lgb_params)

lgbmmodel.fit(X_train,Y_train,
               early_stopping_rounds=200,
               eval_set=[(X_validate,Y_validate)],
               verbose=True)

# R^2 for training data
lgbmmodel.score(X_train,Y_train)



[1]	valid_0's multi_logloss: 0.720889
[2]	valid_0's multi_logloss: 0.620505
[3]	valid_0's multi_logloss: 0.545302
[4]	valid_0's multi_logloss: 0.486981
[5]	valid_0's multi_logloss: 0.438079
[6]	valid_0's multi_logloss: 0.406319
[7]	valid_0's multi_logloss: 0.373523
[8]	valid_0's multi_logloss: 0.34697
[9]	valid_0's multi_logloss: 0.322731
[10]	valid_0's multi_logloss: 0.305151
[11]	valid_0's multi_logloss: 0.286787
[12]	valid_0's multi_logloss: 0.268447
[13]	valid_0's multi_logloss: 0.264097
[14]	valid_0's multi_logloss: 0.250825
[15]	valid_0's multi_logloss: 0.24129
[16]	valid_0's multi_logloss: 0.245517
[17]	valid_0's multi_logloss: 0.229384
[18]	valid_0's multi_logloss: 0.216964
[19]	valid_0's multi_logloss: 0.212087
[20]	valid_0's multi_logloss: 0.209445
[21]	valid_0's multi_logloss: 0.205914
[22]	valid_0's multi_logloss: 0.203636
[23]	valid_0's multi_logloss: 0.205307
[24]	valid_0's multi_logloss: 0.195128
[25]	valid_0's multi_logloss: 0.202487
[26]	valid_0's multi_logloss: 0.1944

0.9522624850820266

In [26]:
# R^2 for validation data
lgbmmodel.score(X_validate,Y_validate)

0.95186125

## Prepare Submission

In [27]:
# View sample submission
df_sample_submission

Unnamed: 0,Id,Cover_Type
0,4000000,2
1,4000001,2
2,4000002,2
3,4000003,2
4,4000004,2
...,...,...
999995,4999995,2
999996,4999996,2
999997,4999997,2
999998,4999998,2


sgdmodel (public score = 0.88080)

In [28]:
# Rename df and replace the cover type column with our predictions
df_sgd_submission = df_sample_submission
df_sgd_submission['Cover_Type'] = sgdmodel.predict(X_test).astype('int')
df_sgd_submission.to_csv("sgd_submission.csv",index=False)

xgbmodel (public score = 0.91796)


In [29]:
# Rename df and replace the cover type column with our predictions
df_xgb_submission = df_sample_submission
df_xgb_submission['Cover_Type'] = xgbmodel.predict(X_test).astype('int')
df_xgb_submission.to_csv("xgb_submission.csv",index=False)

lsvcmodel (public score = 0.88050)

In [30]:
# Rename df and replace the cover type column with our predictions
df_lsvc_submission = df_sample_submission
df_lsvc_submission['Cover_Type'] = lsvcmodel.predict(X_test).astype('int')
df_lsvc_submission.to_csv("lsvc_submission.csv",index=False)

catbmodel (public score = 0.94155)

In [31]:
# Rename df and replace the cover type column with our predictions
df_catb_submission = df_sample_submission
df_catb_submission['Cover_Type'] = catbmodel.predict(X_test).astype('int')
df_catb_submission.to_csv("catb_submission.csv",index=False)

lgbmmodel (public score = 0.92976)

In [32]:
# Rename df and replace the cover type column with our predictions
df_lgbm_submission = df_sample_submission
df_lgbm_submission['Cover_Type'] = lgbmmodel.predict(X_test).astype('int')
df_lgbm_submission.to_csv("lgbm_submission.csv",index=False)