# Preface

In this notebook I tried to follow the ML project structure from experienced kagglers step by step, and learn how they analyze the question, explore data, preprocess data, build machine learning models, evaluate results, and submit prediction. The ultimate goal is to have a pipeline for myself to prepare for future competitions.

This notebook is mainly based on the following two authorswith some of my own modification and learning notes. Thanks to [@VAD13IRT](http://www.kaggle.com/vad13irt) and [@Sanskar Hasija](https://www.kaggle.com/odins0n) for developing such high quality notebooks.
* https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling
* https://www.kaggle.com/vad13irt/tps-2022-january-exploratory-data-analysis 


# Introduction

For the February 2022 Tabular Playground Series competition, your task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment  becomes . Can you use this lossy information to accurately predict bacteria species?

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from scipy.stats import mode

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.ensemble import VotingClassifier

import warnings
import time
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

RANDOM_STATE = 18
FOLDS = 5

# Data Loading and Preperation

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')

## Exploring Train Data

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">

📌 <b><u>Observations in Train Data</u></b><br>

* There are <b><u>288</u></b> columns: <b><u>286</u></b> continous, <b><u>1</u></b> row_id, and <b><u>1</u></b> target column<br>
* There are total of <b><u>200,000</u></b> rows in train dataset<br>
* <b><u>"target"</u></b> is the target variable with <b><u>10</u></b> possible values<br>
* There are no missing / null values in this dataset
    
</div>

### Quick view of Train Data

In [None]:
train.head()

In [None]:
print(f'\033[92mNumber of rows in train data: {train.shape[0]}')
print(f'\033[94mNumber of columns in train data: {train.shape[1]}')
print(f'\033[91mNumber of observations in train data: {train.count().sum()}')
print(f'\033[91mNumber of missing values in train data: {sum(train.isnull().sum())}')

### Basic statistics of training data
Below is the basic statistics for each variables in the training dataset, which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile`, and `maximum`.

In [None]:
train.describe()

## Exploring Test Data

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Test Data</u></b><br>

* There are <b><u>287</u></b> columns: <b><u>286</u></b> continous, and <b><u>1</u></b> row_id<br>
* There are total of <b><u>100,000</u></b> rows in train dataset<br>
* There are no missing / null values in this dataset
    
</div>

In [None]:
print(f'\033[92mNumber of rows in train data: {test.shape[0]}')
print(f'\033[94mNumber of columns in train data: {test.shape[1]}')
print(f'\033[91mNumber of observations in train data: {test.count().sum()}')
print(f'\033[91mNumber of missing values in train data: {sum(test.isnull().sum())}')

### Quick view of Test Data

In [None]:
test.head()

### Basic statistics of test data
Below is the basic statistics for each variables in the test dataset, which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile`, and `maximum`.

In [None]:
test.describe()

## Submission File

In [None]:
submission.head()

# Data Preprocessing & Exploratory Data Analysis

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">

✍️ <b><u>Check List</u></b><br>

Basic information<br>
* Does one hot encoding needed for categorical variables?<br>
* Check missing values (drop or impute?)<br>
* Check duplicates (drop?)<br>

EDA<br>
* Check target distribution (balance or imbalance)<br>
* Check normal distribution for numeric columns<br>
* Check correlation<br>
* Detect outliers<br>
* Test any assumptions
    
</div>

In [None]:
train.drop('row_id', axis = 1, inplace = True)
test.drop('row_id', axis = 1, inplace = True)

## Overview of Data

In [None]:
train.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='YlOrRd')\
                     .bar(subset=["max"], color='#969696')\
                     .bar(subset=["mean",], color='#585858')

## Null Distribution

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Null Distribution</u></b><br>

* No Null values
    
</div>

## Duplicate Values

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Duplicated Data</u></b><br>

* There are 76,007 duplicated records in the Train dataset<br>
* Based on Kaggle employee, the duplicated records in this competition came from data generated process, which should not be excluded for the analysis
    
</div>

In [None]:
train.duplicated().sum()

## Target Distribution

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Target Distribution</u></b><br>

* There are <b>10</b> different target values<br>
* All target values are equally distributed approx - 10% of total observations for each target.
    
</div>

In [None]:
sns.set_style('whitegrid')
fig = plt.figure(figsize = (12, 4))
ax = fig.add_subplot(1, 1, 1)

sns.countplot(x = 'target', data = train, palette="Set2")
plt.xticks(rotation=15)

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

ax.set_xlabel("target", fontsize=14, labelpad=10)
ax.set_ylabel("Count", fontsize=14, labelpad=10)
ax.set_title('Target Distribution', loc = 'left', fontsize = 20, fontweight = 'bold')

fig.tight_layout()

## Feature Distribution

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Feature Distribution</u></b><br>

* Most features are skewed to the left<br>
* Some features have very low variances (unique values <= 30)
                                                            
</div>

In [None]:
numeric_columns = train.columns[train.dtypes == 'float64'].to_numpy()
len(numeric_columns)

In [None]:
fig = plt.figure(figsize = (17, 1.5 * len(numeric_columns)))
rows = 143
cols = 2

for idx, numeric_column in enumerate(numeric_columns):
  ax = fig.add_subplot(rows, cols, idx + 1)
  sns.kdeplot(x = numeric_column, data = train, fill = True, alpha = 0.6, linewidth = 0.7, edgecolor = '#000', label = 'Train')
  sns.kdeplot(x = numeric_column, data = test, fill = True, alpha = 0.6, linewidth = 0.7, edgecolor = '#000', label = 'Test')

  ax.xaxis.set_tick_params(labelsize=10, size=0, pad=5)
  ax.yaxis.set_tick_params(labelsize=10, size=0, pad=5)

  ax.spines['right'].set_visible(False)
  ax.spines['top'].set_visible(False)

  if idx % cols == 0:
    ax.set_ylabel('Density')
  else:
    ax.set_ylabel('')

  ax.set_xlabel(numeric_column)
  ax.legend()

fig.tight_layout()
fig.show()

In [None]:
rows = 143
cols = 2
fig = plt.figure(figsize = (17, 1.5 * len(numeric_columns)))

for idx, numeric_column in enumerate(numeric_columns):
  ax = fig.add_subplot(rows, cols, idx + 1)
  sns.kdeplot(x = numeric_column, data = train, hue = 'target', fill = True, alpha = 0.5, linewidth = 0.7, edgecolor = '#000')
 
  ax.xaxis.set_tick_params(labelsize=10, size=0, pad=5)
  ax.yaxis.set_tick_params(labelsize=10, size=0, pad=5)

  ax.spines['right'].set_visible(False)
  ax.spines['top'].set_visible(False)

  if idx % cols == 0:
    ax.set_ylabel('Density')
  else:
    ax.set_ylabel('')

  ax.set_xlabel(numeric_column)

  if (idx + 1) != 2:
    ax.legend([])

fig.tight_layout()
fig.show()

In [None]:
fig = plt.figure(figsize=(15,15))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")

corr = train[numeric_columns].corr()
sns.heatmap(corr, annot=False, cmap='magma')
ax.xaxis.set_tick_params(labelsize=8, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=8, size=0, pad=5)
ax.set_title("Pearson Correlation", loc="left", fontsize=25, fontweight="bold")

plt.show()

In [None]:
train[numeric_columns].nunique().sort_values(ascending = False).tail(10)

In [None]:
low_variance_columns = train[numeric_columns].nunique().sort_values(ascending = False).tail(10).index.to_numpy()

In [None]:
fig, ax = plt.subplots(5, 2, figsize = (17, 1.5 * 5))

for idx, low_variance_column in enumerate(low_variance_columns):
  ax = plt.subplot(5, 2, idx + 1)

  sns.boxplot(x = low_variance_column, data = train)

  ax.set_ylabel(low_variance_column, rotation = 0)

fig.tight_layout()
fig.show()

# Feature Engineering

## Basic Feature Engineering

In [None]:
TARGET = 'target'
FEATURES = [col for col in train.columns if col not in ['row_id', TARGET]]

'''
train["mean"] = train[FEATURES].mean(axis=1)
train["std"] = train[FEATURES].std(axis=1)
train["min"] = train[FEATURES].min(axis=1)
train["max"] = train[FEATURES].max(axis=1)

test["mean"] = test[FEATURES].mean(axis=1)
test["std"] = test[FEATURES].std(axis=1)
test["min"] = test[FEATURES].min(axis=1)
test["max"] = test[FEATURES].max(axis=1)

FEATURES.extend(['mean', 'std', 'min', 'max'])
'''

# Modelling

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">
    
📌 <b><u>Observations in Modelling</u></b><br>

* <i> <u><b>LGBMClassifier</b></u> , <u><b>CatBoostClassifier</b></u> and <u><b>XGBClassifier</b></u> used in modelling on 5-fold validation.

</div>

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

train[TARGET] = encoder.fit_transform(train[TARGET])

X = train[TARGET].values
y = train[FEATURES].values

<div class = "alert alert-info" role = "alert"; style="font-size:14px; font-family:verdana;">

✍️ <u><b>Train Test Split and Cross Validation</b></u><br>

Scikit-learn library provides many tools to split data into training and test sets. The most basic one is train_test_split which just divides the data into two parts according to the specified partitioning ratio.<br>

If we split data using train_test_split, we can only train a model with the portion set aside for training. The models get better as the amount of training data increases. One solution to overcome this issue is cross validation. With cross validation, dataset is divided into n splits. N-1 split is used for training and the remaining split is used for testing. The model runs through the entire dataset n times and at each time, a different split is used for testing. Thus, we use all of data points for both training and testing. Cross validation is also useful to measure the performance of a model more accurately, especially on new, previously unseen data points.<br>

There are different methods to split data in cross validation. KFold and StratifiedKFold are commonly used.<br>

As the name suggests, KFold divides the dataset into k folds.<br>

StratifiedKFold takes the cross validation one step further. The class distribution in the dataset is preserved in the training and test splits.<br>

In classifications tasks with imbalanced class distributions, we should prefer StratifiedKFold over KFold.
    
</div>

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## XGBoost Classifier

Documentation - https://xgboost.readthedocs.io/en/stable/parameter.html#general-parameters

<i>When predictor is set to default value auto, the gpu_hist tree method is able to provide GPU based prediction without copying training data to GPU memory. If gpu_predictor is explicitly specified, then all data is copied into GPU, only recommended for performing prediction tasks.</i>



In [None]:
xgb_params = {
    'objective': 'multi:softmax', # For multiclass classification
    'eval_metric': 'mlogloss',    # Default to multiclass classification
    'tree_method': 'gpu_hist',    # Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less memory.
    'predictor': 'gpu_predictor', # Prediction using GPU. Used when tree_method is gpu_hist.
    'booster': 'gbtree',          # Default value
    'eta': 0.3,                   # Learning rate, default = 0.3
    'gamma': 0,                   # min_split_loss, default = 0
    'max_depth': 6,               # Maximum depth of a tree, default = 6
    'lambda': 1,                  # L2 regularization term on weights, default = 1
    'alpha': 0,                   # L1 regularization term on weights, default = 0
}

In [None]:
xgb_predictions = []
xgb_scores = []
xgb_fimp = []

cv = StratifiedKFold(n_splits = FOLDS, shuffle = True, random_state = RANDOM_STATE)
for fold, (train_idx, valid_idx) in enumerate(cv.split(train[FEATURES], train[TARGET])):
    
    print(10*"=", f"Fold = {fold + 1}", 10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_idx][FEATURES], train.iloc[valid_idx][FEATURES]
    y_train, y_valid = train[TARGET].iloc[train_idx], train[TARGET].iloc[valid_idx]
    
    model = XGBClassifier(**xgb_params) # Use ** to pass a dictionary for parameters
    model.fit(X_train, y_train, verbose = 0)
    
    pred_valid = model.predict(X_valid)
    acc = accuracy_score(y_valid, pred_valid)
    xgb_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold = {fold + 1}, Accuracy: {acc:.2f}, Run Time: {run_time:.2f}s")
    test_pred = model.predict(test[FEATURES])
    fim = pd.DataFrame(index = FEATURES,
                      data = model.feature_importances_,
                      columns = [f'{fold}_importance'])
    xgb_fimp.append(fim)
    xgb_predictions.append(test_pred)
    
print("Mean Accuracy :", np.mean(xgb_scores))

### Feature Importance for XGBoost Classifier (Top 15 Features)

In [None]:
xgb_fis_df = pd.concat(xgb_fimp, axis = 1).head(15)
xgb_fis_df.sort_values('1_importance').plot(kind = 'barh', figsize = (15, 10), title = 'Feature Importance Across Folds')
plt.show()

## LGBM Classifier

Documentation - https://lightgbm.readthedocs.io/en/latest/Parameters.html

In [None]:
lgb_params = {
    'objective': 'multiclass',    # For multiclass classification
    'metric': 'multi_logloss',    # For multiclass classification
    'device': 'gpu',              # Use GPU
    'num_iterations': 100,        # Same as num_tree, n_iter, n_estimators. Default = 100
    'learning_rate': 0.1,         # Same as eta, default = 0.1
    'num_leaves': 31,             # Default = 31
    'max_depth': -1,              # Default = -1
    'bagging_freq': 0,            # 0 = disable bagging, default = 0
    'feature_fraction': 1,        # Randomly select n% of features on each iteration
    'lambda_l1': 0,               # Default = 0
    'lambda_l2': 0,               # Default = 0
}

In [None]:
lgb_predictions = []
lgb_scores = []
lgb_fimp = []

cv = StratifiedKFold(n_splits = FOLDS, shuffle = True, random_state = RANDOM_STATE)
for fold, (train_idx, valid_idx) in enumerate(cv.split(train[FEATURES], train[TARGET])):
    
    print(10*"=", f"Fold = {fold + 1}", 10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_idx][FEATURES], train.iloc[valid_idx][FEATURES]
    y_train, y_valid = train[TARGET].iloc[train_idx], train[TARGET].iloc[valid_idx]
    
    model = LGBMClassifier(**lgb_params)
    model.fit(X_train, y_train, verbose = 0)
    
    pred_valid = model.predict(X_valid)
    acc = accuracy_score(y_valid, pred_valid)
    lgb_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold = {fold + 1}, Accuracy: {acc:.2f}, Run Time: {run_time:.2f}s")
    test_pred = model.predict(test[FEATURES])
    fim = pd.DataFrame(index = FEATURES,
                      data = model.feature_importances_,
                      columns = [f'{fold}_importance'])
    lgb_fimp.append(fim)
    lgb_predictions.append(test_pred)
    
print("Mean Accuracy :", np.mean(lgb_scores))

### Feature Importance for LGBM Classifier (Top 15 Features)

In [None]:
lgbm_fis_df = pd.concat(lgb_fimp, axis = 1).head(15)
lgbm_fis_df.sort_values('1_importance').plot(kind = 'barh', figsize = (15, 10), title = 'Feature Importance Across Folds')
plt.show()

## Catboost Classifier

Documentation - https://catboost.ai/en/docs/references/training-parameters/

In [None]:
catb_params = {
    'objective': "MultiClass",       # For multiclass classification
    "task_type": "GPU",              # Use GPU
}

In [None]:
catb_predictions = []
catb_scores = []
catb_fimp = []

cv = StratifiedKFold(n_splits = FOLDS, shuffle = True, random_state = RANDOM_STATE)
for fold, (train_idx, valid_idx) in enumerate(cv.split(train[FEATURES], train[TARGET])):
    
    print(10*"=", f"Fold = {fold + 1}", 10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_idx][FEATURES], train.iloc[valid_idx][FEATURES]
    y_train, y_valid = train[TARGET].iloc[train_idx], train[TARGET].iloc[valid_idx]
    
    model = CatBoostClassifier(**catb_params)
    model.fit(X_train, y_train, verbose = 0)
    
    pred_valid = model.predict(X_valid)
    acc = accuracy_score(y_valid, pred_valid)
    catb_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold = {fold + 1}, Accuracy: {acc:.2f}, Run Time: {run_time:.2f}s")
    test_pred = model.predict(test[FEATURES])
    fim = pd.DataFrame(index = FEATURES,
                      data = model.feature_importances_,
                      columns = [f'{fold}_importance'])
    catb_fimp.append(fim)
    catb_predictions.append(test_pred)
    
print("Mean Accuracy :", np.mean(catb_scores))

### Feature Importance for LGBM Classifier (Top 15 Features)

In [None]:
catb_fis_df = pd.concat(catb_fimp, axis = 1).head(15)
catb_fis_df.sort_values('1_importance').plot(kind = 'barh', figsize = (15, 10), title = 'Feature Importance Across Folds')
plt.show()

# Ensemble Model

In [None]:
vote_predictions = []
vote_scores = []

cv = StratifiedKFold(n_splits = FOLDS, shuffle = True, random_state = RANDOM_STATE)
for fold, (train_idx, valid_idx) in enumerate(cv.split(train[FEATURES], train[TARGET])):
    
    print(10*"=", f"Fold = {fold + 1}", 10*"=")
    start_time = time.time()
    
    X_train, X_valid = train.iloc[train_idx][FEATURES], train.iloc[valid_idx][FEATURES]
    y_train, y_valid = train[TARGET].iloc[train_idx], train[TARGET].iloc[valid_idx]
    
    model = VotingClassifier(
            estimators = [
                ('XGB_model', XGBClassifier(**xgb_params)),
                ('LGBM_model', LGBMClassifier(**lgb_params)),
                ('CatBoost_model', CatBoostClassifier(**catb_params))],
            voting = 'soft'
            )
    
    model.fit(X_train, y_train)
    
    pred_valid = model.predict(X_valid)
    acc = accuracy_score(y_valid, pred_valid)
    vote_scores.append(acc)
    run_time = time.time() - start_time
    
    print(f"Fold = {fold + 1}, Accuracy: {acc:.2f}, Run Time: {run_time:.2f}s")
    test_pred = model.predict(test[FEATURES])

    vote_predictions.append(test_pred)
    
print("Mean Accuracy :", np.mean(vote_scores))

In [None]:
submission = submission[['row_id']]

In [None]:
xgb_submission = submission.copy()
xgb_submission["target"] = encoder.inverse_transform(np.squeeze(mode(np.column_stack(xgb_predictions),axis = 1)[0]).astype('int'))
xgb_submission.to_csv("xgb-subs.csv",index=False)
xgb_submission.head()

In [None]:
lgb_submission = submission.copy()
lgb_submission["target"] = encoder.inverse_transform(np.squeeze(mode(np.column_stack(lgb_predictions),axis = 1)[0]).astype('int'))
lgb_submission.to_csv("lgb-subs.csv",index=False)
lgb_submission.head()

In [None]:
catb_submission = submission.copy()
catb_submission["target"] = encoder.inverse_transform(np.squeeze(mode(np.column_stack(catb_predictions),axis = 1)[0]).astype('int'))
catb_submission.to_csv("catb.csv",index=False)
catb_submission.head()

In [None]:
mode_submission = submission.copy()
pred_mode = encoder.inverse_transform(np.squeeze(mode(np.column_stack(xgb_predictions + lgb_predictions + catb_predictions),axis = 1)[0]).astype('int'))
mode_submission["target"] = pred_mode
mode_submission.to_csv("pred_mode.csv",index=False)
mode_submission.head()

In [None]:
vote_submission = submission.copy()
vote_pred = encoder.inverse_transform(np.squeeze(mode(np.column_stack(vote_predictions),axis = 1)[0]).astype('int'))
vote_submission["target"] = vote_pred
vote_submission.to_csv("pred_vote.csv",index=False)
vote_submission.head()