## Tabular Playground Series - Dec 2021
> The objective of this notebook is to apply step-by-step approach to solve a tabular data competition on Kaggle.
> 
> The subject of this notebook is [a multi-classification task](https://www.kaggle.com/c/tabular-playground-series-dec-2021/data). 
> 
> The provided dataset was synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.
> 
> Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

## EXAMPLE Table of Contents
1. Import
1. EDA
1. Relationship with numerical variables
1. Correlation with heatmap
1. Missing Value (data preprocessing)
1. Divided categorical and Numerical
1. Encoder features
1. Normalization
1. Convert Into Numpy array
1. Classifier
    * XGBOOST
    * CatBOOST
    
    
## todo:
### - import ml libraries
### - check for missing values
### - train test split the train dataset to validate hyperparameter tuning
### - standardize + 
### - train on train data
### - evaluate fit with test data

## Step 1. Import

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [3]:
# Read datasets to pandas dataframe
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv')

## Step 2. EDA

In [4]:
# Checking out df_train
df_train.describe()

In [5]:
# Lets see if we have any missing values
missing_values_train = df_train.isna().any().sum()
missing_values_test = df_test.isna().any().sum()
print(f'There are {missing_values_train} missing values in the train dataset')
print(f'There are {missing_values_test} missing values in the test dataset')

In [6]:
# How imbalanced are the class distrubutions in our target variable?
df_train.groupby('Cover_Type').size()

Since there is only 1 occurrence of class 5 and there are only 377 occurrences of class 4 (out of 4 million samples in the train dataset) we could drop both without affecting our model's accuracy. For now we will leave them.

In [7]:
# Lets establish a baseline if we just always predict the target's most common class
# AKA: null accuracy
df_train['Cover_Type'].value_counts(normalize=True).head(1)

Since the accuracy for a model that only predicts class 2 would be 56.5%, we can judge the models we create by how much they can beat this 'dumb model'

In [8]:
# Lets see which features are the most correlated with target
df_train.corr()['Cover_Type'].sort_values()

In [9]:
# What are the datatypes for our features?
for col in df_train:
    print(df_train[col].dtype, col)

## Step 3. Data Preprocessing

If the dataset hadn't already converted categorical features into dummy variables, we would do that here

In [10]:
# Create list of features without'id' and target variable 'cover_type'
features = list(df_train.columns)
features = features[1:55]

In [11]:
# Create feature dataframe and target dataframe for training
X = df_train[features]
Y = df_train["Cover_Type"]
# Also create feature dataframe to generate our prediction
X_test = df_test[features]

In [12]:
# Do the train test split before standardizing our features (to prevent data leak)
# Since the dataset is large we could do a smaller test_size than .2,
# Even better would be to implement cross validation, ie 5 folds of .2 
from sklearn.model_selection import train_test_split

X_train, X_validate, Y_train, Y_validate = train_test_split( X, Y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Validation set:', X_validate.shape,  Y_validate.shape)

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_validate = scaler.transform (X_validate)
X_test = scaler.fit_transform(X_test)

del df_train, df_test

## Step 4: Modeling

Since we are predicting a category, have labled data, and >100K samples I want to test the performance of:
* SGD Classifier
* kernel approximation

I will also test the following estimators that are better with <100K samples:
* Linear SVC
* KNeighbors Classifier
* SVC

Also I totally forgot about the new hype:
* xgboost

### Step 4.1: SGD Classifier (stochastic gradient descent)

SGD classifier allows you to select a loss function, we will use the default, which is equivalent to a Linear SVM (but faster)

In [14]:
# Create SGD model
from sklearn.linear_model import SGDClassifier
sgdmodel = SGDClassifier(loss='hinge',  penalty='l2')
sgdmodel.fit(X_train,Y_train)
# R^2 for training data
sgdmodel.score(X_train,Y_train)

In [15]:
# R^2 for validation data
sgdmodel.score(X_validate,Y_validate)

In [16]:
# Create test data prediction
# sgdmodel.predict(X_test)

### Step 4.2: Kernel Approximation

In [17]:
# ?

### Step 4.3: Linear SVC

In [18]:
# class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', *, dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)

In [19]:
# # Create Linear SVC model
# from sklearn.svm import LinearSVC
# lsvcmodel = LinearSVC(penalty='l2', loss='squared_hinge', *, dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)
# sgdmodel.fit(X_train,Y_train)
# # R^2 for training data
# sgdmodel.score(X_train,Y_train)

### Step 4.?: XGBoost

In [20]:
# # Create XGBoost model
# from xgboost import XGBRegressor
# xgbmodel = XGBRegressor()

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
xgbmodel = GradientBoostingClassifier()
xgbmodel.fit(X_train,Y_train)
# R^2 for training data
xgbmodel.score(X_train,Y_train)

In [None]:
# R^2 for validation data
xgbmodel.score(X_validate,Y_validate)

## Step 5: Preparing Submission

sgdmodel

In [None]:
# View sample submission
df_sample_submission

In [None]:
# Rename df and replace the cover type column with our predictions
df_sgd_submission = df_sample_submission
df_sgd_submission['Cover_Type'] = sgdmodel.predict(X_test).astype('int')
df_sgd_submission.to_csv("sgd_submission.csv",index=False)

In [None]:
df_sgd_submission

xgbmodel

In [None]:
# Rename df and replace the cover type column with our predictions
df_xgb_submission = df_sample_submission
df_xgb_submission['Cover_Type'] = xgbmodel.predict(X_test).astype('int')
df_xgb_submission.to_csv("xgb_submission.csv",index=False)

In [None]:
# kaggle competitions submit -c tabular-playground-series-dec-2021 -f submission.csv -m "Message"