<a href="https://www.kaggle.com/code/valerybonneau/ps-s3e23-eda-automl-and-ensemble?scriptVersionId=193453875" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Table of content

0. [Introduction](#zero)
1. [Import Libraries and Data](#one)
2. [EDA](#two)
3. [Auto Ml test](#three)
4. [Data Preparation](#four)
5. [Random Forest](#five)
6. [Logistic Regression](#six)
7. [XGBoost](#seven)
8. [CatBoost](#eight)
9. [LightGBM](#nine)
10. [ExtraTreesClassifier](#ten)
11. [HistGradientBoostingClassifier](#eleven)
12. [Neural Network - Scikit and Keras](#twelve)
13. [Ensemble implementation](#thirteen)
14. [Conclusion and Submission](#fourteen)

# 0. Introduction
This is my first public notebook.
- After a quick EDA, I'll try Autogluon to give me a base score.
- Then I'll run Random Forest to check features importances.
- Finally I'll try different algorithm based on autogluon results (and ideas I could get while discussing).

I'll try to reach top 20% but the main goal is to improve my skills and knowledge.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:

        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="one"></a>
# 1. Import Libraries and Data

In [None]:
# Math and data manipulation packages
import numpy as np
import pandas as pd
from scipy.stats import randint, uniform

# DataViz packages
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing packages
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold

from sklearn.preprocessing import PowerTransformer, FunctionTransformer, PolynomialFeatures

# Model packages
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier, XGBRegressor
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron

# Neural Network 
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense

# Ensemble model build
from sklearn.ensemble import VotingClassifier

# Metrics and optimization
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

# Execution tim
import time

sns.set()
# make sure we can see needed columns and rows
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

np.set_printoptions(linewidth=195, edgeitems=5)

force_row_wise=True

In [None]:
train = pd.read_csv('/kaggle/input/playground-series-s3e23/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s3e23/test.csv')
train.head(10)

In [None]:
train.describe().T

In [None]:
train.info()

## First Observation
Every feature has a relevant type. There is not NaN values. I'll study the data further during the EDA.

<a id='two'></a>
# 2. EDA

In [None]:
binary = train.drop('id', axis=1)
binary['defects'] = binary['defects'].map({True:1, False:0})

features = binary.columns.tolist()
features = features[:-1]

## Distribution

In [None]:
# adapted from https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sense
_, axs = plt.subplots(7, 3, figsize=(12,14))
for col, ax in zip(binary.columns, axs.ravel()):
    if binary[col].dtype == float:
        ax.hist(binary[col], bins=100, color='red')
    else:
        no_float = binary[col].value_counts()
        ax.bar(no_float.index, no_float, color='red')
    ax.set_title(col)
plt.tight_layout
plt.subplots_adjust(hspace=0.5)  # Increase vertical space between subplots
plt.suptitle('Features distributions')
plt.show()

**First Observations**<br>
- All features are right skeed.
- `lOComment` and `locCodeAndComment` are 71.6% and 91.9% made of zeros
- I will apply Log Transformation

In [None]:
# adapted from https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sense
_, axs = plt.subplots(7, 3, figsize=(12,14))
for col, ax in zip(binary.columns, axs.ravel()):
    ax.hist(np.log1p(binary[col]), bins=100, color='green')
    ax.set_title(col)
plt.tight_layout
plt.subplots_adjust(hspace=0.5)  # Increase vertical space between subplots
plt.suptitle('Features distributions after log transformation')
plt.show()

## Correlation Matrix

The features seems very correlated in general, with values as high as 1, or 0.9, 0.8 in many cases.
Still, there is no feature highly correlated to the target (max is 0.3 or -.3).<br>
**Also, as it is a classification problem, the correlation is not as important as it would be in a regression problem.**<br>
Removing feature that don't bring a lot of values could be interesting to decrease the preocessing time though. 
For the time being, I can keep all features and I will decide after running a first Random Forest to show feature importances.

In [None]:
sns.set(rc={'figure.figsize':(20,10)})
corr = binary.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(100, 7, s = 75, l = 40, n = 20, center = 'light', as_cmap = True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={'shrink':.5}, annot=True, fmt='.1f')

There is no categorical feature.
And right now, I don't see how to create categories from the different features.

## Class Distribution
The target class is umbalanced (roughly 8:2). After going through reading and other notebooks, I will not use SMOTE or other techniques. I'll use class_weight='balanced' when the model offers it.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(binary, y="defects");

In [None]:
(binary['defects'].value_counts(normalize=True)*100).round(2)

<a id='three'></a>
# 3. Auto ML test
I'll run autogluon and submit a first prediction based on the result.<br>

In [None]:
!pip install autogluon > nul

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor
automl = TabularPredictor(
    problem_type='binary',
    label='defects',
    eval_metric='roc_auc'
)

In [None]:
automl.fit(train, presets='best_quality')
automl.leaderboard()

In [None]:
predictions = automl.predict(test)

In [None]:
predictions = predictions.map({False: 0, True: 1})

**The result is 0.66175 wich is pretty disappointing**<br>
I'll focus on RanfomForest and see if I can identify important features.

<a id='four'></a>
# 4. Data Preparation

**Creation of X_train and y_train**

In [None]:
X_train = binary.drop('defects', axis=1)
y_train = binary['defects']
X_test = test.drop('id', axis=1)

**Transformed version**

In [None]:
transformer = FunctionTransformer(np.log1p)

X_train_t = transformer.fit_transform(X_train)
X_test_t = transformer.transform(X_test)

X_train_t = pd.DataFrame(data=X_train_t, columns=X_train.columns)
X_test_t = pd.DataFrame(data=X_test_t, columns=X_test.columns)

## Cross-validation strategy

In [None]:
# adapted from https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sense
# Thanks a lot to AmbrosM (https://www.kaggle.com/ambrosm)

def cross_val(model, X_train_t, y_train):
    tic = time.time()
       
    kf = StratifiedKFold(shuffle=True, random_state=73)
    
    roc_auc = []
    for fold, (indX_tr, indX_va) in enumerate(kf.split(X_train, y_train)):
        X_tr = X_train_t.iloc[indX_tr]
        X_va = X_train_t.iloc[indX_va]
        y_tr = y_train.iloc[indX_tr]
        y_va = y_train.iloc[indX_va]
        model.fit(X_tr, y_tr)
        if hasattr(model, "predict_proba"):
            y_va_pred = model.predict_proba(X_va)[:, 1]
        else:
            y_va_pred = model.predict(X_va)

        roc_auc.append(roc_auc_score(y_va, y_va_pred))
    roc_auc = np.array(roc_auc).mean()
    
    tac= time.time()
    print(f'execution time of {model}: {round((tac-tic),2)} seconds')
    return roc_auc

In [None]:
# adapted from https://www.kaggle.com/code/ambrosm/pss3e23-eda-which-makes-sense

scores = []
score = cross_val(make_pipeline(
    StandardScaler(),
    PolynomialFeatures(2, include_bias=False),
    LogisticRegression(C=0.01, solver='newton-cholesky', penalty='l2', max_iter=1000, class_weight=None)),
    X_train_t, y_train)
scores.append(('LogisticRegression', score))

## Searching for best parameters
I have used loops and ploted the results using that function.
I did it manually becasue I want to understand and see the evolution for each parameter.

In [None]:
def plot_score(scores):
    sns.scatterplot(x=[x for x, y in scores], 
                   y=[y for x,y in scores])
    plt.xscale('linear')
    plt.show()
    print(scores)
    return True

<a id='five'></a>
# 5. Random Forest Testing
As the submission file must contains probabilities and not binary values, RandomForestRegressor seems better than RandomForestClassifier.<br>
I'll test both models anyhow as RandomForestClassifier should work fine with the training set.

In [None]:
score = cross_val(
    RandomForestClassifier(bootstrap=True, 
                           max_depth=9, 
                           min_samples_leaf=200, 
                           min_samples_split=7,
                           n_estimators=70,
                           max_features=1.0,
                           class_weight='balanced_subsample',
                           random_state=73),
        X_train_t, y_train)
scores.append(('RandomForestClassifier', score))  

In [None]:
# for randomforestregressor if needed later
scorerf= []
for n_estimators in [100, 200, 300, 400, 500]:
    score = cross_val(
        RandomForestRegressor(n_estimators=300,
                              max_features=1.0,
                              max_depth=8,
                              min_samples_leaf=72,
                              min_samples_split=16,
                              bootstrap=True,
                              random_state=73,
                              n_jobs=-1),
            X_train_t, y_train)
    scorerf.append(('RandomForestRegressor', score))  

Both models come with the same conclusion that `loc` is the most importante features and by far.<br> Regarding the other features, there importance varies but stays very low in any case. I'll keep them all anyhow for the time being.

<a id='six'></a>
# 6. Logistic Regression

In [None]:
score = cross_val(make_pipeline(
    StandardScaler(),
    PolynomialFeatures(2, include_bias=False),
    LogisticRegression(C=0.001, solver='newton-cholesky', penalty='l2', max_iter=1000, class_weight='balanced')),
    X_train_t, y_train)
scores.append(('LogisticRegression', score))

<a id='seven'></a>
# 7. XGBoost

In [None]:
!pip install xgboost

In [None]:
params = {
    'max_depth': 6,
    'subsample': 0.6,
    'colsample_bytree': 0.7,
    'learning_rate': 0.02,
    'n_estimators': 800,
    'tree_method': 'hist',
    'random_state': 73,
}

score = cross_val(
    XGBClassifier(
        objective='binary:logistic',
        eval_metric='auc',
        **params),
        X_train_t, y_train)
scores.append(('XGBClassifier0', score))

In [None]:
# adapted from https://www.kaggle.com/code/adaubas/pss3e23-comparison-xgboost-parameters#.79065-on-public-LB
params = {
    'tree_method'        : 'hist',
    'random_state'       : 73,
    'max_depth'          : 4,   
    'colsample_bytree'   : 1.,  
    'colsample_bynode'   : .8, 
    'subsample'          : .7,
    'min_child_weight'   : 30, 
    'learning_rate'      : .02, 
    
    'n_estimators'       : 800, # Fixed by using early stopping with another seed
}

score =  cross_val(
    make_pipeline(
        ColumnTransformer(
            # Drop redondant columns : detected in documentation and by using permutation importance
            transformers = [('select', 'drop', ["d", "e", "branchCount", 'iv(g)', 't', 'b', 'n', 'lOCode', 'v', 'i'])], 
            remainder = 'passthrough'),
        XGBClassifier(**params)), 
    X_train_t, y_train)
scores.append(('XGBClassifier1', score))

In [None]:
score = cross_val(
    XGBClassifier(
        objective='binary:logistic',
        eval_metric='auc',
        booster='gbtree',
        learning_rate=0.05,
        gamma=0.75,
        max_depth=5,
        n_estimators=110,
        random_state=73),
        X_train_t, y_train)
scores.append(('XGBClassifier', score))  

### XGboostRegressor

In [None]:
score = cross_val(
    XGBRegressor(
        objective='binary:logistic',
        eval_metric='auc',
        colsample_bytree=0.9,
        gamma=0,
        learning_rate=0.1,
        max_depth=3,
        min_child_weight=2,
        n_estimators=200,
        reg_alpha=0.01,
        reg_lambda=10,
        subsample=0.9,
        random_state=73),
        X_train_t, y_train)
scores.append(('XGBRegressor', score))  

In [None]:
xgbr = XGBRegressor(objective='binary:logistic',
                    eval_metric='auc',
                    colsample_bytree=0.9,
                    gamma=0,
                    learning_rate=0.1,
                    max_depth=3,
                    min_child_weight=2,
                    n_estimators=200,
                    reg_alpha=0.01,
                    reg_lambda=10,
                    subsample=0.9)
xgbr.fit(X_train, y_train)

<a id='eight'></a>
# 8. CatBoost

In [None]:
!pip install catboost

{'random_strength': 0.5,
 'learning_rate': 0.03,
 'l2_leaf_reg': 9,
 'iterations': 500,
 'grow_policy': 'SymmetricTree',
 'depth': 6,
 'border_count': 64,
 'bagging_temperature': 0.5}

In [None]:
score = cross_val(
    CatBoostClassifier(
        loss_function='CrossEntropy',
        learning_rate=0.05,
        iterations=500,
        depth=5,
        grow_policy='SymmetricTree',
        l2_leaf_reg=12,
        border_count=62,
        random_strength=1,
        random_seed=73,
        verbose=0),
        X_train_t, y_train)
scores.append(('CatBoostClassifier', score))  

<a id='nine'></a>
# 9. LightGBM

In [None]:
!pip install lightgbm

### Dart model

In [None]:
parameters = {
    'objective': 'binary',
    'metric': 'auc',
    'is_unbalance': 'true',
    'boosting': 'dart',
    'num_leaves': 23,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.85,
    'bagging_freq': 20,
    'learning_rate': 0.1,
    'verbose': 0,
    'random_state':73
}
score = cross_val(lgb.LGBMClassifier(**parameters), X_train_t, y_train)
scores.append(('LGBMClassifier', score))

<a id='ten'></a>
# 10. ExtraTreesClassifier

In [None]:
score = cross_val(
    ExtraTreesClassifier(
        max_depth=100,
        max_features=1.0,
        min_samples_leaf=100,
        n_estimators=250,
        min_samples_split=5,
        bootstrap=False,
        random_state=73,
        verbose=0),
        X_train_t, y_train)
scores.append(('ExtraTreesClassifier', score))  

<a id='eleven'></a>
# 11. HistGradientBoostingClassifier

In [None]:
score = cross_val(
    HistGradientBoostingClassifier(
        max_depth=20,
        max_iter=500,
        learning_rate=0.0125,
        max_leaf_nodes=30,
        l2_regularization=0.1,
        random_state=73,
        verbose=0),
        X_train_t, y_train)
scores.append(('HistGradientBoostingClassifier', score))  

<a id='thirteen'></a>
# 13. Ensemble

In [None]:
best_params_rfc = {'bootstrap': True, 'max_depth': 9, 'min_samples_leaf': 200, 'min_samples_split': 7, 'n_estimators': 70}
rfc = RandomForestClassifier(bootstrap=True,                            
                             max_depth=9,
                             min_samples_leaf=200, 
                             min_samples_split=7,
                             n_estimators=70,
                             max_features=1.0,
                             class_weight='balanced_subsample',
                             random_state=73)

In [None]:
rfr = RandomForestRegressor(n_estimators=300,
                              max_features=1.0,
                              max_depth=8,
                              min_samples_leaf=72,
                              min_samples_split=16,
                              bootstrap=True,
                              random_state=73,
                              n_jobs=-1)

In [None]:
best_params_lr = {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 1000, 'C': 0.01}
lr = LogisticRegression(**best_params_lr,                      
                      class_weight='balanced',
                      n_jobs=-1,
                      random_state=12345)

lr = make_pipeline(
        StandardScaler(),
        PolynomialFeatures(2, include_bias=False),
        LogisticRegression(C=0.001, solver='newton-cholesky', penalty='l2', max_iter=1000, class_weight='balanced')
)

In [None]:
best_params_xgbc = {'reg_lambda': 0.1, 'reg_alpha': 51.2, 'n_estimators': 130, 'max_depth': 7, 'learning_rate': 0.2, 'gamma': 0}
xgbc0 = XGBClassifier(objective='binary:logistic',
            eval_metric='auc',
            booster='gbtree',
            learning_rate=0.05,
            gamma=0.75,
            max_depth=5,
            n_estimators=110,
            random_state=73)

In [None]:
params = {
    'max_depth': 6,
    'subsample': 0.6,
    'colsample_bytree': 0.7,
    'learning_rate': 0.02,
    'n_estimators': 800,
    'tree_method': 'hist',
    'random_state': 73,
}

xgbc1 = XGBClassifier(objective='binary:logistic',
                      eval_metric='auc',
                      **params)

In [None]:
# adapted from https://www.kaggle.com/code/adaubas/pss3e23-comparison-xgboost-parameters#.79065-on-public-LB
params = {
    'tree_method'        : 'hist',
    'random_state'       : 73,
    'max_depth'          : 4,   # I want only weak learners
    'colsample_bytree'   : 1.,  # To be sure that 'loc' will be in all trees
    'colsample_bynode'   : .8, 
    'subsample'          : .7,
    'min_child_weight'   : 30, 
    'learning_rate'      : .02, # fixed to early stop before 1000 estimatores
    
    'n_estimators'       : 800, # Fixed by using early stopping with another seed
}

xgbc2 = make_pipeline(
        ColumnTransformer(
            # Drop redondant columns : detected in documentation and by using permutation importance
            transformers = [('select', 'drop', ["d", "e", "branchCount", 'iv(g)', 't', 'b', 'n', 'lOCode', 'v', 'i'])], 
            remainder = 'passthrough'),
        XGBClassifier(**params))

In [None]:
best_params_xgbr = {}
xgbr = XGBRegressor(objective='binary:logistic',
                    eval_metric='auc',
                    colsample_bytree=0.9,
                    gamma=0,
                    learning_rate=0.1,
                    max_depth=3,
                    min_child_weight=2,
                    n_estimators=200,
                    reg_alpha=0.01,
                    reg_lambda=10,
                    subsample=0.9)

xgbr._estimator_type = "classifier"

In [None]:
best_params_cbr = {'random_strength': 0.5, 'learning_rate': 0.03, 'l2_leaf_reg': 9, 'iterations': 500,
                   'grow_policy': 'SymmetricTree', 'depth': 6, 'border_count': 64, 'bagging_temperature': 0.5}

cb = CatBoostClassifier(
            loss_function='CrossEntropy',
            learning_rate=0.05,
            iterations=500,
            depth=5,
            grow_policy='SymmetricTree',
            l2_leaf_reg=12,
            border_count=62,
            random_strength=1,
            random_seed=73,
            verbose=0)

In [None]:
# adapted from https://www.kaggle.com/code/iqmansingh/software-defect-ensemble-lgbm-sampling#Optuna-Tuning-LGBM

import lightgbm as lgb
from lightgbm import LGBMClassifier
param0 = {'n_estimators': 957, 'max_depth': 5, 'learning_rate': 0.014544218759128154, 
          'min_child_weight': 2.7336861099989402, 'min_child_samples': 44, 
          'subsample': 0.6786668835159727, 'subsample_freq': 2, 'colsample_bytree': 0.7487125800084695}

param1 = {'n_estimators': 844, 'max_depth': 5, 'learning_rate': 0.011533926188770152, 
          'min_child_weight': 0.7353926016580375, 'min_child_samples': 22, 
          'subsample': 0.7440626280651244, 'subsample_freq': 3, 'colsample_bytree': 0.5193554106905941}

param2 = {'n_estimators': 1374, 'max_depth': 8, 'learning_rate': 0.0030464211231350236, 
          'min_child_weight': 2.6773708665039124, 'min_child_samples': 36, 
          'subsample': 0.5495916550024219, 'subsample_freq': 1, 'colsample_bytree': 0.5065855075580986}

lgbm0 =    make_pipeline(
        ColumnTransformer(
            # Drop redondant columns : detected in documentation and by using permutation importance
            transformers = [('select', 'drop', ["d", "e", "branchCount", 'iv(g)', 't', 'b', 'n', 'lOCode', 'v', 'i'])], 
            remainder = 'passthrough'),
        lgb.LGBMClassifier(**param0))
    
lgbm1 =    make_pipeline(
        ColumnTransformer(
            # Drop redondant columns : detected in documentation and by using permutation importance
            transformers = [('select', 'drop', ["d", "e", "branchCount", 'iv(g)', 't', 'b', 'n', 'lOCode', 'v', 'i'])], 
            remainder = 'passthrough'),
        lgb.LGBMClassifier(**param1))
    
lgbm2 =    make_pipeline(
        ColumnTransformer(
            # Drop redondant columns : detected in documentation and by using permutation importance
            transformers = [('select', 'drop', ["d", "e", "branchCount", 'iv(g)', 't', 'b', 'n', 'lOCode', 'v', 'i'])], 
            remainder = 'passthrough'),
        lgb.LGBMClassifier(**param2))

In [None]:
etc = ExtraTreesClassifier(
            max_depth=100,
            max_features=1.0,
            min_samples_leaf=100,
            n_estimators=300,
            min_samples_split=5,
            bootstrap=False,
            random_state=73,
            verbose=0)

In [None]:
best_params_hgbc = {'l2_regularization': 0.1,
 'learning_rate': 0.1,
 'max_depth': 5,
 'max_iter': 469,
 'max_leaf_nodes': 90}
hgbc = HistGradientBoostingClassifier(
            max_depth=20,
            max_iter=500,
            learning_rate=0.0125,
            max_leaf_nodes=30,
            l2_regularization=0.1,
            random_state=73,
            verbose=0)

I searched the best weights distribution by looping through different values. Not optimal though.

In [None]:
rfr._estimator_type = "classifier"
estimators1 =(
    [('lr', lr),
     ('rf', rfc),
     ('xgbc0', xgbc0),
     ('xgbc1', xgbc1),
     ('xgbc2', xgbc2),
     ('cbc', cb),
     ('hgb', hgbc),
    ('lgbm0', lgbm0),
    ('lgbm1', lgbm1),
    ('lgbm2', lgbm2)])

In [None]:
vcl1 = VotingClassifier(estimators=estimators1,
                        weights=[0.25, 0.5, 1.5, 3.5, 2.75, 0.75, 0.5, 3.75, 2.75, 1.75],
                        voting='soft',
                        n_jobs=-1)
vcl1.fit(X_train_t, y_train)
predictions = vcl1.predict_proba(X_test_t)[:,1]

<a id='forteen'></a>
# 14. Conclusion and Submission

My final LB score is : 0.79035. I'm still in the top20%, which was one of my goals but we'll see how it goes with the final score.
Adding a propre neural network could improve the results but at that stage, I'll keep it like this.

In [None]:
output_sample = pd.read_csv('/kaggle/input/playground-series-s3e23/sample_submission.csv')
output = pd.DataFrame({'id': output_sample.id, 'defects': predictions})
output.to_csv('submission.csv', index=False, sep=',')
print("Your submission was successfully saved!")