<a href="https://www.kaggle.com/code/francescoliveras/ps-s3-e23-eda-model-en-es?scriptVersionId=145218466" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#37FABC; font-size:160%; text-align:center;padding: 0px; border-bottom: 5px solid #407A68">PlayGround Series S3 E23 EDA and simple model</p>

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 5px solid #008F77">Intro</p>

**🟦EN**:
<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em; color:#5361fc;">
This Kaggle workbook aims to provide a comprehensive exploratory data analysis (EDA) and a set of simple models (which will not be optimized), but which can give a vague idea of how to choose the best model for the given data set, with the ultimate goal of making decisions.
Through this EDA, we will be able to get a deeper understanding of the structure of the data, the values that have a relationship between them and the missing values and pattern or outliers that may affect when performing the modeling or selecting the model we want to use for prediction/recommendation. By performing an EDA, we can identify potential pitfalls and make the decisions and subsequent processing necessary to improve the performance and accuracy of the models.
</div>

**🟥ES**: 
<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana; line-height: 1.7em; background-color: #c9b1fa; color:#38196e;">
Este cuaderno Kaggle tiene el objetivo proporcionar un análisis exploratorio de datos (AED) exhaustivo y un conjunto de modelos simples (los cuales no estarán optimizados), pero que pueden llegar a dar una vaga idea para escoger el mejor modelo, para el conjunto de datos dado, con el objetivo final de tomar decisiones.

A través de este AED, podremos obtener una comprensión más profunda de la estructura de los datos, los valores que tiene una relación entre ellos y los valores que faltan y patrón o valores anómalos que pueda afectar a la hora de realizar el modelado o seleccionar el modelo que queremos utilizar para la predicción / recomendación. Al realizar un EDA, podemos identificar posibles obstáculos y tomar las decisiones, y posteriormente el procesado necesario para mejorar el rendimiento y la precisión de los modelos.
</div>

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Data information</p>

**🟦EN**:

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Software Defect Dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

**Archivos**

* ```train.csv``` - the training dataset; defects is the binary target, which is treated as a boolean (False=0, True=1)
* ```test.csv``` - the test dataset; your objective is to predict the probability of positive defects (i.e., defects=True)
* ```sample_submission.csv``` - a sample submission file in the correct format


**🟥ES**:

El conjunto de datos para esta competición (tanto de entrenamiento como de prueba) se generó a partir de un modelo de aprendizaje profundo entrenado en el conjunto de datos de defectos de software. Las distribuciones de las características son similares, aunque no exactamente iguales, a las del original. No dude en utilizar el conjunto de datos original como parte de esta competición, tanto para explorar las diferencias como para ver si la incorporación del original en el entrenamiento mejora el rendimiento del modelo.


**Archivos**
* ```train.csv``` - el conjunto de datos de entrenamiento; defectos es el objetivo binario, que se trata como un booleano (Falso=0, Verdadero=1)
* ```test.csv``` - el conjunto de datos de prueba; su objetivo es predecir la probabilidad de defectos positivos (es decir, defectos=Verdadero)
* ```sample_submission.csv``` - un archivo de envío de muestra en el formato correcto


## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Library import</p>

In [None]:
import os 
import sys
import math
import time
import random
import warnings
import numpy as np 
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import missingno as msno
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import matplotlib.colors as mcolors

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from catboost import CatBoostClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import f1_score, accuracy_score
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, GradientBoostingClassifier, ExtraTreesClassifier


In [None]:
# Put theme of notebook 
from colorama import Fore, Style

# Colors
red = Fore.RED + Style.BRIGHT
mgta = Fore.MAGENTA + Style.BRIGHT
yllw = Fore.YELLOW + Style.BRIGHT
cyn = Fore.CYAN + Style.BRIGHT
blue = Fore.BLUE + Style.BRIGHT

# Reset
res = Style.RESET_ALL
plt.style.use({"figure.facecolor": "#282a36"})

In [None]:
# Colors
YELLOW = "#F7C53E"

CYAN_G = "#0CF7AF"
CYAB_DARK = "#11AB7C"

PURPLE = "#D826F8"
PURPLE_DARJ = "#9309AB"
PURPLE_L = "#b683d6"

BLUE = "#0C97FA"
RED = "#FA1D19"
ORANGE = "#FA9F19"
GREEN = "#0CFA58"
LIGTH_BLUE = "#01FADC"
S_BLUE = "#81c9e6"
DARK_BLUE = "#394be6"
# Palettes
PALETTE_2 = [CYAN_G, PURPLE]
PALETTE_3 = [YELLOW, CYAN_G, PURPLE]
PALETTE_4 = [YELLOW, ORANGE, PURPLE, LIGTH_BLUE]
PALETTE_5 = [PURPLE_DARJ, PURPLE_L, PURPLE, BLUE, LIGTH_BLUE]
PALETTE_6 = [BLUE, RED, ORANGE, GREEN, LIGTH_BLUE, PURPLE]

# Vaporwave palette by Francesc Oliveras
PALETTE_7 = [PURPLE_DARJ, PURPLE_L, PURPLE, BLUE, LIGTH_BLUE, DARK_BLUE, S_BLUE]
PALETTE_7_C = [PURPLE_DARJ, BLUE, PURPLE, LIGTH_BLUE, PURPLE_L, S_BLUE, DARK_BLUE]
INCLUDE_ORIGINAL = True
SEED = 18
FOLDS = 5
N_SPLITS = 7
sns.palplot(sns.color_palette(PALETTE_7))

# Set Style
sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

cmap = mcolors.LinearSegmentedColormap.from_list("", PALETTE_2)
cmap_2 = mcolors.LinearSegmentedColormap.from_list("", [S_BLUE, PURPLE_DARJ])

font_family = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=10), width=1000, height=500))

warnings.filterwarnings('ignore')

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Constants</p>

In [None]:
PATH = "/kaggle/input/playground-series-s3e23"
ORIGINAL_PATH = "/kaggle/input/software-defect-prediction/jm1.csv"
SUBMISSION_FILENAME = "sample_submission.csv"
TEST_FILENAME = "test.csv"
TRAIN_FILENAME = "train.csv"


SUBMISSION_DIR = os.path.join(PATH, SUBMISSION_FILENAME)
TRAIN_DIR = os.path.join(PATH, TRAIN_FILENAME) 
TEST_DIR = os.path.join(PATH, TEST_FILENAME)


TARGET = "defects"

SEED = 50
N_SPLITS = 5
REP = 4

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Functions</p>

In [None]:
def show_corr_heatmap(df, title):
    
    corr = df.corr()
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True

    plt.figure(figsize = (15, 10))
    plt.title(title)
    # sns.heatmap(corr, annot = False, linewidths=.5, fmt=".2f", square=True, mask = mask, cmap=cmap_2)
    if df.shape[1] < 25:
        sns.heatmap(corr, annot=True, linewidths=.5, fmt=".2f", square=True, mask=mask, cmap=cmap_2)
    else:
        sns.heatmap(corr, annot=False, linewidths=.5, square=True, mask=mask, cmap=cmap_2)

    plt.show()

In [None]:
def data_description(df):
    print("Data description")
    print(f"Total number of records {df.shape[0]}")
    print(f'number of features {df.shape[1]}\n\n')
    columns = df.columns
    data_type = []
    
    # Get the datatype of features
    for col in df.columns:
        data_type.append(df[col].dtype)
        
    n_uni = df.nunique()
    # Number of NaN values
    n_miss = df.isna().sum()
    
    names = list(zip(columns, data_type, n_uni, n_miss))
    variable_desc = pd.DataFrame(names, columns=["Name","Type","Unique levels","Missing"])
    print(variable_desc)

In [None]:
def plot_cont(col, ax, color=PALETTE_7[0]):
    sns.histplot(data=comb_df, x=col,
                hue="set",ax=ax, hue_order=labels,
                common_norm=False, **histplot_hyperparams)
    
    ax_2 = ax.twinx()
    ax_2 = plot_cont_dot(
        comb_df.query('set=="train"'),
        col, TARGET, ax_2,
        color=color
    )
    
    ax_2 = plot_cont_dot(
        comb_df, col,
        TARGET, ax_2,
        color=color
    )

In [None]:
def pie_plot(df: pd.DataFrame, hover_temp: str = "Status: ",
            feature=TARGET, palette=[LIGTH_BLUE,"#221e8f"], color=[BLUE ,PURPLE_DARJ],
            title_="Target distribution"):
#     df[feature] = df[feature].replace({0: "Not cancelled ", 1: "Cancelled"})
    target = df[[feature]].value_counts(normalize=True).sort_index().round(decimals=3)*100
    fig = go.Figure()
    
    fig.add_trace(go.Pie(labels=target.index, values=target, hole=.4,
                        sort=False, showlegend=True, marker=dict(colors=color, line=dict(color=palette,width=2)),
                        hovertemplate = "%{label} " + hover_temp + ": %{value:.2f}%<extra></extra>"))
    
    fig.update_layout(template=font_family, title=title_, 
                  legend=dict(traceorder="reversed",y=1.05,x=0),
                  uniformtext_minsize=15, uniformtext_mode="hide",height=600)
    fig.show()

In [None]:
def string_transform(df):
    for col in df.columns:
        if pd.api.types.is_string_dtype(df[col]):
            df[col] = pd.to_numeric(df[col], errors="coerce")
        
    return df

In [None]:
def cvl(X, y, estimator, cv, label):
    val_predictions = np.zeros((len(X)))
    train_sc, val_sc = [], []
    
    #training model, predicting prognosis probability, and evaluating metrics
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        
        model = clone(estimator)
        
        #define train set
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        
        #define validation set
        X_val = X.iloc[val_idx]
        y_val = y.iloc[val_idx]
        
        model.fit(X_train, y_train)
        
        #make predictions
        train_preds = model.predict_proba(X_train)[:, 1]
        val_preds = model.predict_proba(X_val)[:, 1]
                  
        val_predictions[val_idx] += val_preds
        
        #evaluate model for a fold
        train_score = roc_auc_score(y_train, train_preds)
        val_score = roc_auc_score(y_val, val_preds)
        
        #append model score for a fold to list
        train_sc.append(train_score)
        val_sc.append(val_score)
    
    print(f"{red}Train score:{res} {yllw}{np.mean(train_sc):.6f} ± {np.std(train_sc):.6f}{res} \t {cyn}Validation score:{res} {yllw}{np.mean(val_sc):.6f} ± {np.std(val_sc):.6f}{res} \t {blue}Model:{res}{yllw}{label}{res}")
    
    return val_sc, val_predictions

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Import data</p>

In [None]:
train_df = pd.read_csv(TRAIN_DIR, index_col="id")
test_df = pd.read_csv(TEST_DIR, index_col="id")
original_df = pd.read_csv(ORIGINAL_PATH)
submission_df = pd.read_csv(SUBMISSION_DIR)

# comb_df = pd.concat([train_df, original_df], ignore_index=True)

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">EDA and data modification</p>

**🟦EN**: Displays relevant data about the dataframes


**🟥ES**: Muestra datos relevantes sobre los dataframes

In [None]:
data_description(train_df)
data_description(test_df)
data_description(original_df)

**🟦EN**: Show dataframe data exemples


**🟥ES**: Mostrar datos de ejemplos de los dataframes

In [None]:
string_transform(original_df)

In [None]:
test_df

**🟦EN**: Combine original dataframe and train PS dataframe


**🟥ES**: Combianamos el dataframe original y el dataframe de train  de PS

In [None]:
comb_df = pd.concat([train_df, original_df], ignore_index=True)

In [None]:
display(original_df.head())
display(train_df.head())

**🟦EN**: Display a heat map of the correlation of all dataframes


**🟥ES**: Mostramos el mapa de calor referente a la correlación de todos los dataframe

In [None]:
display(show_corr_heatmap(train_df, "Train dataframe heatmap"))
display(show_corr_heatmap(test_df, "Test dataframe heatmap"))
display(show_corr_heatmap(original_df, "Original dataframe heatmap"))
display(show_corr_heatmap(comb_df, "Combination of train and test dataframe heatmap"))

**🟦EN**: Display target distribution of train dataframe and combination dataframe


**🟥ES**: Mostramos la distribución de targets del datafram de train y de la combinación

In [None]:
display(pie_plot(train_df))
display(pie_plot(comb_df))

**🟦EN**: Dispaly scatterplot of different values


**🟥ES**: Mostramos un scatterplot de diferentes valores

In [None]:
fig, axes = plt.subplots(3, 2, figsize = (20,12))

sns.scatterplot(ax = axes[0][0], data = comb_df, x = 'uniq_Op', y = 'uniq_Opnd', hue = TARGET, palette=PALETTE_7_C)
sns.scatterplot(ax = axes[0][1], data = comb_df, x = 'total_Opnd', y = 'n', hue = TARGET, palette=PALETTE_7_C)
sns.scatterplot(ax = axes[1][0], data = comb_df, x = 'total_Opnd', y = 'b', hue = TARGET, palette=PALETTE_7_C)
sns.scatterplot(ax = axes[1][1], data = comb_df, x = 'total_Op', y = 'total_Opnd', hue = TARGET, palette=PALETTE_7_C)
sns.scatterplot(ax = axes[2][0], data = comb_df, x = 'total_Opnd', y = 'lOCode', hue = TARGET, palette=PALETTE_7_C)
sns.scatterplot(ax = axes[2][1], data = comb_df, x = 'b', y = 'n', hue = TARGET, palette=PALETTE_7_C)

In [None]:
columns_of_interest = ["lOCode", "lOComment", "lOBlank", "defects"]

data_subset = train_df[columns_of_interest]

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 10), sharex=True)

for i, column in enumerate(["lOCode", "lOComment", "lOBlank"]):
    sns.kdeplot(data=data_subset, x=column, hue=TARGET, common_norm=False, ax=axes[i], palette=PALETTE_7_C)
    axes[i].set_title(f"KDE Plot for {column} with Defects")
    axes[i].set_xlim(-10, 50)

plt.tight_layout()
plt.show()

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Model</p>

[PCA info Principal Component Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

In [None]:
slr_pip = Pipeline([("scaler", StandardScaler()), 
                   ("PCA", PCA())]).fit(train_df.drop(columns = [TARGET], axis = 1))
slr_pip

**🟦EN**: Split the data and the target


**🟥ES**: Separamos los datos del target

In [None]:
X = comb_df.drop(columns = [TARGET], axis = 1)
y = comb_df[TARGET].map({False: 0, True: 1})

In [None]:
skf = StratifiedKFold(n_splits = N_SPLITS, random_state = SEED, shuffle = True)

**🟦EN**: Declare all the models and train it


**🟥ES**: Declarar todos los modelos y entrenarlos

In [None]:
scr_list = pd.DataFrame()
oof_list = pd.DataFrame()

all_models = [
    ('gnb', GaussianNB()),
    ('bnb', BernoulliNB()),
    ('lda', LinearDiscriminantAnalysis()),
    ('xgb', XGBClassifier(random_state = SEED)),
    ('lgb', LGBMClassifier(random_state = SEED)),
    ('et', ExtraTreesClassifier(random_state = SEED)),
    ('rf', RandomForestClassifier(random_state = SEED)),
    ('gb', GradientBoostingClassifier(random_state = SEED)),
    ('hgb', HistGradientBoostingClassifier(random_state = SEED)),
    ('cb', CatBoostClassifier(random_state = SEED, verbose = 0)),
    ('log', LogisticRegression(random_state = SEED, max_iter = 1000000)),
    ('dart', LGBMClassifier(random_state = SEED, boosting_type = "dart")),

]

    #('svc', SVC(random_state = seed, probability = True)),
    #('knn', KNeighborsClassifier()),
    #('gauss', GaussianProcessClassifier(random_state = SEED)),


for (lbl, mod) in all_models:
    scr_list[lbl], oof_list[lbl] = cvl(X, y,
        make_pipeline(SimpleImputer(), mod),
        skf,
        label = lbl,
    )


In [None]:
plt.figure(figsize = (10, 6), dpi = 295)
sns.barplot(data = scr_list.reindex((-1 * scr_list).mean().sort_values().index, axis = 1), palette = PALETTE_7_C)
plt.title("Model score Comparison", weight = 'bold', size = 18)
plt.show()

In [None]:
model_weights = RidgeClassifier(random_state = SEED).fit(oof_list, comb_df.defects).coef_[0]
df_model_weights = pd.DataFrame(model_weights, index=list(oof_list), columns=["Weight / Model"])
df_model_weights_sorted = df_model_weights.sort_values(by="Weight / Model", ascending=False)
print(df_model_weights_sorted)

In [None]:
v_class = VotingClassifier(all_models, weights = model_weights, voting = "soft")

model = make_pipeline(
    SimpleImputer(),
    v_class
)

model.fit(X, y)

## <p style="font-family:Consolas Mono; font-weight:normal; letter-spacing: 2px; color:#06D1C7; font-size:130%; text-align:left;padding: 0px; border-bottom: 3px solid #008F77">Submission</p>

In [None]:
# test_df_ = test_df.drop("id", axis=1)

In [None]:
submission_df = test_df.copy()
submission_df[TARGET] = model.predict_proba(submission_df)[:, 1]
submission_df.defects.to_csv("submission.csv")

In [None]:
submission_df

## Thanks for your support, the notebook is still in process, the model will be uploaded soon.