# <center>Project of Machine Learning</center>

<center>
Master in Data Science and Advanced Analytics <br>
NOVA Information Management School
</center>

** **
## <center>*TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS*</center>

<center>
Group 17 <br>
Diogo Ruivo, 20240584  <br>
José Tiago, 20240582  <br>
Matilde Miguel, 20240549  <br>
Nuno Sousa, 20222125  <br>
Rafael Lopes, 20240588  <br>



    
</center>


** **

# Agreement Reached
In this notebook we create the model for the 'Agreement Reached' feature so it can be incorporated into the base model

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import statistics

#data partition
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import StratifiedKFold

#empty values
import missingno as msno
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

#feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

from scipy.stats import chi2_contingency
from sklearn.feature_selection import RFE
from sklearn.svm import SVC

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LassoCV
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

In [2]:
pd.set_option('display.max_columns', None)

### Download Data

In [3]:
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')

# Pre-Processing

set Claim Identifier as index

In [4]:
train.set_index('Claim Identifier', inplace=True)
test.set_index('Claim Identifier', inplace=True)

#### Drop Duplicates

In [5]:
train.drop_duplicates(inplace=True)

#### Drop Columns

WCB Decision, Claim Injury Type are not in test

In [6]:
train.drop(['WCB Decision'], inplace = True, axis = 1)
train.drop(['Claim Injury Type'], axis = 1, inplace = True)

the column don't have data

In [7]:
train.drop(['OIICS Nature of Injury Description'], inplace=True, axis=1)
test.drop(['OIICS Nature of Injury Description'], inplace=True, axis=1)

#### Drop Rows

we eliminate the lines that do not have Agreement Reached

In [8]:
train.dropna(subset=['Agreement Reached'], inplace=True)

we eliminate rows with only 1, 2 or 3 NaN values, as we see that 'C-3 Date', 'First Hearing Date' and 'IME-4 Count' columns have +- 70% of the values ​​missing

In [9]:
train = train.dropna(thresh=4)

#### Categorization

we had a code wrongly labeled, so we change it to have the value 0

In [10]:
train['WCIO Part Of Body Code'] = train['WCIO Part Of Body Code'].apply(lambda x: 0 if x < 0 else x)
test['WCIO Part Of Body Code'] = test['WCIO Part Of Body Code'].apply(lambda x: 0 if x < 0 else x)

In [11]:
date_cols = ['Accident Date', 'Assembly Date', 'C-2 Date', 'C-3 Date', 'First Hearing Date']
int_cols = ['Age at Injury', 'Birth Year', 'IME-4 Count', 'Number of Dependents']
float_to_object = ['Industry Code', 'WCIO Cause of Injury Code', 'WCIO Nature of Injury Code', 'WCIO Part Of Body Code']

instead of having Date columns, for each one we split into Year, Month and Day

In [12]:
## IN DATE
for col in date_cols:
    # Convert to datetime
    train[col] = pd.to_datetime(train[col], errors='coerce')
    test[col] = pd.to_datetime(test[col], errors='coerce')
    
    # Extract year, month, and day
    train[f'{col}_Year'] = train[col].dt.year
    train[f'{col}_Month'] = train[col].dt.month
    train[f'{col}_Day'] = train[col].dt.day
    
    test[f'{col}_Year'] = test[col].dt.year
    test[f'{col}_Month'] = test[col].dt.month
    test[f'{col}_Day'] = test[col].dt.day

some columns were floats but should be int

In [13]:
# IN INT
int_cols = ['Age at Injury', 'Birth Year', 'IME-4 Count', 'Number of Dependents']
for col in int_cols:
    train[col] = train[col].astype('Int64')
    test[col] = test[col].astype('Int64')

create dictionaries for mapping codes to descriptions

In [14]:
code_maps = {
    'Industry Code': train.dropna(subset=['Industry Code', 'Industry Code Description']).set_index('Industry Code')['Industry Code Description'].to_dict(),
    'WCIO Cause of Injury Code': train.dropna(subset=['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']).set_index('WCIO Cause of Injury Code')['WCIO Cause of Injury Description'].to_dict(),
    'WCIO Nature of Injury Code': train.dropna(subset=['WCIO Nature of Injury Code', 'WCIO Nature of Injury Description']).set_index('WCIO Nature of Injury Code')['WCIO Nature of Injury Description'].to_dict(),
    'WCIO Part Of Body Code': train.dropna(subset=['WCIO Part Of Body Code', 'WCIO Part Of Body Description']).set_index('WCIO Part Of Body Code')['WCIO Part Of Body Description'].to_dict()
}

some columns were numeric but represented categorical data

we will detailed this further

In [15]:
# IN OBJECT
float_to_object = ['Industry Code', 'WCIO Cause of Injury Code', 'WCIO Nature of Injury Code', 'WCIO Part Of Body Code']
for col in float_to_object:
    train[col] = train[col].astype('Int64')
    test[col] = test[col].astype('Int64')

def to_object(train, val, test):
    for col in float_to_object:
        train[col] = train[col].astype('object')
        val[col] = val[col].astype('object')
        test[col] = test[col].astype('object')

    return train, val, test

for Zip Code we classified it as NY Resident, Non-NY US Resident and Non-US Resident

In [16]:
# classifiy them as Non-US residents
train['Zip Code'] = train['Zip Code'].apply(
    lambda x: x[:2] if isinstance(x, str) and len(x) == 5 and x.isdigit() else ('Non-US Resident' if pd.notna(x) else np.nan)
)
test['Zip Code'] = test['Zip Code'].apply(
    lambda x: x[:2] if isinstance(x, str) and len(x) == 5 and x.isdigit() else ('Non-US Resident' if pd.notna(x) else np.nan)
)

In [17]:
#zip codes that start with 1 come from NY state - where the data set is based
# we decide to divide those that are from NY from those that even though are US residents, are not from NY
train['Zip Code'] = np.where(
    (train['Zip Code'] != 'Unknown') & 
    (train['Zip Code'] != 'Non-US Resident') & 
    train['Zip Code'].notna() & 
    train['Zip Code'].str.startswith('1'), 
    'NY Resident', 
    np.where(
        (train['Zip Code'] != 'Unknown') & 
        (train['Zip Code'] != 'Non-US Resident') & 
        train['Zip Code'].notna(), 
        'Non-NY US Residents', 
        train['Zip Code']
    )
)
test['Zip Code'] = np.where(
    (test['Zip Code'] != 'Unknown') & 
    (test['Zip Code'] != 'Non-US Resident') & 
    test['Zip Code'].notna() & 
    test['Zip Code'].str.startswith('1'), 
    'NY Resident', 
    np.where(
        (test['Zip Code'] != 'Unknown') & 
        (test['Zip Code'] != 'Non-US Resident') & 
        test['Zip Code'].notna(), 
        'Non-NY US Residents', 
        test['Zip Code']
    )
)
# print(train['Zip Code'].value_counts())
# print() 
# print('NaN:', train['Zip Code'].isna().sum())

## Feature Engeneering

we add 4 new features representing the time space between Accident Date and other dates 

In [18]:
# Date feature engineering for train
train['Assembly_to_Accident'] = (train['Assembly Date'] - train['Accident Date']).dt.days.astype('Int64')
train['C2_to_Accident'] = (train['C-2 Date'] - train['Accident Date']).dt.days.astype('Int64')
train['C3_to_Accident'] = (train['C-3 Date'] - train['Accident Date']).dt.days.astype('Int64')
train['Hearing_to_Accident'] = (train['First Hearing Date'] - train['Accident Date']).dt.days.astype('Int64')

# Date feature engineering for test
test['Assembly_to_Accident'] = (test['Assembly Date'] - test['Accident Date']).dt.days.astype('Int64')
test['C2_to_Accident'] = (test['C-2 Date'] - test['Accident Date']).dt.days.astype('Int64')
test['C3_to_Accident'] = (test['C-3 Date'] - test['Accident Date']).dt.days.astype('Int64')
test['Hearing_to_Accident'] = (test['First Hearing Date'] - test['Accident Date']).dt.days.astype('Int64')

we add 4 new features representing the age in different moments related to the claim process

In [19]:
# Age-based features for train
train['Age_at_Assembly'] = (train['Age at Injury'] + (train['Assembly Date'] - train['Accident Date']).dt.days / 365).astype('Int64')
train['Age_at_C2'] = (train['Age at Injury'] + (train['C-2 Date'] - train['Accident Date']).dt.days / 365).astype('Int64')
train['Age_at_C3'] = (train['Age at Injury'] + (train['C-3 Date'] - train['Accident Date']).dt.days / 365).astype('Int64')
train['Age_at_Hearing'] = (train['Age at Injury'] + (train['First Hearing Date'] - train['Accident Date']).dt.days / 365).astype('Int64')

# Age-based features for test
test['Age_at_Assembly'] = (test['Age at Injury'] + (test['Assembly Date'] - test['Accident Date']).dt.days / 365).astype('Int64')
test['Age_at_C2'] = (test['Age at Injury'] + (test['C-2 Date'] - test['Accident Date']).dt.days / 365).astype('Int64')
test['Age_at_C3'] = (test['Age at Injury'] + (test['C-3 Date'] - test['Accident Date']).dt.days / 365).astype('Int64')
test['Age_at_Hearing'] = (test['Age at Injury'] + (test['First Hearing Date'] - test['Accident Date']).dt.days / 365).astype('Int64')

we add a new column to group the wages

In [20]:
# Create a temporary column for wages greater than 0
positive_wages_train = train['Average Weekly Wage'] > 0

# Apply qcut to positive wages only
wage_groups_train = pd.qcut(
    train.loc[positive_wages_train, 'Average Weekly Wage'], 
    q=10, 
    labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
)

# Create a new column for wage groups, preserving NaN values
train['Wage Group'] = pd.Series(index=train.index)
train.loc[positive_wages_train, 'Wage Group'] = wage_groups_train.astype(int)

# Assign group 0 to wages that are exactly 0
train.loc[train['Average Weekly Wage'] == 0, 'Wage Group'] = 0

# Engenharia de recursos de grupos salariais para test
# Create a temporary column for wages greater than 0
positive_wages_test = test['Average Weekly Wage'] > 0

# Apply qcut to positive wages only
wage_groups_test = pd.qcut(
    test.loc[positive_wages_test, 'Average Weekly Wage'], 
    q=10, 
    labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
)

# Create a new column for wage groups, preserving NaN values
test['Wage Group'] = pd.Series(index=test.index)
test.loc[positive_wages_test, 'Wage Group'] = wage_groups_test.astype(int)

# Assign group 0 to wages that are exactly 0
test.loc[test['Average Weekly Wage'] == 0, 'Wage Group'] = 0

# Convert 'Wage Group' to categorical
train['Wage Group'] = train['Wage Group'].astype('category')
test['Wage Group'] = test['Wage Group'].astype('category')


we add a column representing the County of Injury distance

In [21]:
counties = { "SUFFOLK": 45.4, "QUEENS": 8.5, "KINGS": 7.5, "NASSAU": 20.1,
        "BRONX": 10.3, "ERIE": 371.1, "NEW YORK": 0, "WESTCHESTER": 20.5,
        "MONROE": 334.8, "ORANGE": 59.5, "ONONDAGA": 194.8, "RICHMOND": 17.1,
        "ALBANY": 155.1, "DUTCHESS": 76.3, "ROCKLAND": 30.8, "SARATOGA": 143.1, 
        "NIAGARA": 373.9, "BROOME": 173.1, "ONEIDA": 203.1, "RENSSELAER": 145.9, 
        "ULSTER": 86.3, "CAYUGA": 221.9, "HERKIMER": 213.9, "CHAUTAUQUA": 407.9, 
        "ONTARIO": 264.9, "CHEMUNG": 201.9, "OSWEGO": 243.9, "FULTON": 223.1, 
        "PUTNAM": 51.9, "ST. LAWRENCE": 314.9, "JEFFERSON": 341.1, "CLINTON": 304.9, 
        "CATTARAUGUS": 371.9, "SULLIVAN": 97.3, "GENESEE": 344.9, "COLUMBIA": 120.1,
        "MADISON": 193.9, "WARREN": 194.9, "LIVINGSTON": 276.9, "DELAWARE": 137.1,
        "WASHINGTON": 204.9, "GREENE": 124.9, "ALLEGANY": 346.9, "WAYNE": 294.9,
        "CHENANGO": 181.9, "TOMPKINS": 209.9, "ORLEANS": 323.9, "SCHENECTADY": 156.1,
        "FRANKLIN": 294.9, "SENECA": 234.9, "LEWIS": 266.9, "TIOGA": 187.1, "STEUBEN": 246.9, 
        "ESSEX": 214.9, "SCHUYLER": 206.1, "OTSEGO": 165.1, "CORTLAND": 193.9, 
        "WYOMING": 313.9, "MONTGOMERY": 173.9, "SCHOHARIE": 146.1, "YATES": 243.9,"HAMILTON": 221.9
}

# Create a list of distances
distances = list(counties.values())

# Calculate the mean distance
mean_distance = statistics.mean(distances)

# Add the "UNKNOWN" county to the dictionary with the mean distance
counties["UNKNOWN"] = mean_distance

# Create a new column in the df_train DataFrame called distance_of_county
train['distance_of_county'] = train['County of Injury'].map(counties)
test['distance_of_county'] = test['County of Injury'].map(counties)

dropping the original Date columns

In [22]:
train.drop(columns=date_cols, inplace=True)
test.drop(columns=date_cols, inplace=True)

### Data Partition

In [23]:
def combine_df(X_num, X_cat):
    return pd.concat([X_num, X_cat], axis=1)

In [24]:
def split_df(X):
    X_num = X.select_dtypes(include=np.number)
    X_cat = X.select_dtypes(exclude=np.number)
    return X_num, X_cat

we split our training data into train and validation

we can do this using some techniques like train_test_split or using kfold or stratifiedkfold

In [25]:
def split_data(X, y, test, method=None):
    splits = []
    if method is None:
        X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.3, 
                                                random_state = 0, 
                                                stratify = y, 
                                                shuffle = True)
        splits.append((X_train, X_val, y_train, y_val))
    elif isinstance(method, StratifiedKFold):
        for train_index, test_index in method.split(X, y):
            X_train, X_val = X.iloc[train_index], X.iloc[test_index]
            y_train, y_val = y.iloc[train_index], y.iloc[test_index]
            splits.append((X_train, X_val, y_train, y_val))
    else:
        for train_index, test_index in method.split(X):
            X_train, X_val = X.iloc[train_index], X.iloc[test_index]
            y_train, y_val = y.iloc[train_index], y.iloc[test_index]
            splits.append((X_train, X_val, y_train, y_val))

    processed_splits = []
    for X_train, X_val, y_train, y_val in splits:
        X_train_num = X_train.select_dtypes(include=np.number)
        X_val_num = X_val.select_dtypes(include=np.number)
        X_train_cat = X_train.select_dtypes(exclude=np.number)
        X_val_cat = X_val.select_dtypes(exclude=np.number)
        X_test_num = test.select_dtypes(include=np.number)
        X_test_cat = test.select_dtypes(exclude=np.number)
        processed_splits.append((X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat, y_train, y_val))

    return processed_splits

####  Replacing NaN with nearest neighbor

we used knn to fill the code columns, like 'Injury Code', with the value of the most similar claim 

and then accurately, using the dictionary build before, fill the associated description

In [26]:
def KNN_Imputer(X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat, code_maps, float_to_object):
    # Select only the float_to_object features for KNN imputation
    float_to_object_features = [col for col in float_to_object if col in X_train_num.columns]
    print(f"Float to Object Features: {float_to_object_features}")

    X_train_num_knn = X_train_num[float_to_object_features]
    X_val_num_knn = X_val_num[float_to_object_features]
    X_test_num_knn = X_test_num[float_to_object_features]

    # Impute missing values in float_to_object features using KNN
    imputer = KNNImputer(n_neighbors=1).fit(X_train_num_knn)
    train_num_imp_knn = imputer.transform(X_train_num_knn)
    val_num_imp_knn = imputer.transform(X_val_num_knn)
    test_num_imp_knn = imputer.transform(X_test_num_knn)

    # Convert imputed numerical features back to DataFrames
    train_num_df_knn = pd.DataFrame(train_num_imp_knn, columns=X_train_num_knn.columns, index=X_train_num_knn.index)
    val_num_df_knn = pd.DataFrame(val_num_imp_knn, columns=X_val_num_knn.columns, index=X_val_num_knn.index)
    test_num_df_knn = pd.DataFrame(test_num_imp_knn, columns=X_test_num_knn.columns, index=X_test_num_knn.index)

    # Replace the original float_to_object features with the imputed ones
    X_train_num[float_to_object_features] = train_num_df_knn
    X_val_num[float_to_object_features] = val_num_df_knn
    X_test_num[float_to_object_features] = test_num_df_knn

    # Combine numerical and categorical features
    train_combined = combine_df(X_train_num, X_train_cat)
    val_combined = combine_df(X_val_num, X_val_cat)
    test_combined = combine_df(X_test_num, X_test_cat)

    #print(train_combined.columns)
    
    # Map codes to descriptions
    for code, map_dict in code_maps.items():
        code_name = code.replace(' Code', '')
        if f'{code_name} Description' in train_combined.columns:
            train_combined[f'{code_name} Description'] = train_combined[code].map(map_dict).fillna(train_combined[f'{code_name} Description'])
            val_combined[f'{code_name} Description'] = val_combined[code].map(map_dict).fillna(val_combined[f'{code_name} Description'])
            test_combined[f'{code_name} Description'] = test_combined[code].map(map_dict).fillna(test_combined[f'{code_name} Description'])
        else:
            print(f"Column {code_name} Description does not exist")

    X_train_num = train_combined.select_dtypes(include=np.number)
    X_val_num = val_combined.select_dtypes(include=np.number)
    X_test_num = test_combined.select_dtypes(include=np.number)
    X_train_cat = train_combined.select_dtypes(exclude=np.number)
    X_val_cat = val_combined.select_dtypes(exclude=np.number)
    X_test_cat = test_combined.select_dtypes(exclude=np.number)

    return X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat

we decided to input the other's missing values with the median or most frequent value

In [27]:
def imputing(X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat):
    #Using median for numerical data
    num_imputer = SimpleImputer(strategy="median")
    X_train_num = pd.DataFrame(num_imputer.fit_transform(X_train_num), columns=X_train_num.columns)
    X_val_num = pd.DataFrame(num_imputer.transform(X_val_num), columns=X_val_num.columns)
    X_test_num = pd.DataFrame(num_imputer.transform(X_test_num), columns=X_test_num.columns)

    #Using most frequent for categorical data
    cat_imputer = SimpleImputer(strategy="most_frequent")
    X_train_cat = pd.DataFrame(cat_imputer.fit_transform(X_train_cat), columns=X_train_cat.columns)
    X_val_cat = pd.DataFrame(cat_imputer.transform(X_val_cat), columns=X_val_cat.columns)
    X_test_cat = pd.DataFrame(cat_imputer.transform(X_test_cat), columns=X_test_cat.columns)

    return X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat

<hr>
<a class="anchor" id="outliers">
    
## 4.4 Outliers
    
</a>

In [28]:
"""
numeric_columns = train.select_dtypes(include=['number']).columns

num_cols = len(numeric_columns)
cols = 2  # Número de colunas na grade
rows = (num_cols // cols) + (num_cols % cols) 

plt.figure(figsize=(15, 5 * rows))
for i, col in enumerate(numeric_columns, 1):
    plt.subplot(rows, cols, i)
    sns.boxplot(x=train[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show() """

"\nnumeric_columns = train.select_dtypes(include=['number']).columns\n\nnum_cols = len(numeric_columns)\ncols = 2  # Número de colunas na grade\nrows = (num_cols // cols) + (num_cols % cols) \n\nplt.figure(figsize=(15, 5 * rows))\nfor i, col in enumerate(numeric_columns, 1):\n    plt.subplot(rows, cols, i)\n    sns.boxplot(x=train[col])\n    plt.title(f'Boxplot of {col}')\n    plt.xlabel(col)\n\nplt.tight_layout()\nplt.show() "

In [29]:
# Function to calculate IQR and identify outliers for a specific column
def identify_outliers_iqr_column(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
    return outliers

In [30]:
def remove_outliers(X_train):
    outliers = pd.Series([False] * len(X_train))
    
    # Removing outliers for 'Age at Injury'
    if 'Age at Injury' in X_train.columns:
        outliers_age = (X_train['Age at Injury'] < 12) | (X_train['Age at Injury'] > 80)
        outliers = outliers | outliers_age

    # Removing outliers for 'Average Weekly Wage'
    if 'Average Weekly Wage' in X_train.columns:
        outliers_wage = X_train['Average Weekly Wage'] > 1e5
        outliers = outliers | outliers_wage
    
    # Removing outliers for 'Birth Year'
    if 'Birth Year' in X_train.columns:
        outliers_birth_year = X_train['Birth Year'] == 0
        outliers = outliers | outliers_birth_year
    
    return outliers


In [31]:
def sub_outliers(X_train, columns_outliers): #, y_train
    for col in columns_outliers:
        outliers = identify_outliers_iqr_column(X_train, col)
        # Replace outliers with median
        median = X_train[col].median()
        X_train.loc[outliers, col] = median
    
    return X_train #, y_train

#train = sub_outliers(train, train.select_dtypes(include=[np.number]).columns)

In [32]:
"""
numeric_columns = train.select_dtypes(include=['number']).columns

num_cols = len(numeric_columns)
cols = 2  # Número de colunas na grade
rows = (num_cols // cols) + (num_cols % cols)

plt.figure(figsize=(15, 5 * rows))
for i, col in enumerate(numeric_columns, 1):
    plt.subplot(rows, cols, i)
    sns.boxplot(x=train[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show() """

"\nnumeric_columns = train.select_dtypes(include=['number']).columns\n\nnum_cols = len(numeric_columns)\ncols = 2  # Número de colunas na grade\nrows = (num_cols // cols) + (num_cols % cols)\n\nplt.figure(figsize=(15, 5 * rows))\nfor i, col in enumerate(numeric_columns, 1):\n    plt.subplot(rows, cols, i)\n    sns.boxplot(x=train[col])\n    plt.title(f'Boxplot of {col}')\n    plt.xlabel(col)\n\nplt.tight_layout()\nplt.show() "

### Feature Selection

if var == 0 then drop

In [33]:
def variance(X_train, threshold, return_variances=False):
    variances = X_train.var()
    low_variance_cols = variances[variances == threshold].index.tolist()
    if return_variances:
        return low_variance_cols, variances.to_dict()
    return low_variance_cols

spearman correlation

In [34]:
def high_correlated_vars(X_train, threshold):
    cor_spearman = X_train.corr(method='spearman')
    correlated_pairs = []
    for i in range(len(cor_spearman.columns)):
        for j in range(i):
            correlation = cor_spearman.iloc[i, j]
            if abs(correlation) >= threshold:
                correlated_pairs.append({
                    "feature_1": cor_spearman.columns[i],
                    "feature_2": cor_spearman.columns[j],
                    "correlation": correlation
                })
    return correlated_pairs

chi square

In [35]:
def test_independence(x,y,alpha=0.05):        
    dfObserved = pd.crosstab(y,x) 
    if dfObserved.empty:
        print(f"Skipping column {x.name} due to empty observed table.")
        return None
    if x.nunique() <= 1:
        print(f"Skipping column {x.name} as it has <= 1 unique value.")
        return None
    chi2, p, dof, expected = stats.chi2_contingency(dfObserved.values)
    is_important = p < alpha
    result = {
        "feature": x.name,
        "p_value": p,
        "chi2_stat": chi2,
        "is_important": is_important
    }
    return result

In [36]:
def chi_square(X_train, y, alpha=0.05):
    if X_train.empty or y.empty:
        raise ValueError("X_train or y is empty.")
    if len(y.unique()) < 2:
        raise ValueError("y must have at least two unique classes.")
    results = []
    for var in X_train.columns:
        test_result = test_independence(X_train[var], y, alpha)
        if test_result is None:
            print("Deu none")
        results.append(test_result)
    
    results_df = pd.DataFrame(results)
    not_important_features = results_df[~results_df["is_important"]]["feature"].tolist()
    
    return results_df, not_important_features

relation with the dependent variable

In [37]:
def bar_charts_categorical(df, feature, target):
    cont_tab = pd.crosstab(df[feature], df[target], margins=True)
    categories = cont_tab.index[:-1]
    target_categories = cont_tab.columns[:-1]
    
    fig = plt.figure(figsize=(15, 5))
    
    plt.subplot(121)
    bottom = np.zeros(len(categories))
    colors = plt.cm.tab20.colors  # Use a colormap for different colors
    bars = []
    for i, target_cat in enumerate(target_categories):
        bar = plt.bar(categories, cont_tab.iloc[:-1, i].values, 0.55, bottom=bottom, color=colors[i % len(colors)])
        bars.append(bar[0])
        bottom += cont_tab.iloc[:-1, i].values
    plt.legend(bars, [f'$y_i={cat}$' for cat in target_categories])
    plt.title("Frequency bar chart")
    plt.xlabel(feature)
    plt.ylabel("$Frequency$")

    # auxiliary data for 122
    obs_pct = np.array([np.divide(cont_tab.iloc[:-1, i].values, cont_tab.iloc[:-1, -1].values) for i in range(len(target_categories))])
    
    plt.subplot(122)
    bottom = np.zeros(len(categories))
    bars = []
    for i, target_cat in enumerate(target_categories):
        bar = plt.bar(categories, obs_pct[i], 0.55, bottom=bottom, color=colors[i % len(colors)])
        bars.append(bar[0])
        bottom += obs_pct[i]
    plt.legend(bars, [f'$y_i={cat}$' for cat in target_categories])
    plt.title("Proportion bar chart")
    plt.xlabel(feature)
    plt.ylabel("$p$")

    plt.show()

def plot_and_test_correlation(df, target):
    for feature in df.select_dtypes(include='object').columns:
        print(f"Generating bar charts for {feature}...")
        bar_charts_categorical(df, feature, target)

#plot_and_test_correlation(train, 'Claim Injury Type')

rfe

In [38]:
def select_optimal_features_rfe(X_train, y_train, X_val, y_val, model, scoring_function=None):
    if scoring_function is None:
        scoring_function = lambda model, X, y: model.score(X, y)

    nof_list=np.arange(1, X_train.shape[1]+1)
    high_score = 0
    nof = 0
    train_score_list = []
    val_score_list = []

    for n in nof_list:
        rfe = RFE(estimator=model, n_features_to_select=n)
        X_train_rfe = rfe.fit_transform(X_train, y_train)
        X_val_rfe = rfe.transform(X_val)
        model.fit(X_train_rfe, y_train)

        # Storing results on training data
        train_score = scoring_function(model, X_train_rfe, y_train)
        train_score_list.append(train_score)

        # Storing results on validation data
        val_score = scoring_function(model, X_val_rfe, y_val)
        val_score_list.append(val_score)

        # Check best score
        if val_score >= high_score:
            high_score = val_score
            nof = n

    # Fit RFE with the optimal number of features
    rfe = RFE(estimator=model, n_features_to_select=nof)
    rfe.fit(X_train, y_train)
    selected_features = X_train.columns[rfe.support_].tolist()

    return selected_features, train_score_list, val_score_list

embedded methods

In [39]:
def select_best_features_embedded(X_train, y_train, model, threshold=None):
    # Fit the model
    model.fit(X_train, y_train)
    
    # Get the coefficients or feature importances
    if hasattr(model, 'coef_'):
        if model.coef_.ndim > 1:
            coef = pd.Series(model.coef_.mean(axis=0), index=X_train.columns)
        else:
            coef = pd.Series(model.coef_, index=X_train.columns)
    elif hasattr(model, 'feature_importances_'):
        coef = pd.Series(model.feature_importances_, index=X_train.columns)
    else:
        raise ValueError("The model does not have coef_ or feature_importances_ attributes")
    
    if threshold is not None:
        selected_features = coef[coef.abs() > threshold].index.tolist()
    else:
        selected_features = coef[coef != 0].index.tolist()
    
    return selected_features, coef

### Scaling

we scaled our data to improve model performance and to equal the features contributions

In [40]:
def scaling(X_train, X_val, X_test, scaler):
    scaler.fit(X_train)
    X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns).set_index(X_train.index)
    X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns).set_index(X_val.index)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns).set_index(X_test.index)

    return X_train_scaled, X_val_scaled, X_test_scaled

### Reducing Cardinality

to use some encoding strategies like OneHotEnconding we change some minority classes in each column as one

In [41]:
def reduce_cardinality(df, threshold=10, other_label='Other'):
    for col in df.select_dtypes(include='object').columns:
        value_counts = df[col].value_counts()
        frequent_values = value_counts[value_counts > threshold].index
        df[col] = df[col].apply(lambda x: x if x in frequent_values else other_label)
    return df

### Encoding

we use encoding to prepare our categorical features to be used by a model

In [42]:
def encoding_independent(X_train, X_val, X_test, encoder):
    X_train = X_train.astype(str)
    X_val = X_val.astype(str)
    X_test = X_test.astype(str)
    
    encoder.fit(X_train)
    X_train_encoded = encoder.transform(X_train) 
    X_val_encoded = encoder.transform(X_val)
    X_test_encoded = encoder.transform(X_test)

    if isinstance(encoder, OneHotEncoder):
        feature_names = encoder.get_feature_names_out(X_train.columns)
        X_train_encoded = pd.DataFrame(X_train_encoded, columns=feature_names, index=X_train.index)
        X_val_encoded = pd.DataFrame(X_val_encoded, columns=feature_names, index=X_val.index)
        X_test_encoded = pd.DataFrame(X_test_encoded, columns=feature_names, index=X_test.index)
    else:
        X_train_encoded = pd.DataFrame(X_train_encoded, columns=X_train.columns, index=X_train.index)
        X_val_encoded = pd.DataFrame(X_val_encoded, columns=X_val.columns, index=X_val.index)
        X_test_encoded = pd.DataFrame(X_test_encoded, columns=X_test.columns, index=X_test.index)
    
    return X_train_encoded, X_val_encoded, X_test_encoded

In [43]:
def encoding_dependent(y_train, y_val, encoder):
    encoder.fit(y_train)
    y_train_encoded = pd.Series(encoder.transform(y_train))
    y_val_encoded = pd.Series(encoder.transform(y_val))

    return y_train_encoded, y_val_encoded

### Balancing Classes

In [44]:
def custom_sampling_strategy(y):
    class_counts = np.bincount(y)
    max_count = np.max(class_counts)
    sampling_strategy = {}
    for i, count in enumerate(class_counts):
        if i in [5, 6, 7]:  # minority classes
            sampling_strategy[i] = 4000
        else:
            sampling_strategy[i] = count  # use the original count for other classes
    return sampling_strategy

In [45]:
def balance_data(X, y, method='oversample'):
    if method == 'oversample':
        sampler = RandomOverSampler(random_state=42)
    elif method == 'undersample':
        sampler = RandomUnderSampler(random_state=42)
    elif method == 'smote':
        sampler = SMOTEENN(random_state=42, sampling_strategy=custom_sampling_strategy(y))
    else:
        raise ValueError("Method should be 'oversample', 'undersample', or 'smote'")
    
    X_resampled, y_resampled = sampler.fit_resample(X, y)
    return X_resampled, y_resampled

### PCA

In [46]:
def apply_pca(X_train, X_val, X_test, n_components):
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_val_pca = pca.transform(X_val)
    X_test_pca = pca.transform(X_test)
    pca_feat_names = [f'PC{i}' for i in range(n_components)]

    X_train_pca = pd.DataFrame(X_train_pca, index=X_train.index, columns=pca_feat_names)
    X_val_pca = pd.DataFrame(X_val_pca, index=X_val.index, columns=pca_feat_names)
    X_test_pca = pd.DataFrame(X_test_pca, index=X_test.index, columns=pca_feat_names)
    
    return X_train_pca, X_val_pca, X_test_pca, pca_feat_names

## Modelling and Evaluating

In [47]:
def pre_processing_pipeline(train, test, split_method, feature_selection, scaler, encoder_independent, encoder_dependent, balance_method):
    X = train.drop(columns=['Agreement Reached'])
    y = train['Agreement Reached']
    print("Starting Split Data")
    splits = split_data(X, y, test, split_method)
    print("Split data OK")

    results = []

    for X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat, y_train, y_val in splits:
        print("Starting Outliers")
        print(f"Train shape before removing outliers: {X_train_num.shape}")
        outliers = remove_outliers(X_train_num)
        X_train_num = X_train_num[~outliers]
        X_train_cat = X_train_cat[~outliers]
        y_train = y_train[~outliers]
        print(f"Train shape after removing outliers: {X_train_num.shape}")
        print("Outliers OK")

        # print("Starting KNN Imputer")
        # X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat = KNN_Imputer(X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat, code_maps, float_to_object)
        # print("KNN Imputer OK")
        print("Starting Imputing")
        X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat = imputing(X_train_num, X_val_num, X_test_num, X_train_cat, X_val_cat, X_test_cat)
        print("Imputing OK")

        X_train = combine_df(X_train_num, X_train_cat)
        X_val = combine_df(X_val_num, X_val_cat)
        X_test = combine_df(X_test_num, X_test_cat)
        X_train, X_val, X_test = to_object(X_train, X_val, X_test)
        X_train_num, X_train_cat = split_df(X_train)
        X_val_num, X_val_cat = split_df(X_val)
        X_test_num, X_test_cat = split_df(X_test)

        if feature_selection:
            print("Starting Variance")
            low_variance_cols, variances = variance(X_train_num, 0, True)
            print(low_variance_cols)
            print(variances)
            print("Variance OK")
            print("Starting High Correlated Vars")
            X_corr = X_train_num.copy()
            X_corr['Claim Injury Type'] = y_train
            correlated_pairs = high_correlated_vars(X_corr, 0.8)
            print(correlated_pairs)
            print("High Correlated Vars OK")
            # print("Starting Chi Square")
            # results, not_important_fatures = chi_square(X_train_cat, y_train)
            # print(results)
            # print(not_important_fatures)
            # print("Chi Square OK")

        print("Starting Scaling")
        X_train_num, X_val_num, X_test_num = scaling(X_train_num, X_val_num, X_test_num, scaler)
        print("Scaling OK")

        print("Reducing Cardinality")
        X_train_cat = reduce_cardinality(X_train_cat)
        X_val_cat = reduce_cardinality(X_val_cat)
        X_test_cat = reduce_cardinality(X_test_cat)
        print("Reducing Cardinality OK")

        print("Starting Encoding Independent")
        X_train_cat, X_val_cat, X_test_cat = encoding_independent(X_train_cat, X_val_cat, X_test_cat, encoder_independent)
        print("Encoding Independent OK")
        print("Starting Encoding Dependent")
        y_train, y_val = encoding_dependent(y_train, y_val, encoder_dependent)
        print("Encoding Dependent OK")

        # print("Starting PCA")
        # X_train_num, X_val_num, X_test_num, pca_feat_names = apply_pca(X_train_num, X_val_num, X_test_num, 10)
        # print("PCA OK")

        X_train = combine_df(X_train_num, X_train_cat)
        X_val = combine_df(X_val_num, X_val_cat)
        X_test = combine_df(X_test_num, X_test_cat)
        
        if balance_method is not None:
            print("Starting Balance Data")
            X_train, y_train = balance_data(X_train, y_train, balance_method)
            print("Balance Data OK")

        results.append((X_train, X_val, X_test, y_train, y_val))

    return results

In [48]:
normal_split = None
kf = KFold(n_splits=10) #if the splits are too many, poor efficiency
rkf = RepeatedKFold(n_splits=6, n_repeats=2)  
#loo = LeaveOneOut() not good due the size of the dataset
skf = StratifiedKFold(n_splits=10)  #good for imbalanced datasets

min_max = MinMaxScaler()
min_max2 = MinMaxScaler(feature_range=(-1, 1))
standard = StandardScaler()
robust = RobustScaler()

oneHot = OneHotEncoder(sparse_output=False, drop="first", handle_unknown='ignore')
ordinal = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
label = LabelEncoder()

oversample = 'oversample'
undersample = 'undersample'
smote = 'smote'
no_balance = None

In [49]:
processed_data = pre_processing_pipeline(train, test, normal_split, True, standard, ordinal, label, smote)
for i, (X_train, X_val, X_test, y_train, y_val) in enumerate(processed_data):
    print(f"Split {i+1}:")
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")
    print(f"X_test shape: {X_test.shape}")

Starting Split Data
Split data OK
Starting Outliers
Train shape before removing outliers: (401817, 33)
Train shape after removing outliers: (384193, 33)
Outliers OK
Starting Imputing
Imputing OK
Starting Variance
[]
{'Age at Injury': 198.19349163925702, 'Average Weekly Wage': 603570.7186536161, 'Birth Year': 181.3012968175019, 'IME-4 Count': 2.1396326359456688, 'Number of Dependents': 4.008093273175224, 'Accident Date_Year': 2.9666472258224332, 'Accident Date_Month': 11.881004064470508, 'Accident Date_Day': 76.65125835479293, 'Assembly Date_Year': 0.6561729948007808, 'Assembly Date_Month': 11.805455127127571, 'Assembly Date_Day': 76.25989724674965, 'C-2 Date_Year': 1.1145960289107557, 'C-2 Date_Month': 11.560124561431149, 'C-2 Date_Day': 74.0696397247857, 'C-3 Date_Year': 0.2579330823924584, 'C-3 Date_Month': 3.892666213514948, 'C-3 Date_Day': 24.953098575164642, 'First Hearing Date_Year': 0.3093508386964043, 'First Hearing Date_Month': 3.2785115143646357, 'First Hearing Date_Day': 20.

Starting Variance
[]
{'Age at Injury': 198.18715003163175, 'Average Weekly Wage': 593328.9346257411, 'Birth Year': 181.25345368680772, 'IME-4 Count': 2.1339682310632573, 'Agreement Reached': 0.043480791026601726, 'Number of Dependents': 4.0094608120277035, 'Accident Date_Year': 2.976560266497379, 'Accident Date_Month': 11.873280144612893, 'Accident Date_Day': 76.74232654089047, 'Assembly Date_Year': 0.6562224916754912, 'Assembly Date_Month': 11.79377792451315, 'Assembly Date_Day': 76.28770527877323, 'C-2 Date_Year': 1.116449967455882, 'C-2 Date_Month': 11.54371715451866, 'C-2 Date_Day': 74.08105468085947, 'C-3 Date_Year': 0.2574103605283446, 'C-3 Date_Month': 3.901680673864706, 'C-3 Date_Day': 25.12037227072812, 'First Hearing Date_Year': 0.309690685722038, 'First Hearing Date_Month': 3.2766458013681565, 'First Hearing Date_Day': 20.165888611257678, 'Assembly_to_Accident': 300998.0811316094, 'C2_to_Accident': 208948.8283443356, 'C3_to_Accident': 48381.570658223296, 'Hearing_to_Accident': 49116.99307839868, 'Age_at_Assembly': 188.55201250464148, 'Age_at_C2': 184.63142115116767, 'Age_at_C3': 52.75393405972783, 'Age_at_Hearing': 43.566149613971774, 'distance_of_county': 15848.565889636728}
Variance OK

Starting High Correlated Vars
[{'feature_1': 'Birth Year', 'feature_2': 'Age at Injury', 'correlation': -0.9434682621618036}, {'feature_1': 'Assembly Date_Year', 'feature_2': 'Accident Date_Year', 'correlation': 0.9318466133057587}, {'feature_1': 'C-2 Date_Year', 'feature_2': 'Accident Date_Year', 'correlation': 0.9163718626926118}, {'feature_1': 'C-2 Date_Year', 'feature_2': 'Assembly Date_Year', 'correlation': 0.9688395211428722}, {'feature_1': 'C-2 Date_Month', 'feature_2': 'Assembly Date_Month', 'correlation': 0.9310369705137715}, {'feature_1': 'C-2 Date_Day', 'feature_2': 'Assembly Date_Day', 'correlation': 0.81956478383352}, {'feature_1': 'C2_to_Accident', 'feature_2': 'Assembly_to_Accident', 'correlation': 0.948701505976799}, {'feature_1': 'Age_at_Assembly', 'feature_2': 'Age at Injury', 'correlation': 0.9858454839258293}, {'feature_1': 'Age_at_Assembly', 'feature_2': 'Birth Year', 'correlation': -0.9654773357817418}, {'feature_1': 'Age_at_C2', 'feature_2': 'Age at Injury', 'correlation': 0.9778167646555994}, {'feature_1': 'Age_at_C2', 'feature_2': 'Birth Year', 'correlation': -0.955490072529208}, {'feature_1': 'Age_at_C2', 'feature_2': 'Age_at_Assembly', 'correlation': 0.9889200990269661}]
High Correlated Vars OK

## Models

In [50]:
X_train, X_val, X_test, y_train, y_val = processed_data[0]

#### Gradient Boost

In [51]:
def train_xgb(X_train, y_train, X_val, y_val, rfe, random_state=42):
    """Train XGB model"""
    print("\nTraining XGB model...")
    print(f"Starting training with {X_train.shape[1]} features...")
    
    model = XGBClassifier(
        n_estimators=250,
        learning_rate=0.1,
        max_depth=6,
        random_state=random_state,
        n_jobs=2,
        tree_method='hist',
        enable_categorical=True,
        objective='binary:logistic',
        eval_metric=['logloss', 'error'],
        use_label_encoder=False
    )
    if rfe:
        # Select optimal features using RFE
        selected_features, train_scores, val_scores = select_optimal_features_rfe(X_train, y_train, X_val, y_val, model)
        print(f"Selected {len(selected_features)} features: {selected_features}")
        print(f"Train scores: {train_scores}")
        print(f"Validation scores: {val_scores}")

        # Select best features using embedded method
        select_features, coef = select_best_features_embedded(X_train, y_train, model, 0.01)
        print(f"Selected {len(select_features)} features: {select_features}")
        print(f"Feature importances: {coef}")
    else:
        # Train with early stopping
        eval_set = [(X_train, y_train)]
        model.fit(
            X_train, y_train,
            eval_set=eval_set,
            verbose=True
        )
    
    return model

modelXGB = train_xgb(X_train, y_train, X_val, y_val, False)


Training XGB model...
Starting training with 48 features...
[0]	validation_0-logloss:0.11663	validation_0-error:0.00124
[1]	validation_0-logloss:0.10487	validation_0-error:0.00124
[2]	validation_0-logloss:0.09442	validation_0-error:0.00124
[3]	validation_0-logloss:0.08509	validation_0-error:0.00124
[4]	validation_0-logloss:0.07675	validation_0-error:0.00079
[5]	validation_0-logloss:0.06928	validation_0-error:0.00062
[6]	validation_0-logloss:0.06257	validation_0-error:0.00049
[7]	validation_0-logloss:0.05650	validation_0-error:0.00045
[8]	validation_0-logloss:0.05106	validation_0-error:0.00042
[9]	validation_0-logloss:0.04617	validation_0-error:0.00042
[10]	validation_0-logloss:0.04177	validation_0-error:0.00038
[11]	validation_0-logloss:0.03780	validation_0-error:0.00037
[12]	validation_0-logloss:0.03422	validation_0-error:0.00036
[13]	validation_0-logloss:0.03101	validation_0-error:0.00032
[14]	validation_0-logloss:0.02811	validation_0-error:0.00028
[15]	validation_0-logloss:0.02550	

#### Histogram Gradient Boost

In [52]:
def train_hist_gb(X_train, y_train, X_val, y_val, rfe, random_state=42):
    """Train HistGB model"""
    print("\nTraining HistGB model...")
    print(f"Starting training with {X_train.shape[1]} features...")
    
    model = HistGradientBoostingClassifier(
        max_iter=100,
        learning_rate=0.1,
        max_depth=None,
        random_state=random_state,
        verbose=1
    )
    if rfe:
        # Select optimal features using RFE
        selected_features, train_scores, val_scores = select_optimal_features_rfe(X_train, y_train, X_val, y_val, model)
        print(f"Selected {len(selected_features)} features: {selected_features}")
        print(f"Train scores: {train_scores}")
        print(f"Validation scores: {val_scores}")

        # Select best features using embedded method
        select_features, coef = select_best_features_embedded(X_train, y_train, model, 0.01)
        print(f"Selected {len(select_features)} features: {select_features}")
        print(f"Feature importances: {coef}")
    else:
        model.fit(X_train, y_train)
    
    return model

modelHGB = train_hist_gb(X_train, y_train, X_val, y_val, False)


Training HistGB model...
Starting training with 48 features...
Binning 0.116 GB of training data: 0.797 s
Binning 0.013 GB of validation data: 0.015 s
Fitting gradient boosted rounds:
[1/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.01439, val loss: 0.01617, in 0.032s
[2/100] 1 tree, 31 leaves, max depth = 9, train loss: 0.36686, val loss: 0.28434, in 0.031s
[3/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.59063, val loss: 0.60107, in 0.035s
[4/100] 1 tree, 31 leaves, max depth = 11, train loss: 0.54338, val loss: 0.57747, in 0.031s
[5/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.54005, val loss: 0.55474, in 0.035s
[6/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.52045, val loss: 0.53968, in 0.035s
[7/100] 1 tree, 31 leaves, max depth = 12, train loss: 0.49399, val loss: 0.50671, in 0.033s
[8/100] 1 tree, 31 leaves, max depth = 10, train loss: 0.38904, val loss: 0.33127, in 0.033s
[9/100] 1 tree, 31 leaves, max depth = 9, train loss: 0.35478, val loss:

#### Random Forest

In [53]:
def train_simple_rf(X_train, y_train, X_val, y_val, rfe, random_state=42):
    """Train Simple RF model"""
    print("\nTraining Simple RF model...")
    print(f"Starting training with {X_train.shape[1]} features...")
    
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=random_state,
        n_jobs=2,
        verbose=1,
        class_weight='balanced'
    )
    if rfe:
        # Select optimal features using RFE
        selected_features, train_scores, val_scores = select_optimal_features_rfe(X_train, y_train, X_val, y_val, model)
        print(f"Selected {len(selected_features)} features: {selected_features}")
        print(f"Train scores: {train_scores}")
        print(f"Validation scores: {val_scores}")

        # Select best features using embedded method
        select_features, coef = select_best_features_embedded(X_train, y_train, model, 0.01)
        print(f"Selected {len(select_features)} features: {select_features}")
        print(f"Feature importances: {coef}")
    else:
        model.fit(X_train, y_train)
    
    return model

modelRF = train_simple_rf(X_train, y_train, X_val, y_val, False)


Training Simple RF model...
Starting training with 48 features...


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    7.3s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:   16.1s finished


#### Logistic Regression

In [54]:
def train_logistic_regression(X_train, y_train, X_val, y_val, rfe):
    """Train Logistic Regression model"""
    print("\nTraining Logistic Regression model...")
    print(f"Starting training with {X_train.shape[1]} features...")
    
    model = LogisticRegression(class_weight='balanced', max_iter=1000)

    if rfe:
        # Select optimal features using RFE
        selected_features, train_scores, val_scores = select_optimal_features_rfe(X_train, y_train, X_val, y_val, model)
        print(f"Selected {len(selected_features)} features: {selected_features}")
        print(f"Train scores: {train_scores}")
        print(f"Validation scores: {val_scores}")

        # Select best features using embedded method
        select_features, coef = select_best_features_embedded(X_train, y_train, model, 0.01)
        print(f"Selected {len(select_features)} features: {select_features}")
        print(f"Feature importances: {coef}")
    else:
        model.fit(X_train, y_train)
    
    return model
    
modelLR = train_logistic_regression(X_train, y_train, X_val, y_val, False)


Training Logistic Regression model...
Starting training with 48 features...


#### Decision Tree

In [55]:
def train_decision_tree(X_train, y_train, X_val, y_val, rfe):
    """Train Decision Tree model"""
    print("\nTraining Decision Tree model...")
    print(f"Starting training with {X_train.shape[1]} features...")
    
    model = DecisionTreeClassifier(class_weight='balanced')

    if rfe:
        # Select optimal features using RFE
        selected_features, train_scores, val_scores = select_optimal_features_rfe(X_train, y_train, X_val, y_val, model)
        print(f"Selected {len(selected_features)} features: {selected_features}")
        print(f"Train scores: {train_scores}")
        print(f"Validation scores: {val_scores}")

        # Select best features using embedded method
        select_features, coef = select_best_features_embedded(X_train, y_train, model, 0.01)
        print(f"Selected {len(select_features)} features: {select_features}")
        print(f"Feature importances: {coef}")
    else:
        model.fit(X_train, y_train)
    
    return model
    
modelDT = train_decision_tree(X_train, y_train, X_val, y_val, False)


Training Decision Tree model...
Starting training with 48 features...


#### Naive Bayes

In [56]:
# Função para combinar GaussianNB e CategoricalNB
def naive_bayes(X_num_train, X_cat_train, y_train, X_num_test, X_cat_test):
    # Treinar GaussianNB para dados numéricos
    gnb = GaussianNB()
    gnb.fit(X_num_train, y_train)
    
    # Treinar CategoricalNB para dados categóricos
    cnb = CategoricalNB()
    cnb.fit(X_cat_train, y_train)

    # Predizer probabilidades com ambos os modelos
    prob_gnb = gnb.predict_proba(X_num_test)
    prob_cnb = cnb.predict_proba(X_cat_test)

    # Combinar probabilidades multiplicando-as (assumindo independência)
    combined_prob = prob_gnb * prob_cnb

    # Retornar a classe com maior probabilidade combinada
    return np.argmax(combined_prob, axis=1)

<hr>
<a class="anchor" id="evaluate">
    
# 6.2 Evaluate the model
    
</a> 

In [57]:
def evaluate_models(X_val, y_val, X_test):
    models = {
        'LogisticRegression': modelLR,
        'XGB': modelXGB, 
        'HistGB': modelHGB,
        'RF': modelRF,
        # 'GaussianNB': modelGNB,
        'DT': modelDT
    }

    best_model = None
    best_score = 0
    best_report = None

    for name, model in models.items():
        print(f"Model: {name}")
        y_pred = model.predict(X_val)
        report = classification_report(y_val, y_pred)
        print(report)
        score = f1_score(y_val, y_pred, average='macro')
        print(f"F1-score: {score:.4f}")
        print()

        if score > best_score:
            best_score = score
            best_model = model
            best_report = report

    print(f"Best model: {best_model.__class__.__name__}")
    print("Best classification report:")
    print(best_report)
    test_pred = best_model.predict(X_test)
    
    return test_pred, best_model

test_pred, best_model = evaluate_models(X_val, y_val, X_test)

Model: LogisticRegression
              precision    recall  f1-score   support

           0       0.97      0.92      0.94    164172
           1       0.19      0.38      0.25      8036

    accuracy                           0.89    172208
   macro avg       0.58      0.65      0.60    172208
weighted avg       0.93      0.89      0.91    172208

F1-score: 0.5955

Model: XGB
              precision    recall  f1-score   support

           0       0.96      1.00      0.98    164172
           1       0.93      0.04      0.08      8036

    accuracy                           0.96    172208
   macro avg       0.94      0.52      0.53    172208
weighted avg       0.95      0.96      0.94    172208

F1-score: 0.5306

Model: HistGB
              precision    recall  f1-score   support

           0       0.95      1.00      0.98    164172
           1       0.62      0.04      0.07      8036

    accuracy                           0.95    172208
   macro avg       0.79      0.52      0.

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    0.4s finished


              precision    recall  f1-score   support

           0       0.96      1.00      0.98    164172
           1       0.77      0.06      0.12      8036

    accuracy                           0.96    172208
   macro avg       0.86      0.53      0.55    172208
weighted avg       0.95      0.96      0.94    172208

F1-score: 0.5485

Model: DT
              precision    recall  f1-score   support

           0       0.95      1.00      0.98    164172
           1       0.80      0.04      0.07      8036

    accuracy                           0.95    172208
   macro avg       0.88      0.52      0.52    172208
weighted avg       0.95      0.95      0.93    172208

F1-score: 0.5238

Best model: LogisticRegression
Best classification report:
              precision    recall  f1-score   support

           0       0.97      0.92      0.94    164172
           1       0.19      0.38      0.25      8036

    accuracy                           0.89    172208
   macro avg       0.58

<hr>
<a class="anchor" id="export">
    
# 6.3 Export the predictor
    
</a> 

In [58]:
Agreement_Reached = best_model.predict(X_test)

In [59]:
%store Agreement_Reached

Stored 'Agreement_Reached' (ndarray)
