# <center>Machine Learning Project Code</center>

<a class="anchor" id="top"></a>

## <center>*03 - K-Fold*</center>

** **



# Table of Contents  <br>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Cross Validation](#2.-Cross-Validation) <br><br>

3. [Final Predictions](#3.-Final-Predictions) <br><br>

** **

This notebook will consist of the implementation of Stratified K-Fold. It will use the same techniques to fill missing values and treat outliers as Notebook 02. Feature Selection will only be performed in said notebook, and the selected features there will be used here, due to computational complexity and time constraints.

Data Scientist Manager: António Oliveira, **20211595**

Data Scientist Senior: Tomás Ribeiro, **20240526**

Data Scientist Junior: Gonçalo Pacheco, **20240695**

Data Analyst Senior: Gonçalo Custódio, **20211643**

Data Analyst Junior: Ana Caleiro, **20240696**


** ** 

# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [1]:
import pandas as pd
import numpy as np

# Train-Test Split
from sklearn.model_selection import StratifiedKFold

# Models
import models as mod

# Metrics
from sklearn.metrics import classification_report
import metrics as m
from sklearn.metrics import f1_score, precision_score, recall_score

pd.set_option('display.max_columns', None)

# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [2]:
# Load training data
df = pd.read_csv('./data/train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test1 = pd.read_csv('./data/test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

Unnamed: 0_level_0,Age at Injury,Alternative Dispute Resolution,Attorney/Representative,Average Weekly Wage,Birth Year,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Medical Fee Region,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Number of Dependents,Gender Enc,Accident Date Year,Accident Date Month,Accident Date Day,Accident Date Day of Week,Assembly Date Year,Assembly Date Month,Assembly Date Day,Assembly Date Day of Week,C-2 Date Year,C-2 Date Month,C-2 Date Day,C-2 Date Day of Week,Accident to Assembly Time,Assembly to C-2 Time,Accident to C-2 Time,WCIO Codes,Insurance,Zip Code Valid,Industry Sector,Age Group
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
5393875,31.0,N,N,0.0,1988.0,,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,1,ST. LAWRENCE,N,SYRACUSE,,M,,44.0,I,27,10,62,1.0,0,2019.0,12.0,30.0,0.0,2020,1,1,2,2019.0,12.0,31.0,1.0,2.0,1.0,1.0,271062,1,0,Retail and Wholesale,1
5393091,46.0,N,Y,1745.93,1973.0,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,3,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23.0,I,97,49,38,4.0,1,2019.0,8.0,30.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,124.0,0.0,124.0,974938,1,0,Manufacturing and Construction,1
5393889,40.0,N,N,1434.8,1979.0,,INDEMNITY INSURANCE CO OF,1A. PRIVATE,3,ORANGE,N,ALBANY,,M,,56.0,II,79,7,10,6.0,0,2019.0,12.0,6.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,26.0,0.0,26.0,79710,1,0,Business Services,1


# 2. Cross Validation

<a href="#top">Top &#129033;</a>

In [3]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

**Stratified K-Fold**

In [4]:
method = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [19]:
XGB_freq_no_out_10f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'XGB_freq_no_out_10f')

XGB_freq_out_10f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'XGB_freq_out_10f')

XGB_count_no_out_10f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'XGB_count_no_out_10f')

XGB_count_out_10f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'XGB_count_out_10f')

This Fold took 1.66 minutes
This Fold took 1.62 minutes
This Fold took 1.56 minutes
This Fold took 1.53 minutes
This Fold took 1.52 minutes
This Fold took 1.62 minutes
This Fold took 1.68 minutes
This Fold took 1.6 minutes
This Fold took 1.55 minutes
This Fold took 1.67 minutes
This Fold took 1.39 minutes
This Fold took 1.37 minutes
This Fold took 1.37 minutes
This Fold took 1.35 minutes
This Fold took 1.46 minutes
This Fold took 1.34 minutes
This Fold took 1.33 minutes
This Fold took 1.32 minutes
This Fold took 1.24 minutes
This Fold took 1.12 minutes


In [7]:
RF_freq_no_out = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'RF_freq_no_out')

RF_freq_out = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'RF_freq_out')

RF_count_no_out = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'RF_count_no_out')

RF_count_out = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'RF_count_out')

This Fold took 3.38 minutes
This Fold took 3.45 minutes
This Fold took 3.43 minutes
This Fold took 3.39 minutes
This Fold took 3.36 minutes
This Fold took 3.19 minutes
This Fold took 3.2 minutes
This Fold took 3.18 minutes
This Fold took 3.22 minutes
This Fold took 3.27 minutes
This Fold took 3.14 minutes
This Fold took 3.17 minutes
This Fold took 3.2 minutes
This Fold took 3.19 minutes
This Fold took 3.2 minutes
This Fold took 2.97 minutes
This Fold took 3.0 minutes
This Fold took 3.01 minutes
This Fold took 3.06 minutes
This Fold took 3.05 minutes


In [15]:
LGBM_freq_no_out = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'LGBM_freq_no_out')

LGBM_freq_out = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'LGBM_freq_out')

LGBM_count_no_out = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'LGBM_count_no_out')

LGBM_count_out = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'LGBM_count_out')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.030632 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2763
[LightGBM] [Info] Number of data points in the train set: 454727, number of used features: 50
[LightGBM] [Info] Start training from score -3.874529
[LightGBM] [Info] Start training from score -0.676691
[LightGBM] [Info] Start training from score -2.114330
[LightGBM] [Info] Start training from score -1.357652
[LightGBM] [Info] Start training from score -2.469974
[LightGBM] [Info] Start training from score -4.908650
[LightGBM] [Info] Start training from score -8.683647
[LightGBM] [Info] Start training from score -7.103197
This Fold took 1.52 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not eno

[LightGBM] [Info] Start training from score -3.828846
[LightGBM] [Info] Start training from score -0.679079
[LightGBM] [Info] Start training from score -2.119926
[LightGBM] [Info] Start training from score -1.352046
[LightGBM] [Info] Start training from score -2.475656
[LightGBM] [Info] Start training from score -4.914913
[LightGBM] [Info] Start training from score -8.693479
[LightGBM] [Info] Start training from score -7.107696
This Fold took 1.47 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032395 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2763
[LightGBM] [Info] Number of data points in the train set: 454727, number of used features: 50
[LightGBM] [Info] Start training from score -3.874529
[LightGBM] [Info] Start training from score -0.676691
[LightGBM] [Info] Start training from score -2.114330
[LightGBM] [Info] Start 

This Fold took 1.21 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.027036 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2528
[LightGBM] [Info] Number of data points in the train set: 459220, number of used features: 47
[LightGBM] [Info] Start training from score -3.828846
[LightGBM] [Info] Start training from score -0.679079
[LightGBM] [Info] Start training from score -2.119926
[LightGBM] [Info] Start training from score -1.352046
[LightGBM] [Info] Start training from score -2.475656
[LightGBM] [Info] Start training from score -4.914913
[LightGBM] [Info] Start training from score -8.693479
[LightGBM] [Info] Start training from score -7.107696
This Fold took 1.21 minutes


In [None]:
import play_song as s
s.play_('audio.mp3')

In [20]:
models = [XGB_freq_no_out, XGB_freq_out, 
          XGB_count_no_out, XGB_count_out,
          XGB_freq_no_out_10f, XGB_freq_out_10f, 
          XGB_count_no_out_10f, XGB_count_out_10f,
          RF_freq_no_out, RF_freq_out,
          RF_count_no_out, RF_count_out,
          LGBM_freq_no_out, LGBM_freq_out,
          LGBM_count_no_out, LGBM_count_out]

model_names = ['XGB_freq_no_out', 'XGB_freq_out', 
               'XGB_count_no_out', 'XGB_count_out',
               'XGB_freq_no_out_10f', 'XGB_freq_out_10f', 
               'XGB_count_no_out_10f', 'XGB_count_out_10f',
               'RF_freq_no_out', 'RF_freq_out',
               'RF_count_no_out', 'RF_count_out',
               'LGBM_freq_no_out', 'LGBM_freq_out',
               'LGBM_count_no_out', 'LGBM_count_out']

m.metrics2(models, model_names)

Unnamed: 0,XGB_freq_no_out,XGB_freq_out,XGB_count_no_out,XGB_count_out,XGB_freq_no_out_10f,XGB_freq_out_10f,XGB_count_no_out_10f,XGB_count_out_10f,RF_freq_no_out,RF_freq_out,RF_count_no_out,RF_count_out,LGBM_freq_no_out,LGBM_freq_out,LGBM_count_no_out,LGBM_count_out
Train F1 macro,0.67+/-0.002,0.67+/-0.001,0.669+/-0.001,0.669+/-0.001,0.67+/-0.002,0.67+/-0.001,0.669+/-0.001,0.669+/-0.001,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.422+/-0.015,0.437+/-0.015,0.409+/-0.009,0.419+/-0.008
Validation F1 macro,0.449+/-0.006,0.452+/-0.004,0.45+/-0.005,0.452+/-0.003,0.449+/-0.006,0.452+/-0.004,0.45+/-0.005,0.452+/-0.003,0.39+/-0.004,0.393+/-0.006,0.39+/-0.004,0.39+/-0.003,0.391+/-0.007,0.398+/-0.005,0.385+/-0.006,0.392+/-0.004
Precision Train,0.837+/-0.002,0.835+/-0.001,0.838+/-0.003,0.835+/-0.003,0.837+/-0.002,0.835+/-0.001,0.838+/-0.003,0.835+/-0.003,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.496+/-0.032,0.514+/-0.016,0.485+/-0.025,0.491+/-0.021
Precision Validation,0.569+/-0.011,0.571+/-0.003,0.572+/-0.011,0.575+/-0.007,0.569+/-0.011,0.571+/-0.003,0.572+/-0.011,0.575+/-0.007,0.52+/-0.027,0.531+/-0.018,0.524+/-0.027,0.53+/-0.027,0.433+/-0.016,0.444+/-0.01,0.431+/-0.012,0.44+/-0.008
Recall Train,0.654+/-0.001,0.654+/-0.001,0.653+/-0.001,0.653+/-0.002,0.654+/-0.001,0.654+/-0.001,0.653+/-0.001,0.653+/-0.002,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.421+/-0.017,0.444+/-0.012,0.422+/-0.017,0.435+/-0.021
Recall Validation,0.433+/-0.008,0.433+/-0.005,0.433+/-0.007,0.434+/-0.005,0.433+/-0.008,0.433+/-0.005,0.433+/-0.007,0.434+/-0.005,0.377+/-0.002,0.378+/-0.003,0.377+/-0.002,0.376+/-0.002,0.398+/-0.012,0.409+/-0.012,0.395+/-0.012,0.402+/-0.01
Time,1.22+/-0.028,1.222+/-0.031,1.048+/-0.033,1.232+/-0.123,1.578+/-0.054,1.624+/-0.048,1.388+/-0.038,1.27+/-0.083,3.402+/-0.033,3.212+/-0.032,3.18+/-0.023,3.018+/-0.033,1.486+/-0.038,1.46+/-0.018,1.23+/-0.015,1.21+/-0.0
