# <center>Machine Learning Project Code</center>

<a class="anchor" id="top"></a>

## <center>*03 - K-Fold*</center>

** **



# Table of Contents  <br>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Cross Validation](#2.-Cross-Validation) <br><br>

3. [Final Predictions](#3.-Final-Predictions) <br><br>

** **

This notebook will consist of the implementation of Stratified K-Fold. It will use the same techniques to fill missing values and treat outliers as Notebook 02. Feature Selection will only be performed in said notebook, and the selected features there will be used here, due to computational complexity and time constraints.

Data Scientist Manager: António Oliveira, **20211595**

Data Scientist Senior: Tomás Ribeiro, **20240526**

Data Scientist Junior: Gonçalo Pacheco, **20240695**

Data Analyst Senior: Gonçalo Custódio, **20211643**

Data Analyst Junior: Ana Caleiro, **20240696**


** ** 

# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [9]:
import pandas as pd

# Train-Test Split
from sklearn.model_selection import StratifiedKFold

# Models
import models as mod

# Metrics
import metrics as m

pd.set_option('display.max_columns', None)

# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [3]:
# Load training data
df = pd.read_csv('./data/train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test1 = pd.read_csv('./data/test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

Unnamed: 0_level_0,Age at Injury,Alternative Dispute Resolution,Attorney/Representative,Average Weekly Wage,Birth Year,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Medical Fee Region,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Number of Dependents,Gender Enc,Accident Date Year,Accident Date Month,Accident Date Day,Accident Date Day of Week,Assembly Date Year,Assembly Date Month,Assembly Date Day,Assembly Date Day of Week,C-2 Date Year,C-2 Date Month,C-2 Date Day,C-2 Date Day of Week,Accident to Assembly Time,Assembly to C-2 Time,Accident to C-2 Time,WCIO Codes,Insurance,Zip Code Valid,Industry Sector,Age Group
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
5393875,31.0,N,N,0.0,1988.0,,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,1,ST. LAWRENCE,N,SYRACUSE,,M,,44.0,I,27,10,62,1.0,0,2019.0,12.0,30.0,0.0,2020,1,1,2,2019.0,12.0,31.0,1.0,2.0,1.0,1.0,271062,1,0,Retail and Wholesale,1
5393091,46.0,N,Y,1745.93,1973.0,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,3,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23.0,I,97,49,38,4.0,1,2019.0,8.0,30.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,124.0,0.0,124.0,974938,1,0,Manufacturing and Construction,1
5393889,40.0,N,N,1434.8,1979.0,,INDEMNITY INSURANCE CO OF,1A. PRIVATE,3,ORANGE,N,ALBANY,,M,,56.0,II,79,7,10,6.0,0,2019.0,12.0,6.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,26.0,0.0,26.0,79710,1,0,Business Services,1


# 2. Cross Validation

<a href="#top">Top &#129033;</a>

In [4]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

**Stratified K-Fold**

In [11]:
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [12]:
run_XGB = mod.k_fold(skf, X, y, 'XGB', test1)
run_RF = mod.k_fold(skf, X, y, 'RF', test1)
run_LGBM = mod.k_fold(skf, X, y, 'LGBM', test1)

This Fold took 1.23 minutes
This Fold took 1.23 minutes
This Fold took 1.23 minutes
This Fold took 1.24 minutes
This Fold took 1.26 minutes
This Fold took 3.59 minutes
This Fold took 3.54 minutes
This Fold took 3.58 minutes
This Fold took 3.6 minutes
This Fold took 3.54 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.022895 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2522
[LightGBM] [Info] Number of data points in the train set: 459220, number of used features: 47
[LightGBM] [Info] Start training from score -3.828846
[LightGBM] [Info] Start training from score -0.679083
[LightGBM] [Info] Start training from score -2.119944
[LightGBM] [Info] Start training from score -1.352037
[LightGBM] [Info] Start training from score -2.475656
[LightGBM] [Info] Start training from score -4.914913
[LightGBM] [Info] Start training from score

In [13]:
models = [run_XGB, run_RF, run_LGBM]
m.metrics2(models, ['XGB', 'RF', 'LGBM'])

Unnamed: 0,XGB,RF,LGBM
Train F1 macro,0.67+/-0.001,1.0+/-0.0,0.42+/-0.013
Validation F1 macro,0.451+/-0.007,0.392+/-0.004,0.394+/-0.007
Precision Train,0.836+/-0.001,1.0+/-0.0,0.493+/-0.022
Precision Validation,0.572+/-0.009,0.531+/-0.015,0.445+/-0.01
Recall Train,0.654+/-0.0,1.0+/-0.0,0.423+/-0.02
Recall Validation,0.432+/-0.007,0.377+/-0.002,0.398+/-0.011
Time,1.238+/-0.012,3.57+/-0.025,1.094+/-0.014


In [14]:
import play_song as s
s.play_('audio.mp3')

Input #0, wav, from '/var/folders/mm/fxsq_1490x9dd2w76tqvt3kr0000gn/T/tmpwt93ax0l.wav':
  Duration: 00:00:10.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
   9.95 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




# 3. Final Predictions

<a href="#top">Top &#129033;</a>

In [58]:
test_filtered = test_RS[X_train_RS.columns]

In [59]:
test_filtered['Claim Injury Type'] = model.predict(test_filtered)

Map Predictions to Original Values

In [60]:
label_mapping = {
    0: "1. CANCELLED",
    1: "2. NON-COMP",
    2: "3. MED ONLY",
    3: "4. TEMPORARY",
    4: "5. PPD SCH LOSS",
    5: "6. PPD NSL",
    6: "7. PTD",
    7: "8. DEATH"
}


test_filtered['Claim Injury Type'] = test_filtered['Claim Injury Type'].replace(label_mapping)

In [61]:
# Count unique values in column 'Claim Injury Type'
test_filtered['Claim Injury Type'].value_counts() 

Claim Injury Type
2. NON-COMP        315388
4. TEMPORARY        53658
3. MED ONLY          9816
1. CANCELLED         6746
5. PPD SCH LOSS      2332
8. DEATH               35
Name: count, dtype: int64

In [62]:
# Extract the target variable 'Claim Injury Type' from the test dataset for prediction
predictions = test_filtered['Claim Injury Type']

In [63]:
# Assign a descriptive name for easy reference
name = 'all_feat_scaled_XGB_5f_2'

# Save the predictions to a CSV file.
predictions.to_csv(f'./pred/{name}.csv')