# <center>Machine Learning Project Code</center>

<a class="anchor" id="top"></a>

## <center>*03 - K-Fold*</center>

** **



# Table of Contents  <br>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Stratified K-Fold](#2.-Stratified-K-Fold) <br><br>

** **

This notebook will consist of the implementation of Stratified K-Fold. It will use the same techniques to fill missing values and treat outliers as Notebook 02. Feature Selection will only be performed in said notebook, and the selected features there will be used here, due to computational complexity and time constraints.

Data Scientist Manager: António Oliveira, **20211595**

Data Scientist Senior: Tomás Ribeiro, **20240526**

Data Scientist Junior: Gonçalo Pacheco, **20240695**

Data Analyst Senior: Gonçalo Custódio, **20211643**

Data Analyst Junior: Ana Caleiro, **20240696**


** ** 

# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [2]:
import pandas as pd
import numpy as np

# Train-Test Split
from sklearn.model_selection import StratifiedKFold

import preproc as p 

# Models
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import classification_report
import metrics as m

pd.set_option('display.max_columns', None)
import time

# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [31]:
# Load training data
df2 = pd.read_csv('./data/train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test = pd.read_csv('./data/test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

Unnamed: 0_level_0,Age at Injury,Average Weekly Wage,Birth Year,Claim Injury Type,IME-4 Count,Industry Code,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Number of Dependents,Alternative Dispute Resolution Bin,Attorney/Representative Bin,Carrier Name Enc,Carrier Type freq,Carrier Type_1A. PRIVATE,Carrier Type_2A. SIF,Carrier Type_3A. SELF PUBLIC,Carrier Type_4A. SELF PRIVATE,County of Injury freq,COVID-19 Indicator Enc,District Name freq,Gender Enc,Gender_F,Gender_M,Medical Fee Region freq,Accident Date Year,Accident Date Month,Accident Date Day,Accident Date Day of Week,Assembly Date Year,Assembly Date Month,Assembly Date Day,Assembly Date Day of Week,C-2 Date Year,C-2 Date Month,C-2 Date Day,C-2 Date Day of Week,WCIO Codes,Insurance,Zip Code Valid,Industry Sector Count Enc,Age Group,C-3 Date Binary,First Hearing Date Binary
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1
5393875,31.0,0.0,1988.0,1,0.0,44.0,27,10,62,1.0,0,0,399,285367,1,0,0,0,3355,0,44646,0,0,1,135885,2019.0,12.0,30.0,0.0,2020,1,1,2,2019.0,12.0,31.0,1.0,271062,1,0,103330,1,0,0
5393091,46.0,1745.93,1973.0,3,4.0,23.0,97,49,38,4.0,0,1,1023,285367,1,0,0,0,760,0,40449,1,1,0,135885,2019.0,8.0,30.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,974938,1,0,69053,1,1,1
5393889,40.0,1434.8,1979.0,3,0.0,56.0,79,7,10,6.0,0,0,689,285367,1,0,0,0,17450,0,86171,0,0,1,85033,2019.0,12.0,6.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,79710,1,0,57495,1,0,0


In [32]:
df = df2.copy()

# 2. Stratified K-Fold

<a href="#top">Top &#129033;</a>

In [33]:
from sklearn.linear_model import LogisticRegression, SGDClassifier  # Linear models
from sklearn.tree import DecisionTreeClassifier  
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis  # Discriminant analysis models
from xgboost import XGBClassifier   
from sklearn.neural_network import MLPClassifier

In [34]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

In [35]:
X.drop(['Accident Date Day of Week', 'Assembly Date Day of Week',
        'Zip Code Valid'], axis = 1, inplace = True)

In [40]:
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Initialize model
model = XGBClassifier()

# Track scores
scores = []

# Perform stratified k-fold cross-validation
for train_index, val_index in skf.split(X, y):

    start_time = time.time()
    
    X_train, X_val = X.iloc[train_index].copy(), X.iloc[val_index].copy()
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    # MISSING VALUES
    p.fill_dates(X_train, [X_val], 'Accident Date')
    p.fill_dates(X_train, [X_val], 'C-2 Date')
    
    #p.fill_dow([X_train, X_val], 'Accident Date')
    p.fill_dow([X_train, X_val], 'C-2 Date')
    
    p.fill_birth_year([X_train, X_val])
    

    p.ball_tree_impute([X_train, X_val], 'Average Weekly Wage')

    # OUTLIERS
    X_train['IME-4 Count Log'] = np.log1p(X_train['IME-4 Count'])
    X_val['IME-4 Count Log'] = np.log1p(X_val['IME-4 Count'])
    
    X_train['IME-4 Count Double Log'] = np.log1p(X_train['IME-4 Count Log'])
    X_val['IME-4 Count Double Log'] = np.log1p(X_val['IME-4 Count Log'])
    
    X_train['Average Weekly Wage Sqrt'] = np.sqrt(X_train['Average Weekly Wage'])
    X_val['Average Weekly Wage Sqrt'] = np.sqrt(X_val['Average Weekly Wage'])

    X_train['Average Weekly Wage'] = X_train['Average Weekly Wage'].clip(lower=X_train['Average Weekly Wage'].quantile(0.01)
                                                                         , upper=X_train['Average Weekly Wage'].quantile(0.99))
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    pred_train = model.predict(X_train)
    pred_val = model.predict(X_val)
    
    # Evaluate accuracy
    m.metrics(y_train, pred_train , y_val, pred_val)

    # Time
    end_time = time.time()
    elapsed_time = round((end_time - start_time) / 60, 2)
    print(f'This Fold took {elapsed_time} minutes')

______________________________________________________________________
                                TRAIN                                 
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.79      0.56      0.65     11229
           1       0.86      0.96      0.91    261970
           2       0.62      0.10      0.17     62015
           3       0.72      0.89      0.80    133656
           4       0.72      0.65      0.68     43452
           5       0.92      0.05      0.10      3790
           6       1.00      0.97      0.98        87
           7       0.97      0.95      0.96       423

    accuracy                           0.80    516622
   macro avg       0.83      0.64      0.66    516622
weighted avg       0.78      0.80      0.76    516622

______________________________________________________________________
                                VALIDATION                       

______________________________________________________________________
                                TRAIN                                 
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.79      0.56      0.65     11228
           1       0.86      0.96      0.91    261970
           2       0.63      0.10      0.17     62015
           3       0.72      0.90      0.80    133657
           4       0.73      0.65      0.68     43452
           5       0.92      0.06      0.11      3790
           6       1.00      0.92      0.96        88
           7       0.97      0.94      0.96       423

    accuracy                           0.80    516623
   macro avg       0.83      0.64      0.65    516623
weighted avg       0.78      0.80      0.76    516623

______________________________________________________________________
                                VALIDATION                       

In [41]:
test = pd.read_csv('./data/test_treated.csv', 
                   index_col = 'Claim Identifier')

## 4.4 Final Predictions

<a href="#top">Top &#129033;</a>

In [42]:
test_filtered = test[X_train.columns]

In [43]:
test_filtered['Claim Injury Type'] = model.predict(test_filtered)

Map Predictions to Original Values

In [44]:
label_mapping = {
    0: "1. CANCELLED",
    1: "2. NON-COMP",
    2: "3. MED ONLY",
    3: "4. TEMPORARY",
    4: "5. PPD SCH LOSS",
    5: "6. PPD NSL",
    6: "7. PTD",
    7: "8. DEATH"
}


test_filtered['Claim Injury Type'] = test_filtered['Claim Injury Type'].replace(label_mapping)

In [45]:
# Count unique values in column 'Claim Injury Type'
test_filtered['Claim Injury Type'].value_counts() 

Claim Injury Type
2. NON-COMP        265893
3. MED ONLY         74036
4. TEMPORARY        38361
1. CANCELLED         5431
5. PPD SCH LOSS      4217
8. DEATH               37
Name: count, dtype: int64

In [46]:
# Extract the target variable 'Claim Injury Type' from the test dataset for prediction
predictions = test_filtered['Claim Injury Type']

In [47]:
# Assign a descriptive name for easy reference
name = 'all_feat_xgb_10f_drop_chi_sq'

# Save the predictions to a CSV file.
predictions.to_csv(f'./pred/{name}.csv')