# <center>Machine Learning Project Code</center>

<a class="anchor" id="top"></a>

## <center>*03 - K-Fold*</center>

** **



# Table of Contents  <br>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Cross Validation](#2.-Cross-Validation) <br><br>

3. [Final Predictions](#3.-Final-Predictions) <br><br>

** **

This notebook will consist of the implementation of Stratified K-Fold. It will use the same techniques to fill missing values and treat outliers as Notebook 02. Feature Selection will only be performed in said notebook, and the selected features there will be used here, due to computational complexity and time constraints.

Data Scientist Manager: António Oliveira, **20211595**

Data Scientist Senior: Tomás Ribeiro, **20240526**

Data Scientist Junior: Gonçalo Pacheco, **20240695**

Data Analyst Senior: Gonçalo Custódio, **20211643**

Data Analyst Junior: Ana Caleiro, **20240696**


** ** 

# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [1]:
import pandas as pd
import numpy as np

# Train-Test Split
from sklearn.model_selection import StratifiedKFold

# Models
import models as mod

# Metrics
from sklearn.metrics import classification_report
import metrics as m
from sklearn.metrics import f1_score, precision_score, recall_score

pd.set_option('display.max_columns', None)

# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [2]:
# Load training data
df = pd.read_csv('./data/train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test1 = pd.read_csv('./data/test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

Unnamed: 0_level_0,Age at Injury,Alternative Dispute Resolution,Attorney/Representative,Average Weekly Wage,Birth Year,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Medical Fee Region,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Number of Dependents,Gender Enc,Accident Date Year,Accident Date Month,Accident Date Day,Accident Date Day of Week,Assembly Date Year,Assembly Date Month,Assembly Date Day,Assembly Date Day of Week,C-2 Date Year,C-2 Date Month,C-2 Date Day,C-2 Date Day of Week,Accident to Assembly Time,Assembly to C-2 Time,Accident to C-2 Time,WCIO Codes,Insurance,Zip Code Valid,Industry Sector,Age Group
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
5393875,31.0,N,N,0.0,1988.0,,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,1,ST. LAWRENCE,N,SYRACUSE,,M,,44.0,I,27,10,62,1.0,0,2019.0,12.0,30.0,0.0,2020,1,1,2,2019.0,12.0,31.0,1.0,2.0,1.0,1.0,271062,1,0,Retail and Wholesale,1
5393091,46.0,N,Y,1745.93,1973.0,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,3,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23.0,I,97,49,38,4.0,1,2019.0,8.0,30.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,124.0,0.0,124.0,974938,1,0,Manufacturing and Construction,1
5393889,40.0,N,N,1434.8,1979.0,,INDEMNITY INSURANCE CO OF,1A. PRIVATE,3,ORANGE,N,ALBANY,,M,,56.0,II,79,7,10,6.0,0,2019.0,12.0,6.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,26.0,0.0,26.0,79710,1,0,Business Services,1


# 2. Cross Validation

<a href="#top">Top &#129033;</a>

In [3]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

**Stratified K-Fold**

In [4]:
method = StratifiedKFold(n_splits=15, shuffle=True, random_state=42)

In [5]:
XGB_freq_no_out_15f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'XGB_freq_no_out_15f')

XGB_freq_out_15f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'XGB_freq_out_15f')

XGB_count_no_out_15f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'XGB_count_no_out_15f')

XGB_count_out_15f = mod.k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'XGB_count_out_15f')

This Fold took 1.39 minutes
This Fold took 1.34 minutes
This Fold took 1.34 minutes
This Fold took 1.44 minutes
This Fold took 1.51 minutes
This Fold took 1.52 minutes
This Fold took 1.59 minutes
This Fold took 1.59 minutes
This Fold took 1.38 minutes
This Fold took 1.37 minutes
This Fold took 1.43 minutes
This Fold took 1.36 minutes
This Fold took 1.35 minutes
This Fold took 1.38 minutes
This Fold took 1.39 minutes
This Fold took 1.46 minutes
This Fold took 1.38 minutes
This Fold took 1.35 minutes
This Fold took 1.42 minutes
This Fold took 1.58 minutes
This Fold took 1.4 minutes
This Fold took 1.42 minutes
This Fold took 1.41 minutes
This Fold took 1.39 minutes
This Fold took 1.41 minutes
This Fold took 1.42 minutes
This Fold took 1.41 minutes
This Fold took 1.41 minutes
This Fold took 1.4 minutes
This Fold took 1.41 minutes
This Fold took 1.41 minutes
This Fold took 1.35 minutes
This Fold took 1.37 minutes
This Fold took 1.36 minutes
This Fold took 4.14 minutes
This Fold took 19.46 m

In [7]:
RF_freq_no_out_15f = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'RF_freq_no_out_15f')

RF_freq_out_15f = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'RF_freq_out_15f')

RF_count_no_out_15f = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'RF_count_no_out_15f')

RF_count_out_15f = mod.k_fold(method, X, y, test1, 'RF', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'RF_count_out_15f')

This Fold took 3.38 minutes
This Fold took 3.45 minutes
This Fold took 3.43 minutes
This Fold took 3.39 minutes
This Fold took 3.36 minutes
This Fold took 3.19 minutes
This Fold took 3.2 minutes
This Fold took 3.18 minutes
This Fold took 3.22 minutes
This Fold took 3.27 minutes
This Fold took 3.14 minutes
This Fold took 3.17 minutes
This Fold took 3.2 minutes
This Fold took 3.19 minutes
This Fold took 3.2 minutes
This Fold took 2.97 minutes
This Fold took 3.0 minutes
This Fold took 3.01 minutes
This Fold took 3.06 minutes
This Fold took 3.05 minutes


In [15]:
LGBM_freq_no_out_15f = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'LGBM_freq_no_out_15f')

LGBM_freq_out_15f = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'freq', outliers = False,
                              file_name = 'LGBM_freq_out_15f')

LGBM_count_no_out_15f = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'count', outliers = True,
                              file_name = 'LGBM_count_no_out_15f')

LGBM_count_out_15f = mod.k_fold(method, X, y, test1, 'LGBM', params = {},
                              enc = 'count', outliers = False,
                              file_name = 'LGBM_count_out_15f')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.030632 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2763
[LightGBM] [Info] Number of data points in the train set: 454727, number of used features: 50
[LightGBM] [Info] Start training from score -3.874529
[LightGBM] [Info] Start training from score -0.676691
[LightGBM] [Info] Start training from score -2.114330
[LightGBM] [Info] Start training from score -1.357652
[LightGBM] [Info] Start training from score -2.469974
[LightGBM] [Info] Start training from score -4.908650
[LightGBM] [Info] Start training from score -8.683647
[LightGBM] [Info] Start training from score -7.103197
This Fold took 1.52 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not eno

[LightGBM] [Info] Start training from score -3.828846
[LightGBM] [Info] Start training from score -0.679079
[LightGBM] [Info] Start training from score -2.119926
[LightGBM] [Info] Start training from score -1.352046
[LightGBM] [Info] Start training from score -2.475656
[LightGBM] [Info] Start training from score -4.914913
[LightGBM] [Info] Start training from score -8.693479
[LightGBM] [Info] Start training from score -7.107696
This Fold took 1.47 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032395 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2763
[LightGBM] [Info] Number of data points in the train set: 454727, number of used features: 50
[LightGBM] [Info] Start training from score -3.874529
[LightGBM] [Info] Start training from score -0.676691
[LightGBM] [Info] Start training from score -2.114330
[LightGBM] [Info] Start 

This Fold took 1.21 minutes
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.027036 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2528
[LightGBM] [Info] Number of data points in the train set: 459220, number of used features: 47
[LightGBM] [Info] Start training from score -3.828846
[LightGBM] [Info] Start training from score -0.679079
[LightGBM] [Info] Start training from score -2.119926
[LightGBM] [Info] Start training from score -1.352046
[LightGBM] [Info] Start training from score -2.475656
[LightGBM] [Info] Start training from score -4.914913
[LightGBM] [Info] Start training from score -8.693479
[LightGBM] [Info] Start training from score -7.107696
This Fold took 1.21 minutes


In [6]:
import play_song as s
s.play_('audio.mp3')

Input #0, wav, from '/var/folders/mm/fxsq_1490x9dd2w76tqvt3kr0000gn/T/tmp5_5igg01.wav':
  Duration: 00:00:10.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
SDL_OpenAudio (2 channels, 48000 Hz): CoreAudio error (AudioQueueStart): 35
SDL_OpenAudio (1 channels, 48000 Hz): CoreAudio error (AudioQueueStart): 35
SDL_OpenAudio (2 channels, 44100 Hz): CoreAudio error (AudioQueueStart): 35





SDL_OpenAudio (1 channels, 44100 Hz): CoreAudio error (AudioQueueStart): 35
No more combinations to try, audio open failed
Failed to open file '/var/folders/mm/fxsq_1490x9dd2w76tqvt3kr0000gn/T/tmp5_5igg01.wav' or configure filtergraph
    nan    :  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 

In [20]:
models = [XGB_freq_no_out_10f, XGB_freq_out_10f, 
          XGB_count_no_out_10f, XGB_count_out_10f,
          RF_freq_no_out, RF_freq_out,
          RF_count_no_out, RF_count_out,
          LGBM_freq_no_out, LGBM_freq_out,
          LGBM_count_no_out, LGBM_count_out]

model_names = ['XGB_freq_no_out_10f', 'XGB_freq_out_10f', 
               'XGB_count_no_out_10f', 'XGB_count_out_10f',
               'RF_freq_no_out', 'RF_freq_out',
               'RF_count_no_out', 'RF_count_out',
               'LGBM_freq_no_out', 'LGBM_freq_out',
               'LGBM_count_no_out', 'LGBM_count_out']

m.metrics2(models, model_names)

Unnamed: 0,XGB_freq_no_out,XGB_freq_out,XGB_count_no_out,XGB_count_out,XGB_freq_no_out_10f,XGB_freq_out_10f,XGB_count_no_out_10f,XGB_count_out_10f,RF_freq_no_out,RF_freq_out,RF_count_no_out,RF_count_out,LGBM_freq_no_out,LGBM_freq_out,LGBM_count_no_out,LGBM_count_out
Train F1 macro,0.67+/-0.002,0.67+/-0.001,0.669+/-0.001,0.669+/-0.001,0.67+/-0.002,0.67+/-0.001,0.669+/-0.001,0.669+/-0.001,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.422+/-0.015,0.437+/-0.015,0.409+/-0.009,0.419+/-0.008
Validation F1 macro,0.449+/-0.006,0.452+/-0.004,0.45+/-0.005,0.452+/-0.003,0.449+/-0.006,0.452+/-0.004,0.45+/-0.005,0.452+/-0.003,0.39+/-0.004,0.393+/-0.006,0.39+/-0.004,0.39+/-0.003,0.391+/-0.007,0.398+/-0.005,0.385+/-0.006,0.392+/-0.004
Precision Train,0.837+/-0.002,0.835+/-0.001,0.838+/-0.003,0.835+/-0.003,0.837+/-0.002,0.835+/-0.001,0.838+/-0.003,0.835+/-0.003,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.496+/-0.032,0.514+/-0.016,0.485+/-0.025,0.491+/-0.021
Precision Validation,0.569+/-0.011,0.571+/-0.003,0.572+/-0.011,0.575+/-0.007,0.569+/-0.011,0.571+/-0.003,0.572+/-0.011,0.575+/-0.007,0.52+/-0.027,0.531+/-0.018,0.524+/-0.027,0.53+/-0.027,0.433+/-0.016,0.444+/-0.01,0.431+/-0.012,0.44+/-0.008
Recall Train,0.654+/-0.001,0.654+/-0.001,0.653+/-0.001,0.653+/-0.002,0.654+/-0.001,0.654+/-0.001,0.653+/-0.001,0.653+/-0.002,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,1.0+/-0.0,0.421+/-0.017,0.444+/-0.012,0.422+/-0.017,0.435+/-0.021
Recall Validation,0.433+/-0.008,0.433+/-0.005,0.433+/-0.007,0.434+/-0.005,0.433+/-0.008,0.433+/-0.005,0.433+/-0.007,0.434+/-0.005,0.377+/-0.002,0.378+/-0.003,0.377+/-0.002,0.376+/-0.002,0.398+/-0.012,0.409+/-0.012,0.395+/-0.012,0.402+/-0.01
Time,1.22+/-0.028,1.222+/-0.031,1.048+/-0.033,1.232+/-0.123,1.578+/-0.054,1.624+/-0.048,1.388+/-0.038,1.27+/-0.083,3.402+/-0.033,3.212+/-0.032,3.18+/-0.023,3.018+/-0.033,1.486+/-0.038,1.46+/-0.018,1.23+/-0.015,1.21+/-0.0


**WORK IN PROGRESS**

In [10]:
# # 
# import time
# import pandas as pd
# import numpy as np

# # Preprocessing
# import utils2 as p

# # Scalers
# from sklearn.preprocessing import (
#     StandardScaler,
#     MinMaxScaler,
#     RobustScaler)

# # Models
# from sklearn.linear_model import LogisticRegression, SGDClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier, \
#     GradientBoostingClassifier, AdaBoostClassifier
# from xgboost import XGBClassifier 
# from lightgbm import LGBMClassifier
# from sklearn.neural_network import MLPClassifier
# from sklearn.naive_bayes import GaussianNB, CategoricalNB
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.svm import SVC

# # Metrics
# from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix


# def run_model(model_name, X, y, params = None):

#     if params is None:
#         params = {}

#     if model_name == 'LR':
#         model = LogisticRegression(**params).fit(X, y)
#     elif model_name == 'SGD':
#         model = SGDClassifier(**params).fit(X, y)
#     elif model_name == 'DT':
#         model = DecisionTreeClassifier(**params).fit(X, y)
#     elif model_name == 'RF':
#         model = RandomForestClassifier(**params).fit(X, y)
#     elif model_name == 'AdaBoost':
#         model = AdaBoostClassifier(**params).fit(X, y)
#     elif model_name == 'GBoost':
#         model = GradientBoostingClassifier(**params).fit(X, y)
#     elif model_name == 'XGB':
#         model = XGBClassifier(**params).fit(X, y)
#     elif model_name == 'MLP':
#         model = MLPClassifier(**params).fit(X, y)
#     elif model_name == 'GNB':  
#         model = GaussianNB().fit(X, y)
#     elif model_name == 'CNB':  
#         model = CategoricalNB().fit(X, y)
#     elif model_name == 'KNN':  
#         model = KNeighborsClassifier(**params).fit(X, y)
#     elif model_name == 'LGBM':  
#         model = LGBMClassifier(**params).fit(X, y)
#     elif model_name == 'SVM':  
#         model = SVC(**params).fit(X, y)
    
#     return model



# def custom_predict(probabilities, threshold):
#     rare_classes = [5, 6, 7]  
#     for rare_class in rare_classes:
        
#         if probabilities[rare_class] > threshold:
#             return rare_class
        
#     return np.argmax(probabilities)



# def k_fold(method, X, y, test1, model_name, 
#            params, enc, outliers = False,
#            file_name = None):

#     # Save metrics
#     f1macro_train = []
#     f1macro_val = []
#     precision_train = []
#     precision_val = []
#     recall_train = []
#     recall_val = []
#     timer = []
    
#     # Mapping
#     label_mapping = {
#         0: "1. CANCELLED",
#         1: "2. NON-COMP",
#         2: "3. MED ONLY",
#         3: "4. TEMPORARY",
#         4: "5. PPD SCH LOSS",
#         5: "6. PPD NSL",
#         6: "7. PTD",
#         7: "8. DEATH"}
    
#     test_preds = np.zeros((len(test1), len(label_mapping)))


#     # For each fold
#     for train_index, val_index in method.split(X, y):
#         X_train, X_val = X.iloc[train_index], X.iloc[val_index]
#         y_train, y_val = y.iloc[train_index], y.iloc[val_index]
#         test = test1.copy()

#         start_time = time.time()
        
#         # ENCODING
#         X_train['Alternative Dispute Resolution Bin'] = X_train['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})
#         X_val['Alternative Dispute Resolution Bin'] = X_val['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})
#         test['Alternative Dispute Resolution Bin'] = test['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})

#         X_train['Attorney/Representative Bin'] = X_train['Attorney/Representative'].replace({'N': 0, 'Y': 1})
#         X_val['Attorney/Representative Bin'] = X_val['Attorney/Representative'].replace({'N': 0, 'Y': 1})
#         test['Attorney/Representative Bin'] = test['Attorney/Representative'].replace({'N': 0, 'Y': 1})

#         train_carriers = set(X_train['Carrier Name'].unique())
#         test_carriers = set(test['Carrier Name'].unique())
#         common_categories = train_carriers.intersection(test_carriers)
#         common_category_map = {category: idx + 1 for idx, 
#                            category in enumerate(common_categories)}

#         X_train['Carrier Name Enc'] = X_train['Carrier Name'].map(common_category_map).fillna(0).astype(int)
#         X_val['Carrier Name Enc'] = X_val['Carrier Name'].map(common_category_map).fillna(0).astype(int)
#         test['Carrier Name Enc'] = test['Carrier Name'].map(common_category_map).fillna(0).astype(int)

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Name Enc', enc)

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Type', enc)
#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Type', 'OHE')

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'County of Injury', enc)

#         X_train['COVID-19 Indicator Enc'] = X_train['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})
#         X_val['COVID-19 Indicator Enc'] = X_val['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})
#         test['COVID-19 Indicator Enc'] = test['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'District Name', enc)

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Gender', 'OHE')

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Medical Fee Region', enc)

#         X_train, X_val, test = p.encode(X_train, X_val, test, 'Industry Sector', enc)

#         drop = ['Alternative Dispute Resolution', 'Attorney/Representative', 'Carrier Type', 'County of Injury',
#                 'COVID-19 Indicator', 'District Name', 'Gender', 'Carrier Name',
#                 'Medical Fee Region', 'Industry Sector']

#         X_train.drop(columns = drop, axis = 1, inplace = True)
#         X_val.drop(columns = drop, axis = 1, inplace = True)
#         test.drop(columns = drop, axis = 1, inplace = True)

#         # MISSING VALUES
#         X_train['C-3 Date Binary'] = X_train['C-3 Date'].notna().astype(int)
#         X_val['C-3 Date Binary'] = X_val['C-3 Date'].notna().astype(int)
#         test['C-3 Date Binary'] = test['C-3 Date'].notna().astype(int)

#         X_train['First Hearing Date Binary'] = X_train['First Hearing Date'].notna().astype(int)
#         X_val['First Hearing Date Binary'] = X_val['First Hearing Date'].notna().astype(int)
#         test['First Hearing Date Binary'] = test['First Hearing Date'].notna().astype(int)

#         drop = ['C-3 Date', 'First Hearing Date']
#         X_train.drop(columns = drop, axis = 1, inplace = True)
#         X_val.drop(columns = drop, axis = 1, inplace = True)
#         test.drop(columns = drop, axis = 1, inplace = True)

#         X_train['IME-4 Count'] = X_train['IME-4 Count'].fillna(0)
#         X_val['IME-4 Count'] = X_val['IME-4 Count'].fillna(0)
#         test['IME-4 Count'] = test['IME-4 Count'].fillna(0)

#         X_train['Industry Code'] = X_train['Industry Code'].fillna(0)
#         X_val['Industry Code'] = X_val['Industry Code'].fillna(0)
#         test['Industry Code'] = test['Industry Code'].fillna(0)

#         p.fill_dates(X_train, [X_val, test], 'Accident Date')
#         p.fill_dates(X_train, [X_val, test], 'C-2 Date')

#         p.fill_dow([X_train, X_val, test], 'Accident Date')
#         p.fill_dow([X_train, X_val, test], 'C-2 Date')

#         X_train = p.fill_missing_times(X_train, ['Accident to Assembly Time', 
#                                  'Assembly to C-2 Time',
#                                  'Accident to C-2 Time'])

#         X_val = p.fill_missing_times(X_val, ['Accident to Assembly Time', 
#                                  'Assembly to C-2 Time',
#                                  'Accident to C-2 Time'])

#         test = p.fill_missing_times(test, ['Accident to Assembly Time', 
#                                  'Assembly to C-2 Time',
#                                  'Accident to C-2 Time'])

#         p.fill_birth_year([X_train, X_val, test])


#         # Variables
#         num = ['Age at Injury', 'Average Weekly Wage', 'Birth Year',
#            'IME-4 Count', 'Number of Dependents', 'Accident Date Year',
#            'Accident Date Month', 'Accident Date Day', 
#            'Assembly Date Year', 'Assembly Date Month', 
#            'Assembly Date Day', 'C-2 Date Year', 'C-2 Date Month',
#            'C-2 Date Day', 'Accident to Assembly Time',
#            'Assembly to C-2 Time', 'Accident to C-2 Time']
#           # 'Wage to Age Ratio', 'Average Weekly Wage Sqrt',
#           # 'IME-4 Count Log', 'IME-4 Count Double Log']


#         categ = [var for var in X_train.columns if var not in num]

#         categ_count_encoding = ['Carrier Name Enc', 'Carrier Type Enc',
#                                 'County of Injury Enc', 'District Name Enc',
#                                 'Medical Fee Region Enc', 
#                                 'Industry Sector Enc']


#         categ_label_bin = [var for var in X_train.columns if var
#                            in categ and var not in categ_count_encoding]

#         num_count_enc = num + categ_count_encoding


#         # Scale
#         robust = RobustScaler()

#         X_train_num_count_enc_RS = robust.fit_transform(X_train[num_count_enc])
#         X_train_num_count_enc_RS = pd.DataFrame(X_train_num_count_enc_RS, columns=num_count_enc, index=X_train.index)
#         X_val_num_count_enc_RS = robust.transform(X_val[num_count_enc])
#         X_val_num_count_enc_RS = pd.DataFrame(X_val_num_count_enc_RS, columns=num_count_enc, index=X_val.index)
#         test_num_count_enc_RS = robust.transform(test[num_count_enc])
#         test_num_count_enc_RS = pd.DataFrame(test_num_count_enc_RS, columns=num_count_enc, index=test.index)

#         X_train_RS = pd.concat([X_train_num_count_enc_RS, 
#                                 X_train[categ_label_bin]], axis=1)
#         X_val_RS = pd.concat([X_val_num_count_enc_RS, 
#                               X_val[categ_label_bin]], axis=1)
#         test_RS = pd.concat([test_num_count_enc_RS, 
#                              test[categ_label_bin]], axis=1)

#         p.ball_tree_impute([X_train_RS, X_val_RS, test_RS], 
#                            'Average Weekly Wage')
        
#         if outliers:
#             X_train_RS = X_train_RS[X_train_RS['Age at Injury'] < 2.0217391304347827]
            
#             X_train_RS['Average Weekly Wage Sqrt'] = np.sqrt(X_train_RS['Average Weekly Wage'])

#             X_val_RS['Average Weekly Wage Sqrt'] = np.sqrt(X_val_RS['Average Weekly Wage'])

#             test_RS['Average Weekly Wage Sqrt'] = np.sqrt(test_RS['Average Weekly Wage'])
            
#             upper_limit = X_train_RS['Average Weekly Wage'].quantile(0.99)
#             lower_limit = X_train_RS['Average Weekly Wage'].quantile(0.01)

#             X_train_RS['Average Weekly Wage'] = X_train_RS['Average Weekly Wage'].clip(lower = lower_limit
#                                                                   , upper=upper_limit)
            
#             X_train_RS = X_train_RS[X_train_RS['Birth Year'] > -1.9782608695652173]
            
#             X_train_RS['IME-4 Count Log'] = np.log1p(X_train_RS['IME-4 Count'])
#             X_train_RS['IME-4 Count Double Log'] = np.log1p(X_train_RS['IME-4 Count Log'])

#             X_val_RS['IME-4 Count Log'] = np.log1p(X_val_RS['IME-4 Count'])
#             X_val_RS['IME-4 Count Double Log'] = np.log1p(X_val_RS['IME-4 Count Log'])

#             test_RS['IME-4 Count Log'] = np.log1p(test_RS['IME-4 Count'])
#             test_RS['IME-4 Count Double Log'] = np.log1p(test_RS['IME-4 Count Log'])
            
#             X_train_RS = X_train_RS[X_train_RS['Accident Date Year'] > -2.0]
            
#             X_train_RS = X_train_RS[X_train_RS['C-2 Date Year'] > -2.0]
            
#             y_train = y_train[X_train_RS.index]

#         # Training
#         model = run_model(model_name, X_train_RS, y_train, params.get(model_name, {}))

#         # Predictions
#         pred_train = model.predict(X_train_RS)
#         pred_val = model.predict(X_val_RS)
        
#         test_preds = model.predict_proba(test_RS)

#         # Metrics
#         f1macro_train.append(f1_score(y_train, pred_train, average='macro'))
#         f1macro_val.append(f1_score(y_val, pred_val, average='macro'))
#         precision_train.append(precision_score(y_train, pred_train, average='macro')) 
#         precision_val.append(precision_score(y_val, pred_val, average='macro'))  
#         recall_train.append(recall_score(y_train, pred_train, average='macro'))
#         recall_val.append(recall_score(y_val, pred_val, average='macro'))
        
#         # Compute Time
#         end_time = time.time()
#         elapsed_time = round((end_time - start_time) / 60, 2)
#         timer.append(elapsed_time) 
#         print(f'This Fold took {elapsed_time} minutes')

#     # Metrics Average and Stdev
#     avg_time = round(np.mean(timer), 3)
#     avg_f1_train = round(np.mean(f1macro_train), 3)
#     avg_f1_val = round(np.mean(f1macro_val), 3)
#     avg_precision_train = round(np.mean(precision_train), 3)
#     avg_precision_val = round(np.mean(precision_val), 3)
#     avg_recall_train = round(np.mean(recall_train), 3)
#     avg_recall_val = round(np.mean(recall_val), 3)
#     std_time = round(np.std(timer), 3)
#     std_f1_train = round(np.std(f1macro_train), 3)
#     std_f1_val = round(np.std(f1macro_val), 3)
#     std_precision_train = round(np.std(precision_train), 3)
#     std_precision_val = round(np.std(precision_val), 3)
#     std_recall_train = round(np.std(recall_train), 3)
#     std_recall_val = round(np.std(recall_val), 3)

#     # Final Predictions using Soft Voting
#     threshold = 0.5
#     y_custom_pred = np.array([custom_predict(test_preds[i], 
#                                              threshold) for i in range(len(test_preds))])

#     test_RS['Claim Injury Type'] = y_custom_pred 

#     test_RS['Claim Injury Type'] = test_RS['Claim Injury Type'].replace(label_mapping)
    
#     predictions = test_RS['Claim Injury Type']
    
#     if file_name != None:
    
#         predictions.to_csv(f'./pred/{file_name}.csv')


#     # Return data and treated Test_RS
#     return {
#         'avg_time': str(avg_time) + '+/-' + str(std_time),
#         'avg_f1_train': str(avg_f1_train) + '+/-' + str(std_f1_train),
#         'avg_f1_val': str(avg_f1_val) + '+/-' + str(std_f1_val),
#         'avg_precision_train': str(avg_precision_train) + '+/-' + str(std_precision_train),
#         'avg_precision_val': str(avg_precision_val) + '+/-' + str(std_precision_val),
#         'avg_recall_train': str(avg_recall_train) + '+/-' + str(std_recall_train),
#         'avg_recall_val': str(avg_recall_val) + '+/-' + str(std_recall_val),
#         'test_data': test_RS,
#         'predictions': predictions}
    

In [8]:
method = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

In [11]:
try5 = k_fold(method, X, y, test1, 'XGB', params = {},
                              enc = 'freq', outliers = True,
                              file_name = 'try5')

# try1 = simple soft voting
# try2 = 0.1
# try3 = 0.05
# try4 = 0.2
# try5 = 0.5
# above the fold is what appears when one opens the website -> DA

This Fold took 1.15 minutes
This Fold took 1.2 minutes
This Fold took 1.15 minutes


In [None]:
models = [try2]

model_names = ['try2']

m.metrics2(models, model_names)