# <center>Machine Learning Project Code</center>

<a class="anchor" id="top"></a>

## <center>*02 - Holdout Method*</center>

** **



# Table of Contents  <br>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Train-Test Split](#2.-Train-Test-Split)

   2.1 [Feature Engineering](#2.1-Feature-Engineering) <br>
    
   2.2 [Missing Values](#2.2-Missing-Values) <br>
    
   2.3 [Outliers](#2.3-Outliers) <br><br>
   
3. [Feature Selection](#3.-Feature-Selection) 
    
    3.1 [Filter Based Methods](#3.1-Filter-Based-Methods) <br>

    3.2 [Wrapper Methods](#3.2-Wrapper-Methods) <br>
    
    3.3 [Embedded Methods](#3.3-Embedded-Methods) <br><br>
    
4. [Modeling](#4.-Modeling) <br>

    4.1 [Hyperparameter Tuning](#4.1-Hyperparameter-Tuning) <br><br>

5. [Export](#5.-Export)


** **

In this notebook we will start by spliting train and validation data using a simple Holdout Method. After doing so, more complex methods to fill missing values will be applied, and outliers will be addresses. Feature Selection will also be performed, followed by a Modeling section.

Data Scientist Manager: António Oliveira, **20211595**

Data Scientist Senior: Tomás Ribeiro, **20240526**

Data Scientist Junior: Gonçalo Pacheco, **20240695**

Data Analyst Senior: Gonçalo Custódio, **20211643**

Data Analyst Junior: Ana Caleiro, **20240696**


** ** 

# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [149]:
import pandas as pd
import numpy as np

# Preprocessing
import utils2 as p

# Train-Test Split
from sklearn.model_selection import train_test_split

# Scaler
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler
)

# Feature Selection
import feature_selection as fs

# Models
import models as mod
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier 
from lightgbm import LGBMClassifier

from sklearn.impute import SimpleImputer

# Hyperparameter Tuning
import tuning as t

# Metrics
import metrics as m

pd.set_option('display.max_columns', None)
# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [None]:
# Load training data
df = pd.read_csv('./train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test = pd.read_csv('./test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

# 2. Train-Test Split
The train-test split is a crucial technique used to assess model performance by dividing the dataset into training and testing subsets. This ensures that the model is evaluated on unseen data, helping to prevent overfitting and providing an unbiased performance estimate. 

<a href="#top">Top &#129033;</a>

**Holdout Method**

In [83]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

In [84]:
# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    stratify = y) 

## 2.1 Feature Engineering

<a href="#top">Top &#129033;</a>

### 2.1.1 Encoding

Encoding transforms categorical data into numerical format for use in machine learning models. For this section, several encoders were considered:
- **One Hot Encoding** -  turns a variable that is stored in a column into dummy variables stored over multiple columns and represented as 0s and 1s
- **Frequency Encoding** - replaces the categories by with their proportion in the dataset
- **Count Encoding** - replaces the categories by the number of times they appear in the dataset 
- **Manual Mapping Encoding** - manually attribute values to each category

**Alternative Dispute Resolution**

Knowing that 'N' is by far the most common category, and that 'U' only appears 5 times in DF data and 1 time in the test data, we decided to join 'U' and 'Y' into 'Y/U', and encode the variable as:
- 0 - N
- 1 - Y/U

In [None]:
print(X_train['Alternative Dispute Resolution'].value_counts())
print(' ')
print(test['Alternative Dispute Resolution'].value_counts())

In [86]:
X_train['Alternative Dispute Resolution Enc'] = X_train['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})
X_val['Alternative Dispute Resolution Enc'] = X_val['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})
test['Alternative Dispute Resolution Enc'] = test['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1, 'U': 1})

**Attorney/Representative**

As this variable only has 2 categories, they will be encoded as follows:
- N - 0
- Y - 1

In [None]:
print(X_train['Attorney/Representative'].value_counts())
print(' ')
print(test['Attorney/Representative'].value_counts())

In [88]:
X_train['Attorney/Representative Enc'] = X_train['Attorney/Representative'].replace({'N': 0, 'Y': 1})
X_val['Attorney/Representative Enc'] = X_val['Attorney/Representative'].replace({'N': 0, 'Y': 1})
test['Attorney/Representative Enc'] = test['Attorney/Representative'].replace({'N': 0, 'Y': 1})

**Carrier Name**

As Carrier name has a considerable amount of unique values, it will be encoded using Count Encoder.

We will start by analysing the common Carriers between train and test sets

In [89]:
train_carriers = set(X_train['Carrier Name'].unique())
test_carriers = set(test['Carrier Name'].unique())

common_categories = train_carriers.intersection(test_carriers)

Then map the common categories to an index

In [90]:
common_category_map = {category: idx + 1 for idx, 
                       category in enumerate(common_categories)}

Fill the non-common categories with 0

In [91]:
X_train['Carrier Name'] = X_train['Carrier Name'].map(common_category_map).fillna(0).astype(int)
X_val['Carrier Name'] = X_val['Carrier Name'].map(common_category_map).fillna(0).astype(int)
test['Carrier Name'] = test['Carrier Name'].map(common_category_map).fillna(0).astype(int)

Encode de common categores using *Count Encoding*

In [92]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Name', 'count')

**Carrier Type**

After grouping the *5.* categories we decided to encode them in 2 distinct ways, and choose the best option in feature selection.

Starting with *Count Encoding*

In [93]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Type', 'count')

And *One-Hot-Encoding*

In [94]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Carrier Type', 'OHE')

**County of Injury**

As County of Injury has a considerable amount of unique values, it will be encoded using Count Encoder.

In [95]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'County of Injury', 'count')

**COVID-19 Indicator**

As this variable only has 2 categories, they will be encoded as follows:
- N - 0
- Y - 1

In [None]:
print(X_train['COVID-19 Indicator'].value_counts())
print(' ')
print(test['COVID-19 Indicator'].value_counts())

In [97]:
X_train['COVID-19 Indicator Enc'] = X_train['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})
X_val['COVID-19 Indicator Enc'] = X_val['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})
test['COVID-19 Indicator Enc'] = test['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})

**District Name**

As this variable has 8 unique values, Count Encoder will be used.

In [None]:
print(X_train['District Name'].value_counts())
print(' ')
print(test['District Name'].value_counts())

In [99]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'District Name', 'count')

**Gender**

Since after grouping we only have 3 categories, we will also use *One-Hot-Encoding* and decide which is better in feature selection.

In [100]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Gender', 'OHE')

**Medical Fee Region**

Even though this variable only contains 5 unique values, it is not clear whether there is an order between them or not. Therefore we will use *Count Encoding*

In [None]:
print(df['Medical Fee Region'].value_counts())
print(' ')
print(test['Medical Fee Region'].value_counts())

In [102]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Medical Fee Region', 'count')

**Industry Sector**

Since it contains too many categories to use OHE, we will use *Count Encoding*

In [103]:
X_train, X_val, test = p.encode(X_train, X_val, test, 'Industry Sector', 'count')

Remove Encoded Variables

In [104]:
drop = ['Alternative Dispute Resolution', 'Attorney/Representative', 'Carrier Type', 'County of Injury',
        'COVID-19 Indicator', 'District Name', 'Gender', 'Carrier Name',
        'Medical Fee Region', 'Industry Sector']

In [105]:
X_train.drop(columns = drop, axis = 1, inplace = True)
X_val.drop(columns = drop, axis = 1, inplace = True)
test.drop(columns = drop, axis = 1, inplace = True)

# 2.2 Missing Values

<a href="#top">Top &#129033;</a>

In [None]:
X_train.isna().sum()

### 2.2.1 Dealing with Missing Values

In this subsection we will use the existence of missing values to create new features

**C-3 Date**

Create a Binary variable:
- 0 if C-3 date is missing
- 1 if C-3 date exists

In [107]:
X_train['C-3 Date Binary'] = X_train['C-3 Date'].notna().astype(int)
X_val['C-3 Date Binary'] = X_val['C-3 Date'].notna().astype(int)
test['C-3 Date Binary'] = test['C-3 Date'].notna().astype(int)

**First Hearing Date**

Create a Binary variable:
- 0 if First Hearing Date is missing
- 1 if First Hearing Date exists

In [108]:
X_train['First Hearing Date Binary'] = X_train['First Hearing Date'].notna().astype(int)
X_val['First Hearing Date Binary'] = X_val['First Hearing Date'].notna().astype(int)
test['First Hearing Date Binary'] = test['First Hearing Date'].notna().astype(int)

Remove transformed features.

In [109]:
drop = ['C-3 Date', 'First Hearing Date']

In [110]:
X_train.drop(columns = drop, axis = 1, inplace = True)
X_val.drop(columns = drop, axis = 1, inplace = True)
test.drop(columns = drop, axis = 1, inplace = True)

### 2.2.2 Filling Missing Values

In this subsection we will deal with missing values by filling them with constants, statistical methods and using predictive models.

**IME-4 Count**

Since IME-4 Count represents the number of IME-4 forms received per claim, we considered that a missing value represented 0 received forms, hence we will fill them with 0.

In [111]:
X_train['IME-4 Count'] = X_train['IME-4 Count'].fillna(0)
X_val['IME-4 Count'] = X_val['IME-4 Count'].fillna(0)
test['IME-4 Count'] = test['IME-4 Count'].fillna(0)

**Industry Code**

Assuming that a missing value in Industry Code represents an unknown code, it will be filled with 0.

In [112]:
X_train['Industry Code'] = X_train['Industry Code'].fillna(0)
X_val['Industry Code'] = X_val['Industry Code'].fillna(0)
test['Industry Code'] = test['Industry Code'].fillna(0)

**Accident Date** & **C-2 Date**

Fill Year, Month and Day with median. Then recompute full date and from there fill missing values in Day of Week

In [113]:
p.fill_dates(X_train, [X_val, test], 'Accident Date')
p.fill_dates(X_train, [X_val, test], 'C-2 Date')

Identify missing values and recompute Accident Date and C-2 Date to fill Day of the week

In [114]:
p.fill_dow([X_train, X_val, test], 'Accident Date')
p.fill_dow([X_train, X_val, test], 'C-2 Date')

**Time Between**

Fill missing values with the recomputed times using the medians computed above.

In [115]:
X_train = p.fill_missing_times(X_train, ['Accident to Assembly Time', 
                             'Assembly to C-2 Time',
                             'Accident to C-2 Time'])

X_val = p.fill_missing_times(X_val, ['Accident to Assembly Time', 
                             'Assembly to C-2 Time',
                             'Accident to C-2 Time'])

test = p.fill_missing_times(test, ['Accident to Assembly Time', 
                             'Assembly to C-2 Time',
                             'Accident to C-2 Time'])

**Birth Year**

To fill the missing values, we will start by creating a mask, which filters for observations where `Age at Injury` and `Accident Date Year` are not missing, and when `Birth Year` is either missing or zero. Since we are going to use `Age at Injury` and `Accident Date Year` to compute `Birth Year`, ensuring those two variables are no missing is crucial. Then, we also decided to recompute the `Birth Year` when it is 0, since it is impossible to have 0 as a `Birth Year`.

In [116]:
p.fill_birth_year([X_train, X_val, test])

Before filling the missing values in `Average Weekly Wage` we must scale all numeric and Count-Encoded Features, as doing so after filling these missing values may lead to inconsistencies.

Having evaluated models with scaling before and after the imputation in this variables, the difference in performance was not signficant, therefore we will proceed with the scaling first.

### 2.2.3 Scaling

<a href="#top">Top &#129033;</a>

**Variable type split**

In [117]:
num = ['Age at Injury', 'Average Weekly Wage', 'Birth Year',
       'IME-4 Count', 'Number of Dependents', 'Accident Date Year',
       'Accident Date Month', 'Accident Date Day', 
       'Assembly Date Year', 'Assembly Date Month', 
       'Assembly Date Day', 'C-2 Date Year', 'C-2 Date Month',
       'C-2 Date Day', 'Accident to Assembly Time',
       'Assembly to C-2 Time', 'Accident to C-2 Time']
      # 'Wage to Age Ratio', 'Average Weekly Wage Sqrt',
      # 'IME-4 Count Log', 'IME-4 Count Double Log']

 
categ = [var for var in X_train.columns if var not in num]

categ_count_encoding = ['Carrier Name Enc', 'Carrier Type Enc',
                        'County of Injury Enc', 'District Name Enc',
                        'Medical Fee Region Enc', 'Industry Sector Enc']


categ_label_bin = [var for var in X_train.columns if var
                   in categ and var not in categ_count_encoding]


**Scaling**

In [118]:
num_count_enc = num + categ_count_encoding

In [119]:
robust = RobustScaler()

In [120]:
# Scaling the numerical features in the training set using RobustScaler
X_train_num_count_enc_RS = robust.fit_transform(X_train[num_count_enc])
X_train_num_count_enc_RS = pd.DataFrame(X_train_num_count_enc_RS, columns=num_count_enc, index=X_train.index)

# Scaling the numerical features in the validation set using the fitted RobustScaler
X_val_num_count_enc_RS = robust.transform(X_val[num_count_enc])
X_val_num_count_enc_RS = pd.DataFrame(X_val_num_count_enc_RS, columns=num_count_enc, index=X_val.index)

# Scaling the numerical features in the test set using the same fitted RobustScaler
test_num_count_enc_RS = robust.transform(test[num_count_enc])
test_num_count_enc_RS = pd.DataFrame(test_num_count_enc_RS, columns=num_count_enc, index=test.index)

Joining the scaled features back with the Categorical features encoded with labels or binary encoding

In [121]:
X_train_RS = pd.concat([X_train_num_count_enc_RS, 
                        X_train[categ_label_bin]], axis=1)
X_val_RS = pd.concat([X_val_num_count_enc_RS, 
                      X_val[categ_label_bin]], axis=1)
test_RS = pd.concat([test_num_count_enc_RS, 
                     test[categ_label_bin]], axis=1)

Having scaled our features, we will proceed with the imputation of missing values.

**Average Weekly Wage**

In [122]:
p.ball_tree_impute([X_train_RS, X_val_RS, test_RS], 
                   'Average Weekly Wage')

Having treated all missing values, we will create one last feature

**Wage to Age Ratio**

In [123]:
# X_train_RS['Wage to Age Ratio'] = np.where(
#     (X_train_RS['Age at Injury'] != 0) & (X_train_RS['Average Weekly Wage'] != 0),
#     X_train_RS['Average Weekly Wage'] / X_train_RS['Age at Injury'],
#     0)

# X_val_RS['Wage to Age Ratio'] = np.where(
#     (X_val_RS['Age at Injury'] != 0) & (X_val_RS['Average Weekly Wage'] != 0),
#     X_val_RS['Average Weekly Wage'] / X_val_RS['Age at Injury'],
#     0)

# test_RS['Wage to Age Ratio'] = np.where(
#     (test_RS['Age at Injury'] != 0) & (test_RS['Average Weekly Wage'] != 0),
#     test_RS['Average Weekly Wage'] / test_RS['Age at Injury'],
#     0)


# 2.3 Outliers

<a href="#top">Top &#129033;</a>

### Outlier Detection

To detect outliers we will use a function that plots boxplots and identifies outliers based on the Interquartile Range method. This function will also add to a list all columns with a higher percentage of outliers than a previously set threshold.

In [None]:
p.detect_outliers_iqr(X_train_RS, 0.001)

### Dealing With Outliers


**Age at Injury**

Knowing people of a certain age are already retired, we will use the identified upper bound as a limit.

In [125]:
X_train = X_train[X_train['Age at Injury'] < 88.5]

**Average Weekly Wage**

Apply a square root transformation

In [126]:
X_train['Average Weekly Wage Sqrt'] = np.sqrt(X_train['Average Weekly Wage'])

X_val['Average Weekly Wage Sqrt'] = np.sqrt(X_val['Average Weekly Wage'])

test['Average Weekly Wage Sqrt'] = np.sqrt(test['Average Weekly Wage'])

Winsorization for `Average Weekly Wage`

In [127]:
upper_limit = X_train['Average Weekly Wage'].quantile(0.99)
lower_limit = X_train['Average Weekly Wage'].quantile(0.01)

X_train['Average Weekly Wage'] = X_train['Average Weekly Wage'].clip(lower = lower_limit
                                                                  , upper=upper_limit)

**Birth Year**

In [128]:
X_train = X_train[X_train['Birth Year'] > 1932.5]

In [129]:
# lower_limit = X_train['Birth Year'].quantile(0.01)

# X_train['Birth Year'] = X_train['Birth Year'].clip(lower = lower_limit)

**IME-4 Count**

In [130]:
X_train['IME-4 Count Log'] = np.log1p(X_train['IME-4 Count'])
X_train['IME-4 Count Double Log'] = np.log1p(X_train['IME-4 Count Log'])

X_val['IME-4 Count Log'] = np.log1p(X_val['IME-4 Count'])
X_val['IME-4 Count Double Log'] = np.log1p(X_val['IME-4 Count Log'])

test['IME-4 Count Log'] = np.log1p(test['IME-4 Count'])
test['IME-4 Count Double Log'] = np.log1p(test['IME-4 Count Log'])

**Accident Date Year**

In [131]:
X_train = X_train[X_train['Accident Date Year'] > 2017.0]

In [132]:
# lower_limit = X_train['Accident Date Year'].quantile(0.01)

# X_train['Accident Date Year'] = X_train['Accident Date Year'].clip(lower = lower_limit)

**C-2 Date Year**

In [133]:
X_train = X_train[X_train['C-2 Date Year'] > 2017.0]

In [134]:
# lower_limit = X_train['C-2 Date Year'].quantile(0.01)

# X_train['C-2 Date Year'] = X_train['C-2 Date Year'].clip(lower = lower_limit)

**Alternative Dispute Resolution Enc**

Since`Alternative Dispute Resolution Enc` has 0.45% of 1 values, being the rest 0, we may consider dropping this variable

In [None]:
X_train['Alternative Dispute Resolution Enc'].value_counts()

In [136]:
# X_train.drop('Alternative Dispute Resolution Enc', 
#              axis = 1, inplace = True)

# X_val.drop('Alternative Dispute Resolution Enc', 
#              axis = 1, inplace = True)

# test.drop('Alternative Dispute Resolution Enc', 
#              axis = 1, inplace = True)

Even though more columns were identified as having outliers, they were not addressed because at least one of the following is true:
- the categories of said variable were encoded, and removind categories would not be wise (as they can exist in the test set and we will need to predict them)
- the percentage of identified outliers was so high that we decided to leave them

Before continuing we must ensure that *y_train* has the same indices as *X_train*

In [137]:
y_train = y_train[X_train.index]

** ** 

Having performed all data transformations, we will export the data to later use it in the Modelling section of this notebook. This will allow to save time by not always having to run computationally expensive methods such as RFE.

To go to the Modelling section click in the button.

[Go to Modeling &#129034;](#modeling)

<a class="anchor" id="feature-selection"></a>



In [138]:
# X_train.to_csv('./data/X_train_treated.csv')
# X_val.to_csv('./data/X_val_treated.csv')
# y_train.to_csv('./data/y_train_treated.csv')
# y_val.to_csv('./data/y_val_treated.csv')
# test.to_csv('./data/test_treated.csv')

** **

# 3. Feature Selection

<a href="#top">Top &#129033;</a>

In this section we will perform Feature Selection. Having already split our variables into *Numeric*, *Categorical*, *Categorical encoded using Count Encoding* and *Categorical (other)*, we will apply the following methods:

<br>

| Method                        | Feature Types |
| ----------------------------- | ------------- |
| Variance Threshold            | Numerical     |
| Correlation                   | Numerical     |
| Chi-Square Test               | Categorical   |
| Mutual Information            | Categorical   |
| RFE                           | All           |
| LASSO                         | All           |
| Tree-Based Feature Importance | All           |


## 3.1 Filter-Based Methods

<a href="#top">Top &#129033;</a>

Filter-based methods evaluate the relevance of features independently of the model using statistical measures like correlation, Chi-square tests, and mutual information. This section explores how these methods help reduce dimensionality, improve model performance, and prevent overfitting by selecting the most informative features.



**Variance Threshold**

In [None]:
X_train[num].var() 

**Spearman Correlation Matrix**

In [None]:
fs.correlation_matrix(X_train[num])

**Chi Squared Test**

In [None]:
fs.chi_squared(X_train[categ_label_bin], y_train)

**Mutual Information Test**

In [None]:
fs.mutual_info(X_train_RS[categ], y_train, threshold = 0.05)

## 3.2 Wrapper Methods

<a href="#top">Top &#129033;</a>

Unlike filter methods, which assess features independently, wrapper methods evaluate the effectiveness of feature subsets by measuring the model’s performance, making them more computationally expensive but often more accurate in selecting relevant features.

In [None]:
imputer = SimpleImputer(strategy='mean')
X_train[['Average Weekly Wage', 'Average Weekly Wage Sqrt']] = imputer.fit_transform(X_train[['Average Weekly Wage', 'Average Weekly Wage Sqrt']])
X_val[['Average Weekly Wage', 'Average Weekly Wage Sqrt']] = imputer.transform(X_val[['Average Weekly Wage', 'Average Weekly Wage Sqrt']])

**Recursive Feature Elimination (RFE) - Logistic Regression**

In [None]:
n_features = np.arange(3, len(X_train.columns) + 1)
model = LogisticRegression()
fs.rfe(X_train, y_train, X_val, y_val, 
    n_features = n_features, 
    model = model)

-------------TRAIN-------------
Classification Report for 3 features:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      9429
           1       0.74      0.98      0.84    231107
           2       0.00      0.00      0.00     54880
           3       0.60      0.76      0.67    117006
           4       0.00      0.00      0.00     38469
           5       0.00      0.00      0.00      3358
           6       0.00      0.00      0.00        77
           7       0.00      0.00      0.00       374

    accuracy                           0.69    454700
   macro avg       0.17      0.22      0.19    454700
weighted avg       0.53      0.69      0.60    454700

Macro Avg F1 Score for 3 features: 0.1890

----------VALIDATION----------
Classification Report for 3 features:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      2495
           1       0.73      0.98      0.84     58216
  

In [None]:
import play_song as p
p.play_('audio.mp3')

**Recursive Feature Elimination (RFE) - Random Forest Classifier**

In [None]:
# Perform Recursive Feature Elimination (RFE) to select the top features based on their importance for a logistic regression model
n_features = np.arange(5, len(X_train[num].columns) + 1)
model = RandomForestClassifier()
fs.rfe(X_train[num], y_train, n_features = n_features, model = model)

# 

## 3.3 Embedded Methods

<a href="#top">Top &#129033;</a>

These methods use algorithms that inherently select features as part of the model’s learning process. Embedded methods are computationally efficient and tend to be more accurate than filter methods, as they consider feature interactions and model performance simultaneously.


**Least Absolute Shrinkage and Selection Operator (LASSO)**

In [None]:
fs.lasso(X_train_RS[num], y_train, alpha = 0.01)

In [None]:
fs.lasso(X_train_RS[categ], y_train, alpha = 0.01)

**Extra Trees Classifier**

In [None]:
fs.plot_feature_importance(X_train_RS[num], X_train_RS[categ], y_train, 
                        n_estimators = 250)

[Go to Modeling &#129034;](#modeling)

## 3.4 Final Features

<a href="#top">Top &#129033;</a>

**Final Decision**

<br>

`Numeric Variables`
<br><br>


| Variable                   | Variance | Correlation | RFE LR | RFE RF | Lasso | ExtraTrees | `Decision` |
| -------------------------- | -------- | ----------- | ------ | ------ | ----- | ---------- | ---------- |
| Accident Date Day          |    K     |             |        |        |   D   |            |            |
| Accident Date Month        |    K     |             |        |        |   D   |            |            |
| Accident Date Year         |    K     |    HC_4     |        |        |   D   |            |            |
| Accident to Assembly Time  |    K     |             |        |        |   K   |            |            |
| Accident to C-2 Time       |    K     |             |        |        |   K   |            |            |
| Age at Injury              |    K     |    HC_3     |        |        |   K   |            |            |
| Assembly Date Day          |    K     |             |        |        |   D   |            |            |
| Assembly Date Month        |    K     |             |        |        |   D   |            |            |
| Assembly Date Year         |    K     |    HC_4     |        |        |   K   |            |            |
| Assembly to C-2 Time       |    K     |             |        |        |   K   |            |            |
| Average Weekly Wage        |    K     |    HC_2     |        |        |   K   |            |            |
| Average Weekly Wage Sqrt   |    K     |    HC_2     |        |        |   K   |            |            |
| Birth Year                 |    K     |    HC_3     |        |        |   K   |            |            |
| C-2 Date Day               |    K     |             |        |        |   K   |            |            |
| C-2 Date Month             |    K     |             |        |        |   K   |            |            |
| C-2 Date Year              |    K     |    HC_4     |        |        |   D   |            |            |
| IME-4 Count                |    K     |    HC_1     |        |        |   K   |            |            |
| IME-4 Count Double Log     |    K     |    HC_1     |        |        |   D   |            |            |
| IME-4 Count Log            |    K     |    HC_1     |        |        |   K   |            |            |
| Number of Dependents       |    K     |             |        |        |   D   |            |            |
| Wage to Age Ratio          |    K     |             |        |        |   K   |            |            |

<br>

`Categorical Variables`
<br><br>


| Variable                                | Chi-Squared | MI | RFE LR | RFE RF | Lasso | Extra trees | `Decision` |
| --------------------------------------- | ----------- | -- | ------ | ------ | ----- | ----------- | ---------- |
| Accident Date Day of Week               |      D      | D  |        |        |   K   |             |            |
| Assembly Date Day of Week               |      D      | D  |        |        |   K   |             |            |
| Age Group                               |      K      | K  |        |        |   K   |             |            |
| Alternative Dispute Resolution Bin      |      K      | D  |        |        |   K   |             |            |
| Attorney/Representative Bin             |      K      | K  |        |        |   K   |             |            |
| Carrier Name Enc                        |      K      | K  |        |        |   K   |             |            |
| Carrier Type Enc                        |      NA     | D  |        |        |   K   |             |            |
| Carrier Type_2A. SIF                    |      K      | D  |        |        |   D   |             |            |
| Carrier Type_3A. SELF PUBLIC            |      K      | D  |        |        |   D   |             |            |
| Carrier Type_4A. SELF PRIVATE           |      K      | D  |        |        |   D   |             |            |
| Carrier Type_5. SPECIAL FUND OR UNKNOWN |      K      | D  |        |        |   D   |             |            |
| C-2 Date Day of Week                    |      K      | D  |        |        |   K   |             |            |
| C-3 Date Binary                         |      K      | K  |        |        |   K   |             |            |
| COVID-19 Indicator Enc                  |      K      | D  |        |        |   K   |             |            |
| County of Injury Enc                    |      NA     | D  |        |        |   K   |             |            |
| District Name Enc                       |      NA     | D  |        |        |   K   |             |            |
| First Hearing Date Binary               |      K      | K  |        |        |   K   |             |            |
| Gender Enc                              |      K      | D  |        |        |   K   |             |            |
| Gender_M                                |      K      | D  |        |        |   D   |             |            |
| Gender_U/X                              |      K      | D  |        |        |   D   |             |            |
| Industry Code                           |      K      | K  |        |        |   K   |             |            |
| Industry Sector Enc                     |      NA     | D  |        |        |   K   |             |            |
| Insurance                               |      K      | D  |        |        |   D   |             |            |
| Medical Fee Region Enc                  |      NA     | D  |        |        |   K   |             |            |
| WCIO Cause of Injury Code               |      K      | K  |        |        |   K   |             |            |
| WCIO Codes                              |      K      | K  |        |        |   K   |             |            |
| WCIO Nature of Injury Code              |      K      | K  |        |        |   K   |             |            |
| WCIO Part Of Body Code                  |      K      | K  |        |        |   K   |             |            |
| Zip Code Valid                          |      D      | D  |        |        |   D   |             |            |

<br><br><br>

| Symbol | Meaning |
| ------ | ------- |
| K      | Keep    |
| D      | Discard |
| HC_*N*   | High Correlation Identifier |
| NA     | Not Applicable |


# 4. Modeling

<a class="anchor" id="modeling"></a>

[Go to Feature Selection &#129034;](#feature-selection)

<a href="#top">Top &#129033;</a>

Start by importing the correct datasets.


In [3]:
# X_train = pd.read_csv('./data/X_train_treated.csv', index_col = 'Claim Identifier')
# X_val = pd.read_csv('./data/X_val_treated.csv', index_col = 'Claim Identifier')
# y_train = pd.read_csv('./data/y_train_treated.csv', index_col = 'Claim Identifier')
# y_val = pd.read_csv('./data/y_val_treated.csv', index_col = 'Claim Identifier')
# test = pd.read_csv('./data/test_treated.csv', index_col = 'Claim Identifier')

And select the columns to use for predictions purposes

In [50]:
columns = []


X_train_filtered = X_train_RS[columns]
X_val_filtered = X_val_RS[columns]
test_filtered = test_RS[columns]

**Model Training**

In [None]:
# Define the models to run
model_names = ['LGBM', 'KNN', 'DT', 'RF','XGB']

results = mod.modeling(model_names, X_train_filtered, y_train, X_val_filtered, y_val)

In [None]:
import play_song as s
s.play_('audio.mp3')

** ** 

Scaling before using ball_trees

In [None]:
#extra_t = pd.DataFrame(results).T
extra_t

In [None]:
#lasso_num_MIC_categ = pd.DataFrame(results).T
lasso_num_MIC_categ

In [None]:
#lasso_feat = pd.DataFrame(results).T
lasso_feat

In [None]:
#no_out_noWageratio_noSQRT = pd.DataFrame(results).T
no_out_noWageratio_noSQRT

** **

Scaling after using ball_trees

In [None]:
#out_treat_drop_ADR_with_IME_logs_RS = pd.DataFrame(results).T
out_treat_drop_ADR_with_IME_logs_RS 

In [None]:
#outlier_treatment_drop_ADR_with_IME_logs = pd.DataFrame(results).T
outlier_treatment_drop_ADR_with_IME_logs

In [None]:
#outlier_treatment_drop_ADR = pd.DataFrame(results).T
outlier_treatment_drop_ADR

In [None]:
#outlier_treatment = pd.DataFrame(results).T
outlier_treatment

In [None]:
#no_outlier_treatment = pd.DataFrame(results).T
no_outlier_treatment

## 4.1 Hyperparameter Tuning

<a href="#top">Top &#129033;</a>

In [52]:
param_grid = {
    "n_estimators": [375, 400, 425],
    "max_depth": [23, 25, 27],   
#   "min_samples_split": [4, 5],   
#    "min_samples_leaf": [6, 8],     
#   "max_features": ['auto', 'sqrt', 'log2'],  
    "bootstrap": [True, False],       
    "criterion": ['gini', 'entropy']  
}

model = RandomForestClassifier(random_state = 1)

In [None]:
t.hyperparameter_search(model, param_grid, 'random', X_train_RS, 
                      y_train, scoring='f1_macro', 
                      cv=3, n_iter=5, random_state=42)

In [None]:
import play_song as s
s.play_('audio.mp3')

## 4.2 Final Predictions

<a href="#top">Top &#129033;</a>

In [16]:
test_filtered = test

In [17]:
test_filtered['Claim Injury Type'] = model.predict(test_filtered)

Map Predictions to Original Values

In [18]:
label_mapping = {
    0: "1. CANCELLED",
    1: "2. NON-COMP",
    2: "3. MED ONLY",
    3: "4. TEMPORARY",
    4: "5. PPD SCH LOSS",
    5: "6. PPD NSL",
    6: "7. PTD",
    7: "8. DEATH"
}


test_filtered['Claim Injury Type'] = test_filtered['Claim Injury Type'].replace(label_mapping)

In [None]:
# Count unique values in column 'Claim Injury Type'
test_filtered['Claim Injury Type'].value_counts() 

In [20]:
# Extract the target variable 'Claim Injury Type' from the test dataset for prediction
predictions = test_filtered['Claim Injury Type']

In [21]:
# Assign a descriptive name for easy reference
name = 'all_feat_all_correct'

# Save the predictions to a CSV file.
predictions.to_csv(f'./pred/{name}.csv')