# 1. Importing Libraries & Data
In this section, we set up the foundation for our project by importing the necessary Python libraries and loading the dataset. These libraries provide the tools for data manipulation, visualization, and machine learning modeling throughout the notebook. Additionally, we import the historical claims dataset, which forms the core of our analysis. 

In [10]:
import pandas as pd

# Train-Test Split
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
# Suppress Warnings
import warnings
warnings.filterwarnings("ignore")

**Import Data**

In [6]:
# Load training data
df = pd.read_csv('./project_data/train_data_EDA.csv', index_col = 'Claim Identifier')

# Load testing data
test = pd.read_csv('./project_data/test_data_EDA.csv', index_col = 'Claim Identifier')

# Display the first 3 rows of the training data
df.head(3)

Unnamed: 0_level_0,Age at Injury,Average Weekly Wage,Birth Year,C-3 Date,Claim Injury Type,First Hearing Date,IME-4 Count,Industry Code,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Number of Dependents,Alternative Dispute Resolution Bin,Attorney/Representative Bin,Carrier Name Enc,Carrier Type freq,Carrier Type_1A. PRIVATE,Carrier Type_2A. SIF,Carrier Type_3A. SELF PUBLIC,Carrier Type_4A. SELF PRIVATE,Carrier Type_5. SPECIAL FUND,County of Injury freq,COVID-19 Indicator Enc,District Name freq,Gender Enc,Gender_F,Gender_M,Medical Fee Region freq,Accident Date Year,Accident Date Month,Accident Date Day,Accident Date Day of Week,Assembly Date Year,Assembly Date Month,Assembly Date Day,Assembly Date Day of Week,C-2 Date Year,C-2 Date Month,C-2 Date Day,C-2 Date Day of Week,WCIO Codes,Zip Code Valid,Industry Sector Count Enc
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
5393875,31.0,0.0,1988.0,,1,,,44.0,27,10,62,1.0,0,0,963,285367,1,0,0,0,0,3355,0,44646,0,0,1,135885,2019.0,12.0,30.0,0.0,2020,1,1,2,2019.0,12.0,31.0,1.0,271062,0,103330
5393091,46.0,1745.93,1973.0,2020-01-14,3,2020-02-21,4.0,23.0,97,49,38,4.0,0,1,9,285367,1,0,0,0,0,760,0,40449,1,1,0,135885,2019.0,8.0,30.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,974938,0,69053
5393889,40.0,1434.8,1979.0,,3,,,56.0,79,7,10,6.0,0,0,1491,285367,1,0,0,0,0,17450,0,86171,0,0,1,85033,2019.0,12.0,6.0,4.0,2020,1,1,2,2020.0,1.0,1.0,2.0,79710,0,57495


# 2. Train-Test Split
The train-test split is a crucial technique used to assess model performance by dividing the dataset into training and testing subsets. This ensures that the model is evaluated on unseen data, helping to prevent overfitting and providing an unbiased performance estimate. 

<a href="#top">Top &#129033;</a>

**Holdout Method**

In [8]:
# Split the DataFrame into features (X) and target variable (y)
X = df.drop('Claim Injury Type', axis=1) 
y = df['Claim Injury Type']  

In [11]:
# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                    test_size=0.2, # 20% of the data will be used for validation
                                                    random_state=42, # Set a fixed seed for reproducibility of the split
                                                    stratify = y) # Ensure the distribution of the target variable is preserved in both sets


# 3. Missing Values

In [12]:
X_train.isna().sum()

Age at Injury                              0
Average Weekly Wage                    12932
Birth Year                             13069
C-3 Date                              168605
First Hearing Date                    185315
IME-4 Count                           195355
Industry Code                           4813
WCIO Cause of Injury Code                  0
WCIO Nature of Injury Code                 0
WCIO Part Of Body Code                     0
Number of Dependents                       0
Alternative Dispute Resolution Bin         0
Attorney/Representative Bin                0
Carrier Name Enc                           0
Carrier Type freq                          0
Carrier Type_1A. PRIVATE                   0
Carrier Type_2A. SIF                       0
Carrier Type_3A. SELF PUBLIC               0
Carrier Type_4A. SELF PRIVATE              0
Carrier Type_5. SPECIAL FUND               0
County of Injury freq                      0
COVID-19 Indicator Enc                     0
District N

# 3. Outliers

# 4. Feature Selection