# Intro

** **

# Table of Contents

<a class="anchor" id="top"></a>

** **


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>
    
2. [Exploratory Data Analysis](#2.-Exploratory-Data-Analysis)

    2.1 [Initial Exploration](#2.1-Initial-Exploration)

    2.2 [Incoherencies](#2.2-Incoherencies)
    
    2.3 [Initial Visualisations](#2.3-Initial-Visualisations) <br><br>
    
3. [Data Cleaning & Preprocessing](#3.-Data-Cleaning-&-Preprocessing)

    3.1 [Duplicates](#3.1-Duplicates)
    
    3.5 [Feature Engineering](#3.5-Feature-Engineering)
   
    &emsp; 3.3.1 [Data Type Conversions](#3.3.1-Data-Type-Conversions)
    
    &emsp; 3.3.3 [Unique Feature-Pair Analysis](#3.3.3-Unique-Feature-Pair-Analysis) 
    
    3.3 [Missing Values part1](#3.3-Missing-Values)
    
    &emsp; 3.3.2 [Missing Values Identification & Treatment](#3.3.2-Missing-Values-Identification-&-Treatment)

    
    
    3.2 [Train-Test Split](#3.2-Train-Test-Split)
    
    3.4 [Outliers](#3.4-Outliers)

    3.6 [Visualisations](#3.6-Visualisations) <br><br>
    
4. [Feature Selection](#4.-Feature-Selection)    <br><br>     
    
5. [Export](#5.-Export)


# 1. Importing Libraries & Data

<a href="#top">Top &#129033;</a>

In [1]:
import pandas as pd
import numpy as np


# profile report
from ydata_profiling import ProfileReport

# visualisations
import seaborn as sns
import matplotlib.pyplot as plt

# train test split
from sklearn.model_selection import train_test_split

# external functions file
import functions as f

pd.set_option('display.max_columns', None)

  from .autonotebook import tqdm as notebook_tqdm


KeyboardInterrupt: 

In [None]:
df = pd.read_csv('./project_data/train_data.csv', index_col = 'Claim Identifier')
df.head(3)

In [None]:
test = pd.read_csv('./project_data/test_data.csv', index_col = 'Claim Identifier')
test.head(3)

# 2. Exploratory Data Analysis

<a href="#top">Top &#129033;</a>

## 2.1 Initial Exploration

In [None]:
# profile = ProfileReport(
#     df, 
#     title='Data',
#     correlations={
#         "pearson": {"calculate": True},
#         "spearman": {"calculate": False},
#         "kendall": {"calculate": False},
#         "phi_k": {"calculate": False},
#         "cramers": {"calculate": False},
#     },
# )

# profile

In [None]:
df.describe(include='object').T

In [None]:
df.describe().T

In [None]:
df.shape

In [None]:
df.info()

**Correlation matrix**

In [None]:
# drop column always missing
temp = df.drop('OIICS Nature of Injury Description', axis = 1)

# drop na
temp = temp.dropna()

# select numbers
corr_data = temp.select_dtypes(include=['number'])

correlation_matrix = corr_data.corr(method='spearman') # pearson by default

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='Blues', linewidths=0.1)
plt.show()

**Unique Values**

In [None]:
for column in df.columns:
    
    if df[column].nunique() < 20:
        
        print(f"Unique values in '{column}': {df[column].unique()}")
        print(df[column].nunique(), '\n')

## 2.2 Incoherencies

<a href="#top">Top &#129033;</a>

**Age at Injury**

very high max values

In [None]:
df[df['Age at Injury'] > 100]

In [None]:
f.plot_histogram(df['Age at Injury'], 'Age', 'Frequency',
                'Age Distribution')

**Birth Year**

very Low values for Birth Year, but a LOT of rows with the same issue

In [None]:
df[df['Birth Year'] < 1800]

In [None]:
f.plot_histogram(df['Birth Year'], 'Birth Year', 'Frequency',
                'Birth Year Distribution')

**IME-4 Count**

Very High 

In [None]:
df[df['IME-4 Count'] > 30]

In [None]:
f.plot_histogram(df['IME-4 Count'], 'IME-4 Count', 'Frequency',
                'IME-4 Count Distribution')

**OIICS Nature of Injury Description**

always missing

In [None]:
print(f"Number of missing rows in variable OIICS Nature of Injury Description: {len(df[df['OIICS Nature of Injury Description'].isna()])}")
print(f"Number of Rows in the Dataset: {len(df)}")

**Agreement Reached**

w/ a lot of zeros

In [None]:
df['Agreement Reached'].value_counts()

In [None]:
f.plot_histogram(df['Agreement Reached'], 'Agreement Reached', 'Frequency',
                'Agreement Reached Distribution')

**WCB Decision**

always the same

In [None]:
df['WCB Decision'].unique()

**Claim Injury Type**

imbalanced target 

In [None]:
df['Claim Injury Type'].value_counts()

**Incoherent Columns**

Some columns exist in train data but not on test

In [None]:
train_columns = set(df.columns)
test_columns = set(test.columns)

not_in_train = train_columns - test_columns
print(f'Columns in train but not in test: {not_in_train}')


## 2.3 Initial Visualisations

<a href="#top">Top &#129033;</a>

# 3. Data Cleaning & Preprocessing

<a href="#top">Top &#129033;</a>

## 3.1 Duplicates

<a href="#top">Top &#129033;</a>

In [None]:
df[df.duplicated()]

In [None]:
df = df.drop_duplicates()

verify the success

In [None]:
df[df.duplicated()]

## 3.5 Feature Engineering

<a href="#top">Top &#129033;</a>

all transformations must be applied o X_test too

### 3.3.1 Data Type Conversions

even though this is usually feature engineering, it will be useful doing before treating missing values

In [None]:
df.info()

In [None]:
df['Accident Date'] = pd.to_datetime(df['Accident Date'], 
                                     errors='coerce')

df['Assembly Date'] = pd.to_datetime(df['Assembly Date'], 
                                     errors='coerce')

df['C-2 Date'] = pd.to_datetime(df['C-2 Date'], 
                                errors='coerce')

df['C-3 Date'] = pd.to_datetime(df['C-3 Date'], 
                                errors='coerce')

df['First Hearing Date'] = pd.to_datetime(df['First Hearing Date'], 
                                          errors='coerce')

In [None]:
df.head(2)

**Accident Date**

transform into year (maybe month and day too)

In [None]:
df['Accident Year'] = df['Accident Date'].dt.year
df['Accident Month'] = df['Accident Date'].dt.month
df['Accident Day'] = df['Accident Date'].dt.day

**Alternative Dispute Resolution**

encode ['N' nan 'Y' 'U'] n -> 0, y -> 1, u -> 2

In [None]:
freq = df['Alternative Dispute Resolution'].value_counts()
freq

In [None]:
df['Alternative Dispute Resolution Enc'] = df['Alternative Dispute Resolution'].map(freq)

**Attorney/Representative**

['N' 'Y' nan] encode 0/1

In [None]:
df['Attorney/Representative'].value_counts()

In [None]:
df['Attorney/Representative Bin'] = df['Attorney/Representative'].replace({'N': 0, 'Y': 1})

**C-2 Date**

transform into year (maybe month and day too)

In [None]:
df['C-2 Year'] = df['C-2 Date'].dt.year
df['C-2 Month'] = df['C-2 Date'].dt.month
df['C-2 Day'] = df['C-2 Date'].dt.day

**Carrier Name**

too many unique values to encode, probably drop

In [None]:
df['Carrier Name'].nunique()

frequency encoder

In [None]:
freq = df['Carrier Name'].value_counts()

In [None]:
df['Carrier Name Enc'] = df['Carrier Name'].map(freq)

**Carrier Type**

encode (8 unique values)

In [None]:
freq = df['Carrier Type'].value_counts()
freq

In [None]:
df['Carrier Type Enc'] = df['Carrier Type'].map(freq)

**County of Injury**

probably too many to encode (?) --> freq encoding

In [None]:
df['County of Injury'].nunique()

In [None]:
freq = df['County of Injury'].value_counts()

In [None]:
df['County of Injury Enc'] = df['County of Injury'].map(freq)

**COVID-19 Indicator**

binary encoding

In [None]:
df['COVID-19 Indicator'].value_counts()

In [None]:
df['COVID-19 Indicator Bin'] = df['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})

**District Name**

encode (8 uniques)

In [None]:
freq = df['District Name'].value_counts()
freq

In [None]:
df['District Name Enc'] = df['District Name'].map(freq)

**Gender**

encode ['M' 'F' nan 'U' 'X']

In [None]:
df['Gender'].value_counts()

In [None]:
df['Gender Enc'] = df['Gender'].map({
    'M': 0,  # Male
    'F': 1,  # Female
    'U': 2,  # Unknown 
    'X': 2   # Other 
})

**Medical Fee Region**

encode

In [None]:
freq = df['Medical Fee Region'].value_counts()
freq

In [None]:
df['Medical Fee Region Enc'] = df['Medical Fee Region'].map(freq)

**Claim Injury Type**

In [None]:
# from sklearn.preprocessing import LabelEncoder

# encoder = LabelEncoder()
# df['Claim Injury Type'] = encoder.fit_transform(df['Claim Injury Type'])


# #encoder.inverse_transform([result])

**Columns not in test data**

In [None]:
df = df.drop(['Agreement Reached', 'WCB Decision'], axis = 1)

### 3.3.3 Unique Feature-Pair Analysis 

<a href="#top">Top &#129033;</a>

save codes and descriptions in dataframes, for later consultation (if needed)

In [None]:
injury_cause = df[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']].drop_duplicates()

injury_cause_df = injury_cause.set_index('WCIO Cause of Injury Code')

injury_cause_df.head(2)

In [None]:
injury_nature = df[['WCIO Nature of Injury Code', 'WCIO Nature of Injury Description']].drop_duplicates()

injury_nature_df = injury_nature.set_index('WCIO Nature of Injury Code')

injury_nature_df.head(2)

In [None]:
body_code = df[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']].drop_duplicates()

body_code_df = body_code.set_index('WCIO Part Of Body Code')


body_code_df.head(2)

In [None]:
industry_code = df[['Industry Code', 'Industry Code Description']].drop_duplicates()

industry_code_df = industry_code.set_index('Industry Code')


industry_code_df.head(2)

remove unnecessary columns from df

In [None]:
# df = df.drop(['WCIO Cause of Injury Description', 'WCIO Nature of Injury Description', 
#               'WCIO Part Of Body Description', 'Industry Code Description'], axis = 1)

antes de remover esta descriptions foi confirmado se haviam missing values nos codes mas havia a sua descrição, oq n aconteceu

| VARIABLE NAME | DESCRIPTION | 
| -------- | ---------- |
| C-3 Date Binary | 1 if C-3 happened, 0 otherwise |
| First Hearing Year | year of the first hearing (0 if no hearing happened) |
| Accident Year / Month / Day | year / month / day of the accident |
| Assembly Year / Month / Day | year / month / day of the assembly |
| Attorney/Representative Bin | 1 if represented by lawyer, 0 otherwise |
| C-2 Year / Month / Day | year / month / day of receipt of C-2 |
| Carrier Name Enc | replaced Carrier Name by frequency of each carrier name |
| County of Injury Enc | replaced County of Injury by frequency of each county name |
| COVID-19 Indicator Bin | 1 if has covid, 0 otherwise |
| District Name Enc | replaced District Name by frequency of each district name |
| Gender Enc | 0 if male, 1 if female, 2 otherwise |
| Medical Fee Region Enc | replaced Medical Fee Region by frequency of each region name |



**Look at df**

before next step

In [None]:
df.head(3)

## 3.3 Missing Values

<a href="#top">Top &#129033;</a>

In [None]:
df.isna().sum()

In [None]:
df[df['Claim Injury Type'].isna()]

dropping them

In [None]:
df.dropna(subset=['Claim Injury Type'], inplace=True)

verifying the success

In [None]:
df[df['Claim Injury Type'].isna()]

In [None]:
df.isna().sum() / len(df) * 100

**C-3 Date**

In [None]:
print(f'There are {len(df[df["C-3 Date"].isna()])} rows with missing values')
df[df['C-3 Date'].isna()].head(2)

# var description: Date Form C-3 (Employee Claim Form) was received
## interpretation --> if missing, was not received --> fill w/ 0 ou deixar estar como está (?) --> pode dar problema nas visualisations

In [None]:
# Create a binary variable: 1 if 'C-3 Date' is not missing, 0 if it is missing
df['C-3 Date Binary'] = df['C-3 Date'].notna().astype(int)

**First Hearing Date**

In [None]:
print(f'There are {len(df[df["First Hearing Date"].isna()])} rows with missing values')
df[df['First Hearing Date'].isna()].head(2)

# var meaning --> Date the first hearing was held on a claim at a WCB hearinglocation. A blank date means the claim has not yet had ahearing held
## sol --> fill w/ 0s ou deixar estar como está (?) --> pode dar problema nas visualisations

In [None]:
# Create a new variable: 0 if 'First Hearing Date' is missing, otherwise extract the year

df['First Hearing Year'] = df['First Hearing Date'].apply(lambda x: x.year if pd.notna(x) else 0)

### other alternative
## 1->first hearing aconteceu
## o --> not happened

**IME-4 Count**

In [None]:
print(f'There are {len(df[df["IME-4 Count"].isna()])} rows with missing values')
df[df['IME-4 Count'].isna()].head(2)

# var description -->Number of IME-4 forms received per claim. The IME-4 form isthe “Independent Examiner's Report of Independent MedicalExamination” form
## ASSUME that if missing, no forms received --> fill w/ zero

In [None]:
df['IME-4 Count'] = df['IME-4 Count'].fillna(0)

**OIICS Nature of Injury Description**

In [None]:
print(f'There are {len(df[df["OIICS Nature of Injury Description"].isna()])} rows with missing values')
df[df['OIICS Nature of Injury Description'].isna()].head(2)

In [None]:
# size of missing / size of dataset
len(df[df['OIICS Nature of Injury Description'].isna()]) / len(df)

drop useless variables 

In [None]:
df = df.drop(['C-3 Date', 'First Hearing Date', 
             'OIICS Nature of Injury Description'], axis = 1)

In [None]:
df.isna().sum() / len(df) * 100

**Accident Date**

In [None]:
# always has age at injury 0
df[df['Accident Date'].isna()]

In [None]:
years = df['Accident Date'].dt.year.dropna() 

f.plot_histogram(data=years, 
               xlabel='Year', 
               ylabel='Frequency', 
               title='Distribution of Accident Dates by Year')

Use the Median Difference Between the Two Dates

In [None]:
# Calculate the median difference between 'Assembly Date' and 'Accident Date'
time_diff = (df['Assembly Date'] - df['Accident Date']).median()


In [None]:
df['Accident Date'] = df['Accident Date'].fillna(df['Assembly Date'] - time_diff)

**Birth Year**

can be computed from Age at Injury & accident date

In [None]:
df[df['Birth Year'].isna()]

In [None]:
df[df['Birth Year'] == 0]

In [None]:
df.loc[df['Birth Year'].isna() | (df['Birth Year'] == 0), 
            'Birth Year'] = df['Accident Date'].dt.year - df['Age at Injury']

**C-2 Date**

In [None]:
df[df['C-2 Date'].isna()]

In [None]:
# #fill with median
# median_c2_date = df['C-2 Date'].median()

# df['C-2 Date'] = df['C-2 Date'].fillna(median_c2_date)

**Industry Code**

In [None]:
df[(df['Industry Code'].isna()) & (df['Industry Code Description'].isna())]

In [None]:
df['Industry Code'].unique()

In [None]:
# fill with new code for unknown - 0

df['Industry Code'] = df['Industry Code'].fillna(0)
df['Industry Code Description'] = df['Industry Code Description'].fillna('Unknown')

**WCIO Cause of Injury Code**

In [None]:
df[(df['WCIO Cause of Injury Code'].isna()) & (df['WCIO Cause of Injury Description'].isna())]

In [None]:
# fill with new code for unknown - 0

df['WCIO Cause of Injury Code'] = df['WCIO Cause of Injury Code'].fillna(0)
df['WCIO Cause of Injury Description'] = df['WCIO Cause of Injury Description'].fillna('Unknown')

**WCIO Nature of Injury Code**

In [None]:
df[(df['WCIO Nature of Injury Code'].isna()) & (df['WCIO Nature of Injury Description'].isna())]

In [None]:
# fill with new code for unknown - 0

df['WCIO Nature of Injury Code'] = df['WCIO Nature of Injury Code'].fillna(0)
df['WCIO Nature of Injury Description'] = df['WCIO Nature of Injury Description'].fillna('Unknown')

**WCIO Part Of Body Code**

In [None]:
df[(df['WCIO Part Of Body Code'].isna()) & (df['WCIO Part Of Body Description'].isna())]

In [None]:
# fill with new code for unknown - 0

df['WCIO Part Of Body Code'] = df['WCIO Part Of Body Code'].fillna(0)
df['WCIO Part Of Body Description'] = df['WCIO Part Of Body Description'].fillna('Unknown')

**Zip Code**

In [None]:
df[df['Zip Code'].isna()]

In [None]:
# fill with new code for unknown - 0

df['Zip Code'] = df['Zip Code'].fillna(99999)

In [None]:
df.isna().sum() / len(df) * 100

TEMP DF TO SEE OUTLIERS AND FEATURE SELECTION

In [None]:
df2 = df.copy()

In [None]:
X = df2.drop('Claim Injury Type', axis = 1)

y = df2['Claim Injury Type']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    stratify = y, 
                                                    random_state = 1)

## Outliers

# Feature Selection

## 3.2 Train-Test Split

<a href="#top">Top &#129033;</a>

In [None]:
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression


In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
pipeline = Pipeline([
    #('imputer', KNNImputer(n_neighbors=5)),
    ('scaler', RobustScaler()),
    #('custom_selector', FunctionTransformer(custom_feature_selection)),
    ('model', LogisticRegression())
])

In [None]:
for train_index, test_index in kf.split(X):
    
    # Split data into training and validation folds
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y.iloc[train_index], y.iloc[test_index]
    
     # Train your model on the training fold (with outliers handled)
    pipeline.fit(X_train, y_train)
    
    # Predict on the validation fold (without outlier detection here)
    pred_train = pipeline.predict(X_train)
    pred_val = pipeline.predict(X_val)
    
    metrics(y_train, pred_train , y_val, pred_val)
    
    

In [None]:
def metrics(y_train, pred_train , y_val, pred_val):
    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, pred_train))
    print(confusion_matrix(y_train, pred_train))


    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, pred_val))
    print(confusion_matrix(y_val, pred_val))

**Average Weekly Wage**

In [None]:
X_train[X_train['Average Weekly Wage'].isna()]

KNN - Imputation

In [None]:
from sklearn.impute import KNNImputer

num = X_train.select_dtypes(include=[float, int]).columns

imputer = KNNImputer(n_neighbors=5)

X_train[num] = imputer.fit_transform(X_train[num])

In [None]:
X_train[X_train['Average Weekly Wage'].isna()]

verify success of operations

In [None]:
X_train.isna().sum()

## 3.4 Outliers

<a href="#top">Top &#129033;</a>

In [None]:
f.boxplots(X_train.loc[:, X_train.columns])

In [None]:
for column in X_train.columns:
        if pd.api.types.is_numeric_dtype(X_train[column]):
            f.plot_histogram(X_train[column], 
                           xlabel=column, 
                           ylabel='Frequency', 
                           title=f'Histogram of {column}', 
                           color='lightblue')

| Method                 | Distribution Assumption | Sensitivity to Outliers | Complexity | Best Used For                 |
|-----------------------|------------------------|-------------------------|------------|-------------------------------|
| IQR                   | None                    | Low                     | Low        | Skewed distributions          |
| Modified Z-Score      | None                    | Moderate                | Moderate   | Small datasets with outliers  |
| Isolation Forest      | None                    | Low                     | High       | High-dimensional data         |


**Interquartile Range**

In [None]:
def detect_outliers_iqr(df):
    outliers_indices = set()
    for column in df.select_dtypes(include=[np.number]).columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Identify outliers
        outlier_data = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        
        outliers_indices.update(outlier_data.index)
        
        
        # Print the number of outliers
        print(f'Column: {column} - Number of Outliers: {len(outlier_data)}')
        print(f'Column: {column} - % of Outliers: {len(outlier_data) / len(df) * 100}% \n')
        
    return outliers_indices

In [None]:
iqr = detect_outliers_iqr(X_train)
iqr

**Isolation Forest**

In [None]:
from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df):
    outliers_indices = set()
    
    for column in df.select_dtypes(include=[np.number]).columns:
        # Reshape the data for the model
        data = df[column].values.reshape(-1, 1)
        
        # Fit the Isolation Forest model
        iso_forest = IsolationForest(contamination = 0.01, 
                                     random_state=1)
        outlier_predictions = iso_forest.fit_predict(data)
        
        # Identify outliers (predicted as -1)
        outlier_data = df[outlier_predictions == -1]
        
        outliers_indices.update(outlier_data.index)
        
        # Print the number of outliers
        print(f'Column: {column} - Number of Outliers: {len(outlier_data)}')
        print(f'Column: {column} - % of Outliers: {len(outlier_data) / len(df) * 100:.2f}% \n')

    return outliers_indices

In [None]:
iso = detect_outliers_isolation_forest(X_train)
iso 

In [None]:
common_outliers = iqr.intersection(iso)
print(f'Number of Common Outliers: {len(common_outliers)}')

In [None]:
common = X_train.loc[list(common_outliers)]
common

In [None]:
len(common) / len(X_train) * 100

decision - dont do anything for now

**New Variables**

| VARIABLE NAME | DESCRIPTION | 
| -------- | ---------- |
| C-3 Date Binary | 1 if C-3 happened, 0 otherwise |
| First Hearing Year | year of the first hearing (0 if no hearing happened) |
| Accident Year / Month / Day | year / month / day of the accident |
| Assembly Year / Month / Day | year / month / day of the assembly |
| Attorney/Representative Bin | 1 if represented by lawyer, 0 otherwise |
| C-2 Year / Month / Day | year / month / day of receipt of C-2 |
| Carrier Name Enc | replaced Carrier Name by frequency of each carrier name |
| County of Injury Enc | replaced County of Injury by frequency of each county name |
| COVID-19 Indicator Bin | 1 if has covid, 0 otherwise |
| District Name Enc | replaced District Name by frequency of each district name |
| Gender Enc | 0 if male, 1 if female, 2 otherwise |
| Medical Fee Region Enc | replaced Medical Fee Region by frequency of each region name |



## 3.6 Visualisations

<a href="#top">Top &#129033;</a>

In [None]:
all_train = pd.concat([X_train, y_train], axis=1)
all_train.head(2)

In [None]:
accident_counts = all_train['Accident Year'].value_counts().sort_index()

plt.figure(figsize=(10, 6))
plt.plot(accident_counts.index, accident_counts.values)
plt.title('Number of Accidents Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Accidents')
plt.xticks()
plt.grid(True)
plt.show()

In [None]:
# Set up the figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# First subplot: Accident Month
sns.countplot(data=all_train, x='Accident Month', ax=axes[0])
axes[0].set_title('Accident Count by Month')
axes[0].set_xlabel('Accident Month')
axes[0].set_ylabel('Count')

# Second subplot: Assembly Month
sns.countplot(data=all_train, x='Assembly Month', ax=axes[1])
axes[1].set_title('Assembly Count by Month')
axes[1].set_xlabel('Assembly Month')
axes[1].set_ylabel('Count')

# Third subplot: C-2 Month
sns.countplot(data=all_train, x='C-2 Month', ax=axes[2])
axes[2].set_title('C-2 Count by Month')
axes[2].set_xlabel('C-2 Month')
axes[2].set_ylabel('Count')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x='Age at Injury', y='Average Weekly Wage', data=all_train, scatter_kws={'alpha':0.5})

# Set the labels and title
plt.title('Age at Injury vs Average Weekly Wage')
plt.xlabel('Age at Injury')
plt.ylabel('Average Weekly Wage')

# Show the plot
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=X_train, x='Gender', y='Average Weekly Wage')
plt.title('Gender vs. Average Weekly Wage')
plt.xlabel('Gender')
plt.ylabel('Average Weekly Wage')
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=X_train, x='Carrier Type', y='Age at Injury')
plt.title('Carrier Type vs. Age at Injury')
plt.xlabel('Carrier Type')
plt.ylabel('Age at Injury')
plt.xticks(rotation=30) 
plt.grid(True)
plt.show()

In [None]:
num_temp = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Pairwise Relationship of Numerical Variables
sns.set()

# Create pairplot for the selected numerical columns
sns.pairplot(X_train[num_temp], diag_kind="hist")

# Layout adjustments
plt.subplots_adjust(top=0.95)
plt.suptitle("Pairwise Relationship of Numerical Variables", fontsize=20)

# Display the plot
plt.show()

In [None]:
## see distribution of categorical variables
## experiment with density plots
## three way aNOVA

look at X_train and X_test before removing replaced columns

In [None]:
X_train.head(2)

drop variables that will not be needed for modeling purposes

In [None]:
drop = ['Accident Date', 'Alternative Dispute Resolution', 
        'Assembly Date', 'Attorney/Representative', 'C-2 Date',
        'Carrier Name', 'Carrier Type', 'County of Injury', 
        'COVID-19 Indicator', 'District Name', 'Gender',
        'Medical Fee Region']

X_train = X_train.drop(drop, axis = 1)

# 4. Feature Selection

<a href="#top">Top &#129033;</a>

1. split numeric and categorical 
2. evaluate different scalers for numerical (standard, robust, minmax)
- dont forget we have lots of outliers - see which scaler is better for this
3. Filter based methods
- var, spearman corr for numeric
- chi-square for categ
4. Wrapper
- RFE with different models (multinomial logistic reg to start, maybe)
5. Embedded
- lasso
6. explore other methods not talked about in class
7. produce table with results/insights


**EXAMPLE TABLE**

NUM DATA

| VARIABLE | SPEARMAN | RFE MODEL1 | RFE MODEL2 | LASSO | DECISION |
| -------- | -------- | ---------- | ---------- | ----- | -------- |
| var_name | discard | discard | keep | discard | discard |

CATEG DATA

| VARIABLE | CHI-SQUARE | DECISION |
| -------- | ---------- | -------- |
| var_name | keep | keep |

In [None]:
X_train.select_dtypes

**Split Datatypes**

In [None]:
numeric_features = X_train.select_dtypes(include=['float64', 'int64'])
categorical_features = X_train.select_dtypes(include=['object'])

In [None]:
print("\nNumerical Data:")
print(numeric_features.head())

In [None]:
print("\nCategorical Data:")
print(categorical_features.head())

**Scaling Numeric Data**

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_standard = StandardScaler()
X_train_standard_scaled = pd.DataFrame(scaler_standard.fit_transform(numeric_features), columns=numeric_features.columns)

print("Standard Scaled Data Sample:")
print(X_train_standard_scaled.describe())

In [None]:
from sklearn.preprocessing import RobustScaler

scaler_robust = RobustScaler()
X_train_robust_scaled = pd.DataFrame(scaler_robust.fit_transform(numeric_features), columns=numeric_features.columns)

print("Robust Scaled Data Sample:")
print(X_train_robust_scaled.describe())

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()
X_train_minmax_scaled = pd.DataFrame(scaler_minmax.fit_transform(numeric_features), columns=numeric_features.columns)

print("MinMax Scaled Data Sample:")
print(X_train_minmax_scaled.describe())

### Summary of Scaling Methods for Numerical Data

#### 1. **RobustScaler**
- **Ideal for handling outliers**: The **RobustScaler** is best when working with data that contains outliers, as it scales based on the interquartile range (IQR), making it less sensitive to extreme values.
- **Example**: Features like **'IME-4 Count'** have large outliers that could distort the scaling. **RobustScaler** mitigates this issue.

#### 2. **StandardScaler**
- **Works well for normally distributed data**: The **StandardScaler** standardizes data by centering it with a mean of zero and scaling it by the standard deviation. It is effective when our data is normally distributed.
- **Example**: The feature **'Birth Year'** has a normal distribution and is well-centered using this scaler.

#### 3. **MinMaxScaler**
- **Sensitive to outliers**: The **MinMaxScaler** compresses data into a 0–1 range, but is vulnerable to outliers, which can significantly skew the result. It's most effective when applied to features with moderate ranges or when the model is sensitive to the feature range.
- **Example**: The **'Industry Code'** feature, with a moderate range, benefits from this transformation.

---

### Scaling Results Summary:

| Feature | **RobustScaler** | **StandardScaler** | **MinMaxScaler** |
|---------|------------------|--------------------|------------------|
| **Age at Injury** | Median-based scaling to handle outliers | Scaled to have mean = 0, std = 1 | Values between 0 and 1 |
| **Average Weekly Wage** | Robust to large outliers | Scaled but affected by outliers | Most values compressed due to extreme values |
| **Birth Year** | Handles skew in birth year data | Centered around mean | No major issues |
| **IME-4 Count** | Effective for highly skewed counts | Still influenced by outliers | Compression due to large counts |
| **Industry Code** | Efficient scaling due to moderate range | Normalized for centered values | All values mapped to 0–1 range |

---

### Recommendation:
Given that our dataset contains outliers in variables like **'IME-4 Count'** and **'Average Weekly Wage'**, the **RobustScaler** is the best choice for this scenario, in my opinion. It effectively handles skewed data without the influence of outliers. However, if we require data centering for algorithms that depend on normalized features, we can also consider using the **StandardScaler** after addressing outliers.


**Filter-Based Methods**

**Numerical Data: Variance Threshold**

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Set a variance threshold (0.01)
threshold = 0.01
selector = VarianceThreshold(threshold=threshold)

X_train_high_variance = selector.fit_transform(X_train_standard_scaled)
print(f"Number of features after variance threshold: {X_train_high_variance.shape[1]}")


**Numerical Data: Spearman Correlation**

In [None]:
# Calculate Spearman correlation matrix
correlation_matrix = X_train_standard_scaled.corr(method='spearman')

# Set a correlation threshold, 0.9
correlation_threshold = 0.9

# Find pairs of highly correlated features
high_correlation_pairs = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > correlation_threshold:
            colname = correlation_matrix.columns[i]
            high_correlation_pairs.add(colname)

# Drop highly correlated features
X_train_dropped_correlated = X_train_standard_scaled.drop(columns=high_correlation_pairs)
print(f"Features dropped due to high correlation: {high_correlation_pairs}")

**This means no highly correlated features were found using the threshold of 0.9. Since the logic works, we don't need to modify this either, though we could adjust the threshold if needed.**

**Categorical Data: Chi-Square Test**

In [None]:
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder

# Ensure all categorical features are strings
categorical_features = categorical_features.astype(str)

# Apply LabelEncoder to each categorical column
categorical_features_encoded = categorical_features.apply(LabelEncoder().fit_transform)

# Apply Chi-Square test
chi2_values, p_values = chi2(categorical_features_encoded, y_train)
chi2_results = pd.DataFrame({'Feature': categorical_features.columns, 'Chi2': chi2_values, 'p-value': p_values})

# Filter features with p-values < 0.05 (significant)
chi2_significant = chi2_results[chi2_results['p-value'] < 0.05]
print(f"Significant categorical features based on Chi-Square test: {chi2_significant['Feature'].tolist()}")

**These are the categorical features that show a significant relationship with the target variable based on the Chi-Square test, with p-values less than 0.05.**

**Wrapper Methods - Recursive Feature Elimination (RFE)**

In [None]:
print(X_train.dtypes)

# 3. Export

<a href="#top">Top &#129033;</a>

In [None]:
#df.to_csv('./project_data/treated_data')