# Preprocessing
This notebook contains code to preprocess the dataset dealing with some missing values (some others will have to be dealt with after splitting the data since they will use a model for inputation) and feature creation. (no scaling or outlier removal was performed here because some of the models we used are robust to outliers and don't need any scaling so these steps where only performed on the models notebook)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

## Data Loading

In [2]:
# Loading data
df_train = pd.read_csv("../../data/train_data.csv")
df_test = pd.read_csv("../../data/test_data.csv")

  df_train = pd.read_csv("../../data/train_data.csv")


In [3]:
def prepare(df, date_columns=None, code_identifiers=['Code']):
    """
    Prepare the DataFrame by:
      - Converting specified date columns to datetime.
      - Casting columns with specified keywords (e.g., 'Code') in their names to categorical data type.

    Parameters:
        df (pd.DataFrame): The DataFrame to prepare.
        date_columns (list): List of columns to convert to datetime. Default is None.
        code_identifiers (list): List of keywords to identify code columns. Default is ['Code'].

    Returns:
        pd.DataFrame: The modified DataFrame with date conversions and categorical castings applied.
    """
    # Convert specified columns to datetime
    if date_columns:
        for col in date_columns:
            df[col] = pd.to_datetime(df[col], errors='coerce')

    # Convert code_identifiers to lowercase for case-insensitive matching
    code_identifiers = [keyword.lower() for keyword in code_identifiers]

    # Cast columns with keywords in their names to categorical
    for col in df.columns:
        if any(keyword in col.lower() for keyword in code_identifiers):
            if pd.api.types.is_numeric_dtype(df[col]):
                df[col] = df[col].astype('object')
                print(f"Column '{col}' cast to 'object' data type.")
            else:
                print(f"Column '{col}' already non-numeric, no casting applied.")

    return df

In [4]:
# Define columns for date conversion
date_columns = ['Accident Date', 'Assembly Date', 'C-2 Date', 'C-3 Date', 'First Hearing Date']

# Prepare the training and test DataFrames
df_train = prepare(df_train, date_columns=date_columns)
df_test = prepare(df_test, date_columns=date_columns)

Column 'Industry Code' cast to 'object' data type.
Column 'Industry Code Description' already non-numeric, no casting applied.
Column 'WCIO Cause of Injury Code' cast to 'object' data type.
Column 'WCIO Nature of Injury Code' cast to 'object' data type.
Column 'WCIO Part Of Body Code' cast to 'object' data type.
Column 'Zip Code' already non-numeric, no casting applied.
Column 'Industry Code' cast to 'object' data type.
Column 'Industry Code Description' already non-numeric, no casting applied.
Column 'WCIO Cause of Injury Code' cast to 'object' data type.
Column 'WCIO Nature of Injury Code' cast to 'object' data type.
Column 'WCIO Part Of Body Code' cast to 'object' data type.
Column 'Zip Code' already non-numeric, no casting applied.


## Individual Feature Processing
- Unifying missing data with NaN


Missing data will be handled in a later step (Imputer)

In [5]:
df_train.dtypes

Accident Date                         datetime64[ns]
Age at Injury                                float64
Alternative Dispute Resolution                object
Assembly Date                         datetime64[ns]
Attorney/Representative                       object
Average Weekly Wage                          float64
Birth Year                                   float64
C-2 Date                              datetime64[ns]
C-3 Date                              datetime64[ns]
Carrier Name                                  object
Carrier Type                                  object
Claim Identifier                               int64
Claim Injury Type                             object
County of Injury                              object
COVID-19 Indicator                            object
District Name                                 object
First Hearing Date                    datetime64[ns]
Gender                                        object
IME-4 Count                                  f

**Age at injury**:
Impossible age of 0 replace with missing value (NaN)

In [6]:
df_train['Age at Injury'] = df_train['Age at Injury'].replace(0, np.nan)
df_test['Age at Injury'] = df_test['Age at Injury'].replace(0, np.nan)

**Birth year**: Impossible birth year of 0 replace with missing value (NaN)

In [7]:
df_train['Birth Year'] = df_train['Birth Year'].replace(0, np.nan)
df_test["Birth Year"] = df_test["Birth Year"].replace(0,np.nan)

**Attorney representation**: Replace Y/N Strings with 1 and 0.

In [8]:
# Replace 'Y' with 1, 'N' with 0, and preserve NaNs
df_train['Attorney/Representative'] = df_train['Attorney/Representative'].replace({'Y': 1, 'N': 0})


  df_train['Attorney/Representative'] = df_train['Attorney/Representative'].replace({'Y': 1, 'N': 0})


In [9]:
# Replace 'Y' with 1, 'N' with 0, and preserve NaNs
df_test['Attorney/Representative'] = df_test['Attorney/Representative'].replace({'Y': 1, 'N': 0})


  df_test['Attorney/Representative'] = df_test['Attorney/Representative'].replace({'Y': 1, 'N': 0})


**Alternative Dispute Resolution**: Replace Y/N Strings with 1 and 0 remove Unkown values since there are only 5 on the train (out of 574026) and we can change them to the most common category in the test.

In [10]:
df_train = df_train[df_train['Alternative Dispute Resolution'] != 'U']

In [11]:
df_train['Alternative Dispute Resolution'] = df_train['Alternative Dispute Resolution'].replace({'Y': 1, 'N': 0})

  df_train['Alternative Dispute Resolution'] = df_train['Alternative Dispute Resolution'].replace({'Y': 1, 'N': 0})


In [12]:
df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].replace("U", 0)
df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].replace({'Y': 1, 'N': 0})

  df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].replace({'Y': 1, 'N': 0})


**Claim Identifier**: Contains duplicates - is the ID column of the dataset

In [13]:
# Completely remove duplicates in "Claim Identifier" from train
df_train = df_train[~df_train['Claim Identifier'].duplicated(keep=False)]

**Covid indicator**: Replace Y/N Strings with 1 and 0

In [14]:
df_train['COVID-19 Indicator'] = df_train['COVID-19 Indicator'].replace({'Y': 1, 'N': 0})

df_test['COVID-19 Indicator'] = df_test['COVID-19 Indicator'].replace({'Y': 1, 'N': 0})


  df_train['COVID-19 Indicator'] = df_train['COVID-19 Indicator'].replace({'Y': 1, 'N': 0})
  df_test['COVID-19 Indicator'] = df_test['COVID-19 Indicator'].replace({'Y': 1, 'N': 0})


**Average Weekly Wage**: Impossible wage of 0 replace with missing value (NaN)

In [15]:
df_train["Average Weekly Wage"] = df_train["Average Weekly Wage"].replace(0, np.nan)
df_test["Average Weekly Wage"] = df_test["Average Weekly Wage"].replace(0, np.nan)


**Zip Code**: Setting placeholder ZIP codes as NaN to be treated by imputer.

In [16]:
# Replace placeholder values with NaN in the original DataFrame
df_train.loc[df_train["Zip Code"].str.match(r'^0+$', na=False), "Zip Code"] = np.nan
df_test.loc[df_test["Zip Code"].str.match(r'^0+$', na=False), "Zip Code"] = np.nan
print("Replaced placeholder values with NaN in 'Zip Code'.")

Replaced placeholder values with NaN in 'Zip Code'.


## Feature removal
Some features are useless as they're either full of missing values, have no variance or not present in the test dataset.

In [17]:
# Function to display percentage distribution of values in a column if it exists
def display_value_percentages(df, column_name):
    if column_name in df.columns:
        percentages = df[column_name].value_counts(normalize=True) * 100
        print(f"Percentage distribution in '{column_name}' - {df.name}:")
        print(percentages)
        print("\n")
    else:
        print(f"Column '{column_name}' does not exist in {df.name}.\n")

# Assign names to the dataframes for display purposes
df_train.name = 'df_train'
df_test.name = 'df_test'

**Feature OIICS Nature of Injury Description** is empty both in train and test

In [18]:
# Check and display/drop "OIICS Nature of Injury Description" in df_train and df_test
display_value_percentages(df_train, 'OIICS Nature of Injury Description')
display_value_percentages(df_test, 'OIICS Nature of Injury Description')

Percentage distribution in 'OIICS Nature of Injury Description' - df_train:
Series([], Name: proportion, dtype: float64)


Percentage distribution in 'OIICS Nature of Injury Description' - df_test:
Series([], Name: proportion, dtype: float64)




In [19]:
df_train = df_train.drop(['OIICS Nature of Injury Description'], axis=1, errors='ignore')
df_test = df_test.drop(['OIICS Nature of Injury Description'], axis=1, errors='ignore')

**Feature WCB Decision** is empty in test dataset.

In [20]:
# Assign names to the dataframes for display purposes
df_train.name = 'df_train'
df_test.name = 'df_test'
# Check and display/drop "WCB Decision" in df_train and df_test
display_value_percentages(df_train, 'WCB Decision')
display_value_percentages(df_test, 'WCB Decision')

Percentage distribution in 'WCB Decision' - df_train:
WCB Decision
Not Work Related    100.0
Name: proportion, dtype: float64


Column 'WCB Decision' does not exist in df_test.



In [21]:
# Column 'WCB Decision' does not exist in df_test, so we will drop this variable
df_train = df_train.drop(['WCB Decision'], axis=1)

**Feature Agreement Reached** doesn't exist in test dataset.

- This could be a secondary target variable. So we won't drop it.

In [22]:
# Assign names to the dataframes for display purposes
df_train.name = 'df_train'
df_test.name = 'df_test'
# Check and display/drop "Agreement Reached" in df_train and df_test
display_value_percentages(df_train, 'Agreement Reached')
display_value_percentages(df_test, 'Agreement Reached')

Percentage distribution in 'Agreement Reached' - df_train:
Agreement Reached
0.0    95.333446
1.0     4.666554
Name: proportion, dtype: float64


Column 'Agreement Reached' does not exist in df_test.



## Row Removal
Some rows (in TRAIN) can be dropped due to missing values in the target variable. NaN will never be a possible target category as it is invalid.

In [23]:
df_train["Claim Injury Type"].unique()

array(['2. NON-COMP', '4. TEMPORARY', nan, '3. MED ONLY',
       '5. PPD SCH LOSS', '6. PPD NSL', '1. CANCELLED', '8. DEATH',
       '7. PTD'], dtype=object)

In [24]:
#Remove rows where the target variable is NaN
df_train.dropna(axis = 0 , subset=["Claim Injury Type"],inplace = True)

## Feature Engineering

**Accident date SPLIT**: Transform to four new features (day, month, year, weekday)

- Why? E.g. Weekday vs weekend might impact decision as weekend is likely not work related

In [25]:
# Extract accident year, month, day, and day of week for train
df_train['Accident Year'] = df_train['Accident Date'].dt.year
df_train['Accident Month'] = df_train['Accident Date'].dt.month
df_train['Accident Day'] = df_train['Accident Date'].dt.day
df_train['Accident DayOfWeek'] = df_train['Accident Date'].dt.dayofweek

# Extract accident year, month, day, and day of week for test
df_test['Accident Year'] = df_test['Accident Date'].dt.year
df_test['Accident Month'] = df_test['Accident Date'].dt.month
df_test['Accident Day'] = df_test['Accident Date'].dt.day
df_test['Accident DayOfWeek'] = df_test['Accident Date'].dt.dayofweek

**All Dates: Order** Make sure dates are in correct order and make NaN those that aren't

In [26]:
def preprocess_dates(row):
    # Extract dates from relevant columns
    dates = {
        'Accident Date': row['Accident Date'],
        'C-2 Date': row['C-2 Date'],
        'C-3 Date': row['C-3 Date'],
        'Assembly Date': row['Assembly Date'],
        'First Hearing Date': row['First Hearing Date']
    }

    # Sort dates by expected chronological order
    date_keys = list(dates.keys())

    # Loop through all pairs of dates in chronological order
    for i in range(len(date_keys) - 1):
        for j in range(i + 1, len(date_keys)):
            date1_key, date2_key = date_keys[i], date_keys[j]
            date1, date2 = dates[date1_key], dates[date2_key]

            # If both dates are not NA, check their order
            if pd.notna(date1) and pd.notna(date2) and date1 > date2:
                # Set the later date to NaN if out of order
                row[date2_key] = np.nan

    return row

In [27]:
df_train = df_train.apply(preprocess_dates, axis=1)
df_test = df_test.apply(preprocess_dates, axis=1)

In [28]:
# Quality control - this rows order were wrong in EDA
df_train.loc[17, ['Claim Identifier', 'Accident Date', 'C-2 Date', 'C-3 Date', 'Assembly Date', 'First Hearing Date']]

Claim Identifier                  5393811
Accident Date         2019-12-19 00:00:00
C-2 Date              2020-01-01 00:00:00
C-3 Date              2020-01-07 00:00:00
Assembly Date                         NaT
First Hearing Date    2020-08-13 00:00:00
Name: 17, dtype: object

**All Dates**: Days passed since accident

- Why? We assume this holds predictive power and is easier to interpret for the model.

In [29]:
def add_days_since_accident(df, accident_date_col='Accident Date', date_columns=None):
    """
    Add new columns to the DataFrame representing days since the accident date for each specified date column.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the date columns.
        accident_date_col (str): The column name for the accident date.
        date_columns (list): List of date columns to calculate days since accident for. Accident date itself will be excluded.

    Returns:
        pd.DataFrame: The DataFrame with new 'Days Since Accident' columns added.
    """
    if date_columns is None:
        date_columns = []

    for col in date_columns:
        if col != accident_date_col:
            dsa_col = f"{col} DSA"
            df[dsa_col] = (df[col] - df[accident_date_col]).dt.days
            print(f"Column '{dsa_col}' created to represent days since '{accident_date_col}'.")

    return df

In [30]:
df_train = add_days_since_accident(df_train, accident_date_col='Accident Date', date_columns=date_columns)
df_test = add_days_since_accident(df_test, accident_date_col='Accident Date', date_columns=date_columns)

Column 'Assembly Date DSA' created to represent days since 'Accident Date'.
Column 'C-2 Date DSA' created to represent days since 'Accident Date'.
Column 'C-3 Date DSA' created to represent days since 'Accident Date'.
Column 'First Hearing Date DSA' created to represent days since 'Accident Date'.
Column 'Assembly Date DSA' created to represent days since 'Accident Date'.
Column 'C-2 Date DSA' created to represent days since 'Accident Date'.
Column 'C-3 Date DSA' created to represent days since 'Accident Date'.
Column 'First Hearing Date DSA' created to represent days since 'Accident Date'.


In [31]:
# Quality control - this rows order was wrong in EDA
df_train.loc[17, ['Claim Identifier', 'Accident Date', 'C-2 Date DSA', 'C-3 Date DSA', 'Assembly Date DSA', 'First Hearing Date DSA']]

Claim Identifier                      5393811
Accident Date             2019-12-19 00:00:00
C-2 Date DSA                             13.0
C-3 Date DSA                             19.0
Assembly Date DSA                         NaN
First Hearing Date DSA                  238.0
Name: 17, dtype: object

Adding a binary column for the dates since it might provide useful information that would be lost when the missing values are inputed


In [32]:
#Missing values in dates can mean that it wasn't held yet
date_columns = ['Accident Date', 'First Hearing Date', 'C-3 Date', 'Assembly Date', 'C-2 Date']

for column in date_columns:
    df_train[column + '_missing'] = df_train[column].isnull().astype(int)
    df_test[column + '_missing'] = df_test[column].isnull().astype(int)

**Age Category** - Split age into groups

- Why? To give model a categorical feature for age that is has a low cardinality only highlighting big differences.

In [33]:
# Define age category mapping
age_bins = [0, 19, 25, 40, 60, 120]  # Age ranges
age_labels = ['Teen', 'Young Adult', 'Adult', 'Middle-Aged', 'Senior']  # Category labels
age_column = 'Age at Injury'

# Create 'Age At Injury Category' based on bins
df_train['Age at Injury Category'] = pd.cut(df_train[age_column], bins=age_bins, labels=age_labels, right=False)
df_test['Age at Injury Category'] = pd.cut(df_test[age_column], bins=age_bins, labels=age_labels, right=False)

In [34]:
# Quality Control
df_train['Age at Injury Category'].value_counts()

Age at Injury Category
Middle-Aged    246136
Adult          194844
Senior          69166
Young Adult     52715
Teen             5696
Name: count, dtype: int64

In [35]:
# Converting to object type
df_train['Age at Injury Category'] = df_train['Age at Injury Category'].astype('object')


**Carrier Name** - Claim Category

- To indicate if Carrier has a lot of claims, medium amount or low amount of them.
- Why? Easier to interpret for the model than 2000 categories

Decision:
- Category HIGH (2) for State Fund 
- Category MEDIUM (1) for 5k to 50k claims
- Category LOW (0) for 5k and below claims

But this feature will only be created on the models notebook since it will use information from X_train only to avoid data leakage.

**WCIO Part Of Body Code** - Group Body Sections/Regions to reduce cardinality

Creating a map to map codes to section (body region)

- Based on https://www.iaiabc.org/standard-references -> https://assets.noviams.com/novi-file-uploads/iaiabc/EDI_Documents/WCIO_InjuryDescriptionTableandHistory-4dadd33c.xls
- The table contains different sub codes such as 14 -> "head" and 14A -> "IAIABC". Our dataset only has numerical codes, we will therefore only use purely numerical codes from the excel for the map.

In [36]:
# Load the 'Part' sheet from the Excel file
file_path = '../../data/WCIO_InjuryDescriptionTableandHistory-4dadd33c.xlsx'
df_part = pd.read_excel(file_path, sheet_name='Part')

In [37]:
# Filter rows to include only those with purely numerical codes or ranges
numeric_rows = df_part[df_part['Code'].astype(str).str.match(r'^\d+(-\d+)?$')]

# Dictionary to store each code's Body Section
code_to_section_mapping = {}

for _, row in numeric_rows.dropna(subset=['Code', 'Section']).iterrows():
    code = row['Code']
    section = row['Section']

    # If the code is a range like "10-23", expand it and map each number in the range
    if '-' in str(code):
        start, end = map(int, code.split('-'))
        for i in range(start, end + 1):
            # Map each code in the range to its section
            code_to_section_mapping[float(i)] = section
    else:
        # For individual numeric codes, map them directly
        numeric_code = float(code)
        code_to_section_mapping[numeric_code] = section

In [38]:
code_to_section_mapping

{1.0: 'Unassigned',
 2.0: 'Unassigned',
 3.0: 'Unassigned',
 4.0: 'Unassigned',
 5.0: 'Unassigned',
 6.0: 'Unassigned',
 7.0: 'Unassigned',
 8.0: 'Unassigned',
 9.0: 'Unassigned',
 10.0: 'Head',
 11.0: 'Head',
 12.0: 'Head',
 13.0: 'Head',
 14.0: 'Head',
 15.0: 'Head',
 16.0: 'Head',
 17.0: 'Head',
 18.0: 'Head',
 19.0: 'Head',
 20.0: 'Neck',
 21.0: 'Neck',
 22.0: 'Neck',
 23.0: 'Neck',
 24.0: 'Neck',
 25.0: 'Neck',
 26.0: 'Neck',
 27.0: 'Unassigned',
 28.0: 'Unassigned',
 29.0: 'Unassigned',
 30.0: 'Upper Extremities',
 31.0: 'Upper Extremities',
 32.0: 'Upper Extremities',
 33.0: 'Upper Extremities',
 34.0: 'Upper Extremities',
 35.0: 'Upper Extremities',
 36.0: 'Upper Extremities',
 37.0: 'Upper Extremities',
 38.0: 'Upper Extremities',
 39.0: 'Upper Extremities',
 40.0: 'Trunk',
 41.0: 'Trunk',
 42.0: 'Trunk',
 43.0: 'Trunk',
 44.0: 'Trunk',
 45.0: 'Trunk',
 46.0: 'Trunk',
 47.0: 'Trunk',
 48.0: 'Trunk',
 49.0: 'Trunk',
 60.0: 'Trunk',
 61.0: 'Trunk',
 62.0: 'Trunk',
 63.0: 'Trunk'

In [39]:
# Apply to df_train
df_train["Body Section"] = df_train["WCIO Part Of Body Code"].map(code_to_section_mapping)
df_train["Body Section"] = df_train["Body Section"].fillna("Unassigned")
# Apply to df_test
df_test["Body Section"] = df_test["WCIO Part Of Body Code"].map(code_to_section_mapping)
df_test["Body Section"] = df_test["Body Section"].fillna("Unassigned")

Quality Checks

In [40]:
# Optional: Display the resulting DataFrame to confirm the new feature
df_train[["Body Section","WCIO Part Of Body Code","WCIO Part Of Body Description"]].head()

Unnamed: 0,Body Section,WCIO Part Of Body Code,WCIO Part Of Body Description
0,Trunk,62.0,BUTTOCKS
1,Upper Extremities,38.0,SHOULDER(S)
2,Head,10.0,MULTIPLE HEAD INJURY
4,Upper Extremities,36.0,FINGER(S)
5,Upper Extremities,38.0,SHOULDER(S)


As comparison: WCIO Part of Body Description has a very high cardinality

In [41]:
df_train["WCIO Part Of Body Description"].unique()

array(['BUTTOCKS', 'SHOULDER(S)', 'MULTIPLE HEAD INJURY', 'FINGER(S)',
       'LUNGS', 'EYE(S)', 'ANKLE', 'KNEE', 'THUMB', 'LOWER BACK AREA',
       'ABDOMEN INCLUDING GROIN', 'LOWER LEG', 'HIP', 'UPPER LEG',
       'MOUTH', 'WRIST', 'SPINAL CORD', 'HAND', 'SOFT TISSUE',
       'UPPER ARM', 'FOOT', 'ELBOW', 'MULTIPLE UPPER EXTREMITIES',
       'MULTIPLE BODY PARTS (INCLUDING BODY',
       'BODY SYSTEMS AND MULTIPLE BODY SYSTEMS', 'MULTIPLE NECK INJURY',
       'CHEST', 'WRIST (S) & HAND(S)', 'EAR(S)',
       'MULTIPLE LOWER EXTREMITIES', 'DISC', 'LOWER ARM', 'MULTIPLE',
       'UPPER BACK AREA', 'SKULL', 'TOES', 'FACIAL BONES', nan, 'TEETH',
       'NO PHYSICAL INJURY', 'MULTIPLE TRUNK', 'WHOLE BODY',
       'INSUFFICIENT INFO TO PROPERLY IDENTIFY - UNCLASSIFIED', 'PELVIS',
       'NOSE', 'GREAT TOE', 'INTERNAL ORGANS', 'HEART', 'VERTEBRAE',
       'LUMBAR & OR SACRAL VERTEBRAE (VERTEBRA', 'BRAIN',
       'SACRUM AND COCCYX', 'ARTIFICIAL APPLIANCE', 'LARYNX', 'TRACHEA'],
      dtype=ob

In [42]:
df_train["Body Section"].value_counts()

Body Section
Upper Extremities      178380
Lower Extremities      120212
Trunk                  101260
Unassigned              59088
Head                    56885
Multiple Body Parts     46274
Neck                    11922
Name: count, dtype: int64

## Making the id consistent with the submission

In [43]:
#Defining Claim identifier as the index
df_train.set_index('Claim Identifier', inplace = True)
df_test.set_index('Claim Identifier', inplace = True)

### Save enriched data for multivariate analysis (pre-data numerical conversion)

In [44]:
# Save the enriched DataFrames with "_enriched" added to the filenames
df_train.to_csv("../../data/train_data_enriched_multivar.csv")
df_test.to_csv("../../data/test_data_enriched_multivar.csv")

## Make vars usable for models

Preparing our variables so they can be handled by the models (like converting the dates to numerical and preparing the categorical variables for proportional encoding in the models)
<span style="color: red;">This should not be executed when preprocessing data for multivariate_analysis.ipynb.</span>

#### Converting dates into integers in miliseconds after 1970

In [45]:
# Train dataset
for col in date_columns:
    if df_train[col].dtype == 'datetime64[ns]':
        df_train[col] = df_train[col].apply(
            lambda v: v.value // 10**6 if pd.notnull(v) else pd.NA
        )

# Test dataset
for col in date_columns:
    if df_test[col].dtype == 'datetime64[ns]':
        df_test[col] = df_test[col].apply(
            lambda v: v.value // 10**6 if pd.notnull(v) else pd.NA
        )

In [46]:
# Check for inconsistencies in the train dataset
for col in date_columns:
    if col in df_train.columns:
        negatives_train = df_train[df_train[col] < 0]
        if not negatives_train.empty:
            print(f"Negative values found in train dataset column: {col}")
            print(negatives_train[[col]])

# Check for inconsistencies in the test dataset
for col in date_columns:
    if col in df_test.columns:
        negatives_test = df_test[df_test[col] < 0]
        if not negatives_test.empty:
            print(f"Negative values found in test dataset column: {col}")
            print(negatives_test[[col]])

Negative values found in train dataset column: Accident Date
                  Accident Date
Claim Identifier               
5439830            -23760000000
5483928           -262569600000
5665076           -104198400000
5709762            -20995200000
5750699            -68947200000
5883889            -94694400000
5946151            -86572800000
6061985           -197337600000
6061462           -113184000000
Negative values found in test dataset column: Accident Date
                  Accident Date
Claim Identifier               
6396017           -100310400000


Negative values found for the dates before 1970

In [47]:
cat_features = ['Carrier Name', 'Carrier Type','County of Injury',
            'District Name', 'Gender','Industry Code',
           'Industry Code Description','WCIO Cause of Injury Code',
           'WCIO Cause of Injury Description', 'WCIO Nature of Injury Code',
           'WCIO Nature of Injury Description', 'WCIO Part Of Body Code', "Medical Fee Region",
           'WCIO Part Of Body Description',"Age at Injury Category","Body Section"]
   

In [48]:
# Force conversion to object
for col in cat_features:
    df_train[col] = df_train[col].astype(str)
    df_test[col] = df_test[col].astype(str)

# Fill in missing values with "Missing" for when we use proportionate encoding in the models
for col in cat_features:
    df_train[col] = df_train[col].replace("nan", "Missing").fillna("Missing")
    df_test[col] = df_test[col].replace("nan", "Missing").fillna("Missing")


# Save enriched DF

In [49]:
# Save the enriched DataFrames with "_enriched" added to the filenames
df_train.to_csv("../../data/train_data_enriched.csv")
df_test.to_csv("../../data/test_data_enriched.csv")