# Preprocessing
This notebook contains code to preprocess the dataset (problem solving, outlier handling, missing values etc.) and the feature creation.

## Data Loading

## Individual Features
- Unifying missing data with NaN
- Removing outliers
- etc.

Missing data will be handled in a later step (Imputer)

**Age at injury**

In [None]:
df_train['Age at Injury'] = df_train['Age at Injury'].replace(0, np.nan)
df_test['Age at Injury'] = df_test['Age at Injury'].replace(0, np.nan)

In [None]:
df_train = df_train.drop(df_train[(df_train['Age at Injury'] < 14) | (df_train['Age at Injury'] > 80)].index)

**Birth year**

In [None]:
df_train['by'] = df_train['by'].replace(0, np.nan)

**Carrier type**

In [None]:
# Replace UNKNOWN with Nan
df_train['Carrier Type'] = df_train['Carrier Type'].replace("UNKNOWN", np.nan)
df_test['Carrier Type'] = df_test['Carrier Type'].replace("UNKNOWN", np.nan)

**Attorney representation**

In [None]:
# Replace 'Y' with True, 'N' with False, and preserve NaNs
df_train['Attorney/Representative'] = df_train['Attorney/Representative'].replace({'Y': True, 'N': False})

# Now convert to nullable boolean type
df_train['Attorney/Representative'] = df_train['Attorney/Representative'].astype("boolean")

In [None]:
# Replace 'Y' with True, 'N' with False, and preserve NaNs
df_test['Attorney/Representative'] = df_test['Attorney/Representative'].replace({'Y': True, 'N': False})

# Now convert to nullable boolean type
df_test['Attorney/Representative'] = df_test['Attorney/Representative'].astype("boolean")

**Alternative Dispute Resolution**

In [None]:
df_train['Alternative Dispute Resolution'] = df_train['Alternative Dispute Resolution'].replace("U", np.nan)

In [None]:
df_train['Alternative Dispute Resolution'] = df_train['Alternative Dispute Resolution'].replace({'Y': True, 'N': False})
df_train['Alternative Dispute Resolution'] = df_train['Alternative Dispute Resolution'].astype("boolean")

In [None]:
df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].replace("U", np.nan)
df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].replace({'Y': True, 'N': False})
df_test['Alternative Dispute Resolution'] = df_test['Alternative Dispute Resolution'].astype("boolean")

**Carrier Name**

In [None]:
# Replace unknowns with "others"?

**Claim Identifier**

In [None]:
# Completely remove duplicates in "Claim Identifier" from train
df_train = df_train[~df_train['Claim Identifier'].duplicated(keep=False)]

**IME-4 Count**

In [None]:
# Cap high values to 10
lower_bound = 0
upper_bound = 10
df_train['IME-4 Count'] = df_train['IME-4 Count'].apply(lambda x: min(max(x, lower_bound), upper_bound))

**Covid indicator**

In [None]:
df_train['COVID-19 Indicator'] = df_train['COVID-19 Indicator'].replace({'Y': True, 'N': False})

# Now convert to nullable boolean type
df_train['COVID-19 Indicator'] = df_train['COVID-19 Indicator'].astype("boolean")

df_test['COVID-19 Indicator'] = df_test['COVID-19 Indicator'].replace({'Y': True, 'N': False})

# Now convert to nullable boolean type
df_test['COVID-19 Indicator'] = df_test['COVID-19 Indicator'].astype("boolean")

**Average Weekly Wage**

In [None]:
upper_bound = 3.0e+04
lower_bound = 0
df_train['Average Weekly Wage'] = df_train['Average Weekly Wage'].apply(lambda x: min(max(x, lower_bound), upper_bound))

**Medical Fee Region**

In [None]:
df_train['Medical Fee Region'] = df_train['Medical Fee Region'].replace("UK", np.nan)
df_test['Medical Fee Region'] = df_test['Medical Fee Region'].replace("UK", np.nan)

**WCIO Part Of Body Code**

In [None]:
# Temporarily convert to numeric to apply absolute value, then convert back to category
df_train["WCIO Part Of Body Code"] = pd.to_numeric(df_train["WCIO Part Of Body Code"], errors='coerce').abs().astype('category')
df_test["WCIO Part Of Body Code"] = pd.to_numeric(df_test["WCIO Part Of Body Code"], errors='coerce').abs().astype('category')

print("Converted all 'WCIO Part Of Body Code' values to positive and restored as categorical.")

**Zip Code**

In [None]:
# Replace placeholder values with NaN in the original DataFrame
df_train.loc[df_train["Zip Code"].str.match(r'^0+$', na=False), "Zip Code"] = np.nan
print("Replaced placeholder values with NaN in 'Zip Code'.")

**Gender**

In [None]:
# Replace unknown with NaN
df_train['Gender'] = df_train['Gender'].replace("U", np.nan)
df_test['Gender'] = df_test['Gender'].replace("U", np.nan)

## Dates

## Imputing Missing Data

## Feature Creation
### General Strategy
#### Dates
- Extract Day, Month, Year and Weekday (weekday vs weekend might impact decision as weekend is likely not work related) from accident date (most important date)
- Other dates: Create feature "days passed since accident date"

#### Ages
- make age categories (teen, young adult, adult etc.)

#### Carrier Name (?)
- Only regard carriers with 10k+ cases.
- Rest: "other"

**Accident date**: Transform to four new features (day, month, year, weekday)

In [None]:
df_accident_time['Accident Year'] = df_acident_time['Accident Date'].dt.year
df_accident_time['Accident Month'] = df_acident_time['Accident Date'].dt.month
df_accident_time['Accident DayOfWeek'] = df_acident_time['Accident Date'].dt.dayofweek

**All Dates**: Days passed since accident