### Step 1 ML Data Processing Pipeline
**Author:** Evan Yip <br>
**Purpose:** The primary purpose of this notebook is to take the raw csv file, extract and transform the data features and perform the necessary scalings and one hot encodings to prepare the data for either of the `step2_ml_pipeline*.ipynb` notebooks.

In [1]:
# Standard library imports
import os
import re

# Third-party library imports
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import ColumnTransformer

##########################################################################################
#######       Set the current working directory to the root of your project     ##########
os.chdir(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
##########################################################################################

# Local imports
from utilities import data_processing as dp_utils
from utilities.drop_unbalanced_features import DropUnbalancedFeatures

# Setting pandas display options for maximum column display
pd.set_option('display.max_columns', None)


### Extract additional data features from raw data

Using the methods that are in `../utilities/data_processing.py` we will extract and transform the data to gain additional columns such as aggregated DOSPERT scores, height in meters, weight in kg, BMI, ADI and our calculated spinal risk score.

In [2]:
# # Loading the data
# df_features = pd.read_csv('./data/data_raw/RiskFinal_DATA_2024-02-05_0017_combined.csv')
# # Transforming features
# # second argument may change based on how the data looks, respecify the index where the dospert questions start.
# df_features = dp_utils.get_dospert_scores(df_features, 60)
# df_features['height_m'] = df_features.height.apply(lambda h: dp_utils.get_height_value(value=h, unit='metric'))/100
# df_features['weight_kg'] = df_features.weight.apply(lambda h: dp_utils.get_weight_value(value=h, unit='metric'))
# df_features['bmi'] = df_features[['height_m', 'weight_kg']].apply(lambda row: dp_utils.compute_bmi(row.height_m, row.weight_kg), axis=1)
# df_features = dp_utils.get_age_ranges(df_features, age_column='age')
# df_features = dp_utils.get_location_information(df_features)
# df_features = dp_utils.get_adi_score(df_features)
# df_features = dp_utils.get_spinal_risk_score(df_features)

# # Dropping records based on low variance results
# df_features = dp_utils.manual_drop_records(df_features)

# # Write to csv
# df_features.to_csv('./data/data_processed/all_risk_processed.csv', index=False)

### Loading the data features file

In [3]:
data = pd.read_csv('./data/data_processed/all_risk_processed.csv')
data.shape

(799, 86)

In [4]:
# Separate out the risk scenario questions for later
risk_df = data.filter(regex='exer_|work_').copy()

In [5]:
# Remove the irrelevant columns for our ML models
data_final = data.drop(['odi_1', 'odi_2', 'odi_3',
       'odi_4', 'odi_5', 'odi_6', 'odi_7', 'odi_8', 'odi_9', 'odi_10',
       'exer_50improv_1drop', 'exer_50improv_10drop', 'exer_50improv_50drop',
       'exer_50improv_90drop', 'att_check_1', 'exer_90improv_1drop',
       'exer_90improv_10drop', 'exer_90improv_50drop', 'exer_90improv_90drop',
       'exer_50pain_1death', 'exer_50pain_10death', 'exer_50pain_50death',
       'exer_90pain_1death', 'exer_90pain_10death', 'exer_90pain_50death',
       'work_50improv_1drop', 'work_50improv_10drop', 'work_50improv_50drop',
       'work_50improv_90drop', 'work_90improv_1drop', 'work_90improv_10drop',
       'work_90improv_50drop', 'work_50improv_1para', 'work_50improv_10para',
       'work_50improv_50para', 'work_50improv_90para', 'work_90improv_1para',
       'work_90improv_10para', 'att_check2', 'work_90improv_50para',
       'work_50improv_1death', 'work_50improv_10death',
       'work_50improv_50death', 'work_90improv_1death',
       'work_90improv_10death', 'work_90improv_50death', 'att_pass',
       'risk_1_complete','height', 'weight', 'record_id', 'risk_1_timestamp', 
       'zipcode','age_range', 'postal_code','state_code','city',
       'province', 'province_code','latitude', 'longitude', 'FIPS', 'fips', 'GISJOIN', 'state'], axis=1)

In [6]:
# Convert ADI into an integer value
data_final['ADI_NATRANK'] = pd.to_numeric(data_final['ADI_NATRANK'], errors='coerce').astype(float).astype('Int64')
data_final['ADI_STATERNK'] = pd.to_numeric(data_final['ADI_STATERNK'], errors='coerce').astype(float).astype('Int64')

### ML Processing pipeline

The next step is to one hot encode the nonordinal categorical variables (`ohe_cols`) and impute and scale the ordinal categorical variables (`cat_cols`) and the numerical columns (`num_cols`). <br> 

Additionally, I will be applying my own DropUnbalancedFeatures transformation to remove categorical columns that are highly unbalanced (e.g. 90% religious and 10% nonreligious).

In [7]:
# Categorize the columns fo the preprocessing pipeline
ohe_cols = ["religion", "ethnicity"]
cat_cols = ["sex", "income", "education", "prior_surg", "spin_surg", "succ_surg"]
num_cols = ["age", "odi_final", "bmi", "dospert_ethical", "dospert_financial", "dospert_health/safety", "dospert_recreational", "dospert_social", "height_m", "weight_kg", "ADI_NATRANK", "ADI_STATERNK"]

In [8]:
# define preprocessing pipeline
ohe_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')),
    ('selector', DropUnbalancedFeatures(threshold=0.8, verbose=False))
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('selector', DropUnbalancedFeatures(threshold=0.8, verbose=False)),
    ('scaler', StandardScaler())
])

num_pipe = Pipeline([
    ('imputer', IterativeImputer(random_state=52)),
    ('scaler', StandardScaler())
])


preprocessor = ColumnTransformer(
    transformers=[
        ('ohe', ohe_pipe, ohe_cols),
        ('cat', cat_pipe, cat_cols),
        ('num', num_pipe, num_cols)
    ])

In [9]:
# Apply the preprocessing pipeline
processed_final = preprocessor.fit_transform(data_final)
transformed_columns = preprocessor.get_feature_names_out(input_features=data_final.columns)

### Saving the processed data

For this step we save two separate data files. We take the `processed_final` data from the preprocessor and save it as `./data/data_processed/ml_data_processed_final.csv`. This will be used for the `step2_ml_pipeline_risk_model.ipynb`. However we need an additional data file that contains the risk scenario questions. This will be saved as `./data/data_processed/ml_data_w_risk_questions_processed_final.csv`. This file will be used for `step2_ml_pipeline_choice_model.ipynb`.

In [10]:
# Saving file for step2_ml_pipeline_risk_model.ipynb
processed_final_df = pd.DataFrame(processed_final, columns=transformed_columns)
processed_final_df['spinal_risk_score'] = data_final['spinal_risk_score'].values
processed_final_df.to_csv('./data/data_processed/ml_data_processed_final.csv', index=False)

In [12]:
# Saving file for step2_ml_pipeline_choice_model.ipynb
processed_w_risk_final_df = pd.concat([processed_final_df, risk_df], axis=1)
processed_w_risk_final_df.to_csv('./data/data_processed/ml_data_w_risk_questions_processed_final.csv', index=False)

### Saving the fitted preprocessing pipeline

Finally, we will save the preprocessing pipeline to a pickle object so that when we want to transform new data and make predictions using our ML models, we can call the fitted preprocessing pipeline and apply the same transformations on that new data.

In [13]:
# Saving preprocessing pipeline to pickle object
import pickle
with open('./data/ml_models/general_model_preprocessor.pkl', 'wb') as f:
    pickle.dump(preprocessor, f)