# Gradient-Boosting Machine (XGBoost) - With Tuned Hyperparameters
### *Exploring the association between neoantigen-related variables and immune scores*
This notebook is the continuation of the `xgboost.ipynb` notebook, which details hyperparameter tuning grid search with cross-validation to model interactions of select X features on a subset of Y labels (based on the PCA work done by Caitlin from GB team). 

#### **Package and Raw Data Loading**
First, import necessary packages and load in the raw data table into `pandas` dataFrame. 



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as skl
import xgboost as xgb
import matplotlib.pyplot as plt

from warnings import simplefilter, filterwarnings
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
filterwarnings("ignore", category=UserWarning)
pd.set_option('display.max_columns', None)
%config InlineBackend.figure_format = 'retina'

Load up the cleaned-up dataset wrangled from MH's latest work.

In [None]:
# read in latest data
# use the 202409_new_excludedIHC_batch-duplicate-removed.tsv
df = pd.read_csv("../input-data/SA/202409_new_excludedIHC_batch-duplicate-removed.tsv",sep="\t")
print(df.shape)

#print row 107 and 226
df.iloc[[107,226,555,872], :25]

In [None]:
# exclude the 29 Cibersort scores, leaving only 3
df = df.drop(columns=['Bindea_full', 'Expanded_IFNg', 
        'C_Bcellsmemory','C_Plasmacells','C_TcellsCD8','C_TcellsCD4naive',
         'C_TcellsCD4memoryactivated','C_Tcellsfollicularhelper',
         'C_Tcellsregulatory(Tregs)','C_Tcellsgammadelta','C_NKcellsresting',
         'C_NKcellsactivated', 'C_Monocytes', 'C_MacrophagesM0',
         'C_MacrophagesM1','C_Dendriticcellsresting',
         'C_Dendriticcellsactivated', 'C_Mastcellsresting',
         'C_Mastcellsactivated','C_Eosinophils', 'C_Neutrophils', 'S_PAM100HRD'])

print(df.shape)
df.head()

#### **Data Preprocessing**

Decide all the clinical variables and neoantigen-related variables to keep in the X matrix (features).

1. `Subtype` column has already been encoded categorically by `HR_status` and `HER_status` columns so these two columns can be dropped. ***UPDATE: due to their lesser importance during the default XGBoost modeling, `PAM50` column was dropped as well.***

2.  `AgeGroup` is just a binned information of `Age` column so it is dropped as it is redundant.

3. Drop `FusionNeo_bestScore`, `FusionTransscript_Count`, `Fusion_T2NeoRate` columns as well as the `SNVindelNeo_IC50` and `SNVindelNeo_IC50Percentile` columns for now to reduce complexity. 

> **UPDATE 1: Exclude `TotalNeo_Count`, and include `Fusion_T2NeoRate` and `SNVindelNeo_IC50` columns. Also, rename `Fusion_T2NeoRate` to `FN/FT_Ratio`.**

> **UPDATE: put back `FusionNeo_bestScore` into the X variable set and rename it into `FusionNeo_bestIC50`**

In [None]:
# subset df into just features and the immune scores as Y variables
## DEPRECATED ## 
dfd = df.drop(columns = ['HR_status', 'HER_status', 'AgeGroup', 'FusionNeo_bestScore','FusionTransscript_Count', 'Fusion_T2NeoRate', 'SNVindelNeo_IC50', 'SNVindelNeo_IC50Percentile'])
print(dfd.shape)
dfd.head()

In [None]:
# let's drop all NaN for now and set col 'ID' as index
## DEPRECATED ##
dfx = dfd.dropna().set_index('ID')

print(dfx.shape)
dfx.head()

**Sanity Check:** Check to make sure there is no duplicated index rows in the dataset.

In [None]:
## DEPRECATED ##
print(dfx.index[dfx.index.duplicated()].unique())
rows_dupe = list(dfx.index[dfx.index.duplicated()].unique())
rows_dupe

In [None]:
dfd2 = df.drop(columns = ['PAM50', 'HR_status', 'HER_status', 'AgeGroup', 'TotalNeo_Count', 'SNVindelNeo_IC50Percentile'])
print(dfd2.shape)
dfd2.head()

In [None]:
# let's drop all NaN for now and set col 'ID' as index
dfx2 = dfd2.dropna().set_index('ID')
print(dfx2.shape)
dfx2.head()

In [9]:
# rename the column `Fusion_T2NeoRate` to `FN/FT_Ratio` and `FusionNeo_bestScore` to `FusionNeo_bestIC50`
dfx2.rename(columns={'Fusion_T2NeoRate': 'FN/FT_Ratio'}, inplace=True)
dfx2.rename(columns={'FusionNeo_bestScore': 'FusionNeo_bestIC50'}, inplace=True)

In [None]:
dfx2.head()

In [None]:
print(dfx2.index[dfx2.index.duplicated()].unique())
rows_dupe = list(dfx2.index[dfx2.index.duplicated()].unique())
rows_dupe

> **NOTE**: ~~Initially we put `IMPRES` score as part of the target variable set, but it might be more informative to put this column as part of the X variable set, as they describe predictions on ICI response of patients (responders or not). In other words we are not exactly asking the question, "does neoantigen count correlate with patient response to ICI?" (though this is a valid question), so including IMPRES as Y variable might muddle our modeling. Similarly, `ESTIMATE` score is a hybrid tumor purity score that serves as a function of two components derived from ssGSEA: the immune score and the stromal score associated with a tumor sample. So, in this manner, this attribute is closer to being a relevant, potentially informative clinical variable to be used as a feature than a response/target variable.~~

> **UPDATE**: I have decided to keep these two columns as part of the Y variable set.

In [12]:
# dfw = dfdix.copy()

# # first pop these columns
# impres_col = dfw.pop('IMPRES')
# estimate_col = dfw.pop('ESTIMATE')

# # then insert it at the second position
# dfw.insert(7, 'IMPRES', impres_col)
# dfw.insert(8, 'ESTIMATE', estimate_col)

# dfw

Now we can encode categorical variables and integer variables accordingly so they are more amenable to machine learning algorithms. 

Modify the values in the Batch column and then check data types of the resulting cleaned up dataFrame.

In [None]:
# re-encode the Batch column into 1 or 2
## DEPRECATED ##
dfx['Batch'] = dfx['Batch'].apply(lambda x: 1 if x == 'Batch_1' else 2)
dfx.info()

We need to encode the two `object` columns into categorical numerics.

In [None]:
## DEPRECATED ##
dfx['Subtype'].astype('category')
dfx['PAM50'].astype('category')

In [None]:
dfx2['Subtype'].astype('category')
dfx2['Batch'] = dfx2['Batch'].apply(lambda x: 1 if x == 'Batch_1' else 2)
dfx2.info()

In [None]:
pd.set_option('display.max_rows', None)
print(dfx2.dtypes)

We need to encode the `object` columns into appropriate types. Change `Stage`, `Age`, `TumorGrade`, and `IMPRES` into `int64` as well as all `*_Count` columns because they are discrete variables. Change the `FN/FT_Ratio` into `float64`.

In [17]:
# change Stage and Age column into int64
## DEPRECATED ##
dfx['Stage'] = dfx['Stage'].astype('int64')
dfx['Age'] = dfx['Age'].astype('int64')
dfx['IMPRES'] = dfx['IMPRES'].astype('int64')


In [None]:
# change Stage and Age column into int64
dfx2['Stage'] = dfx2['Stage'].astype('int64')
dfx2['Age'] = dfx2['Age'].astype('int64')
dfx2['TumorGrade'] = dfx2['TumorGrade'].astype('int64')
dfx2['IMPRES'] = dfx2['IMPRES'].astype('int64')
dfx2['FusionNeo_Count'] = dfx2['FusionNeo_Count'].astype('int64')
dfx2['FusionTransscript_Count'] = dfx2['FusionTransscript_Count'].astype('int64')
dfx2['SNVindelNeo_Count'] = dfx2['SNVindelNeo_Count'].astype('int64')
dfx2['FN/FT_Ratio'] = dfx2['FN/FT_Ratio'].astype('float64')

print(dfx2.dtypes)

Now we can use Feature_Engine's `OneHotEncoder()` to create a `k` dummy variable set for `PAM50` and `Subtype`.

**NOTE**: The encoded columns will be appended at the end of the dataFrame. 


In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(
    variables=['Subtype'],
    drop_last=False,
    )

encoder.fit(dfx2)
df_tmp = encoder.transform(dfx2)
df_tmp.head()

In [20]:
# Specify the encoded columns to shift
enc_cols = ['Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']

# Drop the specified columns and store them
encoded_col_df = df_tmp[enc_cols]
dfx2_ = dfx2.drop(columns=['Subtype'])

# Specify the index where you want to reinsert the columns
insert_index = 1  # This will insert after the first column

# Reinsert the columns
for i, col in enumerate(encoded_col_df.columns):
    dfx2_.insert(insert_index + i, col, encoded_col_df[col])

Below is the categorically-encoded dataframe.

In [None]:
print(dfx2_.shape)
dfx2_.head()

And below is the original, unencoded dataframe.

In [None]:
print(dfx2.shape)
dfx2.head()

#### **Spearman Correlation Heatmap**

Before moving on with XGBoost, we can plot a Spearman correlation matrix between the clinical variables (X features) and the immune score variables (Y labels).

In [None]:
df_dropcat = dfx2.drop(columns=['Batch', 'Subtype'])
print(df_dropcat.shape)
df_dropcat.head()

In [None]:
# replot heatmap on transformed data
corr_df = df_dropcat.corr(method='spearman')
corr_df = corr_df.round(2)

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_df, dtype=bool))

plt.figure(figsize=(46, 36))

# Create the correlation matrix and represent it as a heatmap.
hm = sns.heatmap(corr_df, annot = False, cmap = 'RdBu_r', square = True, linewidths=0.5, center=0, mask=mask, cbar_kws={"shrink": .5})

# Get current labels
ylabels = hm.get_yticklabels()
xlabels = hm.get_xticklabels()

# Hide the first y-axis label and the last x-axis label
ylabels[0].set_visible(False)
xlabels[-1].set_visible(False)

# Rotate and align the tick labels
plt.setp(xlabels, rotation=45, ha='right')

# Define columns to highlight and their colors
highlight_cols = {
    "TotalNeo_Count": "crimson",
    "FusionNeo_Count": "olive",
    "SNVindelNeo_Count": "darkviolet"
}

# Change color of specific x-axis and y-axis labels
for label in xlabels:
    if label.get_text() in highlight_cols:
        label.set_color(highlight_cols[label.get_text()])
        label.set_fontweight('bold')

for label in ylabels:
    if label.get_text() in highlight_cols:
        label.set_color(highlight_cols[label.get_text()])
        label.set_fontweight('bold')

# Removes all ticks
hm.tick_params(left=False, bottom=False)

hm.set_title('Dataframe Correlation Heatmap', fontsize=30, x=0.45)

plt.show()


#### **Distributions of Neoantigen Counts by Subtype and PAM50 Classes**

Then, plot the distributions of `FusionNeo_Count`, `SNVindelNeo_Count`, `TotalNeo_Count` hued by `Subtype` as boxplots.

In [None]:
# Create a figure with a 3x2 grid of subplots
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(12, 24))

# Flatten the 2D array of axes for easier indexing
axs = axs.flatten()

# List of y-variables and their corresponding titles
y_vars = ['FusionNeo_Count', 'SNVindelNeo_Count', 'TotalNeo_Count']
titles = [
    'Distribution of log(Fusion Neoantigen Count)',
    'Distribution of log(SNV & Indel Neoantigen Count)',
    'Distribution of log(Total Neoantigen Count)'
]

# Plot Subtype boxplots on the left side
for i, (y_var, title) in enumerate(zip(y_vars, titles)):
    sns.boxplot(x="Subtype", y=y_var, data=dfx, hue="Subtype", ax=axs[i*2], palette="Set3")
    axs[i*2].set_title(f"{title} by Subtype")

# Plot PAM50 boxplots on the right side
for i, (y_var, title) in enumerate(zip(y_vars, titles)):
    sns.boxplot(x="PAM50", y=y_var, data=dfx, hue="PAM50", ax=axs[i*2+1], palette="colorblind")
    axs[i*2+1].set_title(f"{title} by PAM50")

# Adjust the spacing between subplots
plt.tight_layout()

# Display the figure
plt.show()

In [None]:
# Create a figure and axis objects
fig, axs = plt.subplots(nrows=3, figsize=(8, 12))

# Plot the first subplot
sns.kdeplot(data=dfx, x="FusionNeo_Count", hue="PAM50", ax=axs[0], common_norm=False)
axs[0].set_title("Distribution of FusionNeo_Count by PAM50")

# Plot the second subplot
sns.kdeplot(data=dfx, x="SNVindelNeo_Count", hue="PAM50", ax=axs[1], common_norm=False)
axs[1].set_title("Distribution of SNVindelNeo_Count by PAM50")

# Plot the third subplot
sns.kdeplot(data=dfx, x="TotalNeo_Count", hue="PAM50", ax=axs[2], common_norm=False)
axs[2].set_title("Distribution of TotalNeo_Count by PAM50")

# Adjust the spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the figure
plt.show()

Nothing strikes as interesting here. In a way this might be good, because this shows that there probably isn't any multicollinearity between the neoantigen count X variables and the subtype X variables. 

#### **Yeo-Johnson Transformation and Z-Score Standardization**
Yeo-Johnson transformation is an extension of the Box-Cox transformation, which falls under the power transformation family. YJ algorithm is designed to handle both positive and negative values in the dataset. Similar to Box-Cox, the Yeo-Johnson transformation aims to stabilize variance, make the data more symmetric, and bring it closer to a normal distribution.

Instead of doing a simple log transformation we will use YJ transformation.

**NOTE**: ~~MH also performed z-score standardization following YJ transformation but that might be unnecessary.~~ **UPDATE:** ~~I changed my mind. Z-score helped center the data which would be useful for some machine learning algorithms. Let's do that.~~

In [None]:
dfc = df_dropcat.copy()
print(dfc.shape)
dfc.head()

First define a plotting function.

In [28]:
# visualise distributions
# Determine the number of rows and columns for the subplot grid
nrows = 10
ncols = 13

#define the plotting function
def visualize_distribution(df, naming_class, colr):
    # Create a figure and a grid of subplots
    # Flatten the axes array for easy iteration
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(30, 22))
    axes = axes.flatten()

    # Plot histograms for each column
    for i, column in enumerate(df.columns):
        sns.histplot(df[column], kde=False, ax=axes[i], color=colr)
        axes[i].set_title(column)
        axes[i].set_xlabel('')
        axes[i].set_ylabel('')

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    # Adjust the layout
    plt.tight_layout()
    plt.subplots_adjust(top=0.92) 
    plt.suptitle(f'{naming_class} Distributions of the Dataset', fontsize=20, fontweight='bold')
    # plt.savefig(f'Distribution_before.png',dpi=300)
    plt.show()

##### Original Unscaled Distribution

In [None]:
#execute the function
visualize_distribution(dfc, 'Original', 'green')

In [30]:
from feature_engine.transformation import YeoJohnsonTransformer
# Apply Yeo-Johnson transformation to each numeric column
yjt = YeoJohnsonTransformer()
yjt.fit(dfc)
df_transformed = yjt.transform(dfc)

##### Power-Transformed Distribution (YJ)

In [None]:
#execute the function
visualize_distribution(df_transformed, 'YJ-Transformed', 'orange')

In [32]:
from scipy import stats
# Apply Z-score normalization to the transformed data
# using lambda function to apply z-score to each column and then join the resulting dataframes
zscore = lambda x : stats.zscore(x)
df_tfz = df_transformed.apply(zscore)

##### Power-Transformed, Z-Scaled Distribution (YJ-Z)

In [None]:
#execute the function
visualize_distribution(df_tfz, 'YJZ-Transformed', 'purple')

In [34]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Fit StandardScaler and transform the data
scaled_array = scaler.fit_transform(df_transformed)

# Convert the scaled array back to a DataFrame
# This preserves the original column names and index
df_scaled = pd.DataFrame(scaled_array, columns=df_transformed.columns, index=df_transformed.index)

##### Power-Transformed, Standard-Scaled Distribution (YJ-SC)

In [None]:
visualize_distribution(df_scaled, 'YJ-transformed-standardized', 'red')

*Conclusion*: The distributions after the power transformation and standardization looks variance-stabilized now with a centered data around 0. Let us proceed with the dataset splitting and then data scaling using YJ and Z-score.

**UPDATE:** Instead of using scipy's `zscore` we use Scikit-Learn's `StandardScaler` which is more amenable to Pipelining off Scikit-Learn.

#### **Split Dataset with `train_test_split`**

Going back to the original dataframe, we need to split the dataset before modeling to avoid information leakage, then transform the data accordingly as above. First, load the cleaned up dataframe.

In [None]:
pd.set_option('display.max_rows', 8)
dfx2_

Now list all the clinical features that would be the X variables.

In [37]:
clin_features = ['Age', 'TumorGrade', 'TumourSize', 'Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']

Grab the X features that we will use for the modeling from the cleaned up dataframe.

In [None]:
# define X features; use the clin_features list generated before
X = dfx2_[clin_features]
X

Now grab the Y targets (do this as a whole, but we will train on each column individually later).

In [None]:
# Now get the Y variable set
cols_y = [col for col in dfx2_.drop(columns=['Batch', 'Stage', 'FusionTransscript_Count']).columns if col not in X.columns]
Y = dfx2_[list(cols_y)]
Y

Now we perform train test split on the X and Y variables.

In [40]:
# Perform train-test split
X_train, X_test, Y_train, Y_test = skl.model_selection.train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train.info()

In [None]:
X_test.info()

As we don't want to transform all the X columns (because some of them are discrete numerical data and some of them are one-hot encoded categorical variables), we need to specify the columns to transform.

In [45]:
X_vars_to_transform = ['TumourSize', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']

First, apply the Yeo-Johnson transformation on the split datasets.

In [46]:
from feature_engine.transformation import YeoJohnsonTransformer

# Initialize Yeo-Johnson transformer for features to transform
yj_transformer_X = YeoJohnsonTransformer(variables=X_vars_to_transform)

# Initialize Yeo-Johnson transformer for target
yj_transformer_Y = YeoJohnsonTransformer()

# Fit and transform the training data (features)
X_train_yj = yj_transformer_X.fit_transform(X_train)

# Transform the test data using the fitted transformer (features)
X_test_yj = yj_transformer_X.transform(X_test)

# Fit and transform the training data (targets)
Y_train_yj = yj_transformer_Y.fit_transform(Y_train)

# Transform the test data using the fitted transformer (targets)
Y_test_yj = yj_transformer_Y.transform(Y_test)


In [None]:
X_train_yj.head()

In [None]:
Y_test.head()

In [49]:
Y_test_inversed = yj_transformer_Y.inverse_transform(Y_test_yj)

In [None]:
Y_test_inversed.head()

**DEPRECATED** ~~Then, apply z-score standardization.~~

In [51]:
# from scipy.stats import zscore
# # Apply Z-score normalization to Yeo-Johnson transformed columns
# for col in X_vars_to_transform:
#     X_train_tf[col] = zscore(X_train_tf[col])
#     X_test_tf[col] = zscore(X_test_tf[col])

# X_train_tfz = X_train_tf.copy()
# X_test_tfz = X_test_tf.copy()

# # using lambda function to apply z-score to each column
# zs = lambda x : zscore(x)
# Y_train_tfz = Y_train_tf.apply(zs)
# Y_test_tfz = Y_test_tf.apply(zs)

Let's use Standard Scaler instead (but wrap with with `feature_engine`'s wrapper).

In [52]:
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper

# select variables to transform
cols_X = X_train_yj.columns.tolist()
cols_X = [col for col in cols_X if col not in ['Age', 'TumorGrade', 'Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']]

# set up the wrapper with the StandardScaler
scaler_X = SklearnTransformerWrapper(transformer = StandardScaler(),
                                    variables = cols_X)

# fit the wrapper + StandardScaler
scaler_X.fit(X_train_yj)

# transform the data
X_train_yjz = scaler_X.transform(X_train_yj)
X_test_yjz = scaler_X.transform(X_test_yj)


In [None]:
X_train_yj

In [None]:
X_train_yjz

Now test if we can inverse-transform the Scaler step.

In [None]:
X_train_yj_inversed = scaler_X.inverse_transform(X_train_yjz)
X_train_yj_inversed

In [None]:
Y_train_yj

In [None]:
# select variables to transform for Y
cols_Y = Y_train_yj.columns.tolist()

# set up the wrapper with the StandardScaler
scaler_Y = SklearnTransformerWrapper(transformer = StandardScaler(),
                                    variables = cols_Y)

# fit the wrapper + StandardScaler
scaler_Y.fit(Y_train_yj)

# transform the data
Y_train_yjz = scaler_Y.transform(Y_train_yj)
Y_test_yjz = scaler_Y.transform(Y_test_yj)
Y_train_yjz

#### **Create a Pipeline from `feature_engine` Package**
This would enable easy inverse transform steps for both X and Y.

In [58]:
from feature_engine.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler

X_vars_to_transform = ['TumourSize', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']

# select variables to scale
scale_cols_X = X_train.columns.tolist()
scale_cols_X = [col for col in scale_cols_X if col not in ['Age', 'TumorGrade', 'Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']]

# Create the pipeline
preprocess_pipeline_X = Pipeline([
    ('yeo_johnson', YeoJohnsonTransformer(variables=X_vars_to_transform)),
    ('scaler', SklearnTransformerWrapper(transformer = StandardScaler(), variables = scale_cols_X))
])

# Fit the pipeline to the training data
preprocess_pipeline_X.fit(X_train)

# Transform the training data
X_train_yjsc = preprocess_pipeline_X.transform(X_train)
# Transform the test data
X_test_yjsc = preprocess_pipeline_X.transform(X_test)


In [None]:
X_train

In [None]:
X_train_yjz

In [None]:
X_train_yjsc

In [None]:
## Try inverse transform on the pipeline
X_train_inv = preprocess_pipeline_X.inverse_transform(X_train_yjsc)
X_train_inv

The Pipeline worked! Now we can do the same with Y.

In [63]:
from feature_engine.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler

# select variables to scale
scale_cols_Y = Y_train.columns.tolist()

# Create the pipeline
preprocess_pipeline_Y = Pipeline([
    ('yeo_johnson', YeoJohnsonTransformer()),
    ('scaler', SklearnTransformerWrapper(transformer = StandardScaler(), variables = scale_cols_Y))
])

# Fit the pipeline to the training data
preprocess_pipeline_Y.fit(Y_train)

# Transform the training data
Y_train_yjsc = preprocess_pipeline_Y.transform(Y_train)
# Transform the test data
Y_test_yjsc = preprocess_pipeline_Y.transform(Y_test)

In [None]:
Y_test

In [None]:
Y_test_yjz

In [None]:
Y_test_yjsc

In [None]:
## try inverse transform
Y_test_inv = preprocess_pipeline_Y.inverse_transform(Y_test_yjsc)
Y_test_inv

#### **XGBoost Learning**

Time to test XGBoost. Select `ESTIMATE` column as the first target/label (`y`) variable first.

In [None]:
# Extract ESTIMATE column as y label
y_target = 'ESTIMATE'
Y_train_targ = Y_train_yjsc[y_target]
Y_test_targ = Y_test_yjsc[y_target]

print(Y_train_targ.shape)
print(Y_test_targ.shape)

With `y` label selected, create DMatrix objects for XGBoost corresponding to the `X_train_yjsc` and `X_test_yjsc` sets.

In [69]:
# Create DMatrix for XGBoost; enable_categorical=True if there are categorical encoded columns
dtrain = xgb.DMatrix(X_train_yjsc, label=Y_train_targ, enable_categorical=True)
dtest = xgb.DMatrix(X_test_yjsc, label=Y_test_targ, enable_categorical=True)

# XGBoost parameters
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse'
}

# Train the model
num_rounds = 150
model = xgb.train(params, dtrain, num_rounds)

# Make predictions
y_transformed_preds = model.predict(dtest)

In [None]:
len(y_transformed_preds)

~~Now that XGBoost training and modeling have been completed, we can reverse the Z-score scaling and YJ transformation in the order they were applied. Make sure to use the pre-scaled mean and standard deviation for the z-score reversal.~~

Use the Pipeline object to inverse transform. Make sure to make a dummy dataFrame for the predicted Y array first because the Pipeline object were created based off a dataFrame.

In [71]:
# mean_y = Y_train_tf[y_index].mean()
# print(mean_y)
# std_y = Y_train_tf[y_index].std()
# print(std_y)
# # Step 1: Reverse Z-score normalization
# preds_rev_z = (preds_tf * std_y) + mean_y
# preds_rev_z

In [None]:
# Create a DataFrame with the same columns as the original y used in fit
dummy_y = pd.DataFrame(0, index=X_test_yjsc.index, columns=Y_test_yjsc.columns)
dummy_y[y_target] = y_transformed_preds
dummy_y

In [None]:
# apply inverse transform
dummy_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_y)
dummy_y_inv

In [None]:
# Extract the relevant target column
y_preds = dummy_y_inv[y_target].to_numpy()
y_preds

#### **Model Evaluation**

Import Scikit-Learn's `metrics` tools to evaluate the machine learning outputs. 

In [None]:
# Evaluate
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

rmse = np.sqrt(mean_squared_error(Y_test[y_target], y_preds))
print(f"RMSE: {rmse}")

mae = mean_absolute_error(Y_test[y_target], y_preds)
r2 = r2_score(Y_test[y_target], y_preds)

print(f"Mean Absolute Error: {mae}")
print(f"R-squared Score: {r2}")

It is worth keeping in mind that these metrics are based on one split of the full dataset (seed of 42) during the `train_test_split`, and using default hyperparameters. 

Regardless, let's look at the built-in feature importance plots as well as SHAP-analyzed feature importance plots.

In [None]:
from xgboost import plot_importance
fig, ax = plt.subplots(figsize=(10, 10))
plot_importance(model, ax=ax)
plt.show()

In [None]:
import shap
# Create the SHAP explainer
explainer = shap.TreeExplainer(model)

# Calculate SHAP values
shap_values = explainer.shap_values(X_train_yjsc)

# Summary plot
shap.summary_plot(shap_values, X_train_yjsc)

In [None]:
import shap

# Create the SHAP explainer
explainer = shap.TreeExplainer(model)

# Calculate SHAP values on the test set
shap_values_test = explainer.shap_values(X_test_yjsc)

# Summary plot for test set
shap.summary_plot(shap_values_test, X_test_yjsc)

# Optionally, you can still calculate and compare with training set
shap_values_train = explainer.shap_values(X_train_yjsc)

# Compare summary plots
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plt.subplot(121)
shap.summary_plot(shap_values_test, X_test_yjsc, plot_type="bar", show=False)
plt.title("Test Set SHAP Values")
plt.subplot(122)
shap.summary_plot(shap_values_train, X_train_yjsc, plot_type="bar", show=False)
plt.title("Train Set SHAP Values")
plt.tight_layout()
plt.show()

In [None]:
# Feature importance based on mean absolute SHAP values
feature_importance = np.abs(shap_values).mean(axis=0)
feature_importance_df = pd.DataFrame(list(zip(X_test_yjsc.columns, feature_importance)),
                                     columns=['feature', 'importance'])
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

print("Top 10 features by SHAP importance based on the test set:")
print(feature_importance_df.head(10))

# Plot feature importance
plt.figure(figsize=(10, 6))
feature_importance_df.plot(x='feature', y='importance', kind='bar')
plt.title('Feature Importance (Mean Absolute SHAP Values)')
plt.xlabel('Features')
plt.ylabel('Mean |SHAP Value|')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# SHAP dependence plot for the most important feature
# most_important_feature = feature_importance_df.iloc[0]['feature']
# shap.dependence_plot(most_important_feature, shap_values, X_train_tfz)

In [None]:
# Create the SHAP explainer
explainer = shap.TreeExplainer(model)

# Calculate SHAP values
shap_values = explainer(X_train_yjsc)

# Perform clustering
clust = shap.utils.hclust(X_train_yjsc, Y_train_targ)

# Create the bar plot with clustering
shap.plots.bar(shap_values, clustering=clust, clustering_cutoff=1)

In [None]:
# Create the SHAP explainer
explainer = shap.TreeExplainer(model)

# Calculate SHAP values
shap_values = explainer(X_test_yjsc)

# Perform clustering
clust = shap.utils.hclust(X_test_yjsc, Y_test_targ)

# Create the bar plot with clustering
shap.plots.bar(shap_values, clustering=clust, clustering_cutoff=1)

In [None]:
# Create SHAP explainers
explainer_original = shap.TreeExplainer(model)
explainer_transformed = shap.TreeExplainer(model)

# Calculate SHAP values
shap_values_original = explainer_original(X_test)
shap_values_transformed = explainer_transformed(X_test_yjsc)

# Function to plot and compare SHAP summary plots
def compare_shap_plots(shap_values_orig, shap_values_trans, X_orig, X_trans):
    plt.figure(figsize=(60, 30))
    plt.subplot(121)
    shap.summary_plot(shap_values_orig, X_orig, plot_type="bar", show=False)
    plt.title("SHAP Values - Original Features")

    plt.subplot(122)
    shap.summary_plot(shap_values_trans, X_trans, plot_type="bar", show=False)
    plt.title("SHAP Values - Transformed Features")

    plt.tight_layout()
    plt.subplots_adjust(bottom=0.2)
    
    plt.show()

# Compare SHAP plots
compare_shap_plots(shap_values_original, shap_values_transformed, X_test, X_test_yjsc)

# Function to get feature importance from SHAP values
def get_feature_importance(shap_values, feature_names):
    feature_importance = np.abs(shap_values.values).mean(0)
    feature_importance_df = pd.DataFrame(list(zip(feature_names, feature_importance)),
                                         columns=['feature', 'importance'])
    return feature_importance_df.sort_values('importance', ascending=False)

# Get and compare feature importances
importance_original = get_feature_importance(shap_values_original, X_test.columns)
importance_transformed = get_feature_importance(shap_values_transformed, X_test.columns)

# print("Top 10 features (Original):")
# print(importance_original.head(10))
# print("\nTop 10 features (Transformed):")
# print(importance_transformed.head(10))

# For specific feature analysis, use dependence plots
# shap.dependence_plot(importance_original.iloc[2]['feature'], shap_values_original.values, X_train)
# shap.dependence_plot(importance_transformed.iloc[2]['feature'], shap_values_transformed.values, X_train_tfz)

#### **Packaging the ML Workflow for Iterative Analyses**

The code above is running XGBoost using its native API. Let's rerun XGBoost using `scikit-learn` API. Expect similar output because different interfaces are cosmetic and should not affect underlying computation. Many of the steps will be packaged into functions as well to allow iterative analyses on different Y target columns.

In [83]:
### DEPRECATED ###
# dfc_updated = dfc.drop(columns=['Stage', 'FusionTransscript_Count', 'Age'])
# clin_features = ['TumorGrade', 'TumourSize', 'FusionNeo_bestIC50', 'FusionNeo_Count', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']
# # define X features; use the clin_features list generated before
# X = dfc_updated[clin_features]

# # Now get the Y variable set
# cols_y = [col for col in dfc_updated.columns if col not in X.columns]
# Y = dfc_updated[list(cols_y)]

Remove `Stage` and `FusionTransscript_Count` features as they are deemed less important in the SHAP analysis above.

In [None]:
dfc = dfx2_.drop(columns=['Batch', 'Stage', 'FusionTransscript_Count'])

clin_features = ['Age', 'TumorGrade', 'TumourSize', 'Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']

X = dfc[clin_features]
X

In [None]:
# get the Y set
cols_y = [col for col in dfc.columns if col not in X.columns]
Y = dfc[list(cols_y)]
Y

In [86]:
# Perform train-test split
X_train, X_test, Y_train, Y_test = skl.model_selection.train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
# load modules for transformation
from feature_engine.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler

# select X features to transform
X_vars_to_transform = ['TumourSize', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50']

# select variables to scale
scale_cols_X = X_train.columns.tolist()
scale_cols_X = [col for col in scale_cols_X if col not in ['Age', 'TumorGrade', 'Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']]

# Create the pipeline
preprocess_pipeline_X = Pipeline([
    ('yeo_johnson', YeoJohnsonTransformer(variables=X_vars_to_transform)),
    ('scaler', SklearnTransformerWrapper(transformer = StandardScaler(), variables = scale_cols_X))
])

# Fit the pipeline to the training data
preprocess_pipeline_X.fit(X_train)

# Transform the training data
X_train_yjsc = preprocess_pipeline_X.transform(X_train)
# Transform the test data
X_test_yjsc = preprocess_pipeline_X.transform(X_test)

#### Y Labels ####
# select variables to scale
scale_cols_Y = Y_train.columns.tolist()

# Create the pipeline
preprocess_pipeline_Y = Pipeline([
    ('yeo_johnson', YeoJohnsonTransformer()),
    ('scaler', SklearnTransformerWrapper(transformer = StandardScaler(), variables = scale_cols_Y))
])

# Fit the pipeline to the training data
preprocess_pipeline_Y.fit(Y_train)

# Transform the training data
Y_train_yjsc = preprocess_pipeline_Y.transform(Y_train)
# Transform the test data
Y_test_yjsc = preprocess_pipeline_Y.transform(Y_test)

print("Shapes after transformations:")
print("Transformed X_train:", X_train_yjsc.shape)
print("Transformed X_test:", X_test_yjsc.shape)
print("Transformed Y_train:", Y_train_yjsc.shape)
print("Transformed Y_test:", Y_test_yjsc.shape)

# Extract ESTIMATE column as y label
y_target = 'ESTIMATE'
y_train_targ = Y_train_yjsc[y_target]
y_test_targ = Y_test_yjsc[y_target]
print(y_train_targ)
print(y_test_targ)

In [None]:
from xgboost import XGBRegressor

model_xgbreg = XGBRegressor(n_estimators=150, random_state=42)

# fit
model_xgbreg.fit(X_train_yjsc, y_train_targ)

# predict
y_transformed_train_pred = model_xgbreg.predict(X_train_yjsc)
y_transformed_test_pred = model_xgbreg.predict(X_test_yjsc)

# Create a DataFrame with the same columns as the original y used in fit
dummy_train_y = pd.DataFrame(0, index=X_train_yjsc.index, columns=Y_train_yjsc.columns)
dummy_train_y[y_target] = y_transformed_train_pred

dummy_test_y = pd.DataFrame(0, index=X_test_yjsc.index, columns=Y_test_yjsc.columns)
dummy_test_y[y_target] = y_transformed_test_pred

# apply inverse transform
dummy_train_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_train_y)
dummy_test_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_test_y)

# Extract the relevant target column
y_train_pred = dummy_train_y_inv[y_target].to_numpy()
y_test_pred = dummy_test_y_inv[y_target].to_numpy()

# Evaluate
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate metrics
train_r2 = r2_score(Y_train[y_target], y_train_pred)
test_r2 = r2_score(Y_test[y_target], y_test_pred)

train_rmse = np.sqrt(mean_squared_error(Y_train[y_target], y_train_pred))
test_rmse = np.sqrt(mean_squared_error(Y_test[y_target], y_test_pred))

train_mae = mean_absolute_error(Y_train[y_target], y_train_pred)
test_mae = mean_absolute_error(Y_test[y_target], y_test_pred)

# Print results
print("Model Performance:")
print(f"{'Metric':<10} {'Train':<10} {'Test':<10}")
print("-" * 30)
print(f"{'R2':<10} {train_r2:<10.4f} {test_r2:<10.4f}")
print(f"{'RMSE':<10} {train_rmse:<10.4f} {test_rmse:<10.4f}")
print(f"{'MAE':<10} {train_mae:<10.4f} {test_mae:<10.4f}")

# Plot actual vs predicted
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.scatter(Y_train[y_target], y_train_pred, alpha=0.5)
ax1.plot([Y_train[y_target].min(), Y_train[y_target].max()], [Y_train[y_target].min(), Y_train[y_target].max()], 'r--', lw=2)
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('Train Set')

ax2.scatter(Y_test[y_target], y_test_pred, alpha=0.5)
ax2.plot([Y_test[y_target].min(), Y_test[y_target].max()], [Y_test[y_target].min(), Y_test[y_target].max()], 'r--', lw=2)
ax2.set_xlabel('Actual')
ax2.set_ylabel('Predicted')
ax2.set_title('Test Set')

plt.tight_layout()
plt.close()

##### OBSOLETE
# rmse = np.sqrt(mean_squared_error(Y_test[y_target], y_test_pred))
# print(f"RMSE: {rmse}")
# mae = mean_absolute_error(Y_test[y_target], y_test_pred)
# r2 = r2_score(Y_test[y_target], y_test_pred)
# print(f"Mean Absolute Error: {mae}")
# print(f"R-squared Score: {r2}")

# Feature importance
fig, ax = plt.subplots(figsize=(18, 16), dpi=300)
plot_importance(model_xgbreg, ax=ax)
plt.show()

import shap
# Create the SHAP explainer
explainer = shap.TreeExplainer(model_xgbreg)

# run explanation object on X_test dataset because we don't want to learn what the model learned from the X_train data, but to see features that would influence predictions on new data
shap_values = explainer(X_test_yjsc)

plt.figure(figsize=(18, 16)) 
# Summary plot
shap.summary_plot(shap_values, show=False)
plt.tight_layout()
plt.show()

# Perform clustering
clust = shap.utils.hclust(X_test_yjsc, y_test_targ)
# Create the bar plot with clustering
plt.figure(figsize=(18, 16)) 
shap.plots.bar(shap_values, clustering=clust, clustering_cutoff=1, show=False)
plt.tight_layout()
plt.show()


#### **Iterative Learning over all Y labels**

The learning using XGBoost above was done on just one Y label, which is the `ESTIMATE` column. Let's put these into a set of functions so we can run this process iteratively on all Y columns we have set up.

In [89]:
# copy the steps above as a function
import os 
import shap
from xgboost import XGBRegressor
from sklearn.model_selection import learning_curve
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

class YTargetMetrics:
    def __init__(self, target_name, train_r2, test_r2, train_rmse, test_rmse, train_mae, test_mae):
        self.target_name = target_name
        self.train_r2 = train_r2
        self.test_r2 = test_r2
        self.train_rmse = train_rmse
        self.test_rmse = test_rmse
        self.train_mae = train_mae
        self.test_mae = test_mae

    def __str__(self):
        return f"""Model Performance for {self.target_name}:
{'Metric':<10} {'Train':<10} {'Test':<10}
{'-' * 30}
{'R2':<10} {self.train_r2:<10.4f} {self.test_r2:<10.4f}
{'RMSE':<10} {self.train_rmse:<10.4f} {self.test_rmse:<10.4f}
{'MAE':<10} {self.train_mae:<10.4f} {self.test_mae:<10.4f}"""

    def to_dict(self):
        return {
            'target_name': self.target_name,
            'train_r2': self.train_r2,
            'test_r2': self.test_r2,
            'train_rmse': self.train_rmse,
            'test_rmse': self.test_rmse,
            'train_mae': self.train_mae,
            'test_mae': self.test_mae
        }

def plot_learning_curves(estimator, X, y, target, cv=5, train_sizes=np.linspace(.1, 1.0, 5)):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1, train_sizes=train_sizes,
        scoring='neg_mean_squared_error')
    
    train_scores_mean = -train_scores.mean(axis=1)
    train_scores_std = train_scores.std(axis=1)
    test_scores_mean = -test_scores.mean(axis=1)
    test_scores_std = test_scores.std(axis=1)

    plt.figure(figsize=(10, 6), dpi=300)
    plt.title("Learning Curves")
    plt.xlabel("Training examples")
    plt.ylabel("Mean Squared Error")
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    plt.legend(loc="best")

    # Create 'plots' directory if it doesn't exist
    os.makedirs(f'plots/{target}', exist_ok=True)

    # Save the plot
    plt.savefig(f'plots/{target}/{target}-xgb-def-model-learning-curve.png')
    plt.close()

def run_xgboost_model(model_instance, y_target, Y_train, Y_test, X_train_transformed, X_test_transformed, Y_train_transformed, Y_test_transformed, preprocess_pipeline_Y):
    # assign untransformed, raw target data
    raw_y_train = Y_train[y_target]
    raw_y_test = Y_test[y_target]

    # fit
    model_instance.fit(X_train_transformed, Y_train_transformed[y_target])
    
    # predict
    y_train_pred_transformed = model_instance.predict(X_train_transformed)
    y_test_pred_transformed = model_instance.predict(X_test_transformed)

    # Create a DataFrame with the same columns as the original y used in fit
    dummy_train_y = pd.DataFrame(0, index=X_train_transformed.index, columns=Y_train_transformed.columns)
    dummy_train_y[y_target] = y_train_pred_transformed

    dummy_test_y = pd.DataFrame(0, index=X_test_transformed.index, columns=Y_test_transformed.columns)
    dummy_test_y[y_target] = y_test_pred_transformed

    # apply inverse transform
    dummy_train_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_train_y)
    dummy_test_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_test_y)

    # Extract the relevant target column
    y_train_pred = dummy_train_y_inv[y_target].to_numpy()
    y_test_pred = dummy_test_y_inv[y_target].to_numpy()

    # Evaluate model
    # Calculate metrics
    train_r2 = r2_score(raw_y_train, y_train_pred)
    test_r2 = r2_score(raw_y_test, y_test_pred)

    train_rmse = np.sqrt(mean_squared_error(raw_y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(raw_y_test, y_test_pred))

    train_mae = mean_absolute_error(raw_y_train, y_train_pred)
    test_mae = mean_absolute_error(raw_y_test, y_test_pred)

    # Plot learning curves
    plot_learning_curves(model_instance, X_train_transformed, Y_train_transformed[y_target], y_target)

    # Print results
    # print("Model Performance:")
    # print(f"{'Metric':<10} {'Train':<10} {'Test':<10}")
    # print("-" * 30)
    # print(f"{'R2':<10} {train_r2:<10.4f} {test_r2:<10.4f}")
    # print(f"{'RMSE':<10} {train_rmse:<10.4f} {test_rmse:<10.4f}")
    # print(f"{'MAE':<10} {train_mae:<10.4f} {test_mae:<10.4f}")

    # Plot actual vs predicted
    _, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6), dpi=300)

    ax1.scatter(raw_y_train, y_train_pred, alpha=0.5)
    ax1.plot([raw_y_train.min(), raw_y_train.max()], [raw_y_train.min(), raw_y_train.max()], 'r--', lw=2)
    ax1.set_xlabel('Actual')
    ax1.set_ylabel('Predicted')
    ax1.set_title('Training Set')

    ax2.scatter(raw_y_test, y_test_pred, alpha=0.5)
    ax2.plot([raw_y_test.min(), raw_y_test.max()], [raw_y_test.min(), raw_y_test.max()], 'r--', lw=2)
    ax2.set_xlabel('Actual')
    ax2.set_ylabel('Predicted')
    ax2.set_title('Testing Set')

    plt.tight_layout()

    # Create 'plots' directory if it doesn't exist
    os.makedirs(f'plots/{y_target}', exist_ok=True)

    plt.savefig(f'plots/{y_target}/{y_target}-xgb-def-model-performance-comparison.png')
    plt.close()

    ##### OBSOLETE
    # rmse = np.sqrt(mean_squared_error(Y_test[y_target], y_test_pred))
    # print(f"RMSE: {rmse}")
    # mae = mean_absolute_error(Y_test[y_target], y_test_pred)
    # r2 = r2_score(Y_test[y_target], y_test_pred)
    # print(f"Mean Absolute Error: {mae}")
    # print(f"R-squared Score: {r2}")

    # Feature importance
    _, ax = plt.subplots(figsize=(18, 16), dpi=300)
    plot_importance(model_instance, ax=ax)

    plt.savefig(f"plots/{y_target}/{y_target}-xgb-def-model-test-set-feature-importance.png")
    plt.close()

    # Create the SHAP explainer
    explainer = shap.TreeExplainer(model_instance)

    # run explanation object on X_test dataset because we don't want to learn what the model learned from the X_train data, but to see features that would influence predictions on new data
    shap_values = explainer(X_test_transformed)

    plt.figure(figsize=(18, 16), dpi=300) 
    # Summary plot
    shap.summary_plot(shap_values, show=False)
    plt.tight_layout()
    plt.savefig(f"plots/{y_target}/{y_target}-xgb-def-model-test-set-SHAP-beeswarm.png", bbox_inches='tight')
    plt.close()

    # Perform clustering
    clust = shap.utils.hclust(X_test_transformed, Y_test_transformed[y_target])
    # Create the bar plot with clustering
    plt.figure(figsize=(18, 16)) 
    shap.plots.bar(shap_values, clustering=clust, clustering_cutoff=1, show=False)
    plt.tight_layout()
    plt.savefig(f"plots/{y_target}/{y_target}-xgb-def-model-test-set-SHAP-summary.png", dpi=300, bbox_inches='tight')
    plt.close()

    print(f"Model training and evaluation for {y_target} completed.")

    # Instead of returning a tuple, return a YTargetMetrics object
    return YTargetMetrics(y_target, train_r2, test_r2, train_rmse, test_rmse, train_mae, test_mae)


Now test the functions.

In [None]:
# initialize an empty dictionary to store the results
results = {}

# instantiate model
model_xgbreg = XGBRegressor(n_estimators=150, random_state=42)

for target in cols_y:
    metrics = run_xgboost_model(model_xgbreg, target, Y_train, Y_test, X_train_yjsc, X_test_yjsc, Y_train_yjsc, Y_test_yjsc, preprocess_pipeline_Y)
    # Store the metric result in the dictionary
    results[target] = metrics

In [None]:
import json

# Now you can easily access and print metrics
for target, metrics in results.items():
    print(metrics)  # This will use the __str__ method of YTargetMetrics

# If you need to convert back to a dictionary (e.g., for saving to JSON)
results_dict = {target: metrics.to_dict() for target, metrics in results.items()}

# Save to a JSON file
with open('out_files/xgboost_default_model_evaluation_metrics.json', 'w') as f:
    json.dump(results_dict, f, indent=4)

It appears that the testing R2 are poor. This means that there is an issue of overfitting. 

#### **Grid Search: Using `GridSearchCV` for Hyperparameter Tuning**

In [92]:
# from sklearn.model_selection import GridSearchCV
# from xgboost import XGBRegressor

# def run_grid_search_cv(model_instance, target, param_grid, X_train_transformed, Y_train_transformed):
#     print(target)

#     # assign y target column
#     y_train_target = Y_train_transformed[target]
    
#     # Set up grid search
#     grid_search = GridSearchCV(
#     estimator=model_instance,
#     param_grid=param_grid,
#     cv=5,
#     scoring='neg_mean_squared_error',
#     n_jobs=-1,  # Use all available cores
#     verbose=1
#     )
    
#     print(f"Running grid search on {target} column...")
#     # Fit grid search
#     grid_search.fit(X_train_transformed, y_train_target)

#     # Get best parameters and model
#     best_params = grid_search.best_params_
#     best_model = grid_search.best_estimator_

#     print("Best parameters:", best_params)
#     print("Best score:", -grid_search.best_score_)

#     return best_model, best_params

# def predict_with_best_xgboost_model(best_model, target, X_test_transformed, Y_test_transformed, Y_test_raw, preprocess_pipeline_Y):
#     # Make predictions on the test set using the best model
#     y_pred_transformed = best_model.predict(X_test_transformed)

#     # Create a DataFrame with the same columns as the original y used in preprocess_pipeline to reverse transformation
#     dummy_y = pd.DataFrame(0, index=X_test_transformed.index, columns=Y_test_transformed.columns)
#     dummy_y[target] = y_pred_transformed

#     # apply inverse transform
#     dummy_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_y)

#     # Extract the relevant target column
#     y_pred = dummy_y_inv[target].to_numpy()

#     # Evaluate
#     y_test_raw = Y_test_raw[target]
#     mae = mean_absolute_error(y_test_raw, y_pred)
#     rmse = np.sqrt(mean_squared_error(y_test_raw, y_pred))
#     r2 = r2_score(y_test_raw, y_pred)

#     print(f"Mean Absolute Error (MAE) [Testing]: {mae:.2f}")
#     print(f"Root Mean Squared Error (RMSE) [Testing]: {rmse:.2f}")
#     print(f"Coefficient of Determination (R2) [Testing]: {r2:.2f}")

#     return mae, rmse, r2

In [93]:
# # Instantiate base model
# model_xgb = XGBRegressor(random_state=42)

# # Define parameter grid
# param_grid = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [3, 5, 7],
#     'learning_rate': [0.01, 0.1, 0.3],
#     'subsample': [0.5, 0.8, 1.0],
#     'gamma': [0, 0.1, 0.3, 0.8, 1]
# }

# # initialize an empty dict
# tuned_results = {}

# for target in cols_y[:2]:
#     best_model, best_params = run_grid_search_cv(model_xgb, target, param_grid, X_train_yjsc, Y_train_yjsc)
    
#     tuned_metrics = predict_with_best_xgboost_model(best_model, target, X_test_yjsc, Y_test_yjsc, Y_test, preprocess_pipeline_Y)

#     # Store the metric result in the dictionary
#     tuned_results[target] = tuned_metrics