# Support Vector Machine for Classification Problem
### *Exploring the association between neoantigen-related variables and immune scores*
This notebook is the continuation of the `support_vector_reg.ipynb` notebook, detailing the testing of SVM application on our neoantigen dataset, converted into a classification problem.

#### **Package and Raw Data Loading**
First, import necessary packages and load in the raw data table into `pandas` dataFrame. 



In [1]:
# first, import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from itables import show
from IPython.display import HTML, display
from warnings import simplefilter, filterwarnings
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
filterwarnings("ignore", category=UserWarning)
pd.set_option('display.max_columns', None)
%config InlineBackend.figure_format = 'retina'

# load pretty jupyter's magics
%load_ext pretty_jupyter

Load up the cleaned-up dataset wrangled from MH's latest work.

In [None]:
# read in latest data
# use the 202409_new_excludedIHC_batch-duplicate-removed.tsv
df = pd.read_csv("../input-data/SA/202409_new_excludedIHC_batch-duplicate-removed.tsv",sep="\t")
print(f"Before trimming columns: {df.shape}")

# exclude the 29 Cibersort scores, leaving only 3
df = df.drop(columns=['Bindea_full', 'Expanded_IFNg', 
        'C_Bcellsmemory','C_Plasmacells','C_TcellsCD8','C_TcellsCD4naive',
         'C_TcellsCD4memoryactivated','C_Tcellsfollicularhelper',
         'C_Tcellsregulatory(Tregs)','C_Tcellsgammadelta','C_NKcellsresting',
         'C_NKcellsactivated', 'C_Monocytes', 'C_MacrophagesM0',
         'C_MacrophagesM1','C_Dendriticcellsresting',
         'C_Dendriticcellsactivated', 'C_Mastcellsresting',
         'C_Mastcellsactivated','C_Eosinophils', 'C_Neutrophils', 'S_PAM100HRD'])

print(f"After trimming columns: {df.shape}")
df.head()

#### **Data Preprocessing**

Decide all the clinical variables and neoantigen-related variables to keep in the X matrix (features).

1. `Subtype` column has already been encoded categorically by `HR_status` and `HER_status` columns so these two columns can be dropped. ***UPDATE: due to their lesser importance during the default XGBoost modeling, `PAM50` column was dropped as well.***

2.  `AgeGroup` is just a binned information of `Age` column so it is dropped as it is redundant.

~~3. Drop `FusionNeo_bestScore`, `FusionTransscript_Count`, `Fusion_T2NeoRate` columns as well as the `SNVindelNeo_IC50` and `SNVindelNeo_IC50Percentile` columns for now to reduce complexity.~~

4. Drop `Batch` column.

~~> **UPDATE 1: Exclude `TotalNeo_Count`, and include `Fusion_T2NeoRate` and `SNVindelNeo_IC50` columns.~~ Also, rename `Fusion_T2NeoRate` to `FN/FT_Ratio`.**

> **UPDATE 2: put back `FusionNeo_bestScore` into the X variable set and rename it into `FusionNeo_bestIC50`**

NOTE: The FN/FT_Ratio can go beyond 1. In this case this implies that there are more predicted neoantigens than there are predicted transcript. Imagine one transcript being able to produce more than 1 putative neoantigen. This implies a highly immunogenic transcript, so this metric might be useful down the line.

In [None]:
# let's drop all NaN for now and set col 'ID' as index
# dfd = df.drop(columns = ['Batch', 'Stage', 'PAM50', 'HR_status', 'HER_status', 'AgeGroup', 'TotalNeo_Count', 'FusionTransscript_Count', 'SNVindelNeo_IC50Percentile']).dropna().set_index('ID')
dfd = df.drop(columns = ['Batch', 'Stage', 'PAM50', 'HR_status', 'HER_status', 'AgeGroup', 'SNVindelNeo_IC50Percentile']).dropna().set_index('ID')

# rename the column `Fusion_T2NeoRate` to `FN/FT_Ratio` and `FusionNeo_bestScore` to `FusionNeo_bestIC50`
dfd.rename(columns={'Fusion_T2NeoRate': 'FN/FT_Ratio'}, inplace=True)
dfd.rename(columns={'FusionNeo_bestScore': 'FusionNeo_bestIC50'}, inplace=True)
dfd.rename(columns={'FusionTransscript_Count': 'FusionTranscript_Count'}, inplace=True)

print(dfd.shape)
dfd.head()

**Sanity Check:** Check to make sure there is no duplicated index rows in the dataset.

In [None]:
print(dfd.index[dfd.index.duplicated()].unique())
rows_dupe = list(dfd.index[dfd.index.duplicated()].unique())
rows_dupe

Now, We need to encode the `object` columns of `Subtype` and `FN/FT_Ratio` into appropriate types. Change `Age`, `TumorGrade`, and `IMPRES` into `int64` as well as all `*_Count` columns because they are discrete variables. Change the `FN/FT_Ratio` into `float64`.

In [5]:
dfd['Subtype'] = dfd['Subtype'].astype('category')
dfd['Age'] = dfd['Age'].astype('int64')
dfd['TumorGrade'] = dfd['TumorGrade'].astype('int64')
dfd['IMPRES'] = dfd['IMPRES'].astype('int64')
dfd['FusionNeo_Count'] = dfd['FusionNeo_Count'].astype('int64')
dfd['FusionTranscript_Count'] = dfd['FusionTranscript_Count'].astype('int64')
dfd['SNVindelNeo_Count'] = dfd['SNVindelNeo_Count'].astype('int64')
dfd['FN/FT_Ratio'] = dfd['FN/FT_Ratio'].astype('float64')

# print(dfd.dtypes)
pd.set_option('display.max_rows', 8)

In [6]:
# save dfd dataframe to a new file as a pandas object
dfd.to_pickle("../input-data/SA/202409_new_excludedIHC_batch-duplicate-removed_cols-dropped_coltypes_encoded.pkl")


Now we can use Feature_Engine's `OneHotEncoder()` to create a `k` dummy variable set for `Subtype`.

**NOTE**: The encoded columns will be appended at the end of the dataFrame. 


In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(
    variables=['Subtype'],
    drop_last=False)

encoder.fit(dfd)
dfd_ = encoder.transform(dfd)
dfd_.head()

In [8]:
# Specify the encoded columns to shift
enc_cols = ['Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']

# Drop the specified columns and store them
encoded_df = dfd_[enc_cols]
dfenc = dfd.drop(columns=['Subtype'])

# Specify the index where you want to reinsert the columns
insert_index = 0  # This will insert at the first column

# Reinsert the columns
for i, col in enumerate(encoded_df.columns):
    dfenc.insert(insert_index + i, col, encoded_df[col])

Below is the categorically-encoded dataframe.

In [None]:
print(dfenc.shape)
dfenc.head()

And below is the original, unencoded dataframe.

In [None]:
print(dfd.shape)
dfd.head()

~~#### **Subsetting Y Labels**~~

~~In the previous exploration, many of the immune scores (Y targets/labels) might not really show much relationship with fusion neoantigen variables so they may not be as informative. We decided to use Caitlin's finding and subset the Y labels into several clinically meaningful groups.~~

In [11]:
# # use the unencoded categorical dataframe (dfd) and drop the Subtype categorical column
# df_dcat = dfd.drop(columns=['Subtype'])
# print(df_dcat.shape)
# df_dcat.head()

First list all the clinical variables that would be the X feature set.

In [12]:
X_features = ['Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+', 'Age', 'TumorGrade', 'TumourSize', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FusionTranscript_Count', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50', 'TotalNeo_Count']

In [13]:
X_features_nocat = ['Age', 'TumorGrade', 'TumourSize', 'FusionNeo_Count', 'FusionNeo_bestIC50', 'FusionTranscript_Count', 'FN/FT_Ratio', 'SNVindelNeo_Count', 'SNVindelNeo_IC50', 'TotalNeo_Count']

In [None]:
# Now get the Y variable set
Y_labels_all = [col for col in dfd.drop(columns=['Subtype']).columns if col not in X_features]
print(Y_labels_all[:5])
len(Y_labels_all)

#### **Split Dataset with `train_test_split` & Create a Data Transformation Pipeline from `feature_engine` Package**

Split the dataset before modeling to avoid information leakage.

Next, the pipeline will apply the Yeo-Johnson transformation on the split datasets on select X features and all Y labels, and scale them using `StandardScaler` (but wrapped within `feature_engine`'s wrapper) on select X features and all Y labels.

This pipeline would enable easy inverse transform steps for both X and Y datasets later.

In [None]:
# subset X columns as desired
X = dfenc[X_features]
X = X.drop(columns=["FusionNeo_bestIC50", "SNVindelNeo_IC50"])
X

In [None]:
# plot distribution of the X columns
X.hist(figsize=(20, 20))
plt.show()

Now grab the Y targets (do this as a whole, but we will train on each column individually later).

In [None]:
# Now get the Y variable set
Y = dfenc[Y_labels_all]
Y

In [None]:
Y.hist(figsize=(20, 20))
plt.show()

Now we perform train test split on the X and Y variables then create a data transformation pipeline to scale the data.

In [19]:
# Perform train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
Y_train.head()

In [None]:
Y_test.head()

In [None]:
# select variables to scale
cols_X = X_train.columns.tolist()
scale_cols_X = [col for col in cols_X if col not in ['Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']]
scale_cols_X

In [25]:
categorical_cols_X = ['Subtype_HR+/HER2-', 'Subtype_HR+/HER2+', 'Subtype_TNBC', 'Subtype_HR-/HER2+']

In [26]:
# select variables to target
cols_Y = Y_train.columns.tolist()
# drop ESTIMATE and IMPRES
target_cols_Y = [col for col in cols_Y if col not in ['ESTIMATE', 'IMPRES']]

#### **Dataset Binning**

To test SVM, we need to bin the target Y datasets into discrete classes. We can leave X datasets as they are (transformed and scaled). 

In [None]:
from feature_engine.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer
from feature_engine.wrappers import SklearnTransformerWrapper
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from typing import Dict
import pandas as pd

# Separate numeric and categorical preprocessing
numeric_transformer = Pipeline([
    ('yeo_johnson', YeoJohnsonTransformer(variables=scale_cols_X)),
    ('scaler', SklearnTransformerWrapper(transformer=StandardScaler(), variables=scale_cols_X))
])

# Create column transformer for X preprocessing
preprocess_pipeline_X = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, scale_cols_X),
        ('categorical', 'passthrough', categorical_cols_X)  # Assuming categorical_cols is your list of one-hot columns
    ],
    remainder='passthrough'  # This will pass through any columns not explicitly specified
)

# Simplified Y preprocessing - only binning for classification
class MultiColumnBinner:
    def __init__(self, n_bins=3, strategy='quantile'):
        self.n_bins = n_bins
        self.strategy = strategy
        self.binner = None
        self.bin_edges_ = None
        
    def fit(self, X, y=None):
        self.binner = KBinsDiscretizer(
            n_bins=self.n_bins,
            encode='ordinal',
            strategy=self.strategy
        )
        self.binner.fit(X)
        self.bin_edges_ = self.binner.bin_edges_
        return self
    
    def transform(self, X):
        binned = self.binner.transform(X)
        return pd.DataFrame(
            binned,
            columns=X.columns,
            index=X.index
        )
    
    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

# Simplified Y pipeline - only binning
preprocess_pipeline_Y = Pipeline([
    ('binner', MultiColumnBinner(n_bins=3, strategy='quantile'))
])

# Modified process X data - note that we need to convert to DataFrame after transformation
preprocess_pipeline_X.fit(X_train)
X_train_transformed = pd.DataFrame(
    preprocess_pipeline_X.transform(X_train),
    columns=list(scale_cols_X) + list(categorical_cols_X),  # Maintain column names
    index=X_train.index
)
X_test_transformed = pd.DataFrame(
    preprocess_pipeline_X.transform(X_test),
    columns=list(scale_cols_X) + list(categorical_cols_X),  # Maintain column names
    index=X_test.index
)

# Process Y data - now only binning
preprocess_pipeline_Y.fit(Y_train)
Y_train_binned = preprocess_pipeline_Y.transform(Y_train)
Y_test_binned = preprocess_pipeline_Y.transform(Y_test)

# Print bin edges for interpretation
binner = preprocess_pipeline_Y.steps[0][1]
for col, edges in zip(Y_train.columns, binner.bin_edges_):
    print(f"\nBin edges for {col}:")
    print(f"low (0): <= {edges[1]:.2f}")
    print(f"mid (1): {edges[1]:.2f} to {edges[2]:.2f}")
    print(f"high (2): > {edges[2]:.2f}")

Now run SVM.

In [None]:
def run_classification_model(
    y_target: str,
    Y_train_binned: pd.DataFrame,
    Y_test_binned: pd.DataFrame,
    X_train_transformed: pd.DataFrame,
    X_test_transformed: pd.DataFrame,
) -> Dict:
    """Run classification model for a single target variable and return performance metrics."""
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
    import matplotlib.pyplot as plt
    import os
    
    # Get binned target data
    y_train = Y_train_binned[y_target].astype(int)
    y_test = Y_test_binned[y_target].astype(int)
    
    # Initialize and fit model
    model = SVC(kernel='rbf', probability=True)
    model.fit(X_train_transformed, y_train)
    
    # Predict
    y_train_pred = model.predict(X_train_transformed)
    y_test_pred = model.predict(X_test_transformed)
    
    # Calculate metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    # Generate classification report
    test_report = classification_report(y_test, y_test_pred)
    
    # Create and save confusion matrix plots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    ConfusionMatrixDisplay.from_predictions(
        y_train, y_train_pred,
        display_labels=['low', 'mid', 'high'],
        ax=ax1
    )
    ax1.set_title('Training Confusion Matrix')
    
    ConfusionMatrixDisplay.from_predictions(
        y_test, y_test_pred,
        display_labels=['low', 'mid', 'high'],
        ax=ax2
    )
    ax2.set_title('Testing Confusion Matrix')
    
    plt.tight_layout()
    os.makedirs('plots', exist_ok=True)
    plt.savefig(f'plots-classification/{y_target}/{y_target}-classification-confusion-matrix.png')
    plt.close()
    
    return {
        'target_name': y_target,
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'classification_report': test_report
    }

# Example usage for all targets
results_dict = {}
for y_target in Y_train.columns:
    print(f"\nProcessing target: {y_target}")
    results = run_classification_model(
        y_target=y_target,
        Y_train_binned=Y_train_binned,
        Y_test_binned=Y_test_binned,
        X_train_transformed=X_train_transformed,
        X_test_transformed=X_test_transformed
    )
    results_dict[y_target] = results
    
    print(f"Results for {y_target}:")
    print(f"Train accuracy: {results['train_accuracy']:.4f}")
    print(f"Test accuracy: {results['test_accuracy']:.4f}")
    print("\nClassification Report:")
    print(results['classification_report'])

In [None]:
import pandas as pd
from typing import Dict

def analyze_model_performances(results_dict: Dict) -> pd.DataFrame:
    """
    Analyze performance metrics across all target variables.
    
    Args:
        results_dict: Dictionary containing results from run_classification_model
        
    Returns:
        DataFrame with sorted performance metrics
    """
    # Create a list to store performance metrics
    performances = []
    
    for target, results in results_dict.items():
        performances.append({
            'target': target,
            'train_accuracy': results['train_accuracy'],
            'test_accuracy': results['test_accuracy'],
            'difference': results['train_accuracy'] - results['test_accuracy']  # to check overfitting
        })
    
    # Convert to DataFrame
    perf_df = pd.DataFrame(performances)
    
    # Sort by test accuracy (primary metric) in descending order
    perf_df_sorted = perf_df.sort_values('test_accuracy', ascending=False)
    
    # Print the best performing target
    best_target = perf_df_sorted.iloc[0]
    print("\nBest performing target:")
    print(f"Target: {best_target['target']}")
    print(f"Test Accuracy: {best_target['test_accuracy']:.4f}")
    print(f"Train Accuracy: {best_target['train_accuracy']:.4f}")
    print(f"Difference (Train-Test): {best_target['difference']:.4f}")
    
    print("\nAll targets sorted by test accuracy:")
    print(perf_df_sorted.to_string(float_format=lambda x: '{:.4f}'.format(x)))
    
    # Identify potential overfitting cases
    overfitting_threshold = 0.1  # You can adjust this threshold
    overfitting_cases = perf_df[perf_df['difference'] > overfitting_threshold]
    
    if not overfitting_cases.empty:
        print("\nPotential overfitting cases (Train-Test > 0.1):")
        print(overfitting_cases.to_string(float_format=lambda x: '{:.4f}'.format(x)))
    
    return perf_df_sorted

# After running your models, use this to analyze:
performance_summary = analyze_model_performances(results_dict)

# If you want to get just the top N performing targets
N = 5  # Change this to get more or fewer top performers
top_N_targets = performance_summary.head(N)
print(f"\nTop {N} performing targets:")
print(top_N_targets.to_string(float_format=lambda x: '{:.4f}'.format(x)))

# Get the classification report for the best performing target
best_target = performance_summary.iloc[0]['target']
print(f"\nDetailed classification report for best target ({best_target}):")
print(results_dict[best_target]['classification_report'])

#### **Iterative Learning over Y Labels with `GridSearchCV` for Hyperparameter Tuning**

Now we can rewrite the functions to incorporate `GridSearchCV`.

In [31]:
# from sklearn.model_selection import GridSearchCV

# def run_svr_model_gridsearch(
#     y_target: str,
#     Y_train: pd.DataFrame,
#     Y_test: pd.DataFrame,
#     X_train_transformed: pd.DataFrame,
#     X_test_transformed: pd.DataFrame,
#     Y_train_transformed: pd.DataFrame,
#     Y_test_transformed: pd.DataFrame,
#     preprocess_pipeline_Y: Pipeline
# ) -> YTargetMetrics:
#     """Run SVR model with GridSearchCV for a single target variable and return performance metrics."""

#     # Define parameter grid
#     param_grid = {
#         'C': [0.1, 1, 10, 100],
#         'epsilon': [0.01, 0.1, 0.2],
#         'gamma': ['scale', 'auto', 0.1, 0.01]
#     }

#     # Initialize base model
#     base_model = SVR()

#     # Setup GridSearchCV
#     grid_search = GridSearchCV(
#         estimator=base_model,
#         param_grid=param_grid,
#         cv=5,
#         scoring='neg_mean_squared_error',
#         n_jobs=-1,
#         verbose=1
#     )

#     # Fit GridSearchCV
#     print(f"\nPerforming GridSearchCV for {y_target}...")
#     grid_search.fit(X_train_transformed, Y_train_transformed[y_target])

#     # Print best parameters
#     print(f"\nBest parameters for {y_target}:")
#     print(grid_search.best_params_)

#     # Use best model for predictions
#     model_instance = grid_search.best_estimator_
    
#     # Get predictions (transformed space)
#     y_train_pred_transformed = model_instance.predict(X_train_transformed)
#     y_test_pred_transformed = model_instance.predict(X_test_transformed)

#     # Create dummy DataFrames for inverse transform
#     dummy_train_y = pd.DataFrame(0, index=X_train_transformed.index, 
#                                 columns=Y_train_transformed.columns)
#     dummy_train_y[y_target] = y_train_pred_transformed

#     dummy_test_y = pd.DataFrame(0, index=X_test_transformed.index, 
#                                columns=Y_test_transformed.columns)
#     dummy_test_y[y_target] = y_test_pred_transformed

#     # Inverse transform predictions
#     dummy_train_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_train_y)
#     dummy_test_y_inv = preprocess_pipeline_Y.inverse_transform(dummy_test_y)

#     # Extract the relevant target column
#     y_train_pred = dummy_train_y_inv[y_target].to_numpy()
#     y_test_pred = dummy_test_y_inv[y_target].to_numpy()

#     # Get raw target data
#     raw_y_train = Y_train[y_target]
#     raw_y_test = Y_test[y_target]

#     # Calculate metrics
#     train_r2 = r2_score(raw_y_train, y_train_pred)
#     test_r2 = r2_score(raw_y_test, y_test_pred)
#     train_rmse = np.sqrt(mean_squared_error(raw_y_train, y_train_pred))
#     test_rmse = np.sqrt(mean_squared_error(raw_y_test, y_test_pred))
#     train_mae = mean_absolute_error(raw_y_train, y_train_pred)
#     test_mae = mean_absolute_error(raw_y_test, y_test_pred)

#     # Create plots directory
#     os.makedirs(f'plots/{y_target}', exist_ok=True)

#     # Plot CV results
#     cv_results = pd.DataFrame(grid_search.cv_results_)
#     plt.figure(figsize=(15, 5))
#     plt.subplot(1, 2, 1)
#     plt.plot(cv_results['param_C'], -cv_results['mean_test_score'], 'o-')
#     plt.xlabel('C parameter')
#     plt.ylabel('Mean Squared Error')
#     plt.xscale('log')
    
#     plt.subplot(1, 2, 2)
#     plt.plot(cv_results['param_epsilon'], -cv_results['mean_test_score'], 'o-')
#     plt.xlabel('Epsilon parameter')
#     plt.ylabel('Mean Squared Error')
#     plt.tight_layout()
#     plt.savefig(f'plots/{y_target}/{y_target}-SVR-grid-search-results.png')
#     plt.close()

#     # Plot actual vs predicted
#     _, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6), dpi=300)

#     ax1.scatter(raw_y_train, y_train_pred, alpha=0.5)
#     ax1.plot([raw_y_train.min(), raw_y_train.max()], 
#              [raw_y_train.min(), raw_y_train.max()], 'r--', lw=2)
#     ax1.set_xlabel('Actual')
#     ax1.set_ylabel('Predicted')
#     ax1.set_title('Training Set')
#     ax1.grid(True)

#     ax2.scatter(raw_y_test, y_test_pred, alpha=0.5)
#     ax2.plot([raw_y_test.min(), raw_y_test.max()], 
#              [raw_y_test.min(), raw_y_test.max()], 'r--', lw=2)
#     ax2.set_xlabel('Actual')
#     ax2.set_ylabel('Predicted')
#     ax2.set_title('Testing Set')
#     ax2.grid(True)

#     plt.tight_layout()
#     plt.savefig(f'plots/{y_target}/{y_target}-SVR-tuned-model-performance-comparison.png')
#     plt.close()

#     # Create results object with additional grid search info
#     results = YTargetMetrics(y_target, train_r2, test_r2, train_rmse, test_rmse, train_mae, test_mae)
    
#     # Add grid search results to dictionary
#     grid_search_results = {
#         'best_params': grid_search.best_params_,
#         'best_score': -grid_search.best_score_,  # Convert back from negative MSE
#         'cv_results': grid_search.cv_results_
#     }

#     return results, grid_search_results

# # Modified function to handle multiple targets
# def run_svr_for_multiple_targets(
#     y_columns: list[str],
#     Y_train: pd.DataFrame,
#     Y_test: pd.DataFrame,
#     X_train_transformed: pd.DataFrame,
#     X_test_transformed: pd.DataFrame,
#     Y_train_transformed: pd.DataFrame,
#     Y_test_transformed: pd.DataFrame,
#     preprocess_pipeline_Y: Pipeline
# ) -> Dict[str, tuple[YTargetMetrics, dict]]:
    
#     results_dict = {}
    
#     for y_target in y_columns:
#         try:
#             print(f"\nProcessing target: {y_target}")
            
#             metrics, grid_results = run_svr_model_gridsearch(
#                 y_target=y_target,
#                 Y_train=Y_train,
#                 Y_test=Y_test,
#                 X_train_transformed=X_train_transformed,
#                 X_test_transformed=X_test_transformed,
#                 Y_train_transformed=Y_train_transformed,
#                 Y_test_transformed=Y_test_transformed,
#                 preprocess_pipeline_Y=preprocess_pipeline_Y
#             )
            
#             results_dict[y_target] = (metrics, grid_results)
#             print(f"\nResults for {y_target}:")
#             print(metrics)
#             print("\nBest parameters:", grid_results['best_params'])
#             print("Best CV score (RMSE):", np.sqrt(grid_results['best_score']))
            
#         except Exception as e:
#             print(f"Error processing {y_target}: {str(e)}")
#             continue
    
#     return results_dict

# # Usage:
# Y_columns = Y_labels_all
# all_results = run_svr_for_multiple_targets(
#     y_columns=Y_columns,
#     Y_train=Y_train,
#     Y_test=Y_test,
#     X_train_transformed=X_train_yjs,
#     X_test_transformed=X_test_yjs,
#     Y_train_transformed=Y_train_yjs,
#     Y_test_transformed=Y_test_yjs,
#     preprocess_pipeline_Y=preprocess_pipeline_Y
# )

# # Create summary DataFrame with best parameters
# summary_dict = {
#     target: {
#         **metrics.to_dict(),
#         **{'best_' + k: v for k, v in grid_results['best_params'].items()}
#     }
#     for target, (metrics, grid_results) in all_results.items()
# }

# summary_df = pd.DataFrame.from_dict(summary_dict, orient='index')
# print("\nOverall Summary:")

In [32]:
# show(summary_df)

Rerun the grid search using the subset of Y labels. 

In [33]:
# Y_columns = merged_cols
# all_results = run_svr_for_multiple_targets(
#     y_columns=Y_columns,
#     Y_train=Y_train,
#     Y_test=Y_test,
#     X_train_transformed=X_train_yjs,
#     X_test_transformed=X_test_yjs,
#     Y_train_transformed=Y_train_yjs,
#     Y_test_transformed=Y_test_yjs,
#     preprocess_pipeline_Y=preprocess_pipeline_Y
# )

# # Create summary DataFrame with best parameters
# summary_dict_ss = {
#     target: {
#         **metrics.to_dict(),
#         **{'best_' + k: v for k, v in grid_results['best_params'].items()}
#     }
#     for target, (metrics, grid_results) in all_results.items()
# }

# summary_df_ss = pd.DataFrame.from_dict(summary_dict_ss, orient='index')
# print("\nOverall Summary:")

In [34]:
# show(summary_df_ss)