# **Experiment 2: Feature Engineering and Dimensionality Reduction**

## ***Objectives for this Notebook***

Machine learning models thrive on good data. But raw data often comes messy and unrefined, holding hidden gems amidst irrelevant clutter. This is where feature engineering and dimensionality reduction come in like superheroes, transforming your data from unpolished ore to gleaming treasure.

Feature engineering and dimensionality reduction are two crucial steps in the machine learning pipeline that can significantly improve the performance of your models.

**Feature Engineering** is the process of transforming raw data into features that are suitable for training and deploying machine learning models. Simply, it's about:
* Selecting the right features: Choosing the most relevant features that hold predictive power for your target variable. Think carefully, irrelevant features can mislead your model!
* Creating new features: Combining existing features or extracting hidden patterns to unlock deeper insights. New features can be like secret weapons for your model!
* Transforming features: Scaling, normalizing, or encoding categorical data to ensure all features play fair in the model's eyes. No one wants features dominating the competition due to unfair advantages!


**Dimensionality reduction** is a technique used in machine learning and statistics to reduce the number of features or variables in a dataset while preserving its essential information. The goal is to simplify the data and improve computational efficiency, mitigate the curse of dimensionality, and enhance the performance of machine learning models.

Algorithms:
* Non-Negative Matrix Factorization (NMF): NMF is a factorization technique that decomposes a matrix into two non-negative matrices. It is particularly useful for non-negative data, such as images or text, and is often applied in topic modeling and image processing.
* Latent Dirichlet Allocation (LDA): LDA is a probabilistic generative model used for topic modeling. It assumes that documents are mixtures of topics and that each word's presence is attributable to one of the document's topics. LDA helps discover the underlying topics in a collection of documents.


## **1.) Loading the dataset and setting parameters**

In [3]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, balanced_accuracy_score

import random
from sklearn.utils import check_random_state

seed_value = 42
# Set Python seed
random.seed(seed_value)

# Set NumPy seed
np.random.seed(seed_value)

# Set scikit-learn seed
sklearn_random_state = check_random_state(seed_value)

In [4]:
microbiome_df = pd.read_csv("./dataset/microbiome_preprocessed_files/microbiome_merged_dfs.csv")
microbiome_df

Unnamed: 0,name,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,...,Hungatella,Pseudopropionibacterium,Peptoanaerobacter,Emergencia,Prevotellamassilia,Criibacterium,Fournierella,Negativibacillus,Duodenibacillus,label
0,TCGA-CG-5720-01A,0.0,0.000000,0.000000,0.895050,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
1,TCGA-CN-4741-01A,0.0,0.000000,0.010470,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
2,TCGA-BR-6801-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
3,TCGA-IG-A3I8-01A,0.0,0.000000,0.000000,0.067717,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
4,TCGA-L5-A4OT-01A,0.0,0.000000,0.012202,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
507,TCGA-CG-5719-01A,0.0,0.000000,0.000000,0.106557,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
508,TCGA-CQ-5329-01A,0.0,0.175564,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.136613,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
509,TCGA-CQ-7068-01A,0.0,0.335060,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.011534,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
510,TCGA-CG-4455-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.014781,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD


## **2.) Creating a data loader**

In [5]:
classes = ["HNSC", "STAD", "COAD", "ESCA", "READ"]

In [6]:
# Show the counts for each class in the df
microbiome_df["label"].value_counts()

label
HNSC    155
STAD    127
COAD    125
ESCA     60
READ     45
Name: count, dtype: int64

In [7]:
def exp_2_data_loader(dataframe, label_column, classes, train_test=True):
    """
    Generate a one-vs-all dataset for a specific class for experiment 2

    Parameters:
    - dataframe: pd.DataFrame, the input DataFrame.
    - label_column: str, the column name representing the labels.
    - classes: list, the classes.

    Returns:
    - dataset_dict: a dictionary where the keys is the targeted class and the values are its corresponding features and labels
    """

    dataset_dict = {}

    for i in classes:
        positive_class = i
        dframe = dataframe.copy()
        dframe['label'] = [1 if x == positive_class else 0 for x in dataframe[label_column]]
        print(dframe.label.value_counts())
        X = dframe.drop(["name", "label"], axis=1)
        y = dframe["label"]
        if train_test:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y, random_state=seed_value)
            dataset_dict[positive_class] = {"train": (X_train, y_train),
                                            "test": (X_test, y_test)}
        else:
            dataset_dict[positive_class] = {"feature": X, 
                                            "label": y}

    return dataset_dict

In [8]:
exp2_datasets = exp_2_data_loader(microbiome_df, "label", classes, train_test=True)
exp2_datasets

label
0    357
1    155
Name: count, dtype: int64
label
0    385
1    127
Name: count, dtype: int64
label
0    387
1    125
Name: count, dtype: int64
label
0    452
1     60
Name: count, dtype: int64
label
0    467
1     45
Name: count, dtype: int64


{'HNSC': {'train': (     Simonsiella  Treponema  Campylobacter  Helicobacter  Paracoccus  \
   155          0.0   0.010944            0.0      0.000000         0.0   
   414          0.0   0.000000            0.0      0.107591         0.0   
   172          0.0   0.000000            0.0      0.000000         0.0   
   367          0.0   0.000000            0.0      0.421787         0.0   
   462          0.0   0.000000            0.0      0.000000         0.0   
   ..           ...        ...            ...           ...         ...   
   33           0.0   0.000000            0.0      0.000000         0.0   
   15           0.0   0.000000            0.0      0.000000         0.0   
   198          0.0   0.000000            0.0      0.000000         0.0   
   211          0.0   0.000000            0.0      0.000000         0.0   
   494          0.0   0.000000            0.0      0.000000         0.0   
   
        Comamonas  Pseudomonas  Xanthomonas  Agrobacterium  Bradyrhizobium  ...

In [9]:
exp1_best_hyperparam = pd.read_csv("./dataset/microbiome_preprocessed_files/exp1_best_hyperparam.csv")
exp1_best_hyperparam

Unnamed: 0,label,criterion,max_depth,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,entropy,20,1,40,50,0.785829,0.911599,0.831183
1,STAD,entropy,30,1,10,100,0.780853,0.981524,0.79308
2,COAD,gini,20,1,10,200,0.895644,0.96504,0.892741
3,ESCA,gini,10,1,10,1,0.596405,0.734715,0.66917
4,READ,entropy,10,1,20,1,0.7,0.676505,0.609679


## **3.) Feature Engineering with Cross Validation and Hyperparameter Grid Search**

Earlier, we've explained that feature engineering and dimensionality reduction can sculpt and refine our data. These techniques are often stacked on top of each other, followed by the classifier. Now, that is a lot of steps. Wouldn't it be easier if we can chain these steps one after the other?


**Pipeline** chains these data transformations together into a single, powerful workflow. 

**Concept and use:**
* **Chaining transformations and estimators**: You define a list of steps, each consisting of a transformer (e.g., scaling, encoding) and its corresponding parameters. The transformers pre-process the data, while the final estimator (e.g., classifier, regressor) learns from the transformed data.
* **Streamlined workflow**: Instead of calling individual fit and predict methods on each step separately, you simply call them on the Pipeline object, simplifying your code and reducing boilerplate.
* **Joint parameter tunin**g: You can optimize hyperparameters for all components in the pipeline simultaneously through grid search or other methods, ensuring all steps work together seamlessly.

To import pipeline:

In [10]:
from sklearn.pipeline import Pipeline

### **a) Dimensionality Reduction**

First, we're going to perform dimensionality reduction for the HNSC vs all setup.

In [11]:
# Retrieve data from dataset dictionary
features = exp2_datasets["HNSC"]["train"][0]
target = exp2_datasets["HNSC"]["train"][1]

# Instatiate classifier
rf = RandomForestClassifier(random_state=seed_value)

# Instantiate the NMF (dim reduc function)
dim_reduc = NMF(init='random', random_state=seed_value, max_iter=20000)

# Cross validation object
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

For our project, we're going to use Pipeline to perform dimensionality reduction/feature engineering and classification steps in one go. Read further to learn more.

In [12]:
# Create the pipeline object. Here we are chaining the dimensionality reduction step and the classifier step.
pipeline = Pipeline([
    ('dim_reduc', dim_reduc),
    ('rf', rf)
])

At this point, we need to setup the parameters grid. The best parameters for the classifier for each class has already been identified in notebook 02. We want to use those results plus the new parameters we have for our dimensionality reduction step.

In [59]:
# Our new parameters for nmf
nmf_param_grid = {
    "dim_reduc__n_components": list(range(10,100))
}

Retrieve the best hyperparameters from report 1 from notebook 02.

In [14]:
exp1_best_hyperparam = pd.read_csv("./dataset/microbiome_preprocessed_files/exp1_best_hyperparam.csv")
exp1_best_hyperparam

Unnamed: 0,label,criterion,max_depth,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,entropy,20,1,40,50,0.785829,0.911599,0.831183
1,STAD,entropy,30,1,10,100,0.780853,0.981524,0.79308
2,COAD,gini,20,1,10,200,0.895644,0.96504,0.892741
3,ESCA,gini,10,1,10,1,0.596405,0.734715,0.66917
4,READ,entropy,10,1,20,1,0.7,0.676505,0.609679


In [16]:
exclude_cols = ["label", "test_score", "mean_train_score", "mean_validation_score"] # We don't want these columns in our parameter grid

# Choose a row (class) from the report DataFrame
selected_class = exp1_best_hyperparam.index[0] #0 is for HNSC
# Get the best hyperparameters and remove excluded columns
best_hyperparameters = exp1_best_hyperparam.loc[selected_class].drop(exclude_cols).to_dict()

best_hyperparameters = {f"rf__{key}": [value] for key, value in best_hyperparameters.items()}

# Add the additional parameters
hnsc_combined_param_grid = {**best_hyperparameters, **nmf_param_grid}
hnsc_combined_param_grid

{'rf__criterion': ['entropy'],
 'rf__max_depth': [20],
 'rf__min_samples_leaf': [1],
 'rf__min_samples_split': [40],
 'rf__n_estimators': [50],
 'dim_reduc__n_components': [10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90,
  91,
  92,
  93,
  94,
  95,
  96,
  97,
  98,
  99]}

Now finally, we have everything we need to make our gridsearch object!

In [17]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)


# Create the GridSearchCV object
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=hnsc_combined_param_grid,
    scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
    cv=cv,
    n_jobs=-1,  # Use all available processors
    return_train_score=True  # Set to True to calculate train scores
)

# Perform the grid search
print("Fitting..")
grid_search.fit(features, target)


Fitting..




After training, let's grab the results and the best hyperparameters from the training.

In [18]:
# Access cv_results_ attribute to get detailed results
cv_results = grid_search.cv_results_

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model
best_model = grid_search.best_estimator_


Best Hyperparameters: {'dim_reduc__n_components': 54, 'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50}


In [19]:
# Evaluate the best model on your test set
# Assuming you have X_test and y_test from your data
# Modify this based on your actual test set
predictions = best_model.predict(exp2_datasets["HNSC"]["test"][0])
accuracy = balanced_accuracy_score(exp2_datasets["HNSC"]["test"][1], predictions)
print("Accuracy on Test Set:", accuracy)

# Calculate mean train score and mean test score
mean_train_score = cv_results['mean_train_score'][grid_search.best_index_]
mean_test_score = cv_results['mean_test_score'][grid_search.best_index_]

print("Mean Train Score:", mean_train_score)
print("Mean Test Score:", mean_test_score)

report = {}
report["HNSC"] = {
    "model": best_model,
    "best_hyperparameters": grid_search.best_params_,
    "accuracy_on_test": accuracy,
    "mean_train_score": mean_train_score,
    "mean_test_score": mean_test_score
        }

Accuracy on Test Set: 0.7548309178743962
Mean Train Score: 0.9082153754595149
Mean Test Score: 0.8363061043388912


### **b) Feature Engineering**

In [216]:
microbiome_df = pd.read_csv("./dataset/microbiome_preprocessed_files/microbiome_merged_dfs.csv")
microbiome_df

Unnamed: 0,name,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,...,Hungatella,Pseudopropionibacterium,Peptoanaerobacter,Emergencia,Prevotellamassilia,Criibacterium,Fournierella,Negativibacillus,Duodenibacillus,label
0,TCGA-CG-5720-01A,0.0,0.000000,0.000000,0.895050,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
1,TCGA-CN-4741-01A,0.0,0.000000,0.010470,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
2,TCGA-BR-6801-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
3,TCGA-IG-A3I8-01A,0.0,0.000000,0.000000,0.067717,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
4,TCGA-L5-A4OT-01A,0.0,0.000000,0.012202,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
507,TCGA-CG-5719-01A,0.0,0.000000,0.000000,0.106557,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
508,TCGA-CQ-5329-01A,0.0,0.175564,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.136613,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
509,TCGA-CQ-7068-01A,0.0,0.335060,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.011534,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
510,TCGA-CG-4455-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.014781,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD


In [332]:
def augment_dataset(X,y,exp1_best_hyperparam,algorithm="NMF"):
    # Create the pipeline object. Here we are chaining the dimensionality reduction step and the classifier step.
    pipeline = Pipeline([
        ('transformer', NMF() if algorithm=="NMF" else LatentDirichletAllocation()),
        ('rf', rf)
    ])

    # Our new parameters for nmf
    feat_eng_param_grid = {
        "transformer__n_components": list(range(4,64)) if algorithm=="NMF" else list(range(1, 5)),
    }

    exclude_cols = ["label", "test_score", "mean_train_score", "mean_validation_score"] # We don't want these columns in our parameter grid

    # Choose a row (class) from the report DataFrame
    selected_class = exp1_best_hyperparam.index[0] #0 is for HNSC
    # Get the best hyperparameters and remove excluded columns
    best_hyperparameters_1 = exp1_best_hyperparam.loc[selected_class].drop(exclude_cols).to_dict()

    best_hyperparameters_1 = {f"{key}": [value] for key, value in best_hyperparameters_1.items()}

    best_hyperparameters_copy_for_combination = {f"rf__{key}": value for key, value in best_hyperparameters_1.items()}
    # # Add the additional parameters
    # hnsc_feat_eng_combined_param_grid = {**best_hyperparameters_copy_for_combination, **feat_eng_param_grid}
    combined_param_grid = create_combined_param_grid(exp1_best_hyperparam,nmf_param_grid if a else lda_param_grid)
 
    for i in classes:
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

        # Create the GridSearchCV object
        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=combined_param_grid[i],
            scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
            cv=cv,
            n_jobs=-1,  # Use all available processors
            return_train_score=True  # Set to True to calculate train scores
        )

        # Perform the grid search
        print("Fitting..")
        grid_search.fit(X, y)

        # Print the best hyperparameters
        print("Best Hyperparameters:", grid_search.best_params_)

        # Get the best model
        best_model = grid_search.best_estimator_

        # LENGTH OF DATASET SHOULD STAY SIMILAR TO ORIGINAL, but length of features should increase by the best n_components

        # Transform the original dataset to get the latent features
        latent_features = best_model.named_steps['transformer'].transform(X)
        print(f"Count of new columns/latent features: {len(latent_features.T)}")

        print(f"Count of columns of original feature set: {len(X.T)}")

        feature_augmented = pd.concat((X, pd.DataFrame(latent_features, index=X.index, columns=[f"new_feature {str(x)}" for x in range(1,len(latent_features.T)+1)])), axis=1)
        print(f"Count of columns of augmented feature set: {len(feature_augmented.T)}")
        print(f"Count of rows of augmented dataset: {len(feature_augmented)}")

        return feature_augmented




In [337]:
def extend_exp_2_dataset(dataframe, 
                         label_column, 
                         classes, 
                         exp1_best_hyperparam=exp1_best_hyperparam, 
                         algorithm="NMF", 
                         train_test=True):
    """

    Returns:
    - dataset_dict: a dictionary where the keys is the targeted class and the values are its corresponding features and labels
    """
    dataset_dict = {}

    exclude_cols = ["name", "label"]

    feat = dataframe.drop(columns=exclude_cols)
    tar = dataframe["label"]

    # print(f"len of feat: {len(feat)}")
    # print(f"len of tar: {len(tar)}")

    feature_augmented = augment_dataset(feat, tar,exp1_best_hyperparam, algorithm=algorithm)

    # print(f"Count of rows in original dataset:{len(dataframe)}")
    # print(f"Count of rows in augmented dataset:{len(feature_augmented)}")

    for i in classes:
        positive_class = i
        dframe = pd.DataFrame(index=feature_augmented.index)
        dframe["name"] = dataframe["name"]
        dframe = pd.concat([dframe, feature_augmented], axis=1)
        dframe['label'] = [1 if x == positive_class else 0 for x in dataframe[label_column]]
        print(dframe.label.value_counts())
        X = dframe.drop(["name", "label"], axis=1)
        y = dframe["label"]
        if train_test:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y, random_state=seed_value)
            dataset_dict[positive_class] = {"train": (X_train, y_train),
                                            "test": (X_test, y_test)}
        else:
            dataset_dict[positive_class] = {"feature": X, 
                                            "label": y}

    return dataset_dict

In [339]:
exp2_datasets_feat_eng = extend_exp_2_dataset(microbiome_df, "label", classes=classes, train_test=True)

Fitting..
Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50, 'transformer__n_components': 15}
Count of new columns/latent features: 15
Count of columns of original feature set: 131
Count of columns of augmented feature set: 146
Count of rows of augmented dataset: 512
label
0    357
1    155
Name: count, dtype: int64
label
0    385
1    127
Name: count, dtype: int64
label
0    387
1    125
Name: count, dtype: int64
label
0    452
1     60
Name: count, dtype: int64
label
0    467
1     45
Name: count, dtype: int64


In [340]:
exp2_datasets_feat_eng["HNSC"]["train"][0]

Unnamed: 0,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,Bradyrhizobium,...,new_feature 6,new_feature 7,new_feature 8,new_feature 9,new_feature 10,new_feature 11,new_feature 12,new_feature 13,new_feature 14,new_feature 15
155,0.0,0.010944,0.0,0.000000,0.0,0.0,0.017489,0.0,0.0,0.0,...,0.000091,0.019054,0.032869,0.006719,0.000503,0.000111,0.001004,0.002378,0.021345,0.017670
414,0.0,0.000000,0.0,0.107591,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.072503,0.017535,0.028632,0.001330,0.000000,0.000024,0.000130,0.000000,0.048650,0.000000
172,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000921,0.000000,0.002354,0.000000,0.000996
367,0.0,0.000000,0.0,0.421787,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.284333,0.020627,0.059243,0.000000,0.000000,0.000000,0.002250,0.000000,0.035826,0.000000
462,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.000325,0.000000,0.000000,0.000179,0.000106,0.000000,0.000000,0.000830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.017618,0.000873,0.000000,0.000574,0.000129,0.000138,0.000000,0.000000,0.000064
15,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000312,0.000199,0.000822,0.000000,0.004203,0.000000,0.000775,0.014377,0.000000,0.000000
198,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.029941,0.000236,0.000000,0.000000,0.000000,0.000000,0.000000,0.011949,0.000499
211,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.000144,0.000000,0.000000,0.010166,0.000055,0.000139,0.000000,0.000423


In [311]:
# Retrieve data from dataset dictionary
features = exp2_datasets["HNSC"]["train"][0]
target = exp2_datasets["HNSC"]["train"][1]

print(f"Feature: {len(features)}")
print(f"Target: {len(target)}")

# Instatiate classifier
rf = RandomForestClassifier(random_state=seed_value)

# Instantiate the NMFFeatureConcatenator (feat engineer function)
feat_eng = NMF()

# Cross validation object
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

Feature: 435
Target: 435


In [244]:
augment_dataset(features, target, exp1_best_hyperparam=exp1_best_hyperparam,algorithm="NMF")

Fitting..
Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50, 'transformer__n_components': 58}
Length of latent features: 58
Length of original feature set: 131
Length of augmented feature set: 189


Unnamed: 0,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,Bradyrhizobium,...,new_feature 49,new_feature 50,new_feature 51,new_feature 52,new_feature 53,new_feature 54,new_feature 55,new_feature 56,new_feature 57,new_feature 58
155,0.0,0.010944,0.0,0.000000,0.0,0.0,0.017489,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.104182,0.000000,0.000000,0.000000,0.0
414,0.0,0.000000,0.0,0.107591,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000016,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
172,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000097,0.000000,0.016179,0.0,0.000177,0.099734,0.000000,0.009247,0.0
367,0.0,0.000000,0.0,0.421787,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.011709,0.000000,0.0
462,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000272,0.0,0.000000,0.000000,0.000000,0.000571,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
15,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
198,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
211,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.003761,0.000000,0.000000,0.0


In [318]:
# Create the pipeline object. Here we are chaining the dimensionality reduction step and the classifier step.
pipeline = Pipeline([
    ('transformer', NMF()),
    ('rf', rf)
])

In [319]:
# Our new parameters for nmf
nmf_feat_eng_param_grid = {
    "transformer__n_components": list(range(4,64))
    # "nmf_concatenator__n_components": [5]
}

In [320]:
exclude_cols = ["label", "test_score", "mean_train_score", "mean_validation_score"] # We don't want these columns in our parameter grid

# Choose a row (class) from the report DataFrame
selected_class = exp1_best_hyperparam.index[0] #0 is for HNSC
# Get the best hyperparameters and remove excluded columns
best_hyperparameters = exp1_best_hyperparam.loc[selected_class].drop(exclude_cols).to_dict()

best_hyperparameters = {f"{key}": [value] for key, value in best_hyperparameters.items()}

best_hyperparameters_copy_for_combination = {f"rf__{key}": value for key, value in best_hyperparameters.items()}
# Add the additional parameters
hnsc_feat_eng_combined_param_grid = {**best_hyperparameters_copy_for_combination, **nmf_feat_eng_param_grid}
hnsc_feat_eng_combined_param_grid

{'rf__criterion': ['entropy'],
 'rf__max_depth': [20],
 'rf__min_samples_leaf': [1],
 'rf__min_samples_split': [40],
 'rf__n_estimators': [50],
 'transformer__n_components': [4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63]}

In [315]:
best_hyperparameters

{'criterion': ['entropy'],
 'max_depth': [20],
 'min_samples_leaf': [1],
 'min_samples_split': [40],
 'n_estimators': [50]}

In [316]:
hnsc_feat_eng_combined_param_grid

{'rf__criterion': ['entropy'],
 'rf__max_depth': [20],
 'rf__min_samples_leaf': [1],
 'rf__min_samples_split': [40],
 'rf__n_estimators': [50],
 'transformer__n_components': [4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63]}

In [321]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

# Create the GridSearchCV object
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=hnsc_feat_eng_combined_param_grid,
    scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
    cv=cv,
    n_jobs=-1,  # Use all available processors
    return_train_score=True  # Set to True to calculate train scores
)

# Perform the grid search
print("Fitting..")
grid_search.fit(features, target)


Fitting..


In [326]:
# Access cv_results_ attribute to get detailed results
cv_results = grid_search.cv_results_

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model
best_model = grid_search.best_estimator_

# LENGTH OF DATASET SHOULD STAY 435, but length of features should increase by the best n_components

# Transform the original dataset to get the latent features
latent_features = best_model.named_steps['transformer'].transform(features)
print(f"Length of latent features: {len(latent_features.T)}")



feature_augmented = pd.concat((features, pd.DataFrame(latent_features, index=features.index, columns=[f"new_feature {str(x)}" for x in range(1,len(latent_features.T)+1)])), axis=1)
print(f"Length of augmented feature set: {len(feature_augmented.T)}")
print(f"length of feature set: {len(feature_augmented)}")

Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50, 'transformer__n_components': 58}
Length of latent features: 58
Length of augmented feature set: 189
length of feature set: 435


In [202]:
feature_augmented

Unnamed: 0,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,Bradyrhizobium,...,new_feature 49,new_feature 50,new_feature 51,new_feature 52,new_feature 53,new_feature 54,new_feature 55,new_feature 56,new_feature 57,new_feature 58
155,0.0,0.010944,0.0,0.000000,0.0,0.0,0.017489,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.104219,0.000000,0.000000,0.000000,0.0
414,0.0,0.000000,0.0,0.107591,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000016,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
172,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000097,0.000000,0.016178,0.0,0.000177,0.099851,0.000000,0.009232,0.0
367,0.0,0.000000,0.0,0.421787,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.011658,0.000000,0.0
462,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000272,0.0,0.000000,0.000000,0.000000,0.000570,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
15,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
198,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
211,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.003765,0.000000,0.000000,0.0


In [179]:
hnsc_feat_eng_combined_param_grid

{'rf__criterion': ['entropy'],
 'rf__max_depth': [20],
 'rf__min_samples_leaf': [1],
 'rf__min_samples_split': [40],
 'rf__n_estimators': [50],
 'nmf_concatenator__n_components': [4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63]}

In [182]:
exp1_best_hyperparam

Unnamed: 0,label,criterion,max_depth,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,entropy,20,1,40,50,0.785829,0.911599,0.831183
1,STAD,entropy,30,1,10,100,0.780853,0.981524,0.79308
2,COAD,gini,20,1,10,200,0.895644,0.96504,0.892741
3,ESCA,gini,10,1,10,1,0.596405,0.734715,0.66917
4,READ,entropy,10,1,20,1,0.7,0.676505,0.609679


In [206]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

# Create the GridSearchCV object
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=best_hyperparameters,
    scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
    cv=cv,
    n_jobs=-1,  # Use all available processors
    return_train_score=True  # Set to True to calculate train scores
)

# Perform the grid search
print("Fitting..")
grid_search.fit(feature_augmented, target)

Fitting..


In [207]:
# Access cv_results_ attribute to get detailed results
cv_results = grid_search.cv_results_

# Get the best model
best_model = grid_search.best_estimator_

In [209]:
exp2_datasets["HNSC"]["test"][0]

Unnamed: 0,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,Bradyrhizobium,...,Mageeibacillus,Hungatella,Pseudopropionibacterium,Peptoanaerobacter,Emergencia,Prevotellamassilia,Criibacterium,Fournierella,Negativibacillus,Duodenibacillus
83,0.0,0.000000,0.034599,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037657,0.0,0.0,0.0,0.0,0.0,0.0
411,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.029619,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
214,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
252,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
472,0.0,0.000000,0.010804,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
359,0.0,0.105388,0.024326,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
434,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
185,0.0,0.000000,0.020363,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [324]:
# Evaluate the best model on your test set
# Assuming you have X_test and y_test from your data
# Modify this based on your actual test set
predictions = best_model.predict(exp2_datasets["HNSC"]["test"][0])
accuracy = balanced_accuracy_score(exp2_datasets["HNSC"]["test"][1], predictions)
print("Accuracy on Test Set:", accuracy)

# Calculate mean train score and mean test score
mean_train_score = cv_results['mean_train_score'][grid_search.best_index_]
mean_test_score = cv_results['mean_test_score'][grid_search.best_index_]

print("Mean Train Score:", mean_train_score)
print("Mean Test Score:", mean_test_score)

report = {}
report["HNSC"] = {
    "model": best_model,
    "best_hyperparameters": grid_search.best_params_,
    "accuracy_on_test": accuracy,
    "mean_train_score": mean_train_score,
    "mean_test_score": mean_test_score
        }

Accuracy on Test Set: 0.7858293075684379
Mean Train Score: 0.9187798236111557
Mean Test Score: 0.8293153052169446


## **4.) Creating a Training and Evaluation Loop Function**

Now let's create a function that will go through all of the classes and wraps all of the training and evaluation in one function.

In [99]:
exp1_best_hyperparam

Unnamed: 0,label,criterion,max_depth,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,entropy,20,1,40,50,0.785829,0.911599,0.831183
1,STAD,entropy,30,1,10,100,0.780853,0.981524,0.79308
2,COAD,gini,20,1,10,200,0.895644,0.96504,0.892741
3,ESCA,gini,10,1,10,1,0.596405,0.734715,0.66917
4,READ,entropy,10,1,20,1,0.7,0.676505,0.609679


In [237]:
# Function to create a combined parameter grid
def create_combined_param_grid(report_df, additional_params):
    combined_param_grid_dict = {}
    exclude_cols = ["label", "test_score", "mean_train_score", "mean_validation_score"]
    if report_df is None or report_df.empty:
        raise ValueError("Report DataFrame is required.")

    for c in report_df.index:
        # Choose a row (class) from the report DataFrame
        selected_class = report_df.index[c]
        # Get the best hyperparameters and remove excluded columns
        best_hyperparameters = report_df.loc[selected_class].drop(exclude_cols).to_dict()

        best_hyperparameters = {f"rf__{key}": [value] for key, value in best_hyperparameters.items()}

        # Add the additional parameters
        combined_param_grid = {**best_hyperparameters, **additional_params}

        combined_param_grid_dict[report_df.loc[selected_class]["label"]] = combined_param_grid

    return combined_param_grid_dict

a = False
nmf_param_grid = {
    "transformer__n_components": list(range(10,100))
}

lda_param_grid = {
    "transformer__n_components": list(range(4,64))
}

combined_param_grid = create_combined_param_grid(exp1_best_hyperparam,nmf_param_grid if a else lda_param_grid)
combined_param_grid

{'HNSC': {'rf__criterion': ['entropy'],
  'rf__max_depth': [20],
  'rf__min_samples_leaf': [1],
  'rf__min_samples_split': [40],
  'rf__n_estimators': [50],
  'transformer__n_components': [4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34,
   35,
   36,
   37,
   38,
   39,
   40,
   41,
   42,
   43,
   44,
   45,
   46,
   47,
   48,
   49,
   50,
   51,
   52,
   53,
   54,
   55,
   56,
   57,
   58,
   59,
   60,
   61,
   62,
   63]},
 'STAD': {'rf__criterion': ['entropy'],
  'rf__max_depth': [30],
  'rf__min_samples_leaf': [1],
  'rf__min_samples_split': [10],
  'rf__n_estimators': [100],
  'transformer__n_components': [4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34,


In [66]:
class FeatureConcatenator(BaseEstimator, TransformerMixin):
    def __init__(self, method='nmf', n_components=10, random_state=None, max_iter=200):
        self.method = method
        self.n_components = n_components
        self.random_state = random_state
        self.max_iter = max_iter

        if method == 'nmf':
            self.model = NMF(n_components=n_components, random_state=random_state, max_iter=max_iter)
        elif method == 'lda':
            self.model = LatentDirichletAllocation(n_components=n_components, random_state=random_state)
        else:
            raise ValueError("Invalid method. Choose 'nmf' or 'lda'.")

    def fit(self, X, y=None):
        self.model.fit(X)
        return self

    def transform(self, X):
        new_features = self.model.transform(X)
        return np.concatenate((X, new_features), axis=1)

In [101]:
## This function will work fine and the dimensional reduction provides accurate results, but feature engineering has a bug. Need to fix
def perform_experiment_2(dataset_dict, classes, method, use_nmf=True, cv_n_splits=5, n_jobs=12):
    report = {}

    for c in classes:
        print("Class: ", c)
        features = dataset_dict[c]["train"][0]
        target = dataset_dict[c]["train"][1]

        if method == "dim_reduc":
             # Create a pipeline with NMF or LDA (based on the use_nmf switch) and Random Forest
            if use_nmf:
                dimensionality_reduction = NMF(init='random', random_state=42, max_iter=20000)  # Initialize NMF without specifying n_components
            else:
                dimensionality_reduction = LatentDirichletAllocation(random_state=42)  # Initialize LDA without specifying n_components
        else:
            if use_nmf:
                feature_engineering = FeatureConcatenator(method="nmf")
            else:
                feature_engineering = FeatureConcatenator(method="lda")

            
        # Define the Random Forest classifier
        rf_classifier = RandomForestClassifier(random_state=seed_value)

       
        pipeline = Pipeline([
            ('transformer', dimensionality_reduction if method=="dim_reduc" else feature_engineering),
            ('rf', rf_classifier)
        ])

        # Define the hyperparameters and their potential values for the grid search
        feat_eng_param_grid = {
            'transformer__n_components': list(range(4,64)) if use_nmf else list(range(1, 5)),
        }

        dim_reduc_param_grid = {
            'transformer__n_components': list(range(10,100)) if use_nmf else list(range(1, 5)),
        }

        combined_param_grid = create_combined_param_grid(exp1_best_hyperparam, dim_reduc_param_grid if method=="dim_reduc" else feat_eng_param_grid)
   

        cv = StratifiedKFold(n_splits=cv_n_splits, shuffle=True, random_state=seed_value)

        # Create the GridSearchCV object
        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=combined_param_grid[c],
            scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
            cv=cv,
            n_jobs=n_jobs,  # Use all available processors
            return_train_score=True  # Set to True to calculate train scores
        )

        # Perform the grid search
        print("Fitting..")
        grid_search.fit(features, target)

        # Access cv_results_ attribute to get detailed results
        cv_results = grid_search.cv_results_

        # Print the best hyperparameters
        print("Best Hyperparameters:", grid_search.best_params_)

        # Get the best model
        best_model = grid_search.best_estimator_

        # Evaluate the best model on your test set
        # Assuming you have X_test and y_test from your data
        # Modify this based on your actual test set
        predictions = best_model.predict(dataset_dict[c]["test"][0])
        accuracy = balanced_accuracy_score(dataset_dict[c]["test"][1], predictions)
        print("Accuracy on Test Set:", accuracy)

        # Calculate mean train score and mean test score
        mean_train_score = cv_results['mean_train_score'][grid_search.best_index_]
        mean_test_score = cv_results['mean_test_score'][grid_search.best_index_]

        print("Mean Train Score:", mean_train_score)
        print("Mean Test Score:", mean_test_score)

        report[c] = {
            "model": best_model,
            "best_hyperparameters": grid_search.best_params_,
            "test_accuracy_score": accuracy,
            "mean_train_score": mean_train_score,
            "mean_test_score": mean_test_score
        }

    return report


In [102]:
exp2_feat_eng_nmf = perform_experiment_2(exp2_datasets, classes, method="dim_reduc", use_nmf=True)

Class:  HNSC
Fitting..




Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50, 'transformer__n_components': 54}
Accuracy on Test Set: 0.7548309178743962
Mean Train Score: 0.9082153754595149
Mean Test Score: 0.8363061043388912
Class:  STAD
Fitting..




Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 30, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 100, 'transformer__n_components': 10}
Accuracy on Test Set: 0.7636116152450091
Mean Train Score: 0.9252973208643918
Mean Test Score: 0.795900765900766
Class:  COAD
Fitting..




Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 200, 'transformer__n_components': 52}
Accuracy on Test Set: 0.9047186932849365
Mean Train Score: 0.9592534646074122
Mean Test Score: 0.8944955044955044
Class:  ESCA
Fitting..




Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 10, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 1, 'transformer__n_components': 46}
Accuracy on Test Set: 0.5522875816993464
Mean Train Score: 0.7792552071349641
Mean Test Score: 0.6891011619958989
Class:  READ
Fitting..




Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 10, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 20, 'rf__n_estimators': 1, 'transformer__n_components': 93}
Accuracy on Test Set: 0.5571428571428572
Mean Train Score: 0.6780982378801057
Mean Test Score: 0.6656396925858952


In [41]:
# Old function without feature engineering
def perform_experiment_2(dataset_dict, classes, use_nmf=True, cv_n_splits=5, n_jobs=12):
    report = {}

    for c in classes:
        print("Class: ", c)
        features = dataset_dict[c]["train"][0]
        target = dataset_dict[c]["train"][1]

        # Define the Random Forest classifier
        rf_classifier = RandomForestClassifier(random_state=seed_value)

        # Create a pipeline with NMF or LDA (based on the use_nmf switch) and Random Forest
        if use_nmf:
            dimensionality_reduction = NMF(init='random', random_state=42, max_iter=20000)  # Initialize NMF without specifying n_components
        else:
            dimensionality_reduction = LatentDirichletAllocation(random_state=42)  # Initialize LDA without specifying n_components

        pipeline = Pipeline([
            ('dim_reduc', dimensionality_reduction),
            ('rf', rf_classifier)
        ])

        # Define the hyperparameters and their potential values for the grid search
        dim_reduc_param_grid = {
            'dim_reduc__n_components': list(range(10,100)) if use_nmf else list(range(1, 5)),
        }

        combined_param_grid = create_combined_param_grid(exp1_best_hyperparam, dim_reduc_param_grid)

    
        cv = StratifiedKFold(n_splits=cv_n_splits, shuffle=True, random_state=seed_value)

        # Create the GridSearchCV object
        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=combined_param_grid[c],
            scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
            cv=cv,
            n_jobs=n_jobs,  # Use all available processors
            return_train_score=True  # Set to True to calculate train scores
        )

        # Perform the grid search
        print("Fitting..")
        grid_search.fit(features, target)

        # Access cv_results_ attribute to get detailed results
        cv_results = grid_search.cv_results_

        # Print the best hyperparameters
        print("Best Hyperparameters:", grid_search.best_params_)

        # Get the best model
        best_model = grid_search.best_estimator_

        # Evaluate the best model on your test set
        # Assuming you have X_test and y_test from your data
        # Modify this based on your actual test set
        predictions = best_model.predict(dataset_dict[c]["test"][0])
        accuracy = balanced_accuracy_score(dataset_dict[c]["test"][1], predictions)
        print("Accuracy on Test Set:", accuracy)

        # Calculate mean train score and mean test score
        mean_train_score = cv_results['mean_train_score'][grid_search.best_index_]
        mean_test_score = cv_results['mean_test_score'][grid_search.best_index_]

        print("Mean Train Score:", mean_train_score)
        print("Mean Test Score:", mean_test_score)

        report[c] = {
            "model": best_model,
            "best_hyperparameters": grid_search.best_params_,
            "test_accuracy_score": accuracy,
            "mean_train_score": mean_train_score,
            "mean_test_score": mean_test_score
        }

    return report


In [30]:
exp2_report = perform_experiment_2(exp2_datasets, classes, use_nmf=True)

Class:  HNSC
Fitting..




Best Hyperparameters: {'dim_reduc__n_components': 54, 'rf__criterion': 'entropy', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 30, 'rf__n_estimators': 250}
Accuracy on Test Set: 0.823268921095008
Mean Train Score: 0.9365063909260594
Mean Test Score: 0.8402104058661436
Class:  STAD
Fitting..




Best Hyperparameters: {'dim_reduc__n_components': 10, 'rf__criterion': 'entropy', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 20, 'rf__n_estimators': 50}
Accuracy on Test Set: 0.7636116152450091
Mean Train Score: 0.8856077656774903
Mean Test Score: 0.8017682317682319
Class:  COAD
Fitting..




Best Hyperparameters: {'dim_reduc__n_components': 34, 'rf__criterion': 'gini', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 50}
Accuracy on Test Set: 0.8956442831215972
Mean Train Score: 0.9665358355578665
Mean Test Score: 0.8895171495171494
Class:  ESCA
Fitting..




Best Hyperparameters: {'dim_reduc__n_components': 68, 'rf__criterion': 'gini', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 30, 'rf__n_estimators': 1}
Accuracy on Test Set: 0.559640522875817
Mean Train Score: 0.6870044387169199
Mean Test Score: 0.6834552289815446
Class:  READ
Fitting..




Best Hyperparameters: {'dim_reduc__n_components': 13, 'rf__criterion': 'gini', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 1}
Accuracy on Test Set: 0.45
Mean Train Score: 0.7578775456914046
Mean Test Score: 0.6817902350813743


In [103]:
def best_hyperparameter_to_df(nested_dict, exp_name):
    path = "./dataset/microbiome_preprocessed_files/"
    data = []

    for key, value in nested_dict.items():
        entry = {'label': key}
        entry.update(value['best_hyperparameters'])
        entry['test_accuracy_score'] = value['test_accuracy_score']
        entry['mean_train_score'] = value['mean_train_score']
        entry['mean_test_score'] = value['mean_test_score']
        data.append(entry)

    df = pd.DataFrame(data)

    df.to_csv(path+f"{exp_name}_best_hyperparam.csv", index=False)
    return df

In [104]:
best_hyperparameter_to_df(exp2_feat_eng_nmf, "exp2_dim_rec_nmf")

Unnamed: 0,label,rf__criterion,rf__max_depth,rf__min_samples_leaf,rf__min_samples_split,rf__n_estimators,transformer__n_components,test_accuracy_score,mean_train_score,mean_test_score
0,HNSC,entropy,20,1,40,50,54,0.754831,0.908215,0.836306
1,STAD,entropy,30,1,10,100,10,0.763612,0.925297,0.795901
2,COAD,gini,20,1,10,200,52,0.904719,0.959253,0.894496
3,ESCA,gini,10,1,10,1,46,0.552288,0.779255,0.689101
4,READ,entropy,10,1,20,1,93,0.557143,0.678098,0.66564


In [105]:
exp2_dim_rec_lda_report = perform_experiment_2(exp2_datasets, classes, method="dim_reduc", use_nmf=False)

Class:  HNSC
Fitting..
Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 40, 'rf__n_estimators': 50, 'transformer__n_components': 3}
Accuracy on Test Set: 0.677536231884058
Mean Train Score: 0.8259928681388564
Mean Test Score: 0.79383704637803
Class:  STAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'entropy', 'rf__max_depth': 30, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 100, 'transformer__n_components': 3}
Accuracy on Test Set: 0.7554446460980035
Mean Train Score: 0.9056766944653598
Mean Test Score: 0.6613686313686313
Class:  COAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 200, 'transformer__n_components': 4}
Accuracy on Test Set: 0.852087114337568
Mean Train Score: 0.9527194541833325
Mean Test Score: 0.8778055278055279
Class:  ESCA
Fitting..
Best Hyperparameter

In [42]:
exp2_lda_report = perform_experiment_2(exp2_datasets, classes, use_nmf=False)

Class:  HNSC
Fitting..
Best Hyperparameters: {'dim_reduc__n_components': 4, 'rf__criterion': 'entropy', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 30, 'rf__n_estimators': 250}
Accuracy on Test Set: 0.6493558776167472
Mean Train Score: 0.8384517568984883
Mean Test Score: 0.7942809770678623
Class:  STAD
Fitting..
Best Hyperparameters: {'dim_reduc__n_components': 3, 'rf__criterion': 'entropy', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 20, 'rf__n_estimators': 50}
Accuracy on Test Set: 0.7813067150635209
Mean Train Score: 0.8394373808420956
Mean Test Score: 0.6754845154845156
Class:  COAD
Fitting..
Best Hyperparameters: {'dim_reduc__n_components': 3, 'rf__criterion': 'gini', 'rf__max_depth': 40, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 10, 'rf__n_estimators': 50}
Accuracy on Test Set: 0.8784029038112522
Mean Train Score: 0.9511662096570396
Mean Test Score: 0.8697735597735597
Class:  ESCA
Fitting..
Best Hyperparameters: 

In [106]:
best_hyperparameter_to_df(exp2_dim_rec_lda_report, "exp2_dim_rec_lda")

Unnamed: 0,label,rf__criterion,rf__max_depth,rf__min_samples_leaf,rf__min_samples_split,rf__n_estimators,transformer__n_components,test_accuracy_score,mean_train_score,mean_test_score
0,HNSC,entropy,20,1,40,50,3,0.677536,0.825993,0.793837
1,STAD,entropy,30,1,10,100,3,0.755445,0.905677,0.661369
2,COAD,gini,20,1,10,200,4,0.852087,0.952719,0.877806
3,ESCA,gini,10,1,10,1,4,0.441176,0.729457,0.552235
4,READ,entropy,10,1,20,1,2,0.635714,0.697575,0.599469
