# **Experiment 2: Feature Engineering and Dimensionality Reduction**

## ***Objectives for this Notebook***

Machine learning models thrive on good data. But raw data often comes messy and unrefined, holding hidden gems amidst irrelevant clutter. This is where feature engineering and dimensionality reduction come in like superheroes, transforming your data from unpolished ore to gleaming treasure.

Feature engineering and dimensionality reduction are two crucial steps in the machine learning pipeline that can significantly improve the performance of your models.

**Feature Engineering** is the process of transforming raw data into features that are suitable for training and deploying machine learning models. Simply, it's about:
* Selecting the right features: Choosing the most relevant features that hold predictive power for your target variable. Think carefully, irrelevant features can mislead your model!
* Creating new features: Combining existing features or extracting hidden patterns to unlock deeper insights. New features can be like secret weapons for your model!
* Transforming features: Scaling, normalizing, or encoding categorical data to ensure all features play fair in the model's eyes. No one wants features dominating the competition due to unfair advantages!


**Dimensionality reduction** is a technique used in machine learning and statistics to reduce the number of features or variables in a dataset while preserving its essential information. The goal is to simplify the data and improve computational efficiency, mitigate the curse of dimensionality, and enhance the performance of machine learning models.

Algorithms:
* Non-Negative Matrix Factorization (NMF): NMF is a factorization technique that decomposes a matrix into two non-negative matrices. It is particularly useful for non-negative data, such as images or text, and is often applied in topic modeling and image processing.
* Latent Dirichlet Allocation (LDA): LDA is a probabilistic generative model used for topic modeling. It assumes that documents are mixtures of topics and that each word's presence is attributable to one of the document's topics. LDA helps discover the underlying topics in a collection of documents.


## **1.) Loading the dataset and setting parameters**

In [116]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import NMF, LatentDirichletAllocation, PCA
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, balanced_accuracy_score

import random
from sklearn.utils import check_random_state

seed_value = 42
# Set Python seed
random.seed(seed_value)

# Set NumPy seed
np.random.seed(seed_value)

# Set scikit-learn seed
sklearn_random_state = check_random_state(seed_value)

In [2]:
microbiome_df = pd.read_csv("./dataset/microbiome_preprocessed_files/microbiome_merged_dfs.csv")
microbiome_df

Unnamed: 0,name,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,...,Hungatella,Pseudopropionibacterium,Peptoanaerobacter,Emergencia,Prevotellamassilia,Criibacterium,Fournierella,Negativibacillus,Duodenibacillus,label
0,TCGA-CG-5720-01A,0.0,0.000000,0.000000,0.895050,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
1,TCGA-CN-4741-01A,0.0,0.000000,0.010470,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
2,TCGA-BR-6801-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
3,TCGA-IG-A3I8-01A,0.0,0.000000,0.000000,0.067717,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
4,TCGA-L5-A4OT-01A,0.0,0.000000,0.012202,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,ESCA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
507,TCGA-CG-5719-01A,0.0,0.000000,0.000000,0.106557,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD
508,TCGA-CQ-5329-01A,0.0,0.175564,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.136613,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
509,TCGA-CQ-7068-01A,0.0,0.335060,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.011534,0.0,0.0,0.0,0.0,0.0,0.0,HNSC
510,TCGA-CG-4455-01A,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.014781,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,STAD


In [10]:
classes = ["HNSC", "STAD", "COAD", "ESCA", "READ"]

## **2.) Creating a data loader**

In [128]:
exp1_best_hyperparam_report = pd.read_csv("./dataset/microbiome_preprocessed_files/exp1_v2_best_hyperparam.csv")
exp1_best_hyperparam_report

Unnamed: 0,label,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,gini,21,61,1,42,51,0.792271,0.940221,0.845135
1,STAD,gini,11,101,1,2,101,0.772232,0.974606,0.833919
2,COAD,gini,11,61,1,42,101,0.92196,0.952209,0.924366
3,ESCA,entropy,11,81,1,22,1,0.526144,0.78764,0.716555
4,READ,gini,11,61,1,2,1,0.5,0.827371,0.67752


In [129]:
# Function to create hyperparameters, we use the best hyperparameters for the rf classifier from experiment 1

def create_hyperparameter_grids(report_df, additional_param=None):
    new_hyperparamter_grid_by_class = {}

    # Read the best hyperparameters from experiment 1
    exclude_cols = ["label", "test_score", "mean_train_score", "mean_validation_score"] # We don't want these columns to be in our future grid parameters
    if report_df is None or report_df.empty:
        raise ValueError("Report DataFrame is required.")

    for c in report_df.index: 
        # Choose a row (class) from the report DataFrame
        selected_class = report_df.index[c]
        # Get the best hyperparameters and remove excluded columns
        exp1_best_hyperparameters = report_df.loc[selected_class].drop(exclude_cols).to_dict()

        exp1_best_hyperparameters = {f"{key}": [value] for key, value in exp1_best_hyperparameters.items()} # The format we want for classifier hyperparameters

        if additional_param != None:
            # Add the additional parameters
            exp1_best_hyperparameters = {f"rf__{key}": value for key, value in exp1_best_hyperparameters.items()} #Adjust prefix of hyperparameters
            combined_param_grid = {**exp1_best_hyperparameters, **additional_param}
            new_hyperparamter_grid_by_class[report_df.loc[selected_class]["label"]] = combined_param_grid
        else:
            new_hyperparamter_grid_by_class[report_df.loc[selected_class]["label"]] = exp1_best_hyperparameters

    return new_hyperparamter_grid_by_class

nmf_pca_param_grid = {
    "transformer__n_components": list(range(4,65)) # In paper, features 4 to 64
}

lda_param_grid = {
    "transformer__n_components": list(range(1, len(classes))) # In paper, features 1 to length of class-1
}

In [130]:
def augment_dataset(X,y,hyperparameters,algorithm="NMF"):

    augmented_datasets_dict = {}
    
    # Create the pipeline object. Here we are chaining the dimensionality reduction step and the classifier step.

    algorithms_dict = {"NMF": NMF(), "LDA": LatentDirichletAllocation(), "PCA": PCA()}
    pipeline = Pipeline([
        ('transformer', algorithms_dict[algorithm]),
        ('rf', RandomForestClassifier(random_state=seed_value))
    ])

    # hyperparameters
    combined_param_grid = create_hyperparameter_grids(hyperparameters,lda_param_grid if algorithm=="LDA" else nmf_pca_param_grid)
 
    print(f'Using algorithm {algorithm}:')
    for i in classes:
        print(f"Class {i}")
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed_value)

        # Create the GridSearchCV object
        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=combined_param_grid[i],
            scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
            cv=cv,
            n_jobs=-1,  # Use all available processors
            return_train_score=True  # Set to True to calculate train scores
        )

        # Perform the grid search
        print("Fitting..")
        grid_search.fit(X, y)

        # Print the best hyperparameters
        print("Best Hyperparameters:", grid_search.best_params_)

        # Get the best model
        best_model = grid_search.best_estimator_

        # LENGTH OF DATASET SHOULD STAY SIMILAR TO ORIGINAL, but length of features should increase by the best n_components

        # Transform the original dataset to get the latent features
        latent_features = best_model.named_steps['transformer'].transform(X)
        print(f"Count of new columns/latent features: {len(latent_features.T)}")

        print(f"Count of columns of original feature set: {len(X.T)}")

        feature_augmented = pd.concat((X, pd.DataFrame(latent_features, index=X.index, columns=[f"new_feature {str(x)}" for x in range(1,len(latent_features.T)+1)])), axis=1)
        print(f"Count of columns of augmented feature set: {len(feature_augmented.T)}")
        print(f"Count of rows of augmented dataset: {len(feature_augmented)}")

        augmented_datasets_dict[i] = feature_augmented

    return augmented_datasets_dict




In [122]:
def exp_2_data_loader(dataframe, 
                         label_column, 
                         classes, 
                         report_df=exp1_best_hyperparam_report, 
                         algorithm="NMF", 
                         train_test=True):
    """

    Returns:
    - dataset_dict: a dictionary where the keys is the targeted class and the values are its corresponding features and labels
    """
    augmented_dataset_dict = {}

    exclude_cols = ["name", "label"]

    feat = dataframe.drop(columns=exclude_cols)
    tar = dataframe["label"]

    # print(f"len of feat: {len(feat)}")
    # print(f"len of tar: {len(tar)}")

    augmented_features_dict = augment_dataset(feat, tar,hyperparameters=report_df, algorithm=algorithm)


    # print(f"Count of rows in original dataset:{len(dataframe)}")
    # print(f"Count of rows in augmented dataset:{len(feature_augmented)}")

    for i in classes:
        positive_class = i
        dframe = pd.DataFrame(index=augmented_features_dict[i].index)
        dframe["name"] = dataframe["name"]
        dframe = pd.concat([dframe, augmented_features_dict[i]], axis=1)
        dframe['label'] = [1 if x == positive_class else 0 for x in dataframe[label_column]]
        print(dframe.label.value_counts())
        X = dframe.drop(["name", "label"], axis=1)
        y = dframe["label"]
        if train_test:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y, random_state=seed_value)
            augmented_dataset_dict[positive_class] = {"train": (X_train, y_train),
                                            "test": (X_test, y_test)}
        else:
            augmented_dataset_dict[positive_class] = {"feature": X, 
                                            "label": y}

    return augmented_dataset_dict

In [131]:
exp2_datasets_nmf = exp_2_data_loader(microbiome_df, "label", classes=classes, report_df=exp1_best_hyperparam_report, algorithm="NMF", train_test=True)

Using algorithm NMF:
Class HNSC
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 21, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n_estimators': 51, 'transformer__n_components': 54}
Count of new columns/latent features: 54
Count of columns of original feature set: 131
Count of columns of augmented feature set: 185
Count of rows of augmented dataset: 512
Class STAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 101, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 101, 'transformer__n_components': 60}
Count of new columns/latent features: 60
Count of columns of original feature set: 131
Count of columns of augmented feature set: 191
Count of rows of augmented dataset: 512
Class COAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n

In [132]:
exp2_datasets_lda = exp_2_data_loader(microbiome_df, "label", classes=classes, report_df=exp1_best_hyperparam_report, algorithm="LDA", train_test=True)

Using algorithm LDA:
Class HNSC
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 21, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n_estimators': 51, 'transformer__n_components': 4}
Count of new columns/latent features: 4
Count of columns of original feature set: 131
Count of columns of augmented feature set: 135
Count of rows of augmented dataset: 512
Class STAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 101, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 101, 'transformer__n_components': 3}
Count of new columns/latent features: 3
Count of columns of original feature set: 131
Count of columns of augmented feature set: 134
Count of rows of augmented dataset: 512
Class COAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n_est

In [133]:
exp2_datasets_pca = exp_2_data_loader(microbiome_df, "label", classes=classes, report_df=exp1_best_hyperparam_report, algorithm="PCA", train_test=True)

Using algorithm PCA:
Class HNSC
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 21, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n_estimators': 51, 'transformer__n_components': 45}
Count of new columns/latent features: 45
Count of columns of original feature set: 131
Count of columns of augmented feature set: 176
Count of rows of augmented dataset: 512
Class STAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 101, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 101, 'transformer__n_components': 45}
Count of new columns/latent features: 45
Count of columns of original feature set: 131
Count of columns of augmented feature set: 176
Count of rows of augmented dataset: 512
Class COAD
Fitting..
Best Hyperparameters: {'rf__criterion': 'gini', 'rf__max_depth': 11, 'rf__max_features': 61, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 42, 'rf__n

In [134]:
exp2_datasets_nmf["HNSC"]["train"][0]

Unnamed: 0,Simonsiella,Treponema,Campylobacter,Helicobacter,Paracoccus,Comamonas,Pseudomonas,Xanthomonas,Agrobacterium,Bradyrhizobium,...,new_feature 45,new_feature 46,new_feature 47,new_feature 48,new_feature 49,new_feature 50,new_feature 51,new_feature 52,new_feature 53,new_feature 54
155,0.0,0.010944,0.0,0.000000,0.0,0.0,0.017489,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.063332,0.000951,0.000000,0.000000,0.000000,0.0,0.000000
414,0.0,0.000000,0.0,0.107591,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000093,0.000000,0.000000,0.0,0.000060
172,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.002634,0.023495,0.0,0.000000,0.000000,0.000000,0.005242,0.043899,0.0,0.000000
367,0.0,0.000000,0.0,0.421787,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.001253,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
462,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000398,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.051051,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
15,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000003
198,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000068,0.000000,0.000000,0.0,0.000050
211,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000112,0.000000,0.0,0.000000


In [141]:
exp1_best_hyperparam_by_class = create_hyperparameter_grids(exp1_best_hyperparam_report, None)
exp1_best_hyperparam_report

Unnamed: 0,label,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,gini,21,61,1,42,51,0.792271,0.940221,0.845135
1,STAD,gini,11,101,1,2,101,0.772232,0.974606,0.833919
2,COAD,gini,11,61,1,42,101,0.92196,0.952209,0.924366
3,ESCA,entropy,11,81,1,22,1,0.526144,0.78764,0.716555
4,READ,gini,11,61,1,2,1,0.5,0.827371,0.67752


In [146]:
def perform_gridsearchcv(dataset_dict, classes, cv_n_splits=5, n_jobs=12):
    report = {}
    for c in classes:
        print("Class: ", c)
        features = dataset_dict[c]["train"][0]
        target = dataset_dict[c]["train"][1]

        # Define the classifier
        rf = RandomForestClassifier(random_state=seed_value)

        exp1_best_hyperparameter_grids = create_hyperparameter_grids(exp1_best_hyperparam_report, None)

        cv = StratifiedKFold(n_splits=cv_n_splits, shuffle=True, random_state=seed_value)

        # Create the GridSearchCV object
        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=exp1_best_hyperparameter_grids[c],
            scoring=make_scorer(balanced_accuracy_score),  # Choose an appropriate metric for your problem
            cv=cv,
            n_jobs=-1,  # Use all available processors,
            return_train_score=True
)

        # Perform the grid search
        grid_search.fit(features, target)

        # Access cv_results_ attribute to get detailed results
        cv_results = grid_search.cv_results_

        # Print the best hyperparameters
        print("Best Hyperparameters:", grid_search.best_params_)

        # Get the best model
        best_model = grid_search.best_estimator_

        # Evaluate the best model on your test set
        # Assuming you have X_test and y_test from your data
        # Modify this based on your actual test set
        predictions = best_model.predict(dataset_dict[c]["test"][0])
        accuracy = balanced_accuracy_score(dataset_dict[c]["test"][1], predictions)
        print("Accuracy on Test Set:", accuracy)


        # Calculate mean train score and mean test score
        mean_train_score = cv_results['mean_train_score'][grid_search.best_index_]
        mean_test_score = cv_results['mean_test_score'][grid_search.best_index_]

        print("Mean Train Score:", mean_train_score)
        print("Mean Test Score:", mean_test_score)


        report[c] = {
            "model": best_model,
            "best_hyperparameters": grid_search.best_params_,
            "test_score": accuracy,
            "mean_train_score": mean_train_score,
            "mean_validation_score": mean_test_score
        }
        
    return report


In [147]:
exp2_nmf_report = perform_gridsearchcv(exp2_datasets_nmf, classes)
exp2_nmf_report

Class:  HNSC
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 21, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 51}
Accuracy on Test Set: 0.7922705314009661
Mean Train Score: 0.9399107192703534
Mean Test Score: 0.8222000373639717
Class:  STAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 101, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 101}
Accuracy on Test Set: 0.7722323049001816
Mean Train Score: 0.9907778668805133
Mean Test Score: 0.8185747585747587
Class:  COAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 101}
Accuracy on Test Set: 0.8956442831215972
Mean Train Score: 0.953363617482942
Mean Test Score: 0.9133266733266734
Class:  ESCA
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 11, 'max_features': 81, 'min_samples_leaf': 1, 'min_samples_split': 22, 'n_estimators': 1}
Accurac

{'HNSC': {'model': RandomForestClassifier(max_depth=21, max_features=61, min_samples_split=42,
                         n_estimators=51, random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 21,
   'max_features': 61,
   'min_samples_leaf': 1,
   'min_samples_split': 42,
   'n_estimators': 51},
  'test_score': 0.7922705314009661,
  'mean_train_score': 0.9399107192703534,
  'mean_validation_score': 0.8222000373639717},
 'STAD': {'model': RandomForestClassifier(max_depth=11, max_features=101, n_estimators=101,
                         random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 11,
   'max_features': 101,
   'min_samples_leaf': 1,
   'min_samples_split': 2,
   'n_estimators': 101},
  'test_score': 0.7722323049001816,
  'mean_train_score': 0.9907778668805133,
  'mean_validation_score': 0.8185747585747587},
 'COAD': {'model': RandomForestClassifier(max_depth=11, max_features=61, min_samples_split=42,
                        

In [148]:
def best_hyperparameter_to_df(nested_dict, exp_name):
    path = "./dataset/microbiome_preprocessed_files/"
    data = []

    for key, value in nested_dict.items():
        entry = {'label': key}
        entry.update(value['best_hyperparameters'])
        entry['test_score'] = value['test_score']
        entry['mean_train_score'] = value['mean_train_score']
        entry['mean_validation_score'] = value['mean_validation_score']
        data.append(entry)

    df = pd.DataFrame(data)

    df.to_csv(path+f"{exp_name}_best_hyperparam.csv", index=False)
    return df

In [149]:
best_hyperparameter_to_df(exp2_nmf_report, "exp2_nmf_v2")

Unnamed: 0,label,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,gini,21,61,1,42,51,0.792271,0.939911,0.8222
1,STAD,gini,11,101,1,2,101,0.772232,0.990778,0.818575
2,COAD,gini,11,61,1,42,101,0.895644,0.953364,0.913327
3,ESCA,entropy,11,81,1,22,1,0.526144,0.786434,0.645629
4,READ,gini,11,61,1,2,1,0.5,0.827056,0.568287


In [150]:
exp2_lda_report = perform_gridsearchcv(exp2_datasets_lda, classes)
exp2_lda_report

Class:  HNSC
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 21, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 51}
Accuracy on Test Set: 0.8047504025764896
Mean Train Score: 0.9348363099535199
Mean Test Score: 0.83845593386577
Class:  STAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 101, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 101}
Accuracy on Test Set: 0.7459165154264973
Mean Train Score: 0.980379577653034
Mean Test Score: 0.8293972693972694
Class:  COAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 101}
Accuracy on Test Set: 0.8956442831215972
Mean Train Score: 0.9532422533479344
Mean Test Score: 0.9146253746253746
Class:  ESCA
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 11, 'max_features': 81, 'min_samples_leaf': 1, 'min_samples_split': 22, 'n_estimators': 1}
Accuracy 

{'HNSC': {'model': RandomForestClassifier(max_depth=21, max_features=61, min_samples_split=42,
                         n_estimators=51, random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 21,
   'max_features': 61,
   'min_samples_leaf': 1,
   'min_samples_split': 42,
   'n_estimators': 51},
  'test_score': 0.8047504025764896,
  'mean_train_score': 0.9348363099535199,
  'mean_validation_score': 0.83845593386577},
 'STAD': {'model': RandomForestClassifier(max_depth=11, max_features=101, n_estimators=101,
                         random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 11,
   'max_features': 101,
   'min_samples_leaf': 1,
   'min_samples_split': 2,
   'n_estimators': 101},
  'test_score': 0.7459165154264973,
  'mean_train_score': 0.980379577653034,
  'mean_validation_score': 0.8293972693972694},
 'COAD': {'model': RandomForestClassifier(max_depth=11, max_features=61, min_samples_split=42,
                         n_

In [151]:
best_hyperparameter_to_df(exp2_lda_report, "exp2_lda_v2")

Unnamed: 0,label,criterion,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,test_score,mean_train_score,mean_validation_score
0,HNSC,gini,21,61,1,42,51,0.80475,0.934836,0.838456
1,STAD,gini,11,101,1,2,101,0.745917,0.98038,0.829397
2,COAD,gini,11,61,1,42,101,0.895644,0.953242,0.914625
3,ESCA,entropy,11,81,1,22,1,0.685458,0.803227,0.668746
4,READ,gini,11,61,1,2,1,0.5,0.835796,0.645608


In [152]:
exp2_pca_report = perform_gridsearchcv(exp2_datasets_pca, classes)
exp2_pca_report

Class:  HNSC
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 21, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 51}
Accuracy on Test Set: 0.823268921095008
Mean Train Score: 0.9458994255220672
Mean Test Score: 0.811541263836346
Class:  STAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 101, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 101}
Accuracy on Test Set: 0.7459165154264973
Mean Train Score: 0.9942261427425823
Mean Test Score: 0.8124442224442225
Class:  COAD
Best Hyperparameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 61, 'min_samples_leaf': 1, 'min_samples_split': 42, 'n_estimators': 101}
Accuracy on Test Set: 0.9219600725952812
Mean Train Score: 0.9522587192498845
Mean Test Score: 0.9100799200799201
Class:  ESCA
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 11, 'max_features': 81, 'min_samples_leaf': 1, 'min_samples_split': 22, 'n_estimators': 1}
Accuracy

{'HNSC': {'model': RandomForestClassifier(max_depth=21, max_features=61, min_samples_split=42,
                         n_estimators=51, random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 21,
   'max_features': 61,
   'min_samples_leaf': 1,
   'min_samples_split': 42,
   'n_estimators': 51},
  'test_score': 0.823268921095008,
  'mean_train_score': 0.9458994255220672,
  'mean_validation_score': 0.811541263836346},
 'STAD': {'model': RandomForestClassifier(max_depth=11, max_features=101, n_estimators=101,
                         random_state=42),
  'best_hyperparameters': {'criterion': 'gini',
   'max_depth': 11,
   'max_features': 101,
   'min_samples_leaf': 1,
   'min_samples_split': 2,
   'n_estimators': 101},
  'test_score': 0.7459165154264973,
  'mean_train_score': 0.9942261427425823,
  'mean_validation_score': 0.8124442224442225},
 'COAD': {'model': RandomForestClassifier(max_depth=11, max_features=61, min_samples_split=42,
                         n