# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi

Wed Apr 26 01:06:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
try:
  import cuml
except (ImportError, KeyError, ModuleNotFoundError):
  !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
  !python rapidsai-csp-utils/colab/pip-install.py


# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  

----

Connect to google drive and change to the folder that contains the project files.

In [3]:
# Mount google drive to colab and change to correct directory
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/Othercomputers/ThinkPad/master-thesis-vt23

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/Othercomputers/ThinkPad/master-thesis-vt23


In [4]:
# Quickfix for dependencies in colab
# Try to import packages, if exception is thrown install dependencies and kill runtime
try:
  from pycaret.classification import *
except (ImportError, KeyError, ModuleNotFoundError):
  ## code to install dependencies
  !pip install -r colab_requirements.txt
  #!pip install pomegranate==0.14.8  #needed for sdmetrics
  display('Stopping RUNTIME! Colaboratory will restart automatically. Please run again.')
  import os
  os.kill(os.getpid(), 9)

In [5]:
%cd /content/drive/Othercomputers/ThinkPad/master-thesis-vt23/notebooks

import cudf as pd
import cuml

display(f"cudf version: {pd.__version__}")
display(f"cuml version: {cuml.__version__}")

/content/drive/Othercomputers/ThinkPad/master-thesis-vt23/notebooks


'cudf version: 23.04.01'

'cuml version: 23.04.01'

In [6]:
#%run -t Step2-Model-GPU.ipynb

In [7]:
import pandas 
import os 
import sys 
import pickle
import re
import ast
import cudf as pd

from sklearn.metrics import (classification_report, 
                             roc_auc_score, 
                             matthews_corrcoef,
                             cohen_kappa_score)

from cuml.model_selection import train_test_split, StratifiedKFold

# Import help methods
sys.path.append('../src')
from utils import (getExperimentConfig, 
                   getPicklesFromDir, 
                   translate_model_name,
                   get_synthetic_filepaths_from_original_data_id,
                   convert_and_clean_dict)

from tuning_grids import Grids
from mlflow_manager import MLFlowManager
from gpuclassification import GPUClassifierPipeline, GPUModels, opt_tune_model

# Get global variables for the experiment
config = getExperimentConfig()
# Get folders
folders = config['folders']
# Load dataset specific settings (from the real-data)
dataset_settings = getPicklesFromDir(folders['settings_dir'])

In [8]:
# Select which datasets use in the experiment, by their dataset id
run_dataset = [
    'D0'
]

In [9]:
"""
Create the dataset to save the performance. Initially was going to use mlflow for this. 
However, a bugg surfaced when google colab was used, where it got stuck in a endless loop
trying to read the loggs via the colab cell. Thus this implementation.

Columns:
    Dataset id: str
        the dataset id that the model was evaluated on.
    model: str
        the shortend model name/id (e.g. lr = logistic regression, rf = random forest, etc.)
    F1, Accuracy, AUC: float
        performance metrics from evaluating the model on the hold-out data.
    Params: dict
        the hyperparameters for the model.
    Tuned on: str
        wheter the hyperparameters comes from tuning on original data or synthetic
    Trained on: str
        the type of data that the model was trained on, "original" or "synthetic"
    Quality: str
        if synthetic, the quality id of the generator
    SDG:
        the synthetic genenerator id.
    Dataset type: str
        if the dataset that the model trained on is "original" or "synthetic"
    USI: str
        Unique Settings Identifier, a unique string generated by pycaret setup each initialization
    
    
"""

# Create an empty DataFrame with the specified columns
columns = ["Dataset id", "model", "F1", "Accuracy", "AUC", "MCC", "Kappa", "Params", "Tuned on", "Trained on", "USI", "Quality", "SDG"]

# if it exists, read it, else create a new one
if os.path.isfile(folders['model_perf_filepath']):
    model_performance_df = pd.read_csv(folders['model_perf_filepath'])
else:
    model_performance_df = pd.DataFrame(columns=columns)

performance_row = {}

In [10]:
############ STEP 4

# read performance data from Step 2
model_performance_df = pandas.read_csv(folders['model_perf_filepath'])
# Specify the metrics to sort by for choosing best model
# Choose the target metric when tuning the models
sort_by = config['clf']['tuning_param']['optimize']

#run_dataset = config['run_dataset']

for settings in dataset_settings:
        
    if run_dataset is not None and settings['meta']['id'] not in run_dataset:
        continue
        
    # update system_log name
    settings['setup_param']['system_log'] = folders['log_dir']+"Step4_SD"
    # disable saving train-test split data (to save space)
    settings['setup_param']['log_data'] = False

    #### Define features (use meta) ####
    ordinal_features = settings['meta']['ordinal_features']
    numeric_features = settings['meta']['numeric_features']
    text_features = settings['meta']['text_features']
    categorical_features = settings['meta']['categorical_features']

    target_label = settings['meta']['target']
    train_size = settings['setup_param']['train_size']
    # Get experiment logging
    experiment_name = f"{settings['meta']['id']}-{settings['meta']['name']}"
    mlflow = MLFlowManager(experiment_name)
     
    # load original dataset
    cols_dtype=None
    if settings['meta']['cols_dtype'] != None:
        cols_dtyped = settings['meta']['cols_dtype']
        
    original_data = pd.read_csv(f"{folders['real_dir']}{settings['meta']['filename']}", dtype=cols_dtype)
    
    # Only need the test data, using same stratified split size as in Step 2 
    _, x_test, _, y_test = train_test_split(original_data.drop(columns=[settings['meta']['target']], axis=1), # X (predictors)
                                            original_data[settings['meta']['target']],        # y (target label)
                                            train_size=train_size, 
                                            stratify=original_data[settings['meta']['target']])


    logg_tags = {
        'Trained on': 'synthetic',
        'Tuned on': 'original',
    }
    mlflow.start_run('Synthetic data models', tags=logg_tags)
    
    # Filter the DataFrame based on the Dataset id and sort by specified column
    # to get hyperparameters and model name for the "best model"
    filtered_df = model_performance_df[model_performance_df["Dataset id"] == settings['meta']['id']]
    sorted_df = filtered_df.sort_values(by=sort_by, ascending=False)
    
    best_ml_model = sorted_df.iloc[0].model
    best_hyperparameters = ast.literal_eval(sorted_df.iloc[0].Params)
    
    #buggfix: remove 'priors' from hyperparameters
    best_hyperparameters.pop('priors', None)
    if 'var_smoothing' in best_hyperparameters:
      best_hyperparameters['alpha'] = best_hyperparameters.pop('var_smoothing', None)

    synthetic_datasets = get_synthetic_filepaths_from_original_data_id(settings['meta']['id'])

    for sd_filename in synthetic_datasets:
        
        sd_id = os.path.splitext(sd_filename)[0]
        quality = re.findall('Q\d+', sd_id)[0]
        sd_path = folders['sd_dir']+sd_filename

        ########### Test the model with best performance from best original dataset ###########
        #mlflow version# hyperparameters = convert_and_clean_dict(hyperparameters)
        model_name = f"Original_{sd_id}{translate_model_name(best_ml_model)}"      
        run_name = model_name
                
        # Add custom tags to the logg, defining dataset type, and Id
        logg_tags = {
            'Trained on': 'synthetic',
            'Dataset id': sd_id,
            'model': best_ml_model,
            'Quality': quality,
            'Tuned on': 'original',
            'SDG': sd_id.split("_")[0],
        }
        mlflow.start_run(run_name, tags=logg_tags, nested=True)

        # create & tune model
        # Split the dataset into a train set using cuML's train_test_split function
        # only use train_set, train on synthetic, test on real
        x_train, _, y_train, _ = train_test_split(X=original_data.drop(target_label, axis=1), 
                                                  y=original_data[target_label],
                                                  train_size=train_size, 
                                                  stratify=original_data[target_label], 
                                                  shuffle=True)

        estimator = GPUModels(best_ml_model)
        tuned_model = GPUClassifierPipeline(classifier=estimator,
                                            numeric_features=numeric_features,
                                            categorical_features=categorical_features,
                                            ordinal_features=ordinal_features
                                          )
        tuned_model.set_params(**best_hyperparameters)

        tuned_model.fit(x_train, y_train)

        y_pred = tuned_model.predict(x_test).to_pandas()

        metrics =  classification_report(y_true=y_test.to_pandas(), y_pred=y_pred, output_dict=True, digits=4)
        holdout_score = pandas.DataFrame.from_dict(metrics).transpose()

        test_metrics = {
            "Accuracy": metrics['accuracy'],
            "F1": metrics['macro avg']['f1-score'],
            "MCC": matthews_corrcoef(y_true=y_test.to_pandas(), y_pred=y_pred),
            "Kappa": cohen_kappa_score(y1=y_test.to_pandas(), y2=y_pred)
        }

        # If there is a prediction_score in the from predict_model (sometimes there isn't)
        if y_test.nunique() < 2:
            y_prob = tuned_model.predict_proba(x_test)
            test_metrics['AUC'] = roc_auc_score(y_true=y_test.to_pandas, y_score=y_prob)
        
        # log parameters     
        mlflow.log_params(tuned_model.get_classifier().get_params())
        # log performance
        mlflow.log_tag('model', best_ml_model)
        mlflow.log_metrics(test_metrics)
        mlflow.log_metric(f"val_{optimize}", val_score)
        mlflow.log_score_report_to_html(holdout_score, "Holdout")
        # log model
        mlflow.log_model(model=tuned_model)
        # end run for the model
        mlflow.end_run()
        
        performance_row = {**logg_tags, **test_metrics}
        performance_row['Params'] = tuned_model.get_params()
        model_performance_df = model_performance_df.append(performance_row, ignore_index=True)
        ########### End test hyper-param ###########
        
        # Start testing all models
        for ml_model in config['clf']['ml_models']:
            #start log run
            logg_tags['model'] = ml_model
            logg_tags['Tuned on'] = 'synthetic'
            
            model_name = f"{sd_id}-{translate_model_name(ml_model)}"
            mlflow.start_run(model_name, tags=logg_tags, nested=True)
            
            # create & tune model
            estimator = GPUModels(ml_model)
            model = GPUClassifierPipeline(classifier=estimator,
                                          numeric_features=numeric_features,
                                          categorical_features=categorical_features,
                                          ordinal_features=ordinal_features
                                      )

            cv = StratifiedKFold(n_splits=config['clf']['cv_folds'])
            optimize = config['clf']['tuning_param']['optimize']    
            tune_grid = Grids.get_tuning_grid(ml_model)
            
            tuned_model, val_score = opt_tune_model(X=x_train, 
                                                    y=y_train, 
                                                    cv=cv, 
                                                    model=model, 
                                                    optimize=optimize, 
                                                    tune_grid=tune_grid,
                                                    n_trials=config['clf']['tuning_param']['early_stopping_max_iters'])

            y_pred = tuned_model.predict(x_test).to_pandas()

            metrics =  classification_report(y_true=y_test.to_pandas(), y_pred=y_pred, output_dict=True, digits=4)
            holdout_score = pandas.DataFrame.from_dict(metrics).transpose()

            test_metrics = {
                "Accuracy": metrics['accuracy'],
                "F1": metrics['macro avg']['f1-score'],
                "MCC": matthews_corrcoef(y_true=y_test.to_pandas(), y_pred=y_pred),
                "Kappa": cohen_kappa_score(y1=y_test.to_pandas(), y2=y_pred)
            }

            # If there is a prediction_score in the from predict_model (sometimes there isn't)
            if y_test.nunique() < 2:
                y_prob = tuned_model.predict_proba(x_test)
                test_metrics['AUC'] = roc_auc_score(y_true=y_test.to_pandas, y_score=y_prob)
            
            # log parameters     
            mlflow.log_params(tuned_model.get_classifier().get_params())
            # log performance
            mlflow.log_tag('model', ml_model)
            mlflow.log_metrics(test_metrics)
            mlflow.log_metric(f"val_{optimize}", val_score)
            mlflow.log_score_report_to_html(holdout_score, "Holdout")
            # log model
            mlflow.log_model(model=tuned_model)
            # end run for the model
            mlflow.end_run()
            
            performance_row = {**logg_tags, **test_metrics}
            performance_row['Params'] = tuned_model.get_params()
            model_performance_df = model_performance_df.append(performance_row, ignore_index=True)

    # end logging for the synthetic datasets based on original id
    mlflow.end_run()          

# Save model performance to csv
#model_performance_df.to_csv(folders['model_perf_filepath'], index=False)

ValueError: ignored

In [14]:
x_test.info()


<class 'cudf.core.dataframe.DataFrame'>
Int64Index: 154 entries, 60 to 38
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Pregnancies               154 non-null    int64
 1   Glucose                   154 non-null    int64
 2   BloodPressure             154 non-null    int64
 3   SkinThickness             154 non-null    int64
 4   Insulin                   154 non-null    int64
 5   BMI                       154 non-null    float64
 6   DiabetesPedigreeFunction  154 non-null    float64
 7   Age                       154 non-null    int64
dtypes: float64(2), int64(6)
memory usage: 10.8 KB
