#### Summer of Reproducibility - noWorkflow base experiment

This Jupyter Notebook dedicated to walk you through noWorkflow applications in Data Science and Machine Learning. It is result of our work in the Summer of Reproducibility from OSPO UCSC 2023.

This Notebook is an usecase based on the problem of Fraud Detection. We partially replicates the work "The effect of feature extraction and data sampling on credit card fraud detection.".The interested reader is invited to consult the original work [here].(https://link.springer.com/article/10.1186/s40537-023-00684-w). 

In this notebook we aim to assess how provenance recollection with [noWorkflow](https://github.com/gems-uff/noworkflow) works in a classical DS/ML subject. 

In [2]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

#### Reading the dataset

In [3]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [5]:
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [6]:
pca_components = now_variable('pca_components', 3)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

Evaluation(id=48, checkpoint=1399.998662652, code_component_id=919, activation_id=45, repr=3)


#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [7]:
random_seed = now_variable('random_seed', 42)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

Evaluation(id=66, checkpoint=1404.694020646, code_component_id=952, activation_id=63, repr=42)


#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [8]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.2)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

Evaluation(id=88, checkpoint=1407.9494737789998, code_component_id=994, activation_id=82, repr=0.2)


#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [9]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [21]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Random Forest - ROC = 0.903370, F1 = 0.899471


### Experiment comparision

Save the operations dictionary in a shelve object with this trial_id as a key.

Steps are:
1. calls get_pre for a given tagged variable and keeps the operations_dictionary output
2. calls store operations() to store the dict into a shelve object with this trial_id key


In [24]:
ops_dict = get_pre('roc_rf')

trial_id = __noworkflow__.trial_id
store_operations(trial_id, ops_dict)

Dictionary stored in shelve.
