#### Summer of Reproducibility - noWorkflow Base Experiment - Notebook 1
This Jupyter Notebook is dedicated to guiding you through the applications of noWorkflow in Data Science and Machine Learning. It is the outcome of our participation in the Summer of Reproducibility at OSPO UCSC 2023, utilizing [noWorkflow](https://github.com/gems-uff/noworkflow).

This Notebook serves as a use case based on the problem of Fraud Detection. We have partially replicated the work entitled "The Effect of Feature Extraction and Data Sampling on Credit Card Fraud Detection." Interested readers are encouraged to refer to the original paper [here](https://link.springer.com/article/10.1186/s40537-023-00684-w).

For the sake of clarity, we have divided this experiment into different notebooks:

1. Covers the steps from reading the dataset to a Random Forest model training, configuring a single trial.
2. Repeats all previous steps but with changes in the experimental setup, such as modified hyperparameters.
3. Utilizes noWorkflow to summarize the results from previous trials.
4. Repeat the experiment, changing the model and the order of operations.
5. Compares the modifications and differences between the last and first experiments.

**Please, remember to select the noWorkflow kernel before running these Notebooks.**

In [1]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb

from noworkflow.now.tagging.var_tagging import backward_deps, \
    global_backward_deps, store_operations, resume_trials, trial_diff, \
    trial_intersection_diff, var_tag_plot, var_tag_values

#### Reading the dataset

In [2]:
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variables. It is the first step in feature treatment

In [3]:
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Applying PCA for feature extraction.

Here, a pca_components tag is created to retain the n_components argument used in the PCA operation.

In [12]:
pca_components = now_variable('pca_components', 3)
pca = PCA(n_components=pca_components)
X_pca = pca.fit_transform(X)

Evaluation(id=155, checkpoint=3859.822927932, code_component_id=1099, activation_id=152, repr=3)


#### Feature engineering: Applying random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is *random_state* value for RandomUnderSampler function


In [5]:
random_seed = now_variable('random_seed', 42)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

Evaluation(id=51, checkpoint=32.867074206, code_component_id=903, activation_id=48, repr=42)


#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. 

In [6]:
test_dim = now_variable('test_dim', 0.2)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

Evaluation(id=70, checkpoint=33.053857322, code_component_id=939, activation_id=67, repr=0.2)


#### Scoring: model training and transforming features into predictions
##### RandomForest

Instantiate and evaluate a Random Forest Classifier. Here we are tagging the model name in a model object 

In [7]:
rf = now_variable('model', RandomForestClassifier())
rf.fit(X_train, y_train)

Evaluation(id=89, checkpoint=33.209249015, code_component_id=973, activation_id=85, repr=RandomForestClassifier())


RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest

Computing performance metrics. Two control variables are tagged here. *roc_metric* stores the ROC score classical metric in classification. On the other hand, *f1_metric* is the F1 score

In [8]:
y_pred = rf.predict(X_test)

roc_metric = now_variable('roc_metric', roc_auc_score(y_test, y_pred))
f1_metric = now_variable('f1_metric', f1_score(y_test, y_pred))

print("Random Forest - ROC = %f, F1 = %f" % (roc_metric, f1_metric))

Evaluation(id=111, checkpoint=33.548601773, code_component_id=1010, activation_id=100, repr=0.817305710162853)
Evaluation(id=120, checkpoint=33.550994317, code_component_id=1026, activation_id=100, repr=0.8181818181818183)
Random Forest - ROC = 0.817306, F1 = 0.818182


### Experiment dependencies from roc_metric variable

When calling the backward_deps('tagged_var_name'), 
we receive a list of variables that are involved in the computation of the tagged variable. In this example, if you call it with the 'roc_metric' tag, the output will include all operations that were involved in the construction of its final value

In [9]:
dict_ops = backward_deps('roc_metric', False)
dict_ops

{26: ('y_test', 'complex data type'),
 25: ('RandomForestClassifier()', 'complex data type'),
 24: ("now_variable('model', RandomForestClassifier())", 'complex data type'),
 23: ('rf', 'complex data type'),
 22: ('X_resampled', 'complex data type'),
 21: ('RandomUnderSampler(random_state=random_seed)', 'complex data type'),
 20: ('rus', 'complex data type'),
 19: ("now_variable('pca_components', 3)", '3'),
 18: ('pca_components', '3'),
 17: ('PCA(n_components=pca_components)', 'PCA(n_components=3)'),
 16: ('pca', 'PCA(n_components=3)'),
 15: ('X', 'complex data type'),
 14: ('X_pca', 'complex data type'),
 13: ('df', 'complex data type'),
 12: ("df['Class']", 'complex data type'),
 11: ('y', 'complex data type'),
 10: ('y_resampled', 'complex data type'),
 9: ("now_variable('test_dim', 0.2)", '0.2'),
 8: ('test_dim', '0.2'),
 7: ("now_variable('random_seed', 42)", '42'),
 6: ('random_seed', '42'),
 5: ('train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_

### Experiment dependencies from roc_metric

Save the operations dictionary in a shelve object with this trial_id as a key.

Steps are:
1. calls store operations() to store the dict into a shelve object with this trial_id key
2. verify the list of stored trials available to comparision with resume_trials()

In [10]:
trial_id = __noworkflow__.trial_id
store_operations(trial_id, dict_ops)

Dictionary stored in shelve.


In [11]:
resume_trials()

['edb94455-f97b-46f0-b30e-ed01eaf81081']

### Next steps

The [next notebook](now_usecase_part_2.ipynb) performs a similar (but not identical) trial, which we will compare in the future.