<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Course Series</font></h1>
</center>

---

<center>
    <h1><font color="red">
        Banknote Authentication Problem with PyTorch
    </font></h1>
</center>

# <font color="red">Objectives</font>

In this presentation, we use a simple classification dataset to:

- Build a PyTorch model
- Train the model
- Evaluate the model

We show the steps for building a Machine Learning (ML) model with PyTorch. The functions presented here can be used as reference for other ML applications.

# <font color="red">References</font>

- [Banknote Authentication using Machine Learning Algorithms](https://www.coditude.com/insights/banknote-authentication-using-machine-learning-algorithms/)
- [Comparative Analysis Of Machine Learning Based Bank Note Authentication Through Variable Selection](https://nhsjs.com/2023/comparative-analysis-of-machine-learning-based-bank-note-authentication-through-variable-selection/) by Rick Nie.

# <font color="red"> Python packages used</font>

- __Matplotlib__: Create visualization.
- __Pandas__: Data (two-dimensional labelled array) manipulation and analysis.
- __Scikit-Learn__:  Provide supervised and unsupervised Machine Learning algorithms.
- __PyTorch__: Used to to build, train, and evaluate a deep machine learning algorithm based on Neural Networks.

In [None]:
try:
    import google.colab
    print("Running in Google Colab")
except:
    print("Not running in Google Colab")
else:
    print("Installing modules in Google Colab")
    !pip install seaborn
    !pip install -U scikit-learn
    !pip3 uninstall --yes torch torchaudio torchvision torchtext torchdata
    !pip3 install torch torchaudio torchvision torchtext torchdata

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib.pyplot as plt

In [None]:
import numpy as np

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
from sklearn.model_selection import train_test_split
#from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

In [None]:
from torchviz import make_dot
from torchsummary import summary

# <font color="red">Loading the dataset</font>

## <font color="blue">Description of the data</font>

- We have a dataset which consists on information on banknotes.
- Banknotes are classified into two classes whether they are real or not:
   - `1`: fake banknote
   - `0`: real banknote
- We want to build a Machine Learning model to be able to predict the classes given a set of banknotes.
- We will use __logistic regression__ that is a statistical method for predicting binary classes.
   - It is a special case of linear regression where the target variable is categorical in nature.
   - It is one of the most simple and commonly used Machine Learning algorithms for two-class classification.
   - The outcome or the target variable has only two possible classes.
   - It predicts the probability of occurrence of a binary event utilizing a logit function. 

## <font color="blue">Read the data</font>

#### Dataset

- We use 1372 images that were taken from genuine and forged banknote-like specimens.
- Wavelet Transform tools were used to extract features from images.
   - Among the five variables, four are features, and one is target class.
   - The four features are continuous numbers that measure the characteristics of digital images of each banknote.
      - `variance`: Measures the spread or distribution of pixel values within the banknote image.
      - `skew`: Quantifies the asymmetry or distortion in the distribution of pixel intensity values, according to GeeksforGeeks.
      - `curtosis`: Describes the sharpness of the peaks in the pixel intensity distribution.
      - `entropy`: Describes the amount of information that must be coded for by a compression algorithm.
   - The target class contains two values, 0 and 1, where 0 represents a genuine note, and 1 represents a fake note.
   - The dataset contains a balanced ratio of both classes which is 55:45 (genuine: counterfeit).

In [None]:
url = "https://raw.githubusercontent.com/Kuntal-G/Machine-Learning/master/R-machine-learning/data/banknote-authentication.csv"

In [None]:
df = pd.read_csv(url)

In [None]:
df.head(5)

## <font color='blue'>Perform EDA</font>

### <font color="green">Quick observation on the data types</font>

In [None]:
df.info()

- All the columns have data types of either `float` or `int`.
   - There is no need to do any data conversion.
- There are no missing values.

### <font color="green">Descriptive statistics</font>

In [None]:
df.describe()

### <font color="green">Basic plots</font>

__Percentage of instances for each class__

In [None]:
df['class'].value_counts().plot(kind="pie", autopct='%1.1f%%');

__Pairplot__

In [None]:
sns.pairplot(df)

__Observations__

- `entropy` and `variance` have a slight linear correlation.
- There is an inverse linear correlation between the `curtosis` and `skew`.
- The values for `curtosis` and `entropy` are slightly higher for real banknotes, while the values for `skew` and `variance` are higher for the fake banknotes.

__Heatmap__

In [None]:
plt.figure(figsize=(22, 11));
correlation_matrix = df.corr().round(3);
sns.heatmap(correlation_matrix, cmap="YlGnBu", annot=True);

In [None]:
plt.figure(figsize=(22, 11));
sns.heatmap(correlation_matrix[(correlation_matrix >= 0.7) | (correlation_matrix <= -0.6)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

### <font color="green">Identify possible non-linear relationships</font>

In [None]:
sns.pairplot(df, hue='class')

__Observations__

-  Out of all the combinations of variables, the scatter plot of `curtosis` vs `entropy` has the most significant overlap between classes.
   - The overlapping of the data indicates that there is substantial ambiguity and similarity between classes based just on their `curtosi`s and `entropy` values, which implies that there is no distinct separation that allows for easy classification.
   - ML algorithms would face significant challenges in accurately classifying banknotes based on these two variables.
- `skew` vs `entropy` and `skew` vs `curtosis` also have relatively high overlap between classes.
   - The overlapping feature also has some implications for the effectiveness of ML algorithms. 
- A large overlap of different classes will require more complex decision boundaries to determine the class of the banknote accurately.
- __It is reasonable to predict that algorithms such as Logistic Regression and Linear Discriminant Analysis, which assume linear relationships between variables, may struggle with an accurate prediction with only two features.__

__Analyzing 3D plots__

- We extend our analysis by incorporating an additional feature in plots by doing three-feature scatter plots.
- By including a new variable, our goal is to mitigate the issue of significant overlap observed in the two-feature scatter plot and improve the separability between the classes.
- As the plots below show, the three-feature scatter plots exhibit a notable reduction in overlap among the different classes compared to the two-feature plots, suggesting the additional variable added to any of the two variable combinations provided additional discriminatory power, leading to clearer separation between genuine and counterfeit banknotes.
- __The reduced overlap and improved separability in the three-feature scatter plots indicate that ML methods, including those that assume linear relationships among the variables, such as Logistic Regression and Linear Discriminant Analysis, are expected to perform better when using three features for classification.__


In [None]:
def do_3d_plot(data, ax, labels):
    # Plot the data, using Seaborn's palette for color differentiation
    # Iterate through each class to plot separately for distinct colors
    for class_label in data['class'].unique():
        subset = data[data['class'] == class_label]
        ax.scatter(subset[labels[0]], subset[labels[1]], subset[labels[2]], 
                   label=f'Class {class_label}',
                   color=sns.color_palette("tab10", n_colors=2)[int(class_label)],
                   s=3
                  ) # s for marker size

    # Set labels and title
    ax.set_xlabel(labels[0])
    ax.set_ylabel(labels[1])
    ax.set_zlabel(labels[2])
    #ax.legend()

In [None]:
list_labels = [
    ('variance', 'entropy', 'skew'),
    ('variance', 'entropy', 'curtosis'),
    ('entropy', 'skew', 'curtosis'),
    ('skew', 'variance', 'curtosis', )
]

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 6), subplot_kw={'projection': '3d'})

for ax, labels in zip(axes.flat, list_labels):
    do_3d_plot(df, ax, labels)
plt.tight_layout()

- The above observations were based on the visual analysis of the scatter plots.
- Further quantitative analysis is necessary to confirm the extent of the visual analysis and its impact on classification accuracy.

#  <font color="red">Data preparation</font>

##  <font color="blue"> Splitting the data into training and testing sets</font>
- We split the data into training and testing sets. 
- We train the model with 80% of the samples and test with the remaining 20%. 

__Extract the train and test datasets as NumPy arrays__

In [None]:
feature_cols = list(df.columns)
del feature_cols[-1]

In [None]:
feature_cols

In [None]:
label_name = list(df.columns)[-1]
label_name

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[feature_cols].values, 
                                                    df[label_name].values, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
y_train

In [None]:
y_train.shape

## <font color="blue">Normailized the Data</font> <a class="anchor" id="sec_tf_norm"></a>

- In general, variables may not be a similar scale. High values would gain more importance in any distance-based calculations. 
- It is good practice to normalize features that use different scales and ranges.
   - The normalization process brings all variables to a similar scale, preventing certain variables from dominating others in later analysis and ensuring fair comparisons and interpretations.
- Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

In [None]:
X_train

In [None]:
train_mean = X_train.mean(axis=0)
train_std = X_train.std(axis=0)

In [None]:
train_mean

In [None]:
train_std

__Normalization of the train features__

In [None]:
X_train = (X_train - train_mean) / train_std

In [None]:
X_train

__Normalization of the test features__

In [None]:
X_test = (X_test - train_mean) / train_std

# <font color="red">Creating the ML model</font>

## <font color="blue">Set the hyperparameters</font>

It is a good practice to declare the following parameters before creating the model for ease of change and understanding.

__Dataset parameters__

These parameters are defines by the dataset used:

- number of features
- number of classes to predict

In [None]:
input_size = len(feature_cols)
num_classes = 2

__Model parameters__

- batch size
- number of epochs
- learning rate (optimizer steps)

In [None]:
batch_size = 4
num_epochs = 10
learning_rate = 0.1

#### Device configuration: check for CUDA availability and set device accordingly

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## <font color="blue">Building the PyTorch model</font>

### <font color="green"> Class to create a simple model with one linear layer

- We define a neural network by subclassing `nn.Module`, and initialize the neural network layers in `__init__`.
- Every `nn.Module` subclass implements the operations on input data in the `forward` method.
   - The `__init()__`  method defines the layers and other components of a model.
   - The `forward()` method is where the computation gets done.
- The input layer has `num_features` nodes and the output layer `num_classes` nodes.
- The most basic type of neural network layer is a linear or fully connected layer.
   - This is a layer where every input influences every output of the layer to a degree specified by the layer’s weights.
   - If a model has `m` inputs and `n` outputs, the weights will be an `m x n` matrix.
- One of the most common places you will see linear layers is in classifier models, which will usually have one or more linear layers at the end, where the last layer will have `n` outputs, where `n` is the number of classes the classifier addresses.

In [None]:
class LogisticRegression(torch.nn.Module):

    def __init__(self, num_features, num_classes):
        super().__init__()
        self.linear1 = torch.nn.Linear(num_features, num_classes)

    def forward(self, x):
        logits = self.linear1(x)
        return logits

Note that we do not have any activation function here because there is only one layer:
- Activation functions make deep learning possible.
   - Inserting non-linear activation functions between layers is what allows a deep learning model to simulate any function, rather than just linear ones.
- __The model defined above can be seen as a single matrix multiplication.__

### <font color="green">Create the model

In [None]:
torch.manual_seed(1)

model = LogisticRegression(num_features=input_size, num_classes=num_classes)

Move the model to the GPU:

In [None]:
model.to(device)

### <font color="green">Model Summary

- The function `summary()` of `torchsummary` provides the architectural summary of the model in the same similar as in case of Keras’ model summary().
- It shows the layer types, the resultant shape of the model, and the number of parameters available in the models.

In [None]:
summary(model, input_size=(input_size, ))

### <font color="green"> Print model information

In [None]:
print('\t Model information: \n')
print(model)

In [None]:
print('\t Layer information: \n')
print(model.linear1)

In [None]:
print('\t Model trainable parameters: \n')
model_dict = model.state_dict()
for key in model_dict:
    print(f"{key}: \n \t {model_dict[key]}")

In [None]:
print('\t Model parameters: \n')
for param in model.parameters():
    print(param)

### <font color="green"> Basic testing of the model with arbitrary data

In [None]:
x = torch.tensor([[-0.6391558 ,  1.80557961, -0.18836535, -3.05096841],
                  [ 0.82188925,  0.85239902, -0.59407847,  0.60345479],
                  [-1.65703344, -1.63328321,  2.38386151, -0.34235536]])

In [None]:
with torch.no_grad():
    logits = model(x.to(device))
    probas = F.softmax(logits, dim=1)

In [None]:
print(probas)

## <font color="blue"> Defining a Dataset</font>

- A dataset is represented by a regular Python class that inherits from the `Dataset` class.
   - I can be seen as a kind of a Python list of tuples, each of which corresponding to one data point (features, label)
- Unless the dataset is huge (cannot fit in memory), you don’t explictly need to define this class. We then use `TensorDataset` instead.
- There are three components:
   - `__init__(self)`
   - `__get_item__(self, index)`
   - `__len__(self)`

In [None]:
class MyDataset(Dataset):
    def __init__(self, X, y):
        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.int64)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

## <font color="blue"> Defining a DataLoader</font>

- Very useful if we have a hude dataset.
- We pass the dataset to our dataloader, and our `batch_size` hyperparameter as initialization arguments.
- This creates an iterable data loader, so we can easily iterate over each batch using a loop.
   - Behave like an __iterator__, so we can __loop over__ it and fetch a different __mini-batch__ every time.

In [None]:
def instantiate_data(Xdata: np.array, 
                     ydata: np.array, 
                     batch_size: int, 
                     shuffle: bool=False) -> DataLoader:
    """
    Take the NumPy arrays for the features and labels to
    create a PyTorch DataLoader object. It also subdivide
    the arrays into groups of size batch_size. 
    If shuffle is set to True (for the training set only),
    the data will be shuffled. It allows for stable training 
    and faster convergence of our model parameters.
    """
    dataset = MyDataset(Xdata, ydata)
    dataloader = DataLoader(dataset=dataset, 
                            batch_size=batch_size, 
                            shuffle=shuffle)
    return dataloader

In [None]:
train_loader = instantiate_data(X_train, y_train, batch_size, shuffle=True)

In [None]:
X_train.shape

In [None]:
test_loader = instantiate_data(X_test, y_test, batch_size)

## <font color="blue">The training loop</font>

__Define the loss function__

- We use the Cross-Entropy Loss that is primarily used for multi-label classification models.
- It first applies softmax to the predictions and calculates the given target labels and predicted values.

In [None]:
loss_function = nn.CrossEntropyLoss()

__Define the optimizer__

- We use the SGD optimizer that implements the stochastic gradient descent method.

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

__Feed train data into the model__

In [None]:
for epoch in range(num_epochs):

    model = model.train()
    for batch_idx, (features, class_labels) in enumerate(train_loader):
        # Predict outputs
        outputs = model(features.to(device))

        # Compute the loss function
        loss = loss_function(outputs, class_labels.to(device))

        # Reset and calculate gradients
        optimizer.zero_grad()
        
        # Back propagation
        loss.backward()

        # Update model parameters
        optimizer.step()

        ### LOGGING
        print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
               f' | Batch {batch_idx+1:03d}/{len(train_loader):03d}'
               f' | Loss: {loss:.4f}')
    print(43*'-')

## <font color="blue">Evaluating the results</font>

In [None]:
def compute_accuracy(model, dataloader):
    """
    Compute the percentage of correct classification.
    """

    model = model.eval()

    pred_values = list()
    true_values = list()

    correct = 0.0
    total_examples = 0

    for idx, (features, class_labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model(features.to(device))

        pred = torch.argmax(logits, dim=1)
        
        pred_values = np.append(pred_values, pred.cpu().numpy())
        true_values = np.append(true_values, class_labels)

        compare = class_labels.to(device) == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples, true_values, pred_values

### <font color="green">Evaluation on the train dataset</font>

In [None]:
train_acc, train_true, train_pred = compute_accuracy(model, train_loader)

In [None]:
print(f"Train accuracy: {train_acc*100}%")

### <font color="green">Evaluation on the test dataset</font>

In [None]:
test_acc, test_true, test_pred = compute_accuracy(model, test_loader)

__Accuracy__

- Measures the proportion of correct predictions in the total sample.

In [None]:
print(f"Test accuracy: {test_acc*100}%")

__Precision score__

- Measures how many of the items identified as positive are actually positive.

In [None]:
precision_score(test_true, test_pred)

__Recall score__

- Measures how many of the actual positive cases were identified correctly.

In [None]:
recall_score(test_true, test_pred)

__F1 score__

- Harmonic mean of Precision and Recall.

In [None]:
f1_score(test_true, test_pred)

__ROC-AUC__

- The area under the Receiver Operating Characteristic curve.

In [None]:
roc_auc_score(test_true, test_pred)

__Generate the confusion matrix__

- A table used to describe the performance of a classification model.

| | Predicted Class A | Predicted Class B |
|---|---|---|
| **Actual Class A** | True Positive (TP) | False Negative (FN) |
| **Actual Class B** | False Positive (FP) | True Negative (TN) |

In [None]:
print(confusion_matrix(test_true, test_pred))

__ROC curve__

- Visualizes how well the model distinguishes between two classes (e.g., positive and negative) by plotting the true positive rate against the false positive rate at different threshold settings.
   - True Positive Rate (TPR): Proportion of actual positive cases that are correctly classified as positive.
   - False Positive Rate (FPR): Proportion of actual negative cases that are incorrectly classified as positive.
- A curve closer to the top-left corner (higher TPR, lower FPR) indicates better performance, meaning the model is more accurate at distinguishing between the two classes. 

In [None]:
fpr, tpr, thresholds = roc_curve(test_true, test_pred)
plt.plot(fpr, tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate');