<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Course Series</font></h1>
</center>

---

<center>
    <h1><font color="red">
        Banknote Authentication Problem with Scikit-Learn
    </font></h1>
</center>

# <font color="red">Objectives</font>

In this presentation, we use a simple classification dataset to:

- Perform Exploratory Data Analysis (EDA)
- Create, train and evaluate various Scikit-Learn (`sklearn`) models.

We show the steps for building a Machine Learning (ML) model with `sklearn`. 

# <font color="red"> Python packages used</font>

- __Matplotlib__: Create visualization.
- __Pandas__: Data (two-dimensional labelled array) manipulation and analysis.
- __Scikit-Learn__:  Provide supervised and unsupervised Machine Learning algorithms.
- __PyTorch__: Used to to build, train, and evaluate a deep machine learning algorithm based on Neural Networks.

In [None]:
try:
    import google.colab
    print("Running in Google Colab")
except:
    print("Not running in Google Colab")
else:
    print("Installing modules in Google Colab")
    !pip install seaborn
    !pip install -U scikit-learn

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib.pyplot as plt

In [None]:
import numpy as np

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
import sklearn

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.svm import SVC

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
print(f"Numpy version:        {np.__version__}")
print(f"Pandas version:       {pd.__version__}")
print(f"Seaborn version:      {sns.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")

# <font color="red">Loading the dataset</font>

## <font color="blue">Description of the data</font>

- We have a dataset which consists on information on banknotes.
- Banknotes are classified into two classes whether they are real or not:
   - `1`: fake banknote
   - `0`: real banknote
- We want to build various ML models to be able to predict the classes given a set of banknotes.

## <font color="blue">Read the data</font>

#### Dataset

- We use 1372 images that were taken from genuine and forged banknote-like specimens.
- Wavelet Transform tools were used to extract features from images.
   - Among the five variables, four are features, and one is target class.
   - The four features are continuous numbers that measure the characteristics of digital images of each banknote.
      - `variance`: Measures the spread or distribution of pixel values within the banknote image.
      - `skew`: Quantifies the asymmetry or distortion in the distribution of pixel intensity values, according to GeeksforGeeks.
      - `curtosis`: Describes the sharpness of the peaks in the pixel intensity distribution.
      - `entropy`: Describes the amount of information that must be coded for by a compression algorithm.
   - The target class contains two values, 0 and 1, where 0 represents a genuine note, and 1 represents a fake note.
   - The dataset contains a balanced ratio of both classes which is 55:45 (genuine: counterfeit).

In [None]:
url = "https://raw.githubusercontent.com/Kuntal-G/Machine-Learning/master/R-machine-learning/data/banknote-authentication.csv"

In [None]:
df = pd.read_csv(url)

In [None]:
df.head(5)

## <font color='blue'>Perform EDA</font>

### <font color="green">Quick observation on the data types</font>

In [None]:
df.info()

- All the columns have data types of either `float` or `int`.
   - There is no need to do any data conversion.
- There are no missing values.

### <font color="green">Descriptive statistics</font>

In [None]:
df.describe()

### <font color="green">Basic plots</font>

__Percentage of instances for each class__

In [None]:
df['class'].value_counts().plot(kind="pie", autopct='%1.1f%%');

__Pairplot__

- The `sns.pairplot()` function takes dataset as a parameter and plots a graph that contains relationships between all the features in the dataset.

In [None]:
sns.pairplot(df)

__Observations__

- `entropy` and `variance` have a slight linear correlation.
- There is an inverse linear correlation between the `curtosis` and `skew`.
- The values for `curtosis` and `entropy` are slightly higher for real banknotes, while the values for `skew` and `variance` are higher for the fake banknotes.

### <font color="green">Identify possible linear relationships</font>

__Heatmap__

In [None]:
plt.figure(figsize=(22, 11));
correlation_matrix = df.corr().round(3);
sns.heatmap(correlation_matrix, cmap="YlGnBu", annot=True);

In [None]:
plt.figure(figsize=(22, 11));
sns.heatmap(correlation_matrix[(correlation_matrix >= 0.7) | (correlation_matrix <= -0.6)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

### <font color="green">Identify possible non-linear relationships</font>

In [None]:
sns.pairplot(df, hue='class')

__Observations__

-  Out of all the combinations of variables, the scatter plot of `curtosis` vs `entropy` has the most significant overlap between classes.
   - The overlapping of the data indicates that there is substantial ambiguity and similarity between classes based just on their `curtosi`s and `entropy` values, which implies that there is no distinct separation that allows for easy classification.
   - ML algorithms would face significant challenges in accurately classifying banknotes based on these two variables.
- `skew` vs `entropy` and `skew` vs `curtosis` also have relatively high overlap between classes.
   - The overlapping feature also has some implications for the effectiveness of ML algorithms. 
- A large overlap of different classes will require more complex decision boundaries to determine the class of the banknote accurately.
- __It is reasonable to predict that algorithms such as Logistic Regression and Linear Discriminant Analysis, which assume linear relationships between variables, may struggle with an accurate prediction with only two features.__

### <font color="green">Analyzing 3D plots</font>

- We extend our analysis by incorporating an additional feature in plots by doing three-feature scatter plots.
- By including a new variable, our goal is to mitigate the issue of significant overlap observed in the two-feature scatter plot and improve the separability between the classes.
- As the plots below show, the three-feature scatter plots exhibit a notable reduction in overlap among the different classes compared to the two-feature plots, suggesting the additional variable added to any of the two variable combinations provided additional discriminatory power, leading to clearer separation between genuine and counterfeit banknotes.
- __The reduced overlap and improved separability in the three-feature scatter plots indicate that ML methods, including those that assume linear relationships among the variables, such as Logistic Regression and Linear Discriminant Analysis, are expected to perform better when using three features for classification.__


In [None]:
def do_3d_plot(data, ax, labels):
    # Plot the data, using Seaborn's palette for color differentiation
    # Iterate through each class to plot separately for distinct colors
    for class_label in data['class'].unique():
        subset = data[data['class'] == class_label]
        ax.scatter(subset[labels[0]], subset[labels[1]], subset[labels[2]], 
                   label=f'Class {class_label}',
                   color=sns.color_palette("tab10", n_colors=2)[int(class_label)],
                   s=3
                  ) # s for marker size

    # Set labels and title
    ax.set_xlabel(labels[0])
    ax.set_ylabel(labels[1])
    ax.set_zlabel(labels[2])
    #ax.legend()

In [None]:
list_labels = [
    ('variance', 'entropy', 'skew'),
    ('variance', 'entropy', 'curtosis'),
    ('entropy', 'skew', 'curtosis'),
    ('skew', 'variance', 'curtosis', )
]

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 6), subplot_kw={'projection': '3d'})

for ax, labels in zip(axes.flat, list_labels):
    do_3d_plot(df, ax, labels)
plt.tight_layout()

- The above observations were based on the visual analysis of the scatter plots.
- Further quantitative analysis is necessary to confirm the extent of the visual analysis and its impact on classification accuracy.

#  <font color="red">Data preparation</font>

##  <font color="blue"> Splitting the data into training and testing sets</font>
- We split the data into training and testing sets. 
- We train the model with 80% of the samples and test with the remaining 20%. 

__Extract the train and test datasets as NumPy arrays__

In [None]:
feature_cols = list(df.columns)
del feature_cols[-1]

In [None]:
feature_cols

In [None]:
label_name = list(df.columns)[-1]
label_name

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[feature_cols].values, 
                                                    df[label_name].values, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [None]:
X_train

In [None]:
X_train.shape

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
y_test.shape

## <font color="blue">Normailized the Data</font> <a class="anchor" id="sec_tf_norm"></a>

- In general, variables may not be a similar scale. High values would gain more importance in any distance-based calculations. 
- It is good practice to normalize features that use different scales and ranges.
   - The normalization process brings all variables to a similar scale, preventing certain variables from dominating others in later analysis and ensuring fair comparisons and interpretations.
- Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

In [None]:
X_train

In [None]:
train_mean = X_train.mean(axis=0)
train_std = X_train.std(axis=0)

In [None]:
train_mean

In [None]:
train_std

__Normalization of the train features__

In [None]:
X_train = (X_train - train_mean) / train_std

In [None]:
X_train

__Normalization of the test features__

In [None]:
X_test = (X_test - train_mean) / train_std

# <font color="red">Creating the ML models</font>

- We define, train and test different ML algorithms.
- We use various metrics to evaluate the performance of each model.

__Classification report__

- The `sklearn.metrics.classification_report` function generates a text-based report summarizing the performance of a classification model.
- The report includes the following metrics:
  - _Precision_: The ability of the classifier not to label as positive a sample that is negative. It is the ratio: $\frac{TP}{TP + FP}$
  - _Recall_: The ability of the classifier to find all the positive samples. It is the ratio: $\frac{TP}{TP + FN}$.
  - _F1-Score_: The weighted harmonic mean of precision and recall. It is a good measure to use when you need to balance both precision and recall.
  - _Support_: The number of actual occurrences of each class in the specified dataset ). 
- The report also provides aggregated averages across all classes:
  - _Accuracy_: The overall proportion of correctly classified instances.
  - _Macro Average_: The unweighted average of precision, recall, and F1-score across all classes. This treats all classes equally, regardless of their size.
  - _Weighted Average_: The average of precision, recall, and F1-score, weighted by the support of each class. This gives more importance to larger classes.

__Confusion matrix__

- A table used to describe the performance of a classification model.

| | Predicted Class A | Predicted Class B |
|---|---|---|
| **Actual Class A** | True Positive (TP) | False Negative (FN) |
| **Actual Class B** | False Positive (FP) | True Negative (TN) |

## <font color="blue">Logic regression</font>


Logistic regression is a statistical method for predicting binary classes.

- It is a special case of linear regression where the target variable is categorical in nature.
- It is one of the most simple and commonly used ML algorithms for two-class classification.
- The outcome or the target variable has only two possible classes.
- It predicts the probability of occurrence of a binary event utilizing a logit function.

__Create model__

In [None]:
lr_estimator = LogisticRegression() 

__Train the model__

In [None]:
lr_estimator.fit(X_train, y_train)

__Make prediction__

In [None]:
y_pred = lr_estimator.predict(X_test) 

__Evaluate the model__

In [None]:
lr_class_rep = classification_report(y_test, y_pred)
print(f"Classification report: \n {lr_class_rep}")

In [None]:
lr_conf_mat = confusion_matrix(y_test, y_pred)
lr_disp = ConfusionMatrixDisplay(confusion_matrix=lr_conf_mat, display_labels=[0, 1]) 
lr_disp.plot();

In [None]:
lr_acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy report: \n {lr_acc_score}")

__Observations__

- The output showed an accuracy of `97.82%` with 4 misclassifications.

## <font color="blue">Random forest</font>

- An ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes of the individual trees.

__Create model__

In [None]:
rf_estimator = RandomForestClassifier(n_estimators=200, random_state=0) 

__Train the model__

In [None]:
rf_estimator.fit(X_train, y_train)

__Make prediction__

In [None]:
y_pred = rf_estimator.predict(X_test) 

__Evaluate the model__

In [None]:
rf_class_rep = classification_report(y_test, y_pred)
print(f"Classification report: \n {rf_class_rep}")

In [None]:
rf_conf_mat = confusion_matrix(y_test, y_pred)
rf_disp = ConfusionMatrixDisplay(confusion_matrix=rf_conf_mat, display_labels=[0, 1]) 
rf_disp.plot();

In [None]:
rf_acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy report: \n {rf_acc_score}")

__Observations__

- The output showed an accuracy of `99.27%` with no wrong prediction.

## <font color="blue">Support Vector Machine (SVM)</font>

-  SVM aims to find an optimal hyperplane that separates different classes in the feature space.
-  In a binary classification problem (as with the banknotes), this hyperplane acts as a decision boundary, dividing the data points into two distinct classes

__Create model__

In [None]:
svm_estimator = SVC(kernel='poly', degree=8)

__Train the model__

In [None]:
svm_estimator.fit(X_train, y_train)

__Make prediction__

In [None]:
y_pred = svm_estimator.predict(X_test) 

__Evaluate the model__

In [None]:
svm_class_rep = classification_report(y_test, y_pred)
print(f"Classification report: \n {svm_class_rep}")

In [None]:
svm_conf_mat = confusion_matrix(y_test, y_pred)
svm_disp = ConfusionMatrixDisplay(confusion_matrix=svm_conf_mat, display_labels=[0, 1]) 
svm_disp.plot();

In [None]:
svm_acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy report: \n {svm_acc_score}")

__Observations__

- The output showed an accuracy of `70.90%` with 4 wrong prediction.

## <font color="blue">K-Nearest Neighbors (KNN)</font>

- KNN calculates distances between a new instance and existing instances in the training set to identify the k-nearest neighbors.
- It predicts the class label of a new data point based on the majority class of its 'k' nearest neighbors in the training data.
- One of the key parameters is `n_neighbors`, i.e., the number of neighbors (`k`) to consider.

__Create model__

In [None]:
knn_estimator = KNeighborsClassifier(n_neighbors=2)

__Train the model__

In [None]:
knn_estimator.fit(X_train, y_train)

__Make prediction__

In [None]:
y_pred = knn_estimator.predict(X_test) 

__Evaluate the model__

In [None]:
knn_class_rep = classification_report(y_test, y_pred)
print(f"Classification report: \n {knn_class_rep}")

In [None]:
knn_conf_mat = confusion_matrix(y_test, y_pred)
knn_disp = ConfusionMatrixDisplay(confusion_matrix=knn_conf_mat, display_labels=[0, 1]) 
knn_disp.plot();

In [None]:
knn_acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy report: \n {knn_acc_score}")

__Determine the performance as function of `n_neighbors`__

In [None]:
knn_errors = list()
num_neighbors = 50
for k in range(1, num_neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_test_pred = knn.predict(X_test)
    error = np.mean(y_test_pred != y_test)
    knn_errors.append(error) 
    #print(f"k = {k:03} Error = {error:.05f}")

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, num_neighbors), knn_errors, 
         color='red', linestyle='dashed', 
         marker='o',
         markerfacecolor='blue', 
         markersize=10)
plt.title(r'Error as function of $k$')
plt.xlabel('Number of neighbors (k)')
plt.ylabel('Mean Error');

__Observations__

- The maximum value of `k` to for an accurate prediction is 18.

# <font color="red">References</font>

- [Banknote Authentication using Machine Learning Algorithms](https://www.coditude.com/insights/banknote-authentication-using-machine-learning-algorithms/)
- [Comparative Analysis Of Machine Learning Based Bank Note Authentication Through Variable Selection](https://nhsjs.com/2023/comparative-analysis-of-machine-learning-based-bank-note-authentication-through-variable-selection/) by Rick Nie.