#Logistic Regression for Banknote Authentication


---



##Objective:
This assignment aims to assess your ability to apply binary logistic regression to a real-world dataset containing only continuous features. You will practice loading data, preparing it for modeling, training a logistic regression model, interpreting its results, evaluating its performance using accuracy, and drawing basic conclusions.

##Scenario:
Imagine you are working for a financial institution developing automated systems to detect counterfeit banknotes. Data has been collected from images of genuine and forged banknote-like specimens. Features were extracted from these images using Wavelet Transform tools, resulting in four continuous numerical measurements. The goal is to build a model that can predict whether a banknote is genuine or forged based on these image-derived features.

##Dataset:
Banknote Authentication Dataset


---



##Import Relevent Data

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report
import warnings

##Tasks

###Task 1.1: Starter Code

In [None]:
# Starter Code for Assignment: Logistic Regression for Banknote Authentication

# Suppress potential convergence warnings for cleaner output (optional)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

# --- Reference Date ---
# Assignment context date: Saturday, April 5, 2025 (as per environment context)
print(f"Assignment starter code executed. Context Date: April 5, 2025")

# --- Task 1: Data Loading and Preparation ---
print("\n--- Task 1: Data Loading and Preparation ---")

# URL for the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'

# Define column names
column_names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']

# Load data, specifying no header and comma separator
try:
    df = pd.read_csv(url, header=None, names=column_names, sep=',')
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")
    # Exit or handle error appropriately in a real script
    exit()
# Display basic info - Verify data loaded correctly
print("\nDataFrame Head:")
print(df.head())
print("\nDataFrame Info:")
df.info()
print("\nDataFrame Description:")
print(df.describe())

Assignment starter code executed. Context Date: April 5, 2025

--- Task 1: Data Loading and Preparation ---
Data loaded successfully.

DataFrame Head:
   Variance  Skewness  Curtosis  Entropy  Class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Variance  1372 non-null   float64
 1   Skewness  1372 non-null   float64
 2   Curtosis  1372 non-null   float64
 3   Entropy   1372 non-null   float64
 4   Class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB

DataFrame Description:
          Variance     Skewness     Curtosis      Entropy        Class
count  1372.000000 

###Task 1.2: Data Loading and Prep

In [None]:
# 1. Define the feature matrix X
X_auth = df[['Variance', 'Skewness', 'Curtosis', 'Entropy']]

# 2. Define the target vector y
y_auth = df['Class']

# 3. Describe the data
print('X Described:',X_auth.describe())
print('Y Described:',y_auth.describe())

X Described:           Variance     Skewness     Curtosis      Entropy
count  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657
std       2.842763     5.869047     4.310030     2.101013
min      -7.042100   -13.773100    -5.286100    -8.548200
25%      -1.773000    -1.708200    -1.574975    -2.413450
50%       0.496180     2.319650     0.616630    -0.586650
75%       2.821475     6.814625     3.179250     0.394810
max       6.824800    12.951600    17.927400     2.449500
Y Described: count    1372.000000
mean        0.444606
std         0.497103
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Class, dtype: float64


###Task 2: Train-Test Split

In [None]:
print("--- Task 2: Train-Test Split ---")
# 1. Split X and y into training and testing sets
X_auth_train, X_auth_test, y_auth_train, y_auth_test = train_test_split(
    X_auth, y_auth,
    test_size=0.20,
    random_state=42,
    stratify=y_auth
    )

# 2. Print the shapes of the resulting arrays
print(f"X Train set size: {X_auth_train.shape[0]}, X Test set size: {X_auth_test.shape[0]}")
print(f"Y Train set size: {y_auth_train.shape[0]}, Y Test set size: {y_auth_test.shape[0]}")

--- Task 2: Train-Test Split ---
X Train set size: 1097, X Test set size: 275
Y Train set size: 1097, Y Test set size: 275


###Task 3: Model Training

In [None]:
print("--- Task 3: Model Training ---")
# 1. Initialize a LogisticRegression model (with random_state=42)
log_reg_auth = LogisticRegression(random_state=42)

# 2. Train the model using X_train_scaled and y_train
log_reg_auth.fit(X_auth_train, y_auth_train)

print('Model trained on banknote authentication data')

--- Task 3: Model Training ---
Model trained on banknote authentication data


###Task 4: Model Evaluation

In [None]:
print("--- Task 4: Model Evaluation ---")
# 1. Make predictions (y_pred) on the test data using the trained model
y_auth_pred_labels = log_reg_auth.predict(X_auth_test)
y_auth_pred_proba = log_reg_auth.predict_proba(X_auth_test)[:, 1]

print("Banknote Authentication Predictions (Test Set):")
print(f"Actual Pass/Fail:    {y_auth_test.iloc[:5].values}")
print(f"Predicted Pass/Fail: {y_auth_pred_labels[:5]}")
print(f"Predicted Probability (Pass): {y_auth_pred_proba[:5].round(3)}")

# 2. Calculate Accuracy
accuracy_auth = accuracy_score(y_auth_test, y_auth_pred_labels)
cm_auth = confusion_matrix(y_auth_test, y_auth_pred_labels)

# 3. Calculate metrics, use zero_division=0 for robustness
precision_auth = precision_score(y_auth_test, y_auth_pred_labels, zero_division=0)
recall_auth = recall_score(y_auth_test, y_auth_pred_labels, zero_division=0)
f1_auth = f1_score(y_auth_test, y_auth_pred_labels, zero_division=0)

# Print Metrics
print('\n---Metrics---')
print(f"Accuracy:  {accuracy_auth:.4f}")
print(f"Precision: {precision_auth:.4f}")
print(f"Recall:    {recall_auth:.4f}")
print(f"F1-Score:  {f1_auth:.4f}")

print('\n---Confusion Matrix---')
print(cm_auth)

--- Task 4: Model Evaluation ---
Banknote Authentication Predictions (Test Set):
Actual Pass/Fail:    [0 1 0 1 1]
Predicted Pass/Fail: [0 1 0 1 1]
Predicted Probability (Pass): [0.    0.996 0.107 0.994 1.   ]

---Metrics---
Accuracy:  0.9855
Precision: 0.9683
Recall:    1.0000
F1-Score:  0.9839

---Confusion Matrix---
[[149   4]
 [  0 122]]


###Task 5: Coefficient Interpretation

In [None]:
print("--- Task 5: Coefficient Interpretation ---")
# 1. Extract coefficients from the trained model

#get intercept and coefficients
intercept_auth = log_reg_auth.intercept_[0]
coefs = log_reg_auth.coef_[0]

#convert features into columns
feature_names = X_auth_train.columns

#convert columns into dataframe
coeffs_log_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefs})

# 2. Print the coefficients for 'Skewness' and 'Entropy'
print(f"\nModel Intercept: {intercept_auth:.4f}") #prints intercept for the whole model
print(f"Coefficient for Skewness: {coeffs_log_df.loc[1, 'Coefficient']:.4f}") #skewness
print(f"Coefficient for Entropy: {coeffs_log_df.loc[3, 'Coefficient']:.4f}") #entropy

--- Task 5: Coefficient Interpretation ---

Model Intercept: 3.7438
Coefficient for Skewness: -1.7832
Coefficient for Entropy: -0.0901


###Task 6: Conclusion

In [None]:
print("--- Task 6: Interpretation ---")
# 1. Write 1-2 print statements summarizing your findings (accuracy, interpretation highlights, model utility)

print('The model is has a 98% accuracy score in predicting a genuine banknote over a forged one.\nPrecision tells us that the predicted genuine banknote is 96%, while Recall found that all of the genuine banknotes were found by the model.')

--- Task 6: Interpretation ---
The model is has a 98% accuracy score in predicting a genuine banknote over a forged one.
Precision tells us that the predicted genuine banknote is 96%, while Recall found that all of the genuine banknotes were found by the model.


###Task 7: Code Quality

In [None]:
print("--- Task 7: Code Quality ---")
#no code necessary
print("Code is commented, uses meaningful variable names, and runs without errors.")

--- Task 7: Code Quality ---
Code is commented, uses meaningful variable names, and runs without errors.
