# Classification with Scikit-Learn

Comparing different classification algorithms on the BankNote Authentication dataset.

**Classifiers tested:**
- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest
- K-Nearest Neighbors (KNN)
- Naive Bayes

In [None]:
# ============================================
# IMPORTS
# ============================================
import pandas as pd                          # Data manipulation and CSV loading
import numpy as np                           # Numerical operations
import seaborn as sb                         # Visualization (pairplots)
from sklearn.metrics import confusion_matrix # Shows prediction breakdown
from sklearn.metrics import classification_report  # Precision, recall, F1

In [None]:
# ============================================
# LOAD DATASET
# ============================================
# BankNote Authentication: detect forged banknotes
# Features extracted from images using Wavelet Transform
# Target: 0 = genuine, 1 = forged
dataset = pd.read_csv("BankNote_Authentication.csv")
dataset.head()

In [None]:
# ============================================
# PREPARE FEATURES AND TARGET
# ============================================
# X = features (first 4 columns): variance, skewness, kurtosis, entropy
# Y = target (column 5): class label (0 or 1)
X = dataset.iloc[:, :4]   # All rows, columns 0-3
Y = dataset.iloc[:, 4]    # All rows, column 4 (target)

print(f"Features shape: {X.shape}")
print(f"Target shape: {Y.shape}")

In [None]:
# ============================================
# TRAIN/TEST SPLIT
# ============================================
# Split data: 90% training, 10% testing
# random_state=0 ensures reproducible results
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, 
    test_size=0.1,      # 10% for testing
    random_state=0      # Seed for reproducibility
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# ============================================
# 1. LOGISTIC REGRESSION
# ============================================
# Despite the name, it's a CLASSIFICATION algorithm
# Uses sigmoid function to output probabilities (0-1)
# Good baseline, fast, works well on linearly separable data
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)           # Train
predictions = classifier.predict(X_test)   # Predict

print("LOGISTIC REGRESSION")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

In [None]:
# ============================================
# 2. SUPPORT VECTOR MACHINE (SVM)
# ============================================
# Finds optimal hyperplane to separate classes
# Effective in high-dimensional spaces
# Can use kernel trick for non-linear boundaries
from sklearn.svm import SVC

classifier = SVC()                         # Default: RBF kernel
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print("SUPPORT VECTOR MACHINE")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

In [None]:
# ============================================
# 3. DECISION TREE
# ============================================
# Creates a tree of if-then rules
# Easy to interpret and visualize
# Prone to overfitting without pruning
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print("DECISION TREE")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

In [None]:
# ============================================
# 4. RANDOM FOREST
# ============================================
# Ensemble of many decision trees
# Each tree trained on random subset of data/features
# Reduces overfitting, more robust than single tree
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print("RANDOM FOREST")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

In [None]:
# ============================================
# 5. K-NEAREST NEIGHBORS (KNN)
# ============================================
# Classifies based on K closest training examples
# No training phase (lazy learner)
# Simple but can be slow on large datasets
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()        # Default: k=5
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print("K-NEAREST NEIGHBORS")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

In [None]:
# ============================================
# 6. NAIVE BAYES
# ============================================
# Based on Bayes' theorem
# Assumes features are independent ("naive")
# Fast, works well with high-dimensional data
# Good for text classification
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()                  # Gaussian for continuous features
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print("NAIVE BAYES")
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

## Understanding the Metrics

**Precision**: Of all predicted positives, how many are actually positive?
- High precision = few false positives

**Recall**: Of all actual positives, how many did we predict?
- High recall = few false negatives

**F1-Score**: Harmonic mean of precision and recall
- Balances both metrics

**Confusion Matrix**:
```
[[TN  FP]
 [FN  TP]]
```
- TN = True Negatives (correctly predicted 0)
- FP = False Positives (predicted 1, actually 0)
- FN = False Negatives (predicted 0, actually 1)
- TP = True Positives (correctly predicted 1)