#Implementing SVMs and Kernel Methods using scikit‐learn and custom code.

Support Vector Machines (SVMs) use different kernels to solve various types of problems by transforming data into higher dimensions where separation becomes easier. Each kernel has its own strengths and is suited for different scenarios. Here’s a simple breakdown of the most common kernels:

The Linear Kernel is the simplest and works by drawing straight lines or flat planes to separate data. Imagine trying to divide different colored marbles on a table with a ruler—this is what the linear kernel does. It’s fast and efficient for datasets where classes can be cleanly separated with straight boundaries, such as text classification or linearly separable data. However, it struggles with more complex, curved patterns.

For more flexibility, the Polynomial Kernel creates curved decision boundaries using polynomial functions. Think of it like using bendy straws instead of straight rulers to separate marbles—it can twist and turn to fit moderately complex data. You can control its flexibility with the degree parameter: a higher degree allows more complex curves, but too high may cause overfitting. This kernel is useful when data requires gentle curves but isn’t overly intricate.

The RBF (Radial Basis Function) Kernel, also called the Gaussian Kernel, is the most versatile and widely used. It creates smooth, flexible boundaries that can adapt to almost any shape, much like a lasso rope that can form perfect loops around groups of marbles. It works exceptionally well for highly complex datasets, such as image recognition or handwriting classification. The gamma parameter controls how tight or loose these boundaries are—small gamma gives broader curves, while large gamma fits closer to individual points.

Lastly, the Sigmoid Kernel mimics the behavior of neural networks, acting like a stretchy rubber sheet to separate data. While interesting theoretically, it’s rarely used in practice today because actual neural networks tend to perform better for similar tasks. It’s mostly kept for historical reasons or specific edge cases.

 # Implementating Support Vector Machines (SVMs) with Scikit-Learn

1. Standard SVC (Classification)
The basic Support Vector Classifier (SVC) separates data using different kernel functions ('linear', 'poly', 'rbf', 'sigmoid') to transform input space. The polynomial kernel creates curved boundaries while RBF fits complex patterns. All kernels aim to maximize the margin between classes while allowing some misclassifications.

In [63]:
# Import necessary libraries from scikit-learn
from sklearn import datasets  # To load standard datasets
from sklearn.model_selection import train_test_split  # For splitting data into train/test sets
from sklearn.svm import SVC  # Support Vector Classifier
from sklearn.metrics import accuracy_score  # To evaluate model performance
from sklearn.preprocessing import StandardScaler  # For feature scaling
import numpy as np  # Numerical computing library

# Load the iris dataset
# The dataset contains 150 samples of iris flowers with 4 features each and labels representing 3 species of iris
iris = datasets.load_iris()
X = iris.data  # Feature matrix (150 samples × 4 features)
y = iris.target  # Target vector (150 labels)

# Split the dataset into training (70%) and testing (30%) sets
# random_state=42 ensures reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a StandardScaler to standardize features
# This transforms features to have mean=0 and std=1, which is important for SVM performance
scaler = StandardScaler()
# Fit the scaler on training data and transform training data
X_train = scaler.fit_transform(X_train)
# Transform test data using the same scaler (don't fit on test data!)
X_test = scaler.transform(X_test)

# Define the different kernel types we want to test
# Kernels determine how SVMs transform the input space
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

# Test each kernel type
for kernel in kernels:
    # Create an SVM classifier with the current kernel type

    # Special case for polynomial kernel where we need to specify degree
    if kernel == 'poly':
        # degree=3 means we're using cubic polynomial features
        # gamma='scale' automatically sets gamma = 1/(n_features * X.var())
        svm = SVC(kernel=kernel, degree=3, gamma='scale')  # Polynomial kernel of degree 3
    else:
        # For other kernels, just specify kernel type and gamma
        svm = SVC(kernel=kernel, gamma='scale')

    # Train the SVM model on the training data
    svm.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = svm.predict(X_test)

    # Calculate accuracy by comparing predictions to true labels
    acc = accuracy_score(y_test, y_pred)

    # Print results with kernel name (left-aligned in 7 chars) and accuracy
    print(f"Kernel: {kernel:<7} | Accuracy: {acc:.4f}")

Kernel: linear  | Accuracy: 0.9778
Kernel: poly    | Accuracy: 0.9556
Kernel: rbf     | Accuracy: 1.0000
Kernel: sigmoid | Accuracy: 0.8889


2. LinearSVC
An optimized version specifically for linear classification that's faster than SVC with linear kernel. It uses squared hinge loss by default and works better with large datasets. Unlike regular SVC, it scales linearly with sample size but only creates straight decision boundaries.

In [64]:
from sklearn.svm import LinearSVC

# Same data prep as above
linear_svc = LinearSVC(C=1.0, loss='hinge', max_iter=10000, random_state=42)
linear_svc.fit(X_train, y_train)
print(f"LinearSVC accuracy: {accuracy_score(y_test, linear_svc.predict(X_test)):.4f}")

LinearSVC accuracy: 0.9111


3. Nu-SVC
A variant where the 'nu' parameter controls the number of support vectors (between 0-1) instead of using C. Easier to interpret as nu represents the upper bound on training errors. For example, nu=0.1 means at most 10% of training points can be misclassified or sit inside the margin.

In [65]:
from sklearn.svm import NuSVC

nu_svc = NuSVC(kernel='rbf', nu=0.1, gamma='scale', random_state=42)
nu_svc.fit(X_train, y_train)
print(f"Nu-SVC accuracy: {accuracy_score(y_test, nu_svc.predict(X_test)):.4f}")

Nu-SVC accuracy: 0.9778


4. SVR (Regression)
The regression version uses an epsilon-insensitive tube (ε) where errors smaller than ε aren't penalized. Still uses kernels like RBF but predicts continuous values instead of classes. The C parameter balances margin width versus points outside the tube, while epsilon controls the tube's width.


In [71]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Regression target (sepal length)
y_reg = iris.data[:, 0]
X_train, X_test, y_train, y_test = train_test_split(iris.data[:, 1:], y_reg, test_size=0.3)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)
print(f"SVR MSE: {mean_squared_error(y_test, svr.predict(X_test)):.4f}")
print(f"SVR R-squared: {r2_score(y_test, svr.predict(X_test)):.4f}") # Use R-squared for regression



SVR MSE: 0.1046
SVR R-squared: 0.8386


6. Custom Kernel SVM
Allows defining your own kernel function (like the RBF example) for specialized similarity measures. The kernel function takes two data matrices and returns their similarity matrix. This flexibility enables domain-specific kernels for text, graphs, or other complex data structures.

In [72]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import rbf_kernel
import numpy as np

# 1. Load classification data (iris dataset)
iris = load_iris()
X, y = iris.data, iris.target  # y contains discrete class labels (0, 1, 2)

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Scale features (critical for SVM)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. Define custom RBF kernel function
def custom_kernel(X, Y):
    return rbf_kernel(X, Y, gamma=0.5)  # gamma=0.5 is our custom parameter

# 5. Create and train SVM with custom kernel
custom_svm = SVC(kernel=custom_kernel)
custom_svm.fit(X_train, y_train)  # Now using proper classification labels

# 6. Evaluate
y_pred = custom_svm.predict(X_test)
print(f"Custom kernel accuracy: {accuracy_score(y_test, y_pred):.4f}")

Custom kernel accuracy: 1.0000



# Custom SVM Implementation with Different Kernels

This Code implements a Support Vector Machine (SVM) from scratch with three different kernel functions:
1. Linear Kernel
2. Polynomial Kernel
3. RBF (Gaussian) Kernel

We'll test these on the Iris dataset for binary classification.

In [60]:
# Import required libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

### Custom SVM Class Implementation

The following class implements our custom SVM with:
- Three kernel functions (linear, polynomial, RBF) :
   
  The kernel functions compute the similarity between two input vectors.
  Different kernels allow the SVM to learn different types of decision boundaries.
- Sequential Minimal Optimization (SMO) training algorithm:

  The fit method trains the SVM using Sequential Minimal Optimization (SMO), which optimizes the Lagrange multipliers (α) in pairs.
- Prediction and evaluation methods
  These methods make predictions using the trained SVM model.

In [61]:
class CustomSVM:
    def __init__(self, kernel='linear', C=1.0, gamma=1.0, degree=3, coef0=1.0):
        """
        Initialize the Support Vector Machine classifier with specified parameters.

        Parameters:
        kernel : str
            Type of kernel function ('linear', 'poly', 'rbf')
        C : float
            Regularization parameter (smaller values allow more margin violations)
        gamma : float
            Kernel coefficient for 'rbf' and 'poly' kernels
        degree : int
            Degree of polynomial kernel
        coef0 : float
            Independent term in polynomial kernel
        """
        # Store kernel type and parameters as instance variables
        self.kernel = kernel  # Type of kernel to use
        self.C = C            # Regularization parameter
        self.gamma = gamma    # Kernel coefficient
        self.degree = degree  # Polynomial degree
        self.coef0 = coef0    # Polynomial independent term

        # Initialize variables that will be set during training
        self.alpha = None     # Lagrange multipliers (dual coefficients)
        self.b = 0            # Bias term (intercept)
        self.X_train = None   # Training data features
        self.y_train = None   # Training data labels
        self.support_vectors = None  # Support vectors after training

    def _kernel(self, X1, X2):
        """
        Compute the kernel matrix between two sets of samples.

        Parameters:
        X1 : array-like, shape (n_samples1, n_features)
            First set of samples
        X2 : array-like, shape (n_samples2, n_features)
            Second set of samples

        Returns:
        K : array, shape (n_samples1, n_samples2)
            Kernel matrix
        """
        if self.kernel == 'linear':
            # Linear kernel: simple dot product between samples
            return X1 @ X2.T  # Matrix multiplication

        elif self.kernel == 'poly':
            # Polynomial kernel: (gamma*<x1,x2> + coef0)^degree
            return (self.gamma * (X1 @ X2.T) + self.coef0) ** self.degree

        elif self.kernel == 'rbf':
            # RBF (Gaussian) kernel: exp(-gamma * ||x1-x2||^2)
            # Compute squared Euclidean distances efficiently
            dists = np.sum(X1**2, axis=1)[:, np.newaxis] + np.sum(X2**2, axis=1) - 2 * X1 @ X2.T
            return np.exp(-self.gamma * dists)

        else:
            raise ValueError("Unknown kernel type")

    def fit(self, X, y, max_iter=1000, tol=1e-3):
        """
        Train the SVM model using Sequential Minimal Optimization (SMO) algorithm.

        Parameters:
        X : array-like, shape (n_samples, n_features)
            Training vectors
        y : array-like, shape (n_samples,)
            Target values (will be converted to -1/1)
        max_iter : int
            Maximum number of iterations
        tol : float
            Tolerance for stopping criterion
        """
        # Convert labels to -1/1 (assuming binary classification)
        y = y.copy()  # Avoid modifying original array
        y[y == 0] = -1  # Convert class 0 to -1, others to 1

        # Store training data and shape information
        n_samples = X.shape[0]  # Number of training samples
        self.X_train = X        # Store training features
        self.y_train = y        # Store training labels

        # Initialize Lagrange multipliers (alpha) to zeros
        self.alpha = np.zeros(n_samples)
        # Initialize bias term to zero
        self.b = 0.0

        # Precompute the kernel matrix (Gram matrix) for all training samples
        K = self._kernel(X, X)

        # Sequential Minimal Optimization (SMO) algorithm
        for _ in range(max_iter):
            num_changed = 0  # Track number of alpha pairs changed

            for i in range(n_samples):  # Iterate through all samples
                # Calculate prediction error for sample i
                Ei = (self.alpha * y) @ K[:, i] + self.b - y[i]

                # Check KKT conditions (violation means we should optimize this alpha_i)
                if ((y[i] * Ei < -tol) and (self.alpha[i] < self.C)) or \
                   ((y[i] * Ei > tol) and (self.alpha[i] > 0)):

                    # Randomly select a different sample j to optimize together
                    j = i
                    while j == i:  # Ensure j is different from i
                        j = np.random.randint(0, n_samples)

                    # Calculate error for sample j
                    Ej = (self.alpha * y) @ K[:, j] + self.b - y[j]

                    # Save old alpha values before updating
                    alpha_i_old, alpha_j_old = self.alpha[i], self.alpha[j]

                    # Compute L and H bounds for alpha_j
                    if y[i] != y[j]:
                        L = max(0, self.alpha[j] - self.alpha[i])
                        H = min(self.C, self.C + self.alpha[j] - self.alpha[i])
                    else:
                        L = max(0, self.alpha[i] + self.alpha[j] - self.C)
                        H = min(self.C, self.alpha[i] + self.alpha[j])

                    if L == H:  # Skip if bounds are equal
                        continue

                    # Compute eta (second derivative of objective function)
                    eta = 2 * K[i, j] - K[i, i] - K[j, j]
                    if eta >= 0:  # Skip if not making progress
                        continue

                    # Update alpha_j (clipped to stay within bounds)
                    self.alpha[j] = alpha_j_old - y[j] * (Ei - Ej) / eta
                    self.alpha[j] = np.clip(self.alpha[j], L, H)

                    # Skip if change is too small
                    if abs(self.alpha[j] - alpha_j_old) < 1e-5:
                        continue

                    # Update alpha_i using the same amount but opposite direction
                    self.alpha[i] = alpha_i_old + y[i] * y[j] * (alpha_j_old - self.alpha[j])

                    # Update bias term b using both samples
                    b1 = self.b - Ei - y[i] * (self.alpha[i] - alpha_i_old) * K[i, i] - \
                         y[j] * (self.alpha[j] - alpha_j_old) * K[i, j]
                    b2 = self.b - Ej - y[i] * (self.alpha[i] - alpha_i_old) * K[i, j] - \
                         y[j] * (self.alpha[j] - alpha_j_old) * K[j, j]

                    # Choose b based on where alpha_i and alpha_j are in their bounds
                    if 0 < self.alpha[i] < self.C:
                        self.b = b1
                    elif 0 < self.alpha[j] < self.C:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2

                    num_changed += 1  # Increment changed counter

            # If no alphas were changed in this iteration, we've converged
            if num_changed == 0:
                break

        # After optimization, identify support vectors (alpha > 0)
        sv_indices = self.alpha > 1e-5  # Small threshold to account for numerical precision
        self.support_vectors = X[sv_indices]  # Store support vectors
        self.alpha = self.alpha[sv_indices]   # Keep only alphas for support vectors
        self.y_train = y[sv_indices]          # Keep only labels for support vectors

    def decision_function(self, X):
        """
        Calculate signed distance of samples to the decision boundary.

        Parameters:
        X : array-like, shape (n_samples, n_features)
            Input samples

        Returns:
        array, shape (n_samples,)
            Decision function values (signed distances)
        """
        # Compute kernel matrix between input and support vectors
        K = self._kernel(X, self.support_vectors)
        # Calculate decision values using support vector alphas and labels
        return (self.alpha * self.y_train) @ K.T + self.b

    def predict(self, X):
        """
        Predict class labels for samples in X.

        Parameters:
        X : array-like, shape (n_samples, n_features)
            Input samples

        Returns:
        array, shape (n_samples,)
            Predicted class labels (-1 or 1)
        """
        # Convert decision function values to class labels
        return np.sign(self.decision_function(X))

## Testing on Iris Dataset

Now we'll test our custom SVM with different kernels on the Iris dataset.

In [62]:
# Load the Iris dataset from scikit-learn
# The Iris dataset contains 150 samples with 4 features each (sepal length/width, petal length/width)
# and 3 classes of iris flowers (setosa=0, versicolor=1, virginica=2)
iris = load_iris()
X, y = iris.data, iris.target  # X contains features, y contains class labels

# Prepare for binary classification by:
# 1. Removing class 2 (virginica) to create a 2-class problem
# 2. Converting remaining classes (0 and 1) to -1 and 1 respectively
# This creates a binary classification problem between setosa (-1) and versicolor (1)
X = X[y != 2]  # Keep only samples where class is not 2
y = y[y != 2]  # Similarly filter the labels
y = np.where(y == 0, -1, 1)  # Convert class 0 to -1, class 1 to 1

# Split the dataset into training and testing sets:
# - 70% for training (X_train, y_train)
# - 30% for testing (X_test, y_test)
# random_state=42 ensures reproducible results across runs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling is crucial for SVMs because:
# 1. Features on different scales can dominate the kernel calculations
# 2. It helps the optimization converge faster
scaler = StandardScaler()  # Creates a scaler that standardizes features (mean=0, std=1)
X_train = scaler.fit_transform(X_train)  # Fit scaler to training data and transform
X_test = scaler.transform(X_test)  # Transform test data using same scaling parameters

# Define the kernels to test:
# 1. Linear kernel (no additional parameters needed)
# 2. Polynomial kernel with degree=3 and gamma=0.1
# 3. RBF kernel with gamma=0.1
kernels = [
    ('linear', {}),  # Simple linear kernel
    ('poly', {'degree': 3, 'gamma': 0.1}),  # 3rd degree polynomial kernel
    ('rbf', {'gamma': 0.1}),  # Radial Basis Function (Gaussian) kernel
]

print("Improved Iris Dataset Classification Results:")
print("-----------------------------------")

# Test each kernel configuration
for kernel, params in kernels:
    # Initialize SVM with current kernel and parameters
    # C=1.0 is the regularization parameter (balance between margin and classification error)
    svm = CustomSVM(kernel=kernel, C=1.0, **params)

    # Train the SVM model on the training data
    # This runs the SMO algorithm to find the optimal decision boundary
    svm.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = svm.predict(X_test)

    # Calculate accuracy by comparing predictions to true labels
    acc = accuracy_score(y_test, y_pred)

    # Print results for this kernel:
    print(f"{kernel.upper()} Kernel:")
    print(f"  Accuracy: {acc:.4f}")  # Classification accuracy on test set
    print(f"  Number of support vectors: {len(svm.support_vectors)}")  # Points defining the boundary
    print(f"  Bias term: {svm.b:.4f}")  # Offset of the decision boundary
    print("-----------------------------------")

Improved Iris Dataset Classification Results:
-----------------------------------
LINEAR Kernel:
  Accuracy: 1.0000
  Number of support vectors: 8
  Bias term: 0.9467
-----------------------------------
POLY Kernel:
  Accuracy: 1.0000
  Number of support vectors: 12
  Bias term: 0.0298
-----------------------------------
RBF Kernel:
  Accuracy: 1.0000
  Number of support vectors: 13
  Bias term: 0.1133
-----------------------------------
