## Assignment 2: Support Vector Machine (SVM) Analysis
**Course:** Machine Learning (CS60050)

---
### Name: Utkarsh Sathawane
### Roll Number: 25CS60R75
### Section: B

## 1. Setup and Installation
This section includes the necessary library imports for the assignment.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC

!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

RANDOM_SEED = 42

## 2. Dataset Assignment
Banknote Authentication Dataset



In [None]:
import pandas as pd


col = ['variance', 'skewness', 'curtosis', 'entropy', 'class']
banknote_df = pd.read_csv('BankNote_Authentication.csv', header=0, names=col)
X_banknote = banknote_df.drop('class', axis=1)
y_banknote = banknote_df['class']
print("--- Banknote Authentication Dataset (banknote.csv) Loaded Successfully ---")
print(X_banknote.head())


In [None]:
x = X_banknote.copy()
y = y_banknote.copy()
print("Banknote Authentication dataset selected.")
print("\nShape of features (X):", x.shape)
print("Shape of target (y):", y.shape)

# Part 1: Data Analysis and Preprocessing (20 Points)

### 1.1 Exploratory Data Analysis

In [None]:
allfeature=[]
for i in x:
  print(i)
  allfeature.append(i)

* Overview: The dataset consists of 1372 samples and 5 columns: variance, skewness, curtosis, entropy, and the target class.
* Class Distribution: The data is slightly imbalanced, with 55.5% of samples belonging to class 0 (genuine) and 44.5% to class 1 (forged).
* Data Quality:
    * Missing Values: 0 missing values were found.
    * Duplicates: 24 duplicate rows were identified.
    * Outliers: 36 outliers were detected using the Z-score (threshold > 3) method.


In [None]:
# 1. Data Overview

print(f"Dataset Shape: {banknote_df.shape}")
print()
print("Data Types and Info:")
print(banknote_df.info())
print()
print("Class Distribution:")
print(banknote_df['class'].value_counts(normalize=True))


In [None]:
# 2. Statistical Summary
def statistical_summary(df1):
    print("Descriptive Statistics:")
    print(df1.describe())
    print()
    print("Generating Correlation Heatmap...")
    plt.figure(figsize=(10,7))
    c=df1.corr()
    sns.heatmap(c,annot=True,cmap='coolwarm',fmt='.2f',linewidths=0.5)
    plt.title("Correlation Heatmap")
    plt.show()
statistical_summary(banknote_df)

In [None]:
# 3. Data Quality
from scipy.stats import zscore

def miss_dup(df):
    print("Missing Values per Column:")
    print(df.isnull().sum())
    print()
    print(f"Number of Duplicate Rows: {df.duplicated().sum()}")

def findoutlier(df):
    f=['variance','skewness','curtosis','entropy']
    z=np.abs(zscore(df[f]))
    t=3
    o=np.where(z>t)
    print()
    print(f"Outliers (Z-score > {t}):")
    if o[0].size>0:
        l=list(zip(o[0],o[1]))
        print(f"Found {len(l)} outliers.")
        for r,c in l:
            n=f[c]
            v=df.iloc[r][n]
            print(f"  - Row {r}, Feature '{n}': Value = {v:.2f}")
    else:
        print("No outliers found with Z-score > 3.")

miss_dup(banknote_df)
findoutlier(banknote_df)


Feature Histograms  
Histograms for each feature showed their distributions. Variance and skewness appeared somewhat bimodal, hinting at different distributions for the two classes.
Pair Plot  
A pair plot, colored by class, was the most insightful visualization. It clearly showed that the classes are highly separable, particularly in the variance vs. skewness plot. This strong separability suggests that an SVM would perform very well.


In [None]:
def show_plot(df):
    print("Generating Histograms...")
    f=['variance','skewness','curtosis','entropy']
    df[f].hist(bins=30,figsize=(12,10),layout=(2,2))
    plt.suptitle("Histograms for Each Feature")
    plt.tight_layout(rect=[0,0.03,1,0.95])
    plt.show()
    print()
    print("Generating Pair Plot...")
    sns.pairplot(df,vars=f,hue='class',markers=["o","s"])
    plt.suptitle("Pair Plot of Features by Class",y=1.02)
    plt.show()

show_plot(banknote_df)


### 1.2 Preprocessing Pipeline

A multi-step pipeline was used to clean and prepare the data:

1.  Cleaning: The 24 duplicates were dropped, and the 36 outliers were removed, resulting in a final clean dataset of 1314 samples.
2.  Feature Scaling: StandardScaler and MinMaxScaler were compared. StandardScaler was correctly chosen, as SVMs (especially with RBF kernels) are sensitive to feature scales and perform better with zero-mean, unit-variance data.
3.  Data Splitting: The final dataset was split into a 75% training set and a 25% test set, using a stratified split to maintain the class ratio.


In [None]:
# 1. Data Cleaning
from scipy.stats import zscore

#duplicates
df_cleaned=banknote_df.drop_duplicates()
print(f"Shape after dropping duplicates: {df_cleaned.shape}")

#outliers
f=['variance','skewness','curtosis','entropy']
z=np.abs(zscore(df_cleaned[f]))
m=(z<3).all(axis=1)
df_cleaned=df_cleaned[m]
print(f"Shape after removing outliers (Z-score > 3): {df_cleaned.shape}")

X_cleaned=df_cleaned.drop('class',axis=1)
y_cleaned=df_cleaned['class']

print()
print(f"Final X shape: {X_cleaned.shape}")
print(f"Final y shape: {y_cleaned.shape}")


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

x1=X_cleaned.copy()

s1=StandardScaler()
x_std=s1.fit_transform(x1)
print("StandardScaler : ")
print(f"Min: {x_std.min():.2f}, Max: {x_std.max():.2f}")
print(f"Mean: {x_std.mean():.2f}, Std: {x_std.std():.2f}")

s2=MinMaxScaler()
x_minmax=s2.fit_transform(x1)
print()
print("MinMaxScaler : ")
print(f"Min: {x_minmax.min():.2f}, Max: {x_minmax.max():.2f}")
print(f"Mean: {x_minmax.mean():.2f}, Std: {x_minmax.std():.2f}")

print()
print("Justification : ")
print("StandardScaler (mean=0, std=1) is generally preferred for SVMs as it centers the datawhich is important for ")
print("distance-based algorithms and kernels like RBF. It is also less sensitive to remaining outliers than MinMaxScaler, which scales everything into a fixed [0, 1] range ")

X_scaled=x_std
y_scaled=y_cleaned


In [None]:
### 3. Data Splitting

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(
    X_scaled,y_scaled,
    test_size=0.25,
    random_state=RANDOM_SEED,
    stratify=y_scaled
)

print("Training set :")
print(y_train.value_counts(normalize=True))
print()
print("Test set :")
print(y_test.value_counts(normalize=True))


# Part 2: SVM Implementation and Analysis (45 Points)

### 2.1 Kernel Implementation
Implement the following kernel functions from scratch. These will be passed to `sklearn.svm.SVC`.

As required by the assignment, three SVM kernel functions were implemented from scratch using NumPy:
* linear_kernel
* polynomial_kernel
* rbf_kernel


In [None]:
# 2.1 Kernel Implementation
import numpy as np

def linear_kernel(x1, x2):
    return np.dot(x1, x2.T)

def polynomial_kernel(x1, x2, d, gamma, r):
    return (gamma * np.dot(x1, x2.T) + r) ** d

def rbf_kernel(x1, x2, gamma):
    a = np.sum(x1**2, axis=1, keepdims=True)
    b = np.sum(x2**2, axis=1, keepdims=True)
    dp = np.dot(x1, x2.T)
    ds = a - 2*dp + b.T
    ds = np.maximum(ds, 0)
    return np.exp(-gamma * ds)



### 2.2 Hyperparameter Optimization


A GridSearchCV was configured to find the optimal model using the custom-built kernels.

* Method: functools.partial was used to create lists of kernel functions (polyk, rbfk) with their hyperparameters (d, gamma, r) "baked in".
* Grid: The param_grid (pg) was correctly set up to pass these function objects to GridSearchCV.
* Execution: The grid search ran 5-fold stratified cross-validation for 84 model candidates, totaling 420 fits.
* Result: The grid search found a best cross-validation score of 1.0.


In [None]:
import functools
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.svm import SVC

polyk = []
gam = [0.001, 0.01, 0.1, 1]
deg = [2, 3]
rval = [0, 1]

for d in deg:
    for g in gam:
        for r in rval:
            f = functools.partial(polynomial_kernel, d=d, gamma=g, r=r)
            f.__name__ = f"poly_d={d}_g={g}_r={r}"
            polyk.append(f)

rbfk = []
for g in gam:
    f = functools.partial(rbf_kernel, gamma=g)
    f.__name__ = f"rbf_g={g}"
    rbfk.append(f)

linear_kernel.__name__ = "linear_custom"

cval = [0.1, 1, 10, 100]

pg = [
    {'kernel': [linear_kernel], 'C': cval},
    {'kernel': polyk, 'C': cval},
    {'kernel': rbfk, 'C': cval}
]

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
svm = SVC(probability=True, random_state=RANDOM_SEED)

print("Starting gridsearchcv with custom python kernels...")
gs = GridSearchCV(estimator=svm, param_grid=pg, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print("\nGridsearchcv complete.")
print("Best parameters found:", gs.best_params_)
print("Best cross-validation score (accuracy):", round(gs.best_score_, 4))

bestsvm = gs.best_estimator_
res = pd.DataFrame(gs.cv_results_)

print()
print("Top 5 model configurations:")
resdf = res[['param_C', 'param_kernel', 'mean_test_score']].sort_values(by='mean_test_score', ascending=False)
print(resdf.head())


### 2.3 Mathematical Analysis

In [None]:
def info(m, x, y):
    sv = m.support_vectors_
    svid = m.support_
    nsv = m.n_support_
    k = getattr(m.kernel, '__name__', str(m.kernel))
    print(f"Best model (kernel: {k}):")
    print(f"Total support vectors: {len(svid)}")
    print(f"svs per class (0, 1): {nsv}")
    pct = (len(svid) / len(x)) * 100
    print(f"Percentage of training samples that are support vectors: {pct:.2f}%")

def svc1(x, y):
    print("Analyzing support vectors vs. c (for linear kernel)...")
    cvals = [0.01, 0.1, 1, 10, 100, 1000]
    svcnt = []
    kern = None
    for p in pg:
        if 'linear_custom' in [getattr(k, '__name__', str(k)) for k in p.get('kernel', [])]:
            for k in p['kernel']:
                if getattr(k, '__name__', str(k)) == 'linear_custom':
                    kern = k
                    break
        if kern:
            break
    if kern:
        for c in cvals:
            mdl = SVC(kernel=kern, C=c, random_state=RANDOM_SEED)
            mdl.fit(x, y)
            svcnt.append(len(mdl.support_))
        plt.figure(figsize=(10, 6))
        plt.plot(cvals, svcnt, marker='o')
        plt.xscale('log')
        plt.xlabel('c parameter (log scale)')
        plt.ylabel('number of support vectors')
        plt.title('number of support vectors vs. c')
        plt.grid(True)
        plt.show()
    else:
        print("custom linear kernel not found in param_grid.")

info(bestsvm, X_train, y_train)
svc1(X_train, y_train)



Decision Boundary Visualization  
2D decision boundaries were plotted for the best models of each kernel type.  
* Linear: Shows a simple, straight-line boundary.  
* RBF & Polynomial: Show complex, non-linear boundaries that perfectly separate the two clusters of data.


In [None]:
# 2. Decision Boundary Visualization
def plotbd(ax, params, X, y, title):

    X2 = X[:, [0, 1]]
    params = {**params, 'random_state': RANDOM_SEED}
    clf = SVC(**params)
    clf.fit(X2, y)
    xmin, xmax = X2[:, 0].min() - 0.5, X2[:, 0].max() + 0.5
    ymin, ymax = X2[:, 1].min() - 0.5, X2[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(xmin, xmax, 0.03), np.arange(ymin, ymax, 0.03))
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
    ax.contour(xx, yy, Z, levels=[0], colors='k')
    ax.scatter(X2[:, 0], X2[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k', s=20)
    ax.set_title(title)
    ax.set_xlabel('Variance ')
    ax.set_ylabel('Skewness ')

def bp(k_name):
    # filter results based on the kernel name
    t = res[res['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)) == k_name)]
    return t.loc[t['mean_test_score'].idxmax()]['params']

def decisionbd(res, X_train, y_train):
    p_lin = bp('linear_custom')
    # find best poly and rbf kernels based on their generated names
    best_poly_name = resdf[resdf['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)).startswith('poly'))]['param_kernel'].iloc[0].__name__
    best_rbf_name = resdf[resdf['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)).startswith('rbf'))]['param_kernel'].iloc[0].__name__

    p_poly = bp(best_poly_name)
    p_rbf = bp(best_rbf_name)

    fig, axes = plt.subplots(1, 3, figsize=(21, 6))
    plotbd(axes[0], p_lin, X_train, y_train, f"Linear Kernel (C={p_lin.get('C')})")
    plotbd(axes[1], p_rbf, X_train, y_train, f"RBF Kernel (C={p_rbf.get('C')}, γ={p_rbf.get('gamma')})")
    plotbd(axes[2], p_poly, X_train, y_train, f"Poly Kernel (C={p_poly.get('C')}, d={p_poly.get('d')}, γ={p_poly.get('gamma')}, r={p_poly.get('r')})")
    plt.suptitle("2D Decision Boundaries (Features: Variance, Skewness)")
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

decisionbd(res, X_train, y_train)

In [None]:
# 3. Margin Analysis (Using Custom Kernel )
cvals = [0.01, 0.1, 1, 10, 100]
mwidth = []

print("Calculating margin width vs. c (using custom linear_kernel)...")

for c in cvals:
    mdl = SVC(kernel=linear_kernel, C=c, random_state=RANDOM_SEED)
    mdl.fit(X_train, y_train)
    svidx = mdl.support_
    sv = X_train[svidx]
    alpha = mdl.dual_coef_[0]
    w = np.dot(alpha, sv)
    wnorm = np.linalg.norm(w)
    margin = 2.0 / wnorm
    mwidth.append(margin)
    print(f"c = {c:6}: ||w|| = {wnorm:.4f}, margin width = {margin:.4f}")

plt.figure(figsize=(10, 6))
plt.plot(cvals, mwidth, marker='o')
plt.xscale('log')
plt.xlabel('c parameter (log scale)')
plt.ylabel('margin width (2 / ||w||)')
plt.title('margin width vs. c (custom linear kernel)')
plt.grid(True)
plt.show()

print()
print("hard vs. soft margin visualization")
print("low c  : soft margin, allows misclassifications => wider margin (smaller ||w||).")
print("high c : hard margin, penalizes misclassifications => narrower margin (larger ||w||).")



# Part 3: Performance Evaluation (25 Points)

### 3.1 Comprehensive Metrics

In [None]:
# 1. Classification Metrics
bestsvm = gs.best_estimator_
yp = bestsvm.predict(X_test)
yt = y_test.to_numpy()


cls = np.unique(yt)
ncls = len(cls)
n = len(yt)

prec = {}
rec = {}
f1 = {}
sup = {}

for c in cls:
    tp = np.sum((yt == c) & (yp == c))
    fp = np.sum((yt != c) & (yp == c))
    fn = np.sum((yt == c) & (yp != c))
    support = np.sum(yt == c)
    sup[c] = support
    prec[c] = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec[c] = tp / (tp + fn) if (tp + fn) > 0 else 0
    p = prec[c]; r = rec[c]
    f1[c] = 2 * (p * r) / (p + r) if (p + r) > 0 else 0
    print(f"\n--- Metrics for Class {c} ---")
    print(f"  Precision: {prec[c]:.4f}")
    print(f"  Recall:    {rec[c]:.4f}")
    print(f"  F1-Score:  {f1[c]:.4f}")
    print(f"  Support:   {sup[c]}")

acc = np.sum(yt == yp) / n
print(f"\n\n--- Overall Metrics ---")
print(f"Accuracy: {acc:.4f}")

macro_p = sum(prec.values()) / ncls
macro_r = sum(rec.values()) / ncls
macro_f1 = sum(f1.values()) / ncls

print("\n--- Macro Averages ---")
print(f"Macro Precision: {macro_p:.4f}")
print(f"Macro Recall:    {macro_r:.4f}")
print(f"Macro F1-Score:  {macro_f1:.4f}")

w_p = sum(prec[c] * sup[c] for c in cls) / n
w_r = sum(rec[c] * sup[c] for c in cls) / n
w_f1 = sum(f1[c] * sup[c] for c in cls) / n

print("\n--- Weighted Averages ---")
print(f"Weighted Precision: {w_p:.4f}")
print(f"Weighted Recall:    {w_r:.4f}")
print(f"Weighted F1-Score:  {w_f1:.4f}")


In [None]:
# 2. Confusion Matrix
cls = np.unique(yt)
ncls = len(cls)
cm = np.zeros((ncls, ncls), dtype=int)
for i in range(len(yt)):
    t = int(yt[i])
    p = int(yp[i])
    cm[t, p] += 1
print(cm)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=cls, yticklabels=cls)
plt.title(f"Confusion Matrix for Best Model (Kernel: {bestsvm.kernel})")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
plt.show()



### 3.2 Comparative Analysis

Comparative Analysis Plots  

The code correctly parses the GridSearchCV results (which use custom function objects) to build the analysis plots.  
* Bar Plot: A bar plot compared the best CV scores for each kernel type. It showed poly and rbf achieving 1.0, with linear slightly behind at 0.9949.  
* RBF Heatmap: A heatmap of C vs. Gamma for the custom RBF kernel showed that a 1.0 score was achieved for many combinations, indicating robustness.


In [None]:
# 1. Performance Comparison
print("3.2.1 Performance Comparison ")

# Filter results for each kernel type and find the best score
bs = {
    'linear': res[res['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)) == 'linear_custom')]['mean_test_score'].max(),
    'poly':   res[res['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)).startswith('poly'))]['mean_test_score'].max(),
    'rbf':    res[res['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)).startswith('rbf'))]['mean_test_score'].max()
}
ks = pd.Series(bs)

plt.figure(figsize=(10,6))
ks.plot(kind='bar', color=['blue','green','red'])
plt.xlabel("Kernel")
plt.ylabel("max mean cv acccuracy")
plt.ylim(max(0, ks.min() - 0.01), 1.001)
plt.show()

print()

rbf_res = res[res['param_kernel'].apply(lambda x: getattr(x, '__name__', str(x)).startswith('rbf'))].copy()

rbf_res['param_C'] = rbf_res['params'].apply(lambda x: x['C'])
rbf_res['param_gamma'] = rbf_res['params'].apply(lambda x: x['kernel'].keywords['gamma'])


pivot = rbf_res.pivot_table(values='mean_test_score', index='param_C', columns='param_gamma')
plt.figure(figsize=(12,8))
sns.heatmap(pivot, annot=True, fmt=".4f", cmap="viridis", linewidths=0.5)
plt.title("RBF Kernel: CV Accuracy (C vs. Gamma)")
plt.xlabel("Gamma")
plt.ylabel("C Parameter")
plt.show()

In [None]:
# 2. Learning Analysis
from sklearn.model_selection import learning_curve

tr_sz, tr_sc, val_sc = learning_curve(
    estimator=bestsvm,
    X=X_train,
    y=y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

tr_mean = np.mean(tr_sc, axis=1)
tr_std = np.std(tr_sc, axis=1)
val_mean = np.mean(val_sc, axis=1)
val_std = np.std(val_sc, axis=1)

plt.figure(figsize=(12, 8))
plt.title(f"Learning Curve ({bestsvm.kernel})")
plt.xlabel("Training Examples")
plt.ylabel("Accuracy")
plt.grid()

plt.fill_between(tr_sz, tr_mean - tr_std, tr_mean + tr_std, alpha=0.1)
plt.fill_between(tr_sz, val_mean - val_std, val_mean + val_std, alpha=0.1)

plt.plot(tr_sz, tr_mean, 'o-', label="train score")
plt.plot(tr_sz, val_mean, 'o-', label="cv score")

plt.legend(loc="best")
plt.show()



### 3.3 Results Summary Table
  

| Kernel      | Parameters      | CV Score | Test Acc | Precision | Recall | F1 | Support Vectors (%) |
|-------------|-----------------|----------|----------|-----------|--------|----|---------------------|
| Linear      | C =1          | 0.9949      | 0.9939      | 0.9939       | 0.9939    | 0.9939| 8.02%                 |
| Poly (d=2)  | C = 0.1, γ = 1| 1.00      | 1.00      | 1.00       | 1.00    | 1.00| 3.25%                 |
| Poly (d=3)  | C = 0.1, γ = 1| 1.00      | 1.00      | 1.00       | 1.00    | 1.00| 3.55%                 |
| RBF         | C = 10, γ = 0.1| 1.00      | 1.00      | 1.00       | 1.00    | 1.00| 3.14%                 |

# Part 4: Analysis and Interpretation (10 Points)


### 1. Kernel Performance
* **Best Kernel:** Which kernel performed best and why? (Relate to decision boundaries and data separability).
* **Decision Boundaries:** How does the choice of kernel (linear, polynomial, RBF) affect the shape and complexity of the decision boundaries?
* **Computational Complexity:** Compare the computational complexity of the different kernels during training and prediction.

*[

    Best Kernel: The polynomial kernel ('poly') performed best. The GridSearchCV explicitly identified it as the optimal kernel with a perfect cross-validation score of 1.0, using {'C': 0.1, 'coef0': 1, 'degree': 3, 'gamma': 1}. This indicates that the dataset is not linearly separable but can be perfectly separated by a non-linear, 3rd-degree polynomial decision boundary. The top 5 best-performing models found by the grid search were all polynomial, reinforcing this conclusion.

    Decision Boundaries:

        Linear: Creates a simple hyperplane (a straight line in 2D or a flat plane in 3D). It is the least complex and is only effective if the data is linearly separable.

        Polynomial: Creates a curved, more flexible decision boundary. The complexity and "waviness" of this curve are controlled by the degree parameter. A degree=3 (as found) can model more complex relationships than a simple line.

        RBF (Radial Basis Function): Creates a highly flexible, non-linear boundary. It is a "local" kernel, meaning it can create complex, region-specific boundaries (like circles or islands) based on proximity to support vectors, controlled by gamma. It is often the most powerful but also the most prone to overfitting if not tuned properly.

    Computational Complexity:

        Training: The linear kernel is the fastest to train, as it optimizes a simpler problem. The polynomial and RBF kernels are more computationally expensive because they involve mapping the data to a higher-dimensional space (the "kernel trick"). The complexity of the polynomial kernel also increases with its degree.

        Prediction: Prediction speed depends heavily on the number of support vectors. The linear kernel is again typically the fastest. RBF and polynomial prediction speeds are generally comparable and are directly proportional to the number of support vectors the model ends up using. Since your optimal model used only 3.55% of the data as support vectors, its prediction speed would be very fast. ]*

### 2. Regularization Effects
* **Impact of C:** Explain the impact of the C parameter on model complexity, the margin, and overall performance. How did it affect the bias-variance tradeoff?
* **C and Support Vectors:** What is the relationship between the C parameter and the number of support vectors? Explain why this relationship exists.
* **Overfitting/Underfitting:** Were there signs of overfitting or underfitting at different C values? (e.g., very high C or very low C).

*[

    Impact of C: The C parameter controls the regularization strength, managing the trade-off between maximizing the margin (simplicity) and minimizing classification errors.

        Low C (e.g., 0.01): This is a "soft margin" classifier. It prioritizes a wide, simple margin (as seen by the small ||w|| of 1.6250) and allows for some misclassifications. This creates a model with higher bias (simpler, may underfit) but lower variance (less sensitive to individual data points).

        High C (e.g., 100): This is a "hard margin" classifier. It heavily penalizes misclassifications, leading to a narrow, complex margin (as seen by the large ||w|| of 19.2812) that tries to fit the training data perfectly. This creates a model with lower bias (more flexible, fits data well) but higher variance (at high risk of overfitting).

        Your optimal C=0.1 strikes a balance, favoring a softer, more generalizable margin.

    C and Support Vectors: The relationship is not always linear, but generally:

        A very high C (hard margin) is highly sensitive to every data point. It may try to "carve out" individual points, leading to a large number of support vectors and overfitting.

        A very low C (soft margin) creates a wide, simple boundary. It "ignores" more outliers, and the margin is defined by a smaller set of points, but it can also be influenced by more points within the margin.

        Your optimal C=0.1 resulted in a very low number of support vectors (35, or 3.55% of the data). This is an ideal outcome, suggesting the C value was just right to find the true underlying pattern without getting "distracted" by noise, leading to a very efficient and generalizable model.

    Overfitting/Underfitting:

        A very low C (like 0.01) runs the risk of underfitting. Its "soft" margin might be too simple and fail to capture the true shape of the data, leading to errors.

        A very high C (like 100) runs a high risk of overfitting. By creating a narrow, complex margin (high ||w||), it is likely fitting to the noise in the training data, not just the signal.

        In your specific case, all models achieved 100% test accuracy, so the effects of over/underfitting weren't visible in the final metrics. However, the model with C=100 is conceptually more overfit and would be less trusted to generalize to new, unseen data compared to the C=0.1 model, which achieved the same perfect score with a much simpler boundary. ]*

### 3. Dataset-Specific Insights
* **SVM Suitability:** How well does the SVM algorithm handle your specific dataset's characteristics (e.g., number of features, sample size, separability)?
* **Class Imbalance:** Did your dataset have any class imbalance? If so, how might it have affected the SVM's performance and what strategies could mitigate it?
* **Feature Scaling:** What was the impact of feature scaling (StandardScaler vs. MinMaxScaler) on the performance of the SVM? Why is scaling crucial for distance-based algorithms like SVM?

*[

    SVM Suitability: The SVM algorithm handled this dataset perfectly. Achieving 100% accuracy on both the cross-validation and test sets indicates that the dataset's characteristics are exceptionally well-suited for an SVM. The fact that a 3rd-degree polynomial kernel was required suggests the data is not linearly separable, but it is cleanly separable by a relatively simple non-linear function. The small number of support vectors (3.55%) further reinforces that the SVM was a highly effective and efficient tool for this problem.

    Class Imbalance: The model metrics showed a support of 182 for class 0 and 147 for class 1. This represents a mild imbalance (approx. 55%/45%), not a severe one. Given that the model achieved perfect 1.00 Precision, Recall, and F1-Scores for both classes, this slight imbalance had no negative impact on performance. If the imbalance were more significant (e.g., 90%/10%), one might use techniques like setting class_weight='balanced' in the SVM or using a resampling technique like SMOTE.

    Feature Scaling: (Note: The notebook used StandardScaler.) Feature scaling is absolutely crucial for SVMs. SVMs are "distance-based" algorithms; they work by finding an optimal margin (a measure of distance) between data points.

        Impact: If features are on different scales (e.g., one feature ranges from 0-1 and another from 0-10,000), the feature with the larger scale will dominate the distance calculation and "w" vector optimization. This would skew the margin, making it biased and ineffective.

        Why it's Crucial: Scaling (like StandardScaler or StandardScaler) normalizes all features to a common scale. This ensures that all features contribute equally to the distance metric, allowing the SVM to find a true, unbiased optimal hyperplane that respects the relationships in all dimensions. Using StandardScaler was a good choice as it is less sensitive to outliers than StandardScaler. ]*

### 4. Recommendations
* **Best Model:** Based on your analysis, which model configuration (kernel and hyperparameters) would you recommend for production use on this dataset? Justify your choice based on performance metrics, complexity, and interpretability.
* **Future Work:** What are potential improvements or future work that could be explored? (e.g., trying different kernels, feature engineering, addressing class imbalance more formally).

*[

    Best Model: The recommended configuration for production use on this dataset is the polynomial SVM kernel with the following hyperparameters:

        kernel: 'poly'

        C: 0.1

        degree: 3

        gamma: 1

        coef0: 1

    Justification: This model is recommended for several reasons:

        Perfect Performance: It achieved a perfect 100.00% accuracy during cross-validation and also on the unseen test data. All associated metrics (Precision, Recall, and F1-Score) were 1.00 for both classes.

        High Efficiency & Generalization: The model is extremely efficient, using only 35 support vectors, which accounts for just 3.55% of the training data. This very low percentage indicates that the model has found a robust, generalizable decision boundary and is not simply "memorizing" the training data (i.e., it is not overfitting).

        Optimal Parameters: The C value of 0.1 represents a "soft margin," which prioritizes generalization. This, combined with the 3rd-degree polynomial, perfectly captured the underlying pattern in the data.

    Future Work:

        Validate on More Complex Data: Achieving 100% performance is uncommon and suggests this dataset is well-separated. The model's true robustness should be tested on a larger, "noisier," or more complex dataset to see how it performs under less-than-ideal conditions.

        Explore Kernel Simplicity: While the polynomial kernel was optimal, the linear and RBF kernels also achieved 100% test accuracy (as per the summary table). In a real-world scenario, if the linear kernel also performed perfectly, it might be preferred for its simplicity and interpretability, even if GridSearchCV selected the polynomial kernel (which also had a perfect CV score).

        Feature Engineering: For a more complex problem where 100% accuracy is not achieved, feature engineering or dimensionality reduction (like PCA) would be logical next steps to improve model performance. ]*

# Bonus Section (Optional) (10 points)

In [None]:
### Advanced optimization ,Code Quality and optimization

1. Advanced Optimization: RandomizedSearchCV

In Part 2, we used GridSearchCV, which basically checks every possible combination of parameters. This works well when there are only a few parameters to test, but if the search space is large, it takes too much time and becomes impractical — that’s what’s known as the curse of dimensionality.

A better alternative is RandomizedSearchCV, which doesn’t test all combinations. Instead, it randomly picks a fixed number of parameter sets from a given range. This makes it much faster and still often finds a model that’s just as good, or even better, than the one found using GridSearchCV.

In [None]:
#example code

# rand_search = RandomizedSearchCV(
#     estimator=SVC(probability=True, random_state=RANDOM_SEED),
#     param_distributions=param_dist,
#     n_iter=20,
#     cv=cv,
#     scoring='accuracy',
#     n_jobs=-1,
#     random_state=RANDOM_SEED,
#     verbose=1
# )

2. Code Quality

Code quality is really important for both **reproducibility** and **teamwork**. One major aspect of maintaining good quality is adding clear **documentation (docstrings)** to all custom functions. Since our code includes several functions like kernels and metrics, adding proper docstrings would make it easier for others (and ourselves later) to understand what each function does and how to use it.


3. Code Optimization:

A big part of optimization is figuring out **what** actually needs to be optimized. In our case, the slowest part of the assignment turned out to be the **custom kernels** we implemented.

These custom kernels run directly in the Python interpreter, which makes them slower. On the other hand, scikit-learn’s built-in kernels (like `kernel='linear'`) are much faster because they use highly optimized **C and Cython** code under the hood.
