# Support Vector Machine Implementation and Analysis

This project demonstrates the implementation and analysis of Support Vector Machines (SVM) for classification tasks. We explore both linear and non-linear classification problems using different SVM kernels and hyperparameter optimization.

## Project Objectives
1. Implement SVM for different classification scenarios:
   - Linear separable data
   - Non-linear separable data (circular pattern)
2. Explore different SVM kernels
3. Optimize hyperparameters using Grid Search
4. Visualize decision boundaries and results

## Key Concepts
- Support Vector Machines
- Kernel Methods
- Hyperparameter Optimization
- Cross-validation

## Required Libraries and Tools

In [11]:
%matplotlib inline
%matplotlib notebook
import sklearn.datasets as data
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
import sklearn.metrics as metrics
import pandas as pd
from matplotlib.colors import ListedColormap
from sklearn import model_selection

## Visualization Tools

### create_graphs Function
Creates a comprehensive 2x2 visualization of SVM results:
1. Original data distribution
2. Training data with decision boundary
3. Test data predictions
4. Training data predictions

Features:
- Custom color mapping
- Decision boundary visualization
- Scatter plots of data points

### graph_results_greed Function
Visualizes grid search results:
- Heat map of performance metrics
- Parameter combinations (C and gamma)
- Color-coded performance values

In [2]:
def create_graphs(x_data,
                  y_target,
                  data_x_train,
                  data_x_test,
                  data_y_train,
                  data_y_test,
                  y_pred_target_test,
                  y_pred_target_train,
                  model):

    """
    Creates a comprehensive 2x2 visualization of SVM classification results.

    Parameters
    ----------
    x_data : numpy.ndarray
        Complete feature dataset of shape (n_samples, 2)
    y_target : numpy.ndarray
        Complete target labels
    data_x_train : numpy.ndarray
        Training feature data of shape (n_train_samples, 2)
    data_x_test : numpy.ndarray
        Test feature data of shape (n_test_samples, 2)
    data_y_train : numpy.ndarray
        Training target labels
    data_y_test : numpy.ndarray
        Test target labels
    y_pred_target_test : numpy.ndarray
        Predicted labels for test data
    y_pred_target_train : numpy.ndarray
        Predicted labels for training data
    model : sklearn.svm.SVC
        Fitted SVM classifier

    Returns
    -------
    None
        Displays a matplotlib figure with four subplots:
        1. Original data distribution
        2. Training data with decision boundary
        3. Test data with predictions
        4. Training data with predictions
    """


    min_x_0 = np.min(x_data[:, 0]) - 0.5
    max_x_0 = np.max(x_data[:, 0]) + 0.5

    min_x_1 = np.min(x_data[:, 1]) - 0.5
    max_x_1 = np.max(x_data[:, 1]) + 0.5

    line_x_0 = np.linspace(min_x_0, max_x_0, 200)
    line_x_1 = np.linspace(min_x_1, max_x_1, 200)

    vx, vy = np.meshgrid(line_x_0, line_x_1)

    coord_vx_vy = np.array([np.concatenate(vx), np.concatenate(vy)]).T

    coord_vx_vy_pred = model.predict(coord_vx_vy)

    coord_vx_vy_pred = coord_vx_vy_pred.reshape(200, 200)

    cmap = ListedColormap(['#999999', '#333333'])


    fig_results, ax_results = plt.subplots(2, 2)

    ax_results[0, 0].contourf(vx, vy, coord_vx_vy_pred, cmap=cmap)
    ax_results[0, 0].set_aspect(1)
    ax_results[0, 0].scatter(x_data[:, 0], x_data[:, 1], c=y_target)
    ax_results[0, 0].set_title('Original Data 2 dimension')

    ax_results[0, 1].contourf(vx, vy, coord_vx_vy_pred, cmap=cmap)
    ax_results[0, 1].set_aspect(1)
    ax_results[0, 1].scatter(data_x_train[:, 0], data_x_train[:, 1], c=data_y_train)
    ax_results[0, 1].set_title('Training Data with Threshold')

    ax_results[1, 0].contourf(vx, vy, coord_vx_vy_pred, cmap=cmap)
    ax_results[1, 0].set_aspect(1)
    ax_results[1, 0].scatter(data_x_test[:, 0], data_x_test[:, 1], c=y_pred_target_test)
    ax_results[1, 0].set_title('Test Data with Predicted Labels')

    ax_results[1, 1].contourf(vx, vy, coord_vx_vy_pred, cmap=cmap)
    ax_results[1, 1].set_aspect(1)
    ax_results[1, 1].scatter(data_x_train[:, 0], data_x_train[:, 1], c=y_pred_target_train)
    ax_results[1, 1].set_title('Train Data with Predicted Labels')
    plt.tight_layout()

    plt.show()

def graph_results_greed(results, parameters_model, title):
    """
    Visualizes grid search results as a heatmap with annotated values.

    Parameters
    ----------
    results : numpy.ndarray
        2D array of performance metrics from grid search
        Shape should match the parameter combinations
    parameters_model : dict
        Dictionary containing parameter grids with keys:
        - 'gamma': list of gamma values
        - 'C': list of C values
    title : str
        Title for the plot

    Returns
    -------
    None
        Displays a matplotlib figure showing:
        - Heatmap of performance metrics
        - Text annotations for each parameter combination
        - Axis labels for C and gamma values
    """

    fig, ax1 = plt.subplots()
    ax1.set_xlabel('gama')
    ax1.set_ylabel('C')
    ax1.set_xticks(np.arange(len(parameters_model['gamma'])))
    ax1.set_yticks(np.arange(len(parameters_model['C'] )))
    ax1.set_xticklabels(parameters_model['gamma'])
    ax1.set_yticklabels(parameters_model['C'])
    ax1.imshow(results, cmap="viridis", origin='lower')

    for i in range( len(parameters_model['gamma']) ):
        for j in range( len(parameters_model['C']) ):
            if results[i, j] > 0.8 :
                ax1.text(j, i, results[i, j], ha="center", va="center", color="b")
            else :
                ax1.text(j, i, results[i, j], ha="center", va="center", color="w")

    fig.tight_layout()
    plt.title(title)
    plt.show( )


## Model Implementation

### create_model Function
Implements SVM with the following features:
1. Flexible kernel selection
2. Optional parameter tuning
3. Train-test split
4. Performance evaluation
5. Visualization options

### accuracy_scores Function
Provides comprehensive model evaluation:
- Training and test scores
- Accuracy metrics
- Formatted results in DataFrame

In [3]:
def create_model(x_data, y_target, kernel, graph=False, params=None):
    """
    Creates and trains an SVM model with specified parameters and optional visualization.

    Parameters
    ----------
    x_data : numpy.ndarray
        Feature dataset of shape (n_samples, n_features)
    y_target : numpy.ndarray
        Target labels
    kernel : str
        Kernel type to be used in SVM
        Options: 'linear', 'rbf', 'poly', 'sigmoid'
    graph : bool, optional (default=False)
        If True, creates visualization of results
    params : dict, optional (default=None)
        Dictionary containing SVM parameters:
        - 'C': float, regularization parameter
        - 'gamma': float, kernel coefficient

    Returns
    -------
    pandas.DataFrame
        DataFrame containing model performance metrics:
        - Training and test scores
        - Training and test accuracies
    """


    if params is None:
        model = svm.SVC(kernel=kernel)
    else:
        model = svm.SVC(kernel=kernel, C=params['C'], gamma=params['gamma'])
    model.fit(x_data, y_target)
    x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                            y_target,
                                                            random_state=41)
    model.fit(x_train, y_train)
    y_pred_target_test = model.predict(x_test)
    y_pred_target_train = model.predict(x_train)

    results = accuracy_scores(model,
                              x_train,
                              y_train,
                              x_test,
                              y_test,
                              y_pred_target_test,
                              y_pred_target_train)

    # Display the table

    if graph:
        create_graphs(x_data,
                      y_target,
                      x_train,
                      x_test,
                      y_train,
                      y_test,
                      y_pred_target_test,
                      y_pred_target_train,
                      model)
    return results

def accuracy_scores(model,
                    x_train,
                    y_train,
                    x_test,
                    y_test,
                    y_pred_target_test,
                    y_pred_target_train):
    """
    Calculates and formats various accuracy metrics for model evaluation.

    Parameters
    ----------
    model : sklearn.svm.SVC
        Fitted SVM classifier
    x_train : numpy.ndarray
        Training feature data
    y_train : numpy.ndarray
        Training target labels
    x_test : numpy.ndarray
        Test feature data
    y_test : numpy.ndarray
        Test target labels
    y_pred_target_test : numpy.ndarray
        Predicted labels for test data
    y_pred_target_train : numpy.ndarray
        Predicted labels for training data

    Returns
    -------
    pandas.DataFrame
        DataFrame containing:
        - Dataset type (Training/Test)
        - Model score (from model.score)
        - Accuracy score (from metrics.accuracy_score)
    """


    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    train_accuracy = metrics.accuracy_score(y_train, y_pred_target_train)
    test_accuracy = metrics.accuracy_score(y_test, y_pred_target_test)
    # Create a DataFrame to display the results
    results = pd.DataFrame({
        'Dataset': ['Training Data', 'Test Data'],
        'Score': [train_score, test_score],
        'Accuracy': [train_accuracy, test_accuracy]
    })

    return results

## Linear Separable Data Analysis

### Data Generation
- Two distinct clusters using make_blobs
- Clear linear separation
- Controlled random state for reproducibility

In [13]:
x_data_1, y_target_1 = data.make_blobs(centers=[[5, 5], [10, 10]], random_state=21)


plt_1, axis_1 = plt.subplots()
axis_1.scatter(x_data_1[:, 0], x_data_1[:, 1], c=y_target_1)
plt_1.show()

<IPython.core.display.Javascript object>

### Results
1. Model Performance
   - Accuracy metrics
   - Decision boundary visualization
2. Advantages
   - Simple decision boundary
   - Fast computation
   - High accuracy for linearly separable data


In [14]:
create_model(x_data=x_data_1,
             y_target=y_target_1,
             kernel='linear',
             graph=True)

<IPython.core.display.Javascript object>

Unnamed: 0,Dataset,Score,Accuracy
0,Training Data,1.0,1.0
1,Test Data,1.0,1.0


## Non-linear Data Analysis

### Data Generation
- Circular pattern using make_circles
- Non-linear decision boundary required
- Controlled noise level

In [15]:
x_data_2, y_target_2 = data.make_circles(n_samples=200,
                                         noise=0.1,
                                         factor=0.35,
                                         random_state=9)

plt_2, axis_2 = plt.subplots()
axis_2.scatter(x_data_2[:, 0], x_data_2[:, 1], c=y_target_2)
plt_2.show()

<IPython.core.display.Javascript object>

### Hyperparameter Optimization
1. Grid Search Parameters:
   - C: [0.25, 0.05, 0.1, 0.2, 0.4]
   - gamma: [0.25, 0.5, 1.0, 2.0, 4.0]

2. Cross-validation
   - 5-fold CV
   - Mean and standard deviation analysis

In [16]:
parameters = {'C': [0.25, 0.05, 0.1, 0.2, 0.4],
              'gamma': [0.25, 0.5, 1.0 , 2.0, 4.0]}

x_train2, x_test2, y_train2, y_test2 = model_selection.train_test_split(x_data_2,
                                                                        y_target_2,
                                                                        test_size = 0.2,
                                                                        random_state = 20 )

grid_search = model_selection.GridSearchCV(svm.SVC(), parameters, cv=5)

grid_search.fit(x_train2, y_train2)

results_mean = (np.array(pd.DataFrame(grid_search.cv_results_).
                         mean_test_score).
                reshape(len(parameters['C']),
                   len(parameters['gamma'])))

graph_results_greed(results_mean, parameters, 'Mean Results')

<IPython.core.display.Javascript object>

In [17]:
results_std = (np.array(pd.DataFrame(grid_search.cv_results_).
                         std_test_score).
                reshape(len(parameters['C']),
                   len(parameters['gamma'])))

results_std = np.around(results_std, decimals=2)

graph_results_greed(results_std, parameters, 'Std Results')

<IPython.core.display.Javascript object>

## Best Model

In [18]:
best_model = grid_search.best_estimator_

print('Best Model:')
print(grid_search.best_params_)
print()

y_pred_target_test_2 = best_model.predict(x_test2)
y_pred_target_train_2 = best_model.predict(x_train2)

results_accuracy_score = accuracy_scores(best_model,
                                         x_train2,
                                         y_train2,
                                         x_test2,
                                         y_test2,
                                         y_pred_target_test_2,
                                         y_pred_target_train_2)

results_accuracy_score

Best Model:
{'C': 0.25, 'gamma': 0.5}



Unnamed: 0,Dataset,Score,Accuracy
0,Training Data,1.0,1.0
1,Test Data,1.0,1.0


In [19]:
create_graphs(x_data_2,
              y_target_2,
              x_train2,
              x_test2,
              y_train2,
              y_test2,
              y_pred_target_test_2,
              y_pred_target_train_2,
              best_model)

<IPython.core.display.Javascript object>

# Project Conclusions

## Linear Separable Dataset Analysis
The first dataset demonstrated clear linear separability, making it an ideal candidate for classification using SVM with a linear kernel. The results showed perfect classification, which was expected given the dataset's inherent linear structure.

## Non-Linear Dataset Analysis
For the second dataset, despite its non-linear nature, the SVM classifier with RBF (Radial Basis Function) kernel performed excellently. The kernel's ability to map the data into a higher-dimensional space effectively handled the non-linear decision boundary, resulting in accurate classification as visualized in our analysis.

This project effectively showcases how different SVM kernels can be applied to handle both linear and non-linear classification tasks appropriately.