# Model

## Support Vector Machine (SVM)

**Support Vector Machine (SVM)** is a powerful and versatile supervised machine learning algorithm used for classification and regression tasks. SVMs are particularly well-suited for classification of complex but small- or medium-sized datasets.

## Concept

The main idea behind SVM is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. The goal is to find a plane that has the maximum margin, i.e., the maximum distance between data points of both classes.

## Hyperplane and Margin

- **Hyperplane:** This is the decision boundary that separates different classes. In 2D, this hyperplane is a line, but in higher dimensions, it becomes a plane or hyper-plane.
- **Margin:** This is the gap between the two lines on the closest data points of different classes. SVM seeks the hyperplane with the largest margin.

## Hyperparameters

Key hyperparameters in SVM include:
- **Kernel:** The function used to map a lower dimensional data into a higher dimensional space. Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid.
- **Regularization (C):** The regularization parameter (often denoted by C) tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.
- **Gamma (γ):** For non-linear kernels, this parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.

## Applications

SVMs are used in a variety of applications such as:
- Face detection,
- Handwriting recognition,
- Image classification,
- Bioinformatics (e.g., for protein classification andth of the petals and sepals.



<img src="svm_illustration.png" alt="SVM Illustration" title="Title">


# Implementation

### Import Libraries

**Press ▶ to import the libraries.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

print("Libraries are imported.")

### Import data

**Press ▶ to load the data.**

In [None]:
import os
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output

# List all .csv and Excel files in the current directory
supported_extensions = ['.csv', '.xlsx', '.xls']
files = [f for f in os.listdir('./data/') if any(f.endswith(ext) for ext in supported_extensions)]

# Create a dropdown widget
dropdown = widgets.Dropdown(
    options=files,
    description='Files:',
    disabled=False,
)

# Create a button widget
button = widgets.Button(
    description='Select',
    disabled=False,
    button_style='',
    tooltip='Click to select file',
    icon='check'
)

# Output widget to display messages
output = widgets.Output()

# Function to handle button click
def on_button_click(b):
    with output:
        clear_output()
        selected_file = dropdown.value
        global data
        if selected_file.endswith('.csv'):
            data = pd.read_csv('./data/'+selected_file)
        elif selected_file.endswith(('.xlsx', '.xls')):
            data = pd.read_excel('./data/'+selected_file)
        print(f"File '{selected_file}' uploaded as data.")

# Attach the function to the button widget
button.on_click(on_button_click)

# Display the dropdown, button widgets, and initial message within the output widget
with output:
    print("Please select a file from the dropdown and click 'Select'.")
display(output)
display(dropdown)
display(button)

**Press ▶ to display the data.**

In [None]:
display(data.head())
print ("Loading pair plot, please wait...")
sns.pairplot(data=data,hue=data.columns[-1],diag_kind="hist")

### Select Target Column

**Press ▶ to specify the target column.**

In [None]:
import ipywidgets as widgets
import pandas as pd

# Create a Dropdown widget for column selection
dropdown = widgets.Dropdown(
    options=data.columns.tolist(),
    value=data.columns[0],
    description='Select Column:',
    disabled=False,
    layout=widgets.Layout(width='500px'),
    style={'description_width': '200px'}
)

# Create a Button widget
button = widgets.Button(
    description='Select',
    button_style='',  # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click to select the target column as the last column',
    icon='check'  # FontAwesome icon names (without 'fa-')
)

# Create an Output widget for displaying messages
output = widgets.Output()

# Function to handle button click that rearranges the DataFrame
def on_button_clicked(b):
    with output:
        output.clear_output()
        global data
        # Get the selected column name
        selected_column = dropdown.value
        # Reorder the DataFrame columns
        new_columns = [col for col in data.columns if col != selected_column] + [selected_column]
        data = data[new_columns]
        print(f"Column '{selected_column}' has been moved to the last position.")

# Link the button click event to the function
button.on_click(on_button_clicked)

# Display the widgets and output
display(widgets.VBox([dropdown, button, output]))


### Data Preprocessing

**Press ▶ to handle nulls, process categorical values, and normalize data.**

In [None]:
import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

df = data.copy()

# Widgets for selecting operations and methods
fill_method_dropdown = widgets.Dropdown(
    options=[('None', None), ('Zero', 'zero'), ('Mean', 'mean'), ('Median', 'median')],
    value=None,
    description='Fill Method:',
)

remove_nulls_checkbox = widgets.Checkbox(value=False, description='Remove Nulls')
encode_categorical_checkbox = widgets.Checkbox(value=False, description='Encode Categorical')
normalize_data_checkbox = widgets.Checkbox(value=False, description='Normalize Data')

# Button to apply selected operations
apply_button = widgets.Button(description='Apply All', button_style='info')

# Button to display data
show_data_button = widgets.Button(description='Show Data')

# Output area
output = widgets.Output()

def apply_operations():
    global data
    with output:
        output.clear_output()
        # Fill missing values based on selected method
        if fill_method_dropdown.value:
            if fill_method_dropdown.value == 'zero':
                data = data.fillna(0)
            elif fill_method_dropdown.value in ['mean', 'median']:
                # Apply fill method only to numeric columns
                numeric_cols = data.select_dtypes(include=np.number).columns
                for column in numeric_cols:
                    if data[column].sum() != 0:
                        if fill_method_dropdown.value == 'mean':
                            mean_value = data[column].mean()
                            data[column] = data[column].fillna(mean_value)
                        elif fill_method_dropdown.value == 'median':
                            median_value = data[column].median()
                            data[column] = data[column].fillna(median_value)
            print(f"Missing values filled with {fill_method_dropdown.value}.")

        if remove_nulls_checkbox.value:
            data = data.dropna()  # Remove remaining null values
            print("Remaining null values removed.")
        if encode_categorical_checkbox.value:
            # Apply label encoding to categorical columns
            label_encoder = LabelEncoder()
            categorical_cols = data.select_dtypes(include=['object', 'category']).columns
            for col in categorical_cols:
                data[col] = label_encoder.fit_transform(data[col].astype(str))
            print("Categorical data encoded using label encoding.")
        if normalize_data_checkbox.value:
            # Normalize numeric columns using MinMaxScaler
            scaler = MinMaxScaler()
            numeric_cols = data.select_dtypes(include=np.number).columns
            data[numeric_cols] = scaler.fit_transform(data[numeric_cols])
            print("Data normalized.")

def show_data(b):
    with output:
        output.clear_output()
        display(data.head())  # Show the head of the DataFrame

apply_button.on_click(lambda b: apply_operations())
show_data_button.on_click(show_data)

# Layout the widgets
widgets.VBox([
    widgets.Label('Select Fill Method and Operations:'),
    fill_method_dropdown,
    remove_nulls_checkbox,
    encode_categorical_checkbox,
    normalize_data_checkbox,
    apply_button,
    show_data_button,
    output
])


### Split data into training and testing

**Press ▶ to split the data into training and testing sets.**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import ipywidgets as widgets
from IPython.display import display, clear_output

X = data.iloc[:, :-1]  # all rows, all columns except the last one
y = data.iloc[:, -1]   # all rows, just the last column
# Function to split data and display the shape of the splits

X_train, X_test, y_train, y_test = None, None, None, None

def split_data(button):
    global X_train, X_test, y_train, y_test  # Declare the use of global variables
    test_size = test_size_slider.value
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

    # Display the output
    with out:
        clear_output(wait=True)
        print(f"Training set: {X_train.shape[0]} samples")
        print(f"Test set: {X_test.shape[0]} samples")

# Create output widget
out = widgets.Output()


# Create slider for test size
test_size_slider = widgets.FloatSlider(
    value=0.25,  # Default split 75%-25%
    min=0.1,
    max=0.9,
    step=0.05,
    description='Test Size:',
    readout_format='.2f',  # Display format
)

# Create an Apply button
apply_button = widgets.Button(description="Apply Changes")

# Set up button click event to trigger the data split
apply_button.on_click(split_data)

# Organize widgets in a vertical box
widgets_box = widgets.VBox([test_size_slider, apply_button, out])

# Display the widgets
display(widgets_box)

### SVM

### Key Parameters in SVM

Support Vector Machines (SVM) are a powerful set of supervised learning methods used for classification, regression, and outliers detection. The effectiveness of SVM largely depends on the selection of hyperparameters. Understanding these parameters is crucial for tuning SVMs to achieve the best performance.

#### 1. Gamma (`gamma`)

The `gamma` parameter defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'. The `gamma` parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. If `gamma` is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with `C` will be able to prevent overfitting. When `gamma` is very small, the model is too constrained and cannot capture the complexity or "shape" of the data. The medium `gamma` values can generalize well which makes it very critical to tune.

#### 2. Penalty (`C`)

The `C` parameter trades off correct classification of training examples against maximization of the decision function's margin. For larger values of `C`, a smaller margin will be accepted if the decision function classifies all training points correctly. A lower `C` will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other words, `C` behaves as a regularization parameter in the SVM. 

#### 3. Random State (`random_state`)

The `random_state` parameter is used as a seed to the pseudo random number generator during the shuffling of the data for probability estimates. This is not applicable to all types of SVM, but when it is, it ensures the reproducibility of your results. Keeping a constant `random_state` ensures that your results are consistent between runs.

#### 4. Kernel (`kernel`)

The kernel type to be used in the algorithm specifies how to transform the input data into a higher-dimensional space. A linear kernel is good when the data is linearly separable (i.e., it can be separated using a single line). Non-linear kernels (like `rbf`, `poly`, and `sigmoid`) allow the algorithm to create more complex boundaries, depending on the nature of the data. Choosing the right kernel and kernel parameters (like `gamma`) is crucial as it can allow the model to fit the dataset better without overfitting.

Each of these parameters plays a vital role in the performance of an SVM model, and careful tuning of them is crucial to obtaining a model that generalizes well on unseen data.


**Press ▶ to train the model.**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import ipywidgets as widgets
from IPython.display import display

# Assuming X_train and y_train, X_test, y_test are already defined

# Standardize the training and test data (assuming X_train and X_test are defined in the global scope)
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

# Initialize global variable for SVC parameters
gamma, C, random_state, kernel = 0.1, 1.0, 0, 'rbf'
svm = None
y_pred = None

# Function to train SVC and print accuracy
def train_svm(button):
    global gamma, C, random_state, kernel, svm, y_pred
    svm = SVC(kernel=kernel, random_state=random_state, gamma=gamma, C=C)
    svm.fit(X_train_std, y_train)
    accuracy = svm.score(X_test_std, y_test) * 100
    y_pred = svm.predict(X_test_std)
    # Display the output
    with out:
        clear_output(wait=True)
        print(f'The accuracy of the SVM classifier on test data is {accuracy:.1f}%')

# Create output widget
out = widgets.Output()

# Function to update SVC parameters
def update_params(change):
    global gamma, C, random_state, kernel
    gamma = gamma_slider.value
    C = C_slider.value
    random_state = random_state_text.value
    kernel = kernel_dropdown.value

# Create sliders for SVC parameters
gamma_slider = widgets.FloatLogSlider(
    value=0.1,
    base=10,
    min=-4, # min exponent of base
    max=1, # max exponent of base
    step=0.1, # exponent step
    description='Gamma:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '200px'}
)

C_slider = widgets.FloatLogSlider(
    value=1.0,
    base=10,
    min=-4, # min exponent of base
    max=2, # max exponent of base
    step=0.1, # exponent step
    description='C:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '200px'}
)

# Create integer text box for random state
random_state_text = widgets.IntText(
    value=0,
    description='Random State:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '200px'}
)

# Create dropdown for kernel selection
kernel_dropdown = widgets.Dropdown(
    options=['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
    value='rbf',
    description='Kernel:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '200px'}
)

# Create a button to train the model
train_button = widgets.Button(description="Train")

# Link parameter widgets to the update_params function
gamma_slider.observe(update_params, 'value')
C_slider.observe(update_params, 'value')
random_state_text.observe(update_params, 'value')
kernel_dropdown.observe(update_params, 'value')

# Set up button click event to trigger the data split
train_button.on_click(train_svm)

# Organize widgets in a vertical box
widgets_box = widgets.VBox([gamma_slider, C_slider, random_state_text, kernel_dropdown, train_button, out])

# Display the widgets
display(widgets_box)


**Press ▶ to display the confusion matrix and the classification report.**

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print("Confusion Matrix: ")
print(confusion_matrix(y_test,y_pred))

print("Classification Report: ")
print(classification_report(y_test,y_pred))

### Visualization

In classification tasks, the <span style="color:red">red</span> and <span style="color:blue">blue</span> areas typically represent different classes or categories. The <span style="color:red">red</span> area signifies regions where the classifier predicts one class, while the <span style="color:blue">blue</span> area represents regions where it predicts another class, based on the features of the data points in those regions.


**Press ▶ to visualize the decision boundary.**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import ipywidgets as widgets
from IPython.display import display
import pandas as pd

def plot_decision_regions(X, y, classifier, label_encoder, feature1, feature2, resolution=0.02):
    X = X[[feature1, feature2]]  # Take the selected features for visualization

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ['blue', 'red', 'lightgreen', 'gray', 'cyan']
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
    x2_min, x2_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X.loc[y == cl, feature1], y=X.loc[y == cl, feature2],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=label_encoder.inverse_transform([cl])[0])

    plt.xlabel(feature1)
    plt.ylabel(feature2)
    plt.title('SVM Classification with Decision Boundaries')
    plt.legend(loc='upper left')

# Encoding the labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

# Dropdowns for selecting features
feature_selector1 = widgets.Dropdown(
    options=X.columns,
    description='X Axis:',
    disabled=False,
)

feature_selector2 = widgets.Dropdown(
    options=X.columns,
    description='Y Axis:',
    disabled=False,
)

button = widgets.Button(description="Plot")

output = widgets.Output()

def on_button_click(b):
    with output:
        output.clear_output()
        feature1 = feature_selector1.value
        feature2 = feature_selector2.value
        
        if feature1 and feature2:
            svm_model = SVC()
            svm_model.fit(X_train[[feature1, feature2]], y_train)
            plt.figure(figsize=(10, 5))
            plot_decision_regions(X_test, y_test, svm_model, label_encoder, feature1, feature2)
            plt.show()
        else:
            print("Please select both features")

button.on_click(on_button_click)

display(feature_selector1, feature_selector2, button, output)
