The mathematical formula for a linear Support Vector Machine (SVM) involves finding the hyperplane that separates the classes while maximizing the margin between the closest data points (support vectors) from each class. For a linearly separable dataset, the decision boundary is a hyperplane defined by the equation:

\[ w^T x + b = 0 \]

where:
- \( x \) is the input vector (a feature vector).
- \( w \) is the weight vector (coefficients) perpendicular to the hyperplane.
- \( b \) is the bias term (intercept).
- \( w^T \) denotes the transpose of the weight vector.

The decision boundary divides the feature space into two regions corresponding to the two classes. Points on one side of the hyperplane are classified as one class, while points on the other side are classified as the other class.

The distance between the hyperplane and the closest data point (support vector) from each class is the margin, denoted as \( \frac{1}{\|w\|} \). The goal of SVM is to maximize this margin while minimizing the classification error.

Mathematically, for a linearly separable dataset, the objective function of a linear SVM can be written as:

\[ \min_{w,b} \frac{1}{2} \|w\|^2 \]

subject to the constraints:

\[ y^{(i)}(w^T x^{(i)} + b) \geq 1 \quad \text{for } i = 1, 2, ..., m \]

where:
- \( (x^{(i)}, y^{(i)}) \) are the training examples.
- \( y^{(i)} \) is the class label (+1 or -1).
- \( m \) is the number of training examples.
- \( \|w\| \) is the Euclidean norm (magnitude) of the weight vector \( w \).

The optimization problem is solved using techniques like quadratic programming to find the optimal \( w \) and \( b \) that satisfy the constraints and maximize the margin.

If the dataset is not linearly separable, a soft-margin SVM is used, which introduces a slack variable \( \xi^{(i)} \) for each training example to allow for some misclassifications. The objective function then becomes:

\[ \min_{w,b,\xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{m} \xi^{(i)} \]

subject to the constraints:

\[ y^{(i)}(w^T x^{(i)} + b) \geq 1 - \xi^{(i)} \quad \text{and} \quad \xi^{(i)} \geq 0 \]

where \( C \) is the regularization parameter, controlling the trade-off between maximizing the margin and minimizing the classification error.

The objective function of a linear Support Vector Machine (SVM) aims to maximize the margin between the decision boundary (hyperplane) and the closest data points (support vectors) while minimizing the classification error. Mathematically, the objective function of a linear SVM for a dataset with \( m \) training examples is formulated as follows:

\[ \min_{w,b} \frac{1}{2} \|w\|^2 \]

subject to the constraints:

\[ y^{(i)}(w^T x^{(i)} + b) \geq 1 \]

where:
- \( w \) is the weight vector (coefficients) perpendicular to the hyperplane.
- \( b \) is the bias term (intercept).
- \( x^{(i)} \) is the feature vector of the \( i \)-th training example.
- \( y^{(i)} \) is the class label of the \( i \)-th training example (\( y^{(i)} = 1 \) for positive class, \( y^{(i)} = -1 \) for negative class).
- \( \|w\|^2 \) is the squared Euclidean norm (magnitude) of the weight vector \( w \), which represents the margin.
  
The objective function seeks to minimize \( \frac{1}{2} \|w\|^2 \), which is equivalent to maximizing the margin between the decision boundary and the closest data points. The factor \( \frac{1}{2} \) is included for mathematical convenience, as it simplifies the derivative of the objective function.

The constraints \( y^{(i)}(w^T x^{(i)} + b) \geq 1 \) ensure that all data points are correctly classified and are at least at a distance of \( \frac{1}{\|w\|} \) from the decision boundary. 

In summary, the objective function of a linear SVM aims to find the optimal hyperplane that maximizes the margin between classes, ensuring good generalization to unseen data.

The kernel trick is a method used in Support Vector Machines (SVMs) to implicitly map data into a higher-dimensional feature space without explicitly computing the transformation. This allows SVMs to efficiently handle non-linearly separable data by effectively finding non-linear decision boundaries in the original input space.

In a traditional SVM with a linear kernel, the decision boundary is a hyperplane defined by a linear combination of input features. However, many real-world datasets are not linearly separable in their original feature space. The kernel trick addresses this limitation by introducing a kernel function that computes the dot product of data points in a higher-dimensional space, allowing SVMs to learn complex decision boundaries.

Mathematically, the kernel trick is based on the Mercer's theorem, which states that a symmetric function \( K(x, x') \) can be used as a valid kernel if and only if the corresponding Gram matrix \( K \), defined as \( K_{ij} = K(x^{(i)}, x^{(j)}) \), is positive semi-definite for any set of points \( x^{(1)}, x^{(2)}, ..., x^{(m)} \) in the input space.

There are several commonly used kernel functions, each suitable for different types of data:

1. **Linear Kernel (no kernel trick)**:
   \[ K(x, x') = x^T x' \]
   This is the standard inner product of the input features.

2. **Polynomial Kernel**:
   \[ K(x, x') = (x^T x' + c)^d \]
   This kernel maps the data into a higher-dimensional space using polynomial functions.

3. **Gaussian Radial Basis Function (RBF) Kernel**:
   \[ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right) \]
   This kernel maps the data into an infinite-dimensional space using a Gaussian function.

4. **Sigmoid Kernel**:
   \[ K(x, x') = \tanh(\alpha x^T x' + c) \]
   This kernel maps the data into a higher-dimensional space using hyperbolic tangent functions.

The kernel trick allows SVMs to find complex decision boundaries in the original input space without explicitly computing the transformations, thus avoiding the computational burden of working in high-dimensional spaces. It provides a powerful tool for handling non-linear relationships in data, making SVMs versatile for various machine learning tasks.

In Support Vector Machines (SVMs), support vectors are the data points that lie closest to the decision boundary (hyperplane) and play a crucial role in defining the decision boundary. These are the points that have a non-zero value for the Lagrange multiplier (alpha) in the optimization problem. Support vectors are the key elements that determine the position and orientation of the decision boundary, and the entire SVM model is essentially defined by these support vectors.

Here's an explanation of the role of support vectors with an example:

Consider a simple binary classification problem where we have two classes, represented by red and blue points in a 2D feature space. We want to find a decision boundary (hyperplane) that separates the two classes. 

![SVM Example](https://i.imgur.com/zoRsiNB.png)

In the above image, the red and blue points represent the two classes, and the decision boundary is the dashed line. The two classes are not linearly separable in the original feature space.

When we train an SVM on this dataset, the SVM algorithm will find the optimal decision boundary that maximizes the margin between the classes. This decision boundary will be defined by a subset of data points known as support vectors.

In this example, the support vectors are the points marked by larger circles. These are the points that are closest to the decision boundary. They determine the position and orientation of the decision boundary because their distance to the boundary directly affects the margin.

For example:
- The support vectors marked in red are closest to the blue points, and vice versa.
- Changing the position of any other points that are not support vectors will not affect the decision boundary as long as they remain on the correct side of the margin.

![SVM Example with Support Vectors](https://i.imgur.com/gX2Wp9m.png)

In the above image, the dashed line represents the decision boundary, and the solid lines represent the margin. The support vectors are the points lying on the margin lines.

The significance of support vectors in SVM can be summarized as follows:
1. **Determining the Decision Boundary**: Support vectors are the critical points that define the decision boundary. The decision boundary is constructed by maximizing the margin around these support vectors.
2. **Robustness to Outliers**: Since the decision boundary depends only on the support vectors, SVMs are robust to outliers or noise in the dataset that doesn't affect the support vectors.
3. **Efficiency**: The computational complexity of SVM depends on the number of support vectors rather than the entire dataset, making SVM efficient in high-dimensional spaces or with large datasets.

In summary, support vectors are the backbone of SVMs, crucial for defining the decision boundary and ensuring the robustness and efficiency of the model.

Let's illustrate the concepts of hyperplane, marginal plane, soft margin, and hard margin in Support Vector Machines (SVM) with examples and graphs.

**1. Hyperplane**:
The hyperplane is the decision boundary that separates the classes in an SVM. In a binary classification problem, it's a flat subspace of dimension \(n-1\), where \(n\) is the number of features. In 2D, it's a line, and in 3D, it's a plane. The hyperplane is defined by the equation \(w^Tx + b = 0\), where \(w\) is the weight vector, \(x\) is the input vector, and \(b\) is the bias.

**2. Marginal Plane**:
The marginal plane is the area that lies parallel to the hyperplane and is equidistant from it. In a hard-margin SVM, the margin is the distance between the hyperplane and the closest data point (support vector) from each class. In a soft-margin SVM, the margin is widened to allow for some misclassifications.

**3. Soft Margin**:
In a soft-margin SVM, some misclassifications are allowed to achieve a wider margin and better generalization to unseen data. This is useful when the data is not perfectly separable. The soft-margin SVM introduces slack variables (\(\xi\)) to penalize misclassifications. The objective function is modified to minimize the misclassification errors while maximizing the margin.

**4. Hard Margin**:
In a hard-margin SVM, no misclassifications are allowed, and the margin is maximized without any violations. This is suitable for linearly separable data. Hard-margin SVMs are more sensitive to outliers and noise in the data.

Let's illustrate these concepts with examples and graphs:

```python
import numpy as np
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = np.array([[1, 2], [2, 3], [3, 3], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1])  # 0: Class 1, 1: Class 2

# Plot data points
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot hyperplane
w = np.array([1, -1])  # Example weight vector
b = -2  # Example bias
x_hyper = np.linspace(0, 10, 100)
y_hyper = (-w[0] * x_hyper - b) / w[1]
plt.plot(x_hyper, y_hyper, 'k--', label='Hyperplane')

# Plot marginal planes (soft margin)
margin = 1  # Margin distance
plt.plot(x_hyper, y_hyper + margin, 'r--', label='Soft Margin')
plt.plot(x_hyper, y_hyper - margin, 'r--')

# Plot support vectors
plt.scatter([3, 7], [3, 7], c='k', marker='x', s=200, label='Support Vectors')

plt.xlim(0, 10)
plt.ylim(0, 10)
plt.legend()
plt.title('SVM with Hyperplane, Marginal Plane (Soft Margin)')
plt.show()
```

In this example:
- We have two classes, represented by red and blue points.
- The hyperplane (decision boundary) separates the two classes.
- The soft margin is depicted by the dashed red lines, representing the area equidistant from the hyperplane.
- Support vectors are the points lying on the margin lines.

The graph illustrates the concepts of hyperplane, marginal plane (soft margin), and support vectors in SVM.

Let's modify this example to illustrate a hard-margin SVM as well.

Here's the modified example to illustrate a hard-margin SVM:

```python
import numpy as np
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = np.array([[1, 2], [2, 3], [3, 3], [6, 7], [7, 8], [8, 9]])
y = np.array([0, 0, 0, 1, 1, 1])  # 0: Class 1, 1: Class 2

# Plot data points
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot hyperplane
w = np.array([1, -1])  # Example weight vector
b = -2  # Example bias
x_hyper = np.linspace(0, 10, 100)
y_hyper = (-w[0] * x_hyper - b) / w[1]
plt.plot(x_hyper, y_hyper, 'k--', label='Hyperplane')

# Plot support vectors
plt.scatter([3, 7], [3, 7], c='k', marker='x', s=200, label='Support Vectors')

plt.xlim(0, 10)
plt.ylim(0, 10)
plt.legend()
plt.title('SVM with Hyperplane and Support Vectors (Hard Margin)')
plt.show()
```

In this example:
- We have removed the soft margin lines.
- The decision boundary (hyperplane) now directly touches the closest data points (support vectors) from each class, creating a hard margin.
- No misclassifications are allowed in a hard-margin SVM, so the decision boundary is determined solely by the support vectors.

The graph illustrates the concept of a hard-margin SVM with a hyperplane and support vectors.

Sure, let's implement a linear SVM classifier on the Iris dataset using both scikit-learn and from scratch. We'll compare their performances and plot the decision boundaries.

First, let's load the Iris dataset, split it into training and testing sets, and import necessary libraries.

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

Now, let's implement a linear SVM classifier using scikit-learn and train it on the training set.

```python
# Train a linear SVM classifier using scikit-learn
svm_clf_sklearn = SVC(kernel='linear')
svm_clf_sklearn.fit(X_train, y_train)

# Predict labels for the testing set
y_pred_sklearn = svm_clf_sklearn.predict(X_test)

# Compute the accuracy of the model
accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)
print("Accuracy of scikit-learn SVM:", accuracy_sklearn)
```

Next, let's plot the decision boundaries of the trained model using two features.

```python
# Plot decision boundaries
def plot_decision_boundary(X, y, model, feature1, feature2):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[:, feature1], X[:, feature2], c=y, cmap=plt.cm.Paired, s=100)

    # Create meshgrid of feature values
    x_min, x_max = X[:, feature1].min() - 1, X[:, feature1].max() + 1
    y_min, y_max = X[:, feature2].min() - 1, X[:, feature2].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

    # Predict the labels for each point in the meshgrid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

    plt.xlabel(iris.feature_names[feature1])
    plt.ylabel(iris.feature_names[feature2])
    plt.title('Decision Boundary')
    plt.show()

plot_decision_boundary(X_train, y_train, svm_clf_sklearn, 0, 1)
```

Now, let's implement a linear SVM classifier from scratch using Python. We'll use the Sequential Minimal Optimization (SMO) algorithm for optimization.

```python
class LinearSVM:
    def __init__(self, C=1.0, max_iter=100, tol=1e-3):
        self.C = C
        self.max_iter = max_iter
        self.tol = tol

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.W = np.zeros(n_features)
        self.b = 0

        for _ in range(self.max_iter):
            num_changed_alphas = 0
            for i in range(n_samples):
                E_i = self.decision_function(X[i]) - y[i]

                if (y[i] * E_i < -self.tol and self.W[i] < self.C) or (y[i] * E_i > self.tol and self.W[i] > 0):
                    j = np.random.choice(np.delete(np.arange(n_samples), i))
                    E_j = self.decision_function(X[j]) - y[j]

                    W_i_old = self.W[i]
                    W_j_old = self.W[j]

                    if y[i] != y[j]:
                        L = max(0, self.W[j] - self.W[i])
                        H = min(self.C, self.C + self.W[j] - self.W[i])
                    else:
                        L = max(0, self.W[i] + self.W[j] - self.C)
                        H = min(self.C, self.W[i] + self.W[j])

                    if L == H:
                        continue

                    eta = 2 * X[i].dot(X[j]) - X[i].dot(X[i]) - X[j].dot(X[j])
                    if eta >= 0:
                        continue

                    self.W[j] = self.W[j] - y[j] * (E_i - E_j) / eta
                    self.W[j] = min(H, max(L, self.W[j]))

                    if abs(self.W[j] - W_j_old) < 1e-5:
                        continue

                    self.W[i] = self.W[i] + y[i] * y[j] * (W_j_old - self.W[j])

                    b1 = self.b - E_i - y[i] * (self.W[i] - W_i_old) * X[i].dot(X[i]) - y[j] * (self.W[j] - W_j_old) * X[i].dot(X[j])
                    b2 = self.b - E_j - y[i] * (self.W[i] - W_i_old) * X[i].dot(X[j]) - y[j] * (self.W[j] - W_j_old) * X[j].dot(X[j])
                    if 0 < self.W[i] < self.C:
                        self.b = b1
                    elif 0 < self.W[j] < self.C:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2

                    num_changed_alphas += 1

            if num_changed_alphas == 0:
                break

    def decision_function(self, X):
        return X.dot(self.W) + self.b

    def predict(self, X):
        return np.sign(self.decision_function(X))

# Train a linear SVM classifier from scratch
svm_clf_scratch = LinearSVM()
svm_clf_scratch.fit(X_train, y_train)

# Predict labels for the testing set
y_pred_scratch = svm_clf_scratch.predict(X_test)

# Compute the accuracy of the model
accuracy_scratch = accuracy_score(y_test, y_pred_scratch)
print("Accuracy of SVM from scratch:", accuracy_scratch)

# Plot decision boundary for SVM from scratch
plot_decision_boundary