# Support Vector Machines

-- SVMs are particularly suited for complex but small of medium sized datasets.

-- SVMs are very sensitive to scaling. 

-- In simple terms, SVMs create a decision boundary that is as far as possible from the classes.

-- RULE OF THUMB: Start with a linear kernel, specially for large ds. Then try Gaussian RBF if the ds is not too large. If you have time, explore others.

-- The instances that are closer to this decision boundary, but that do not affect the boundary, are the "margin violation". They can be controlled by hyperparameter C. A smaller C leads to a "wider street" but more margin violations, while a higher C leads to "narrow street" and therefore less margin violation.

-- If your SVM model is overfitting you can try regularise by reducing the value of C.

-- **Unlike logistic regression, SVMs do not output probabilities for each class.**

-- For non-linear separable datasets, you can add polynomial features.

<img src="../img/linear-poly.png" width="60%">

-- When using SVM you can use **kernel=poly** because it's quicker than PolynomialFeatures. It's a trick. The hyperparameter *coef0* controls how much the model is influenced by highdegree polynomials versus low-degree polynomials. You also set *poly*  as the polynomial degree. If your model is underfitting, increase the degree; otherwise, decrease it.  

-- You can also add *similarity features* with the trick **kernel="rbf"** (explained below)
This sets as hyperparameters gamma and C. Increasing gamma makes the shape of the bell curve narrower; likewise a small gamma gives a wider bell curve, with smoother decision boudaries. So gamma acts likes a regularisatio: if the model is underfitting, increase it.

-- There are other kernels, but they're used with specific data structures (like String kernels, for example).


**THE CHOICE OF C** controls the trade-off between maximizing the margin (distance between the decision boundary and the closest data points) and minimizing the classification error

Large C values:
- Prioritize correct classification of training data points.
- May lead to overfitting if the data is noisy or has outliers.
- Results in a narrow margin.

Small C values:
- Prioritize a wider margin, even if it means misclassifying some training points.
- May lead to underfitting if the data is not linearly separable.
- Results in a larger margin.

In essence:
- Larger C: More emphasis on correct classification, potentially at the cost of generalization.
- Smaller C: More emphasis on generalization, potentially at the cost of classification accuracy on the training set.

-- **FOR REGRESSION:** 
Use LinearSVR to perform linear regression or SVR for non-linear.

In [4]:
# example of SVC for linear classification

# note that below we use LinearSVC
# we could also use SVC with kernel="linear" and C=1 but it's not recommended because it's slow
# Another option is to use the SGDClassifier class, with SGDClassifier(loss="hinge", alpha=1/(m*C)).
# this last one takes time to converge but it's useful for large ds that do not fit in memory

import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
















dsdsd
# loads iris ds
iris = datasets.load_iris()

X = iris["data"][:, (2, 3)] # petal lenght, petal width
y= (iris["target"]==2).astype(np.float64) # type == Irist Virginica, it will return 0 or 1

svm_classifier = Pipeline([
    ("scaler", StandardScaler()), # IT NEEDS TO BE SCALED WITH BEING CENTRED AROUND THE MEAN \ StandardScaler does that
    ("linear_svc", LinearSVC(C=1, loss="hinge")), # set loss to hinge, it's not default
    # for better performance, set "dual" parameter of SVC to FALSE unless you have more features than training instances
])

# as an alternative to the above you can have
# SVC(kernel="linear", C=1)
# but it's more slower, specially for larger datasets

svm_classifier.fit(X, y)



In [5]:
# nonlinear classification
# example of how to add polynomial features in a Pipeline

from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
])

polynomial_svm_clf.fit(X, y)

In [None]:
# SVC with poly kernel instead of Polynomial Features
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])

# coef0 controls how much the model is influenced by highdegree polynomials versus low-degree polynomials.
# if your model is underfitting, increase C; otherwise, decrease it.

poly_kernel_svm_clf.fit(X, y)

**SIMILARITY FEATURES**

- By adding these features you essentially map the original data into a higher-dimension space, which is therefore more likely to have a linerar separation.

- The choice of landmarks and the γ parameter can significantly impact the performance of your model.

- This technique is often used in conjunction with kernel methods in SVMs, as the kernel function effectively computes these similarity features.

- The best approach to create a landmark is to create one at the location of each and every instance of the dataset (the example below doesn't do that); however you will entnd up with a very large dataset to train. 

- A trick here is to use the kernel=rbf instead of doing the implementation manually


In [None]:
# example
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics.pairwise import rbf_kernel

# Generate a non-linearly separable dataset
X, y = make_moons(n_samples=200, noise=0.25, random_state=0)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Create landmarks   

landmarks = np.array([[-2], [1]])

# Add similarity features using RBF kernel
X_train_sim = rbf_kernel(X_train, landmarks, gamma=0.3)
X_test_sim = rbf_kernel(X_test, landmarks, gamma=0.3)

# Train SVM model
svm_clf = SVC(kernel="linear")
svm_clf.fit(X_train_sim, y_train)

# Make predictions
y_pred = svm_clf.predict(X_test_sim)

# Evaluate accuracy
accuracy = svm_clf.score(X_test_sim, y_test)
print("Accuracy:", accuracy)

# REVIEW

-- **The fundamental idea behind Support Vector Machines is to fit the widest possible "street" between the classes.** In other words, the goal is to have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possible street (i.e., a few instances may end up on the street). Another key idea is to use kernels when training on nonlinear datasets. SVMs can also be tweaked to perform linear and nonlinear regression, as well as novelty detection.

-- **After training an SVM, a support vector is any instance located on the "street" (see the previous answer), including its border. The decision boundary is entirely determined by the support vectors. Any instance that is not a support vector (i.e., is off the street) has no influence whatsoever;** you could remove them, add more instances, or move them around, and as long as they stay off the street they won't affect the decision boundary. Computing the predictions with a kernelized SVM only involves the support vectors, not the whole training set.
SVMs try to fit the largest possible "street" between the classes (see the first answer), so if the training set is not scaled, the SVM will tend to neglect small features.


-- You can use the decision_function() method to get confidence scores. These scores represent the distance between the instance and the decision boundary. However, they cannot be directly converted into an estimation of the class probability. If you set probability=True when creating an SVC, then at the end of training it will use 5-fold cross-validation to generate out-of-sample scores for the training samples, and it will train a LogisticRegression model to map these scores to estimated probabilities. The predict_proba() and predict_log_proba() methods will then be available.

-- All three classes can be used for large-margin linear classification. The SVC class also supports the kernel trick, which makes it capable of handling nonlinear tasks. However, this comes at a cost: the SVC class does not scale well to datasets with many instances. It does scale well to a large number of features, though. The LinearSVC class implements an optimized algorithm for linear SVMs, while SGDClassifier uses Stochastic Gradient Descent. Depending on the dataset LinearSVC may be a bit faster than SGDClassifier, but not always, and SGDClassifier is more flexible, plus it supports incremental learning.

-- If an SVM classifier trained with an RBF kernel underfits the training set, there might be too much regularization. To decrease it, you need to increase gamma or C (or both).

-- A Regression SVM model tries to fit as many instances within a small margin around its predictions. If you add instances within this margin, the model will not be affected at all: it is said to be ϵ-insensitive.

-- The kernel trick is mathematical technique that makes it possible to train a nonlinear SVM model. The resulting model is equivalent to mapping the inputs to another space using a nonlinear transformation, then training a linear SVM on the resulting high-dimensional inputs. The kernel trick gives the same result without having to transform the inputs at all.