# $\newcommand{\ds}{\displaystyle}$ SVM Classifier with $\ds\ell_1$-Regularization

- Here we use CVXPY to train a SVM classifier with $\ds\ell_1$-regularization $\newcommand{\vw}{\mathbf{w}}$ $\newcommand{\vx}{\mathbf{x}}$ $\newcommand{\ds}{\displaystyle}$
- We are given data $\ds(\vx_i,\,y_i)$, $i=1,\ldots, m$ 
- The $\vx_i\in \mathbb{R}^n$ are feature vectors, while the $y_i\in\{\pm 1\}$ are associated boolean outcomes
- Goal: To construct a good linear classifier $\ds\widehat{y} = {\rm sgn}(\vw\cdot\vx + b)$
- Find the parameters $\vw$, $b$ by minimizing the convex function
\begin{align*}
  f(\vw, b) = \frac{1}{m}\sum_i \big(1 - y_i (\vw\cdot\vx_i + b) \big)_+ + \lambda\,\|\vw\|_1
\end{align*}
- 1st term is the average hinge loss; 2nd term shrinks the coefficients in $\vw$ and encourages sparsity
- $\lambda \geqslant 0$ is a regularization parameter
- Minimizing $f(\vw, b)$ simultaneously selects features and fits the classifier

## Example

- Generate data with $n=20$ features by randomly choosing $\vx_i$ and a sparse $\vw_{\mathrm{true}} \in \mathbb{R}^n$
- Set $\ds y_i = {\rm sgn}(\vw_{\mathrm{true}}\cdot\vx_i +b_{\mathrm{true}} - z_i)$ where the $z_i$ are i.i.d. normal random variables
- Divide the data into training and test sets with $m=1000$ examples each

In [None]:
# Generate data for SVM classifier with L1 regularization.
import numpy as np

n = 20
m = 1000
TEST = m
DENSITY = 0.2

np.random.seed(1)
w_true = np.random.randn(n, 1)
idxs = np.random.choice(range(n), int((1 - DENSITY) * n), replace=False)
for idx in idxs:
    w_true[idx] = 0
offset = 0
sigma = 45
X = np.random.normal(0, 5, size=(m, n))
Y = np.sign(X.dot(w_true) + offset + np.random.normal(0, sigma, size=(m, 1)))
X_test = np.random.normal(0, 5, size=(TEST, n))
Y_test = np.sign(X_test.dot(w_true) + offset + np.random.normal(0, sigma, size=(TEST, 1)))

- Solve the optimization problem for a range of $\lambda$ to compute a trade-off curve
- Plot the train and test error over the trade-off curve
- A reasonable choice of $\lambda$ is the value that minimizes the test error

In [None]:
# Form SVM with L1 regularization problem.
import cvxpy as cp

w = cp.Variable((n, 1))
b = cp.Variable()
loss = cp.sum(cp.pos(1 - cp.multiply(Y, X @ w + b)))
reg = cp.norm(w, 1)
lambd = cp.Parameter(nonneg=True)
prob = cp.Problem(cp.Minimize(loss / m + lambd * reg))

In [None]:
# Compute a trade-off curve and record train and test error.
TRIALS = 100
train_error = np.zeros(TRIALS)
test_error = np.zeros(TRIALS)
lambda_vals = np.logspace(-2, 0, TRIALS)
w_vals = []
for i in range(TRIALS):
    lambd.value = lambda_vals[i]
    prob.solve()
    train_error[i] = (np.sign(X.dot(w_true) + offset) != np.sign(X.dot(w.value) + b.value)).sum() / m
    test_error[i] = (np.sign(X_test.dot(w_true) + offset) != np.sign(X_test.dot(w.value) + b.value)).sum() / TEST
    w_vals.append(w.value)

In [None]:
# Plot the train and test error over the trade-off curve.
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'svg'

plt.plot(lambda_vals, train_error, label="Train error")
plt.plot(lambda_vals, test_error, label="Test error")
plt.xscale("log")
plt.legend(loc="upper left")
plt.xlabel(r"$\lambda$", fontsize=16)
plt.show()

- Plot the regularization path --- the $\vw_i$ versus $\lambda$
- $\vw_i$ do not necessarily decrease monotonically as $\lambda$ increases
- 4 features remain non-zero longer for larger $\lambda$; these features are the most important 
- In fact $\vw_{\mathrm{true}}$ had 4 non-zero values

In [None]:
# Plot the regularization path for w.
for i in range(n):
    plt.plot(lambda_vals, [wi[i, 0] for wi in w_vals])
plt.xlabel(r"$\lambda$", fontsize=16)
plt.xscale("log")