<img src="../data/skl.png">

<h1><b>Steps</b><h1>
<pre>
    |- More than 50 samples
    |- Predict a category
      |- Labeled Data < 100K samples
        |- Linear SVC
        |- If Text Data
          |- Naive Bayes
        |- If not Text Data
          |- KNeighbors Classifier
          |- SVC > Ensemble Classifiers
      |- Labeled Data > 100K samples
        |- SGD Classifier
        |- Kernel Approximation
</pre>

<h1>Generalized Linear Models</h1>
Target is linear combination of inputs

Prediction $\hat{y}$ solves 

$$\hat{y}(w,x) = w_{0} + w_{1}x_{1} + \dots + w_{p}x_{p}$$

coef_ $$w = (w_{1}, \dots , w_{p})$$

intercept_ $$w_{0}$$ 

fit takes array $$ ([X] , [y]) $$

```python
from sklearn.linear_model import LinearRegression  #(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
```
> * coef_
> * intercept_ 

| Methods | Parameters | return |
|---------|------------|--------|
| fit | X, y, sample_weight | self |
| get_params | deep | params |
| predict | X | C |
| score | X, y, sample_weight | score |
| set_params | \*\*params | self |

# Ordinary Least Squares #

minimize the residual sum of squares between observed value and value predicted by linear approximation

$$\min_{w} || X_{w} - y || 2^{2}$$

> * *Note*: be aware of multicollinearity

In [1]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [2]:
reg.coef_

array([ 0.5,  0.5])

### Ordinary Least Squares Complexity ###
Using singular value decomposition of $X$
Matrix : $[x]_{n \times p}$

> $\forall : n \ge p$

> <b>Cost is</b> $O(np^{2})$

# Ridge Regression #
**Ridge Cofficients** : Impose a penalty on the size of coefficients and minimize a penalized residual sum of squares
$$\min_{w} || X_{w} - y || 2^{2} + \alpha || w || 2^{2}$$

> *Complexity parameter* $\alpha \ge 0$

as **$\alpha$** increase $\to$ **Shrinkage** increase $\to$ coefficients become more robust to collinearity

```python
from sklearn.linear_model import Ridge  # parameter = alpha
```

In [3]:
from sklearn import linear_model
reg = linear_model.Ridge(alpha = .5)
reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) 

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [4]:
reg.coef_

array([ 0.34545455,  0.34545455])

In [5]:
reg.intercept_ 

0.13636363636363641

### Ridge Complexity ###
Complexity is same as Ordinary Least Squares
> <b>Cost is</b> $O(np^{2})$

### Setting the regularization parameter: generalized Cross-Validation ###

implementation of ** *ridge regression* ** with built-in cross-validation of the alpha parameter


works in the same way as <b><i>GridSearchCV</i></b> except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out cross-validation

```python
from sklearn.linear_model import RidgeCV  # parameter = list(alphas)
```

In [6]:
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0])
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])       

RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, gcv_mode=None,
    normalize=False, scoring=None, store_cv_values=False)

In [7]:
reg.alpha_

0.10000000000000001

# Lasso #
*a linear model that estimates sparse coefficients (fewer paramater)*
 Lasso and its variants are fundamental to the field of compressed sensing
 
Uses $\mathcal{L}_{1}$ regularizer
 
$$\min_{w} \frac{1}{2n_{samples}} || X_{w} - y ||_{2} ^{2} + \alpha || w ||_{1}$$

> $\alpha$ is constant

> $||w||_{1}$ is $\mathcal{L}_{1}$ norm of parametr vector

```python
from sklearn.linear_model import Lasso  # parameter = alpha
```

In [8]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha = 0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [9]:
reg.predict([[1, 1]])

array([ 0.8])

### Setting regularization parameter ###

$\alpha$ : **alpha** is degree of sparsity of estimated cofficients

#### Using cross-validation ####
> LassoCV
>> High dimentional dataset with many collinear regressors 

> LassoLarsCV
>> based on Least Angle Regression

>> Samples is less then features 

#### Information-criteria based model selection ####
LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC)

#### Comparison with the regularization parameter of SVM ####
$C$ is regularization parameter of SVM

$$\alpha = \frac{1}{C}$$

or

$$\alpha = \frac{1}{n_{samples} \times C}$$

# Multi-task Lasso #
estimates sparse coefficients for multiple regression problems jointly 

<h2>$$[y]_{n_{samples} \times n_{tasks}}$$</h2>

trained with $\mathcal{L_{1}L_{2}}$ prior as regularizer

$$\min_{w} \frac{1}{2n_{samples}} || XW - Y || _{Fro} ^{2} + \alpha ||W|| _{21}$$

** * Fro * ** : Frobenius norm
$$||A||_{Fro} = \sqrt{\sum_{ij} a _{ij} ^{2}}$$

** $\mathcal{L_{1}L_{2}}$ ** : 
$$||A||_{21} = \sum_{i}\sqrt{\sum_{j}a_{ij}^{2}}$$

# Elastic Net #

trained with $\mathcal{L_{1}}$ and $\mathcal{L_{2}}$ prior as regularizer

>  learning a sparse model where few of the weights are non-zero like <i>Lasso</i>, while still maintaining the regularization properties of <i>Ridge</i>

$$\min_{w} \frac{1}{2n_{samples}}||X_{w} - y||_{2} ^{2} + \alpha\rho || w ||_{1} + \frac{\alpha(1 - \rho)}{2} ||w||_{2}^{2}$$

> <i>Note:</i> set the parameters alpha ($\alpha$) and l1_ratio ($\rho$) by cross-validation

# Multi-task Elastic Net #
estimates sparse coefficients for multiple regression problems jointly 

<h2>$$[y]_{n_{samples} \times n_{tasks}}$$</h2>

trained with $\mathcal{L_{1}L_{2}}$ prior and $\mathcal{L_{2}}$ prior as regularizer

$$\min_{w}\frac{1}{2n_{samples}} || WX - Y || _{Fro} ^{2} + \alpha\rho || W || _{21} + \frac{\alpha(1 - \rho)}{2} || W || _{Fro} ^{2}$$

> <i>Note:</i> set the parameters alpha ($\alpha$) and l1_ratio ($\rho$) by cross-validation

# Least Angle Regression (LARS)#
regression algorithm for high-dimensional data
>At each step, it finds the predictor most correlated with the response. When there are multiple predictors having equal correlation, instead of continuing along the same predictor, it proceeds in a direction equiangular between the predictors.

><i>Note:</i> similar to forward stepwise regression

# LARS Lasso #
yields the exact solution which is piecewise linear as a function of the norm of its coefficients.

### Mathematical formulation ###
Instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residua

> <i>Note:</i> similar to forward stepwise regression
>> <i>coef\_path\_</i> has size of $(n_{features} , max_{features} + 1)$

>> first column is zero

```python
from sklearn.linear_model import LassoLars  # parameter = alpha
```

In [10]:
from sklearn import linear_model
reg = linear_model.LassoLars(alpha=.1)
reg.fit([[0, 0], [1, 1]], [0, 1])

LassoLars(alpha=0.1, copy_X=True, eps=2.2204460492503131e-16,
     fit_intercept=True, fit_path=True, max_iter=500, normalize=True,
     positive=False, precompute='auto', verbose=False)

In [11]:
reg.coef_ 

array([ 0.71715729,  0.        ])

# Orthogonal Matching Pursuit (OMP) #
Based on greedy algorithm which approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the $\mathcal{L}_{0}$ pseudo-norm).

><i>Note:</i> similar to forward stepwise regression

* with fixed non-zero elements

$$\arg\min||y - X\gamma||_{2}^{2}$$ 
$||\gamma||_{0} \leq n_{nonzero\_coefs}$

* target specific number of non-zero coefficients

$$\arg\min ||\gamma||_{0}$$
$|| y - X\gamma||_{2}^{2} \leq tol$

# Bayesian Regression #

$$p(y|X,w,\alpha) = \mathcal{N}(y|Xw, \alpha)$$

$\alpha$ random variable

> * adapts to data at hand 
> * used to include regularization in the estimation procedure (Use <i>uniformative priors</i> over the hyper parameters of the model)


### Bayesian Ridge Regression ###
also known as classical <i>Ridge</i>

$$p(w|\lambda) = \mathcal{N}(w|0, \lambda^{-1}\mathcal{I} _{p})$$

** *Default :* ** $\alpha_{1} = \alpha_{2} = \lambda_{1} = \lambda_{2} = 10^{-6}$

```python
from sklearn.linear_model import BayesianRidge
```

In [12]:
from sklearn import linear_model
X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
Y = [0., 1., 2., 3.]
reg = linear_model.BayesianRidge()
reg.fit(X, Y)

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
       normalize=False, tol=0.001, verbose=False)

In [13]:
reg.predict ([[1, 0.]])

array([ 0.50000013])

In [14]:
reg.coef_

array([ 0.49999993,  0.49999993])

### Automatic Relevance Determination - ARD ###
> * Sparse Bayesian Learning <br />
> * Relevance Vector Machine

$$p(w|\lambda) = \mathcal{N}(w|0, A^{-1})$$
$diag(A) = \lambda = \{ \lambda_{1}, \dots \lambda_{p}\}$
distribution over $w$ is assumed to be axis-parallel elliptical Gaussian distribution.
* *center is (0,0)*
* *precision is $\lambda_{\iota}$*

# Logistic regression #

AKA:
* logit regression
* maximum-entropy classification (MaxEnt)
* log-linear classifier

** *$\mathcal{L}_{2}$ penalized logistic regression* **
$$\min_{w,c}\frac{1}{2}w^{T}w + C \sum _{i = 1} ^{n}\log(\exp(-y_{i}(X_{i}^{T}w+c))+1)$$

** *$\mathcal{L}_{1}$ regularized logistic regression* **
$$\min_{w,c}||w||_{1}+C\sum_{i=1}^{n}\log(\exp({-y_{i}}(X_{i}^{T}w+c))+1)$$

| Case | Solver |
|------|------|
| L1 penalty | liblinear, saga |
| Multinomial loss | lbfgs, sag, saga, newton-cg |
| Very Large dataset (n_samples) | sag, saga |


```python
from sklearn.linear_model import LogisticRegression
```

> * coef_
> * intercept_
> * n_iter_

| Parameters | type | value | dafault |
|------------|------|-------|---------|
| penalty | str | ‘l1’ or ‘l2’ | ‘l2’ |
| dual | bool |True or False | False |
| tol | float  ||1e-4|
| C | float | | 1.0 |
| fit_intercept | bool | | True |
| intercept_scaling | float | | 1 |
| class_weight | dict | ‘balanced’ | None |
| random_state | int | RandomState instance or None, optional | None |
| solver | str | {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’} | ‘liblinear’ |
| max_iter | int | | 100 |
| multi_class | str | {‘ovr’, ‘multinomial’} | ‘ovr’ |
| verbose | int| | 0|
| warm_start | bool| | False |
| n_jobs | int| | 1 |

| Methods | Parameters | return |
|---------|------------|--------|
| decision_function | X | array |
| densify | | self |
| fit | X, y, sample_weight=None | self |
| get_params | deep = true | params |
| predict | X | c |
| predict_log_proba | X | T |
| predict_proba | X | T |
| score | X, y, sample_weight=None | score |
| set_params | \*\*params | self | 
| sparsify | | self |

# Stochastic Gradient Descent - SGD

```python
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import SGDRegressor
```

use it when number of samples or features is very large