# Linear Models

## Ordinary Least Squares

LinearRegression fits a linear model with coefficients w = (w_1, ..., w_p)
 to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

![image-2.png](attachment:image-2.png)


 `sklearn.linear_model.LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)`

* **fit_intercept** bool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

* **copy_X** bool, default=True
If True, X will be copied; else, it may be overwritten.

* **n_jobs** int, default=None
The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

* **positive** bool, default=False
When set to True, forces the coefficients to be positive. This option is only supported for dense arrays.


Methods

fit(X, y[, sample_weight]) : Fit linear model.

get_metadata_routing(): Get metadata routing of this object.

get_params([deep]): Get parameters for this estimator.

predict(X): Predict using the linear model.

score(X, y[, sample_weight]): Return the coefficient of determination of the prediction.

set_fit_request(*[, sample_weight]):Request metadata passed to the fit method.

set_params(**params): Set the parameters of this estimator.

set_score_request(*[, sample_weight]): Request metadata passed to the score method.

## Ridge regression

![image.png](attachment:image.png)

`sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=0.0001, solver='auto', positive=False, random_state=None)`

* **max_iter** int, default=None
Maximum number of iterations for conjugate gradient solver. For ‘sparse_cg’ and ‘lsqr’ solvers, the default value is determined by scipy.sparse.linalg. For ‘sag’ solver, the default value is 1000. For ‘lbfgs’ solver, the default value is 15000.

* **tol** float, default=1e-4
The precision of the solution (coef_) is determined by tol which specifies a different convergence criterion for each solver:

 * ‘svd’: tol has no impact.

 * ‘cholesky’: tol has no impact.

 * ‘sparse_cg’: norm of residuals smaller than tol.

 * ‘lsqr’: tol is set as atol and btol of scipy.sparse.linalg.lsqr, which control the norm of the residual vector in terms of the norms of matrix and coefficients.

 * ‘sag’ and ‘saga’: relative change of coef smaller than tol.

 * ‘lbfgs’: maximum of the absolute (projected) gradient=max|residuals| smaller than tol.

* **solver{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’, ‘lbfgs’}, default=’auto’**
Solver to use in the computational routines:

 * ‘auto’ chooses the solver automatically based on the type of data.

 * ‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than ‘cholesky’ at the cost of being slower.

 * ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution.

 * ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter).

 * ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure.

 * ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

 * ‘lbfgs’ uses L-BFGS-B algorithm implemented in scipy.optimize.minimize. It can be used only when positive is True.

## Lasso

The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients

![image.png](attachment:image.png)

`sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')`




## Multi-layer Perceptron

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

`sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(100,), activation='relu', *, solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)`


* **hidden_layer_sizes array-like of shape(n_layers - 2,), default=(100,)**
The ith element represents the number of neurons in the ith hidden layer.

* **activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’**
Activation function for the hidden layer.

 * ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

 * ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).

 * ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

 * ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

* **solver {‘lbfgs’, ‘sgd’, ‘adam’}, default=’adam’**
The solver for weight optimization.

 * ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.

 * ‘sgd’ refers to stochastic gradient descent.

 * ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

* **alpha float, default=0.0001**
Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss.

* **batch_size int, default=’auto’**
Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the regressor will not use minibatch. When set to “auto”, batch_size=min(200, n_samples).

* **learning_rate {‘constant’, ‘invscaling’, ‘adaptive’}, default=’constant’**
Learning rate schedule for weight updates.

 * ‘constant’ is a constant learning rate given by ‘learning_rate_init’.

 * ‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)

 * ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.

    Only used when solver=’sgd’.

* **learning_rate_init float, default=0.001**
The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.

* **power_t float, default=0.5**
The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.

* **max_iter int, default=200**
Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.

* **shuffle bool, default=True**
Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’.

* **random_state int, RandomState instance, default=None**
Determines random number generation for weights and bias initialization, train-test split if early stopping is used, and batch sampling when solver=’sgd’ or ‘adam’. Pass an int for reproducible results across multiple function calls. See Glossary.

* **tol float, default=1e-4**
Tolerance for the optimization. When the loss or score is not improving by at least tol for n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.

* **verbose bool, default=False**
Whether to print progress messages to stdout.

* **warm_start bool, default=False**
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary.

* **momentum float, default=0.9 **
Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.

* **nesterovs_momentum bool, default=True**
Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0.

* **early_stopping bool, default=False**
Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside validation_fraction of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs. Only effective when solver=’sgd’ or ‘adam’.

* **validation_fraction float, default=0.1**
The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.

* **beta_1 float, default=0.9**
Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1]. Only used when solver=’adam’.

* **beta_2 float, default=0.999**
Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1]. Only used when solver=’adam’.

* **epsilon float, default=1e-8**
Value for numerical stability in adam. Only used when solver=’adam’.

* **n_iter_no_change int, default=10**
Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’.

* **max_fun int, default=15000**
Only used when solver=’lbfgs’. Maximum number of function calls. The solver iterates until convergence (determined by tol), number of iterations reaches max_iter, or this number of function calls. Note that number of function calls will be greater than or equal to the number of iterations for the MLPRegressor.