# Linear Regression
## Introduction
Linear regression is a method used to find a relationship between a dependent variable and a set of independent variables. In its simplest form it consist of fitting a function y = w.x + b to observed data, where 
y is the dependent variable, x the independent, w the weight matrix and b the bias.
![](images/img1.png)
## Loss Function

A loss function is a way to map the performance of our model into a real number. It measures how well the model is performing its task, be it a linear regression model fitting the data to a line, a neural network correctly classifying an image of a character, etc. The loss function is particularly important in learning since it is what guides the update of the parameters so that the model can perform better.

One such function is the Squared Loss, which measures the average of the squared difference between an estimation and the ground-truth value.


The squared loss function can be seen below
![](images/img2.png)
where,

M is the number of training points 

y is the ground-truth value

ŷ is the estimated value

Now, let's put linear equation in the above equation for calculating the loss function of linear regression. We will get:
![](images/img3.png)
Naturally, we want a model with the smallest possible MSE. One common way to do that is using Gradient Descent. Let's see what Gradient Descent is.

## Gradient Descent
This optimization algorithm is based using gradients to update the model parameters (w and b in our case) until a minimum is found and the gradient becomes zero. Convergence to the global minimum is guaranteed (with some reservations) for convex functions since that’s the only point where the gradient is zero.

An animation of the Gradient Descent method is shown below
![](images/img4.mov)
Now lets see the derivatives of above loss wrt. w and b
![](images/img5.png)
Remember from calculus that the gradient points in the direction of steepest ascent, but since we want our cost to decrease we invert its symbol, therefore getting below eqs:
![](images/img6.png)
Where α is called learning rate and relates to much we trust the gradient at a given point, it is usually the case that 0<α<1. Setting the learning rate too high might lead to divergence since it risks overshooting the minimum, refer animation below:
![](images/img7.mov)

### Batch Gradient Descent
Batch gradient descent computes the gradient using the `whole dataset`. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. This has been often regarded as slow and unreliable. In past, the applications of gradient descent to nonconvex optimization problems was regarded as foolhardy and unpricipled.

The recurring problem in ML is that large training sets are necessary for good generalization, but large training sets are also computationally expensive. Now in this case SGD algorithm may be better able to help us. 

### Stochastic Gradient descent
We can think of one more way of feeding dataset;say instead of feeding large dataset into model; why not feed set of minibatches into learning algorithm. Stochastic gradient descent (SGD) works that way exactly; dataset is divided into several minibatches and we pass each minibatch one by one to our model.

Minibatch sizes are generally driven by the following factors:
<ol>
    <li>Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.</li>
    <li>Multicore architectures are usually underutilized by extremely small batches.This motivates using some absolute minimum batch size, below which thereis no reduction in the time to process a minibatch.</li>
    <li>If all examples in the batch are to be processed in parallel (as is typicallythe case), then the amount of memory scales with the batch size. For manyhardware setups this is the limiting factor in batch size.</li>
    <li>Some kinds of hardware achieve better runtime with speciﬁc sizes of arrays.Especially when using GPUs, it is common for power of 2 batch sizes to oﬀerbetter runtime. Typical power of 2 batch sizes range from 32 to 256, with 16sometimes being attempted for large models.</li>
    <li>Small batches can oﬀer a regularizing eﬀect (Wilson and Martinez, 2003),perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batchsize might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be veryhigh as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire trainingset