### Different components of a simple linear regression model

In Ordinary Least Squares (OLS) Linear Regression, our goal is to define the best-fitting line with following equation:

<img src="images/ols_equation.png">

That is we have to find the line (or hyperplane) that minimizes the vertical offsets (the `red` lines in the following figure).

<img src="images/ols_described_in_graph.png">

The best fitting line minimizes the `sum of squared errors` (**SSE**) or `mean squared error` (**MSE**) between the target variable ($y$) and the predicted output over all samples $i$ in our dataset of size $n$.

<img src="images/ols_sse_mse.png">

Since `linear regression` is the process of fitting the typical linear hypothesis, this can be achieved using one of the following approaches:

- Solving the model parameters analytically using least-square loss and normal equations (closed-form equations) method. (e.g. class <span style="background-color:#CBFF33"> LinearRegression </span>)
- Applying an optimization algorithm such as Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc. (e.g. class <span style="background-color:#CBFF33"> SGDRegressor </span>)

### Gradient Descent (GD) based optimization techniques

- `SGDRegressor` wherein we specify the specific `loss function` and penalty and it uses `stochastic gradient descent (SGD)` to do the fitting.  

<div class="alert alert-block alert-info">
<b>SGD:</b> In SGD we repeatedly run through the training set <span style="background-color:#33ECFF"> one data point at a time </span> and update the parameters according to the gradient of the error with respect to each individual data point.
</div>

<div class="alert alert-block alert-info">
<b>SGDRegressor:</b> Linear model fitted by minimizing a regularized empirical loss with SGD.
</div>

Be sure to set the $n_{iter}$ parameter high enough to get good convergence. 




#### Let's assume a cost function with `more than one minima`.
In this scenario how the regular batch GD and SGD will behave?
- The regular batch GD will walk towards one `minima` and it may happen to be the local minima and not the global minima.
- However, since SGD tend to jump all over the places due to the randomness, it may end up falling across minimum regions and not blined by one minima. This is one of the prominent a

### References
- This notebook is based on the follwing web resources:
    - this [quora discussion](https://qr.ae/pG9WVc)
    - this [blog post](https://sdsawtelle.github.io/blog/output/week2-andrew-ng-machine-learning-with-python.html)

### [Visualizing regression models](https://seaborn.pydata.org/tutorial/regression.html) a good place to see the fitted lines.