
# Exploring gradient descent with linear regression
Task is to build a function as close as possible to the desired using set (argument; value) built on it.
And we now what kind of function we expect.

For example, based on the situation we conclude that the function should look
like $c_1 + c_2 x + c_3 sin x$, but we have only $l$ points and don't know values of constants - this is linear regression that
is trained using gradient descent job.

So there are $l$ points - $(x_1, y_1), ..., (x_l, y_l)$. We take vector $w = (w_1, w_2, w_3)^T$ and let's call $\color{cyan}{X}$ this matrix:
$\left(
    \begin{array}{ccc}
        1 & x_1 & sin x_1 \\
        1 & x_2 & sin x_2 \\
        \vdots & \ddots & \vdots \\
        1 & x_l & sin x_l
    \end{array}
\right)$
and Y this vector: $\left(\begin{array}{cccc} y_1 & y_2 & ... & y_l \end{array}\right)^T$.

$a_w(x) = w_1 + w_2x + w_3 sin x$ - modulated function. Notice that $\left(\begin{array}{cccc} a_w(x_1) & a_w(x_2) & ... & a_w(x_l) \end{array}\right)^T$ - 
model (linear regression) response vector is just ${\color{cyan}{X}} \cdot w$.

And now we come to questions: what is gradient descent and how does it works? Gradient descent is a way to find the extremum of a function
by ''scating'' throw function uphill and into fault. In our situation function is loss/difference between modulated and desired and we want to minimize it. 

As is known from the course of mathematical analysis, the direction of movement is the gradient of the function. Gradient descent takes this value and
recalculates the model data according to the formulas, which will be given next.

Let's move on the assignments:

Step formula:
$$
\eta_k = \lambda(\frac{s_0}{s_0 + k})^p
$$
You don't need to set $s_0$ or $p_0$, you can use default $1$ and $0.5$, but you should adjust $\lambda$

In this task we use MSE loss function:
$$
Q(w) = \frac{1}{l}\sum\limits_{i=1}^l (a_w(x_i) - y_i)^2
$$


In [None]:
#!source<mlpractice.gradient_descent.BaseValues>

#!source<mlpractice.gradient_descent.BaseDescent>

## Gradient descent

$$
w_{k+1} = w_k - \eta_k \nabla_w Q(w_k)
$$

In [None]:
#!source<mlpractice.gradient_descent.GradientDescent>

In [None]:
from mlpractice.tests.gradient_descent.test_gradient_descent import test_all

test_all(GradientDescent)

## Stochastic Descent

$$
w_{k+1} = w_k - \eta_k \nabla_w q_{i_k}(w_k)
$$
where $\nabla_w q_{i_k}(w_k)$ - gradient estimation for batch with randomly selected objects

In [None]:
#!source<mlpractice.gradient_descent.StochasticDescent>

In [None]:
from mlpractice.tests.gradient_descent.test_stochastic_descent import test_all

test_all(StochasticDescent)

## Momentum Descent

$$
h_0 = 0, h_{k+1} = \alpha h_k + \eta_k \nabla_w Q(w_k),\\
w_{k+1} = w_k - h_{k + 1}
$$

In [None]:
#!source<mlpractice.gradient_descent.MomentumDescent>

In [None]:
from mlpractice.tests.gradient_descent.test_momentum_descent import test_all

test_all(MomentumDescent)

## Adagrad

$$
G_0 = 0, G_{k+1} = G_k + (\nabla_w Q(w_k))^2,\\
w_{k+1} = w_k - \frac{\eta_k}{\sqrt{\varepsilon + G_{k+1}}} \nabla_k Q(w_k)
$$

In [None]:
#!source<mlpractice.gradient_descent.Adagrad>

In [None]:
from mlpractice.tests.gradient_descent.test_adagrad import test_all

test_all(Adagrad)

# Gradient Descent in action
## Linear Regression
To see how gradient descent can provide minimizing loss, we propose the implementation of linear regression, that studying with using gradient descent.

Notice that you must comply with following conditions:
- Сalculations must be vectorized
- Python cycles are only allowed for gradient descent iterations
- Stop studying is reaching maximal iteration count (max_iter) or reaching small error (square of the euclidean norm of difference
in weights between adjacent iterations is less than tolerance)
- Saving loss function history in loss_history from zero step (before studying) to last
- Weights must be initialized either to zero or from normal distribution $N(0, 1)$ with fixed seed

In [None]:
#!source<mlpractice.gradient_descent.LinearRegression>

In [None]:
from mlpractice.tests.gradient_descent.test_linear_regression import test_all

test_all(LinearRegression)

# Regularization
In this task we want to explore how regularization (adding a fine proportional to the norm of weights). Use the l2-regularization:
$$
G(w) = \frac{1}{l}\sum\limits_{i=1}^l (a_w(x_i) - y_i)^2 + \frac{\mu}{2}||w||^2
$$

In [None]:
#!source<mlpractice.gradient_descent.GradientDescentReg>

In [None]:
from mlpractice.tests.gradient_descent.test_gradient_descent_reg import test_all

test_all(GradientDescentReg)

In [None]:
#!source<mlpractice.gradient_descent.StochasticDescentReg>

In [None]:
from mlpractice.tests.gradient_descent.test_stochastic_descent_reg import test_all

test_all(StochasticDescentReg)

In [None]:
#!source<mlpractice.gradient_descent.MomentumDescentReg>

In [None]:
from mlpractice.tests.gradient_descent.test_momentum_descent_reg import test_all

test_all(MomentumDescentReg)

In [None]:
#!source<mlpractice.gradient_descent.AdagradReg>

In [None]:
from mlpractice.tests.gradient_descent.test_adagrad_reg import test_all

test_all(AdagradReg)

In [None]:
# If you want to submit you solutions authorize
USERNAME = ""
PASSWORD = ""

from mlpractice.stats.stats_utils import _get_stats, submit

submit(USERNAME, PASSWORD, str(_get_stats()))