# What is a likelihood function? Also add a formula. 

Many probability distributions have unknown parameters; We estimate these unknowns using sample data. The Likelihood function gives us an idea of how well the data summarizes these parameters.<br>
Although a likelihood function might look just like a probability density function, it’s fundamentally different. A probability density function is a function of x, your data point, and it will tell you how likely it is that certain data points appear. A likelihood function, on the other hand, takes the data set as a given, and represents the likeliness of different parameters for your distribution.<br>
Defining Likelihood Functions in Terms of Probability Density Functions
Suppose the joint probability density function of your sample $ X = (X1,…X2) is f(x| θ) $, where θ is a parameter.$ X = x $is an observed sample point. Then the function of θ defined as

$$ L(θ |x) = f(x |θ)$$

is your likelihood function.<br>
Here it certainly looks like we’re just taking our probability density function and cleverly relabeling it as a likelihood function. The reality, though, is actually quite different. For your probability density function, you thought of θ as a constant and focused on an ever changing x. In the likelihood function, you let a sample point x be a constant and imagine θ to be varying over the whole range of possible parameter values.<br>

If we compare two points on our probability density function, we’ll be looking at two different values of x and examining which one has more probability of occurring. But for the likelihood function, we compare two different parameter points. For example, if we find that $L(θ1 | x) > L(θ2 | x)$, we know that our observed point x is more likely to have been observed under parameter conditions $θ = θ1$ rather than $θ = θ2$.<br>

__Properties of Likelihoods__<br>
Unlike probability density functions, likelihoods aren’t normalized. The area under their curves does not have to add up to 1.<br>

In fact, we can only define a likelihood function up to a constant of proportionality. What that means that, rather then being one function, likelihood is an equivalence class of functions.<br>

__Using Likelihoods__<br>
Likelihoods are a key part of Bayesian inference. We also use likelihoods to generate estimators; we almost always want the maximum likelihood estimator.<br>

# What is Maximum Likelihood estimation (MLE) ? 

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.<br>
The principle of Maximum Likelihood is at the heart of Machine Learning. It guides us to find the best model in a search space of all models. In simple terms, Maximum Likelihood Estimation or MLE lets us choose a model (parameters) that explains the data (training set) better than all other models. For any given neural network architecture, the objective function can be derived based on the principle of Maximum Likelihood.<br>

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. MLE can be seen as a special case of the maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters, or as a variant of the MAP that ignores the prior and which therefore is unregularized.<br>
MLE in a nutshell helps us answer this question:Which are the best parameters/coefficients for my model?<br>
The distinction between probability and likelihood is fundamentally important: Probability attaches to possible results; likelihood attaches to hypotheses.<br>
Maximum likelihood estimation (MLE) is a technique used for estimating the parameters of a given distribution, using some observed data. For example, if a population is known to follow a “normal distribution” but the “mean” and “variance” are unknown, MLE can be used to estimate them using a limited sample of the population. MLE does that by finding particular values for the parameters (mean and variance) so that the resultant model with those parameters (mean and variance) would have generated the data.<br>
So generally, likelihood expression is in the form of: L(parameters | data ). Meaning of this is, “likelihood of having these parameters, once the data are these”.<br>
Likelihood and Probability are two different things although they look and behaves same. We talk about probability when we know the model parameters and when predicting a value from that model. So there we talk about how probable is the resultant value to be come out from that model. So probability is: P(data | parameters)<br>
Now we can see that Likelihood is other side of probability. That is we are going to guess the model parameters from the data. So there we know the results well and we know for sure that they have occured (probability = 1)<br>

# How is linear regression related to Pytorch and gradient descent? 

## linear regression related to Pytorch

Linear Regression is an approach that tries to find a linear relationship between a dependent variable and an independent variable by minimizing the distance.<br>
Let’s consider a very basic linear equation i.e., y=2x+1. Here, ‘x’ is the independent variable and y is the dependent variable. We’ll use this equation to create a dummy dataset which will be used to train this linear regression model. Following is the code for creating the dataset.<br>

In [3]:
import numpy as np
import torch

In [4]:
# Create tensors.
x = torch.tensor(3.)
w = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

In [5]:
# Print tensors
print(x)
print(w)
print(b)

tensor(3.)
tensor(4., requires_grad=True)
tensor(5., requires_grad=True)


In [6]:
# Arithmetic operations
y = w * x + b
print(y)

tensor(17., grad_fn=<AddBackward0>)


In [7]:
# Compute gradients
y.backward()

In [8]:
# Display gradients
print('dy/dw:', w.grad)
print('dy/db:', b.grad)

dy/dw: tensor(3.)
dy/db: tensor(1.)


In [10]:
# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70]], dtype='float32')

In [11]:
# Targets (apples, oranges)
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')

In [12]:
# Convert inputs and targets to tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs)
print(targets)

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.]])
tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


In [13]:
# Weights and biases
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
print(w)
print(b)

tensor([[ 0.4588,  0.8561, -0.8744],
        [ 0.6292,  0.3563,  1.2141]], requires_grad=True)
tensor([0.8987, 0.2927], requires_grad=True)


In [14]:
# Define the model
def model(x):
    return x @ w.t() + b

In [15]:
# Generate predictions
preds = model(inputs)
print(preds)

tensor([[ 54.1466, 122.3036],
        [ 62.0199, 166.6079],
        [104.8099, 173.1985],
        [ 52.1514, 124.7130],
        [ 53.5295, 162.9014]], grad_fn=<AddBackward0>)


In [16]:
# Compare with targets
print(targets)

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


In [17]:
# MSE loss
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff * diff) / diff.numel()

In [18]:
# Compute loss
loss = mse(preds, targets)
print(loss)

tensor(2219.8362, grad_fn=<DivBackward0>)


In [19]:
# Compute gradients
loss.backward()

In [20]:
# Gradients for weights
print(w)
print(w.grad)

tensor([[ 0.4588,  0.8561, -0.8744],
        [ 0.6292,  0.3563,  1.2141]], requires_grad=True)
tensor([[ -687.0081, -1429.7102,  -892.9557],
        [ 5052.3350,  4530.1270,  3019.5906]])


In [21]:
# Gradients for bias
print(b)
print(b.grad)

tensor([0.8987, 0.2927], requires_grad=True)
tensor([-10.8685,  57.9449])


In [22]:
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])


In [23]:
# Generate predictions
preds = model(inputs)
print(preds)

tensor([[ 54.1466, 122.3036],
        [ 62.0199, 166.6079],
        [104.8099, 173.1985],
        [ 52.1514, 124.7130],
        [ 53.5295, 162.9014]], grad_fn=<AddBackward0>)


In [24]:
# Calculate the loss
loss = mse(preds, targets)
print(loss)

tensor(2219.8362, grad_fn=<DivBackward0>)


In [25]:
# Compute gradients
loss.backward()

In [26]:
# Adjust weights & reset gradients
with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    w.grad.zero_()
    b.grad.zero_()

In [27]:
print(w)

tensor([[ 0.4656,  0.8704, -0.8654],
        [ 0.5787,  0.3110,  1.1839]], requires_grad=True)


In [28]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(1686.8291, grad_fn=<DivBackward0>)


In [29]:
# Train for 100 epochs
for i in range(100):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    with torch.no_grad():
        w -= w.grad * 1e-5
        b -= b.grad * 1e-5
        w.grad.zero_()
        b.grad.zero_()

In [30]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(235.4994, grad_fn=<DivBackward0>)


In [31]:
# Print predictions
preds

tensor([[ 61.6818,  74.1907],
        [ 74.6087, 105.1717],
        [128.7154, 116.4812],
        [ 45.3871,  59.4863],
        [ 74.6952, 113.9124]], grad_fn=<AddBackward0>)

In [32]:
# Print targets
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

## linear regression related to gradient descent

__Linear Regression__<br>
In statistics, linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. Let X be the independent variable and Y be the dependent variable. We will define a linear relationship between these two variables as follows:<br>
$$ Y = mX + c $$<br>
m is the slope of the line and c is the y intercept. We will use this equation to train our model with a given dataset and predict the value of Y for any given value of X. The challenge is to determine the value of m and c, such that the line corresponding to those values is the best fitting line or gives the minimum error.<br>
<br>__Loss Function__<br>
The loss is the error in our predicted value of m and c. The goal is to minimize this error to obtain the most accurate value of m and c.<br>
We will use the Mean Squared Error function to calculate the loss. There are three steps in this function:<br>
Find the difference between the actual y and predicted y value(y = mx + c), for a given x.<br>
Square this difference.<br>
Find the mean of the squares for every value in X.<br>

$${{E} ={\frac {1}{n}}\sum _{i=0}^{n}(y_{i}-{\bar {y_{i}}})^{2}}$$

Here $yᵢ$ is the actual value and $ȳᵢ$ is the predicted value. Lets substitute the value of $ ȳᵢ$:
$${{E} ={\frac {1}{n}}\sum _{i=0}^{n}(y_{i}- (mx + c_{i})^{2}}$$


<br>__The Gradient Descent Algorithm__<br>
Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here that function is our Loss Function.
<br>__Understanding Gradient Descent__<br>

<br>Let’s try applying gradient descent to m and c and approach it step by step:<br>
1. Initially let m = 0 and c = 0. Let L be our learning rate. This controls how much the value of m changes with each step. L could be a small value like 0.0001 for good accuracy.
2. Calculate the partial derivative of the loss function with respect to m, and plug in the current values of x, y, m and c in it to obtain the derivative value D.

$${{D_m} ={\frac {1}{n}}\sum _{i=0}^{n}2((y_{i} - (mx_{i} + c)){-x_i}})$$

$${{D_m} ={\frac {-2}{n}}\sum _{i=0}^{n}{x_i}((y_{i} - {\bar {y_i}}})$$

Derivative with respect to m<br>
$D_m$ is the value of the partial derivative with respect to m. Similarly lets find the partial derivative with respect to c, $D_c$ :<br>

<br>Derivative with respect to c<br>
3. Now we update the current value of m and c using the following equation:

$$ m = m - L * D_m $$
$$ c = c - L * D_c $$

4. We repeat this process until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy). The value of m and c that we are left with now will be the optimum values.<br>
Now going back to our analogy, m can be considered the current position of the person. D is equivalent to the steepness of the slope and L can be the speed with which he moves. Now the new value of m that we calculate using the above equation will be his next position, and L×D will be the size of the steps he will take. When the slope is more steep (D is more) he takes longer steps and when it is less steep (D is less), he takes smaller steps. Finally he arrives at the bottom of the valley which corresponds to our loss = 0.<br>
Now with the optimum value of m and c our model is ready to make predictions !


Gradient descent is one of the simplest and widely used algorithms in machine learning, mainly because it can be applied to any function to optimize it. 

# Write out MSE loss for linear regression. Could we also use this loss for classification? 

Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable and predicted values.<br>
The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value. It is a risk function, corresponding to the expected value of the squared error loss. It is always non – negative and values close to zero are better. The MSE is the second moment of the error (about the origin) and thus incorporates both the variance of the estimator and its bias.<br>
Steps to find the MSE<br>

1. Find the equation for the regression line<br>
$$\hat{Y}_i = \hat{\beta}_0+\hat{\beta}_1{X}_i+\hat{\epsilon}_i $$<br>

2. Insert X values in the equation found in step 1 in order to get the respective Y values i.e.<br>
$$\hat{Y}_i$$<br>

3. Now subtract the new Y values (i.e. $\hat{Y}_i$) from the original Y values. Thus, found values are the error terms. It is also known as the vertical distance of the given point from the regression line.<br>
  $$\begin{equation*}  Y_i - \hat{Y}_i  \end{equation*} $$<br>
  
4. Square the errors found in step 3.<br>
$$  \begin{equation*}  {(Y_i - \hat{Y}_i)}^2  \end{equation*} $$<br>
5. Sum up all the squares<br>
$$  \begin{equation*}  \sum_{i=1}^{N}(Y_i - \hat{Y}_i)^2  \end{equation*} $$<br>
6. Divide the value found in step 5 by the total number of observations.<br>
$$  \begin{equation*}  MSE = \frac{1}{N}\sum_{i=1}^{N}(Y_i - \hat{Y}_i)^2  \end{equation*} $$<br>

Cross-entropy (or softmax loss, but cross-entropy works better) is a better measure than MSE for classification, because the decision boundary in a classification task is large (in comparison with regression). MSE doesn’t punish misclassifications enough but is the right loss for regression, where the distance between two values that can be predicted is small.<br>
from a probabilistic point of view, the cross-entropy arises as the natural cost function to use if you have a sigmoid or softmax nonlinearity in the output layer of your network, and you want to maximize the likelihood of classifying the input data correctly. If instead you assume the target is continuous and normally distributed, and you maximize the likelihood of the output of the net under these assumptions, you get the MSE (combined with a linear output layer). For classification, cross-entropy tends to be more suitable than MSE – the underlying assumptions just make more sense for this setting. That said, you can train a classifier with the MSE loss and it will probably work fine (although it does not play very nicely with the sigmoid/softmax nonlinearities, a linear output layer would be a better choice in that case). For regression problems, you would almost always use the MSE. Another alternative for classification is to use a margin loss, which basically amounts to putting a (linear) SVM on top of your network. ive for classification is to use a margin loss, which basically amounts to putting a (linear) SVM on top of your network.<br>

# Write out the Maximum likelihood Estimation for linear regression. How is this related to the MSE loss for linear regression derived in the last point? Derive the relation between them

Maximum likelihood estimation or otherwise noted as MLE is a popular mechanism which is used to estimate the model parameters of a regression model. Other than regression, it is very often used in statics to estimate the parameters of various distribution models.<br>
As we have used likelihood calculation to find the best parameter values for various distribution models in statistics, MLE method can also be used to find the best model parameters of a linear regression model. But when calculating parameters values for those statistical distribution models, we knew what kind of distributions was it and the relevant PDF function.<br>
 When we talk about some values being in a normal distribution, we need to describe more about that normal distribution ; like what kind of normal distribution? More precisely what are the mean and variance of that normal distribution?<br>
In linear regression the trick that we do is, we take the model that we need to find, as the mean of the above stated normal distribution. Because we know how to find MLE values of a mean in a normal distribution.
So let’s define our linear model that needed to be estimated as ŷ.<br>
$$\hat{y} = w_0 + w_1x_1 +.......+w_dx_d$$<br>
But in linear regression, the mean is a function (ŷ). So you need to understand that for every x value (input) , there will be a number generated by function ŷ as the mean. So from ŷ function, we get a set of values as means. And the important thing to understand from this is that mean for each y value (each result / label) is a different. That is, mean for each y value (label) is the value predicted by our model.<br>
As stated above In linear regression, we treat above line ŷ as the “mean” of the normal distribution.<br>


We can consider this ŷ data as also in a normal distribution. But this time, their mean values will be them self since they fall along on top of the ŷ line perfectly. And so the variance of these ŷ data (predicted labels) will be 0. So, ŷ~N(XW , 0)
We have an error term called ε (residual) which is the distance between predicted value (ŷ) and actual value(y). And there are some important assumptions that we do in linear regression regarding these residuals, and they are:
“ Residuals are normally distributed”
“ Residuals have an equal variance”
“ Means of residuals are 0”
So:

![](img2.png)
And we know that y = ŷ + ε and y labels are normally distributed. Our aim is to estimate the best values for mean and variance of normal distribution y. Let’s get the mean and variance of y in terms of ŷ and ε normal distributions. We know the mean is termed as expectation. So let’s get the expectation of y=ŷ + ε equation in order to find the mean (expectation) of y.
E(y) = E(ŷ + ε)
E(y) = E(ŷ) + E(ε)
E(y) = XW + 0 = XW
And,
Variance(y) = Variance( ŷ + ε )
Variance(y) = Variance( ŷ )+ Variance( ε )
Variance(y) = 0 + σ²
So we can say that y is a normal distribution with mean XW and variance σ².

![](img3.png)
Now let’s calculate the MLEs for XW and σ² as we did in previous example. But note that here we hava n data points (in earlier example we had only 3), and each of those data point are of dimension d.
![](img4.png)


As y is a normal distribution,

![](!img1.png)

![](img.png)

Let’s get natural logarithms in both sides so that we get the log likelihood,
![](img5.png)
In order to estimate the best set of weights (weight matrix), let’s partially differentiate the above equation from w.
![](img6.png)
Optimal values for W is when
![](img7.png)
Then,
![](img8.png)

So now we have found the optimal values for Ws in our model. And that is the main aim of linear regression since once found the w matrix, we can predict.

# Write out the likelihood function for linear classification. What is the drawback of using MSE loss here? 

When we apply MLE to linear regression, the objective is to find the best line that fits the data points. But first, let us make some assumptions. We assume each label, $y_i$, is gaussian distributed with mean, $x^T_iθ$ and variance, $σ^2$, given by<br>

$$y_i=\mathcal{N}(x^T_i\theta,σ^2)=x^T_i\theta+\mathcal{N}(0,σ^2)$$
$$prediction,\hat{y}_i=x^T_i\theta$$<br>
where  $x_i$ is a vector of form $(x^1_i=1,x^2_i)$.<br>

The mean, $x^T_i\theta$ represents the best fit line. The data points will vary about the line, and the second term, captures this variance, $\mathcal{N}(0,σ^2)$.<br>
Learning<br>
If we assume that each point yi is gaussian distributed, the process of learning becomes the process of maximizing the product of the individual probabilities, which is equivalent to maximizing the log likelihood. We switch to log space, as it is more convenient and it removes the exponential in the gaussian distribution.<br>





As the data points are independent, we can write the joint probability distribution of y,θ,σ as,

$$p(y|X,θ,σ)=∏i=1np(y_i|x_i,θ,σ)$$
$$p(y|X,θ,σ)=∏i=1n(2πσ^2)−1/2e−12σ2(yi−xTiθ)2$$
rewriting in vector form,

$$p(y|X,θ,σ)={(2πσ^2)−1/2e}−1/2σ^2(y−Xθ)T(y−Xθ)$$
Log likelihood,

l(θ)=−n2log(2πσ2)−12σ2(Y−Xθ)T(Y−Xθ)
The first term is a constant and the second term is a parabola, the peak (maxima) of which can be found by equating the derivative of l(θ) to zero. Equating first derivative to zero, we get,

dl(θ)dθ=0=−12σ2(0−2XTY+XTXθ)
we get,<br>

$$\hat{\theta}_{ML}=(X^TX)^{−1}X^TY$$
Finally, we reach our goal of finding the best model for linear regression. This equation is commonly known as the normal equation. The same equation can be derived using the least squares method (perhaps in another post).<br>

Similarly, we can get the maximum likelihood of variance, $σ^2$, by differentiating log likelihood with respect to σ and equating to zero.<br>

$$\hat{\sigma}^2_{ML}=\frac{1}{n}(Y−Xθ)^T(Y−Xθ)=\frac{1}{n}\sum_{i=1} n(y_i−x_iθ)^2$$<br>
This gives us the standard estimate of variance in the training data.<br>

# Can gradient descent be used to find the parameters for linear regression? What about linear classification? Why? 

Yes, gradient descent can be used to find the parameters for linear regression. Gradient descent can also be used in linear classification as the loss function is differentiable.


# What are normal equations? Is it the same as least squares? Explain.  

Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of θ without using Gradient Descent. Following this approach is an effective and a time-saving option when are working with a dataset with small features.<br>

__Normal Equation is a follows :__

$$ \theta = ({X}^T{X})^{-1}.({X}^T{y}) $$

In the above equation,<br>
$θ$ : hypothesis parameters that define it the best.<br>
$X$ : Input feature value of each instance.<br>
$Y$ : Output value of each instance.<br>

__Maths Behind the equation –__
Given the hypothesis function <br>

$$ h(\theta) = \theta_0{x_0} + \theta_1{x_1}+...... + \theta_n{x_n} $$


where,<br>
$n$ : the no. of features in the data set.<br>
${x_0}$ : 1 (for vector multiplication)<br>

Notice that this is dot product between θ and x values. So for the convenience to solve we can write it as :<br>


$$ h(\theta) = \theta ^ T{x}$$



The motive in Linear Regression is to minimize the cost function :<br>

 $$J(\Theta) = \frac{1}{2m} \sum_{i = 1}^{m} \frac{1}{2} [h_{\Theta}(x^{(i)}) - y^{(i)}]^{2} $$ 
where,<br>
$x_i$ : the input value of iih training example.<br>
$m$ : no. of training instances<br>
$n$ : no. of data-set features<br>
$y_i$ : the expected result of ith instance<br>

Let us representing cost function in a vector form<br>


$$
\begin{bmatrix} 
h_\theta ({x}^0) \\
h_\theta ({x}^1) \\
.......\\
h_\theta ({x}^m) \\
\end{bmatrix}  
 - \begin{bmatrix} 
({y}^0) \\
({y}^1) \\
.......\\
({y}^m) \\
\end{bmatrix}
$$
<br><br>we have ignored 1/2m here as it will not make any difference in the working. It was used for the mathematical convenience while calculation gradient descent. But it is no more needed here.<br>

$$
\begin{bmatrix} 
\theta ^ T ({x}^0) \\
\theta ^ T ({x}^1) \\
.......\\
\theta ^ T ({x}^m) \\
\end{bmatrix} - y
$$


$\theta_0 \begin{pmatrix} 0   \\ {x_0} \end{pmatrix}$ + 

$\theta_1
\begin{pmatrix} 
0   \\
{x_1}
\end{pmatrix}
$


${x}^i_j$ : value of ${j}^{ih}$ feature in ${i}^{ih}$ training example.

This can further be reduced to  $X\theta - y$<br>
But each residual value is squared. We cannot simply square the above expression. As the square of a vector/matrix is not equal to the square of each of its values. So to get the squared value, multiply the vector/matrix with its transpose. So, the final equation derived is

$$(X\theta - y)^{T}(X\theta - y)$$

Therefore, the cost function is

$$Cost = (X\theta - y)^{T}(X\theta - y) $$

So, now getting the value of θ using derivative

$$\frac{\partial J_{\theta}}{\partial {\theta}} = \frac{\partial}{\partial {\theta}}{[(X{\theta}- y)^T{(X{\theta}- y)}]}$$

$$ \frac{\partial J_{\theta}}{\partial {\theta}} = 2X^TX\theta - 2X^Ty$$

$$ Cost^{'}(\theta) = 0 $$

$$2X^{T}X{\theta} - 2X^Ty = 0$$

$$2X^{T}X{\theta} = 2X^Ty$$

$$ (X^TX)^{-1}(X^TX){\theta} = (X^TX)^{-1}.(X^Ty) $$

$$\theta = (X^TX)^{-1}.(X^Ty)$$
So, this is the finally derived Normal Equation with θ giving the minimum cost value.



# Is feature scaling needed for linear regression when using gradient descent?  Why or why not? 

Yes. Feature scaling is needed for linear regression when using gradient descent. This is because real-world data can come up in different orders of magnitude. For example, human age might ranges from 0 to 100 years, while income from €10,000 to €10,000,000 (and more). Using such data with such variable range as input features for a linear regression system might slow down the gradient descent algorithm to a crawl. We can speed up gradient descent by scaling features. This is because coeffiecient of linear regression will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

# Write out the MLE approach for logistic regression. How is this related to the binary cross-entropy?

Logistic regression is widely used to model the outcomes of a categorical dependent variable. For categorical variables it is inappropriate to use linear regression because the response values are not measured on a ratio scale and the error terms are not normally distributed. In addition, the linear regression model can generate as predicted values any real number ranging from negative to positive infinity, whereas a categorical variable can only take on a limited number of discrete values within a specified range.
Now let’s consider the logistic model (a binary classifier) to describe log-odds using a linear model:
$$ \ln \frac{p}{1-p}=\beta_{0}+\beta_{1} x_{1}+\cdots+\beta_{m} x_{m} $$<br>
The probability of observing outcome y=1 under this model is given by the following function (sigmoid):<br>
$$
p \equiv p(y=1 | \mathbf{B}, \mathbf{X})=\frac{1}{1+e^{-\left(\beta_{0}+\beta_{1} x_{1}+\cdots+\beta_{m} x_{m}\right)}}
$$<br>
With 0 and 1 being the only possible outcomes, the probability of observing outcome y=0 is simply (1-p):
$$
p(y=o | \mathbf{B}, \mathbf{X})=p^{o}(1-p)^{1-o}
$$
The likelihood function is given by the product of all individual probabilities:
$$
\mathcal{L}=\prod_{i=1}^{n} p\left(y=y_{i} | \mathbf{B}_{i}, \mathbf{X}_{i}\right)=\prod_{i=1}^{n} p_{i}^{y_{i}}\left(1-p_{i}\right)^{1-y_{i}}
$$
It’s easier to maximize the log-likelihood:
$$
\ln \mathcal{L}=\sum_{i=1}^{n}\left(y_{i} \ln p_{i}+\left(1-y_{i}\right) \ln \left(1-p_{i}\right)\right)
$$
Thus  maximum liklihood estimation yields a familiar loss function (cross-entropy in this case).