# Q: What is a likelihood function? Also add a formula and explain what it means?


The likelihood function which is also known as likelihood expresses the values on y axis against a set of given observations. It is equal to joint probability distribution of the random samples. 

In a simpliest way, likelihood function can be defined as a function which expresses the value of likelihood against a value of an unknown parameter. 

Let $X_{1}, X_{2}, \ldots, X_{n}$ have a joint density function $f\left(X_{1}, X_{2}, \dots, X_{n} | \theta\right)$. Given $X_{1}=x_{1}, X_{2}= x_{2}, \ldots, X_{n}=x_{n}$ is observed, the function is $\theta$ defined by:

$$L(\theta)=L\left(\theta | x_{1}, x_{2}, \ldots, x_{n}\right)=f\left(x_{1}, x_{2}, \ldots, x_{n} | \theta\right)$$

However,

1. The likelihood function is not a probability density function.

2. It is an important component of both frequentist and bayesian analyses.


# Q: What is Maximum Likelihood Estimation (MLE) ? Can you give an example? 
## Maximum Likelihood Estimation (MLE)

MLE is method of estimating the unknown parameter $\theta$ of a model, given observed data. It estimates the model parameter by finding the parameter value that maximizes the likelihood function. The parameter estimate is called maximum likelihood estimate $\hat{\theta}_{M L E}$. 



Let $X_{1}, X_{2}, \ldots, X_{n}$ be a random sample from a distribution that depends on one or more unknown parameters $\theta_{1}, \theta_{2}, \ldots, \theta_{m}$ with probability density (or mass) function $f\left(x_{i} ; \theta_{1}, \theta_{2}, \ldots, \theta_{m}\right)$. Suppose that $\left(\theta_{1}, \theta_{2}, \ldots, \theta_{m}\right)$ is restricted to a given parameter space $Ω$. Then:

(1) When regarded as a function of $\theta_{1}, \theta_{2}, \ldots, \theta_{m}$, the joint probability density (or mass) function of $X_{1}, X_{2}, \ldots, X_{n}$:

$$L\left(\theta_{1}, \theta_{2}, \ldots, \theta_{m}\right)=\prod_{i=1}^{n} f\left(x_{i} ; \theta_{1}, \theta_{2}, \ldots, \theta_{m}\right)$$


$\left(\left(\theta_{1}, \theta_{2}, \ldots, \theta_{m}\right) \text { in } \Omega\right)$ is called the likelihood function.


$\begin{array}{l}{\text { (2) If: }} \\ {\qquad\left[u_{1}\left(x_{1}, x_{2}, \ldots, x_{n}\right), u_{2}\left(x_{1}, x_{2}, \ldots, x_{n}\right), \ldots, u_{m}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right]}\end{array}$

is the m-tuple that maximizes the likelihood function, then:

$$\hat{\theta}_{i}=u_{i}\left(X_{1}, X_{2}, \ldots, X_{n}\right)$$
is the maximum likelihood estimator of $\theta_{i}, \text { for } i=1,2, \ldots, m$


3) The corresponding observed values of the statistics in (2), namely:

$$\left[u_{1}\left(x_{1}, x_{2}, \ldots, x_{n}\right), u_{2}\left(x_{1}, x_{2}, \ldots, x_{n}\right), \ldots, u_{m}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right]$$

are called the maximum likelihood estimates of $\theta_{i}, \text { for } i=1,2, \ldots, m.$

## Example:

Suppose the weights of randomly selected American female college students are normally distributed with unknown mean $μ$ and standard deviation $σ$. A random sample of 10 American female college students yielded the following weights (in pounds):

115   122   130   127   149   160   152   138  149   180    

Based on the definitions given above, identify the likelihood function and the maximum likelihood estimator of $μ$, the mean weight of all American female college students. Using the given sample, find a maximum likelihood estimate of $μ$ as well.


### Solution:
The probability density function of $X_{i}$ is:
$$f\left(x_{i} ; \mu, \sigma^{2}\right)=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left[-\frac{\left(x_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right]$$

for −∞ < x < ∞. The parameter space is Ω = {(μ, σ): −∞ < μ < ∞ and 0 < σ < ∞}. Therefore, (you might want to convince yourself that) the likelihood function is:
$$L(\mu, \sigma)=\sigma^{-n}(2 \pi)^{-n / 2} \exp \left[-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(x_{i}-\mu\right)^{2}\right]$$

for −∞ < μ < ∞ and 0 < σ < ∞. It can be shown (we'll do so in the next example!), upon maximizing the likelihood function with respect to μ, that the maximum likelihood estimator of μ is:

$$\hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} X_{i}=\overline{X}$$

Based on the given sample, a maximum likelihood estimate of μ is:

$$\hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} x_{i}=\frac{1}{10}(115+\cdots+180)=142.2$$

pounds. Note that the only difference between the formulas for the maximum likelihood estimator and the maximum likelihood estimate is that:

* the estimator is defined using capital letters (to denote that its value is random), and

* the estimate is defined using lowercase letters (to denote that its value is fixed and based on an obtained sample)

# Q : How is linear regression related to Pytorch and gradient descent? 

Linear regression refers to the linear relationships between a dependent variable and one or more independent variables. If $Y$ is a dependent variable and $X_{1}$ and $X_{2}$ are dependent variables then we can have the following equation to establish the linear relationship.

$Y$ = a$X_{1}$ + b$X_{2}$ + $C$

Where,

$Y$ = Dependent variable

a = coefficient/slope of $X_{1}$ variable

b = coefficient/slope of $X_{2}$ variable

$X_{1}$ = independent variable

$X_{2}$ = independent variable

$C$ = Intercept

The objective of the data analysis in term of linear regression to find out the best fitted line which can be used as the reference for test data set for the accuracy determination. However, the fitted line relies on the values of intercpt ($C$) and slopes ($a$,$b$) to find a gradient descent where the error is minimum. In the other way we can say, it is to reduce the loss function. Pytorch is one of the built in libraries which is being used to find the loss function with a minimal number of codings. 

# Q: Write out MSE loss for linear regression. Could we also use this loss for classification?

MSE(mean square error)/quadratic/L2 loss function is the most commonly used function to predict the model's predicted output to the actual output. Mathematically this can be expressed as
\begin{equation*}
MSE =  \frac{\sum_{i=1}^n (Y_i-y_i)^2)}{n}
\end{equation*}
where Y is the actual output and y is the predicted output.
Some of the common loss functions used for ML are given here [1](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0) , [2](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#mse-l2) and [3](https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/)
We cannot use MSE for classification problem because classification problems are categorical.

# Q: Write out the Maximum likelihood Estimation for linear regression. How is this related to the MSE loss for linear regression derived in the last point? Derive the relation between them. 

 We assume the noise is gaussian and can be represented as:
$\epsilon \sim N\left(0, \sigma^{2}\right)$
The linear equation can be written as:
<p>
$y=\theta_{1} x+\theta_{0}+\epsilon$</p>
To apply maximum likelihood, we first need to derive the likelihood function. First, let's rewrite our model from above as a single conditional distribution given x:
$y \sim N\left(\theta_{1} x+\theta_{0}, \sigma^{2}\right)$
the equation of a Gaussian distribution's probability density function, with our linear equation in place of the mean( Because mean is the value of function y):
$f\left(y | x ; \theta_{0}, \theta_{1}, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-\left(y-(\theta_{1} x+\theta_{0})\right)^{2}}{2 \sigma^{2}}}$
<p>Each point is independent and identically distributed (iid), so we can write the likelihood function with respect to all of our observed points as the product of each individual probability density.</p>
$L_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \prod_{(x, y) \in X} e^{\frac{-\left(y-(\theta_{1} x+\theta_{0})\right)^{2}}{2 \sigma^{2}}}$
<p> To make our equation simpler, let's take the log of our likelihood.</p>
$l_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\log \left[\frac{1}{\sqrt{2 \pi \sigma^{2}}} \prod_{(x, y) \in X} e^{\frac{-\left(y-\left(\theta_{1} x+\theta_{0}\right)^{2}\right.}{2 \sigma^{2}} )^{2}}\right]$
$=\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\sum_{(x, y) \in X} \log \left(e^{\frac{-\left(y-(\theta_{1} x+\theta_{0})\right)^{2}}{2 \sigma^{2}}}\right)$
$=\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\sum_{(x, y) \in X} \frac{-\left(y-\left(\theta_{1} x+\theta_{0}\right)\right)^{2}}{2 \sigma^{2}}$
$=\log (1)-\log \left(\sqrt{2 \pi \sigma^{2}}\right)-\frac{1}{2 \sigma^{2}} \sum_{(x, y) \in X}\left[y-\left(\theta_{1} x+\theta_{0}\right)\right]^{2}$
$=-\log \left(\sqrt{2 \pi \sigma^{2}}\right)-\frac{1}{2 \sigma^{2}} \sum_{(x, y) \in X}\left[y-\left(\theta_{1} x+\theta_{0}\right)\right]^{2}$
<p>maximizing a number is the same thing as minimizing the negative of the number. So instead of maximizing the likelihood, let's minimize the negative log-likelihood:</p>
$-l_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\log \left(\sqrt{2 \pi \sigma^{2}}\right)+\frac{1}{2 \sigma^{2}} \sum(y-\hat{y})^{2}$
The constant term can be neglected and the minimizing the negative log likelihood function(or maximizing log likelihood function) turns out to be minimizing mean square error.
[Reference](https://towardsdatascience.com/linear-regression-91eeae7d6a2e)

# Q: Write out the likelihood function for linear classification. What is the drawback of using MSE loss here?

the logit model the output variable  $y_{i}$ is a Bernoulli random variable (it can take only two values, either 1 or 0)
<p> $\mathrm{P}\left(y_{i}=1 | x_{i}\right)=S\left(x_{i} \beta\right)$</p>
where
$S(t)=\frac{1}{1+\exp (-t)}$ is the logistic function,  $x_{i}$ is a  $1xK$ vector of inputs and  $\beta $ is a  Kx1 vector of coefficients.
Furthermore,
the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:
$L\left(\beta ; y_{i}, x_{i}\right)=\left[S\left(x_{i} \beta\right)\right]^{y_{i}}\left[1-S\left(x_{i} \beta\right)\right]^{1-y_{i}}$
$\begin{array} l(\beta ; y, X) &=\ln (L(\beta ; y, X)) \\ &=\ln \left(\prod_{i=1}^{N}\left[S\left(x_{i} \beta\right)\right]^{y_{i}}\left[1-S\left(x_{i} \beta\right)\right]^{1-y_{i}}\right) \\ &=\sum_{i=1}^{N}\left[y_{i} \ln \left(S\left(x_{i} \beta\right)\right)+\left(1-y_{i}\right) \ln \left(1-S\left(x_{i} \beta\right)\right)\right]
\end{array}$
$
\begin{array}{l}{=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(1-\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)\right]} \\ {=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(\frac{1+\exp \left(-x_{i} \beta\right)-1}{1+\exp \left(-x_{i} \beta\right)}\right)\right]} \\ {=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)\right]}\end{array}$
$
\begin{array}{l}{=\sum_{i=1}^{N}\left[\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)-\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)\right)\right]} \\ {=\sum_{i=1}^{N}\left[\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)} \frac{\exp \left(x_{i} \beta\right)}{\exp \left(x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)} \frac{1+\exp \left(-x_{i} \beta\right)}{\exp \left(-x_{i} \beta\right)}\right)\right\}\right]} \\ {\quad=\sum_{i=1}^{N}\left[\ln \left(\frac{1}{1+\exp \left(x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{\exp \left(-x_{i} \beta\right)}\right)\right)\right]}\end{array}
$
$
\begin{array}{l}{=\sum_{i=1}^{N}\left[\ln (1)-\ln \left(1+\exp \left(x_{i} \beta\right)\right)+y_{i}\left(\ln (1)-\ln \left(\exp \left(-x_{i} \beta\right)\right)\right]\right.} \\ {=\sum_{i=1}^{N}\left[-\ln \left(1+\exp \left(x_{i} \beta\right)\right)+y_{i} x_{i} \beta\right]}\end{array}
$

# Q. Can gradient descent be used to find the parameters for linear regression? What about linear classification? Why?  

Yes, Gradient Descent can be used to find the parameters for linear regression. In linear classicification, the inputs are contineous but the outputs are binary (0,1). According to the linear classification function, the hypothetical function of linear classification is sigmoid where the gradient descent can be used. 



# Q: What are normal equations? Is it the same as least squares? Explain.
Normal equations are  technique for computing coefficients for Multivariate Linear Regression.
This problem is also called OLS Regression, and Normal Equation is an approach of solving it
It finds the regression coefficients analytically.
It's an one-step learning algorithm as opposed to Gradient Descent which is iterative process of finding the regression coefficients
This approach is an effective and a time-saving option when are working with a dataset with small features.<br>
__Normal Equation is a follows :__
$$ \theta = ({X}^T{X})^{-1}.({X}^T{y}) $$
In the above equation,<br>
$θ$ : hypothesis parameters that define it the best.<br>
$X$ : Input feature value of each instance.<br>
$Y$ : Output value of each instance.<br>
__Maths Behind the equation –__
Given the hypothesis function <br>
$$ h(\theta) = \theta_0{x_0} + \theta_1{x_1}+...... + \theta_n{x_n} $$
where,<br>
$n$ : the no. of features in the data set.<br>
${x_0}$ : 1 (for vector multiplication)<br>
Notice that this is dot product between θ and x values. So for the convenience to solve we can write it as :<br>
$$ h(\theta) = \theta ^ T{x}$$
The motive in Linear Regression is to minimize the cost function :<br>
$$J(\Theta) = \frac{1}{2m} \sum_{i = 1}^{m} \frac{1}{2} [h_{\Theta}(x^{(i)}) - y^{(i)}]^{2} $$
where,<br>
$x_i$ : the input value of iih training example.<br>
$m$ : no. of training instances<br>
$n$ : no. of data-set features<br>
$y_i$ : the expected result of ith instance<br>
Let us representing cost function in a vector form<br>
$$
\begin{bmatrix}
h_\theta ({x}^0) \\
h_\theta ({x}^1) \\
.......\\
h_\theta ({x}^m) \\
\end{bmatrix}
- \begin{bmatrix}
({y}^0) \\
({y}^1) \\
.......\\
({y}^m) \\
\end{bmatrix}
$$
<br><br>we have ignored 1/2m here as it will not make any difference in the working. It was used for the mathematical convenience while calculation gradient descent. But it is no more needed here.<br>
$$
\begin{bmatrix}
\theta ^ T ({x}^0) \\
\theta ^ T ({x}^1) \\
.......\\
\theta ^ T ({x}^m) \\
\end{bmatrix} - y
$$
$\theta_0 \begin{pmatrix} 0   \\ {x_0} \end{pmatrix}$ +
$\theta_1
\begin{pmatrix}
0   \\
{x_1}
\end{pmatrix}
$
${x}^i_j$ : value of ${j}^{ih}$ feature in ${i}^{ih}$ training example.
This can further be reduced to  $X\theta - y$<br>
But each residual value is squared. We cannot simply square the above expression. As the square of a vector/matrix is not equal to the square of each of its values. So to get the squared value, multiply the vector/matrix with its transpose. So, the final equation derived is
$$(X\theta - y)^{T}(X\theta - y)$$
Therefore, the cost function is
$$Cost = (X\theta - y)^{T}(X\theta - y) $$
So, now getting the value of θ using derivative
$$\frac{\partial J_{\theta}}{\partial {\theta}} = \frac{\partial}{\partial {\theta}}{[(X{\theta}- y)^T{(X{\theta}- y)}]}$$
$$ \frac{\partial J_{\theta}}{\partial {\theta}} = 2X^TX\theta - 2X^Ty$$
$$ Cost^{'}(\theta) = 0 $$
$$2X^{T}X{\theta} - 2X^Ty = 0$$
$$2X^{T}X{\theta} = 2X^Ty$$
$$ (X^TX)^{-1}(X^TX){\theta} = (X^TX)^{-1}.(X^Ty) $$
$$\theta = (X^TX)^{-1}.(X^Ty)$$
So, this is the finally derived Normal Equation with θ giving the minimum cost value.
[Reference](https://www.geeksforgeeks.org/ml-normal-equation-in-linear-regression/)








## Alternative!

Given a matrix equation

 A$x$=b, 
the normal equation is that which minimizes the sum of the square differences between the left and right sides:

 $A^(T)$A$x$=$A^(T)$b. 
It is called a normal equation because b-Ax is normal to the range of A.

Here, $A^(T)$A is a normal matrix.

Normal equations are used to solve the least square error. They are called the normal equations because they specify that the residual must be normal (orthogonal) to every vector in the span of A.


