# What is a likelihood function? Also add a formula and explain what it means.


 The Likelihood function gives us an idea of how well the data summarizes these parameters.The “parameters” here aren’t population parameters— they are the parameters for a particular probability distribution function (PDF). 
 Suppose the joint probability density function of your sample X = (X1,…X2) is f(x| θ), where θ is a parameter. X = x is an observed sample point. Then the function of θ defined as

L(θ |x) = f(x |θ)
is the likelihood function.

Let $ X^n = (X1, · · · , Xn) $ have joint density $p(x^n
; θ) = p(x1, . . . , x_n; θ)$ where
θ ∈ Θ. The likelihood function L : Θ → [0,∞) is defined by:
<p>$L(θ) ≡ L(θ; x^n) = p(x^n; θ)$</p>

where $ x^n $ is fixed and θ varies in Θ. The log-likelihood function is 
<p>$\ell(\theta)=\log L(\theta)$</p>

1. The likelihood function is a function of θ.
2. The likelihood function is not a probability density function.
3. If the data are iid then the likelihood is
<p>$L(\theta)=\prod_{i=1}^{n} p\left(x_{i} ; \theta\right)$</p> iid(independent and indentical distribution) case only.
4. The likelihood is only defined up to a constant of proportionality. In other words, it is
an equivalence class of functions.
5. The likelihood function is used (i) to generate estimators (the maximum likelihood
estimator) and (ii) as a key ingredient in Bayesian inference.

#  What is Maximum Likelihood estimation (MLE) ? Can you give an example?



Maximum Likelihood is a way to find the most likely function to explain a set of observed data.
Maximum Likelihood Estimation is one way to find the parameters of the population that is most likely to have generated the sample being tested. How well the data matches the model is known as “Goodness of Fit.”
<p>
For example, a researcher might be interested in finding out the mean weight gain of person eating a particular diet. The researcher is unable to weigh every person in the population so instead takes a sample. Weight gains of person tend to follow a normal distribution; Maximum Likelihood Estimation can be used to find the mean and variance of the weight gain in the general population based on this sample.</p>

X1, X2, X3, . . . Xn have joint density denoted
<p>
$f_θ(x1, x2, . . . , xn) = f(x1, x2, . . . , xn|θ)$</p>
Given observed values X1 = x1, X2 = x2, . . . , Xn = xn, the likelihood of θ is the function
<p>$lik(θ) = f(x1, x2, . . . , xn|θ)$</p>
considered as a function of θ.
If the distribution is discrete, f will be the frequency distribution function.
In words:
<b>lik(θ)=probability of observing the given data as a function of θ.</b>

If the Xi are iid, then the likelihood simplifies to
<p>
$\operatorname{lik}(\theta)=\prod_{i=1}^{n} f\left(x_{i} | \theta\right)$</p>

Rather than maximising this product which can be quite tedious, we often use the fact
that the logarithm is an increasing function so it will be equivalent to maximise the log
likelihood:
<p>$l(\theta)=\sum_{i=1}^{n} \log \left(f\left(x_{i} | \theta\right)\right)$</p>

# How is linear regression related to Pytorch and gradient descent

Linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. Let X be the independent variable and Y be the dependent variable. We will define a linear relationship between these two variables as follows:
$$Y=mX+c$$
$m$ is the slope of the line and $c$ is the y intercept.We will use this equation to train our model with a given dataset and predict the value of $Y$ for any given value of $X$. Our challenge is to determine the value of $m$ and $c$, such that the line corresponding to those values is the best fitting line or gives the minimum error.
The loss is the error in our predicted value of $m$ and $c$. Our goal is to minimize this error to obtain the most accurate value of $m$ and $c$.
To find the loss function we can use pytorch which has built in libraries which minimizes the number of lines of code. 
This is all these are related.

# Write out MSE loss for linear regression. Could we also use this loss for classification?


The Mean Squared Error (MSE) is perhaps the simplest and most common loss function, often taught in introductory Machine Learning courses. To calculate the MSE, you take the difference between your model’s predictions and the ground truth, square it, and average it out across the whole dataset.
The MSE will never be negative, since we are always squaring the errors. The MSE is formally defined by the following equation:
<p>
$\operatorname{MSE}=\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}$</p>
using the MSE loss makes sense if the assumption that your outputs are a real-valued function of your inputs, with a certain amount of irreducible Gaussian noise, with constant mean and variance. If these assumptions don’t hold true (such as in the context of classification), the MSE loss may not be the best bet.

# Write out the Maximum likelihood Estimation for linear regression. How is this related to the MSE loss for linear regression derived in the last point? Derive the relation between them.


We write our linear model with Gaussian noise like this:
$\epsilon \sim N\left(0, \sigma^{2}\right)$
<p>
$y=\theta_{1} x+\theta_{0}+\epsilon$</p>
To apply maximum likelihood, we first need to derive the likelihood function. First, let’s rewrite our model from above as a single conditional distribution given x:
$y \sim N\left(\theta_{1} x+\theta_{0}, \sigma^{2}\right)$
the equation of a Gaussian distribution’s probability density function, with our linear equation in place of the mean:
$f\left(y | x ; \theta_{0}, \theta_{1}, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-\left(y-\theta_{1} x+\theta_{0}\right)^{2}}{2 \sigma^{2}}}$
<p>Each point is independent and identically distributed (iid), so we can write the likelihood function with respect to all of our observed points as the product of each individual probability density.</p>

$L_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \prod_{(x, y) \in X} e^{\frac{-\left(y-\theta_{1} x+\theta_{0}\right)^{2}}{2 \sigma^{2}}}$
<p> To make our equation simpler, let’s take the log of our likelihood.</p> 
$l_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\log \left[\frac{1}{\sqrt{2 \pi \sigma^{2}}} \prod_{(x, y) \in X} e^{\frac{-\left(y-\left(\theta_{1} x+\theta_{0}\right)^{2}\right.}{2 \sigma^{2}} )^{2}}\right]$

$=\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\sum_{(x, y) \in X} \log \left(e^{\frac{-\left(y-\theta_{1} x+\theta_{0}\right)^{2}}{2 \sigma^{2}}}\right)$

$=\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\sum_{(x, y) \in X} \frac{-\left(y-\left(\theta_{1} x+\theta_{0}\right)\right)^{2}}{2 \sigma^{2}}$

$=\log (1)-\log \left(\sqrt{2 \pi \sigma^{2}}\right)-\frac{1}{2 \sigma^{2}} \sum_{(x, y) \in X}\left[y-\left(\theta_{1} x+\theta_{0}\right)\right]^{2}$

$=-\log \left(\sqrt{2 \pi \sigma^{2}}\right)-\frac{1}{2 \sigma^{2}} \sum_{(x, y) \in X}\left[y-\left(\theta_{1} x+\theta_{0}\right)\right]^{2}$

<p>maximizing a number is the same thing as minimizing the negative of the number. So instead of maximizing the likelihood, let’s minimize the negative log-likelihood:</p>
$-l_{X}\left(\theta_{0}, \theta_{1}, \sigma^{2}\right)=\log \left(\sqrt{2 \pi \sigma^{2}}\right)+\frac{1}{2 \sigma^{2}} \sum(y-\hat{y})^{2}$

# Write out the likelihood function for linear classification. What is the drawback of using MSE loss here?


 the logit model the output variable  $y_{i}$ is a Bernoulli random variable (it can take only two values, either 1 or 0) 
<p> $\mathrm{P}\left(y_{i}=1 | x_{i}\right)=S\left(x_{i} \beta\right)$</p>
where
$S(t)=\frac{1}{1+\exp (-t)}$ 
is the logistic function,  
$x_{i}$
is a 
$1xK$ vector of inputs and  
$\beta $
is a  Kx1 vector of coefficients.

Furthermore,
the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:
$L\left(\beta ; y_{i}, x_{i}\right)=\left[S\left(x_{i} \beta\right)\right]^{y_{i}}\left[1-S\left(x_{i} \beta\right)\right]^{1-y_{i}}$

$\begin{aligned} l(\beta ; y, X) &=\ln (L(\beta ; y, X)) \\ &=\ln \left(\prod_{i=1}^{N}\left[S\left(x_{i} \beta\right)\right]^{y_{i}}\left[1-S\left(x_{i} \beta\right)\right]^{1-y_{i}}\right) \\ &=\sum_{i=1}^{N}\left[y_{i} \ln \left(S\left(x_{i} \beta\right)\right)+\left(1-y_{i}\right) \ln \left(1-S\left(x_{i} \beta\right)\right)\right] \end{aligned}$   
 
$\begin{array}
{=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(1-\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)\right]} \\ 
{=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(\frac{1+\exp \left(-x_{i} \beta\right)-1}{1+\exp \left(-x_{i} \beta\right)}\right)\right]} \\ 
{=\sum_{i=1}^{N}\left[y_{i} \ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)+\left(1-y_{i}\right) \ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)\right]}
\end{array}$

$
\begin{array}{l}{=\sum_{i=1}^{N}\left[\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)}\right)-\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)}\right)\right)\right]} \\ {=\sum_{i=1}^{N}\left[\ln \left(\frac{\exp \left(-x_{i} \beta\right)}{1+\exp \left(-x_{i} \beta\right)} \frac{\exp \left(x_{i} \beta\right)}{\exp \left(x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{1+\exp \left(-x_{i} \beta\right)} \frac{1+\exp \left(-x_{i} \beta\right)}{\exp \left(-x_{i} \beta\right)}\right)\right\}\right]} \\ {\quad=\sum_{i=1}^{N}\left[\ln \left(\frac{1}{1+\exp \left(x_{i} \beta\right)}\right)+y_{i}\left(\ln \left(\frac{1}{\exp \left(-x_{i} \beta\right)}\right)\right)\right]}\end{array}
$

$
\begin{array}{l}{=\sum_{i=1}^{N}\left[\ln (1)-\ln \left(1+\exp \left(x_{i} \beta\right)\right)+y_{i}\left(\ln (1)-\ln \left(\exp \left(-x_{i} \beta\right)\right)\right]\right.} \\ {=\sum_{i=1}^{N}\left[-\ln \left(1+\exp \left(x_{i} \beta\right)\right)+y_{i} x_{i} \beta\right]}\end{array}
$

    

# Can gradient descent be used to find the parameters for linear regression? What about linear classification? Why? 


Gradient Descent is used in Linear regression to find the optimized parameters. and is used to update the values of weights using gradient Descent algorithm.

In linear classification it is hard to find the paramaters and optimize the cost funcrion using Gradient Descent becausethe boundaries are not clear until it reaches the data point as there is no local information on the point in which it is moving. Hence we have algorithm to estimate parametes silimar to this as regression i.e Perceptron Algorithm.

# What are normal equations? Is it the same as least squares? Explain. 


Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of θ without using Gradient Descent. Following this approach is an effective and a time-saving option when are working with a dataset with small features.

Normal Equation is a follows :

The "Normal Equation" is a method of finding the optimum theta without iteration.

$\theta = (X^T X)^{-1}X^T 
 y$

There is no need to do feature scaling with the normal equation.

In the above equation,

θ : hypothesis parameters that define it the best.

X : Input feature value of each instance.

Y : Output value of each instance.

IS normal equation and least squares method same?

Least Square Method:

The least squares method is a form of mathematical regression analysis that finds the line of best fit for a set of data, providing a visual demonstration of the relationship between the data points. Each point of data is representative of the relationship between a known independent variable and an unknown dependent variable.

Are they same? No they are not same.

# Is feature scaling needed for linear regression when using gradient descent?  Why or why not?


Scaling of features can be very helpful.It can help in the faster convergence of the algorithm in case you are using Gradient Descent.
It can make the analysis of coefficients easier.If your features differ in scale then this may impact the resultant coefficients of the model and it can be hard to interpret the coefficients.

# Write out the MLE approach for logistic regression. How is this related to the binary cross-entropy? You can use this reference to answer this question. Video Reference


Typically, to find the maximum likelihood estimates we’d differentiate the log
likelihood with respect to the parameters, set the derivatives equal to zero, and solve.

 
Now let’s consider the logistic model (a binary classifier) to describe log-odds using a linear model:

$$\ln \frac{p}{1-p}=\beta_{0}+\beta_{1} x_{1}+\cdots+\beta_{m} x_{m}$$

The probability of observing outcome y=1 under this model is given by the following function (sigmoid):

$$p \equiv p(y=1 | \mathbf{B}, \mathbf{X})=\frac{1}{1+e^{-\left(\beta_{0}+\beta_{1} x_{1}+\cdots+\beta_{m} x_{m}\right)}}$$

With 0 and 1 being the only possible outcomes, the probability of observing outcome y=0 is simply (1-p):

$$p(y=o | \mathbf{B}, \mathbf{X})=p^{o}(1-p)^{1-o}$$
The likelihood function is given by the product of all individual probabilities:

$$\mathcal{L}=\prod_{i=1}^{n} p\left(y=y_{i} | \mathbf{B}_{i}, \mathbf{X}_{i}\right)=\prod_{i=1}^{n} p_{i}^{y_{i}}\left(1-p_{i}\right)^{1-y_{i}}$$
It’s easier to maximize the log-likelihood:

$$\ln \mathcal{L}=\sum_{i=1}^{n}\left(y_{i} \ln p_{i}+\left(1-y_{i}\right) \ln \left(1-p_{i}\right)\right)$$
Thus maximum liklihood estimation yields a familiar loss function (cross-entropy in this case).