

\title{Gaussian Assumption Over Linear Regression}
\author{Dristanta Das}
${February} {2021}$




\maketitle

\section{Introduction}
Linear regression attempts to model the relationship between two variables by fitting a line to the observed data. One variable is considered to be the independent variable and the other is considered to be the dependent variable.

\section{Probabilistic Modelling}

\subsection{Linear Model}



\qquad $y_{i} \simeq \theta^T x_{i}$ \\
   
 $y_{i} = \theta^T x_{i} + \epsilon_{i}$ , where $\epsilon_{i}\sim \mathcal{N}(0,\sigma^{2}) $ 

Here $\epsilon_{i}$'s are i.i.d random variables. \\


$\epsilon_{i} \sim \mathcal{N} (0,\sigma^{2})$ 
    
$\Rightarrow p(\epsilon_{i}) = \frac{1}{\sqrt{2 \pi}\sigma} exp \left (^ {-\frac{\epsilon_{i}^2}{2 \sigma^2}}\right)$ \\
    
$\Rightarrow p(y_{i}-\theta^T x_{i}) = \frac{1}{\sqrt{2 \pi}\sigma} exp \left (^ {-\frac{(y_{i}-\theta^T x_{i})^2}{2 \sigma^2}}\right)$ \\
    
However the conventional way is, \\
    
$p(y_{i}|x_{i};\theta) = \frac{1}{\sqrt{2 \pi}\sigma} exp \left (^ {-\frac{(y_{i}-\theta^T x_{i})^2}{2 \sigma^2}}\right)$ \\
    




\subsection{Parameter Estimation}

Suppose, we are given with a dataset, $\mathcal{D}$  = { $x_{i}, y_{i}$}$_{i=1}^{m}$ \\

Then, Bayes' Theorem States that,

        
$P(\theta | \mathcal{D} )  = P(\mathcal{D} | \theta) . P(\theta)$ \\
    
$\qquad  = P(\theta , \mathcal{D})\frac{1}{P(\mathcal{D})}$ \\
    














\subsection{Maximum Likelihood Estimation (MLE)}

The idea behind maximum likelihood estimation (MLE) is to define a function of the parameters that enables us to find a model that fits the data
well. The estimation problem is focused on the likelihood function, or
more precisely its negative logarithm. For data represented by a random
variable x and for a family of probability densities $p( x   |  \theta )$ parametrized
by $\theta$ , the negative log-likelihood is given by
$$\mathcal{L}_{x}(\theta) = - log p( x  |  \theta)$$
The notation $\mathcal{L}_{x}(\theta)$ emphasizes the fact that the parameter $\theta$ is varying
and the data x is fixed. We very often drop the reference to x when writing
the negative log-likelihood, as it is really a function of $\theta$ , and write it as
$\mathcal{L}(\theta)$ when the random variable representing the uncertainty in the data
is clear from the context.
Let us interpret what the probability density $p( x  |  \theta )$ is modeling for a
fixed value of $\theta$ . It is a distribution that models the uncertainty of the data.
In other words, once we have chosen the type of function we want as a
predictor, the likelihood provides the probability of observing data x .
In a complementary view, if we consider the data to be fixed (because
it has been observed), and we vary the parameters $\theta$ , what does $\mathcal{L}(\theta)$ tell us? It tells us how likely a particular setting of $\theta$ is for the observations x.Based on this second view, the maximum likelihood estimator gives us the most likely parameter $\theta$ for the set of data.
We consider the supervised learning setting, where we obtain pairs
$(x_{1},y_{1})$, . . . , $(x_{N},y_{N})$ with   and labels $y_{N} \in \mathbb{R}^n$. We are interested in constructing a predictor that takes a feature vector $x_{n}$ as input and produces a prediction y n (or something close to it), i.e., given a vector $x_{n}$ we want the probability distribution of the label $y_{n}$ . In other words,
we specify the conditional probability distribution of the labels given the
examples for the particular parameter setting $\theta$ .\\
We assume that the set of examples $(x_{1},y_{1})$, . . . , $(x_{N},y_{N})$ are independent
and identically distributed (i.i.d.). The word “independent” 
implies that the likelihood of the whole dataset  Y = {{$y_{1} , . . . , y_{N} $}} and
X = {{$x_{1} , . . . , x_{N} $}} factorizes into a product of the likelihoods of each individual example,
$$p(Y|X, \theta) = \prod_{n=1}^{N}p(y_{n}|x_{n}, \theta)$$

where $p(y_{n} | x_{n} , \theta)$ is a particular distribution (which was Gaussian). The expression “identically distributed” means that each term
in the product, is of the same distribution, and all of them share
the same parameters. It is often easier from an optimization viewpoint to
compute functions that can be decomposed into sums of simpler functions.
Hence, in machine learning we often consider the negative log-likelihood,

$$\mathcal{L}(\theta) = -log\  p(Y|X, \theta) = - \sum_{n=1}^{N}log\  p(y_{n}|x_{n}, \theta)$$

While it is temping to interpret the fact that $\theta$ is on the right of the condi-
tioning in $p(y_{n} | x_{n} , \theta)$, and hence should be interpreted as observed
and fixed, this interpretation is incorrect. The negative log-likelihood $\mathcal{L}(\theta)$
is a function of $\theta$ . Therefore, to find a good parameter vector $\theta$ that
explains the data $(x_{1},y_{1})$, . . . , $(x_{N},y_{N})$ well, minimize the negative log-
likelihood $\mathcal{L}(\theta)$ with respect to $\theta$ .\\

Remark. The negative sign  is a historical artifact that is due
to the convention that we want to maximize likelihood, but numerical
optimization literature tends to study minimization of functions.



We will make a Gaussian model assumption with $\theta \in \mathbb{R}^n$  

  

$\theta^*  = argmax_\theta  \mathcal{L} (\theta | \mathcal{D})$ \\

$= argmax_\theta P(\mathcal{D}  |  \theta))$ \\
         
$= argmax_\theta P(y_{1},x_{1},...,y_{m},x_{m};\theta)$ \\
        
$= argmax_\theta \prod_{i = 1}^{m} P(y_{i},x_{i} ; \theta)$ \\
          
$= argmax_\theta \prod_{i = 1}^{m} [P(y_{i}|x_{i} ; \theta). P(x_{i}; \theta ]$ \\
          
$= argmax_\theta \prod_{i = 1}^{m} [P(y_{i}| x_{i} ; \theta)] . P(x_{i})$\\
          
$= argmax_\theta \prod_{i = 1}^{m} [P(y_{i}| x_{i} ; \theta)]$\\
          
$= argmax_\theta\sum_{i = 1}^{m}  log P(y_{i} | x_{i} ; \theta )$ \\
          
$= argmax_\theta\sum_{i=1}^{m} [log(\frac{1}{\sqrt{2 \pi}\sigma}) + log(exp(^-\frac{(\theta^T x_{i} - y_{i})^2}{2 \sigma^2}))]$\\
          
$= argmax_\theta-\frac{1}{2 \sigma^2}\sum_{i=1}^{m} (\theta^T x_{i} - y_{i})^2$ \\
          
$= argmax_\theta\frac{1}{n}\quad\sum_{i=1}^{m} (\theta^T x_{i} - y_{i})^2$ \\





\section{Conclusion}
Hence, Under Gaussian assumption linear regression amounts to least square.