# Linear Regression With Maximum Likelihood Estimation

Linear regression is a classical model for predicting a numerical quantity. 
- The parameters of a linear regression model can be estimated using a least squares procedure or by a MLE procedure. 
- Supervised learning can be framed as a conditional probability problem, and MLE can be used to fit the parameters of a model that best summarizes the conditional probability distribution, so-called conditional maximum likelihood estimation. 

Linear Regression is a model that maps one or more numerical inputs to a numerical output. In terms of predictive modeling, it is suited to regression type problems: that is, the prediction of a real-valued quantity. The input data is denoted as X with n examples and the output is denoted y with one output for each input. 

The prediction of the model for a given input is denoted as yhat.      $yhat = model(X)$
The model is defined in terms of parameters called coeficients (beta or$\beta$), where there is one coeficient per input and an additional coeficient that provides the intercept or bias.

The model can also be described using linear algebra, with a vector for the coeficients ($\beta$) and a matrix for the input data (X) and a vector for the output (y).

$ y = X * \beta $

The sample is known to be incomplete being drawn from broader population. Also measurement error or statistical noise is expected. The regression problem is to estimate the parameters of the model ($\beta$) from the sample (nosiy. Two frameworks that are most common:
1. Least Squares Optimization - by seeking a set of parameters that results in the smallest squared error between the predictions of the model (yhat) and the actual outputs (y), averaged over all examples in the dataset, or mean squared error.
2. Maximum Likelihood Estimation - is frequentist probabilistic framework that seeks a set of parameters for the model that maximize a likelihood function. 

Under both frameworks, different optimization algorithms may be used, such as local search methods like the BFGS algorithm (or variants), and general optimization methods like stochastic gradient descent. The linear regression model is special in that an analytical solution also exists, meaning that the coeffcients can be calculated directly using linear algebra. 


# Maximum Likelyhood Estimation
The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters, as they are iid. Multiplying many small probabilities together can be unstable; hence it is common to restate this problem as the sum of the natural log conditional probability.

Given the common use of log in the likelihood function, it is referred to as a log-likelihood function. It is also common in optimization problems to prefer to minimize the $cost function$ rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

#  MLE Relationship to Machine Learning
We can frame the problem of fitting a machine learning model as the problem of probability density estimation.
1. The choice of model and model parameters is referred to as a modeling hypothesis h.
2. Problem involves finding h that best explains the data X, or P(X; h)
or more fully

max $\sum_{i=1}^{n}$ $log P(x_i ; h)$

Now we can replace h with our linear regression model. Assuming (i.i.d.) and that the target variable (y) has statistical noise
with a Gaussian distribution, zero mean, and the same variance for all examples, we can frame the problem of estimating y given X as estimating the mean value for $y$ from a Gaussian probability distribution given X.



#  Logistic Regression With Maximum Likelihood Estimation

Logistic regression is a classical linear method for $binary$ classification. Logistic regression has a lot in common with linear regression, both techniques model the target variable with a line (or hyperplane, depending on the number of dimensions of input. Linear regression fits the line to the data, which can be used to predict a new quantity, whereas logistic regression fits a line to best separate the two classes. 

Model is identical to linear regression initially, then squashes the output of this weighted sum using a nonlinear function $ sigmoid$ to ensure the outputs are a value between 0 and 1. 

#  Logistic Regression and Log-Odds

The linear part of the model (the weighted sum of the inputs) calculates the log-odds of a successful event. Specifically, the log-odds that a sample belongs to class 1, for the input variables at each level (all observed values). 

Odds are often stated as wins to losses (wins : losses), e.g. a one to ten chance or ratio of winning is stated as 1:10. Given the probability of success (p) predicted by the logistic regression model, we can convert it to odds of success as the probability of success divided by the probability of not success: p / (1-p) 

The logarithm of the odds is the log-odds and may be referred to as the logit (logistic unit), a unit of measure.

So problem of fitting a ML model is a problem of probability density estimation. Choice of model and model parameters is referred to as a modeling hypothesis h, and the problem involves finding h that best explains the data X.

The probability distribution that is most often used when there are two classes is the binomial distribution. This distribution has a single parameter, p, that is the probability of an event or a specific class. The basis for the likelihood function
for a specific input, where the probability is given by the model (yhat) and the actual label is given from the dataset.

likelihood = yhat * y + (1 - yhat) * (1 - y) 

This function will always return a large probability when the model is close to the matching class value, and a small value when it is far away, for both y = 0 and y = 1 cases.


In [6]:
# test for Bernoulli likelihood function
#Def likelihood (y, yhat): #spellerror!!
def likelihood(y, yhat):
    return yhat * y + (1 - yhat) * (1-y)

# test for y = 1
y, yhat = 1, 0.9
print('y= %.1f, yhat=%.1f, likelihood=%.1f' % (y, yhat, likelihood(y, yhat)))
y, yhat = 1, 0.1
print('y= %.1f, yhat=%.1f, likelihood=%.1f' % (y, yhat, likelihood(y, yhat)))
y, yhat = 0, 0.1
print('y= %.1f, yhat=%.1f, likelihood=%.1f' % (y, yhat, likelihood(y, yhat)))
y, yhat = 0, 0.9
print('y= %.1f, yhat=%.1f, likelihood=%.1f' % (y, yhat, likelihood(y, yhat)))

y= 1.0, yhat=0.9, likelihood=0.9
y= 1.0, yhat=0.1, likelihood=0.1
y= 0.0, yhat=0.1, likelihood=0.9
y= 0.0, yhat=0.9, likelihood=0.1


It is common practice to minimize a cost function for optimization problems; we can invert the function so that we minimize the negative log-likelihood.
Calculating the negative of the log-likelihood function for the Bernoulli distribution is equivalent to calculating the cross-entropy function for the Bernoulli distribution, where p() represents the probability of class 0 or class 1, and q() represents the estimation of the probability distribution.

# Expectation Maximization (EM) Algo

MLE involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample. 
A limitation is that it assumes that the dataset is complete, or fully observed. There may be datasets where only some of the relevant variables can be observed, and some cannot, and although they infuence other random variables in the dataset, they remain hidden. More generally, these unobserved or hidden variables are referred to as latent variables.

The Expectation-Maximization algorithm provides an alternative approach. 

The EM algorithm is an iterative approach that cycles between two modes. The first model attempts to estimate the missing or latent variables, called the estimation-step or E-step. The second mode attempts to optimize the parameters of the model to best explain the data, called the maximization-step or M-step. 

### TO be UPDATED!!

In [29]:
# constuing a bimodal distribution
from numpy import hstack
from numpy.random import normal
from matplotlib import pyplot
#from sklearn.mixture import GuassianMixture ## spellerror!!
from sklearn.mixture import GaussianMixture
X1, X2 = normal(loc=20, scale=5, size=300), normal(loc=40, scale=5, size=700)
X = hstack((X1, X2))
X = X.reshape((len(X), 1))
# fit model
#model = GuassianMixture(n_components=2, init_params='random') ##Spellerror!!
model = GaussianMixture(n_components=2, init_params='random')
model.fit(X)
#predict  learned values
yhat = model.predict(X)
print(yhat[:100])
print(yhat[-100:])

[0 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1
 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 0 1 0
 0 1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


#  Probabilistic Model Selection with AIC, BIC, and MDL

Model selection is the problem of choosing one from among a set of candidate models. It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation. An alternative approach to model selection involves using probabilistic statistical measures that attempt to
quantify both the model performance on the training dataset and the complexity of the model.

