<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> Examples of Maximum Likelihood and Maximum A Posteriori.

# Introduction 

- Both MLE and MAP try to figure out the best approximation of an unseen quantity.
- Consider a series of measurements over a noisy channel. We have two things: our observations and some information (or best, compromise guess) about the channel. The goal becomes to find out the original data. 

$$y = ax + n$$  

- Consider a supervised learning problem where we have some training data and test data. In this case, we know the labels of the training data (our observations), and we have the "true" data itself (the evidence), our input features. The goal becomes to learn the a, or "channel" information. 
- That way, when we get some new test data (new evidence), we can pass it through our channel to determine what we the observation should be. 
- If we had enough data, picked the right model, and trained in the correct way, we will have an accurate estimate of the channel A. Whether its accuracy is enough will depend on the specific use case.  

$$ f_{\theta}(x) \rightarrow y $$ 


- In both cases $n$ can be seen as the irreducible noise. In the signal example, it captures everything we don't know and can't control: perfect knowledge of the measurement environment, micro and macro electronic faults, heat and electrical noise from circuits, interference from unseen or unknown sources, occlusions, etc. For the supervised learning example, it captures the inherent uncertainty we have by the mere fact that $x$ does not contain all possible examples of the data, recorded until every possible condition.  
- The best we can do is find a good set of weights a that best maps the data to labels. Or, make the best and most accurate estimate of the channel to figure out the most likely true measurement from the observations.  
- This "best we can do" is what both MLE and MAP attempt to do.  

- The following examples are for future notes, since there is an "inversion" when considering MAP and MLE for classification and regression. In regression, we have the measurements and what to find the inputs/data that most likely caused these measurements. In classification, we have the inputs/data for one set of observations, and want to make sure that we can correctly derive measurements for a new, unseen set of inputs/data.  
- If you notice the X and Y swapped places as the quantities of interest.  
- However, with some manipulation of Bayes Rule, we can easily handle both cases.  
- The key is remembering that we are always after some unknown quantity. The quantity itself will depend on the problem setup and its constraints.  

# Setting up the examples  
- To see how the unseen quantity of interest varies, let's consider two examples.
- One for regression, the other for classification.  

## Regression - temperature readings from an IoT sensor.  

- Imagine we've setup a sensor in a warehouse. This warehouse stores food and must be kept within a certain temperature range to keep the food from spoiling.  
- The food is very sensitive, and one of the worst things bacteria-wise is for food to warm up then cool back down. Especially if this happens many times.  
- In order to prevent this, the warehouse owner wants measurements taken every 5 minutes.  
- If one measurement is above the threshold, a the sensor is flagged for human inspection.   
- If the same sensor is above the threshold for a second reading, this raises a system alarm and demands intervention.  - For our purposes, let's consider only one of these sensors.  


## Classification - the Iris flower problem  

- To keep things simple, let's use the classic Iris flow problem.  
- While rich in history, this dataset has been analyzed and worked for all its worth.  
- There are no more key Machine Learning insights to gain from it, it is now a toy problem.  
- But, it is perfect for visualizations or instructions, since we can focus on aspects other than data cleaning, augmentation, error analysis, etc. In other words, we are sure that the data and its results are on good footing, and can focus our effort on other things. 

# Quick recap of Bayes Rule  

$$P(\theta|\textbf{D}) = P(\theta ) \frac{P(\textbf{D} |\theta)}{P(\textbf{D})}$$

## Posterior  
- The probability we truly care about. What people think we usually have given a problem setup, but this is usually a causal trap because of all our priors and biases.  
## Likelihood  
- What we usually have, i.e. we start with a certain set of evidence, and think we have a hypothesis about it. 
## Prior  
- Our background information about the hypothesis, and how prevalent it is in the real world.  
## Evidence  
- The total probability of our evidence.  
- This is usually intractable for several reasons. It involves knowing absolutely everything about our given evidence. Both in scope and degrees. For example with a given set of symptoms, we'd have to know not only about every possible symptom we *could* have, but also about all variations of the symptoms we do have. For example I may have a slight cough and a headache on the right side of my head. Not only do we have to know about weaker coughs, or slightly harder coughs that other people may have, but we also have to know about our exact same symptoms, but with headache on the left side instead. This labored exampled is only to highlight that, to truly know $P(x)$, we usually need infinite and complete knowledge about some aspect of our reality. Were it so easy... 
- Thankfully, many of the quantities we care about don't directly need this value. It is only proportional or related to the posterior/likelihood/priors, and we can hand-wave the evidence away by saying it will be some constant we don't truly care about. We only care about the direct, numerator proportions.  

## Bayes Rule with Gaussian Distributions   

- The above terms are general and hold for any probability distributions.  
- To keep things simple, let's use Bayes Equations with Gaussian distributions.  
- The reasons for this will become clear in the MLE and MAP sections. It is mainly a convenience choice from three angles:  
- With enough samples, most things in the world can be treated as ~Gaussian distributed.  
- Their theory is very well fleshed out, have loads of equations and knowledge to draw from.  
- With a few algebraic manipulations, the Gaussian quantities for each Bayes terms are very easy to compute.

$$P(x_{i}\mid y) = \frac{1}{\sqrt{2\pi \sigma_y^{2}}} \exp \left(-\frac{(x_{i} -\mu_{y})^2}{2\sigma_y^{2}} \right)$$

# Maximum Likelihood Estimate  

Let's dive into some numbers for our two examples. First we will deal with sensor measurements of regression, then the Iris flower classifications. 

## Regression with MLE  

- We have a set of sensor temperature observations y. 
- We have some knowledge about the channel a. 
- For a given y, we want to find the most likely, true temperature x that cause the reading for this y.  

## Classification with MLE  

- For the training set:
- We have a given set of flower feature inputs x.  
- We also have the species that each measurement belongs to as labels y.  
- We want to find a good mapping $a$ (technically $\theta$ or $w$) from our features to the species.  

- Then, critically, when given a new, unseen set of flower features x (maybe a different plot of land, or the same plot after some time), we want to make sure our mapping can still accurately tell us the flower species.  

## MLE Recap  and limitations
- We saw two examples: finding out the true measured temperature of our sensor. And finding which class a flower most likely belongs to.  

- But in both cases, we were using a limited amount of information. For the temperature, knowing that we are on planet Earth, it makes sense to think the sensor will never read -100 or +100 celsius. Why check those temperatures at all? Further, if we knew that the temperature would be, for example, between 10 and 18 degrees, we could focus on those values to get more accurate readings.  
- And for the flowers, we assumed that the each flower had an equal number of examples. That is the ideal case with "balanced classes". But it's hardly ever the reality. If we know that one type of flower is vastly more frequent than the other, either because it grows better or because we planted more of them, that should factor into our decision!  

- In other words, it helps to bring in our knowledge about the outside world, or about the known limits or characteristics of our problem. While this risks biasing us in one direction vs. all the other, if this previous knowledge, our *priors*, are well-founded and earned, we can gain much by using them.  

# Maximum A Posteriori  

- MAP builds upon MLE by brining in knowledge about our priors.  
- At its basic, we scale the likelihood estimate from the previous section based on our prior knowledge about the unobserved quantities. 
> When you think hooves, think horses not zebras.  
- For our sensor example, we can bring in the knowledge both about the average temperatures inside our warehouse, and even knowledge about the temperature outside in case we ever experience a catastrophic breach.   
- For classification, we can bring in knowledge about the frequency of each flower in our training data. If we expect the test data to be a good representation of the data our model will see in the field, this will also give us better performance.  

## Regression with MAP  

- Let's say we know that the true temperature will almost certainly be between 6 and 18 degrees celsius, with a mean of 12 degrees and a standard deviation of 2 degrees. We could figured this out with a series of accurate thermometers spread throughout the warehouse, recorded at different times of the day over a long period of time. Now we arrive at the key motivation for our sensors! While we could have someone walk around the entire warehouse, checking each thermometer and dealing with any problems, that is a human and labor intensive operation. Ideally we'd need several people to cover the warehouse efficiently. And there are certainly better things for employees to do than walking around starting at thermometers.  
- That means that, for a given recorded temperature on the sensor, we know how likely it is to be the true temperature, independent of any other measurement or sensors. That is the $P(\text{temperature})$.  
- Now, when we find the likelihood of our measurement, we can likewise scale it by this prior knowledge. 
- This also makes our alarm system potentially better: if we see two extreme temperatures back to back, far outside the expected range, it is even more likely that either the sensor is broken or, worse, that there is a problem in the warehouse.  

## Classification with MAP  

- For classification, we can scale the likelihood by the normalized frequency of each class in the dataset.  
- This means we are now taking into account the class balance into our answers.  
- For example, if we had a likelihood score that was very high, but its for a class that is exceedingly rare in our dataset, then we should not be so sure that it is actually the rare class.  
- In the MLE case, we'd always take the class with the highest likelihood. If, based on our training data, we ended up with a rare class that always has a high likelihood by virtue of its features (either intrinsic or via some peculiarity in the data collection), we'd almost always predict that class. 
- By scaling our belief with our prior, we can now more accurately gauge the true probability of the class.  

## MAP Recap  

- MAP is a way to bring in our prior outside knowledge of the world.  
- Adding in the prior risks biasing our estimates, especially if they are inaccurate.  
- But, if the priors are good approximations or even flat out very accurate, they will help scale our decisions to the truly most probably outcomes, given the entire context of the observed and hidden qualities. 

# MLE vs. MAP

- MLE only looks at the likelihood function.  
- MAP bring in our prior information about the problem.  
- In the case where the priors are equally likely, aka uniform distribution, then MAP and MLE.  
- So in a way, by doing MLE we are always doing MAP with a maximum entropy distribution (Uniform) over our labels.  
- And whenever we do have non-uniform prior knowledge, aka whenever there is even a slight class imbalance in the labels, then we are doing MAP.  

# Conclusion

- This post looked at two examples for MAP and MLE.  
- Sensor temperature readings to find the most likely temperature.  
- Flower characteristics to find the most likely flower species.  
- MAP and MLE can be used for any problem where we have a set of givens and a set of hiddens we want to know more about.  
- Depending on the problem details and constraints, the terms in our Bayes Theorem will move around. 

###### References. 

[Supervised vs. Unsupervised Learning](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d)