In [2]:
%matplotlib inline
import numpy as np;
import matplotlib
import matplotlib.pyplot as plt

$\LaTeX \text{ commands here}
\newcommand{\E}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\im}{\text{im}\,}
\newcommand{\norm}[1]{||#1||}
\newcommand{\inner}[1]{\langle #1 \rangle}
\newcommand{\span}{\mathrm{span}}
\newcommand{\proj}{\mathrm{proj}}
\newcommand{\OPT}{\mathrm{OPT}}
\newcommand{\grad}{\nabla}
\newcommand{\eps}{\varepsilon}
$

<hr style="border: 5px solid black">

**Georgia Tech, CS 4540**

# L16: Computing Eigenvalues & Eigenvectors

*Tuesday, October 16, 2018*

### Outline

* Some review of probability and statistics
    * Random variables
    * Expectation
    * Variance
* Maximum Likelihood Estimation
    * The basics
    * Some examples
    

# Probability Theory

### What is a random variable? Discreet case.

Technically speaking, a random variable $X : \Omega \to \R$ is just a map from some *sample space* $\Omega$ to the real numbers. We assume we have a *probability measure* $\Omega$.

When $\Omega$ is a finite or countably-infinite set, then we can assume our measure is just a probability $p(\omega)$ assigned to every $\omega \in \Omega$, where $\sum_{\omega \in \Omega} p_\omega = 1$.
* We often will write $P(X = a)$ which is precisely $\sum_{\omega \in \Omega : X(\omega) = 1} p(\omega)$.
* The expectation of a random variable $\E[X]$ is defined as $\sum_{\omega \in \Omega} X(\omega) p(\omega)$.
* Similarly, the expectation of a function $\E[g(X)] = \sum_{\omega \in \Omega} g(X(\omega)) p(\omega)$


### What is a random variable? Continuous case.

When the sample space $\Omega$ is uncountably infinite, we usually need to go to full measure theory to talk about random variables in general. We won't do today, but consider a measure theory course!

Countinuous random variables are the most common in Machine Learning. The best way to think about random variables is through their *probability density function* and *cumulative density function*
* For random variable $X$, the CDF is $F(x) := P(X \leq x)$
* Also, the PDF of $X$ is the derivative of the CDF, $f(x) := F'(x)$
* WARNING: not every random variable has a PDF! But it always has a CDF! (But CDF may not be diff'bl)
* When a r.v. has a PDF $f(\cdot)$, we can write
$$ \E[X] = \int x f(x)\, dx $$
* When $\mu$ is the mean of random variable $X$, then the variance $\text{Var}(X)$ is $\E[(X-\mu)^2]$


### PDF vs CDF

<img src="images/pdf_cdf.gif">

### Independence

* Strictly speaking, two random variables $X$ and $Y$ are independent if for any (measureable) sets $A, B \subset \R$ we have $P(X \in A \text{ and } Y \in B) = P(X \in A) P( Y \in B)$.
* Perhaps the most useful fact of independence of $X$ and $Y$ is that $\E[XY] = \E[X]\E[Y]$
<div style="padding:10px; margin:10px; border: 1px solid black">
<b>Problem:</b> Show the following<br>
<b>Part A:</b> Let $X$ be any random variable. Show that $\text{Var}(X) = (\E[X])^2 - \E[X^2]$<br>
<b>Part B:</b> Let $X$ be some random variable. Then $\text{Var}(X + X + X) = 9 \text{Var}(X)$<br>
<b>Part C:</b> Let $X_1, \ldots, X_n$ be *independent* random variables. Then $\text{Var}(X_1 + \ldots + X_n) = \text{Var}(X_1) + \ldots + \text{Var}(X_n)$<br>
</div>


### Problem: Markov's Inequality

Try to show the famous Markov's Inequality:

Let $X$ be a random variable that only takes positive values. Then for any $k > 0$ we have
$$P(X \geq k) \leq \frac{\E[X]}{k}$$

### Problem: Chebyshev's Inequality

Now try to prove the same for Chebyshev:

Let $X$ be any random variable with mean $\E[X] = \mu$, and a bounded variance. Then for any $k > 0$ we have
$$P(|X - \mu| \geq k) \leq \frac{\text{Var}(X)}{k^2}$$

### Some Random Variable Distributions

### Parameterized families of probability distributions

* Generally, we assume that the data we observe in the real world are drawn from some unknown distribution
* Much of the research in statistics and ML can be viewed as trying to *reason about uncertainty* by estimating these distributions, or properties of these distributions
* For example: what is the *mean*? What is the *median*? What is the threshold of the upper decile? Given that you are in the top 30% of the distribution, what is your most likely value? Etc.
* Most distributions we work with come from *parameterized families*, i.e. we can describe the distribution as a function of some inputs, known as parameters
* Most basic example:
    * the *gaussian* distribution has PDF 
    $$p_{\mu, \sigma^2}(x) := \frac{1}{\sqrt{2 \pi \sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
    * this is a *function of $x$* but it is *parameterized by* $\mu$ and $\sigma^2$

### Distributions of interest:


**Discrete distributions**
* Bernoulli: https://en.wikipedia.org/wiki/Bernoulli_distribution
* Binomial: https://en.wikipedia.org/wiki/Binomial_distribution
* Poisson: https://en.wikipedia.org/wiki/Poisson_distribution

**Continuous distributions**
* Gaussian/Normal: https://en.wikipedia.org/wiki/Normal_distribution
* Uniform: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
* Exponential: https://en.wikipedia.org/wiki/Exponential_distribution
* Beta: https://en.wikipedia.org/wiki/Beta_distribution


### Estimating Parameters of Distributions: MLE

Let us say we have $n$ independent draws $X_1, \ldots, X_n$ from some distribution $p_\theta$ parameterized by $\theta$. How might we estimate $\theta$?

**Maximum Likelihood Estimation** (A natural heuristic): Choose the $\theta$ that makes the data most likely! That is, it should maximize the density (PDF) of the samples

(Warning: this is a very popular but *controversial* approach to parameter estimation. Just as a Bayesian what they think about the MLE! We won't get into this debate here)

$$
\theta_{\text{MLE}} = \arg\max_\theta p_\theta(X_1, \ldots, X_n) = \arg\max_\theta \prod_{i=1}^n p_\theta(X_i)
$$
It's often easier to take the $\log$ of the product, since it doesn't change the argmax!
$$\theta_{\text{MLE}} = \arg\max_\theta \sum_{i=1}^n \log p_\theta(X_i)$$

### Problem: MLE of the exponential distribution

Assume $X_1, \ldots, X_n > 0$ are sampled from an exponential distribution $p_\lambda(x) = \lambda \exp(\lambda x)$ with. What is the MLE for the $\lambda$?

Also, what is the mean of the distribution $p_\lambda$?

### Problem: MLE of the Bernoulli distribution

Assume $X_1, \ldots, X_n$ are sampled from a Bernoulli distribution, which is supported on $\{0,1\}$: $p_\theta(x) =  \theta^x (1-\theta)^{1-x}$ with. What is the MLE for $\theta$?


### Problem: MLE of the Poisson distribution

Assume $X_1, \ldots, X_n > 0$ are nonnegative integers sampled from the Poisson distribution: $p_\lambda(k) = \frac{\lambda^k\exp(-\lambda)}{k!}$ with. Note that this is a *discrete* distribution. What is the MLE for the $\lambda$?


### Problem: MLE of the Gaussian Distribution

Assume $X_1, \ldots, X_n$ are real numbers sampled from a Gaussian distribution:
$$p_{\mu, \sigma^2}(x) := \frac{1}{\sqrt{2 \pi \sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
What are the MLE estimates of $\mu$ and $\sigma^2$? *Hint*: solve for $\mu$ first.