### Review of Clustering
- Intuition: data is assigned to cluster k should be close to the mean for that cluster $\mu_k$. 
- Distortion measure: $J = \sum_n \sum_j r_{n,k} || x_n - \mu_k||_2^2$. 
- We have the indicator variables $r_{n,k} = 1$ iff $A(x_n) = k$. 
- Minimize distortion measure thorugh altenative optimization between $r_{nk}$ and $\mu_k$. 
- Lloyd Algorithm: 
    - Step 0: initialize $\mu_k$. 
    - Step 1: Update $r_{nk}$  based on $\mu_k$ (basically assign data to clusters)
    - Step 2: Recompute means $\mu_k$. 

### GMMs: Mixture Models - Probabalistic Interpretation
- GMM = probabilistic interpretation of k-means
- Model each distribution witha  gaussian. 
- GMM: $p(x) = \sum_{k=1}^{K} w_{k} N(x | \mu_k, \sigma_k)$ or equivalently, $p(x) = \sum_k p(z_n = k)N(x | \mu_k, \sigma_k)$. 
- We have $\mu_k$ and $\sigma_k$ which are the means and covariances of the kth component. 
- $w_k$ are the mixture weights. We have $w_k \geq 0 \forall k$, and $\sum_k w_k = 1$. 
- Given incomplete data $D = [x_n]$ we must learn parameters of the Gaussian distributions for each of our mixture components, as well as the values $w_k$ that give us weights for our mixture components. 

### Parameter estimation for GMMs 
- The parameters are the mixture weights as well as means and covariances of each particular Gaussian distribution: $\theta = {w_k, \mu_k, \sigma_k}, k \in 1...K$. 
- If we knew these parameters, then we can compute $P(x_n | z_n = k)$ by just substituing in the relevant normal distribution: $P(x_n | z_n = k) = N(x_n | \mu_k, \sigma_k)$. 
- With Bayes theorem, we can compute the posterior probability $P(z_n = k | x_n)$ (also remember that the prior $w_k = P(z_n = k)$). The quanity $P(z_n = k | x_n)$ basically tells us the probability with which the cluster $x_n$ is assigned to cluster $k$. 
- Bayes theorem states that $P(a | b) = \frac{P(b |a) P(a)}{P(b)}$. 
- Applying this, we have $P(z_n = k | x_n) = \frac{P(x_n | z_n = k)P(z_n = k)}{P(x_n)}$. 
- Rewriting this with the chain rule, we have $P(z_n = k | x_n) = \frac{P(x_n | z_n = k)P(z_n = k)}{\sum_k p(x_n | z_n = k) p(z_n = k)}$. We can compute this quantity if we assume we know all of the parameters. 

- If we assume we have the labels and not the parameters, then we have the problem of finding the parameters given the complete data, ie, the points $x_n$ and the associated cluster assignments $z_n$. 
- We can do this by maximizing the complete likelihood: $\theta = argmax log P(D') = \sum_n log p(x_n, z_n)$. 
- We obtain some pretty intuitive values for the parameters (explicity written in the last lecture). $w_k$ is just the fraction of all of the data points who's $z_n = k$, $\mu_k$ is the mean of all points who's $z_n = k$, and $\sigma_k$ is the covariance matrix of all points who's $z_n = k$. 

### Parameter estimation for GMMs: Incomplete Data
- The interesting problem is when we only have observed training data $D = {x_n}$ and the cluster assignments $z_n$ are hidden. 
- We have a similar goal of obtaining the maximum likelihood estimate of the parameters: $\theta$
- This is $argmax_{\theta} \sum_{n} log p(x_n | \theta)$. Tis new objective function $l(\theta)$ is called the incomplete log-likelihood. 

### Optimization with the EM Algorithm
- No easy/typical way to optimize the incomplete log likelihood. 
- Expectation maximization algorithm: a strategy for iteratively optiziming this function. 
- Two steps as its applied to GMMS: 
    - E-step: guess vcalues for $z_n$ given values of parameters $\theta$. 
    - M-Step; obtain new values for the parameters $\theta$ given the newly computed values for $\z_n$. 

- E-step: Soft cluster assignments
    - Define $\gamma_{nk} = p(z_n = k | x_n, \theta)$. This is the posterior of $z_n$ given $x_n$ and $\theta$. 
    - In the complete data setting, $\gamma$ was binary but now its a soft/probabilistic assignment of $x_n$ to the $k$th component, so $x_n$ is being assigned to each component with some probability. 
    - Given $\theta = {w_k, \mu_k, \sigma_k}$ we can compute $\gamma_{nk}$ with Bayes theorem: 
    - $\gamma_{nk} = p(z_n = k | x_n) = \frac{p(x_n | z_n = k) p(z_n = k)}{p(x_n)}$ where we can expand the bottom using the chain rule. 

- M-step: 
    - Maximize the complete likelihood. 
    - Previously, we had $\gamma_{nk}$ as binary but now we define $\gamma_{nk} = p(z_n = k | x_n)$, soft cluster assignments from the previous step
    - We now compute the maximum likelihood estimation in the same fashion as when we would have complete data, $argmax_{\theta} \sum_n p(x_n, z_n)$. We end up with the same expressions for the MLEs as before (written in the previous lecture notes), but now the gammas are probabalistic instead of just binary. 
    - Intuition: each point contributes some fractional componentnt oe ach of the parameters, with weights determined by $\gamma_{nk}$. 
    
- EM procedure for GMM: 
    - Alternate between estimating $\gamma_{nk}$ and $\theta$.
    - Initialize $\theta$ with some values (random or otherwise). 
        - Repeat: E-step: compute soft assignment gammas, M-step: maximize complete likelihood to obtain MLEs for $\theta = (w_k, \mu_k, \sigma_k)$. 



### GMMs and K-means
- GMM = probabalistic interpretation of K-means. 
- GMMs reduce to K-means under some specific conditions. In this case, the EM algorithm for GMMs also simplies down to Lloyd's algorithm. These conditions are: 
    - Assume that all Gaussians have $\sigma^2I$, in other words, their covariance matrices are diagonal. 
    - Further assume that the covariance tends to 0, so we only need to estimate $\mu_k$, so we have less parameters. - Therefore, GMMs can be considered a more general model. K-means = hard GMM or GMM = soft k-means. 
    
### EM Algorithm
- With the EM algorithm, the estimates of the parameter in the m-step of each iteration increases the likelihood. 
- The algorithm converges but only to a local optima. 

### Hidden Markov Models
- Motivation: Many models have assumptions. Linearly separable data: can be removed w/nonlin bsis functions and kernels. We also typically assume that the data are independently distributed - can we challenge this assumption? It is not realistic for many cases, such as any time-series data. 

- Sequential process: 
- A system that can occupy one of K states. 
- Let $X_t$ be the state of the system at time $t$. 
- $X_t$ is a random variable that takes values between 1 and K. 
- If we observe the state of the system from time 1 to tt, what is its state at time t+1?
- It's given by the conditional distribution: 
- $P(x_{t+1} | x_1 ... x_t)$. 
- What is the probability of observing that the system is in states $x_1 ... x_t$ at times 1 to t? $P(x_1 ... x_t) = P(x_1) P(x_2 | x_1) P(x_3 | x_2, x_1) ... P(x_t | x_1 ... x_{t-1})$. Basically the probability of a current state at a particular time is dependent on all previous states. 
- If we knwon all the conditional probabilities we can compute the joint. 
- This is kind of difficult (but possible with things like RNNs and LSTMs/GRUs), but for Markov Processes, we just assume that the next state depends only on the current state. The future is conditioned on the present and independent of the past: $P(x_{t+1} | x_1 ... x_t) = P(x_{t+1} | x_t)$. 

### Markov Chains
- Initial probabilities to begin the chain: $\pi_i = P(X_1 = i)$. 
- Transition probabilities: $q_{ij} = P(X_{t+1} = i | X_t = j)$. 
- For the transition probability matrix and the intial probability vector we require that the cols sum to 1. 
- Computing on Markov Chains: 
- What is the probability of observing a particular sequence of states such as $X_1, X_2, X_3 = 1,1,3$? 
- More generally: $P(x_1 .. x_T) = \pi_{x_1} q_{x_1, x_2} ... q_{x_{T-1}, x_T}$. This gives us the probability of observing a particular sequence of states. 
- We may also be interested in computing the probability of a particular state at a particular time. 
- This means that we want to compute $P(x_T = i)$, and we want to consider all previous states that could have led to this current state. 
- We know that a particular sequence $P(x_1 ... x_T) = \pi_{x_1} q_{x_2, x_1} ... q_{x_{t+1}, x_t}$ so we just have to consider all possible such sequencees that end up with the desired end state. 
- This value is $P(X_t = i) = \sum_{x_1, x_2, ... x_{t-1}} P(X_1, x_2, ... x_{t-1}, x_t = i)$. Since each $x_t$ can take $K$ states, the computational cost of this is $O(K^T)$. 
- However, there is an algorithm to compute this more efficiently. 
- We want to compute $P(x_T = i)$. To do this, we first define $p_t(i) = P(X_t = i)$. For the first state, we have $p_{t=1}(i) = 1$ if $i$ is the sart state else 0.
- NOw assume that we know $p_{t-1}(i) \forall i$. Then to compute the next time in the time series, $p_t(i)$, we just have $p_t(i) = \sum_j p_{t-1}(j)q_{ij}$. 
- The computational cost of this is $O(K^2)$ to compute each $p_t(i)$ for a given $t$, and doing this $T-1$ times gives a total computation cost of $O(K^2(T-1))$, a much better computational cost than exponential. This is an example of dynamic programming. 

### Hidden Markov Models
- Previously, we directly observed the states $X_t$. Now we actually bserve another random variable $Y_t$ that is affected by $X_t$. 
- In this case we referred to $Y_t$ as the observed states and $X_t$ as the hidden states. Having observed $Y = (Y_1 ... Y_t)$ we want to ask questions about the hidden states. 
- We define the set of observed states/emission symbols $B = b_1 ... b_L$ to be the values that the observed states can take. 
- We define emission probabilities: the probability of observing a state given a particular hidden state. This is $e_k(b) = P(Y_t = b | X_t = k)$. We require $\sum_{b} e_k(b) = 1$. 
- Hidden Markov Model definitions: 
    - Hidden and observed states. Some initial probability $\pi_i =  P(X_1 = i)$. Transition probabilities $q_{ij} = P(X_t = i | X_{t-1} = j)$. Emission probabilities $e_k(b) = P(Y_t = b | X_t = k)$. 

### Querying HMMs
- HMMs can be generative models: 
    - Pick a state $X_1$ according to the distribution $\pi$. 
    - let t = 1
    - Emit an observation $y_T$ according to $e_x(t)$. 
    - Choose the next state $x_{t+1}$ according to the distribution $q_{x_t}$. 
    - $t+=1$ and repeat. 
    
- We may also be interested in the joint porbability of a sequence of hidden and observed states. We have $P(y, x) = P(y_{1:T}, x_{1:T}) = P(y_{1:T} | x_{1:T}) P(x_{1:T}) = \prod_{t=1}^{T} e_{x_t}(y_t) \pi_{x_1}\prod_{t=1}^{T-1}q_{x_{t+1}, x_t}$. 
- HMMs most probable path (MPP) problem
- Given a sequence of obesrvations $y_1 ... y_T$ what is the most probable sequence of hidden states $x_1 ... x_t$ ? 
- We basically want $argmax_{x_{1:T}} P(y_{1:T}, x_{1:T})$, or the sequence of hidden states that maximize the probability of observing what we have observed. Since from above we already know how to compute a specific sequence $P(y_{1:T} | x_{1:T}) P(x_{1:T})$ we can solve this problem by searching over all possible values of $x_{1:T}$. Since each $X_t$ can take on $K$ values, there are $O(K^T)$ possibilities so this search would take exponential time, which is bad. 
- However, there is an efficient DP solution to this. 

### The Viterbi Algorithm
- Remember that the MPP problem is trying to find the most probable sequence of hidden states. This is $argmax_{x_{1:T}} P(y_{1:T}, x_{1:T})$. 
- Suppose that we have computed the probability $v_t(k) \forall t  \in 1...T$ and $k \in 1...k$. This is the probability of the MPP for observations $y_1 ... y_t$ so the path has length $t$ that ends up in state $k$. 
- Suppose that we have computed the probability $v_t(k) \forall t \in 1...T, k \in 1...K$. 
- The probability of the MPP for observations $y_1 ... y_T$ (so that the path has length t and ends up in state k): 
- $v_t(k) = max_{x_{1:t-1}} P(y_t, x_{1:t-1}, x_t = k)$. 
- If we look at $v_T(k)$, it tells us the probability of the MPP of length $T$ that ends in state $K$. 
- The answer to the MPP problem is then just $max_k v_T(k)$. 
- Can we compute $v_t(k)$ efficiently? 
- Assume that we have computed $v_{t-1}(l)$. 
- This means that we have computed the probability for the MPP of length $t-1$ that ends in state $l$ for all values of $l$. 
- How do we use this to compute the probability of length t that ends in state k? 

- Viterbi Algorithm: 
    - Start with t = 1. 
    - $v_1(k) = max_{x_1 = k} P(y_1, x_1 = k) = P(y_1 | x_1 = k )P(x_1 = k) = e_k(y_1)\pi_k$. 
- Now assume that we've computed $v_{t-1}(l)$. THis menans that we have computed the probability of the MPP of length $t-1$ that end up in state $l$, for all values of $l$. 
- How do we use this to compute the probability of MPP of length $t$ that ends up in state $k$ ? 
- The most probable path wih last 2 states $l,k$ is the MPP with state $l$ at time $t-1$ followed by a transition from state $l$ to state $k$, and emitting the observation at time $t$. 
- The probaiblity of this path is $v_{t_1}(l) P(x_t = k | X_{t-1} = l)P(y_t | X_t = k)$.
- = $v_{t-1}(l)q_{lk}e_t(y_t)$. 
- The MPP that ends in state $k$ at ttime $t$ is obtained by maximizing over all possible states $l$ in teh previous time $t-1$. $v_t(k) = max_l v_{t-1}(l) q_{kl}e_t(y_t)$.  

- So basically the Viterbi algorithm boils down to 2 things: 
- Base case: $ v_1(k) = max_{x_1 = k} P(y_1, x_1 = k)$ which can be easily expanded. 
- The MPP that ends in state $k$ at time $t$ is given by maximizing over all possible states $l$ in the previous time $t-1$: 
    - $v_t(k) = max_l v_{t_1} (l) = q_{kl}e_t(y_t)$. 

### Learning HMMs
- previously assumed that the parameters are knwon. Can we actually learn these from data? 
- Params of HMMs: $\theta = (\pi, Q, E)$. 
- Here $E$ is the matrix of memission probabilities $E_{kb} = e_k(b)$. 
- Given training data of observed states, find parameters $\theta$ that maximize the log-likelihood. 
- We have incomplete data: observed states $D = y_{1:T}$ and unobserved states $x_{1:T}$. 
- We have $\hat{\theta} = argmax_{\theta} l(\theta) = argmax_{\theta} log P(y_{1:T} | \theta)$. 
- This is $argmax_\theta \sum_{x_{1:T}} log P(y_{1:T}, x_{1:T} | \theta)$
- The objective is called the incomplet elog-likelihood. 
- Can be optimized with EM algorithm, like we did for GMMs. 

### HMM Summary
- Allows us to model dependencies. 
- Can efficiently perform computations using DP
- Learn params with EM algorithm. 
- Pretty common model. 


