# Midterm

## Technical Objectives 

The first half of the class had two high-level objectives: 

1) Internalize the formal language around probabilistic modeling in the context of deep learning for natural language processing. 
2) Get practice building high-performing neural network models for core classes of NLP models to prep for doing research. 

The homeworks have primarily focused on (2). The midterm will focus **exclusively** on (1). 


In particular we have covered the following technical topics, which are fair game for the midterm: 

* Specification of generative models through *generative processes* and directed graphical models.
* Parameterization of distributions through features and *neural networks*. 
* Geometric representations of distributions through softmax and *simplex* representations. 
* Information-theoretic properties of discrete distributions, primarily *KL* but also entropy, cross-entropy. 
* Maximum-likelihood estimation (MLE) through *back-propagation* particularly chain-rule of log-softmax.
* Familiarity with basic neural network structures, in particular *attention*.
* Mastery of *Naive Bayes* and *Softmax regression* - paramaterization, class-sizes, posterior inference, features, difference. 
* Comprehension of notation of *latent-variables* and their usage, including MLE in the presence of latent-variables.
* Understanding the *variational* formulation of the MLE objective in terms of ELBO and posterior gap. 
* Writing down the *EM* steps for clustering and understanding what each step is doing.
* Knowing the conditions under which EM is intractable and alternative variational approaches using simpler $q$.
* Using variational auto-encoders with neural $\rho$ as an alternative to EM.
* Conditions under which REINFORCE is used for backpropagation and the reasoning.

# Midterm Practice

The following is roughly the form that the midterm will take. Mastery of the following questions will give a good foundation for the midterm itself. 



Consider the following natural language processing task classification task. You are given a sentence $x_1 \ldots x_T$ from $|V|$ which we will assume is of a fixed length $T$. We want to classify this sentence into a sentiment class $y \in \{ \text{positive, neutral, negative}\}$. However we will assume that our sentences come from a broad set of different *domains* in particular $z \in \{ \text{books, movies, music}\}$. We are given a dataset with $x, y$ observed but with $z$ unobserved, however we believe it is important to model $z$ as part of a generative model of this data. 

1) Specify this model as a naive generative process and a directed graphical model. Assume the simplest parameterization of the model where there is a single parameter for each probability. How many learned parameters (big-O) does this model have?

2) In the parameterization for (1), there is an assumption that each word is generated in a completely different manner for each domain. Modify the above parameterization so that there is parameter sharing through *embeddings*. Argue that the model should take into account that "good" may be used similarly across different domain. How many learned parameters does this model have?

2b) This model will likely utilize a softmax when generating words. What might be the advantages or disadvantages of instead specifying these distributions with an argmax or a sparse max?

3) What is the MLE objective for fitting the learned parameters for (1) and (2)? Highlight which aspect of this objective makes this model difficult to fit in closed form. Write out the variational objective (ELBO + posterior gap) for this model. 

4) The variational objective is a function of the model parameters $\theta$ and the variational distribution $q$. Assume that we want to maximize ELBO for the $q$ term. Solve in closed-form for $q$ in this case. Argue (with a diagram) why this maximizes the ELBO.

5) Now assume that we want to maximize ELBO for the model parameters $\theta$ for model (2). Write down the objective directly. How might we optimize this in pytorch? Give the computational complexity for computing this objective. 

6) What if instead of a 3 domains, the model had 100? How does run-time change? What approach might we take instead to speed up training? Specify the complete derivation and where any approximation occur. 

7) Instead of EM, we now want to utilize a variational auto-encoder. Specify a form a VAE for this problem. Given the objective over the ELBO for both $\theta$ and $q$. Draw a diagram of the gradients in terms of the variational parameters and specify where approximations occur. 