### Introspective Classifier Learning: Empower Generatively

**Authors:** L. Jin, J. Lazarow, Z. Tu  
**Link:** https://arxiv.org/pdf/1704.07816.pdf  

---

**Contributions**
- Proposed `Introspective Classifier Learning` (ICL) Framework - A single model that is simultaneously discriminative and generative
- Studied how generative aspect of their a benefits its own discriminative training
- Developed an efficient sampling procedure to synthesize new data (from scratch) from a discriminative classifier
- Developed *Reclassification-By-Synthesis* algorithm to iteratively augment negative samples and update the classifier.
- Proposed a formulation to train a *multi-class classifier* on training set and augmented samples.

---

**Improving Classifier Performance**
- Using more data (hard examples) to train the classifier (a common way)
- Bootstrapping, active learning, semi-supervised learning
- Data augmentation
- *Above approaches use data that is already present in training set or created by humans or separate algorithms* - Utilizes positive samples

---

**Generative-Discriminative Modeling Concept** (Tu, 2007)
- A generative model can be successfully modeled by learning a sequence of discriminative classifiers via self-generated samples called `pseudo-negatives`
- `New samples that pass the learned classifier are considered as a new set of pseudo-negatives in the next round`

---

**ICL Advantages compared to other approaches**
- Convolutional - Automatic feature learning (Vs features pre-selected manually)
- More efficient learning process
- More efficient sampling process (Vs time consuming MCMC simulations)
- Simplicity of having a single classifier as opposed having a sequence of boosting classifiers

---

- `Reference distribution` - Generates first batch of pseudo-negatives
- Discriminative classifier is trained to separate the given input data and `pseudo-negatives`
- Algorithm repeats until `pseudo-negatives` are no longer distinguishable from the input data

- Discriminative classifier computes the probability of $\mathbf x$ being positive or negative, i.e. compute: $p(y|\mathbf x)$

- Probabilities should sum to 1, i.e. $p(y=+1|\mathbf x) + p(y=-1|\mathbf x) = 1$

- Generative model models $p(y,\mathbf x) = p(\mathbf x|y)p(y)$. This captures the underlying generation process of $\mathbf x$ for class $y$
	
- Binary classification: Positive samples are of primary interest. Using Bayes theorem and assuming equal priors $p(y=+1)=p(y=-1)$: 

	$$p(\mathbf x|y=+1) = \frac{p(y=+1|\mathbf x)}{p(y=-1|\mathbf x)}p(\mathbf x|y=-1)$$

---

**Input** - Input data has an underlying distribution we wish to learn

![](images/icl-data.png)

- Goal is to gradually learn $p_t(\mathbf x|y=-1)$ such that samples drawn from it become indistinguishable from the given positive samples, i.e. $p_t(\mathbf x|y=-1)\overset{t=\infty}{\rightarrow}p(\mathbf x|y=+1)$

![](images/icl-goal.png)

---

**Initialization**

![](images/icl-init.png)

---

**Synthesis-Reclassification Loop**

**For $t=0, 1, ..., T$**

Update **Generative** Model: 

$p_t(\mathbf x|y=-1) = \frac{1}{Z_t}\frac{q_t(y=+1|\mathbf x)}{q_t(y=-1|\mathbf x)}p_r(\mathbf x|y=-1)$ 

- Where 
$Z_t = \int \frac{q_t(y=+1|\mathbf x)}{q_t(y=-1|\mathbf x)}p_r(\mathbf x|y=-1)dx$

- $Z_t$ is Partition Function (also known as Normalizing Constant). Unnormalized probability distribution is guaranteed to be non-negative everywhere, however it is not guaranteed to sum or integrate to 1. To obtain a valid probability distribution the unnormalized probability distribution needs to be normalized: $p(x) = \frac{1}{Z}\tilde{p(x)}$

- $Z$ is integral or sum over all possible joint assignments of the state $x$, i.e. $Z = \int \tilde{p(x)} dx$ and it is often intractable to compute (resort to approximation).
    
- **Synthesis Step**
    - Sample $l$ pseudo-negative samples 
    
    ![](images/l-samples.png)
    
    - $\mathbf x_i \sim p_t(\mathbf x|y=-1), i=n+tl+1, ..., n+tl+l$ from the current model $p_t(\mathbf x|y=-1)$ using `Variational Sampling Procedure (Stochastic Gradient on Input)`
    
    - Authors update a random sample $$\mathbf x$$ drawn from $p_r(\mathbf x|y=-1)$ by increasing $\frac{q_t(y=+1|\mathbf x; W_t)}{q_t(y=-1|\mathbf x; W_t)}$ using backpropagation.
    
- **Augment Pseudo-negative Set**

	![](images/pn-samples-t.png) 

    - $\mathbf S_{pn}^{t+1} = \mathbf S_{pn}^{t} \cup \{(\mathbf x_i, -1), i=n+tl+1, ..., n+tl+l\}$
    
	![](images/pn-samples-tt.png)

- **Reclassification Step**
    - Expand training set: $\mathbf S_{e}^{t+1} = \mathbf S \cup \mathbf S_{pn}^{t+1}$
    
    ![](images/e-set.png) 

    - Update CNN classifier to $\mathbf C^{t+1}$ on expanded set $\mathbf S_{e}^{t+1}$ to get $q_{t+1}(y=+1|\mathbf x)$
    
    ![](images/c-t.png)
    
- $t \leftarrow t+1$ until convergence *(e.g. no improvement on validation set)*

---

**Synthesis Process**

- Goal is to draw fair samples from 

$p_t(\mathbf x|y=-1) = \frac{1}{Z_t}\frac{q_t(y=+1|\mathbf x; W_t)}{q_t(y=-1|\mathbf x; W_t)}p_r(\mathbf x|y=-1)$

- Update a random sample $\mathbf x$ drawn from $p_t(\mathbf x|y=-1)$ by increasing $\frac{q_t(y=+1|\mathbf x; W_t)}{q_t(y=-1|\mathbf x|W_t)}$ using backpropagation.

- Partition function $Z_t$ is a constant and not dependent on the sample $\mathbf x$
- Let $g_t(\mathbf x) = \frac{q_t(y=+1|\mathbf x; W_t)}{q_t(y=-1|\mathbf x|W_t)} = \exp\{\mathbf w_t^{(1)T}\phi(\mathbf x; \mathbf w_t^{(0)})\}$, which is essentially odds ratio. Taking *natural log* of both sides: $ln(g_t(\mathbf x)) = \mathbf w_t^{(1)T}.\phi(\mathbf x; \mathbf w_t^{(0)})$

- Starting from $x$ drawn from $p_r(\mathbf x|y=-1)$, authors directly increase  $\mathbf w_t^{(1)T}.\phi(\mathbf x; \mathbf w_t^{(0)})$ using stochastic gradient **ascent** on $\mathbf x$ via backpropagation, which allows them to obtain fair samples subject to $p_t(\mathbf x|y=-1) = \frac{1}{Z_t}\frac{q_t(y=+1|\mathbf x; W_t)}{q_t(y=-1|\mathbf x; W_t)}p_r(\mathbf x|y=-1)$

---

**Synthesis Process Example**

- Sample **3** samples $\{x_1, x_2, x_3\}$ from $p_t(\mathbf x|y=-1)$

  ![](images/samples-3.png)

- Update sample $x_i$ in **3** samples by increasing $\mathbf w_t^{(1)T}.\phi(\mathbf x_i; \mathbf w_t^{(0)})$ using `stochastic gradient ascent` on $\mathbf x_i$ via backpropagation as follows:

  ![](images/synth-input.png)
  
  ![](images/synth-grad.png)
  
  ![](images/synth-update.png)
  
---

NOTE: Check paper for `Multi-class Classification Loss Function` derivation and details

