disco  
Copyright (C) 2022-present NAVER Corp.  
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license  

# Quick introduction

Note that this is only an introduction, from the [README](../README.MD), skipping some details and relying on toyish use cases: 
see the other notebooks in the tutorial folder for more depth, and more use cases.

## Distributions

The generative model that you want to tune must be wrapped by a `Distribution` object.  For example, for a (causal or seq2seq) language model compatible with the 🤗 Hugging Face interface use an `LMDistribution`.

A valid `Distribution` must have the following two methods:
- `.sample(context)` that given an optional `context` on which the distribution can be conditioned, returns a list of samples from the underlying distribution and a tensor with their corresponding log-probabilities. 
- `.log_score(samples, context)` that given a list of samples and the `context` on which to condition the distribution, returns their corresponding log-probabilities.

In [None]:
from disco.distributions import LMDistribution

distribution = LMDistribution()

In [None]:
incipit = "It was a cold and stormy night"
samples, log_scores = distribution.sample(context=incipit)

In [None]:
distribution.log_score(samples, context=incipit)

`LMDistribution` generate samples, with the `TextSample` type, which are named tuples with both a `text` and `token_ids` fields.

### Features

Features are represented by an object with the method
- `.score(samples, context)` which given a list of samples and an eventual context returns a tensor of real-valued scores.

A convenient way to define one is using the `Scorer` class, which accepts a function, or a lambda definition, that takes sample and a context, and vectorizes it. For example, we can compute the effective length of a GPT-2 text sample by finding the eos token:

In [None]:
from disco.scorers.scorer import Scorer

sequence_length = Scorer(lambda s, c: s.text.index("<|endoftext|>"))

where `s` is the sample (assumed to be a `TextSample`) and `c` is an eventual context.

#### Boolean Features

An important class of features are *boolean* features. While general features can only be used to define *distributional* constraints, boolean features can also be used to define *pointwise* constraints, see below. To define one, we can use the `BooleanScorer` helper class, which takes a function as an argument.  
For example, we can score the presence of the string "amazing", as follows:

In [None]:
from disco.scorers.boolean_scorer import BooleanScorer

In [None]:
amazing = BooleanScorer(lambda s, c: "amazing" in s.text)

The ```False```/```True``` results from the lambda are casted to `0.0`/`1.0` float values so that they can be used in the EBM definition. 

`BooleanScorer` belongs to the more general class of `PositiveScorer`s, which can be used to construct EBMs. The main properties of a `PostiveScorer` is that first, it returns positive scorers, and second that it provides the method
 
 - `.log_score(samples, context)` that given a list of samples and the `context` on which to condition the distribution, returns their corresponding log-probabilities.

As a consequence, we can see that a ```Distribution``` is also a ```PositiveScorer``` that is able to sample as well.

## Controlling Generation

### Expressing preferences in an EBM

We express preferences over the distribution by defining target moments for specific features. This results in a target distribution that matches the desired moments while minimizing the KL divergence to the original distribution. In other words, it incorporates the preferences while avoiding catastrophic forgetting. This distribution is represented as an EBM, which can be used to score samples, in other words it is a `PositiveScorer`, but cannot be used to sample, we'll see how to sample below.

We can express either *pointwise* or *distributional* constraints on a distribution and compose them at will. The former expresses a (boolean) property that must apply to *all* sequences, whereas the latter represents properties at the distributional level.  

To obtain the target distribution that incorporates our constraints, we use the `constraint` method of the corresponding `Distribution`. This method takes a list of features and their corresponding target moments.

For example, we can define an EBM with a *pointwise* constraint requiring that all our samples must include "amazing" by setting the target moment to `1` on a `BooleanFeature`:

In [None]:
from disco.distributions.lm_distribution import LMDistribution

base = LMDistribution()
ebm = base.constrain([amazing], [1])
# ebm = base.constrain([amazing]) # would also work
# ebm = base * amazing # as well, using a product notation 

Or we can ask for a _distributional_ constraint requiring that _half_ of the samples include "amazing":

In [None]:
import os
# disabling parallelism to avoid deadlocks (see warning from HF's tokenizer)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
ebm = base.constrain([amazing], [1/2])

### Approximating the target EBM

Given an EBM target distribution, we now want to train a model to approximate it so that we can use it to generate samples. In the _unconditional_ case, namely when there is a single fixed context used in generation, then we can use a `Tuner`, more specifically a ```DPGTuner```, as follows.

In [None]:
from disco.tuners.dpg_tuner import DPGTuner

In [None]:
target_ebm = base.constrain([amazing], [1])

model = LMDistribution(freeze=False)
incipit = "It was a cold and stormy night"

tuner = DPGTuner(model, target_ebm, context=incipit, n_gradient_steps=4)
tuner.tune()

And we can sample _amazing_ sequences from the tuned model.

In [None]:
samples, log_scores = model.sample(context=incipit)
for s in samples:
  print(incipit + s.text)

#### Tuning parameters

Important parameters of the `Tuner` include:
- `n_gradient_steps`: number of total gradient steps in the full tuning process;
- `n_samples_per_step`: total number of samples used in performing a gradient step (aka batch size);
- `scoring_size`: number of samples sent in a batch to the `.score` function. This parameter affects training speed or helps solve GPU memory errors, but does not affect final results;
- `sampling_size`: number of samples obtained from a single call to the `.sample` function. This parameter affects training speed or helps solve GPU memory errors, but does not affect final results;
- `features`: list of pairs (`name`, `feature`) so that the `feature` moments will be computed by importance sampling (and reported using the key given by `name`);
- `track_divergence_from_base`: set to True to track the reverse KL divergence from the original model —this requires an additional round of samples' scoring.

### Logging

The Tuner reports a number of metrics that are useful to monitor the training progress. A number of `Logger` classes are provided to keep track of these metrics. Basic logging is provided though the console, as follows:

In [None]:
from disco.tuners.loggers.console import ConsoleLogger

console_logger = ConsoleLogger(tuner)

However, more detailed statistics can be kept through a JSON/WandB/Neptune loggers:

In [None]:
from disco.tuners.loggers.json import JSONLogger
from disco.tuners.loggers.neptune import NeptuneLogger
from disco.tuners.loggers.wandb import WandBLogger


In [None]:
project = "example_project"
name = "run_01"
json_logger = JSONLogger(tuner, project, name)
neptune_logger = NeptuneLogger(tuner, project, name)
wandb_logger = WandBLogger(tuner, project, name)

where `project` and `name` refer to the project and run name, respectively.

#### Logged Metrics

Loggers store a number of metrics about the training process. Here we list a few of the most relevant ones:

-  `kl_target_model` and `kl_target_proposal`: estimates of the forward KL divergence to the target EBM from the tuned model and the proposal distribution, respectively. In the case of using online training, the two are equivalent with the only caveat that `kl_target_model` is computed —this is the metric being optimized, and not the value reported as `loss`;
-  `kl_model_base` and `kl_proposal_base`: estimates of the reverse KL divergence to the original model of the tuned model and the proposal distribution, respectively —only reported if `track_divergence_from_base` is set to True;
-  Feature moments: estimate of the features' moments for those features specified with the `features` parameter at the Tuner's construction time.

## Controlled Conditional Generation

The _conditional_ case is superficially very similar, with an extra step needed to instantiate a `ContextDistribution`, which allows to sample contexts that can then be used to condition the model. Furthermore, we use the more general ```CDPGTuner``` class.

Assuming we have a file of incipits, one per line, in a `data/incipits.txt` file, we could do:

In [None]:
from disco.tuners.cdpg_tuner import CDPGTuner
from disco.distributions.context_distribution import ContextDistribution

In [None]:
target_ebm = base.constrain([amazing], [1])

model = LMDistribution(freeze=False)

tuner = CDPGTuner(model, target_ebm, n_gradient_steps=4, # use a much higher value for actual tuning
  context_distribution=ContextDistribution("data/incipits.txt"), context_sampling_size=2**3)
tuner.tune()

Note that while we have used a decoder-only model here for illustrative purposes, the real power of the CDPGTuner is that it allows to control _seq2seq models_ such as those used in NMT, summarization, etc... Please refer to the dedicated [tutorial notebook](tutorials/4.conditional_tuning.ipynb) for an example of how to control an actual conditional model.

### Monte-Carlo sampling to improve the approximation

After the tuning is done, `model` is now a better approximation to the target EBM, but it is not guaranteed to perfectly match this distribution. While further training can improve the situation, another alternative is using [quasi-rejection sampling (QRS)](https://disco.europe.naverlabs.com/QRS/), a Monte-Carlo sampling technique that allows to trade-off sampling efficiency for a higher fidelity to the target distribution —a higher value of `beta` yields a better fidelity although at a higher computational cost.

In [None]:
from disco.samplers.quasi_rejection_sampler import QuasiRejectionSampler

In [None]:
beta=0.5
sampler = QuasiRejectionSampler(target_ebm, model, beta=beta)
samples, log_scores = sampler.sample(sampling_size=2**7)

### In summary

To put some of this (distributional constraint, tuning in the unconditional case and using QRS) together:

In [None]:
base = LMDistribution()
target_ebm = base.constrain([amazing], [1/2])

model = LMDistribution()

tuner = DPGTuner(model, target_ebm)
tuner.tune()

beta=0.5
sampler = QuasiRejectionSampler(target_ebm, model, beta=beta)
samples, log_scores = sampler.sample(context=incipit, sampling_size=2**7)