# Generative models

This tutorial will show how to use generative models to solve:
- Question answering
- Dialogue generation
However, the same prinicples are applicable to 

Yay we made it!

## Generative Question Answering (QA)

Up to now we have seen how to retrieve relevant passages that may contain the answer to a question.
- What if the answer is not written explicitly in the passage?
- What if the passage is too long to read (and our lives are too dynamic to spend more than one minute reading)?

We can train a model to output the answer to a question given 
- a relevant passage (and we know how to gather relevant documents)
- the question (yes, the question contains relevant information to answer the question)
... Or we can put a pre-trained model on top of our retreival pipeline

### QA data preparation

In this section we will be using the [WikiQA](https://aclanthology.org/D15-1237/) data set.
It's a data set for open domain generative QA.

It's avaialble via the HuggingFace data set package, let's install it

In [None]:
!pip install datasets

Let's download the validation split of the WikiQA data set

In [None]:
import datasets

wiki_qa = datasets.load_dataset('wiki_qa', split='validation')
wiki_qa[:10]

### Knowlege preparation

We have the questions (and the target answers), now we need to prepare our knowledge source.


### Answering a question

### Putting all together

We can finally set up an entire question answering pipeline:
- We have the knowledge
- We have the retreival system
    - We also have the re-ranking system
- We have the asnwering system

Let's define a function that puts everything together and 

In [None]:
# TODO

Try it out

In [None]:
question = ""

## Generative chatbots

Generative chatbot

### Pretrained models

For starter let's play around with a pre-trained model.
We can load the [DialoGPT](https://arxiv.org/abs/1911.00536) chatbot, a fine-tuning of GPT-2 trained o large collections of conversations crawled from Reddit.

We can start seeing different ways to decode (generate) responses using this autoregressive model.
What we want to do is use the output probability distribution to select a token compsing a response.
Hopefully we select the most probable sequence, actually that's not feasible.

Let's proceed step-by-step.
First of all get model and tokeniser

#### How does it work?

First we need to understand how to provide data to our model

Up to a couple of years ago, the standard appraoch to present the input to these models was to separate each utterance with a `end-of-sequence` token.
The model would generate an answer and stop every time the `end-of-sequence` tokens is generated.

```
"<|endoftext|>Summer loving had me a blast<|endoftext|>Summer loving happened so fast<|endoftext|>I met a girl crazy for me<|endoftext|>Met a boy cute as can be<|endoftext|>"
```

Nowadays the appraoch is to have an uninterrupted stream of text, like a movie script

```
"
A: Hello.
B: Is it me you're looking for?
A: I can see it in your eyes...
B: I can see it in your smile!
"
```

DialoGPT uses the `end-of-sequence` token.

In [None]:
context = [
    "Hello, how are you?", 
    "I'm fine thaks, how about you?"
]

Encode input

If we run the sequence through the model, we get a series of logits as output.
Since we are using an autogressive models, in the rightmost position we will have the logits of next token.

We can run these logits through a $\mathrm{softmax}(\cdot)$ and obtain the probability distribution over tokens:
- for each possible token we have the probability of it being the next in the sequence
- We can sample a token from this probability distribution and recurr itin input to get a new token
- We can iterate this process to compose a response

#### Deterministic decoding

Deterministic appraoches yield always the same output for a given input

##### Greedy decoding

The most starightforward way is to pick each time the most probable token and recurr it as next step in input.
Very suboptimal solution, usually yields dull responses like `"I don't know"` or causes degenerate generation (e.g., repeating the same token many times).

##### Beam search

We cannot do an exhaustuve search, but we can keep the top $n$ most probable sequences up to now.
This is what beam search does

#### Sampling

Sampling based decoding adds more spice to the output sampling the next token with a certain probability given by the language model.
The nice thing is that given the same input the generated content may change (higher diversity in the text of responses), the bad thing is that given the same input the generated content may change (possibly inconsistent behaviour).

##### Temperature rescoring

Divide the logits by a value $\tau$:
- if $\tau > 1$ (high temprature) the distribution get softer (reduces probability of most probable tokens and increases that of least probable)
- if $\tau = 1$ the distribution is unchanged
- if $\tau < 1$ (low temperature) the distribution get sharper (reduces probability of most probable tokens and increases that of least probable)


##### Top-k

Consider only first $k$ most probable tokens and zero out others probabilities 

##### Top-p (nucleus sampling)

Consider only first most probable tokens so that their probability sum up to $p \in [0, 1] \subseteq \mathbb{R}$ and zero out others probabilities 
Similar to top-$k$ but variable window.

##### Contrastive

### Fine tuning

#### Data preparation

#### Training

#### Testing

## ELIZA meets DialoGPT

In the 70s they made ELIZA and PARRY meet each other: https://www.theatlantic.com/technology/archive/2014/06/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972/372428/
We could you have ELIZA meet ChatGPT, but since we are humble we will settle with DialoGPT