<a href="https://colab.research.google.com/github/annalisad98/BERT-Text-Generator-and-QA-Model/blob/main/MLDLproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PROJECT: NEURAL TEXT GENERATOR**


bert-babble script (colab) + evaluation (github)

# **INTRODUCTORY PART**

In [None]:
!pip3 install pytorch_pretrained_bert

With this command the PyTorch pretrained bert package is installed. It contains many classes related to the BERT model.

In [None]:
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

In [None]:
#help(BertForMaskedLM)

BertTokenizer: perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

BertModel: raw BERT Transformer model (fully pre-trained).

BertForMaskedLM: BERT Transformer with the pre-trained masked language modeling head on top (fully pre-trained).

In [None]:
# Load pre-trained model (weights)
model_version = 'bert-base-uncased'  #'bert-large-uncased'
model = BertForMaskedLM.from_pretrained(model_version)
model.eval()
cuda = torch.cuda.is_available()
if cuda:
    model = model.cuda()

100%|██████████| 1248501532/1248501532 [00:31<00:00, 40086531.21B/s]


In [None]:
type(model)

In [None]:
cuda

from_pretrained: let you instantiate a model/configuration/tokenizer from a pretrained version (with the above command the pre-trained model 'bert-base-uncased' is installed).

The line model.eval() is used to set the model in evaluation mode to deactivate the DropOut modules. It is IMPORTANT to have reproducible results during evaluation.

With the last 3 lines of code we move our tensor to the GPU if available. Remember that PyTorch exploits GPU's power which has an increased level of parallelism w.r.t CPU.

In [None]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=model_version.endswith("uncased"))

100%|██████████| 231508/231508 [00:00<00:00, 16698738.25B/s]


In [None]:
print(tokenizer)

<pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f3acf131d90>


tokenizer is used for the tokenization of sentences/batches.

In [None]:
def tokenize_batch(batch):
    return [tokenizer.convert_tokens_to_ids(sent) for sent in batch]

The method convert_tokens_to_ids converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
The function defined above gives the possibility to tokenize batches of strings.

In [None]:
def untokenize_batch(batch):
    return [tokenizer.convert_ids_to_tokens(sent) for sent in batch]

The method convert_ids_to_tokens converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
The function defined above gives the possibility to untokenize batches of strings.

Ids stays for indeces and token stays for word objects (words, points, ...).

In [None]:
def detokenize(sent):
    """ Roughly detokenizes (mainly undoes wordpiece) """
    new_sent = []
    for i, tok in enumerate(sent):
        if tok.startswith("##"):
            new_sent[len(new_sent) - 1] = new_sent[len(new_sent) - 1] + tok[2:]
        else:
            new_sent.append(tok)
    return new_sent

CLS = '[CLS]'
SEP = '[SEP]'
MASK = '[MASK]'
mask_id = tokenizer.convert_tokens_to_ids([MASK])[0]
sep_id = tokenizer.convert_tokens_to_ids([SEP])[0]
cls_id = tokenizer.convert_tokens_to_ids([CLS])[0]

new_sent is at the beginning an empty list. Using the for cycle it is populated.

At each step of the for cycle a new word is added to the list new_sent. If some tokens start with "##", it means they are part of a bigger word (without spaces), so we concatenate them.

Remember BERT learns by pretraining on 2 supervised tasks simultaneously: Masked Language Model and Next Sentence Prediction.

The Masked LM task is implemented by masking 15% of the words randomly in every sentence and training the model to predict them. This is why we introduced above an index for the token [MASK].

For the next sentence task the goal is to understand if a generic sentence is after another sentence and to do this we need to specify the beginning of the sample (with [CLS]) and we need a special separator token ([SEP]) for example to separate questions/answers.
__________
__________


___________
___________
# **GENERATION PART (BERT)**


## GENERATION: Functions for generation

It follows the generation part, with all its important connected functions. The main generate function is GENERATE.
It permits to generate sentences by applying one possible modality out of 3:

    - parallel_sequential_generation

    - sequential_generation
    
    - parallel_generation

There are then some minor functions:

    - generate_step

    - get_init_text

    - printer (and with Github also read_sents and write_sents)

In [None]:
def generate_step(out, gen_idx, temperature=None, top_k=0, sample=False, return_list=True):
    """ Generate a word from out[gen_idx]
    
    args:
        - out (torch.Tensor): tensor of logits of size batch_size x seq_len x vocab_size
        - gen_idx (int): location for which to generate for
        - top_k (int): if >0, only sample from the top k most probable words
        - sample (Bool): if True, sample from full distribution. Overridden by top_k 
    """
    logits = out[:, gen_idx] 
    # array of dimension batch_size e vocabulary_size.
    # this is a multidim array (matrix)
    if temperature is not None:
        logits = logits / temperature
        # temperature is used to squeeze the matrix logits of the tensor out.
        # smoothing parameter for the next word distribution. 
        # Higher means more like uniform; lower means more peaky.
        # Closer to 1 means a more uniform distribution.
    if top_k > 0:# in this case we sample from the top k most probable words.
        kth_vals, kth_idx = logits.topk(top_k, dim=-1)
        # returns the k biggest entries of the input.
        dist = torch.distributions.categorical.Categorical(logits=kth_vals)
        # The distributions package contains parameterizable
        # probability distributions and sampling functions.
        idx = kth_idx.gather(dim=1, index=dist.sample().unsqueeze(-1)).squeeze(-1)
    elif sample:# in this case we sample from all the distribution.
        dist = torch.distributions.categorical.Categorical(logits=logits)
        # The distributions package contains parameterizable
        # probability distributions and sampling functions.
        idx = dist.sample().squeeze(-1)
    else:
        idx = torch.argmax(logits, dim=-1)
    return idx.tolist() if return_list else idx
  

The function generate_step is applied to generate a word.

The function generate_step above returns a list of indeces if specified in the input parameter return_list, otherwise returns idx. These indeces define words (see Generate function part).

First of all the object "logits" is created, then it squeezed in case of a prespecified value of temperature (in order to have a more or uniform distribution or less).

Then there is an if-elif-else block that is used to sample a word from the distribution of logits. If we prespecify that we want to sample from the set of most probable words, then the distribution will be built over them and then sampling is applied (using .sample()). In case we want to consider the full distribution we specify just all the logits multiarray when building the distribution. Finally in case of neither full nor subset distribution, the default sampling refers to the argmax of logits.

In [None]:
def get_init_text(seed_text, max_len, batch_size = 1, rand_init=False):
    """ Get initial sentence by padding seed_text with either masks or random words to max_len """
    # builds a text by adding to seed_text a sequence (of length max_len) 
    # of either masks or random words.
    # Recall that seed_text is used as a sort of pointer from which we start
    # adding masked or random words to generate the initialized batch (in the
    # BERT setting the best seed is [CLS], as we can see in the GENERATE
    # function part).
    # max_len = length of sequence to add to seed_text.
    batch = [seed_text + [MASK] * max_len + [SEP] for _ in range(batch_size)]
    # we are applying this operation a number of time equal to the size of batch
    # (batch_size is the size of the batch).
    
    # before giving an output, the tokenization is applied.
    return tokenize_batch(batch)


The function get_init_text generates a tokenized text of length max_len (which will be used as initial text in the more general generate function) starting from a seed_text and completing it with masks or random words.

In [None]:
def printer(sent, should_detokenize=True):
    if should_detokenize:
        sent = detokenize(sent)[1:-1]
    print(" ".join(sent))

The function printer prints a sentences given as input (if specified, it is first detokenized).

The following code block is extracted from https://github.com/nyu-dl/bert-gen/blob/master/bert-babble.ipynb (file bert-babble). 

(with respect to the Colab demo, two more "print" function are given: read_sents, write_sents)

In [None]:
# Utility functions
    
def read_sents(in_file, should_detokenize=False):
  # reads content from the in_file.
    sents = [sent.strip().split() for sent in open(in_file).readlines()]
    if should_detokenize:
        sents = [detokenize(sent) for sent in sents]
    return sents

def write_sents(out_file, sents, should_detokenize=False):
  # writes inside the out_file.
    with open(out_file, "w") as out_fh:         
        for sent in sents:
            sent = detokenize(sent[1:-1]) if should_detokenize else sent
            out_fh.write("%s\n" % " ".join(sent))

The function read_sents is used to read sentences from an external file, named in_file. If we should dekotenize them, we enter the if construction. After reading them, they are returned.

The function write_sents is used to write the (generated) sentences inside an external file, named out_file. If we should detokenize, the previously defined detokenize function is applied. There's no return here, the function just writes in the out_file.

_____________


For the following part of code observe this: this is the core of the algorithm. The general idea is

      1- start from all masks
      2- repeatedly pick a location, mask the token at that location, and generate from the probability distribution given by BERT
      3- stop when converged or tired of waiting
We consider three "modes" of generating:

      . generate a single token for a position chosen uniformly at random for a chosen number of time steps(** PARALLEL SEQUENTIAL GENERATION**)
      . generate in sequential order (Left->Right), one token at a time(**SEQUENTIAL GENERATION**)
      . generate for all positions at once for a chosen number of time steps (**PARALLEL GENERATION**)
The generate function wraps and batches these three generation modes. In practice, we find that the first leads to the most fluent samples

In [None]:
# Generation modes as functions
import math
import time
# the time package above is used to measure the time required for
# generating an entire sentence.

def parallel_sequential_generation(seed_text, batch_size=10, max_len=15, top_k=0, temperature=None, max_iter=300, burnin=200,
                                   cuda=False, print_every=10, verbose=True):
    """ Generate for one random position at a timestep
    
    args:
        - burnin: during burn-in period, sample from full distribution; afterwards take argmax
    """
    seed_len = len(seed_text)
    batch = get_init_text(seed_text, max_len, batch_size)
    # These first 2 lines are the same both in parallell_sequential_generation,
    # parallel_generation and sequential_generation.
    
    for ii in range(max_iter):
        kk = np.random.randint(0, max_len)
        #choose a random position from 0 to maximal length where a word will be added
        for jj in range(batch_size):
            batch[jj][seed_len+kk] = mask_id
            # think jj as an index that moves over the rows.
            # mask_id = tokenizer.convert_tokens_to_ids([MASK])[0]
            # in every batch sentence change the word in position [seed_len+kk] into a mask(?)
        inp = torch.tensor(batch).cuda() if cuda else torch.tensor(batch)
        # using the above line, the inp object is transformed into a tensor,
        # using GPU if prespecified.
        out = model(inp)  # the pretrained model BertForMaskedLM is applied (see first
        # code block).
        topk = top_k if (ii >= burnin) else 0
        # top_k : at each step, sample from the top_k most likely words 
        # but only if the iteration we're in is >= the burn-in, else topk=0
        idxs = generate_step(out, gen_idx=seed_len+kk, top_k=topk, temperature=temperature, sample=(ii < burnin))
        for jj in range(batch_size):
            batch[jj][seed_len+kk] = idxs[jj]
            # think jj as an index that moves over the rows.
            # think seed_len+kk as an index that moves over the columns.
            # in this sense idxs is a sort of column vector whose entries are
            # specified using the indeces jj.
            # jj indicates the row (the sentence) inside the batch.
            # seed_len+kk stays for the token inside the specified sentence.
            # Remember that batch is a list of vectors, where each vector
            # has entries corresponding to indeces for words inside 
            # the tokenizer (vocabulary).
            
        if verbose and np.mod(ii+1, print_every) == 0:
            # if verbose is true and ii % print_every = 0, so ii = α * print_every
            # we print an output message.
            for_print = tokenizer.convert_ids_to_tokens(batch[0])
            # batch[0] corresponds to the first vector of the list. Through the
            # application of the tokenizer, this is exactly a sentence, the first
            # sentence of the batch we are working on.
            # remember that convert_ids_to_tokens converts indeces (numeric values)
            # to tokens (that may be words) through the tokenizer (which is 
            # substantially a vocabulary).

            for_print = for_print[:seed_len+kk+1] + ['(*)'] + for_print[seed_len+kk+1:]
            # idea: use the + as concatenation and show with the "(*)" where we
            # have sampled the kk in this last external for (ii) cycle.
            # In this way, we know that the token we see before the "(*)" is the
            # one we have just updated.

            print("iter", ii+1, " ".join(for_print))     
            # we could think this if and print as a command to show the user
            # that the process is going on, the iterations are moving, and 
            # we show the first sentence of the batch to illustrate
            # how the generated sentence is changing, how the process is
            # modifying iteratively our first sentence of the batch.        
    return untokenize_batch(batch)

def parallel_generation(seed_text, batch_size=10, max_len=15, top_k=0, temperature=None, max_iter=300, sample=True, 
                        cuda=False, print_every=10, verbose=True):
    """ Generate for all positions at a time step """
    seed_len = len(seed_text)
    batch = get_init_text(seed_text, max_len, batch_size)
    # These first 2 lines are the same both in parallel_sequential_generation,
    # parallel_generation and sequential_generation. 
    
    for ii in range(max_iter):
        # w.r.t the sequential_generation function, since now we generate all
        # words at a time, we don't need the command inp = [sent[:seed_len+ii+leed_out_len]+[sep_id] for sent in batch]
        
        # so while in sequential_generation the generation process goes from
        # the beginning of the batch up to the end, now there is
        # a for loop with a max_iter number of iterations.
        # For each iteration the process below is applied (NOTE that the
        # value ii is not used in the generation process below. The only
        # moment we use it is when we want to print a message to show the 
        # iteration the algorithm is workin on).

        inp = torch.tensor(batch).cuda() if cuda else torch.tensor(batch)
        # using the above line, the inp object is transformed into a tensor,
        # using GPU if prespecified.
        out = model(inp) # the pretrained model BertForMaskedLM is applied (see first
        # code block).
        for kk in range(max_len):
            idxs = generate_step(out, gen_idx=seed_len+kk, top_k=top_k, temperature=temperature, sample=sample)
            for jj in range(batch_size):
                batch[jj][seed_len+kk] = idxs[jj]
                # think jj as an index that moves over the rows.
                # think seed_len+kk as an index that moves over the columns.
                # in this sense idxs is a sort of column vector whose entries are
                # specified using the indeces jj.
                # jj indicates the row (the sentence) inside the batch.
                # seed_len+kk stays for the token inside the specified sentence.
                # Remember that batch is a list of vectors, where each vector
                # has entries corresponding to indeces for words inside 
                # the tokenizer (vocabulary).
            
        if verbose and np.mod(ii, print_every) == 0:
            # if verbose is true and ii % print_every = 0, so ii = α * print_every
            # we print an output message.
            print("iter", ii+1, " ".join(tokenizer.convert_ids_to_tokens(batch[0])))
            # batch[0] corresponds to the first vector of the list. Through the
            # application of the tokenizer, this is exactly a sentence, the first
            # sentence of the batch we are working on.

            # we could think this if and print as a command to show the user
            # that the process is going on, the iterations are moving, and 
            # we show the first sentence of the batch to illustrate
            # how the generated sentence is changing, how the process is
            # modifying iteratively our first sentence of the batch.

            # remember that convert_ids_to_tokens converts indeces (numeric values)
            # to tokens (that may be words) through the tokenizer (which is 
            # substantially a vocabulary).
    
    return untokenize_batch(batch)
            
def sequential_generation(seed_text, batch_size=10, max_len=15, leed_out_len=15, 
                          top_k=0, temperature=None, sample=True, cuda=False):
    """ Generate one word at a time, in L->R order """# from left to right.
    # This function is called inside the GENERATION function (which is the main
    # generation function to call), to generate a batch of words.
    seed_len = len(seed_text)
    batch = get_init_text(seed_text, max_len, batch_size)

    # Recall that with get_init_text we build a text by adding to seed_text a 
    # sequence (of length max_len) of either masks or random words.
    # with get_init_text we initialize the batch, then through this function
    # it is updated.
    # max_len = length of sequence to add to seed_text (we can consider roughly
    # as the maximal length of each sentence composing the text).
    # Recall seed_text is the prefix to generate for (it was found crucial to 
    # start with the CLS token). It is somehow the prefix to add when generating
    # sentences, it stays for the beginning of a sentence.
    # batch_size is the size of the batch.

    # so the object batch will contain our text (a structure containing sentences).
    
    for ii in range(max_len):
        inp = [sent[:seed_len+ii+leed_out_len]+[sep_id] for sent in batch]
        # the above command says that for each sentences present in the batch
        # we save a growing portion of the sentence inside the object inp
        # (we say growing portion because the outer for cycle cycles over ii, that
        # is used as an index for sent).
        inp = torch.tensor(batch).cuda() if cuda else torch.tensor(batch)
        # using the above line, the inp object is transformed into a tensor,
        # using GPU if prespecified.
        out = model(inp) # the pretrained model BertForMaskedLM is applied (see first
        # code block).
        idxs = generate_step(out, gen_idx=seed_len+ii, top_k=top_k, temperature=temperature, sample=sample)
        # recall from some previous code blocks that the GENERATE_STEP IS APPLIED
        # TO GENERATE A WORD (in this case idxs is a column vector).
        for jj in range(batch_size):
            batch[jj][seed_len+ii] = idxs[jj]
            # think jj as an index that moves over the rows.
            # think seed_len+ii as an index that moves over the columns.
            # in this sense idxs is a sort of column vector whose entries are
            # specified using the indeces jj.
            # jj indicates the row (the sentence) inside the batch.
            # seed_len+kk stays for the token inside the specified sentence.
            # Remember that batch is a list of vectors, where each vector
            # has entries corresponding to indeces for words inside 
            # the tokenizer (vocabulary).
    return untokenize_batch(batch)


def generate(n_samples, seed_text="[CLS]", batch_size=10, max_len=25, 
             generation_mode="parallel-sequential",
             sample=True, top_k=100, temperature=1.0, burnin=200, max_iter=500,
             cuda=False, print_every=1):
    # main generation function to call

    # n_samples = number of samples.
    # math.ceil is used to round a number upward to its nearest integer, in
    # this case it is applied to define the number of batches, n_batches.
    # sentences is the list of generated words.
    # print_every is just used to specify after how many batches generations to
    # output the time required.
    # seed_text stays for the first token of every sequence (in this
    # case it is the [CLS] token)
    sentences = []
    n_batches = math.ceil(n_samples / batch_size)
    start_time = time.time()
    # for each batch, depending on the generation_mode, a specific generation
    # function is applied (parallel_sequential_generation, or sequential_generation, 
    # or parallel_generation).
    # the final "if" checks 
    # at the end of each for iteration the generated batch is added to sentences.
    for batch_n in range(n_batches):
        # ma
        if generation_mode == "parallel-sequential":
            batch = parallel_sequential_generation(seed_text, batch_size=batch_size, max_len=max_len, top_k=top_k,
                                                   temperature=temperature, burnin=burnin, max_iter=max_iter, 
                                                   cuda=cuda, verbose=False)
        elif generation_mode == "sequential":
            batch = sequential_generation(seed_text, batch_size=batch_size, max_len=max_len, top_k=top_k, 
                                          temperature=temperature, leed_out_len=leed_out_len, sample=sample,
                                          cuda=cuda)
        elif generation_mode == "parallel":
            batch = parallel_generation(seed_text, batch_size=batch_size,
                                        max_len=max_len, top_k=top_k, temperature=temperature, 
                                        sample=sample, max_iter=max_iter, 
                                        cuda=cuda, verbose=False)
        
        if (batch_n + 1) % print_every == 0:
          # if a number of batches equal to "print_every" has been
          # generated, then an output message is shown giving the time required.
            print("Finished batch %d in %.3fs" % (batch_n + 1, time.time() - start_time))
            start_time = time.time()
        
        sentences += batch
    return sentences

The piece of code above contains some smaller generation functions (parallel_sequential_generation, or sequential_generation, or parallel_generation) and one bigger, more general, for generating sentences.

For each batch one of the 3 above generation methods is chosen.


Sequential_generation. We use this function to generate a batch of sentences (that will be added to the final set of sentences). A sort of initialized text is generated and it is used for the process of generation: consider a generic sentence of the initialized batch, a growing portion of it is taken (inp) and given as input to the pretrained model BertForMaskedLM, the output (out) is used as a parameter to the generate_step function to generate a word. Intuitively we use a growing portion of the initialized batch to write sentences with some sense.

Parallel_generation. While in sequential_generation the generation process goes from the beginning of the batch up to the end, now there is a for loop with a max_iter number of iterations.
For each iteration the process, similar to the one for the sequential_generation, is applied (NOTE that the value ii is not used in the generation process below. The only moment we use it is when we want to print a message to show the iteration the algorithm is workin on).

Parallel_sequential_generation. This function generates a single token for a position chosen uniformly at random for a chosen number of time steps. As in sequential_generation, we fill in the batch one idx (token) at a time but the position of this idx in the batch's vectors (sentences) is sampled uniformly at random instead of using the iteration number as in sequential_generation. Here we give the full batch as input (inp) to the pretrained model BertForMaskedLM.
At each step, the model (generate_step function) samples the words from the top_k most likely words, but if the iteration we're in is <= the burn-in, it doesn't do that anymore. 

_____



## APPLICATION OF THE GENERATION FUNCTION


### Application Example 

In the following we consider an extracted code from https://github.com/nyu-dl/bert-gen/blob/master/bert-babble.ipynb

In [None]:
n_samples = 1000 
batch_size = 50 
max_len = 40
top_k = 100
temperature = 0.7

leed_out_len = 5 # max_len
burnin = 250
sample = True
max_iter = 500

# Choose the prefix context
seed_text = "[CLS]".split()

for temp in [1.0]:
    bert_sents = generate(n_samples, seed_text=seed_text, batch_size=batch_size, max_len=max_len,
                          sample=sample, top_k=top_k, temperature=temp, burnin=burnin, max_iter=max_iter,
                          cuda=True)
    out_file = "Bert_using_pytorch.txt"
    write_sents(out_file, bert_sents, should_detokenize=True)


In [None]:
write_sents(out_file, bert_sents, should_detokenize=True)

In [None]:
in_file = "Bert_using_pytorch.txt"
bert_sents = read_sents(in_file, should_detokenize=False)

In [None]:
for i in range(50):
    printer(bert_sents[i], should_detokenize=True)

it is true that the world is not just governed by one force , but by some other force , then you will select where the events begin and when they end , that is what you say
in britain , temp is fondly remembered for this , because just a month later several of his men had been gang - raped , while the two men living close to the camp were tortured
they were real ugly . ) . . our beckoned . bambi , clearly ugly as well , had been believed to be a wanker for this character because of her matted dark hair
king , professional boxer . stephen jones , jon jones , john jones ( the waste had been wiped clean , cobbled up into garbage ) . brian jones , john jones ( man in black )
hurry " - said one of the women . " alright , alright , alright " - nobody could hurt her . bearl could not . the dead woman and lissy were at least twenty
by ( author and illustrator ) e . mccready , univ . and philosophical research institute , n . y . c . [ revised and enlarged ] philadelphia : w . a . t
see list be

_______________
_______________
# **GENERATION PART (OpenAI GPT)**

**Comparing to existing models**

The OpenAI Generative Pretraining Transformer is another pretrained model successfully used for transfer learning. Since the model is a unidirectional language model, we can straightforwardly generate from the model. See this repo by Thomas Wolf at Huggingface for instructions for setting up the model.

In [None]:
!git clone https://github.com/huggingface/pytorch-openai-transformer-lm.git 'OpenAi'
%cd /content/

In [None]:
!git clone https://github.com/openai/finetune-transformer-lm.git


In order to run the next cells, we need to move the folder "Model" from "finetune-transformer-lm" to "OpenAi"

In [None]:
!pip install ftfy
!pip install tqdm
!pip install sklearn
!pip install spacy
!pip install pandas

In [None]:
import os
import sys

sys.path.insert(1, os.path.join(".", "OpenAi"))  #pytorch-openai-transformer-lm

from OpenAi.model_pytorch import LMModel, load_openai_pretrained_model, DEFAULT_CONFIG
from OpenAi.text_utils import TextEncoder

def load_openai_gpt(n_special=1, n_ctx=512):
    text_encoder = TextEncoder("/content/OpenAi/model/encoder_bpe_40000.json", 
                               "/content/OpenAi/model/vocab_40000.bpe")
    encoder = text_encoder.encoder
    n_vocab = len(text_encoder.encoder)
    vocab = n_vocab + n_special + n_ctx

    args = DEFAULT_CONFIG
    lm_model = LMModel(args, vocab, n_ctx, return_probs=True)
    load_openai_pretrained_model(lm_model.transformer, n_ctx=n_ctx, n_special=n_special,
                                 path="/content/OpenAi/model/",
                                 path_names="/content/OpenAi/")
    #lm_model.to(device)
    lm_model.return_probs = False
    lm_model.eval()
    return lm_model, text_encoder

def make_batch(X, n_vocab, n_special, batch_size):
    X = np.array(X)
    assert X.ndim in [1, 2]
    if X.ndim == 1:
        X = np.expand_dims(X, axis=0)
    pos_enc = np.arange(n_vocab + n_special, n_vocab + n_special + X.shape[-1])
    pos_enc = np.tile(pos_enc, (batch_size, pos_enc.shape[-1])) #np.expand_dims(pos_enc, axis=0)
    batch = np.stack([X, pos_enc], axis=-1)
    batch = torch.tensor(batch, dtype=torch.long)#.to(device)
    return batch

def append_batch(X, next_idx):
    next_pos = X[:, -1:, 1] + 1
    next_x = torch.cat((next_idx, next_pos), -1).unsqueeze(1)
    return torch.cat((X, next_x), 1)

def _generate_sentence_openai(model, text_encoder, seed_text, batch_size=10, gen_len=20, 
                             topk=100, sample=True, n_special=0):
    n_vocab = len(text_encoder.encoder)
    #X = np.random.randint(n_vocab, size=(batch_size, 1)).tolist()
    #sents = [[text_encoder.decoder[X[i][0]]].replace('</w>', '') for i in range(batch_size)]
    X = [[n_vocab - 1] for _ in range(batch_size)]
    sents = [[] for _ in range(batch_size)]
    if seed_text:
        seed_ids = text_encoder.encode([seed_text,])
        X = [X[i] + seed_ids[0] for i in range(batch_size)]
        sents = [[seed_text] for _ in range(batch_size)]
    XMB = make_batch(X, n_vocab, n_special, batch_size=batch_size)


    for step_n in range(gen_len):
        out = model(XMB) + model.pos_emb_mask
        next_idxs = generate_step(out, gen_idx=step_n, top_k=topk, sample=sample, return_list=False)
        idxs = next_idxs.tolist()
        for i in range(batch_size):
            next_token = idxs[i]
            if next_token == n_vocab:
                next_token = "<EOS>"
            else:
                next_token = text_encoder.decoder[next_token].replace('</w>', '')
            sents[i].append(next_token)
        XMB = append_batch(XMB, next_idxs.unsqueeze(-1))
        
    return [[tok for tok in sent if tok != '\n'] for sent in sents]

def generate_openai(model, text_encoder, n_samples, seed_text, 
                    batch_size=10, gen_len=20, 
                    topk=100, temperature=0.7, sample=True,
                    n_special=0, print_every=1):
    sents = []
    start_time = time.time()
    n_batches = math.ceil(n_samples / batch_size)
    for batch_n in range(n_batches):
        batch_sents = _generate_sentence_openai(model, text_encoder, seed_text,
                                                batch_size=batch_size, gen_len=gen_len, 
                                                topk=topk, sample=sample,
                                                n_special=n_special)
        sents += batch_sents
        if (batch_n + 1) % print_every == 0:
            print("Generated batch %d of %d in %.3fs" % (batch_n + 1, n_batches, time.time() - start_time))
            start_time = time.time()
    return sents

In [None]:
import json

gpt_model, gpt_text_encoder = load_openai_gpt(n_special=1)

Loading weights...


Loading weights...

### Application of OpenAI GPT

In [None]:
n_samples = 1000
batch_size = 50
max_len = 40
top_k = 100
temperature = 0.7

leed_out_len = 5 # max_len
burnin = 250
sample = True
max_iter = 500

openai_sents = generate_openai(gpt_model, gpt_text_encoder, seed_text="", 
                               n_samples=n_samples, batch_size=batch_size, gen_len=max_len,
                               topk=top_k, temperature=temperature, sample=sample,
                               n_special=1, print_every=1)



Generated batch 1 of 20 in 147.889s
Generated batch 2 of 20 in 144.145s
Generated batch 3 of 20 in 143.557s
Generated batch 4 of 20 in 143.098s
Generated batch 5 of 20 in 142.696s
Generated batch 6 of 20 in 143.003s
Generated batch 7 of 20 in 143.652s
Generated batch 8 of 20 in 143.495s
Generated batch 9 of 20 in 143.918s
Generated batch 10 of 20 in 143.643s
Generated batch 11 of 20 in 143.896s
Generated batch 12 of 20 in 142.684s
Generated batch 13 of 20 in 144.480s
Generated batch 14 of 20 in 142.908s
Generated batch 15 of 20 in 142.632s
Generated batch 16 of 20 in 142.571s
Generated batch 17 of 20 in 142.476s
Generated batch 18 of 20 in 143.346s
Generated batch 19 of 20 in 144.636s
Generated batch 20 of 20 in 143.657s


In [None]:
out_file = "openaitext.txt"
    #out_file = "data/%s-len%d-burnin%d-topk%d-temp%.3f.txt" % (model_version, max_len, burnin, top_k, temp)
write_sents(out_file, openai_sents, should_detokenize=True)

In order to get more values related to the table shown in the paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model", we sample 1000 sentences from the training split of the datasets WT103, and check the values of corpus bleu.

___________
___________
# **EVALUATION PART**

Evaluation methods for unconditional generation aren't perfect. We'll measure the diversity of our generated samples via _self-BLEU_: we compute _corpus BLEU_ where for each generated sentence, we compute BLEU treating the other sentences as references. 

We also compute the percentage of  _n-grams_ that are unique among the generations. 

(From Wikipedia: an n-gram is an n-elements subsequence of a sequence).

We try some other strategies, including comparing to outside models, in our report, and you can see some of the code for that here (SEE SECTION TEXYGEN).

The following part is extracted from link https://github.com/nyu-dl/bert-gen/blob/master/bert-babble.ipynb

**Evaluation**

In [None]:
!pip3 install nltk==3.6.2

In [None]:
from nltk.translate import bleu_score as bleu

## Quality Measures: Corpus-BLEU


We want to know how similar are the generated sentences to the original training data (Toronto Book Corpus and Wikipedia dumps). We follow Yu et al., (2017) and compute the BLEU between the generations and the test sets of both corpora by treating the test set as the references for each generation. The tests sets are large; we subsample 5000 examples from each.

In [None]:
#help(bleu.corpus_bleu)

In [None]:
def prepare_data(data_file, replacements={}, uncased=True):
    """ Prepare data to compute the BLEU score, since we use corpus_bleu each 
        sentence has to be a list of a list of tokens
    """
    data = [d.strip().split() for d in open(data_file, 'r').readlines()]
    # strip() to remove spaces, split ['splits', 'a', 'string', 'into', 'a', 'list']
    # done for each line in the data_file
    if uncased:
        data = [[t.lower() for t in sent] for sent in data]
        # lower case for every word
        
    for k, v in replacements.items():
        # example from "prepare_wiki": replace "@@unknown@@"(k) with  "[UNK]"(v)
        data = [[t if t != k else v for t in sent] for sent in data]
        # if the token t is different from k (what we have to change), then leave t
        # otherwise, if t is = k, replace t with v 
        
        # recall: replacements is a dictionary that connects tokens to be substituted
        # and token that substitute, e.g. "@@unknown@@" with "[UNK]"
 
        # at the end the data are ready to be used in corpus_bleu
    return data

def prepare_wiki(data_file, uncased=True):
    """ prepare the data from wiki103 so we can use these phrases as 
    references in the corpus bleu function """
    replacements = {"@@unknown@@": "[UNK]"}
    return prepare_data(data_file, replacements=replacements, uncased=uncased)

def prepare_tbc(data_file):     
    """ prepare the data from tbc so we can use these phrases as 
    references in the corpus bleu function """   
    replacements = {"``": "\"", "\'\'": "\""}
    return prepare_data(data_file, replacements=replacements)

def corpus_bleu(generated, references):
    """ Compute similarity between two corpora as measured by
    comparing each sentence of `generated` against all sentences in `references` 
    
    args:
        - generated (List[List[str]]): list of sentences (split into tokens)
        - references (List[List[str]]): list of sentences (split into tokens)
        
    returns:
        - bleu (float)
    """    
    # generated is a list of sentences, where each sentence is represented as a list
    # of tokens.
    # references have the same basis structure of generated.
    return bleu.corpus_bleu([references for _ in range(len(generated))], generated)
    # compare each sentence of 'generated' against all sentences in 'references'
    # corpus_bleu -> ([['reference'], ['reference'],...(|generated| times)], ['generated'])
    # corpus_bleu analyzes each sentences of the list "generated" with all the others
    # of "references", and then averages (not a simple averaging)
    # while when we analyze self_bleu we are comparing against the other generated
    # sentences, not with some reference sentences!
    

Function _prepare_data_ is used to prepare the data when computing the BLEU score. In particular, each sentence of the data_file is transformed into a list of lists of tokens.
Then functions _prepare_wiki_ and _prepare_tbc_ are applied to prepare the training data (of respectively Wikipedia dumps and Toronto Book Corpus) to study the similarity.


In [None]:
wiki103_file = 'datawiki103.5k.txt'
#this comes from wikitext103 test set
tbc_file = 'tbc.5k.txt'

wiki_data = prepare_wiki(wiki103_file)
tbc_data = prepare_tbc(tbc_file)

In [None]:
# Some initializations for the table of corpus-BLEU
print(model_version)# THIS WHOLE CODE BLOCK HAS TO BE REPEATED FOR
# THE OTHER BERT MODEL VERSION TOO

TITLE_CORPUS = ['Model', 'Corpus-BLEU against WT103', 'Corpus-BLEU against TBC']
# values_corpus_bleu is a list of 3 lists (one with the model name, two for corpus-BLEU)
# each one with 4 elements (first element refers to BERTlarge, second element is for
# BERTbase, third element for GPT, fourth element for WT103)
# initialization:
values_corpus_bleu = [[0,0,0,0],[0,0,0,0],[0,0,0,0]]
values_corpus_bleu[0] = ['BERTlarge', 'BERTbase', 'GPT', 'WT103']

bert-base-uncased


The following code block has to be repeated two times, one for 'bert-base-uncased', one for 'bert-large-uncased'.

In [None]:
value = corpus_bleu(bert_sents, tbc_data)
print("BERT-TBC BLEU: %.2f" % (100 * value))
if model_version == 'bert-base-uncased':
  # paper value: 7.06
  values_corpus_bleu[2][1] = 100 * value
else: #'bert-large-uncased'
  # paper value: 7.60
  values_corpus_bleu[2][0] = 100 * value

value = corpus_bleu(bert_sents, wiki_data)
print("BERT-Wiki103 BLEU: %.2f" % (100 * value ))
if model_version == 'bert-base-uncased':
  # paper value: 7.80
  values_corpus_bleu[1][1] = 100 * value
else: #'bert-large-uncased'
  # paper value: 5.05
  values_corpus_bleu[1][0] = 100 * value


BERT-TBC BLEU: 7.04
BERT-Wiki103 BLEU: 8.51


In [None]:
import random
wiki1000_file = 'wiki_train_1000_samples.txt'
##this comes from wikitext103 training set
wiki1000_data = prepare_wiki(wiki1000_file)

value = corpus_bleu(wiki1000_data, wiki_data)
print("Wiki103_train-Wiki103 BLEU: %.2f" % (100 * value))
# paper value: 17.48
values_corpus_bleu[1][3] = 100 * value


value = corpus_bleu(wiki1000_data, tbc_data)
print("Wiki103_train-TBC BLEU: %.2f" % (100 * value))
# paper value: 6.57
values_corpus_bleu[2][3] = 100 * value


Wiki103_train-Wiki103 BLEU: 15.18
Wiki103_train-TBC BLEU: 6.04


In [None]:
value = corpus_bleu(openai_sents, tbc_data)
print("GPT-TBC BLEU: %.2f" % (100 * value))
# paper value 30.75
values_corpus_bleu[2][2] = 100 * value

value = corpus_bleu(openai_sents, wiki_data)
print("GPT-Wiki103 BLEU: %.2f" % (100 * value))
# paper value: 10.81
values_corpus_bleu[1][2] = 100 * value


GPT-TBC BLEU: 30.02
GPT-Wiki103 BLEU: 11.32


## Diversity measures: Self-BLEU


Self-BLEU: treat each sentence as a hypothesis and treat rest of corpus as reference. Lower is better.

In [None]:
#help(bleu)

The following function implements the self_bleu measure for diversity between one sentence and all the others in the document.
Recall the difference between BLEU and SELF-BLEU. Since BLEU aims to assess how similar two sentences are, it can also be used to evaluate how one sentence resembles the rest in a generated collection. Regarding one sentence as hypothesis and the others as reference, we can calculate BLEU score for every generated sentence, and define the average BLEU score to be the Self-BLEU of the document.

So the difference between BLEU and SELF-BLEU is that BLEU analyzes a group of generated sentences against a group of reference sentences. On the other hand, SELF-BLEU compares sentences of the same type (e.g. generated words).

In [None]:
def self_bleu(sents):
  # this function computes the scoring for comparing diversity between one sentence
  # and all the others in the document.
  # higher self-bleu score indicates less diversity in the project.
    return bleu.corpus_bleu([[s for (j, s) in enumerate(sents) if j != i] for i in range(len(sents))], sents)
  # function corpus_bleu(): for calculating the BLEU score for multiple sentences such as a paragraph or a document.
  # https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
  # Self-BLEU, a metric to evaluate the diversity 
  # of the generated data. Since BLEU aims to assess how similar
  # two sentences are, it can also be used to evaluate how one sentence 
  # resembles the rest in a generated collection. Regarding one sentence 
  # as hypothesis and the others as reference,
  # we can calculate BLEU score for every generated sentence, 
  # and define the average BLEU score to be the Self-BLEU of the document.
  # Self-BLEU: treat each sentence as a hypothesis and treat rest of corpus 
  # as reference. Lower is better.

In [None]:
# Some initializations for the table of self-BLEU
print(model_version)# THIS WHOLE CODE BLOCK HAS TO BE REPEATED FOR
# THE OTHER BERT MODEL VERSION TOO

TITLE_SELF = ['Model', 'Self-BLEU']
# values_self_bleu is a list of 2 lists (one with the model name, one for self-BLEU)
# each one with 4 elements (first element refers to BERTlarge, second element is for
# BERTbase, third element for GPT, fourth element for WT103)
# initialization:
values_self_bleu = [[0,0,0,0],[0,0,0,0]]
values_self_bleu[0] = ['BERTlarge', 'BERTbase', 'GPT', 'WT103']

bert-base-uncased


In [None]:
value = self_bleu(bert_sents)
print("BERT self-BLEU: %.2f" % (100 * value))
# paper value: 10,06
if model_version == 'bert-base-uncased':
  values_self_bleu[1][1] = 100 * value
else:
  values_self_bleu[1][0] = 100 * value

value = self_bleu(openai_sents)
print("OpenAI self-BLEU: %.2f" % (100 * value))
values_self_bleu[1][2] = 100 * value
# paper value: 40.02

value = self_bleu(wiki1000_data)
print("Wiki103_train SELF-BLEU: %.2f" % (100 * value))  
# paper value: 9.80
values_self_bleu[1][3] = 100 * value

BERT self-BLEU: 8.49
OpenAI self-BLEU: 38.09
Wiki103_train SELF-BLEU: 17.42


## Diversity measures: n-grams

In [None]:
from collections import Counter
from nltk.util import ngrams

Class Counter: Dict subclass for counting hashable items.  Sometimes called a bag or multiset.  Elements are stored as dictionary keys and their counts are stored as dictionary values.

Function ngrams: Return the ngrams generated from a sequence of items, as an iterator.
For example:
    
    >>> from nltk.util import ngrams
    >>> list(ngrams([1,2,3,4,5], 3))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Other interesting measures are those regarding n-grams. In the following part we define _get_ngram_counts_ , _ref_unique_ngrams_ , _self_unique_ngrams_ .

In [None]:
def get_ngram_counts(sents, max_n=4):
    size2count = {} #empty dictionary
    for i in range(1, max_n + 1):
        size2count[i] = Counter([n for sent in sents for n in ngrams(sent, i)])
    return size2count
    # size2count is a dictionary whose keys are "i" and for each i a counter is
    # applied. For key 1, this counter counts all the occurrences of the 1-grams
    # inside the sentences, while for key 2, this counter counts all the occurrences
    # of the 2-grams (i.e. two consecutive words) inside the sentences, and 
    # so on for all the other keys up to max_n

def ref_unique_ngrams(preds, refs, max_n=4):
    # get # of *distinct* pred ngrams that don't appear in ref
    pct_unique = {}
    pred_ngrams = get_ngram_counts(preds, max_n)
    # builds the ngrams of the generated sentences
    ref_ngrams = get_ngram_counts(refs, max_n)
    # builds the ngrams of the reference sentences
    for i in range(1, max_n + 1):
        pred_ngram_counts = set(pred_ngrams[i].keys())
        # with the above command we save the keys of the i-th dictionary (w.r.t 
        # our predicted sentences) inside
        # a set.
        total = sum(pred_ngrams[i].values())
        # with the above command we compute the sum of all the occurrences
        # of the grams of length i (i-grams).
        ref_ngram_counts = set(ref_ngrams[i].keys())
        # with the above command we save the keys of the i-th dictionary (w.r.t 
        # the reference sentences) inside
        # a set.
        pct_unique[i] = len(pred_ngram_counts.difference(ref_ngram_counts)) / total
        # we measure the proportion of predicted i-grams that don't appear
        # in the reference i-grams
    return pct_unique
        
def self_unique_ngrams(preds, max_n=4):
    # get # of pred ngrams with count 1
    pct_unique = {}
    # empty set
    pred_ngrams = get_ngram_counts(preds, max_n)
    # build the set of dictionaries  where each dictionary contains the 
    # i-grams and the number of occurrences w.r.t the generated sentences (i.e.
    # the predicted sentences).
    for i in range(1, max_n + 1):
        n_unique = len([k for k, v in pred_ngrams[i].items() if v == 1])
        # n_unique is the number of i-grams that are unique 
        # in the i-th dictionary
        total = sum(pred_ngrams[i].values())
        pct_unique[i] = n_unique / total
        # we measure the proportion of generated i-grams that are unique
        # (i.e. which occure just one single time)
    return pct_unique

_ref_unique_ngrams_: We use this function to count how many ngrams (in %) appear in preds  (which in our case is bert_sents, our generated sentences) and don't appearin refs (which in our case is wiki_data, our 5000 sentences from wiki103). The results are in table 2 in the paper.

_self_unique_ngrams_: We count how many ngrams (in %) appear only 1 time in preds (bert_sents). The results are in table 2 in the paper

_get_ngram_counts_: We need this function in order to define the 2 functions above. It creates a set of four dictionaries: each of them contains all the (1 or 2 or 3 or 4) ngrams with the respective number of occurrences.  

In [None]:
max_n = 4
print(model_version)

TITLE2 = ['Model', '% unique 2-grams vs WT103', '% unique 3-grams vs WT103', '% unique 4-grams vs WT103']
TITLE1 = ['Model', '% unique 2-grams vs Self', '% unique 3-grams vs Self', '% unique 4-grams vs Self']
TITLE3 = ['Model', '% unique 2-grams vs TBC', '% unique 3-grams vs TBC', '% unique 4-grams vs TBC']
# values_grams_VS_SELF is a list of 4 lists (one with the model name, one for
# n=2, one for n=3, one for n=4) each one with 4 elements (first element is 
# the percentage of unique n-grams of BERTlarge VS ITSELF, second element is the 
# percentage of unique n-grams of BERTbase VS ITSELF, third element is the percentage
# of unique n-grams of GPT VS ITSELF, fourth element is the percentage of unique
# n-grams of WT103 VS ITSELF)
# initialization:
values_grams_VS_SELF = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
values_grams_VS_SELF[0] = ['BERTlarge', 'BERTbase', 'GPT', 'WT103']

# values_grams_VS_WT103 is a list of 3 lists (one for
# n=2, one for n=3, one for n=4) each one with 4 elements (first element is 
# the percentage of unique n-grams of BERTlarge VS WT103, second element is the 
# percentage of unique n-grams of BERTbase VS WT103, third element is the percentage
# of unique n-grams of GPT VS WT103, fourth element is the percentage of unique
# n-grams of WT103 VS WT103)
# initialization:
values_grams_VS_WT103 = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
values_grams_VS_WT103[0] = ['BERTlarge', 'BERTbase', 'GPT', 'WT103']

# values_grams_VS_TBC is a list of 3 lists (one for
# n=2, one for n=3, one for n=4) each one with 4 elements (first element is 
# the percentage of unique n-grams of BERTlarge VS TBC, second element is the 
# percentage of unique n-grams of BERTbase VS TBC, third element is the percentage
# of unique n-grams of GPT VS TBC, fourth element is the percentage of unique
# n-grams of WT103 VS TBC)
# initialization:
values_grams_VS_TBC = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
values_grams_VS_TBC[0] = ['BERTlarge', 'BERTbase', 'GPT', 'WT103']

bert-base-uncased


In [None]:
# BERT VS WT103
pct_uniques = ref_unique_ngrams(bert_sents, wiki_data, max_n)
for i in range(1, max_n + 1):
    print("BERT unique %d-grams relative to Wiki: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1) and (model_version == 'bert-base-uncased'): #'bert-large-uncased'
      values_grams_VS_WT103[i-1][1] = 100 * pct_uniques[i]
    elif (i != 1):  #'bert-large-uncased'
      values_grams_VS_WT103[i-1][0] = 100 * pct_uniques[i]

# BERT VS TBC
pct_uniques = ref_unique_ngrams(bert_sents, tbc_data, max_n)
for i in range(1, max_n + 1):
    print("BERT unique %d-grams relative to TBC: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1) and (model_version == 'bert-base-uncased'): #'bert-large-uncased'
      values_grams_VS_TBC[i-1][1] = 100 * pct_uniques[i]
    elif (i != 1):  #'bert-large-uncased'
      values_grams_VS_TBC[i-1][0] = 100 * pct_uniques[i]

# BERT VS BERT
pct_uniques = self_unique_ngrams(bert_sents, max_n)
for i in range(1, max_n + 1):
    print("BERT unique %d-grams relative to self: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1) and (model_version == 'bert-base-uncased'): #'bert-large-uncased'
      values_grams_VS_SELF[i-1][1] = 100 * pct_uniques[i]
    elif (i != 1):  #'bert-large-uncased'
      values_grams_VS_SELF[i-1][0] = 100 * pct_uniques[i]

BERT unique 1-grams relative to Wiki: 9.42
BERT unique 2-grams relative to Wiki: 59.05
BERT unique 3-grams relative to Wiki: 91.80
BERT unique 4-grams relative to Wiki: 98.60
BERT unique 1-grams relative to TBC: 12.40
BERT unique 2-grams relative to TBC: 62.68
BERT unique 3-grams relative to TBC: 92.53
BERT unique 4-grams relative to TBC: 98.67
BERT unique 1-grams relative to self: 12.38
BERT unique 2-grams relative to self: 63.13
BERT unique 3-grams relative to self: 92.38
BERT unique 4-grams relative to self: 98.24


We understand from the table that the BERT with a higher number of parameters (BERT Large) gives better results than BERT with a standard number of parameters (BERT Base), infact the percentage of unique n-grams is always higher (for both n=2, n=3 and n=4), meaning more diverse generated sentences. With the same considerations we conclude that the generated words are more diverse with BERT than using GPT.

In [None]:
# GPT VS WT103
pct_uniques = ref_unique_ngrams(openai_sents, wiki_data, max_n)
for i in range(1, max_n + 1):
    print("GPT unique %d-grams relative to Wiki: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_WT103[i-1][2] = 100 * pct_uniques[i]

# GPT VS TBC
pct_uniques = ref_unique_ngrams(openai_sents, tbc_data, max_n)
for i in range(1, max_n + 1):
    print("GPT unique %d-grams relative to TBC: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_TBC[i-1][2] = 100 * pct_uniques[i]

# GPT VS GPT
pct_uniques = self_unique_ngrams(openai_sents, max_n)
for i in range(1, max_n + 1):
    print("GPT unique %d-grams relative to self: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_SELF[i-1][2] = 100 * pct_uniques[i]

GPT unique 1-grams relative to Wiki: 3.15
GPT unique 2-grams relative to Wiki: 34.11
GPT unique 3-grams relative to Wiki: 73.93
GPT unique 4-grams relative to Wiki: 91.90
GPT unique 1-grams relative to TBC: 2.10
GPT unique 2-grams relative to TBC: 26.05
GPT unique 3-grams relative to TBC: 66.06
GPT unique 4-grams relative to TBC: 89.14
GPT unique 1-grams relative to self: 4.42
GPT unique 2-grams relative to self: 31.83
GPT unique 3-grams relative to self: 68.60
GPT unique 4-grams relative to self: 88.60


In the following block we fill the last row of table 2 (regarding WT103). In particular remember that the WT103 on the rows is a sample of 1000 sentences from the training dataset (we sampled after removing the titles from the original training dataset). In the column WT103 we consider a sample of 5000 words from the test set (see link https://github.com/nyu-dl/bert-gen/blob/master/data/wiki103.5k.txt ).

In [None]:
# WT103 (1000) VS WT103 (1000) (SELF)
pct_uniques = self_unique_ngrams(wiki1000_data, max_n)
for i in range(1, max_n + 1):
    print("WT103 unique %d-grams relative to self(1000): %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_SELF[i-1][3] = 100 * pct_uniques[i]

# WT103 (1000) VS WT103 (5000)
pct_uniques = ref_unique_ngrams(wiki1000_data, wiki_data, max_n)
for i in range(1, max_n + 1):
    print("WT103 unique %d-grams relative to WT103(5000): %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_WT103[i-1][3] = 100 * pct_uniques[i]

# WT103 (1000) VS TBC
pct_uniques = ref_unique_ngrams(wiki1000_data,  tbc_data, max_n)
for i in range(1, max_n + 1):
    print("WT103 unique %d-grams relative to TBC: %.2f" % (i, 100 * pct_uniques[i]))
    if (i != 1):
      values_grams_VS_TBC[i-1][3] = 100 * pct_uniques[i]

WT103 unique 1-grams relative to self(1000): 7.05
WT103 unique 2-grams relative to self(1000): 51.79
WT103 unique 3-grams relative to self(1000): 85.86
WT103 unique 4-grams relative to self(1000): 96.76
WT103 unique 1-grams relative to WT103(5000): 7.52
WT103 unique 2-grams relative to WT103(5000): 50.67
WT103 unique 3-grams relative to WT103(5000): 85.55
WT103 unique 4-grams relative to WT103(5000): 96.81
WT103 unique 1-grams relative to TBC: 10.18
WT103 unique 2-grams relative to TBC: 57.59
WT103 unique 3-grams relative to TBC: 89.34
WT103 unique 4-grams relative to TBC: 97.94


__________
__________
# TABLES



Since we want to replicate the results of the original paper, we can build a table where to put them, and we do that using the following code block. 

In [None]:
import plotly.graph_objects as go
# Three tables of the n-grams percentage agains SELF, WT103, TBC.
Table2_n_grams_SELF = go.Figure(
    data=[go.Table(
        header=dict(values=TITLE1),
        cells=dict(values=values_grams_VS_SELF))
                     ])
Table2_n_grams_SELF.show()
# values_grams_VS_SELF is completed during the process.

Table2_n_grams_WT103 = go.Figure(
    data=[go.Table(
        header=dict(values=TITLE2),
        cells=dict(values=values_grams_VS_WT103))
                     ])
Table2_n_grams_WT103.show()
# values_grams_VS_WT103 is completed during the process.

Table2_n_grams_TBC = go.Figure(
    data=[go.Table(
        header=dict(values=TITLE3),
        cells=dict(values=values_grams_VS_TBC))
                     ])
Table2_n_grams_TBC.show()
# values_grams_VS_TBC is completed during the process.

We also add a table concerning Self-BLEU and one for Corpus-BLEU.

In [None]:
# Self-BLEU
Table2_self_bleu = go.Figure(
    data=[go.Table(
        header=dict(values=TITLE_SELF),
        cells=dict(values=values_self_bleu))
                     ])
Table2_self_bleu.show()
# values_self_bleu is completed during the process.

# Corpus - BLEU
Table3_corpus_bleu = go.Figure(
    data=[go.Table(
        header=dict(values=TITLE_CORPUS),
        cells=dict(values=values_corpus_bleu))
                     ])
Table3_corpus_bleu.show()


# TEXYGEN


## Introductory commands

In [None]:
!git clone https://github.com/geek-ai/Texygen.git
%cd Texygen
# we clone the Texygen repository from github

In [None]:
!pip install -r requirements.txt
#these are the libraries required for the Texygen models

In [None]:
%tensorflow_version 1.x
#Some functions of Texygen require the old version of tensorflow

In [None]:
import nltk
nltk.download('punkt')

## Texygen tutorial:

python main.py -g GAN type -t training method -d data location

  -g GAN type : 
    specify the GAN type in the experiment

    (GAN type = seqgan | maligan | rankgan | leakgan | gsgan | textgan | mle)

  -t training method :
    specify the traning method in the experiment

    (training method = oracle | cfg | real  ;  default is oracle)

  -d data location : 
    use user's own dataset only avaiable with real data training 
    (default is 'data/image_coco.txt')

more details: https://github.com/geek-ai/Texygen/blob/master/docs/doc.md

The models inside Texygen are: 
- seqgan 
- maligan 
- rankgan 
- leakgan 
- gsgan 
- textgan 
- mle

## Texygen models

In [None]:
!python main.py -g mle -t real -d 'wiki_train_1000_samples.txt'


## Texygen metrics

In [None]:
import os
import sys
sys.path.insert(1, os.path.join(".", "Texygen/utils"))

In [None]:
import os
from multiprocessing import Pool

import nltk
from nltk.translate.bleu_score import SmoothingFunction                    

# import Texygen metrics
from utils.metrics.Metrics import Metrics

from Texygen.utils.metrics.Bleu import Bleu
from Texygen.utils.metrics.SelfBleu import SelfBleu
from Texygen.utils.metrics.EmbSim import EmbSim
from Texygen.utils.metrics.Nll import Nll
from Texygen.utils.metrics.UniqueGram import UniqueGram


**BERT VALUES**

In [None]:
print("BERT-WIKI BLEU: %.2f" % (100 *Bleu.get_score(Bleu('Bert_using_pytorch.txt', wiki103_file))))
print("BERT-self-BLEU: %.2f" % (100 *SelfBleu.get_score(SelfBleu('Bert_using_pytorch.txt')))) 
print("BERT-unique4grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('Bert_using_pytorch.txt',4))))
print("BERT-unique3grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('Bert_using_pytorch.txt',3))))
print("BERT-unique2grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('Bert_using_pytorch.txt',2))))

BERT-WIKI BLEU: 8.43
BERT-self-BLEU: 16.27
BERT-unique4grams: 3410.90
BERT-unique3grams: 3383.60
BERT-unique2grams: 2664.00


**GPT VALUES**

In [None]:
print("GPT-WIKI BLEU: %.2f" % (100 *Bleu.get_score(Bleu('openaitext.txt', wiki103_file))))
print("GPT-self-BLEU: %.2f" % (100 *SelfBleu.get_score(SelfBleu('openaitext.txt')))) 
print("GPT-unique4grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('openaitext.txt',4))))
print("GPT-unique3grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('openaitext.txt',3))))
print("GPT-unique2grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('openaitext.txt',2))))

GPT-WIKI BLEU: 9.98
GPT-self-BLEU: 49.60
GPT-unique4grams: 3132.60
GPT-unique3grams: 2688.70
GPT-unique2grams: 1548.80


**MLE VALUES**

In [None]:
print("MLE-WIKI BLEU: %.2f" % (100 *Bleu.get_score(Bleu('/content/Texygen/save/test_file.txt', wiki103_file))))
print("MLE-self-BLEU: %.2f" % (100 *SelfBleu.get_score(SelfBleu('/content/Texygen/save/test_file.txt')))) 
print("MLE-unique4grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('/content/Texygen/save/test_file.txt',4))))
print("MLE-unique3grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('/content/Texygen/save/test_file.txt',3))))
print("MLE-unique2grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('/content/Texygen/save/test_file.txt',2))))

MLE-WIKI BLEU: 12.61
MLE-self-BLEU: 25.08
MLE-unique4grams: 5390.45
MLE-unique3grams: 4743.94
MLE-unique2grams: 2420.25


ATTENTION on Self-BLEU: We observe some different results from those of the paper implementation. The reason is the fact that when using self-BLEU for the BERT model (those functions contained in the generation part of the BERT model) we compute self-BLEU by averaging the results of Corpus-BLEU for each sentence as hypothesis against all the other as references. Instead, in the self-BLEU inside the Texygen platform we have another definition of self-BLEU (slightly different, but brings to different results), infact the command sentence bleu is used.

In the following we will use the Texygen implementation, but be aware of this.

ATTENTION on unique n-grams: as in the case of self-BLEU, also in the unique-grams definition there are some differences of Texygen with respect to BERT measures. One of the main reasons is that the value at the denominator for the Texygen implementation corresponds to the number of sentences, whereas in the "correct" definition of unique-grams it should be the total number of n-grams, which is clearly a higher number (in BERT this second definition is proposed). As a consequence, in the case of Texygen we have really higher results that those obtained for BERT unique-grams definition.

In the following we will use the Texygen implementation, but be aware of this.

____________________
____________________
# COMPARISON WITH TRANSFORMER XL AND XL-NET

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline

In [None]:
transfo_xl_generator = pipeline('text-generation', model='transfo-xl-wt103')
xlnet_generator = pipeline('text-generation', model='xlnet-base-cased')

In [None]:
with open('wt40.txt') as f:
  content = f.readlines()
f2 = open('transfxl_gen.txt', 'w')
f3 = open('xlnet_gen.txt', 'w')

for line in content:
  #prompt = line[:-1]
  prompt = line.rstrip('\n')
  res2 = transfo_xl_generator(prompt, max_length=40, do_sample=True, temperature=0.9)
  print('TRANSFO_XL:'+res2[0]['generated_text'])
  res3 = transfo_xl_generator(prompt, max_length=40, do_sample=True, temperature=0.9)
  print('XL_NET:'+res3[0]['generated_text']+'\n')
  f2.write(res2[0]['generated_text']+"\n")
  f3.write(res3[0]['generated_text']+"\n")

f2.close()
f3.close()

##Comparison using Texygen metrics

In [None]:
print("transfoxl-WIKI BLEU: %.2f" % (100 *Bleu.get_score(Bleu('transfxl_gen.txt', wiki103_file))))
print("transfoxl-self-BLEU: %.2f" % (100 *SelfBleu.get_score(SelfBleu('transfxl_gen.txt')))) 
print("transfoxl-unique4grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('transfxl_gen.txt',4))))
print("transfoxl-unique3grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('transfxl_gen.txt',3))))
print("transfoxl-unique2grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('transfxl_gen.txt',2))))

transfoxl-WIKI BLEU: 11.65
transfoxl-self-BLEU: 12.26
transfoxl-unique4grams: 2985.00
transfoxl-unique3grams: 3030.00
transfoxl-unique2grams: 2805.00


In [None]:
print("xlnet-WIKI BLEU: %.2f" % (100 *Bleu.get_score(Bleu('xlnet_gen.txt', wiki103_file))))
print("xlnet-self-BLEU: %.2f" % (100 *SelfBleu.get_score(SelfBleu('xlnet_gen.txt')))) 
print("xlnet-unique4grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('xlnet_gen.txt',4))))
print("xlnet-unique3grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('xlnet_gen.txt',3))))
print("xlnet-unique2grams: %.2f" % (100 *UniqueGram.get_score(UniqueGram('xlnet_gen.txt',2))))

xlnet-WIKI BLEU: 13.71
xlnet-self-BLEU: 11.78
xlnet-unique4grams: 3015.00
xlnet-unique3grams: 3057.50
xlnet-unique2grams: 2800.00


##Comparison using metrics from the Evaluation part above


In [None]:
#wiki103_file = 'datawiki103.5k.txt'
#wiki_data = prepare_wiki(wiki103_file)
transfxl_data = prepare_data('transfxl_gen.txt')
xlnet_data = prepare_data('xlnet_gen.txt')

In [None]:
#TRANSFORMER-XL EVALUATION (VS WIKI AND SELF)

value = corpus_bleu(transfxl_data, wiki_data)
print("transfoXL-WIKI BLEU: %.2f" % (100 * value))
value = self_bleu(transfxl_data)
print("transfoXL self-BLEU: %.2f" % (100 * value))

pct_uniques = ref_unique_ngrams(transfxl_data, wiki_data, max_n)
for i in range(1, max_n + 1):
    print("transfoXL unique %d-grams relative to Wiki: %.2f" % (i, 100 * pct_uniques[i]))

pct_uniques = self_unique_ngrams(transfxl_data, max_n)
for i in range(1, max_n + 1):
    print("transfoXL unique %d-grams relative to self: %.2f" % (i, 100 * pct_uniques[i]))

transfoXL-WIKI BLEU: 12.08
transfoXL self-BLEU: 5.25
transfoXL unique 1-grams relative to Wiki: 18.77
transfoXL unique 2-grams relative to Wiki: 62.11
transfoXL unique 3-grams relative to Wiki: 90.89
transfoXL unique 4-grams relative to Wiki: 98.76
transfoXL unique 1-grams relative to self: 41.39
transfoXL unique 2-grams relative to self: 86.42
transfoXL unique 3-grams relative to self: 97.61
transfoXL unique 4-grams relative to self: 99.43


In [None]:
#XL-NET EVALUATION (VS WIKI AND SELF)

value = corpus_bleu(xlnet_data, wiki_data)
print("XLNet-WIKI BLEU: %.2f" % (100 * value))
value = self_bleu(xlnet_data)
print("XLNet self-BLEU: %.2f" % (100 * value))

pct_uniques = ref_unique_ngrams(xlnet_data, wiki_data, max_n)
for i in range(1, max_n + 1):
    print("XLNet unique %d-grams relative to Wiki: %.2f" % (i, 100 * pct_uniques[i]))

pct_uniques = self_unique_ngrams(xlnet_data, max_n)
for i in range(1, max_n + 1):
    print("XLNet unique %d-grams relative to self: %.2f" % (i, 100 * pct_uniques[i]))

XLNet-WIKI BLEU: 14.35
XLNet self-BLEU: 0.00


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


XLNet unique 1-grams relative to Wiki: 18.96
XLNet unique 2-grams relative to Wiki: 61.54
XLNet unique 3-grams relative to Wiki: 89.92
XLNet unique 4-grams relative to Wiki: 98.29
XLNet unique 1-grams relative to self: 42.44
XLNet unique 2-grams relative to self: 85.94
XLNet unique 3-grams relative to self: 98.17
XLNet unique 4-grams relative to self: 99.81


In [None]:
#XL-NET vs TRANSFORMER-XL

value = corpus_bleu(xlnet_data, transfxl_data)
print("XLNet-transfoXL BLEU: %.2f" % (100 * value))
value = corpus_bleu(transfxl_data, xlnet_data)
print("transfoXL-XLNet BLEU: %.2f" % (100 * value))

pct_uniques = ref_unique_ngrams(xlnet_data, transfxl_data, max_n)
for i in range(1, max_n + 1):
    print("XLNet unique %d-grams relative to transfoXL: %.2f" % (i, 100 * pct_uniques[i]))
    pct_uniques = ref_unique_ngrams(transfxl_data, xlnet_data, max_n)
for i in range(1, max_n + 1):
    print("transfoXL unique %d-grams relative to XLNet: %.2f" % (i, 100 * pct_uniques[i]))

XLNet-transfoXL BLEU: 24.84
transfoXL-XLNet BLEU: 25.87
XLNet unique 1-grams relative to transfoXL: 31.00
XLNet unique 2-grams relative to transfoXL: 68.68
XLNet unique 3-grams relative to transfoXL: 81.69
XLNet unique 4-grams relative to transfoXL: 86.63
transfoXL unique 1-grams relative to XLNet: 29.48
transfoXL unique 2-grams relative to XLNet: 68.68
transfoXL unique 3-grams relative to XLNet: 81.69
transfoXL unique 4-grams relative to XLNet: 86.63
