# Contrastive Decoding with Galactica and Huggingface Transformers

[aicrumb](https://twitter.com/aicrumb)

Introducing CSearch from Huggingface Blog: https://huggingface.co/blog/introducing-csearch

> Natural language generation (i.e. text generation) is one of the core tasks in natural language processing (NLP). In this blog, we introduce the current state-of-the-art decoding method, ___Contrastive Search___, for neural text generation. Contrastive search is originally proposed in _"A Contrastive Framework for Neural Text Generation"_ <a href='#references'>[1]</a> ([[Paper]](https://arxiv.org/abs/2202.06417)[[Official Implementation]](https://github.com/yxuansu/SimCTG)) at NeurIPS 2022. Moreover, in this follow-up work,  _"Contrastive Search Is What You Need For Neural Text Generation"_ <a href='#references'>[2]</a> ([[Paper]](https://arxiv.org/abs/2210.14140) [[Official Implementation]](https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need)), the authors further demonstrate that contrastive search can generate human-level text using **off-the-shelf** language models across **16** languages.

In [None]:
#@title install

from IPython.display import clear_output
!pip install git+https://github.com/paperswithcode/galai -q
clear_output()

In [3]:
#@title load model and test normal generation

import galai as gal
model = gal.load_model("base", num_gpus=1)
clear_output()
model.generate("The Transformer architecture [START_REF]")

'The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a popular choice for sequence-to-sequence models. It consists of a stack of encoder and decoder layers, each of which is composed of a multi-head self-attention mechanism and a feed-forward network. The encoder'

In [6]:
#@title contrastive generation

from galai.utils import escape_custom_split_sequence
import torch

input_text = "The Transformer architecture [START_REF]"
max_new_tokens = 256

texts = [escape_custom_split_sequence(input_text)]

new_doc = True
if new_doc:
    pad_id = model.tokenizer.padding["pad_id"]
    pad_token = model.tokenizer.id_to_token(pad_id)
    texts = [pad_token + t for t in texts]

list_encoded = model.tokenizer.encode_batch(texts)
context_tokens = [encoded.ids for encoded in list_encoded]
input_v = torch.LongTensor(context_tokens).to(model.model.device)

out = model.model.generate(
    input_v, 
    max_new_tokens = max_new_tokens, 
    return_dict_in_generate=True, 
    output_hidden_states=True, 
    penalty_alpha=0.6, 
    top_k=4
)

output = model.tokenizer.decode_batch(
    out['sequences'].tolist(), 
    skip_special_tokens=False
)[0].lstrip('<pad>')

print(output)

The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is used for the encoder and decoder.

We use 12 layers for the encoder and 6 layers for the decoder. The embedding size is 512 and the hidden size is 2048. The dropout rate is 0.1. The Adam optimizer is used with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . The learning rate is 0.0001 for pretraining and 0.00001 for fine-tuning.

# 4.2 Results

Table 1 shows the results of our model and previous state-of-the-art models. We can see that our model performs competitively with the previous SOTA models, which demonstrates the effectiveness of our model. In addition, we compare the performance of different pretraining methods, as shown in Table 2. It can be seen that BERT-like models are better than RoBERTa-like models, which is consistent with the results of [START_REF] How Multilingual is Multilingual BERT?, Pires[END_REF]. The reason is that BERT can capture long-range dependencies,
