In [1]:
import transformers, torch

  from .autonotebook import tqdm as notebook_tqdm


## Setup GPT-2

Using HuggingFace `distillgpt2` model, we setup a pretrained language model, therefore a model that already contains a good bit of sematic information about the words in text.

In [2]:
tokenizer = transformers.AutoTokenizer.from_pretrained("distilgpt2")
def preprocess_function(examples):
  return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
tokenizer.pad_token = tokenizer.eos_token
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [31]:
model = transformers.AutoModelForCausalLM.from_pretrained("distilgpt2")
model.eval()

In [49]:
inputs=tokenizer("Hi this is a sentence")
outputs = model.generate(input_ids=torch.LongTensor([inputs["input_ids"],]), max_length=50)
tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Hi this is a sentence that will be a good part of the year, and I can not wait to finish this week!\n\nThis blog has also been updated with news of the death of our editor-in-chief Jonathan Chait. Here'

In [51]:
# Alternate method for doing inferrence, might be useful after the model is trained.
# Can essentially make the adaptor and image model part of a tokenizer
inferer = transformers.pipeline(task="text-generation", model=model, tokenizer=tokenizer)
inferer("Hi this is a sentence")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hi this is a sentence we used because the words (e.g.,) are not a sentence we used because the phrase "the world" is not a sentence we used because the term "the world" can be applied. For instance, this'}]