# Transformers-based Language Model: GPT-2

- Codes from [A Comprehensive Guide to Build Your Own Language Model in Python](https://medium.com/analytics-vidhya/a-comprehensive-guide-to-build-your-own-language-model-in-python-5141b3917d6d)
- Use the OpenAI GPT-2 language model (based on Transformers)

In [11]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 11.9MB/s eta 0:00:01
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 49.8MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.9.2 transformers-3.4.0


In [1]:
# Import required libraries
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

# Print the predicted word
print(predicted_text)

ModuleNotFoundError: No module named 'torch'

In [13]:
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 50085 (delta 1), reused 0 (delta 0), pack-reused 50074[K
Receiving objects: 100% (50085/50085), 36.98 MiB | 27.11 MiB/s, done.
Resolving deltas: 100% (34943/34943), done.


In [9]:
!ls pytorch-transformers/examples

adversarial	     lightning_base.py	 seq2seq
benchmarking	     longform-qa	 test_examples.py
bert-loses-patience  lxmert		 test_xla_examples.py
bertology	     movement-pruning	 text-classification
conftest.py	     multiple-choice	 text-generation
contrib		     question-answering  token-classification
deebert		     rag		 xla_spawn.py
distillation	     README.md
language-modeling    requirements.txt


## Text Generation Using DPT2

- [Write with Transformer](https://transformer.huggingface.co/)



In [41]:
# !python3 transformers/examples/text-generation/run_generation.py \
#     --model_type=gpt2 \
#     --length=100 \
#     --model_name_or_path=gpt2 \

!python transformers/examples/text-generation/run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --length=100

2020-11-07 23:30:03.018575: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/07/2020 23:30:16 - INFO - __main__ -   Namespace(device=device(type='cuda'), fp16=False, k=0, length=100, model_name_or_path='gpt2', model_type='gpt2', n_gpu=1, no_cuda=False, num_return_sequences=1, p=0.9, padding_text='', prefix='', prompt='', repetition_penalty=1.0, seed=42, stop_token=None, temperature=1.0, xlm_language='')
Model prompt >>> The world is a 
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
=== GENERATED SEQUENCE 1 ===
The world is a vernacular of 21st century cinema, and Hollywood has created some impressive collections of great films, series, and episodes on all sorts of topics. The year is always full of sensational screen casting, competition films, time travel, and many other exciting trivia related moments. There are even various levels of passion and enthusiasm for video games like Halo o

## Text Generation Using GPT2

In [23]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': "Hello, I'm a language model, I didn't want to work with another language like Rust. I'm making more time for Rust. I just"},
 {'generated_text': 'Hello, I\'m a language model, and I think this is a great time to look at how languages and their dependencies do it," he said.'},
 {'generated_text': "Hello, I'm a language model, and writing one is like writing a dictionary. I can put any number of objects together, and it keeps the"},
 {'generated_text': "Hello, I'm a language model, I'm a grammar model\n\nLinguistics [ edit ]\n\nThe term linguistics refers to the"},
 {'generated_text': "Hello, I'm a language model, not a system programming language. The language we were writing in Java was very basic and it was pretty much a"}]

In [42]:
generator("Nowaways lots of people need ", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': "Nowaways lots of people need iced water or a cold shower with a baby that they need to get through their kids' school day and to get"},
 {'generated_text': 'Nowaways lots of people need \xa0to get better at doing business, and this is certainly something I can do within the family, but in our'},
 {'generated_text': "Nowaways lots of people need ____________, most don't have ______________________ or many don't want to do ________/ ________________ ______"},
 {'generated_text': "Nowaways lots of people need \xa0a good job...it might have cost them something a few months away.\nThat's why I did the"},
 {'generated_text': "Nowaways lots of people need iced tea and the best iced teas contain lots of vitamins. And they just can't afford tea and they"}]

## Transforming Texts into Features

In [37]:
# from transformers import GPT2Tokenizer, GPT2Model
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model = GPT2Model.from_pretrained('gpt2')
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt') # return tensorflow tensors
# output = model(encoded_input)


from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
print(encoded_input)

All model checkpoint layers were used when initializing TFGPT2Model.

All the layers of TFGPT2Model were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


{'input_ids': <tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[3041, 5372,  502,  416,  597, 2420,  345, 1549,  588,   13]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 10), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
