### Get to know with HuggingFace GPT-2

This notebook contains test instructions that will be placed inside train.py file.

In [2]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Tokenize input
text = "The movie is about"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:01<00:00, 687kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.46MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 664kB/s]
Downloading pytorch_model.bin: 100%|██████████| 548M/548M [07:42<00:00, 1.18MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 124kB/s]


The movie is about a


In [4]:
model.save_pretrained("./model/gpt2_movie_description")

In [154]:
print(f"Number of parmeters in GPT-2 model: {sum(p.numel() for p in model.parameters()):,}")
# model.named_parameters()

Number of parmeters in GPT-2 model: 124,439,808


In [146]:
model.eval()

text = tokenizer("Star Wars", return_tensors='pt')
tokens = text['input_ids'].to('cuda')
pred = torch.tensor(-1)
print(tokens)
eof = tokenizer.encode('<|endoftext|>', return_tensors='pt').item()
print(eof)


while len(tokens[0]) != 128:
    n_inp = tokens
    output = model(n_inp)
    probs = output[0].softmax(dim=-1)
    pred = torch.multinomial(probs[0, -1, :], 1)
    tokens = torch.cat([n_inp, pred.view(1,1)], dim=-1)
    if pred.item() == eof:
        break
n_inp


tensor([[8248, 6176]], device='cuda:0')
50256


tensor([[ 8248,  6176,     1,   550, 14869,   422, 15589,   338,  2656,  1621,
            11,   366,   464,  4586, 16147,     1,   284,   366, 27676, 15339,
           290, 10598,   553,  1642,   262,  2551,  2562,    13,  2935, 32968,
           416,  6240, 20320,   371,    13,   775,   600,   430,   549,   355,
           366, 10262,  1758,   262,  2851,   553,  2935, 26674,   274,   318,
           783,   262,  1266,    12,  4002,  4014,   290, 19466,  4014,   508,
         13831,   257,  4334,  2756,   329,   262,  1943,   286,   465,  1492,
          8670,  1505,  6176,   357, 11528,   828,   543, 14999,  1657,   319,
           262,   779,   290, 22036,   286, 18423,  5010,   287,   428,  6980,
            13,   198,   198, 29011,    11,  1810,   373, 11791,   416,   867,
          2458,    13, 12168,  8581, 15434,  8793, 12411,  8088,  1390,  1266,
          4286,    11,  1266,  8674,    11,  1266,  8674,  4286,    11,   290,
          1266,  1492,    13,   679,   635,  6492,  

In [147]:
print(tokenizer.decode(n_inp.view(-1)))

Star Wars" had shifted from 1977's original story, "The Last Jedi" to "Garuda and Rome," making the decision easy. Described by professor Nicholas R. Weintraub as "Agape the Dragon," Descartes is now the best-known critic and libertarian critic who pays a heavy price for the success of his book Opium Wars (2008), which shed light on the use and monopoly of pharmaceutical drugs in this era.

Nevertheless, War was accompanied by many changes. Several Academy Awards achieved notable reviews including best picture, best actor, best actor picture, and best book. He also obtained professional


We can see that GPT-2 is producing some rich sentences but it's nowhere near the target of shortly describing the movies.

In [90]:
tokenizer.encode("<|endoftext|>",return_tensors='pt').view(-1)

tensor([50256])