In [1]:
#!pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

In [2]:
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
# from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import pandas as pd

# Introduction

Over the course of this experiment, I found that these AI models, are great at making believable text, but aren't great at fullfilling the task of finishing the story.

# Choice of Text and Models

I chose the ROCStories data set, which is a data set (seen below) that has a lot of short stories.

I chose these two models because they seem like powerful text models that would successfully accomplish the task set out for them.

In [3]:
df = pd.read_csv('https://goo.gl/0OYkPK')
df.head(1)

Unnamed: 0,storyid,storytitle,sentence1,sentence2,sentence3,sentence4,sentence5
0,8bbe6d11-1e2e-413c-bf81-eaea05f4f1bd,David Drops the Weight,David noticed he had put on a lot of weight re...,He examined his habits to try and figure out t...,He realized he'd been eating too much fast foo...,He stopped going to burger places and started ...,"After a few weeks, he started to feel much bet..."


# Findings

| Text | Model | Accuracy | Loss | Perplexity |
| - | - | - | - | - |
| story1 | GPT-2 | Repeated the last sentence a lot| 2.986330270767212 | 19.812841415405273 |
| story1 | Phi-2 | The text is exactly the same. | 2.1900503635406494 | 8.935663223266602 |
| story2 | GPT-2 | Started to give different variations of the a sentence till it started to repeat | 3.617241859436035 | 37.23472595214844 |
| story2 | Phi-2 | Made a quiz question out of the text. | 2.820509910583496 | 16.7854061126709 |

## Interpretation

The perplexity shows that GPT-2 has a lot more reasonable options, each step of the way than compared to the Phi-2 model.

When looking at the loss, we can see that the GPT-2 model also has a higher average loss over the course of the 2 stories.

Looking at the accuracy, we can see that Phi-2 did better in this regard as well. GPT-2 kept repeating the last sentence while Phi-2 might not have finished properly but it rather made a quiz question out of it.

# Conclusion

Overall, I think that the Phi-2 model is better than the GPT-2 model for this test. It shows across all of the different categories that it is better in this usecase.

# Code

In [5]:
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Define the text you want to evaluate
# text = "Once upon a time, there was a bug."
text = df["storytitle"][1]

# Define additional context
additional_text = df['sentence1'][1] + df['sentence2'][1] + df['sentence3'][1] + df['sentence4'][1] + df['sentence5'][1]

# Combine the input text and additional context
text_with_context = f"{text} {additional_text}"

# Define the model and tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai-community/gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

try:
    # Tokenize the text with context
    inputs = tokenizer(text_with_context, return_tensors="pt", max_length=100, truncation=True)

    # Prepare the input tensor
    input_ids = inputs.input_ids.to(device)

    # Prepare the target labels, ignoring the prompt
    prompt_len = len(tokenizer(text, return_tensors="pt")["input_ids"][0])
    target_ids = input_ids.clone()
    target_ids[:, :prompt_len] = -100

    # Compute the loss
    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs.loss

    # Compute perplexity
    perplexity = torch.exp(neg_log_likelihood)

    print(f"Loss: {neg_log_likelihood.item()}")
    print(f"Perplexity: {perplexity.item()}")

except Exception as e:
    print(f"An error occurred: {e}")

# Debugging outputs
print("Input Text with Context:", text_with_context)
print("Tokenized Input IDs:", input_ids)
print("Target IDs:", target_ids)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Loss: 3.617241859436035
Perplexity: 37.23472595214844
Input Text with Context: Frustration Tom had a very short temper.One day a guest made him very angry.He punched a hole in the wall of his house.Tom's guest became afraid and left quickly.Tom sat on his couch filled with regret about his actions.
Tokenized Input IDs: tensor([[ 6732, 44027,  4186,   550,   257,   845,  1790,  4124,    13,  3198,
          1110,   257,  8319,   925,   683,   845,  7954,    13,  1544, 25436,
           257,  7604,   287,   262,  3355,   286,   465,  2156,    13, 13787,
           338,  8319,  2627,  7787,   290,  1364,  2952,    13, 13787,  3332,
           319,   465, 18507,  5901,   351, 13721,   546,   465,  4028,    13]],
       device='cuda:0')
Target IDs: tensor([[ -100,  -100,  4186,   550,   257,   845,  1790,  4124,    13,  3198,
          1110,   257,  8319,   925,   683,   845,  7954,    13,  1544, 25436,
           257,  7604,   287,   262,  3355,   286,   465,  2156,    13, 13787,
         

In [6]:
# Set the model to evaluation mode
model.eval()

# Generate text
with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_length=150, num_return_sequences=1)

# Decode the generated output
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print("Generated Output Text:", output_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Output Text: Frustration Tom had a very short temper.One day a guest made him very angry.He punched a hole in the wall of his house.Tom's guest became afraid and left quickly.Tom sat on his couch filled with regret about his actions.Tom was a very good student.He was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a good student.Tom was a
