# Evaluation of GPT Datasets

In [2]:
%pip install transformers datasets accelerate evaluate

Installing collected packages: tokenizers, nvidia-ml-py3, xxhash, multidict, frozenlist, dill, charset-normalizer, async-timeout, yarl, responses, multiprocess, huggingface-hub, aiosignal, accelerate, transformers, aiohttp, datasets, evaluate
Successfully installed accelerate-0.17.1 aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 charset-normalizer-3.1.0 datasets-2.10.1 dill-0.3.6 evaluate-0.4.0 frozenlist-1.3.3 huggingface-hub-0.13.2 multidict-6.0.4 multiprocess-0.70.14 nvidia-ml-py3-7.352.0 responses-0.18.0 tokenizers-0.13.2 transformers-4.27.1 xxhash-3.2.0 yarl-1.8.2


In [3]:
# get the file data
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [4]:
# mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [37]:
# unzip model config files (google drive only)
!unzip /content/drive/MyDrive/GPTModels/model_setup_5000_3.zip -d /content/models

Archive:  /content/drive/MyDrive/GPTModels/model_setup_5000_3.zip
replace /content/models/content/model_config/tokenizer_config.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/models/content/model_config/tokenizer_config.json  
replace /content/models/content/model_config/config.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/models/content/model_config/config.json  
replace /content/models/content/model_config/special_tokens_map.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/models/content/model_config/special_tokens_map.json  
replace /content/models/content/model_config/generation_config.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/models/content/model_config/generation_config.json  
replace /content/models/content/model_config/merges.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/models/content/model_config/merges.txt  
replace /content/models/content/model_config/pytorch_model.b

In [41]:
# read in data
from datasets import Dataset
import pandas as pd

# google drive version
filename = '/content/drive/MyDrive/GPTModels/5000_booksummaries.zip' #data/5000_booksummaries.zip'
tokens_df = pd.read_csv(filename)
tokens_df.head(5)

Unnamed: 0,Text
0,Generate a book summary with genres Science Fi...
1,Generate a book summary with genres Fantasy:\n...
2,Generate a book summary with genres Crime Fict...
3,"Generate a book summary with genres Fiction, N..."
4,"Generate a book summary with genres War novel,..."


In [42]:
# split data into train and test/eval data
from sklearn.model_selection import train_test_split

# split into train (80%), val (10%), test (10%)
train_data, test_eval_dataset = train_test_split(tokens_df, test_size=0.2, random_state=8)
eval_set, test_set = train_test_split(test_eval_dataset, test_size=0.5, random_state=8)

# create HuggingFace Datasets
train_ds = Dataset.from_pandas(train_data)
eval_ds = Dataset.from_pandas(eval_set)
test_ds = Dataset.from_pandas(test_set)

In [49]:
# change dir depending on where it is

# finetuned
"""checkpoint = '/content/models/content/model_config'
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)"""

# vanilla model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [53]:
# THIS JUST GENERATES ONE OUTPUT!

# load input prompt
input_prompt = "Generate a book summary with genre science fiction:\n"
inputs = tokenizer(input_prompt, return_tensors="pt")

# generate output from pretrained experiments (see baseline file)
outputs = model.generate(**inputs, 
    max_length=150, 
    num_beams=2, 
    no_repeat_ngram_size=2, 
    do_sample=True,
    early_stopping=True)

# decode output and print out summary
output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generate a book summary with genre science fiction:

The Science Fiction and Fantasy Book Short Story Series: The short story series is a series of short stories written by writers who've written short fiction for television, film, and video games. They have been published in more than 100 languages, including English, French, Spanish, Italian, Japanese, Korean, Chinese, Portuguese, Swedish, Dutch, German, Norwegian, Danish, Polish, Hungarian, Romanian, Russian, Serbian, Ukrainian, Slovene, Turkish, Greek, Arabic, Finnish, Hebrew, Hindi, Indonesian, Malayalam, Vietnamese, Cambodian, Mandarin Chinese (Traditional), Korean (Simplified), Vietnamese (Mandarin), Japanese (Japanese), Portuguese (Brazilian), Romanian


In [10]:
# use BERTScores to analyze
%pip install bert_score
from evaluate import load
bertscore = load("bertscore")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13


Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [57]:
# helper functions
def truncate_to_prompt(whole_text):
    tok = whole_text.index(':')
    return whole_text[:tok+2] # returns text with new line

def generate_summary_from_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")

    # generate output from pretrained experiments , just comment out params from num_beans to end if no good decoding
    outputs = model.generate(**inputs), 
        max_length=150, 
        num_beams=2, 
        no_repeat_ngram_size=2, 
        do_sample=True,
        early_stopping=True)

    # decode output and return out summary
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

In [58]:
# run model to generate predictions
references = []
predictions = []
truncated_test_inputs = []

counter = 0
for example in test_ds:
    # stop scoring at 15
    if counter == 15:
      break
    
    input = example["Text"]
    prompt_only = truncate_to_prompt(input)
    truncated_test_inputs.append(prompt_only)
    references.append(input)

    # make predictions
    predictions.append(generate_summary_from_prompt(prompt_only))
    counter += 1

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 21, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-e

In [61]:
results = bertscore.compute(predictions=predictions, references=references, lang="en")

In [62]:
# print results and stats
print("Raw Results")
print('PRECISION: ' + str(results['precision']))
print('RECALL: ' + str(results['recall']))
print('F1: ' + str(results['f1']))
print()

def avg(number_list):
  return sum(number_list)/len(number_list)

print("Averages")
print('PRECISION: ' + str(avg(results['precision'])))
print('RECALL: ' + str(avg(results['recall'])))
print('F1: '  + str(avg(results['f1'])))
print()
print("Max Values")
print('PRECISION: ' + str(max(results['precision'])))
print('RECALL: ' + str(max(results['recall'])))
print('F1: ' + str(max(results['f1'])))

Raw Results
PRECISION: [0.8866522908210754, 0.8581110239028931, 0.9579274654388428, 0.9162920713424683, 0.8598588705062866, 0.8525220155715942, 0.8655290603637695, 0.8911129832267761, 0.8867733478546143, 0.8907850384712219, 0.9089034795761108, 0.9170349836349487, 0.9110207557678223, 0.8657287359237671, 0.8620990514755249]
RECALL: [0.7635667324066162, 0.7781015038490295, 0.7941382527351379, 0.7529852390289307, 0.7635701894760132, 0.7598407864570618, 0.7669138312339783, 0.8274396061897278, 0.8009669780731201, 0.7693344354629517, 0.7538388967514038, 0.7590856552124023, 0.7636767625808716, 0.7625319957733154, 0.7663823366165161]
F1: [0.820519208908081, 0.8161500692367554, 0.868377149105072, 0.8266504406929016, 0.8088590502738953, 0.8035176396369934, 0.8132427930831909, 0.858096718788147, 0.8416889309883118, 0.8256171941757202, 0.8241406083106995, 0.830618143081665, 0.8308669328689575, 0.8108601570129395, 0.8114277124404907]

Averages
PRECISION: 0.8886900782585144
RECALL: 0.7721582134564717

In [63]:
print(predictions[0])
print("SPLIT")
print(references[0])

Generate a book summary with genres Speculative fiction:

A book summary with genres Spe
SPLIT
Generate a book summary with genres Speculative fiction:
 Diana Londen, werewolf, works as the city manger of a small South Carolina town, while moonlighting for the Verdaville Police Department as a police dog. While helping investigate a brutal murder, Diana learns she's not the only magical creature in town. A female vampire has decided to make Verdaville her murderous playground. But Diana is not the only one after the vampire. Llyr Galatyn is the king of the Cachamwri Sidhe--an other-worldly warrior with fantastic abilities. He has sworn to take down the murderous vampire and he's willing to give Diana any help she needs. And not just with the case. Diana is in her Burning Moon, a time of sexual heat for werewolves. When needs rides her hard, Llyr is delighted to answer her erotic prayers. As they hunt for the rogue vampire, an even more deadly enemy urges the vampire to turn her sights 