<a href="https://colab.research.google.com/github/UMWordLab/surprisal_with_minicons/blob/main/EnglishGPT_2_Surprisal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 English Surprisals for Experimental Stimuli (CSV or single sentence)

This script was created by [Yizhi Tang](https://github.com/tangyizhi2000) and [Lisa Levinson](https://lisalevinson.github.io/) for use in the [WordLab](https://umwordlab.github.io/), and uses the [Minicons](https://github.com/kanishkamisra/minicons) package by [Kanishka Misra](https://kanishka.website/).

The purpose of this notebook is to calculate lexical surprisal (for English) using GPT-2.

We use Minicons, a python library that automates the probability computations of transformer LMs that are accessible through the transformers package by HuggingFace.

One challenge in matching GPT-2 surprisals with experimental stimuli is the tokenization of GPT-2, which uses BPE (byte-pair encoding). For example, GPT-2 tokenizes the word "inflating" into two separate tokens: "infl" and "ating", and Minicons gives the probability of two tokens respectively. To compute the full word surprisal as a match to a word in the stimulus, we combine (by addition, due to logarithmic values) the probabilities of sub-words to estimate the surprisal of the target word.

Note that puncuation is evaluated, so the surprisal will be different if, for example, you include the period at the end of the sentence or do not. The current version of this script combines (same as other BPE combinations) with punctuation with the previous word if they are not separated with a space. If you are analyzing words at a sentence point with puncuation, you should consider carefully whether you want the surprisal for the word alone, or the word with  punctuation which may entail much more about the prosody and syntax. 

You can use different GPT-2 models by changing the model name in the setup code. Only English has currently been tested for this specific script currently, but it should work for other languages where the orthography breaks up words using spaces.  

## Setup Code

In [None]:
# Minicons Installation
# Introduction can be found https://kanishka.xyz/post/minicons-running-large-scale-behavioral-analyses-on-transformer-lms/
# Tutorial and code can be found https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md
!pip install minicons
# Import necessary libararies
from minicons import scorer
import torch
from torch.utils.data import DataLoader
import numpy as np
import json
import csv
# Download GPT-2
# Note that HuggingFace GPT-2 has several version (differ in size)
# we can replace 'gpt2' with 'gpt2-medium', 'gpt2-large', or 'gpt2-xl'
# See https://huggingface.co/transformers/model_doc/gpt2.html#gpt2model for details
model = scorer.IncrementalLMScorer('gpt2-medium', 'cpu')

## CSV Input

The following block of code takes in a list of sentences and outputs a csv file containing surprisal values. It may take some time for the output file to appear in the left "files" panel after the cell has finished running. 

**input_file**

input_file is a file containing the list of sentences in comma separated (csv) format. Each row should contain two columns, the first with a sentence label, the second with the full sentence. 

Note: if there is a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) at the beginning of the file, this will mangle the label of your first sentence. 

As an example, the first few lines of your file might look like:


```
intr0,The balloons were popping in the car outside of the the party
tran1,The clown was popping the balloons in the car
tran2,The maid was shrinking the laundry in the dryer
intr3,The sweaters were shrinking in the dryer at the cleaners
tran4,The leak was eroding the pipes at the connection
```

**dest_file**

dest_file is the csv file generated by this code containing surprisal values. dest_file is in "long" format and contains four columns: label, word, word_number, surprisal

For example, the first few lines of our dest_file for the sentences above would look like this:

```
intr0,The,0,0.0
intr0,balloons,1,12.588163375854492
intr0,were,2,2.1027908325195312
intr0,popping,3,8.180717468261719
intr0,in,4,3.451995849609375
intr0,the,5,1.3590087890625
intr0,car,6,6.585639953613281
intr0,outside,7,5.377586364746094
intr0,of,8,2.3132476806640625
intr0,the,9,0.9443435668945312
intr0,the,10,7.9730224609375
intr0,party,11,5.788078308105469
tran1,The,0,0.0
tran1,clown,1,11.388120651245117
tran1,was,2,2.9219207763671875
```

In [None]:
# upload your csv to in the files tab and change the filename here as needed
input_file = '/content/input.csv'
file = open(input_file, 'r')

# rename the output file if desired
dest_file = '/content/output.csv'

with open(dest_file, mode='w') as csvfile:
  csv_file = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
  # Write the first line of dest_file
  csv_file.writerow(['label', 'word', 'word_number', 'surprisal'])
  for line in file.readlines():
    # Parsing input file
    sentence = line[line.find(',')+1:].strip()
    label = line[:line.find(',')]
    # Minicons can take two sentences at once
    # Since we are processing sentences one by one, we use a 'placeholder' at the second position
    input = [sentence, 'placeholder']
    # Use Minicons to calculate surprisal / log likelihood
    surprisal = model.token_score(input, surprisal=True, base_two=False)
    sentence = sentence.split()
    word_number = 0
    prev_word = ''
    prev_surprisal = 0
    # the loop here combines multiple sub-words together and calculate the 
    # final surprisal value for each word
    for word in surprisal[0]:
      prev_word += word[0]
      prev_surprisal += word[1]
      if prev_word == sentence[word_number]:
        row = [label, prev_word, word_number, prev_surprisal]
        word_number += 1
        csv_file.writerow(row)
        prev_word = ''
        prev_surprisal = 0


## Single Sentence Input

The code block below will just work on one sentence at a time, but otherwise calculates surprisal in the same way as above. 

In [None]:
# Takes in a sentence, and output the surprisal values for each word
# This function is just a small version of the code above
# sometimes we don't have an entire list of stimulus,
# instead, we only want a quick check of the surprisal of one sentence
def calculate_surprisal(sentence):
  input = [sentence, 'placeholder']
  # token_score() function of Minicons takes in several parameters
  # if surprisal=True, the output value is surprisal instead of log likelihood
  # if base_two=True, the log likelihood will be in base 2
  # see Minicons documentations for details
  surprisal = model.token_score(input, surprisal=True, base_two=False)
  word_number = 0
  prev_word = ''
  prev_surprisal = 0
  sentence = sentence.split()
  for word in surprisal[0]:
    prev_word += word[0]
    prev_surprisal += word[1]
    if prev_word == sentence[word_number]:
      print(prev_word, prev_surprisal)
      word_number += 1
      prev_word = ''
      prev_surprisal = 0

# Example Usage:
# As shown in the result, 'infl' and 'ating' are combined as one word
sentence = 'The balloon was inflating for 10 minutes'
calculate_surprisal(sentence)

In [None]:
calculate_surprisal("When did the biker crash yesterday?")

## BPE Demonstration

In [None]:
# This block is to demonstrate the tokenization problem of GPT-2
# As seen from the result, the word 'inflating' is separated as 'infl' and 'ating'
# and each sub-word has its own log likelihood.
sentences = ['When did the biker crash yesterday?', 'The float was inflating at the carnival near the church.']
model.token_score(sentences, surprisal=True, base_two=False)