# Document parsing

In [1]:
import pymupdf
from pypdf import PdfReader
from IPython.display import display, Markdown
import json
import nltk
import pandas as pd

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /Users/fayad/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### PyMuPDF4LLM experiments

In [2]:
import pymupdf4llm

In [3]:
lec_doc_path = "../data/input/full/references/superglue.pdf"

In [4]:
doc = pymupdf4llm.to_markdown(lec_doc_path)

In [5]:
display(Markdown(doc))

# **SuperGLUE: A Stickier Benchmark for** **General-Purpose Language Understanding Systems**

**Alex Wang** _[‚àó]_ **Yada Pruksachatkun** _[‚àó]_ **Nikita Nangia** _[‚àó]_
New York University New York University New York University


**Amanpreet Singh** _[‚àó]_ **Julian Michael** **Felix Hill** **Omer Levy**
Facebook AI Research University of Washington DeepMind Facebook AI Research


**Samuel R. Bowman**
New York University


**Abstract**


In the last year, new models and methods for pretraining and transfer learning have
driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers
a single-number metric that summarizes progress on a diverse set of such tasks,
but performance on the benchmark has recently surpassed the level of non-expert
humans, suggesting limited headroom for further research. In this paper we present
SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard.
SuperGLUE is available at `[super.gluebenchmark.com](https://super.gluebenchmark.com/)` .


**1** **Introduction**


In the past year, there has been notable progress across many natural language processing (NLP)
tasks, led by methods such as ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018),
and BERT (Devlin et al., 2019). The common thread connecting these methods is that they couple
self-supervised learning from massive unlabelled text corpora with a recipe for effectively adapting
the resulting model to target tasks. The tasks that have proven amenable to this general approach
include question answering, sentiment analysis, textual entailment, and parsing, among many others
(Devlin et al., 2019; Kitaev and Klein, 2018, i.a.).


In this context, the GLUE benchmark (Wang et al., 2019a) has become a prominent evaluation
framework for research towards general-purpose language understanding technologies. GLUE is
a collection of nine language understanding tasks built on existing public datasets, together with
private test data, an evaluation server, a single-number target metric, and an accompanying expertconstructed diagnostic set. GLUE was designed to provide a general-purpose evaluation of language
understanding that covers a range of training data volumes, task genres, and task formulations. We
believe it was these aspects that made GLUE particularly appropriate for exhibiting the transferlearning potential of approaches like OpenAI GPT and BERT.


The progress of the last twelve months has eroded headroom on the GLUE benchmark dramatically.
While some tasks (Figure 1) and some linguistic phenomena (Figure 2 in Appendix B) measured
in GLUE remain difficult, the current state of the art GLUE Score as of early July 2019 (88.4 from
Yang et al., 2019) surpasses human performance (87.1 from Nangia and Bowman, 2019) by 1.3


_‚àó_ Equal contribution. Correspondence: `glue-benchmark-admin@googlegroups.com`


33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.


|1.2|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|.<br>|||||||||||||
|0.7<br>0.8<br>0.9<br>1.0<br>|||||||||||||
|0.7<br>0.8<br>0.9<br>1.0<br>|||||||||||||
|0.7<br>0.8<br>0.9<br>1.0<br>||+ELMo+Attn|OpenAI GPT|sk Adapters|ERT (Large)|T on STILTs|BERT + BAM|SemBERT|orkel MeTaL|LICE (Large)|(ensemble)|(ensemble)|









Figure 1: GLUE benchmark performance for submitted systems, rescaled to set human performance
to 1.0, shown as a single number score, and broken down into the nine constituent task performances.
For tasks with multiple metrics, we use an average of the metrics. More information on the tasks
included in GLUE can be found in Wang et al. (2019a) and in Warstadt et al. (2018, CoLA), Socher
et al. (2013, SST-2), Dolan and Brockett (2005, MRPC), Cer et al. (2017, STS-B), and Williams et al.
(2018, MNLI), and Rajpurkar et al. (2016, the original data source for QNLI).


points, and in fact exceeds this human performance estimate on four tasks. Consequently, while there
remains substantial scope for improvement towards GLUE‚Äôs high-level goals, the original version of
the benchmark is no longer a suitable metric for quantifying such progress.


In response, we introduce SuperGLUE, a new benchmark designed to pose a more rigorous test of
language understanding. SuperGLUE has the same high-level motivation as GLUE: to provide a
simple, hard-to-game measure of progress toward general-purpose language understanding technologies for English. We anticipate that significant progress on SuperGLUE should require substantive
innovations in a number of core areas of machine learning, including sample-efficient, transfer,
multitask, and unsupervised or self-supervised learning.


SuperGLUE follows the basic design of GLUE: It consists of a public leaderboard built around
eight language understanding tasks, drawing on existing data, accompanied by a single-number
performance metric, and an analysis toolkit. However, it improves upon GLUE in several ways:


_‚Ä¢_ **More challenging tasks:** SuperGLUE retains the two hardest tasks in GLUE. The remaining tasks were identified from those submitted to an open call for task proposals and were
selected based on difficulty for current NLP approaches.


_‚Ä¢_ **More diverse task formats:** The task formats in GLUE are limited to sentence- and
sentence-pair classification. We expand the set of task formats in SuperGLUE to include
coreference resolution and question answering (QA).


_‚Ä¢_ **Comprehensive human baselines:** We include human performance estimates for all benchmark tasks, which verify that substantial headroom exists between a strong BERT-based
baseline and human performance.


_‚Ä¢_ **Improved code support:** SuperGLUE is distributed with a new, modular toolkit for work
on pretraining, multi-task learning, and transfer learning in NLP, built around standard tools
including PyTorch (Paszke et al., 2017) and AllenNLP (Gardner et al., 2017).


_‚Ä¢_ **Refined usage rules:** The conditions for inclusion on the SuperGLUE leaderboard have
been revamped to ensure fair competition, an informative leaderboard, and full credit
assignment to data and task creators.


The SuperGLUE leaderboard, data, and software tools are available at `[super.gluebenchmark.com](https://super.gluebenchmark.com/)` .


2


**2** **Related Work**


Much work prior to GLUE demonstrated that training neural models with large amounts of available
supervision can produce representations that effectively transfer to a broad range of NLP tasks
(Collobert and Weston, 2008; Dai and Le, 2015; Kiros et al., 2015; Hill et al., 2016; Conneau and
Kiela, 2018; McCann et al., 2017; Peters et al., 2018). GLUE was presented as a formal challenge
affording straightforward comparison between such task-agnostic transfer learning techniques. Other
similarly-motivated benchmarks include SentEval (Conneau and Kiela, 2018), which specifically
evaluates fixed-size sentence embeddings, and DecaNLP (McCann et al., 2018), which recasts a set
of target tasks into a general question-answering format and prohibits task-specific parameters. In
contrast, GLUE provides a lightweight classification API and no restrictions on model architecture or
parameter sharing, which seems to have been well-suited to recent work in this area.


Since its release, GLUE has been used as a testbed and showcase by the developers of several
influential models, including GPT (Radford et al., 2018) and BERT (Devlin et al., 2019). As shown
in Figure 1, progress on GLUE since its release has been striking. On GLUE, GPT and BERT
achieved scores of 72.8 and 80.2 respectively, relative to 66.5 for an ELMo-based model (Peters
et al., 2018) and 63.7 for the strongest baseline with no multitask learning or pretraining above the
word level. Recent models (Liu et al., 2019d; Yang et al., 2019) have clearly surpassed estimates of
non-expert human performance on GLUE (Nangia and Bowman, 2019). The success of these models
on GLUE has been driven by ever-increasing model capacity, compute power, and data quantity, as
well as innovations in model expressivity (from recurrent to bidirectional recurrent to multi-headed
transformer encoders) and degree of contextualization (from learning representation of words in
isolation to using uni-directional contexts and ultimately to leveraging bidirectional contexts).


In parallel to work scaling up pretrained models, several studies have focused on complementary
methods for augmenting performance of pretrained models. Phang et al. (2018) show that BERT can
be improved using two-stage pretraining, i.e., fine-tuning the pretrained model on an intermediate
data-rich supervised task before fine-tuning it again on a data-poor target task. Liu et al. (2019d,c) and
Bach et al. (2018) get further improvements respectively via multi-task finetuning and using massive
amounts of weak supervision. Anonymous (2018) demonstrate that knowledge distillation (Hinton
et al., 2015; Furlanello et al., 2018) can lead to student networks that outperform their teachers.
Overall, the quantity and quality of research contributions aimed at the challenges posed by GLUE
underline the utility of this style of benchmark for machine learning researchers looking to evaluate
new application-agnostic methods on language understanding.


Limits to current approaches are also apparent via the GLUE suite. Performance on the GLUE
diagnostic entailment dataset, at 0.42 _R_ 3, falls far below the average human performance of 0.80
_R_ 3 reported in the original GLUE publication, with models performing near, or even below, chance
on some linguistic phenomena (Figure 2, Appendix B). While some initially difficult categories
saw gains from advances on GLUE (e.g., double negation), others remain hard (restrictivity) or
even adversarial (disjunction, downward monotonicity). This suggests that even as unsupervised
pretraining produces ever-better statistical summaries of text, it remains difficult to extract many
details crucial to semantics without the right kind of supervision. Much recent work has made similar
observations about the limitations of existing pretrained models (Jia and Liang, 2017; Naik et al.,
2018; McCoy and Linzen, 2019; McCoy et al., 2019; Liu et al., 2019a,b).


**3** **SuperGLUE Overview**


**3.1** **Design Process**


The goal of SuperGLUE is to provide a simple, robust evaluation metric of any method capable of
being applied to a broad range of language understanding tasks. To that end, in designing SuperGLUE,
we identify the following desiderata of tasks in the benchmark:


_‚Ä¢_ **Task substance:** Tasks should test a system‚Äôs ability to understand and reason about texts
written in English.

_‚Ä¢_ **Task difficulty:** Tasks should be beyond the scope of current state-of-the-art systems,
but solvable by most college-educated English speakers. We exclude tasks that require
domain-specific knowledge, e.g. medical notes or scientific papers.


3


Table 1: The tasks included in SuperGLUE. _WSD_ stands for word sense disambiguation, _NLI_ is
natural language inference, _coref._ is coreference resolution, and _QA_ is question answering. For
MultiRC, we list the number of total answers for 456/83/166 train/dev/test questions. The metrics for
MultiRC are binary F1 on all answer-options and exact match.


**Corpus** _|_ **Train** _|_ _|_ **Dev** _|_ _|_ **Test** _|_ **Task** **Metrics** **Text Sources**


BoolQ 9427 3270 3245 QA acc. Google queries, Wikipedia
CB 250 57 250 NLI acc./F1 various
COPA 400 100 500 QA acc. blogs, photography encyclopedia
MultiRC 5100 953 1800 QA F1 _a_ /EM various
ReCoRD 101k 10k 10k QA F1/EM news (CNN, Daily Mail)
RTE 2500 278 300 NLI acc. news, Wikipedia
WiC 6000 638 1400 WSD acc. WordNet, VerbNet, Wiktionary
WSC 554 104 146 coref. acc. fiction books


_‚Ä¢_ **Evaluability:** Tasks must have an automatic performance metric that corresponds well to
human judgments of output quality. Certain text generation tasks fail to meet this criteria
due to issues surrounding automatic metrics like ROUGE and BLEU (Callison-Burch et al.,
2006; Liu et al., 2016, i.a.).


_‚Ä¢_ **Public data:** We require that tasks have _existing_ public training data in order to minimize
the risks involved in newly-created datasets. We also prefer tasks for which we have access
to (or could create) a test set with private labels.


_‚Ä¢_ **Task format:** We prefer tasks that had relatively simple input and output formats, to avoid
incentivizing the users of the benchmark to create complex task-specific model architectures.
Nevertheless, while GLUE is restricted to tasks involving single sentence or sentence pair
inputs, for SuperGLUE we expand the scope to consider tasks with longer inputs. This
yields a set of tasks that requires understanding individual tokens in context, complete
sentences, inter-sentence relations, and entire paragraphs.


_‚Ä¢_ **License:** We require that task data be available under licences that allow use and redistribution for research purposes.


To identify possible tasks for SuperGLUE, we disseminated a public call for task proposals to the
NLP community, and received approximately 30 proposals. We filtered these proposals according
to our criteria. Many proposals were not suitable due to licensing issues, complex formats, and
insufficient headroom; we provide examples of such tasks in Appendix D. For each of the remaining
tasks, we ran a BERT-based baseline and a human baseline, and filtered out tasks which were either
too challenging for humans without extensive training or too easy for our machine baselines.


**3.2** **Selected Tasks**


Following this process, we arrived at eight tasks to use in SuperGLUE. See Tables 1 and 2 for details
and specific examples of each task.


**BoolQ** (Boolean Questions, Clark et al., 2019) is a QA task where each example consists of a short
passage and a yes/no question about the passage. The questions are provided anonymously and
unsolicited by users of the Google search engine, and afterwards paired with a paragraph from a
Wikipedia article containing the answer. Following the original work, we evaluate with accuracy.


**CB** (CommitmentBank, De Marneffe et al., 2019) is a corpus of short texts in which at least one
sentence contains an embedded clause. Each of these embedded clauses is annotated with the degree
to which it appears the person who wrote the text is _committed_ to the truth of the clause. The resulting
task framed as three-class textual entailment on examples that are drawn from the Wall Street Journal,
fiction from the British National Corpus, and Switchboard. Each example consists of a premise
containing an embedded clause and the corresponding hypothesis is the extraction of that clause.
We use a subset of the data that had inter-annotator agreement above 80% . The data is imbalanced
(relatively fewer _neutral_ examples), so we evaluate using accuracy and F1, where for multi-class F1
we compute the unweighted average of the F1 per class.


4


Table 2: Development set examples from the tasks in SuperGLUE. **Bold** text represents part of the
example format for each task. Text in _italics_ is part of the model input. _Underlined_ text is specially
marked in the input. Text in a `monospaced font` represents the expected model output.


**Passage:** _Barq‚Äôs ‚Äì Barq‚Äôs is an American soft drink. Its brand of root beer is notable for having caffeine._
_Barq‚Äôs, created by Edward Barq and bottled since the turn of the 20th century, is owned by the Barq_
_family but bottled by the Coca-Cola Company. It was known as Barq‚Äôs Famous Olde Tyme Root Beer_
_until 2012._
**Question:** _is barq‚Äôs root beer a pepsi product_ **Answer:** `No`


**Text:** _B: And yet, uh, I we-, I hope to see employer based, you know, helping out. You know, child, uh,_
_care centers at the place of employment and things like that, that will help out. A: Uh-huh. B: What do_
_you think, do you think we are, setting a trend?_
**Hypothesis:** _they are setting a trend_ **Entailment:** `Unknown`


**Premise:** _My body cast a shadow over the grass._ **Question:** _What‚Äôs the CAUSE for this?_
**Alternative 1:** _The sun was rising._ **Alternative 2:** _The grass was cut._
**Correct Alternative:** `1`


**Paragraph:** _Susan wanted to have a birthday party. She called all of her friends. She has five friends._
_Her mom said that Susan can invite them all to the party. Her first friend could not go to the party_
_because she was sick. Her second friend was going out of town. Her third friend was not so sure if her_
_parents would let her. The fourth friend said maybe. The fifth friend could go to the party for sure. Susan_
_was a little sad. On the day of the party, all five friends showed up. Each friend had a present for Susan._
_Susan was happy and sent each friend a thank you card the next week_
**Question:** _Did Susan‚Äôs sick friend recover?_ **Candidate answers:** _Yes, she recovered_ ( `T` ), _No_ ( `F` ), _Yes_
( `T` ), _No, she didn‚Äôt recover_ ( `F` ), _Yes, she was at Susan‚Äôs party_ ( `T` )


**Paragraph:** _(_ _CNN_ _)_ _Puerto Rico_ _on Sunday overwhelmingly voted for statehood. But Congress, the only_
_body that can approve new states, will ultimately decide whether the status of the_ _US_ _commonwealth_
_changes. Ninety-seven percent of the votes in the nonbinding referendum favored statehood, an increase_
_over the results of a 2012 referendum, official results from the_ _State Electorcal Commission_ _show. It_
_was the fifth such vote on statehood. "Today, we the people of_ _Puerto Rico_ _are sending a strong and_
_clear message to the_ _US Congress_ _... and to the world ... claiming our equal rights as_ _American_ _citizens,_
_Puerto Rico_ _Gov._ _Ricardo Rossello_ _said in a news release. @highlight_ _Puerto Rico_ _voted Sunday in_
_favor of US statehood_
**Query** For one, they can truthfully say, ‚ÄúDon‚Äôt blame me, I didn‚Äôt vote for them, ‚Äù when discussing the
<placeholder> presidency **Correct Entities:** `US`


**Text:** _Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44,_
_according to the Christopher Reeve Foundation._
**Hypothesis:** _Christopher Reeve had an accident._ **Entailment:** `False`


**Context 1:** _Room and board._ **Context 2:** _He nailed boards across the windows._

**Sense match:** `False`


**Text:** _Mark told_ _Pete_ _many lies about himself, which Pete included in his book._ _He_ _should have been_
_more truthful._ **Coreference:** `False`


**COPA** (Choice of Plausible Alternatives, Roemmele et al., 2011) is a causal reasoning task in which
a system is given a premise sentence and must determine either the cause or effect of the premise
from two possible choices. All examples are handcrafted and focus on topics from blogs and a
photography-related encyclopedia. Following the original work, we evaluate using accuracy.


**MultiRC** (Multi-Sentence Reading Comprehension, Khashabi et al., 2018) is a QA task where each
example consists of a context paragraph, a question about that paragraph, and a list of possible
answers. The system must predict which answers are true and which are false. While many QA
tasks exist, we use MultiRC because of a number of desirable properties: (i) each question can have
multiple possible correct answers, so each question-answer pair must be evaluated independent of
other pairs, (ii) the questions are designed such that answering each question requires drawing facts
from multiple context sentences, and (iii) the question-answer pair format more closely matches
the API of other tasks in SuperGLUE than the more popular span-extractive QA format does. The
paragraphs are drawn from seven domains including news, fiction, and historical text. The evaluation
metrics are F1 over all answer-options (F1 _a_ ) and exact match of each question‚Äôs set of answers (EM).


5


**ReCoRD** (Reading Comprehension with Commonsense Reasoning Dataset, Zhang et al., 2018) is a
multiple-choice QA task. Each example consists of a news article and a Cloze-style question about
the article in which one entity is masked out. The system must predict the masked out entity from a
given list of possible entities in the provided passage, where the same entity may be expressed using
multiple different surface forms, all of which are considered correct. Articles are drawn from CNN
and Daily Mail. Following the original work, we evaluate with max (over all mentions) token-level
F1 and exact match (EM).


**RTE** (Recognizing Textual Entailment) datasets come from a series of annual competitions on textual
entailment. [2] RTE is included in GLUE, and we use the same data and format as GLUE: We merge data
from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and
RTE5 (Bentivogli et al., 2009). [3] All datasets are combined and converted to two-class classification:
_entailment_ and _not_entailment_ . Of all the GLUE tasks, RTE is among those that benefits from

_‚àº_
transfer learning the most, with performance jumping from near random-chance ( 56%) at the time
of GLUE‚Äôs launch to 86.3% accuracy (Liu et al., 2019d; Yang et al., 2019) at the time of writing.
Given the nearly eight point gap with respect to human performance, however, the task is not yet
solved by machines, and we expect the remaining gap to be difficult to close.


**WiC** (Word-in-Context, Pilehvar and Camacho-Collados, 2019) is a word sense disambiguation task
cast as binary classification of sentence pairs. Given two text snippets and a polysemous word that
appears in both sentences, the task is to determine whether the word is used with the same sense in
both sentences. Sentences are drawn from WordNet (Miller, 1995), VerbNet (Schuler, 2005), and
Wiktionary. We follow the original work and evaluate using accuracy.


**WSC** (Winograd Schema Challenge, Levesque et al., 2012) is a coreference resolution task in
which examples consist of a sentence with a pronoun and a list of noun phrases from the sentence.
The system must determine the correct referrent of the pronoun from among the provided choices.
Winograd schemas are designed to require everyday knowledge and commonsense reasoning to solve.


GLUE includes a version of WSC recast as NLI, known as WNLI. Until very recently, no substantial
progress had been made on WNLI, with many submissions opting to submit majority class predictions. [4] In the past few months, several works (Kocijan et al., 2019; Liu et al., 2019d) have made rapid
progress via a hueristic data augmentation scheme, raising machine performance to 90.4% accuracy.
Given estimated human performance of _‚àº_ 96%, there is still a gap between machine and human
performance, which we expect will be relatively difficult to close. We therefore include a version of
WSC cast as binary classification, where each example consists of a sentence with a marked pronoun
and noun, and the task is to determine if the pronoun refers to that noun. The training and validation
examples are drawn from the original WSC data (Levesque et al., 2012), as well as those distributed
by the affiliated organization _Commonsense Reasoning_ . [5] The test examples are derived from fiction
books and have been shared with us by the authors of the original dataset. We evaluate using accuracy.


**3.3** **Scoring**


As with GLUE, we seek to give a sense of aggregate system performance over all tasks by averaging
scores of all tasks. Lacking a fair criterion with which to weight the contributions of each task to
the overall score, we opt for the simple approach of weighing each task equally, and for tasks with
multiple metrics, first averaging those metrics to get a task score.


**3.4** **Tools for Model Analysis**


**Analyzing Linguistic and World Knowledge in Models** GLUE includes an expert-constructed,
diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and
world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with


2 Textual entailment is also known as natural language inference, or NLI
3 RTE4 is not publicly available, while RTE6 and RTE7 do not conform to the standard NLI task.
4 WNLI is especially difficult due to an adversarial train/dev split: Premise sentences that appear in the
training set often appear in the development set with a different hypothesis and a flipped label. If a system
memorizes the training set, which was easy due to the small size of the training set, it could perform far _below_
chance on the development set. We remove this adversarial design in our version of WSC by ensuring that no
sentences are shared between the training, validation, and test sets.
5 `[http://commonsensereasoning.org/disambiguation.html](http://commonsensereasoning.org/disambiguation.html)`


6


a three-way entailment relation ( _entailment_, _neutral_, or _contradiction_ ) and tagged with labels that
indicate the phenomena that characterize the relationship between the two sentences. Submissions
to the GLUE leaderboard are required to include predictions from the submission‚Äôs MultiNLI
classifier on the diagnostic dataset, and analyses of the results were shown alongside the main
leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain
it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse _contradiction_
and _neutral_ into a single _not_entailment_ label, and request that submissions include predictions
on the resulting set from the model used for the _RTE_ task. We collect non-expert annotations to
estimate human performance, following the same procedure we use for the main benchmark tasks
(Section 5.2). We estimate an accuracy of 88% and a Matthew‚Äôs correlation coefficient (MCC, the
two-class variant of the _R_ 3 metric used in GLUE) of 0.77.


**Analyzing Gender Bias in Models** Recent work has identified the presence and amplification
of many social biases in data-driven machine learning models. (Lu et al., 2018; Zhao et al., 2018;
Kiritchenko and Mohammad, 2018). To promote the detection of such biases, we include Winogender
(Rudinger et al., 2018) as an additional diagnostic dataset. Winogender is designed to measure gender
bias in coreference resolution systems. We use the Diverse Natural Language Inference Collection
(DNC; Poliak et al., 2018) version that casts Winogender as a textual entailment task. [6] Each example
consists of a premise sentence with a male or female pronoun and a hypothesis giving a possible
antecedent of the pronoun. Examples occur in _minimal pairs_, where the only difference between
an example and its pair is the gender of the pronoun in the premise. Performance on Winogender
is measured with both accuracy and the _gender parity score_ : the percentage of minimal pairs for
which the predictions are the same. We note that a system can trivially obtain a perfect gender parity
score by guessing the same class for all examples, so a high gender parity score is meaningless unless
accompanied by high accuracy. We collect non-expert annotations to estimate human performance,
and observe an accuracy of 99.7% and a gender parity score of 0.99.


Like any diagnostic, Winogender has limitations. It offers only positive predictive value: A poor
bias score is clear evidence that a model exhibits gender bias, but a good score does not mean that
the model is unbiased. More specifically, in the DNC version of the task, a low gender parity score
means that a model‚Äôs prediction of textual entailment can be changed with a change in pronouns, all
else equal. It is plausible that there are forms of bias that are relevant to target tasks of interest, but
that do not surface in this setting (Gonen and Goldberg, 2019). In addition, Winogender does not
cover all forms of social bias, or even all forms of gender. For instance, the version of the data used
here offers no coverage of gender-neutral _they_ or non-binary pronouns. Despite these limitations, we
believe that Winogender‚Äôs inclusion is worthwhile in providing a coarse sense of how social biases
evolve with model performance and for keeping attention on the social ramifications of NLP models.


**4** **Using SuperGLUE**


**Software Tools** To facilitate using SuperGLUE, we release `jiant` (Wang et al., 2019b), [7] a modular
software toolkit, built with PyTorch (Paszke et al., 2017), components from AllenNLP (Gardner
et al., 2017), and the `pytorch-pretrained-bert` package. [8] `jiant` implements our baselines and
supports the evaluation of custom models and training methods on the benchmark tasks. The toolkit
includes support for existing popular pretrained models such as OpenAI GPT and BERT, as well as
support for multistage and multitask learning of the kind seen in the strongest models on GLUE.


**Eligibility** Any system or method that can produce predictions for the SuperGLUE tasks is eligible
for submission to the leaderboard, subject to the data-use and submission frequency policies stated
immediately below. There are no restrictions on the type of methods that may be used, and there is
no requirement that any form of parameter sharing or shared initialization be used across the tasks in
the benchmark. To limit overfitting to the private test data, users are limited to a maximum of two
submissions per day and six submissions per month.


6 We filter out 23 examples where the labels are ambiguous
7 `[https://github.com/nyu-mll/jiant](https://github.com/nyu-mll/jiant)`
8 `[https://github.com/huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)`


7


Table 3: Baseline performance on the SuperGLUE test sets and diagnostics. For CB we report
accuracy and macro-average F1. For MultiRC we report F1 on all answer-options and exact match
of each question‚Äôs set of correct answers. AX _b_ is the broad-coverage diagnostic task, scored using
Matthews‚Äô correlation (MCC). AX _g_ is the Winogender diagnostic, scored using accuracy and the
gender parity score (GPS). All values are scaled by 100. The _Avg_ column is the overall benchmark
score on non-AX _‚àó_ tasks. The bolded numbers reflect the best machine performance on task. *MultiRC
has multiple test sets released on a staggered schedule, and these results evaluate on an installation of
the test set that is a subset of ours.


**Model** **Avg BoolQ** **CB** **COPA** **MultiRC** **ReCoRD RTE WiC WSC** **AX** _b_ **AX** _g_
**Metrics** **Acc.** **F1/Acc.** **Acc.** **F1** _a_ **/EM** **F1/EM** **Acc. Acc.** **Acc.** **MCC GPS Acc.**


Most Frequent 47.1 62.3 21.7/48.4 50.0 61.1 / 0.3 33.4/32.5 50.3 50.0 65.1 0.0 100.0/ 50.0
CBoW 44.3 62.1 49.0/71.2 51.6 0.0 / 0.4 14.0/13.6 49.7 53.0 65.1 -0.4 100.0/ 50.0

BERT 69.0 77.4 75.7/83.6 70.6 70.0 / 24.0 72.0/71.3 71.6 **69.5** **64.3** 23.0 97.8 / 51.7

BERT++ **71.5** 79.0 **84.7** / **90.4** 73.8 70.0 / 24.1 72.0/71.3 79.0 **69.5** **64.3** 38.0 99.4 / 51.4

Outside Best  - **80.4**  - / - **84.4** **70.4** */ **24.5**  - **74.8** / **73.0 82.7**  -  -  -  - /  

Human (est.) 89.8 89.0 95.8/98.9 100.0 81.8*/51.9* 91.7/91.3 93.6 80.0 100.0 77.0 99.3 / 99.7


**Data** Data for the tasks are available for download through the SuperGLUE site and through a
download script included with the software toolkit. Each task comes with a standardized training set,
development set, and _unlabeled_ test set. Submitted systems may use any public or private data when
developing their systems, with a few exceptions: Systems may only use the SuperGLUE-distributed
versions of the task datasets, as these use different train/validation/test splits from other public
versions in some cases. Systems also may not use the unlabeled test data for the tasks in system
development in any way, may not use the structured source data that was used to collect the WiC
labels (sense-annotated example sentences from WordNet, VerbNet, and Wiktionary) in any way, and
may not build systems that share information across separate _test_ examples in any way.


We do not endorse the use of the benchmark data for _non-research_ applications, due to concerns
about socially relevant biases (such as ethnicity‚Äìoccupation associations) that may be undesirable
or legally problematic in deployed systems. Because these biases are evident in texts from a wide
variety of sources and collection methods (e.g., Rudinger et al., 2017), and because none of our task
datasets directly mitigate them, one can reasonably presume that our training sets teach models these
biases to some extent and that our evaluation sets similarly _reward_ models that learn these biases.


To ensure reasonable credit assignment, because we build very directly on prior work, we ask the
authors of submitted systems to directly name and cite the specific datasets that they use, _including the_
_benchmark datasets_ . We will enforce this as a requirement for papers to be listed on the leaderboard.


**5** **Experiments**


**5.1** **Baselines**


**BERT** Our main baselines are built around BERT, variants of which are among the most successful
approach on GLUE at the time of writing. Specifically, we use the `bert-large-cased` variant.
Following the practice recommended in Devlin et al. (2019), for each task, we use the simplest
possible architecture on top of BERT. We fine-tune a copy of the pretrained BERT model separately
for each task, and leave the development of multi-task learning models to future work. For training,
we use the procedure specified in Devlin et al. (2019): We use Adam (Kingma and Ba, 2014) with an
initial learning rate of 10 _[‚àí]_ [5] and fine-tune for a maximum of 10 epochs.


For classification tasks with sentence-pair inputs (BoolQ, CB, RTE, WiC), we concatenate the
sentences with a [ SEP ] token, feed the fused input to BERT, and use a logistic regression classifier that
sees the representation corresponding to [ CLS ]. For WiC only, we also concatenate the representation
of the marked word to the [ CLS ] representation. For COPA, MultiRC, and ReCoRD, for each answer
choice, we similarly concatenate the context with that answer choice and feed the resulting sequence
into BERT to produce an answer representation. For COPA, we project these representations into a
scalar, and take as the answer the choice with the highest associated scalar. For MultiRC, because
each question can have more than one correct answer, we feed each answer representation into


8


a logistic regression classifier. For ReCoRD, we also evaluate the probability of each candidate
independent of other candidates, and take the most likely candidate as the model‚Äôs prediction. For
WSC, which is a span-based task, we use a model inspired by Tenney et al. (2019). Given the BERT
representation for each word in the original sentence, we get span representations of the pronoun
and noun phrase via a self-attention span-pooling operator (Lee et al., 2017), before feeding it into a
logistic regression classifier.


**BERT++** We also report results using BERT with additional training on related datasets before
fine-tuning on the benchmark tasks, following the STILTs two-stage style of transfer learning (Phang
et al., 2018). Given the productive use of MultiNLI in pretraining and intermediate fine-tuning of
pretrained language models (Conneau et al., 2017; Phang et al., 2018, i.a.), for CB, RTE, and BoolQ,
we use MultiNLI as a transfer task by first using the above procedure on MultiNLI. Similarly, given
the similarity of COPA to SWAG (Zellers et al., 2018), we first fine-tune BERT on SWAG. These
results are reported as BERT++. For all other tasks, we reuse the results of BERT fine-tuned on just
that task.


**Simple Baselines** We include a baseline where for each task we simply predict the majority class, [9]
as well as a bag-of-words baseline where each input is represented as an average of its tokens‚Äô GloVe
word vectors (the 300D/840B release from Pennington et al., 2014).


**Outside Best** We list the best known result on each task to date, except on tasks which we recast
(WSC), resplit (CB), or achieve the best known result (WiC). The outside results for COPA, MultiRC,
and RTE are from Sap et al. (2019), Trivedi et al. (2019), and Liu et al. (2019d) respectively.


**5.2** **Human Performance**


Pilehvar and Camacho-Collados (2019), Khashabi et al. (2018), Nangia and Bowman (2019), and
Zhang et al. (2018) respectively provide estimates for human performance on WiC, MultiRC, RTE,
and ReCoRD. For the remaining tasks, including the diagnostic set, we estimate human performance
by hiring crowdworker annotators through Amazon‚Äôs Mechanical Turk platform to reannotate a
sample of each test set. We follow a two step procedure where a crowd worker completes a short
training phase before proceeding to the annotation phase, modeled after the method used by Nangia
and Bowman (2019) for GLUE. For both phases and all tasks, the average pay rate is $23.75/hr. [10]


In the training phase, workers are provided with instructions on the task, linked to an FAQ page, and
are asked to annotate up to 30 examples from the development set. After answering each example,
workers are also asked to check their work against the provided ground truth label. After the training
phase is complete, we provide the qualification to work on the annotation phase to all workers
who annotated a minimum of five examples, i.e. completed five HITs during training and achieved
performance at, or above the median performance across all workers during training.


In the annotation phase, workers are provided with the same instructions as the training phase, and
are linked to the same FAQ page. The instructions for all tasks are provided in Appendix C. For the
annotation phase we randomly sample 100 examples from the task‚Äôs test set, with the exception of
WSC where we annotate the full test set. For each example, we collect annotations from five workers
and take a majority vote to estimate human performance. For additional details, see Appendix C.3.


**5.3** **Results**


Table 3 shows results for all baselines. The simple baselines of predicting the most frequent class
and CBOW do not perform well overall, achieving near chance performance for several of the tasks.
Using BERT increases the average SuperGLUE score by 25 points, attaining significant gains on
all of the benchmark tasks, particularly MultiRC, ReCoRD, and RTE. On WSC, BERT actually
performs worse than the simple baselines, likely due to the small size of the dataset and the lack of
data augmentation. Using MultiNLI as an additional source of supervision for BoolQ, CB, and RTE
leads to a 2-5 point improvement on all tasks. Using SWAG as a transfer task for COPA sees an 8
point improvement.


9 For ReCoRD, we predict the entity that has the highest F1 with the other entity options.
10 This estimate is taken from `[https://turkerview.com](https://turkerview.com)` .


9


Our best baselines still lag substantially behind human performance. On average, there is a nearly 20
point gap between BERT++ and human performance. The largest gap is on WSC, with a 35 point
difference between the best model and human performance. The smallest margins are on BoolQ,
CB, RTE, and WiC, with gaps of around 10 points on each of these. We believe these gaps will be
challenging to close: On WSC and COPA, human performance is perfect. On three other tasks, it is
in the mid-to-high 90s. On the diagnostics, all models continue to lag significantly behind humans.
Though all models obtain near perfect gender parity scores on Winogender, this is due to the fact that
they are obtaining accuracy near that of random guessing.


**6** **Conclusion**


We present SuperGLUE, a new benchmark for evaluating general-purpose language understanding
systems. SuperGLUE updates the GLUE benchmark by identifying a new set of challenging NLU
tasks, as measured by the difference between human and machine baselines. The set of eight tasks in
our benchmark emphasizes diverse task formats and low-data training data tasks, with nearly half the
tasks having fewer than 1k examples and all but one of the tasks having fewer than 10k examples.


We evaluate BERT-based baselines and find that they still lag behind humans by nearly 20 points.
Given the difficulty of SuperGLUE for BERT, we expect that further progress in multi-task, transfer,
and unsupervised/self-supervised learning techniques will be necessary to approach human-level performance on the benchmark. Overall, we argue that SuperGLUE offers a rich and challenging testbed
for work developing new general-purpose machine learning methods for language understanding.


**7** **Acknowledgments**


We thank the original authors of the included datasets in SuperGLUE for their cooperation in the
creation of the benchmark, as well as those who proposed tasks and datasets that we ultimately could
not include.


This work was made possible in part by a donation to NYU from Eric and Wendy Schmidt made by
recommendation of the Schmidt Futures program. We gratefully acknowledge the support of the
NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research. AW is
supported by the National Science Foundation Graduate Research Fellowship Program under Grant
No. DGE 1342536. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the National Science
Foundation.


**References**


Anonymous. Bam! Born-again multi-task networks for natural language understanding. Anonymous
preprint under review, 2018. URL `[https://openreview.net/forum?id=SylnYlqKw4](https://openreview.net/forum?id=SylnYlqKw4)` .


Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik
Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher R√©, and
Rob Malkin. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In
_SIGMOD_ . ACM, 2018.


Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and
Idan Szpektor. The second PASCAL recognising textual entailment challenge. In _Proceedings_
_of the Second PASCAL Challenges Workshop on Recognising Textual Entailment_, 2006. URL
`[http://u.cs.biu.ac.il/~nlp/RTE2/Proceedings/01.pdf](http://u.cs.biu.ac.il/~nlp/RTE2/Proceedings/01.pdf)` .


Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The
fifth PASCAL recognizing textual entailment challenge. In _Textual Analysis Conference (TAC)_,
2009. URL `[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.1231](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.1231)` .


Sven Buechel, Anneke Buffone, Barry Slaff, Lyle Ungar, and Jo√£o Sedoc. Modeling empathy and
distress in reaction to news stories. In _Proceedings of the 2018 Conference on Empirical Methods_
_in Natural Language Processing (EMNLP)_, 2018.


10


Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role of bleu in machine
translation research. In _Proceedings of the Conference of the European Chapter of the Association_
_for Computational Linguistics (EACL)_ . Association for Computational Linguistics, 2006. URL
`[https://www.aclweb.org/anthology/E06-1032](https://www.aclweb.org/anthology/E06-1032)` .


Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task
1: Semantic textual similarity multilingual and crosslingual focused evaluation. In _Proceedings_
_of the 11th International Workshop on Semantic Evaluation (SemEval-2017)_ . Association for
Computational Linguistics, 2017. doi: 10.18653/v1/S17-2001. URL `[https://www.aclweb.](https://www.aclweb.org/anthology/S17-2001)`
`[org/anthology/S17-2001](https://www.aclweb.org/anthology/S17-2001)` .


Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke
Zettlemoyer. QuAC: Question answering in context. In _Proceedings of the 2018 Conference on_
_Empirical Methods in Natural Language Processing (EMNLP)_ . Association for Computational
Linguistics, 2018a.


Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. Ultra-fine entity typing. In _Proceedings_
_of the Association for Computational Linguistics (ACL)_ . Association for Computational Linguistics,
2018b. URL `[https://www.aclweb.org/anthology/P18-1009](https://www.aclweb.org/anthology/P18-1009)` .


Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina
Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In _Proceedings_
_of the 2019 Conference of the North American Chapter of the Association for Computational_
_Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)_, pages 2924‚Äì2936,
2019.


Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep
neural networks with multitask learning. In _Proceedings of the 25th International Conference on_
_Machine Learning (ICML)_ . Association for Computing Machinery, 2008. URL `[https://dl.acm.](https://dl.acm.org/citation.cfm?id=1390177)`
`[org/citation.cfm?id=1390177](https://dl.acm.org/citation.cfm?id=1390177)` .


Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In _Proceedings of the 11th Language Resources and Evaluation Conference_ . European Language Resource Association, 2018. URL `[https://www.aclweb.org/anthology/L18-1269](https://www.aclweb.org/anthology/L18-1269)` .


Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo√Øc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In
_Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing_
_(EMNLP)_ . Association for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1070. URL

`[https://www.aclweb.org/anthology/D17-1070](https://www.aclweb.org/anthology/D17-1070)` .


Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In _Machine Learning Challenges. Evaluating Predictive Uncertainty, Vi-_
_sual Object Classification, and Recognising Textual Entailment_ . Springer, 2006. URL `[https:](https://link.springer.com/chapter/10.1007/11736790_9)`
`[//link.springer.com/chapter/10.1007/11736790_9](https://link.springer.com/chapter/10.1007/11736790_9)` .


Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In _Advances in Neural_
_Information Processing Systems (NeurIPS)_ . Curran Associates, Inc., 2015. URL `[http://papers.](http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf)`
`[nips.cc/paper/5949-semi-supervised-sequence-learning.pdf](http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf)` .


Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank:
Investigating projection in naturally occurring discourse. 2019. To appear in _Proceedings of Sinn_
_und Bedeutung 23_ . Data can be found at `[https://github.com/mcdm/CommitmentBank/](https://github.com/mcdm/CommitmentBank/)` .


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In _Proceedings of the Conference of the_
_North American Chapter of the Association for Computational Linguistics: Human Language_
_Technologies (NAACL-HLT)_ . Association for Computational Linguistics, 2019. URL `[https:](https://arxiv.org/abs/1810.04805)`
`[//arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)` .


William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases.
In _Proceedings of IWP_, 2005.


11


Manaal Faruqui and Dipanjan Das. Identifying well-formed natural language questions. In _Pro-_
_ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)_ .
Association for Computational Linguistics, 2018. URL `[https://www.aclweb.org/anthology/](https://www.aclweb.org/anthology/D18-1091)`
`[D18-1091](https://www.aclweb.org/anthology/D18-1091)` .


Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar.
Born again neural networks. _International Conference on Machine Learning (ICML)_, 2018. URL
`[http://proceedings.mlr.press/v80/furlanello18a.html](http://proceedings.mlr.press/v80/furlanello18a.html)` .


Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew
Peters, Michael Schmitz, and Luke S. Zettlemoyer. AllenNLP: A deep semantic natural language
processing platform. In _Proceedings of Workshop for NLP Open Source Software_, 2017. URL
`[https://www.aclweb.org/anthology/W18-2501](https://www.aclweb.org/anthology/W18-2501)` .


Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing
textual entailment challenge. In _Proceedings of the ACL-PASCAL Workshop on Textual Entailment_
_and Paraphrasing_ . Association for Computational Linguistics, 2007.


Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic
gender biases in word embeddings but do not remove them. In _Proceedings of the 2019_
_Conference of the North American Chapter of the Association for Computational Linguistics:_
_Human Language Technologies, Volume 1 (Long and Short Papers)_, pages 609‚Äì614, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL `[https:](https://www.aclweb.org/anthology/N19-1061)`
`[//www.aclweb.org/anthology/N19-1061](https://www.aclweb.org/anthology/N19-1061)` .


Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences
from unlabelled data. In _Proceedings of the Conference of the North American Chapter of_
_the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)_ .
Association for Computational Linguistics, 2016. doi: 10.18653/v1/N16-1162. URL `[https:](https://www.aclweb.org/anthology/N16-1162)`
`[//www.aclweb.org/anthology/N16-1162](https://www.aclweb.org/anthology/N16-1162)` .


Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. _arXiv_
_preprint 1503.02531_, 2015. URL `[https://arxiv.org/abs/1503.02531](https://arxiv.org/abs/1503.02531)` .


Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In
_Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)_ .
Association for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1215. URL `[https:](https://www.aclweb.org/anthology/D17-1215)`
`[//www.aclweb.org/anthology/D17-1215](https://www.aclweb.org/anthology/D17-1215)` .


Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking
beyond the surface: A challenge set for reading comprehension over multiple sentences. In
_Proceedings of the Conference of the North American Chapter of the Association for Computa-_
_tional Linguistics: Human Language Technologies (NAACL-HLT)_ . Association for Computational
Linguistics, 2018. URL `[https://www.aclweb.org/anthology/papers/N/N18/N18-1023/](https://www.aclweb.org/anthology/papers/N/N18/N18-1023/)` .


Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. _arXiv preprint_
_1412.6980_, 2014. URL `[https://arxiv.org/abs/1412.6980](https://arxiv.org/abs/1412.6980)` .


Svetlana Kiritchenko and Saif Mohammad. Examining gender and race bias in two hundred sentiment
analysis systems. In _Proceedings of the Seventh Joint Conference on Lexical and Computational_
_Semantics_ . Association for Computational Linguistics, 2018. doi: 10.18653/v1/S18-2005. URL
`[https://www.aclweb.org/anthology/S18-2005](https://www.aclweb.org/anthology/S18-2005)` .


Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. Skip-thought vectors. In _Advances in neural information processing systems_,
2015.


Nikita Kitaev and Dan Klein. Multilingual constituency parsing with self-attention and pre-training.
_arXiv preprint 1812.11760_, 2018. URL `[https://arxiv.org/abs/1812.11760](https://arxiv.org/abs/1812.11760)` .


Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz.
A surprisingly robust trick for winograd schema challenge. _arXiv preprint 1905.06290_, 2019.


12


Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference
resolution. In _Proceedings of the 2017 Conference on Empirical Methods in Natural Language_
_Processing_ . Association for Computational Linguistics, September 2017. doi: 10.18653/v1/
D17-1018. URL `[https://www.aclweb.org/anthology/D17-1018](https://www.aclweb.org/anthology/D17-1018)` .


Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In
_Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning_,
2012. URL `[http://dl.acm.org/citation.cfm?id=3031843.3031909](http://dl.acm.org/citation.cfm?id=3031843.3031909)` .


Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau.
How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics
for dialogue response generation. In _Proceedings of the 2016 Conference on Empirical Methods in_
_Natural Language Processing_ . Association for Computational Linguistics, 2016. doi: 10.18653/
v1/D16-1230. URL `[https://www.aclweb.org/anthology/D16-1230](https://www.aclweb.org/anthology/D16-1230)` .


Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic
knowledge and transferability of contextual representations. In _Proceedings of the Conference of_
_the North American Chapter of the Association for Computational Linguistics: Human Language_
_Technologies (NAACL-HLT)_ . Association for Computational Linguistics, 2019a. URL `[https:](https://arxiv.org/abs/1903.08855)`
`[//arxiv.org/abs/1903.08855](https://arxiv.org/abs/1903.08855)` .


Nelson F. Liu, Roy Schwartz, and Noah A. Smith. Inoculation by fine-tuning: A method for
analyzing challenge datasets. In _Proceedings of the Conference of the North American Chapter_
_of the Association for Computational Linguistics: Human Language Technologies (NAACL-_
_HLT)_ . Association for Computational Linguistics, 2019b. URL `[https://arxiv.org/abs/1904.](https://arxiv.org/abs/1904.02668)`
`[02668](https://arxiv.org/abs/1904.02668)` .


Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural
networks via knowledge distillation for natural language understanding. _arXiv preprint 1904.09482_,
2019c. URL `[http://arxiv.org/abs/1904.09482](http://arxiv.org/abs/1904.09482)` .


Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for
natural language understanding. _arXiv preprint 1901.11504_, 2019d.


Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. Gender bias in
neural natural language processing. _arXiv preprint 1807.11714_, 2018. URL `[http://arxiv.org/](http://arxiv.org/abs/1807.11714)`
`[abs/1807.11714](http://arxiv.org/abs/1807.11714)` .


Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In _Advances in Neural Information Processing Sys-_
_tems (NeurIPS)_ . Curran Associates, Inc., 2017. URL `[http://papers.nips.cc/paper/](http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf)`
`[7209-learned-in-translation-contextualized-word-vectors.pdf](http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf)` .


Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language
decathlon: Multitask learning as question answering. _arXiv preprint 1806.08730_, 2018. URL
`[https://arxiv.org/abs/1806.08730](https://arxiv.org/abs/1806.08730)` .


R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic
heuristics in natural language inference. In _Proceedings of the Association for Computational_
_Linguistics (ACL)_ . Association for Computational Linguistics, 2019. URL `[https://arxiv.org/](https://arxiv.org/abs/1902.01007)`
`[abs/1902.01007](https://arxiv.org/abs/1902.01007)` .


Richard T. McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language
inference. In _Proceedings of the Society for Computational in Linguistics (SCiL) 2019_, 2019. URL
`[https://scholarworks.umass.edu/scil/vol2/iss1/46/](https://scholarworks.umass.edu/scil/vol2/iss1/46/)` .


George A Miller. WordNet: a lexical database for english. _Communications of the ACM_, 1995. URL

`[https://www.aclweb.org/anthology/H94-1111](https://www.aclweb.org/anthology/H94-1111)` .


Aakanksha Naik, Abhilasha Ravichander, Norman M. Sadeh, Carolyn Penstein Ros√©, and Graham
Neubig. Stress test evaluation for natural language inference. In _International Conference on_
_Computational Linguistics (COLING)_, 2018.


13


Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In _Proceedings of the Association of Compu-_
_tational Linguistics (ACL)_ . Association for Computational Linguistics, 2019. URL `[https:](https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf)`
`[//woollysocks.github.io/assets/GLUE_Human_Baseline.pdf](https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf)` .


Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
PyTorch. In _Advances in Neural Information Processing Systems (NeurIPS)_ . Curran Associates,
Inc., 2017. URL `[https://openreview.net/pdf?id=BJJsrmfCZ](https://openreview.net/pdf?id=BJJsrmfCZ)` .


Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In _Proceedings of the Conference on Empirical Methods in Natural Language Pro-_
_cessing (EMNLP)_ . Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1162.
URL `[https://www.aclweb.org/anthology/D14-1162](https://www.aclweb.org/anthology/D14-1162)` .


Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and
Luke Zettlemoyer. Deep contextualized word representations. In _Proceedings of the Conference of_
_the North American Chapter of the Association for Computational Linguistics: Human Language_
_Technologies (NAACL-HLT)_ . Association for Computational Linguistics, 2018. doi: 10.18653/v1/
N18-1202. URL `[https://www.aclweb.org/anthology/N18-1202](https://www.aclweb.org/anthology/N18-1202)` .


Jason Phang, Thibault F√©vry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary
training on intermediate labeled-data tasks. _arXiv preprint 1811.01088_, 2018. URL `[https:](https://arxiv.org/abs/1811.01088)`
`[//arxiv.org/abs/1811.01088](https://arxiv.org/abs/1811.01088)` .


Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for
evaluating context-sensitive meaning representations. In _Proceedings of the Conference of the_
_North American Chapter of the Association for Computational Linguistics: Human Language_
_Technologies (NAACL-HLT)_ . Association for Computational Linguistics, 2019. URL `[https:](https://arxiv.org/abs/1808.09121)`
`[//arxiv.org/abs/1808.09121](https://arxiv.org/abs/1808.09121)` .


Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White,
and Benjamin Van Durme. Collecting diverse natural language inference problems for sentence
representation evaluation. In _Proceedings of the 2018 Conference on Empirical Methods in_
_Natural Language Processing_ . Association for Computational Linguistics, 2018. URL `[https:](https://www.aclweb.org/anthology/D18-1007)`
`[//www.aclweb.org/anthology/D18-1007](https://www.aclweb.org/anthology/D18-1007)` .


Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at
`[https://blog.openai.com/language-unsupervised/](https://blog.openai.com/language-unsupervised/)` .


Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions
for machine comprehension of text. In _Proceedings of the Conference on Empirical Methods in_
_Natural Language Processing (EMNLP)_ . Association for Computational Linguistics, 2016. doi:
10.18653/v1/D16-1264. URL `[http://aclweb.org/anthology/D16-1264](http://aclweb.org/anthology/D16-1264)` .


Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives:
An evaluation of commonsense causal reasoning. In _2011 AAAI Spring Symposium Series_, 2011.


Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in elicited natural language
inferences. In _Proceedings of the First ACL Workshop on Ethics in Natural Language Processing_ .
Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-1609. URL `[https:](https://www.aclweb.org/anthology/W17-1609)`
`[//www.aclweb.org/anthology/W17-1609](https://www.aclweb.org/anthology/W17-1609)` .


Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in
coreference resolution. In _Proceedings of the 2018 Conference of the North American Chapter_
_of the Association for Computational Linguistics: Human Language Technologies_ . Association
for Computational Linguistics, 2018. doi: 10.18653/v1/N18-2002. URL `[https://www.aclweb.](https://www.aclweb.org/anthology/N18-2002)`
`[org/anthology/N18-2002](https://www.aclweb.org/anthology/N18-2002)` .


Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense
reasoning about social interactions. _arXiv preprint 1904.09728_, 2019. URL `[https://arxiv.](https://arxiv.org/abs/1904.09728)`
`[org/abs/1904.09728](https://arxiv.org/abs/1904.09728)` .


14


Nathan Schneider and Noah A Smith. A corpus and model integrating multiword expressions and
supersenses. In _Proceedings of the Conference of the North American Chapter of the Association_
_for Computational Linguistics: Human Language Technologies (NAACL-HLT)_ . Association for
Computational Linguistics, 2015. URL `[https://www.aclweb.org/anthology/N15-1177](https://www.aclweb.org/anthology/N15-1177)` .


Karin Kipper Schuler. _Verbnet: A Broad-coverage, Comprehensive Verb Lexicon_ . PhD thesis, 2005.
URL `[http://verbs.colorado.edu/~kipper/Papers/dissertation.pdf](http://verbs.colorado.edu/~kipper/Papers/dissertation.pdf)` .


Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng,
and Christopher. Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In _Proceedings of the Conference on Empirical Methods in Natural Language Processing_
_(EMNLP)_ . Association for Computational Linguistics, 2013. URL `[https://www.aclweb.org/](https://www.aclweb.org/anthology/D13-1170)`
`[anthology/D13-1170](https://www.aclweb.org/anthology/D13-1170)` .


Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim,
Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. What do you learn from
context? probing for sentence structure in contextualized word representations. 2019. URL
`[https://openreview.net/forum?id=SJzSgnRcKX](https://openreview.net/forum?id=SJzSgnRcKX)` .


Harsh Trivedi, Heeyoung Kwon, Tushar Khot, Ashish Sabharwal, and Niranjan Balasubramanian.
Repurposing entailment for multi-hop question answering tasks, 2019. URL `[https://arxiv.](https://arxiv.org/abs/1904.09380)`
`[org/abs/1904.09380](https://arxiv.org/abs/1904.09380)` .


Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In
_International Conference on Learning Representations_, 2019a. URL `[https://openreview.](https://openreview.net/forum?id=rJ4km2R5t7)`
`[net/forum?id=rJ4km2R5t7](https://openreview.net/forum?id=rJ4km2R5t7)` .


Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari,
Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave,
Najoung Kim, Phu Mon Htut, Thibault F‚Äôevry, Berlin Chen, Nikita Nangia, Haokun Liu,, Anhad
Mohananey, Shikha Bordia, Ellie Pavlick, and Samuel R. Bowman. jiant 1.0: A software toolkit
for research on general-purpose text understanding models. `[http://jiant.info/](http://jiant.info/)`, 2019b.


Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments.
_arXiv preprint 1805.12471_, 2018. URL `[https://arxiv.org/abs/1805.12471](https://arxiv.org/abs/1805.12471)` .


Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. Mind the GAP: A balanced
corpus of gendered ambiguous pronouns. _Transactions of the Association for Computational_
_Linguistics (TACL)_, 2018. URL `[https://www.aclweb.org/anthology/Q18-1042](https://www.aclweb.org/anthology/Q18-1042)` .


Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
sentence understanding through inference. In _Proceedings of the Conference of the North American_
_Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-_
_HLT)_ . Association for Computational Linguistics, 2018. URL `[http://aclweb.org/anthology/](http://aclweb.org/anthology/N18-1101)`
`[N18-1101](http://aclweb.org/anthology/N18-1101)` .


Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. Xlnet: Generalized autoregressive pretraining for language understanding. _arXiv preprint_
_1906.0823_, 2019.


Fabio Massimo Zanzotto and Lorenzo Ferrone. Have you lost the thread? discovering ongoing
conversations in scattered dialog blocks. _ACM Transactions on Interactive Intelligent Systems_
_(TiiS)_, 2017.


Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large-scale adversarial dataset
for grounded commonsense inference. 2018. URL `[https://www.aclweb.org/anthology/](https://www.aclweb.org/anthology/D18-1009)`
`[D18-1009](https://www.aclweb.org/anthology/D18-1009)` .


Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme.
Record: Bridging the gap between human and machine commonsense reading comprehension.
_arXiv preprint 1810.12885_, 2018.


15


Yuan Zhang, Jason Baldridge, and Luheng He. PAWS: Paraphrase adversaries from word scrambling.
_arXiv preprint 1904.01130_, 2019. URL `[https://arxiv.org/abs/1904.01130](https://arxiv.org/abs/1904.01130)` .


Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in
coreference resolution: Evaluation and debiasing methods. In _Proceedings of the 2018 Conference_
_of the North American Chapter of the Association for Computational Linguistics: Human Language_
_Technologies_ . Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-2003. URL

`[https://www.aclweb.org/anthology/N18-2003](https://www.aclweb.org/anthology/N18-2003)` .


16


Table 4: Baseline performance on the SuperGLUE development.


**Model** **Avg** **BoolQ** **CB** **COPA** **MultiRC** **ReCoRD** **RTE** **WiC** **WSC**
**Metrics** **Acc.** **Acc./F1** **Acc.** **F1** _a_ **/EM** **F1/EM** **Acc.** **Acc.** **Acc.**


Most Frequent Class 47.7 62.2 50 /22.2 55 59.9/ 0.8 32.4/31.5 52.7 50.0 63.5
CBOW 47.7 62.4 71.4/49.6 63.0 20.3/ 0.3 14.4/13.8 54.2 55.3 61.5

BERT 72.2 77.7 94.6/93.7 69.0 70.5/24.7 70.6/69.8 75.8 74.9 68.3

BERT++ 74.6 80.1 96.4/95.0 78.0 70.5/24.7 70.6/69.8 82.3 74.9 68.3


**A** **Development Set Results**


In Table 4, we present results of the baselines on the SuperGLUE tasks development sets.


**B** **Performance on GLUE Diagnostics**


Figure 2 shows the performance on the GLUE diagnostics dataset for systems submitted to the public
leaderboard.









|80<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>Disjun|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||||||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||||||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||||||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||||||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||||||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80|||||Chan<br>~~BiLS~~|Chan<br>~~BiLS~~||||
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80|||||Chan<br>~~BiLS~~|Chan<br>~~BiLS~~|ce<br>~~M+ELMo+Attn~~|BERT <br>~~SemB~~|BAM<br>~~RT~~|
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80|||||Open<br>BERT<br>|Open<br>BERT<br>|AI GPT<br> + Single~~-~~task Ad<br>|apters<br><br>Snorke<br>ALICE <br>|l MeTaL<br>(Large)<br>|
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||~~ction~~<br>~~Downwar~~|~~Monotone~~<br>~~Restri~~<br>~~BERT~~<br>BERT|~~Monotone~~<br>~~Restri~~<br>~~BERT~~<br>BERT|~~(Large)~~<br> on STILTs|~~MT-DN~~<br>XLNet~~-~~|~~N (ensemble)~~<br>Large (ensemble)|
|~~Disju~~<br>60<br>40<br>20<br>0<br>20<br>40<br>60<br>80||||~~ction~~<br>~~Downwar~~|~~Monotone~~<br>~~Restri~~<br>~~BERT~~<br>BERT|~~Monotone~~<br>~~Restri~~<br>~~BERT~~<br>BERT|~~tivity~~<br>~~Doubl~~|~~Negation~~<br>~~Preposition~~|~~l Phrases~~|


Figure 2: Performance of GLUE submissions on selected diagnostic categories, reported using the
_R_ 3 metric scaled up by 100, as in Wang et al. (2019a, see paper for a description of the categories).
Some initially difficult categories, like double negation, saw gains from advances on GLUE, but
others remain hard (restrictivity) or even adversarial (disjunction, downward monotone).


**C** **Instructions to Crowd Workers**


**C.1** **Training Phase Instructions**


For collecting data to establish human performance on the SuperGLUE tasks, we follow a two
step procedure where we first provide some training to the crowd workers before they proceed to
annotation. In the training step, we provide workers with brief instructions about the training phase.
An example of these instructions is given Table 5. These training instructions are the same across
tasks, only the task name in the instructions is changed.


17


**C.2** **Task Instructions**


During training and annotation for each task, we provide workers with brief instructions tailored to
the task. We also link workers to an FAQ page for the task. Tables 6, 7, 8, and 9, show the instructions
we used for all four tasks: COPA, CommitmentBank, WSC, and BoolQ respectively. The instructions
given to crowd workers for annotations on the diagnostic and bias diagnostic datasets are shown in
Table 11.


We collected data to produce conservative estimates for human performance on several tasks that
we did not ultimately include in our benchmark, including GAP (Webster et al., 2018), PAWS
(Zhang et al., 2019), Quora Insincere Questions, [11] Ultrafine Entity Typing (Choi et al., 2018b), and
Empathetic Reactions datasets (Buechel et al., 2018). The instructions we used for these tasks are
shown in Tables 12, 13, 14, 15, and 16.


**C.3** **Task Specific Details**


For WSC and COPA we provide annotators with a two way classification problem. We then use
majority vote across annotations to calculate human performance.


**CommitmentBank** We follow the authors in providing annotators with a 7-way classification
problem. We then collapse the annotations into 3 classes by using the same ranges for bucketing used
by De Marneffe et al. (2019). We then use majority vote to get human performance numbers on the
task.


Furthermore, for training on CommitmentBank we randomly sample examples from the low interannotator agreement portion of the CommitmentBank data that is not included in the benchmark
version of the task. These low agreement examples are generally harder to classify since they are
more ambiguous.


**Diagnostic Dataset** Since the diagnostic dataset does not come with accompanying training data,
we train our workers on examples from RTE‚Äôs development set. RTE is also a textual entailment
task and is the most closely related task in the main benchmark. Providing the crowd workers with
training on RTE enables them to learn label definitions which should generalize to the diagnostic
dataset.


**Ultrafine Entity Typing** We cast the task into a binary classification problem to make it an easier
task for non-expert crowd workers. We work in cooperation with the authors of the dataset (Choi
et al., 2018b) to do this reformulation: We give workers one possible tag for a word or phrase and
asked them to classify the tag as being applicable or not.


The authors used WordNet (Miller, 1995) to expand the set of labels to include synonyms and
hypernyms from WordNet. They then asked five annotators to validate these tags. The tags from this
validation had high agreement, and were included in the publicly available Ultrafine Entity Typing
dataset, [12] This constitutes our set of positive examples. The rest of the tags from the validation
procedure that are not in the public dataset constitute our negative examples.


**GAP** For the Gendered Ambiguous Pronoun Coreference task (GAP, Webster et al., 2018), we
simplified the task by providing noun phrase spans as part of the input, thus reducing the original
structure prediction task to a classification task. This task was presented to crowd workers as a three
way classification problem: Choose span A, B, or neither.


**D** **Excluded Tasks**


In this section we provide some examples of tasks that we evaluated for inclusion but ultimately could
not include. We report on these excluded tasks only with the permission of their authors. We turned
down many medical text datasets because they are usually only accessible with explicit permission
and credentials from the data owners.


11 `[https://www.kaggle.com/c/quora-insincere-questions-classification/data](https://www.kaggle.com/c/quora-insincere-questions-classification/data)`
12 `[https://homes.cs.washington.edu/~eunsol/open_entity.html](https://homes.cs.washington.edu/~eunsol/open_entity.html)`


18


Tasks like QuAC (Choi et al., 2018a) and STREUSLE (Schneider and Smith, 2015) differed substantially from the format of other tasks in our benchmark, which we worried would incentivize users
to spend significant effort on task-specific model designs, rather than focusing on general-purpose
techniques. It was challenging to train annotators to do well on Quora Insincere Questions [13], Empathetic Reactions (Buechel et al., 2018), and a recast version of Ultra-Fine Entity Typing (Choi et al.,
2018b, see Appendix C.3 for details), leading to low human performance. BERT achieved very high
or superhuman performance on Query Well-Formedness (Faruqui and Das, 2018), PAWS (Zhang
et al., 2019), Discovering Ongoing Conversations (Zanzotto and Ferrone, 2017), and GAP (Webster
et al., 2018).


During the process of selecting tasks for our benchmark, we collected human performance baselines
and run BERT-based machine baselines for some tasks that we ultimately excluded from our task
list. We chose to exclude these tasks because our BERT baseline performs better than our human
performance baseline or if the gap between human and machine performance is small.


On Quora Insincere Questions our BERT baseline outperforms our human baseline by a small margin:
an F1 score of 67.2 versus 66.7 for BERT and human baselines respectively. Similarly, on the
Empathetic Reactions dataset, BERT outperforms our human baseline, where BERT‚Äôs predictions
have a Pearson correlation of 0.45 on empathy and 0.55 on distress, compared to 0.45 and 0.35 for
our human baseline. For PAWS-Wiki, we report that BERT achieves an accuracy of 91.9%, while our
human baseline achieved 84% accuracy. These three tasks are excluded from the benchmark since
our, admittedly conservative, human baselines are worse than machine performance. Our human
performance baselines are subject to the clarity of our instructions (all instructions can be found in
Appendix C), and crowd workers engagement and ability.


For the Query Well-Formedness task, the authors set an estimate human performance at 88.4%
accuracy. Our BERT baseline model reaches an accuracy of 82.3%. While there is a positive gap on
this task, the gap was smaller than we were were willing to tolerate. Similarly, on our recast version
of the Ultrafine Entity Typing, we observe too small a gap between human (60.2 F1) and machine
performance (55.0 F1). Our recasting for this task is described in Appendix C.2. On GAP, when
taken as a classification problem without the related task of span selection (details in C.2), BERT
performs (91.0 F1) comparably to our human baseline (94.9 F1). Given this small margin, we also
exclude GAP.


On Discovering Ongoing Conversations, our BERT baseline achieves an F1 of 51.9 on a version of
the task cast as sentence pair classification (given two snippets of texts from plays, determine if the
second snippet is a continuation of the first). This dataset is very class imbalanced (90% negative), so
we also experimented with a class-balanced version on which our BERT baselines achieves 88.4
F1. Qualitatively, we also found the task challenging for humans as there was little context for the
text snippets and the examples were drawn from plays using early English. Given this fairly high
machine performance and challenging nature for humans, we exclude this task from our benchmark.


_Instructions tables begin on the following page._


13 `[https://www.kaggle.com/c/quora-insincere-questions-classification/data](https://www.kaggle.com/c/quora-insincere-questions-classification/data)`


19


Table 5: The instructions given to crowd-sourced worker describing the training phase for the Choice
of Plausible Answers (COPA) task.


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


This project is a **training task** that needs to be completed before working on the main project
on AMT named Human Performance: Plausible Answer. Once you are done with the training,
please proceed to the main task! The qualification approval is not immediate but we will add
you to our qualified workers list within a day.


In this training, you must answer the question on the page and then, to see how you did, click
the **Check Work** button at the bottom of the page before hitting Submit. The Check Work
button will reveal the true label. Please use this training and the provided answers to build
an understanding of what the answers to these questions look like (the main project, Human
Performance: Plausible Answer, does not have the answers on the page).


Table 6: Task-specific instructions for Choice of Plausible Alternatives (COPA). These instructions
were provided during both training and annotation phases.


**Plausible Answer Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a prompt sentence and a question. The question will either be about
what caused the situation described in the prompt, or what a possible effect of that situation is.
We will also give you two possible answers to this question. Your job is to decide, given the
situation described in the prompt, which of the two options is a more plausible answer to the
question:


In the following example, option 1. is a more plausible answer to the question about what caused
the situation described in the prompt,


_The girl received a trophy._
_What‚Äôs the CAUSE for this?_


1. _She won a spelling bee._
2. _She made a new friend._


In the following example, option 2. is a more plausible answer the question about what happened
because of the situation described in the prompt,


_The police aimed their weapons at the fugitive._
_What happened as a RESULT?_


1. _The fugitive fell to the ground._
2. _The fugitive dropped his gun._


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/copa-faq)


20


Table 7: Task-specific instructions for Commitment Bank. These instructions were provided during
both training and annotation phases.


**Speaker Commitment Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a prompt taken from a piece of dialogue, this could be a single sentence,
a few sentences, or a short exchange between people. Your job is to figure out, based on this
first prompt (on top), how certain the speaker is about the truthfulness of the second prompt
(on the bottom). You can choose from a 7 point scale ranging from (1) completely certain that
the second prompt is true to (7) completely certain that the second prompt is false. Here are
examples for a few of the labels:


Choose 1 (certain that it is true) if the speaker from the first prompt definitely believes or knows
that the second prompt is true. For example,


_"What fun to hear Artemis laugh. She‚Äôs such a serious child. I didn‚Äôt know_
_she had a sense of humor."_
_"Artemis had a sense of humor"_


Choose 4 (not certain if it is true or false) if the speaker from the first prompt is uncertain if the
second prompt is true or false. For example,


_"Tess is committed to track. She‚Äôs always trained with all her heart and soul._
_One can only hope that she has recovered from the flu and will cross the finish_
_line."_

_"Tess crossed the finish line."_


Choose 7 (certain that it is false) if the speaker from the first prompt definitely believes or knows
that the second prompt is false. For example,


_"Did you hear about Olivia‚Äôs chemistry test? She studied really hard. But_
_even after putting in all that time and energy, she didn‚Äôt manage to pass the_
_test"._

_"Olivia passed the test."_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/commit-faq)


21


Table 8: Task-specific instructions for Winograd Schema Challenge (WSC). These instructions were
provided during both training and annotation phases.


**Winograd Schema Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a sentence that someone wrote, with one bolded pronoun. We will then
ask if you if the pronoun refers to a specific word or phrase in the sentence. Your job is to figure
out, based on the sentence, if the bolded pronoun refers to this selected word or phrase:


Choose Yes if the pronoun refers to the selected word or phrase. For example,


_"I put the cake away in the refrigerator. It has a lot of butter in it."_
_Does_ _**It**_ _in "It has a lot" refer to_ _**cake**_ _?_


Choose No if the pronoun does not refer to the selected word or phrase. For example,


_"The large ball crashed right through the table because it was made of_
_styrofoam."_
_Does_ _**it**_ _in "it was made" refer to_ _**ball**_ _?_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/wsc-faq)


22


Table 9: Task-specific instructions for BoolQ (continued in Table 10). These instructions were
provided during both training and annotation phases.


**Question-Answering Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a passage taken from a Wikipedia article and a relevant question. Your
job is to decide, given the information provided in the passage, if the answer to the question is
Yes or No. For example,


**In the following examples the correct answer is Yes,**


_The thirteenth season of Criminal Minds was ordered on April 7, 2017, by_
_CBS with an order of 22 episodes. The season premiered on September 27,_
_2017 in a new time slot at 10:00PM on Wednesday when it had previously_
_been at 9:00PM on Wednesday since its inception. The season concluded on_
_April 18, 2018 with a two-part season finale._

_will there be a 13th season of criminal minds?_
(In the above example, the first line of the passage says that the 13th season of
the show was ordered.)


_As of 8 August 2016, the FDA extended its regulatory power to include e-_
_cigarettes. Under this ruling the FDA will evaluate certain issues, including_
_ingredients, product features and health risks, as well their appeal to minors_
_and non-users. The FDA rule also bans access to minors. A photo ID is_
_required to buy e-cigarettes, and their sale in all-ages vending machines is not_
_permitted. The FDA in September 2016 has sent warning letters for unlawful_
_underage sales to online retailers and retailers of e-cigarettes._
_is vaping illegal if you are under 18?_
(In the above example, the passage states that the "FDA rule also bans access
to minors." The question uses the word "vaping," which is a synonym for
e-cigrattes.)


**In the following examples the correct answer is No,**


_Badgers are short-legged omnivores in the family Mustelidae, which also_
_includes the otters, polecats, weasels, and wolverines. They belong to the_
_caniform suborder of carnivoran mammals. The 11 species of badgers are_
_grouped in three subfamilies: Melinae (Eurasian badgers), Mellivorinae (the_
_honey badger or ratel), and Taxideinae (the American badger). The Asiatic_
_stink badgers of the genus Mydaus were formerly included within Melinae_
_(and thus Mustelidae), but recent genetic evidence indicates these are actually_
_members of the skunk family, placing them in the taxonomic family Mephitidae._
_is a wolverine the same as a badger?_
(In the above example, the passage says that badgers and wolverines are in
the same family, Mustelidae, which does not mean they are the same animal.)


23


Table 10: Continuation from Table 9 of task-specific instructions for BoolQ. These instructions were
provided during both training and annotation phases.


_More famously, Harley-Davidson attempted to register as a trademark the_
_distinctive ‚Äúchug‚Äù of a Harley-Davidson motorcycle engine. On February_
_1, 1994, the company filed its application with the following description:_
_‚ÄúThe mark consists of the exhaust sound of applicant‚Äôs motorcycles, produced_
_by V-twin, common crankpin motorcycle engines when the goods are in use.‚Äù_
_Nine of Harley-Davidson‚Äôs competitors filed oppositions against the applica-_
_tion, arguing that cruiser-style motorcycles of various brands use the same_
_crankpin V-twin engine which produces the same sound. After six years of_
_litigation, with no end in sight, in early 2000, Harley-Davidson withdrew their_
_application._
_does harley davidson have a patent on their sound?_
(In the above example, the passage states that Harley-Davidson applied for a
patent but then withdrew, so they do not have a patent on the sound.)


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/boolq-faq)


24


Table 11: Task-specific instructions for the diagnostic and the bias diagnostic datasets. These
instructions were provided during both training and annotation phases.


**Textual Entailment Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a prompt taken from an article someone wrote. Your job is to figure out,
based on this correct prompt (the first prompt, on top), if another prompt (the second prompt, on
bottom) is also necessarily true:


Choose True if the event or situation described by the first prompt definitely implies that the
second prompt, on bottom, must also be true. For example,


_‚Ä¢ "Murphy recently decided to move to London."_

_"Murphy recently decided to move to England."_
(The above example is True because London is in England and therefore prompt 2 is
clearly implied by prompt 1.)

_‚Ä¢_ _"Russian cosmonaut Valery Polyakov set the record for the longest continuous amount_
_of time spent in space, a staggering 438 days, between 1994 and 1995."_
_"Russians hold record for longest stay in space."_
(The above example is True because the information in the second prompt is contained
in the first prompt: Valery is Russian and she set the record for longest stay in space.)

_‚Ä¢_ _"She does not disgree with her brother‚Äôs opinion, but she believes he‚Äôs too aggresive in_
_his defense"_
_"She agrees with her brother‚Äôs opinion, but she believes he‚Äôs too aggresive in his_
_defense"_
(The above example is True because the second prompt is an exact paraphrase of the
first prompt, with exactly the same meaning.)


Choose False if the event or situation described with the first prompt on top does not necessarily
imply that this second prompt must also be true. For example,


_‚Ä¢ "This method was developed at Columbia and applied to data processing at CERN."_

_"This method was developed at Columbia and applied to data processing at CERN_
_with limited success."_

(The above example is False because the second prompt is introducing new information
not implied in the first prompt: The first prompt does not give us any knowledge of
how succesful the application of the method at CERN was.)

_‚Ä¢ "This building is very tall."_

_"This is the tallest building in New York."_
(The above example is False because a building being tall does not mean it must be the
tallest building, nor that it is in New York.)

_‚Ä¢_ _"Hours earlier, Yasser Arafat called for an end to attacks against Israeli civilians in_
_the two weeks before Israeli elections."_
_"Arafat condemned suicide bomb attacks inside Israel."_
(The above example is False because from the first prompt we only know that Arafat
called for an end to attacks against Israeli citizens, we do not know what kind of attacks
he may have been condemning.)


You do not have to worry about whether the writing style is maintained between the two prompts.


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/diagnostic-faq)


25


Table 12: Task-specific instructions for the Gendered Ambiguous Pronoun Coreference (GAP) task.
These instructions were provided during both training and annotation phases.


**GAP Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with an extract from a Wikipedia article, with one bolded pronoun. We will
also give you two names from the text that this pronoun could refer to. Your job is to figure out,
based on the extract, if the pronoun refers to option A, options B, or neither:


Choose A if the pronoun refers to option A. For example,


_"In 2010 Ella Kabambe was not the official Miss Malawi; this was Faith_
_Chibale, but Kabambe represented the country in the Miss World pageant._
_At the 2012 Miss World, Susan Mtegha pushed Miss New Zealand, Collette_
_Lochore, during the opening headshot of the pageant, claiming that Miss New_
_Zealand was in her space."_
_Does_ _**her**_ _refer to option A or B below?_


A _Susan Mtegha_


B _Collette Lochore_

C _Neither_


Choose B if the pronoun refers to option B. For example,


_"In 1650 he started his career as advisor in the ministerium of finances in Den_
_Haag. After he became a minister he went back to Amsterdam, and took place_
_as a sort of chairing mayor of this city. After the death of his brother Cornelis,_
_De Graeff became the strong leader of the republicans. He held this position_
_until the rampjaar."_
_Does_ _**He**_ _refer to option A or B below?_


A _Cornelis_


B _De Graeff_

C _Neither_


Choose C if the pronoun refers to neither option. For example,


_"Reb Chaim Yaakov‚Äôs wife is the sister of Rabbi Moishe Sternbuch, as is_
_the wife of Rabbi Meshulam Dovid Soloveitchik, making the two Rabbis his_
_uncles. Reb Asher‚Äôs brother Rabbi Shlomo Arieli is the author of a critical_
_edition of the novallae of Rabbi Akiva Eiger. Before his marriage, Rabbi Arieli_
_studied in the Ponevezh Yeshiva headed by Rabbi Shmuel Rozovsky, and he_
_later studied under his father-in-law in the Mirrer Yeshiva."_
_Does_ _**his**_ _refer to option A or B below?_


A _Reb Asher_


B _Akiva Eiger_

C _Neither_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/gap-faq)


26


Table 13: Task-specific instructions for the Paraphrase Adversaries from Word Scrambling (PAWS)
task. These instructions were provided during both training and annotation phases.


**Paraphrase Detection Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with two similar sentences taken from Wikipedia articles. Your job is to
figure out if these two sentences are paraphrases of each other, and convey exactly the same
meaning:


Choose Yes if the sentences are paraphrases and have the exact same meaning. For example,


_"Hastings Ndlovu was buried with Hector Pieterson at Avalon Cemetery in_
_Johannesburg."_
_"Hastings Ndlovu, together with Hector Pieterson, was buried at the Avalon_
_cemetery in Johannesburg ."_


_"The complex of the Trabzon World Trade Center is close to Trabzon Airport_
_."_

_"The complex of World Trade Center Trabzon is situated close to Trabzon_
_Airport ."_


Choose No if the two sentences are not exact paraphrases and mean different things. For
example,


_"She was only a few months in French service when she met some British_
_frigates in 1809 ."_
_"She was only in British service for a few months, when in 1809, she_
_encountered some French frigates ."_


_"This work caused him to trigger important reflections on the practices of_
_molecular genetics and genomics at a time when this was not considered_
_ethical ."_

_"This work led him to trigger ethical reflections on the practices of molecular_
_genetics and genomics at a time when this was not considered important ."_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/paws-faq)


27


Table 14: Task-specific instructions for the Quora Insincere Questions task. These instructions were
provided during both training and annotation phases.


**Insincere Questions Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a question that someone posted on Quora. Your job is to figure out
whether or not this is a sincere question. An insincere question is defined as a question intended
to make a statement rather than look for helpful answers. Some characteristics that can signify
that a question is insincere:


_‚Ä¢_ Has a non-neutral tone


**‚Äì**
Has an exaggerated tone to underscore a point about a group of people

**‚Äì**
Is rhetorical and meant to imply a statement about a group of people

_‚Ä¢_ Is disparaging or inflammatory


**‚Äì**
Suggests a discriminatory idea against a protected class of people, or seeks
confirmation of a stereotype

**‚Äì**
Makes disparaging attacks/insults against a specific person or group of people

**‚Äì**
Based on an outlandish premise about a group of people

**‚Äì**
Disparages against a characteristic that is not fixable and not measurable

_‚Ä¢_ Isn‚Äôt grounded in reality


**‚Äì**
Based on false information, or contains absurd assumptions

**‚Äì**
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek
genuine answers


Please note that there are far fewer insincere questions than there are sincere questions! So you
should expect to label most questions as sincere.


**Examples,**


Choose Sincere if you believe the person asking the question was genuinely seeking an answer
from the forum. For example,


_"How do DNA and RNA compare and contrast?"_
_"Are there any sports that you don‚Äôt like?_ "
_"What is the main purpose of penance?"_


Choose Insincere if you believe the person asking the question was not really seeking an answer
but was being inflammatory, extremely rhetorical, or absurd. For example,


_"How do I sell Pakistan? I need lots of money so I decided to sell Pakistan_
_any one wanna buy?"_
_"If Hispanics are so proud of their countries, why do they move out?"_
_"Why Chinese people are always not welcome in all countries?"_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/quora-faq)


28


Table 15: Task-specific instructions for the Ultrafine Entity Typing task. These instructions were
provided during both training and annotation phases.


**Entity Typing Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will provide you with a sentence with on bolded word or phrase. We will also give you a
possible tag for this bolded word or phrase. Your job is to decide, in the context of the sentence,
if this tag is correct and applicable to the bolded word or phrase:


Choose Yes if the tag is applicable and accurately describes the selected word or phrase. For
example,


_‚ÄúSpain was the gold line."_ _**It**_ _started out with zero gold in 1937, and by 1945_
_it had 65.5 tons._

_Tag: nation_


Choose No if the tag is not applicable and does not describes the selected word or phrase. For
example,


_**Iraqi museum workers**_ _are starting to assess the damage to Iraq‚Äôs history._

_Tag: organism_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/ultra-faq)


29


Table 16: Task-specific instructions for the Empathetic Reaction task. These instructions were
provided during both training and annotation phases.


**Empathy and Distress Analysis Instructions**


The New York University Center for Data Science is collecting your answers for use in research
on computer understanding of English. Thank you for your help!


We will present you with a message someone wrote after reading an article. Your job is to figure
out, based on this message, how disressed and empathetic the author was feeling. Empathy is
defined as feeling warm, tender, sympathetic, moved, or compassionate. Distressed is defined as
feeling worried, upset, troubled, perturbed, grieved, distrubed, or alarmed.


**Examples,**
The author of the following message was not feeling empathetic at all with an empathy score of 1,
and was very distressed with a distress score of 7,


_"I really hate ISIS. They continue to be the stain on society by committing_
_atrocities condemned by every nation in the world. They must be stopped at_
_all costs and they must be destroyed so that they wont hurt another soul. These_
_poor people who are trying to survive get killed, imprisoned, or brainwashed_
_into joining and there seems to be no way to stop them."_


The author of the following message is feeling very empathetic with an empathy score of 7 and
also very distressed with a distress score of 7,


_"All of you know that I love birds. This article was hard for me to read because_
_of that. Wind turbines are killing a lot of birds, including eagles. It‚Äôs really_
_very sad. It makes me feel awful. I am all for wind turbines and renewable_
_sources of energy because of global warming and coal, but this is awful. I_
_don‚Äôt want these poor birds to die like this. Read this article and you‚Äôll see_
_why."_


The author of the following message is feeling moderately empathetic with an
empathy score of 4 and moderately distressed with a distress score of 4,


_"I just read an article about wild fires sending a smokey haze across the state_
_near the Appalachian mountains. Can you imagine how big the fire must be_
_to spread so far and wide? And the people in the area obviously suffer the_
_most. What if you have asthma or some other condition that restricts your_
_breathing?"_


The author of the following message is feeling very empathetic with an empathy score of 7 and
mildly distressed with a distress score of 2,


_"This is a very sad article. Being of of the first female fighter pilots must_
_have given her and her family great honor. I think that there should be more_
_training for all pilots who deal in these acrobatic flying routines. I also think_
_that women have just as much of a right to become a fighter pilot as men."_


[If you have any more questions, please refer to our FAQ page.](https://nyu-mll.github.io/SuperGLUE-human/empathy-faq)


30




In [6]:
# import pathlib
# pathlib.Path("output.md").write_bytes(doc.encode())

### PyPDF experiments

In [7]:
def read_text_with_metadata_from_pdfs(file_path):
    reader = PdfReader(file_path)
    num_pages = len(reader.pages)
    data = []
    for i in range(num_pages):
        data.append({
            "file_type": "pdf",
            "file_name": file_path,
            "marker": i+1, # To identify where in the document this chunk is from
            "text": reader.pages[i].extract_text()
        })

    return data

In [8]:
lec_data = read_text_with_metadata_from_pdfs(lec_doc_path)

In [9]:
lec_data

[{'file_type': 'pdf',
  'file_name': '../data/input/full/references/superglue.pdf',
  'marker': 1,
  'text': 'SuperGLUE: A Stickier Benchmark for\nGeneral-Purpose Language Understanding Systems\nAlex Wang‚àó\nNew York University\nYada Pruksachatkun‚àó\nNew York University\nNikita Nangia‚àó\nNew York University\nAmanpreet Singh‚àó\nFacebook AI Research\nJulian Michael\nUniversity of Washington\nFelix Hill\nDeepMind\nOmer Levy\nFacebook AI Research\nSamuel R. Bowman\nNew York University\nAbstract\nIn the last year, new models and methods for pretraining and transfer learning have\ndriven striking performance improvements across a range of language understand-\ning tasks. The GLUE benchmark, introduced a little over one year ago, offers\na single-number metric that summarizes progress on a diverse set of such tasks,\nbut performance on the benchmark has recently surpassed the level of non-expert\nhumans, suggesting limited headroom for further research. In this paper we present\nSuperGLUE

In [10]:
for data in lec_data:
    print(data["text"])

SuperGLUE: A Stickier Benchmark for
General-Purpose Language Understanding Systems
Alex Wang‚àó
New York University
Yada Pruksachatkun‚àó
New York University
Nikita Nangia‚àó
New York University
Amanpreet Singh‚àó
Facebook AI Research
Julian Michael
University of Washington
Felix Hill
DeepMind
Omer Levy
Facebook AI Research
Samuel R. Bowman
New York University
Abstract
In the last year, new models and methods for pretraining and transfer learning have
driven striking performance improvements across a range of language understand-
ing tasks. The GLUE benchmark, introduced a little over one year ago, offers
a single-number metric that summarizes progress on a diverse set of such tasks,
but performance on the benchmark has recently surpassed the level of non-expert
humans, suggesting limited headroom for further research. In this paper we present
SuperGLUE, a new benchmark styled after GLUE with a new set of more difÔ¨Å-
cult language understanding tasks, a software toolkit, and a public 

### Opening Anoop's lecture PDFs

PyMuPDF's get_text() function does decently well
<ul>
    <li>Extracts meaningful text</li> 
    <li>Sigma for summation often replaced with X</li>
    <li>Equations and formula often split across multiple lines</li>
</ul>

But we found from our experiments that PyPDF does remarkably better
<ul>
    <li>Effectively recognizes most mathematical symbols</li>
    <li>Better at retaining new line information</li>
</ul>

We choose to go with PyPDF's extract_text() function.

For metadata, we provide the file name and page number to refer back to quickly to the source of the knowledge.

In [11]:
lec_doc_path = "../data/input/full/lectures/ff.pdf"

In [12]:
lec_data = read_text_with_metadata_from_pdfs(lec_doc_path)

In [13]:
lec_data

[{'file_type': 'pdf',
  'file_name': '../data/input/full/lectures/ff.pdf',
  'marker': 1,
  'text': '0\nSFU NatLangLab\nNatural Language Processing\nAnoop Sarkar\nanoopsarkar.github.io/nlp-class\nSimon Fraser University\nSeptember 27, 2024'},
 {'file_type': 'pdf',
  'file_name': '../data/input/full/lectures/ff.pdf',
  'marker': 2,
  'text': '1\nNatural Language Processing\nAnoop Sarkar\nanoopsarkar.github.io/nlp-class\nSimon Fraser University\nPart 1: Feedforward neural networks'},
 {'file_type': 'pdf',
  'file_name': '../data/input/full/lectures/ff.pdf',
  'marker': 3,
  'text': '2\nLog-linear models versus Neural networks\nFeedforward neural networks\nStochastic Gradient Descent\nMotivating example: XOR\nComputation Graphs'},
 {'file_type': 'pdf',
  'file_name': '../data/input/full/lectures/ff.pdf',
  'marker': 4,
  'text': '3\nLog linear model\n‚ñ∂ Let there be m features, fk (x, y) for k = 1, . . . ,m\n‚ñ∂ Define a parameter vector v ‚àà Rm\n‚ñ∂ A log-linear model for classificatio

### Sample QAs

For the QA files we found, it was easier to first convert it into a CSV with some preliminary information for each question, the question itself and the answer. We found from our tests that using PyPDF, we were unable to define a heuristic or a method that would copy over the preliminary information for each sub-question.

For metadata, we only provide a random index since the source knowledge would not be super useful here, or at least, we were unable to find a use case for it since the question, preliminary information and all of the answers are already available in the parsed text.

In [14]:
def read_text_with_metadata_from_qa_csvs(qa_doc_path):
    qas = pd.read_csv(qa_doc_path, skiprows=1, names=["info", "question", "answer"])
    qas["info"] = qas["info"].fillna("No extra information given")
    data = []
    for index, row in qas.iterrows():
        text = f"Info: {row['info']}\nQuestion: {row['question']}\nAnswer: {row['answer']}"
        data.append({
            "file_type": "csv",
            "file_name": "test",
            "marker": index+1,
            "sub_marker": index+1, # Not necessary since we are not chunking
            "text": text,
            "first_10_words": " ".join(text.split()[:10])
        })

    return data

In [15]:
qa_doc_path = "../data/input/full/qas/midterm-questions-3.csv"

In [16]:
qa_data = read_text_with_metadata_from_qa_csvs(qa_doc_path)

In [17]:
qa_data

[{'file_type': 'csv',
  'file_name': 'test',
  'marker': 1,
  'sub_marker': 1,
  'text': 'Info: (1) You are given the following training data for the prepositional phrase (PP) attachment task. v        n1        p        n2        Attachment\njoin        board        as        director        V\nis        chairman        of        N.V.        N\nusing        crocidolite        in        filters        V\nWhere the attachment value of V indicates that p attaches to v and the attachment value of N indicates\nthat p attaches to n1.\nIn order to resolve PP attachment ambiguity we can train a probability model: P(A= N |v,n1,p,n2)\nwhich predicts the attachment A as N if P >0.5 and V otherwise.\nQuestion: (6pts) To define P(A= N |v,n1,p,n2) using n-gram probabilities, since we are unlikely to see the\nsame four words v,n1,p,n2 in novel unseen data, in order for this probability model to be useful we\nneed to take care of zero counts.\nÀÜ\nProvide a Jelinek-Mercer style interpolation smoothin

### References

#### PDFs

<ul>
    <li>Struggles with multi-column layouts</li>
    <li>Text often split across multiple lines</li>
    <li>The other usual problems...</li>
</ul>

In [18]:
ref_doc_path = "../data/input/full/references/superglue.pdf"

In [19]:
ref_data = read_text_with_metadata_from_pdfs(ref_doc_path)

In [20]:
ref_data

[{'file_type': 'pdf',
  'file_name': '../data/input/full/references/superglue.pdf',
  'marker': 1,
  'text': 'SuperGLUE: A Stickier Benchmark for\nGeneral-Purpose Language Understanding Systems\nAlex Wang‚àó\nNew York University\nYada Pruksachatkun‚àó\nNew York University\nNikita Nangia‚àó\nNew York University\nAmanpreet Singh‚àó\nFacebook AI Research\nJulian Michael\nUniversity of Washington\nFelix Hill\nDeepMind\nOmer Levy\nFacebook AI Research\nSamuel R. Bowman\nNew York University\nAbstract\nIn the last year, new models and methods for pretraining and transfer learning have\ndriven striking performance improvements across a range of language understand-\ning tasks. The GLUE benchmark, introduced a little over one year ago, offers\na single-number metric that summarizes progress on a diverse set of such tasks,\nbut performance on the benchmark has recently surpassed the level of non-expert\nhumans, suggesting limited headroom for further research. In this paper we present\nSuperGLUE

#### HTMLs

For webpages listed as references, we used boilerpy to remove some of the boilerplate headers and css and js. We then tokenize it at a sentence level.

For metadata, we include the file name and the sentence number to refer back to the source knowledge.

In [21]:
from boilerpy3 import extractors

In [22]:
def read_text_from_html(file_path):
    # Use ArticleExtractor from boilerpy3
    extractor = extractors.ArticleExtractor()
    data = []
    try:
        clean_content = extractor.get_content_from_file(html_doc_path)
        clean_content = nltk.sent_tokenize(clean_content)
        # data.append({
        #     "file_type": "html",
        #     "file_name": file_path,
        #     "text": " ".join(clean_content),
        #     "marker": None
        # })
        seq_num = 0
        for content in clean_content:
            data.append({
                "file_type": "html",
                "file_name": file_path,
                "text": content,
                "marker": seq_num
            })
            seq_num += 1
        
    except Exception as e:
        print(f"Error with BoilerPy3 extraction: {e}")
    finally:
        return data

In [23]:
html_doc_path = "../data/input/full/references/Generation with LLMs.html"

In [24]:
html_data = read_text_from_html(html_doc_path)

In [25]:
html_data

[{'file_type': 'html',
  'file_name': '../data/input/full/references/Generation with LLMs.html',
  'text': 'Graph models\nInternal Helpers\nCustom Layers and Utilities Utilities for pipelines Utilities for Tokenizers Utilities for Trainer Utilities for Generation Utilities for Image Processors Utilities for Audio processing General Utilities Utilities for Time Series\nGeneration with LLMs\nLLMs, or Large Language Models, are the key component behind text generation.',
  'marker': 0},
 {'file_type': 'html',
  'file_name': '../data/input/full/references/Generation with LLMs.html',
  'text': 'In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text.',
  'marker': 1},
 {'file_type': 'html',
  'file_name': '../data/input/full/references/Generation with LLMs.html',
  'text': 'Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just call

#### Notebooks

For notebooks, we inspect contents at a cell level, separating content into code, text and output blocks. While it is true that, in most cases, the output would be largely pointless, sometimes it does contain important statistics or algorithmic run-throughs that can be useful to refer to.

In [26]:
def read_code_md_outputs_from_notebook_sequence(file_path):
    # Load the notebook file
    with open(file_path, 'r', encoding='utf-8') as f:
        notebook = json.load(f)

    content = []
    seq_num = 0
    
    # Extract cells
    for cell in notebook.get('cells', []):
        cell_type = cell.get('cell_type')
        if cell_type == 'markdown': # Markdown cell
            md_content = "Text block:\n"
            md_content += "".join(cell.get('source', [])) + "\n"
            # print(md_content)
            content.append({
                "file_type": "ipynb",
                "file_name": file_path,
                "marker": seq_num,
                "text": md_content
            })
        elif cell_type == 'code':
            code_content = "Code block:\n"
            code_content += "".join(cell.get('source', [])) + "\n"
            # print(code_content)
            # code_source = ''.join(code_content)
            # outputs = []
            code_content += "Output:\n"
            
            # Extract outputs
            for output in cell.get('outputs', []):
                if output.get('output_type') == 'stream':
                    code_content += "".join(output.get('text', [])) + "\n"
                    # print(output_content)
                    # outputs.append(output_content)
                elif output.get('output_type') == 'execute_result':
                    code_content += "".join(output.get('data', {}).get('text/plain', [])) + "\n"
                    # print(output_content)
                    # outputs.append(output_content)
                elif output.get('output_type') == 'error':
                    code_content += "Error: ".join(output.get('traceback', [])) + "\n"
                    # print(output_content)
                    # outputs.append('Error: ' + output_content)

            content.append({
                "file_type": "ipynb",
                "file_name": file_path,
                "marker": seq_num,
                "text": code_content
            })

        seq_num += 1
    
    return content

In [27]:
notebook_doc_path = "../data/input/full/notebooks/bpe.ipynb"

In [28]:
markdown_data = read_code_md_outputs_from_notebook_sequence(notebook_doc_path)

In [29]:
markdown_data

[{'file_type': 'ipynb',
  'file_name': '../data/input/full/notebooks/bpe.ipynb',
  'marker': 0,
  'text': 'Text block:\n# Byte-Pair Encoding tokenization\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/input/full/notebooks/bpe.ipynb',
  'marker': 1,
  'text': 'Text block:\nBPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let‚Äôs say our corpus uses these five words:\n\nThis material was adapted from the Huggingface tutorial available here:\n\nhttps://huggingface.co/learn/nlp-course/chapter6/5?fw=pt\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/input/full/notebooks/bpe.ipynb',
  'marker': 2,
  'text': 'Code block:\ncorpus = ["hug", "pug", "pun", "bun", "hugs"]\nOutput:\n'},
 {'file_type': 'ipynb',
  'file_name': '../data/input/full/notebooks/bpe.ipynb',
  'marker': 3,
  'text

### Overall

We parse 4 different kinds of files:
<ul>
    <li>PDFs - PyPDF (our experiments with PyMuPDF4LLM showed worse results than this)</li>
    <li>CSVs (very specific to this use case though) - pandas</li>
    <li>HTMLs - boilerpy + nltk</li>
    <li>ipynb notebooks - json</li>
</ul>

These we get from CMPT-713's course site

The output of these functions is a list of extracted text contents, and these functions are later used in the pipeline.