# Retrieving data from papers using GPT

## Setup

In [2]:
!conda install -y tenacity

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - tenacity


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2023.5.7~ --> pkgs/main::ca-certificates-2023.01.10-hecd8cb5_0
  certifi            conda-forge/noarch::certifi-2023.5.7-~ --> pkgs/main/osx-64::certifi-2023.5.7-py39hecd8cb5_0
  openssl            conda-forge::openssl-1.1.1t-hfd90126_0 --> pkgs/main::openssl-1.1.1t-hca72f7f_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [3]:
!conda install -y openai

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [4]:
!conda install -y -c conda-forge tiktoken

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.11.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - tiktoken


The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2023.01.10~ --> conda-forge::ca-certificates-2023.5.7-h8857fd0_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            pkgs/main/osx-64::certifi-2023.5.7-py~ --> conda-forge/noarch::certifi-2023.5.7-pyhd8ed1ab_0
  openssl              pkgs/main::openssl-1.1.1t-hca72f7f_0 --> conda-forge::openssl-1.1.1t-hfd90126_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [5]:
!conda install -y -c conda-forge pdfminer.six

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.11.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.



In [6]:
import datetime
import numpy as np
import openai
import os
import pandas as pd
from pdfminer.high_level import extract_text
import re
import requests
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff on API calls
import tiktoken

In [7]:
os.makedirs('output_data', exist_ok=True)

In [8]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [9]:
"""
Base character limit for the model.
May be adjusted by prompt length.
I've heard that English has about 4 chars per token on average.
But papers may have parts with a lot of digits, which I think are one token each.
GPT-3 token limit (including output) is 4096.
So a base value of 4096 is safe.
This should be subtracted by the prompt length later.
"""
BASE_CHAR_LIMIT = 4096

"""
The number of questions the model is asked.
This should be updated along with the prompt.
"""
NUM_QUESTIONS = 2

# Token limit for each model
MAX_TOKENS = {
    "gpt-4": 8192,
    "gpt-4-32k": 32768,
    "gpt-3.5-turbo": 4096,
    "text-davinci-003": 4097,
}

# Token limit for model output
MAX_RESPONSE_TOKENS = 100

DEFAULT_CHAT_MODEL = "gpt-4"  # "gpt-3.5-turbo"
DEFAULT_COMPLETION_MODEL = "text-davinci-003"

## Playground

In [10]:
example_paper_text = """
2 Related work

Language models and dialog models: Language models have attracted much attention recently thanks to their
successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to
model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their ﬁndings,
our results show that model scaling improves our quality (sensibleness, speciﬁcity, and interestingness), safety and
groundedness metrics to some extent. However, ﬁne-tuning combined with scaling signiﬁcantly improves performance
on all metrics.

Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26,
17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our ﬁne-tuning
stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30].
Our use of ﬁne-tuning on crowdworker-annotated data to improve interestingness is comparable to Roller et al. [18].
However, we aim to maximize the interestingness of the model’s output distinctly from its ability to engage the user in
further interaction.

Our ﬁnding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes
that of Shuster et al. [31], who also focus on the problem of groundedness. Recent studies on scaling have found that
performance on question-answering tasks improves with model size [32, 33], similar to our ﬁndings on pre-trained
LaMDA prior to ﬁne-tuning.

Our approach to improving model groundedness is broadly consistent with a growing literature on augmenting neural
language models with retrieval systems. Most of the existing literature focuses on the problem of open-domain
question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge
sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing
approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et
al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’
used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank
passages [40]. The RETRO architecture demonstrates that language models can be primed with results retrieved from
a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et
al. [42], which ﬁne-tunes the model to use external APIs for movie ticketing dialog.

Parts of our ﬁndings are similar to recent studies on dialog groundedness. Granting access to external knowledge
bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of
retrieval systems and model architectures [31]. Another study ﬁnds that a question-answering system’s accuracy is
improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and
‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that can interact
with the open web via a text-only interface, and learns to imitate humans in answering questions by citing external
sources [44]. Komeili et al. [45] compare different types of pre-trained models and retrieval methods, and reach a
similar conclusion that augmenting language models with a search engine provides more factually grounded responses.
They encode the input context with grounded information from search to generate the next response, while we augment
the generated responses with information from known sources in our method. This allows us to ﬁne-tune the model for
groundedness without sacriﬁcing gains in safety or quality from other ﬁne-tuning treatments.

Dialog metrics: Deﬁning effective metrics for dialog models remains an open research topic. Our approach is
inspired by Adiwardana et al. [17], who argued for human-like metrics, such as sensibleness and speciﬁcity. Many
automated metrics for dialog models have been studied, including perplexity [16, 17], F1, Hits@1/N [25], USR [46],
or BLEU/ROUGE [47, 15, 27]. However, such automated metrics may not correlate well with human judgment [48].
More reliable metrics for dialog modeling require human evaluation [49, 50, 18, 25, 17, 51], as used in this paper.

Earlier research attempted to combine multifaceted evaluations of dialog quality into a single headline metric [52]. We
follow the pattern established in Adiwardana et al. [17] and Roller et al. [18] by considering the different components
of our evaluations separately. In addition to sensibleness and speciﬁcity per Adiwardana et al. [17], we add new metrics:
interestingness, safety, and groundedness. An advantage of using several different metrics is their debuggability: by
exploring responses with low safety or groundedness scores, we have been able to develop targeted methods to improve
them.

Safety and safety of dialog models:
Inappropriate and unsafe risks and behaviors of language models have been
extensively discussed and studied in previous works (e.g., [53, 54]). Issues encountered include toxicity (e.g., [55, 56,
57]), bias (e.g., [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]), and inappropriately revealing personally
identifying information (PII) from training data [73]. Weidinger et al. [54] identify 21 risks associated with large-scale

3

language models and discuss the points of origin for these risks. While many mitigation strategies have also been
suggested (e.g., [74, 75, 76, 77, 78, 79, 80, 81, 82]), meaningfully addressing these issues remains an active research
area.

Similar issues have also been discussed speciﬁcally for dialog models [53]. For instance, examples of bias, offensiveness,
and hate speech have been found both in training data drawn from social media, and consequently in the output of dialog
models trained on such data [83]. Dialog models [84] can learn, and even amplify, biases in the training data. Echoing
Gehman et al. [85], we ﬁnd ﬁne-tuning effective to augment language models for safety. The method we use in this
paper follows previous attempts to tackle these issues by training separate layers to detect unsafe output [17, 86, 18, 79].
Our strategy is similar to recent work that also uses ﬁne-tuning [87]. While their safety guidelines were derived from
human rights principles, they similarly ﬁnd that increasing scale has no impact on toxicity metrics, while ﬁne-tuning on
safety evaluations does.

Groundedness metrics: Similar to other recent research into groundedness cited above, we assess groundedness
by asking crowdworkers to judge whether the model’s output is in accordance with authoritative external sources.
The recently-proposed Attributable to Identiﬁed Sources (AIS) framework [88] articulates a more precise approach
to assess output of language models that pertains to the external world. It splits evaluation into two stages, where
crowdworkers are asked: (1) if they can understand and identify the information shared in a dialog turn, and (2) if all
of this information can be attributed to a source. Meanwhile, a recent study has reopened the question of automatic
evaluation, with the Q2 metric showing performance comparable to human annotation [89].

3 LaMDA pre-training

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data
alone [17, 18], we pre-trained LaMDA on a dataset created from public dialog data and other public web documents.
Therefore, LaMDA can be used as a general language model prior to ﬁne-tuning.

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T
words (Appendix E). Over 90% of the pre-training dataset is in the English language. We used the SentencePiece
library [90] to tokenize the dataset into 2.81T byte pair encoding (BPE) tokens [91], with a vocabulary of 32K tokens.
For comparison, the total number of words in the training set for Meena [17] was 40B words, which is nearly 40x
smaller.

The largest LaMDA model has 137B non-embedding parameters, which is ~50x more parameters than Meena [17].
We use a decoder-only Transformer [92] language model as the model architecture for LaMDA. The Transformer has
64 layers, dmodel = 8192, df f = 65536, h = 128, dk = dv = 128, relative attention as described in T5 [11], and
gated-GELU activation as described in Raffel et al. [93].

We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch. We used
the Lingvo framework [94] for training and achieved 123 TFLOPS/sec with 56.5% FLOPS utilization with the 2D
sharding algorithm, as described in GSPMD [95] (see Section 10 for carbon footprint estimates). We also trained
smaller 2B-parameter and 8B-parameter models to measure the effects of model scaling on our metrics. Hyperparameter
details for the models of different sizes can be found in Table 27, Appendix D.

Figure 2 gives an overview of the pre-training stage. We call the model before any ﬁne-tuning "PT", for PreTrained.

Figure 2: LaMDA pre-training as a language model.

4

PT uses the same sample-and-rank strategy as Meena [17] for decoding. We ﬁrst sample 16 independent candidate
responses using top-k (k = 40) sampling (no temperature). The ﬁnal output is the highest-scoring candidate, where the
score is based on the candidate’s log-likelihood and its length.

4 Metrics

Evaluating generative models in general, and open-ended dialog models in particular, is difﬁcult. See the Related
Work section for a general review of recent work in this area. In this section, we describe the metrics that we use for
evaluation.

4.1 Foundation metrics: Quality, Safety and Groundedness

Sensibleness, Speciﬁcity, Interestingness (SSI): Our overall quality score is an average of sensibleness, speciﬁcity,
and interestingness (SSI).

Adiwardana et al. [17] propose the sensibleness and speciﬁcity average (SSA) metric to measure the quality of Meena.
This metric is a simple average of two scores: sensibleness and speciﬁcity.

The ﬁrst score, sensibleness, measures whether a model’s responses make sense in context and do not contradict
anything that was said earlier. Humans tend to take this basic aspect of communication for granted, but generative
models often struggle to meet this requirement. However, if sensibleness alone is used to evaluate models, we could
inadvertently reward models for playing it safe by always producing short, generic, and boring responses. The
GenericBot algorithm [17], which answers every question with “I don’t know” and every statement with “Ok,” scores
70% on sensibleness, which even surpasses some large dialog models [17].

The second score, speciﬁcity, is used to measure whether a response is speciﬁc to a given context. For example, if a user
says “I love Eurovision” and the model responds “Me too,” then it would score 0 on speciﬁcity, since this response could
be used in many different contexts. If it answers “Me too. I love Eurovision songs,” then it would score 1. Adiwardana
et al. [17] report that Meena narrows the gap to average human performance in the SSA metric.

As the model’s performance increases, however, we ﬁnd that sensibleness and speciﬁcity are not sufﬁcient to measure
the quality of a dialog model. For example, a response to “How do I throw a ball?” could be “You can throw a ball by
ﬁrst picking it up and then throwing it”, which makes sense and is speciﬁc to the question. An alternative deeper and
more satisfying answer could be “One way to toss a ball is to hold it ﬁrmly in both hands and then swing your arm
down and up again, extending your elbow and then releasing the ball upwards.”
"""

In [11]:
response = openai.Completion.create(
    model="text-davinci-003",
    prompt=f"A table summarizing the training hardware from this paper:\n\n====\n\n{example_paper_text}\n\n====\n\n| Number of GPUs or TPUs | Hardware model (e.g. A100) | FLOP/s |\n",
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
response

<OpenAIObject text_completion id=cmpl-7H90LNEEoZZDUp9ZAaPl7cymJGMdy at 0x7f8458519130> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "| ---------------------- | --------------------------- | ------ |\n| 1024                  | TPU-v3                      | 123 TFLOPS/sec |"
    }
  ],
  "created": 1684320545,
  "id": "cmpl-7H90LNEEoZZDUp9ZAaPl7cymJGMdy",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 29,
    "prompt_tokens": 3117,
    "total_tokens": 3146
  }
}

In [12]:
prompt_text = f"""
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".
1. How many GPUs or TPUs were used to train the model? Just state the number. If the number of GPUs or TPUs is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
3. What FLOP/s (AKA: FLOP/second, FLOPS) was achieved during training? Include the same units as written in the paper. If FLOP/s is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100
3. 21 TFLOP/s

1. N/A
2. Titan V
3. 21 petaflops

1. 32
2. N/A
3. 127e12 FLOPS

====

{example_paper_text}

====

"""

response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt_text,
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
print(response["choices"][0]["text"])


1. N/A
2. N/A
3. 123 TFLOP/s


In [33]:
prompt_text = """Read the Machine Learning research paper below and answer the following questions. 

## Questions

1. How many GPUs or TPUs or chips were used to train the model? Just state the number.
2. What model of GPU or TPU was used to train the model?

## Instructions for how to answer each question

1. Write the question number, e.g. "1. ".
2. Write "Relevant quote: " and copy verbatim the quote from the paper that determines your final answer.
3. Write "Final answer: " and then write your final answer for the question.
4. If the answer cannot be determined from the text, just write "Relevant quotes: N/A. Final answer: N/A.".

## Made-up example answers

1. Relevant quote: "We pre-trained BaLM on 1024 V100 GPUs for a total of about 30 days." Final answer: 1024.
2. Relevant quote: "We pre-trained BaLM on 1024 V100 GPUs for a total of about 30 days." Final answer: V100.

1. Relevant quote: N/A. Final answer: N/A.
2. Relevant quote: "SuperGAN was trained on a TPUv3 cluster, achieving a throughput of 52 TFLOPS." Final answer: TPUv3.

## Paper

{paper_text}

## Answers
"""

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        # {"role": "system", "content": "You are an expert in Machine Learning."},
        {"role": "user", "content": f"{prompt_text.format(paper_text=example_paper_text)}"},
    ],
    temperature=0,
    max_tokens=MAX_RESPONSE_TOKENS,
    top_p=1,
    frequency_penalty=-2,
    presence_penalty=-2,
)


In [34]:
prompt_text.format(paper_text=example_paper_text)

'Read the Machine Learning research paper below and answer the following questions. \n\n## Questions\n\n1. How many GPUs or TPUs or chips were used to train the model? Just state the number.\n2. What model of GPU or TPU was used to train the model?\n\n## Instructions for how to answer each question\n\n1. Write the question number, e.g. "1. ".\n2. Write "Relevant quotes: " and copy verbatim any relevant quotes from the paper that inform your final answer.\n3. Write "Final answer: " and then write your final answer for the question.\n4. If the answer cannot be determined from the text, just write "Relevant quotes: N/A. Final answer: N/A.".\n\n## Made-up example answers\n\n1. Relevant quotes: "We pre-trained BaLM on 1024 V100 GPUs for a total of about 30 days." Final answer: 1024.\n2. Relevant quotes: "We pre-trained BaLM on 1024 V100 GPUs for a total of about 30 days." Final answer: V100.\n\n1. Relevant quotes: N/A. Final answer: N/A.\n2. Relevant quotes: "SuperGAN was trained on a TPUv3

In [35]:
response["choices"][0]["message"]["content"]

'1. Relevant quotes: "We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days". Final answer: 1024. \n2. Relevant quotes: "We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days.  Final answer: TPU-v3. '

In [23]:
answers = response["choices"][0]["message"]["content"].strip().split("\n")
# Remove "" from list of answers in case of multiple "\n" chars
answers = [a for a in answers if a != ""]
processed_answers = list()
for a in answers:
    a = a.strip()
    if a[1] == ".":  # question numbering
        answer_parts = a.split(".", maxsplit=1)[-1]
        print(answer_parts)
#         processed_answers.append(a.split(".")[-1].strip())
#     else:
#         processed_answers.append(a)
# processed_answers

['1', ' Relevant quote: "We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days". Final answer: 1024. ']
['2', ' Relevant quote: "We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days.  Final answer: TPU-v3.']


In [25]:
prompt_text = """
Read the excerpt of a Machine Learning research paper below and answer the following questions.
1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
"""

openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in Machine Learning."},
        {"role": "user", "content": f"{prompt_text}\n\n====\n\n{example_paper_text}\n\n====\n\n"},
    ]
)

<OpenAIObject chat.completion id=chatcmpl-7Gt92obaFxiKh7eWGJ7x38epqSoWj at 0x7f8aa0b269f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "1. N/A\n2. TPU-v3",
        "role": "assistant"
      }
    }
  ],
  "created": 1684259580,
  "id": "chatcmpl-7Gt92obaFxiKh7eWGJ7x38epqSoWj",
  "model": "gpt-4-0314",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 3072,
    "total_tokens": 3083
  }
}

## Pipeline

In [14]:
chat_message_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".

Here are some made-up example answers:

1. 1
2. V100

1. N/A
2. TPUv3

1. 32
2. N/A

====

{paper_text}

====

"""

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def parse_text_gpt_chat(text):
    prompt_text = chat_message_template.format(paper_text=text)
    response = openai.ChatCompletion.create(
        model=DEFAULT_CHAT_MODEL,
        messages=[
            # {"role": "system", "content": "You are an expert in Machine Learning."},
            {"role": "user", "content": prompt_text},
        ],
        temperature=0,
        max_tokens=MAX_RESPONSE_TOKENS,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response

def parse_gpt_chat_response(text):
    response = parse_text_gpt_chat(text)
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    # print(f"Response: {response['choices'][0]['message']['content']}")
    answers = response["choices"][0]["message"]["content"].strip().split("\n")
    # Remove "" from list of answers in case of multiple "\n" chars
    answers = [a for a in answers if a != ""]
    processed_answers = list()
    for a in answers:
        a = a.strip()
        if a[1] == ".":  # question numbering
            processed_answers.append(a.split(".")[-1].strip())
        else:
            processed_answers.append(a)
    return processed_answers

In [15]:
completion_prompt_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".

Here are made-up some example answers:

1. 1
2. V100

1. N/A
2. TPUv3

1. 32
2. N/A

====

{paper_text}

====

"""

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def parse_text_gpt(text):
    prompt_text = completion_prompt_template.format(paper_text=text)
    response = openai.Completion.create(
        model=DEFAULT_COMPLETION_MODEL,
        prompt=prompt_text,
        temperature=0,
        max_tokens=MAX_RESPONSE_TOKENS,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response

def parse_gpt_response(text):
    response = parse_text_gpt(text)
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    # print(response["choices"][0]["text"])
    answers = response["choices"][0]["text"].strip().split("\n")
    # Remove "" from list of answers in case of multiple "\n" chars
    answers = [a for a in answers if a != ""]
    processed_answers = list()
    for a in answers:
        a = a.strip()
        if a[1] == ".":  # question numbering
            processed_answers.append(a.split(".")[-1].strip())
        else:
            processed_answers.append(a)
    return processed_answers

In [16]:
def get_model_answers(text, num_questions=1, mode="chat"):
    answer_fcn = parse_gpt_chat_response if mode == "chat" else parse_gpt_response
    model_name = DEFAULT_CHAT_MODEL if mode == "chat" else DEFAULT_COMPLETION_MODEL
    prompt_template = chat_message_template if mode == "chat" else completion_prompt_template

    tokenizer = tiktoken.encoding_for_model(model_name)
    encoded_text = tokenizer.encode(text, disallowed_special=())
    print(f"Full #tokens: {len(encoded_text)}")
    # The prompt text will be added to the paper text later
    # Add constant extra tokens for a little buffer when using the chat model
    max_tokens = MAX_TOKENS[model_name] - (len(tokenizer.encode(prompt_template)) + MAX_RESPONSE_TOKENS + 20)
    print(f"Max #tokens: {max_tokens}")

    text_pos = 0
    final_answers = [None] * num_questions
    while text_pos < len(encoded_text):
        # Get model answers for the next chunk of the text
        encoded_text_chunk = encoded_text[text_pos : text_pos + max_tokens]
        print(f"Chunk #tokens: {len(encoded_text_chunk)}")
        text_chunk = tokenizer.decode(encoded_text_chunk)  # the model will encode again
        answers = answer_fcn(text_chunk)
        # Process each answer
        for i in range(num_questions):
            try:
                if final_answers[i] is None and answers[i] is not None:
                    # Take the first answer as the final answer initially
                    final_answers[i] = answers[i]
                elif "N/A" in final_answers[i] and not ("N/A" in answers[i]):
                    """
                    If the answer was "N/A" previously but there's at least 
                    one non-"N/A" answer for a later chunk, then use the 
                    first non-"N/A" answer as the final answer
                    """
                    final_answers[i] = answers[i]
            except IndexError:
                print(f"IndexError: index={i}, answers={answers}, final_answers={final_answers}")
        # Move to the next chunk of text
        text_pos += max_tokens
    return final_answers

In [17]:
def parse_paper(df, i, row, keys):
    url = row['Link']

    # replace "abs" with "pdf" in arxiv url links
    url = url.replace('abs', 'pdf')
    print(f"Looking into \"{row['Reference']}\"")

    # try:
    #     response = requests.get(url)
    # except Exception as e:
    #     print(f"There's something wrong with downloading: {e}")
    #     raise e

    # file = open("download.pdf", "wb")
    # file.seek(0) # overwrite previous file
    # file.write(response.content)
    # file.close()

    # try:
    # text = extract_text('download.pdf')
    paper_title = row['Reference'].replace(' ', '_').replace(':', '').replace('"', '').lower()
    with open('input_data/' + paper_title + '.txt', 'r') as f:
        text = f.read()

    answers = get_model_answers(text, num_questions=NUM_QUESTIONS, mode='chat')
    print(f"Answers: {answers}")
    
    for key, answer in zip(keys, answers):
        df.loc[i,key] = answer if answer else "none"
    # except Exception as e:
    #     print(f"There's something wrong with extracting the text: {e}")
    #     raise e

In [18]:
def evaluate_answers(df, answer_keys, correct_answers):
    correct_dict = dict()
    for key in answer_keys:
        print(key)
        correct = 0
        for i, row in df.iterrows():
            ref = row['Reference']
            answer = row[key]
            correct_answer = correct_answers[key][ref]
            if answer == correct_answer:
                correct += 1
            else:
                print(f"{answer} != {correct_answer}")
        correct_dict[key] = correct
    for k, v in correct_dict.items():
        print(f"{k}: {v}/{len(df)}")
    return correct_dict

In [19]:
CORRECT_ANSWERS = {
    'Number of hardware units': {
        'No Language Left Behind: Scaling Human-Centered Machine Translation': 'N/A',
        'Solving Quantitative Reasoning Problems with Language Models': '1024',
        'Scaling Autoregressive Models for Content-Rich Text-to-Image Generation': 'N/A',
        'A Generalist Agent': '256',
        'OPT: Open Pre-trained Transformer Language Models': '992',
        'PaLM: Scaling Language Modeling with Pathways': '6144',
        'Training Compute-Optimal Large Language Models': 'N/A',
        'Efficient Language Modeling with Sparse all-MLP': '32',
        'Announcing GPT- NeoX- 20B': '96',
        'LaMDA: Language Models for Dialog Applications': '1024',
        'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale': 'N/A',
        'GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding': '1024',
        'Generative Pretraining from Pixels': 'N/A',
        'Once for all: Train one network and specialize it for efficient deployment.': '32',
        'Language models are Few- Shot Learners': 'N/A',
        'ProGen: Language Modeling for Protein Generation': '128',
        'Turing-NLG: A 17-billion-parameter language model by Microsoft': '256',
        'ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.': '512',
        'IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures': '1',
        'Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm': '64',
        'Progressive Neural Architecture Search': '100',
        'Mastering the game of Go without human knowledge': '64',
        'Dota 2 ': 'N/A',
        'Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.': '50',
        'Attention Is All You Need': '8',
        'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer': '64',
        'DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker': '20'
    },
    'Hardware model': {
        'No Language Left Behind: Scaling Human-Centered Machine Translation': 'A100',
        'Solving Quantitative Reasoning Problems with Language Models': 'TPUv4',
        'Scaling Autoregressive Models for Content-Rich Text-to-Image Generation': 'TPUv4',
        'A Generalist Agent': 'TPUv3',
        'OPT: Open Pre-trained Transformer Language Models': 'A100',
        'PaLM: Scaling Language Modeling with Pathways': 'TPUv4',
        'Training Compute-Optimal Large Language Models': 'TPUv3, TPUv4',
        'Efficient Language Modeling with Sparse all-MLP': 'V100',
        'Announcing GPT- NeoX- 20B': 'A100',
        'LaMDA: Language Models for Dialog Applications': 'TPUv3',
        'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale': 'TPUv3',
        'GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding': 'TPUv3',
        'Generative Pretraining from Pixels': 'V100',
        'Once for all: Train one network and specialize it for efficient deployment.': 'V100',
        'Language models are Few- Shot Learners': 'V100',
        'ProGen: Language Modeling for Protein Generation': 'TPUv3',
        'Turing-NLG: A 17-billion-parameter language model by Microsoft': 'V100',
        'ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.': 'TPUv3',
        'IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures': 'P100',
        'Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm': 'TPUv2',
        'Progressive Neural Architecture Search': 'P100',
        'Mastering the game of Go without human knowledge': 'N/A',
        'Dota 2 ': 'N/A',
        'Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.': 'K80',
        'Attention Is All You Need': 'P100',
        'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer': 'K40',
        'DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker': 'N/A'
    }
}

In [20]:
paper_titles = list(CORRECT_ANSWERS['Hardware model'].keys())
print(len(paper_titles))
paper_titles

27


['No Language Left Behind: Scaling Human-Centered Machine Translation',
 'Solving Quantitative Reasoning Problems with Language Models',
 'Scaling Autoregressive Models for Content-Rich Text-to-Image Generation',
 'A Generalist Agent',
 'OPT: Open Pre-trained Transformer Language Models',
 'PaLM: Scaling Language Modeling with Pathways',
 'Training Compute-Optimal Large Language Models',
 'Efficient Language Modeling with Sparse all-MLP',
 'Announcing GPT- NeoX- 20B',
 'LaMDA: Language Models for Dialog Applications',
 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale',
 'GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding',
 'Generative Pretraining from Pixels',
 'Once for all: Train one network and specialize it for efficient deployment.',
 'Language models are Few- Shot Learners',
 'ProGen: Language Modeling for Protein Generation',
 'Turing-NLG: A 17-billion-parameter language model by Microsoft',
 'ALBERT: A Lite BERT for Self

In [27]:
# Download dataset from the Parameters, Compute and Data Trends in ML sheet
df = pd.read_csv('https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/export?format=csv#gid=0')

year_start = 2017

# Recode columns
df['Publication date'] = pd.to_datetime(df['Publication date'], errors='coerce', dayfirst=True)

# Filter for papers of only the last 5 years
df = df[df['Publication date'] > f'{year_start}-01-01']

# Keep only bibliographical data
df = df.filter(['Author(s)', 'Publication date', 'Reference', 'Link'])
df = df[df['Link'].notna()]
# Keep only links which forward to a pdf or an arxiv link
df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]

keys = ['Number of hardware units', 'Hardware model']

# Enable for test running with the first ten papers
# df = df[1:11]
# Or a specific paper
# idx = 4
# df = df[idx:idx+1]
# Or by title
df = df[df['Reference'].isin(paper_titles)]
df = df.drop_duplicates(subset=['Reference'], keep='first')

for i, row in df.iterrows():
    # try:
    parse_paper(df, i, row, keys)
    print("---")
    # except:
    #     continue
print(len(df))
display(df)

timestamp = datetime.datetime.now()
df.to_csv(f'output_data/parsed_paper_data_{timestamp.strftime("%Y-%m-%d_%H-%M-%S")}.csv')

  df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]


Looking into "Solving Quantitative Reasoning Problems with Language Models"
Full #tokens: 38377
Max #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 6845
Answers: ['N/A', 'TPUv4']
---
Looking into "PaLM: Scaling Language Modeling with Pathways"
Full #tokens: 82875
Max #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 4045
Answers: ['6144', 'TPU v4']
---
Looking into "Training Compute-Optimal Large Language Models"
Full #tokens: 30151
Max #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 6502
Answers: ['N/A', 'TPUv3/TPUv4']
---
Looking into "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation"
Full #tokens: 39185
Max #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chunk #tokens: 7883
Chun

Unnamed: 0,Author(s),Publication date,Reference,Link,Number of hardware units,Hardware model
3,"Aitor Lewkowycz, Anders Andreassen, David Doha...",2022-06-29,Solving Quantitative Reasoning Problems with L...,https://arxiv.org/abs/2206.14858,,TPUv4
4,"Aakanksha Chowdhery, Sharan Narang, Jacob Devl...",2022-04-04,PaLM: Scaling Language Modeling with Pathways,https://arxiv.org/abs/2204.02311,6144.0,TPU v4
6,"Jordan Hoffmann, Sebastian Borgeaud, Arthur Me...",2022-03-29,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2203.15556,,TPUv3/TPUv4
7,"Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Lu...",2022-06-22,Scaling Autoregressive Models for Content-Rich...,https://arxiv.org/abs/2206.10789v1,,CloudTPUv4
8,"Romal Thoppilan, Daniel De Freitas, Jamie Hall...",2022-02-10,LaMDA: Language Models for Dialog Applications,https://arxiv.org/abs/2201.08239,1024.0,TPU-v3
21,"Ping Yu, Mikel Artexte, Myle Ott, Sam Shleife...",2022-04-14,Efficient Language Modeling with Sparse all-MLP,https://arxiv.org/abs/2203.06850,,32G V100
100,"Tom B. Brown, Benjamin Mann, Nick Ryder, Melan...",2020-05-28,Language models are Few- Shot Learners,https://arxiv.org/abs/2005.14165,,V100
103,"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,...",2020-06-30,GShard: Scaling Giant Models with Conditional ...,https://arxiv.org/abs/2006.16668,2048.0,TPU v3
108,"Zhenzhong Lan, Mingda Chen, Sebastian Goodman,...",2020-02-09,ALBERT: A Lite BERT for Self-supervised Learni...,https://arxiv.org/abs/1909.11942,,Cloud TPU V3
111,"Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhan...",2020-04-29,Once for all: Train one network and specialize...,https://arxiv.org/abs/1908.09791,,V100


In [28]:
display(df)

Unnamed: 0,Author(s),Publication date,Reference,Link,Number of hardware units,Hardware model
3,"Aitor Lewkowycz, Anders Andreassen, David Doha...",2022-06-29,Solving Quantitative Reasoning Problems with L...,https://arxiv.org/abs/2206.14858,,TPUv4
4,"Aakanksha Chowdhery, Sharan Narang, Jacob Devl...",2022-04-04,PaLM: Scaling Language Modeling with Pathways,https://arxiv.org/abs/2204.02311,6144.0,TPU v4
6,"Jordan Hoffmann, Sebastian Borgeaud, Arthur Me...",2022-03-29,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2203.15556,,TPUv3/TPUv4
7,"Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Lu...",2022-06-22,Scaling Autoregressive Models for Content-Rich...,https://arxiv.org/abs/2206.10789v1,,CloudTPUv4
8,"Romal Thoppilan, Daniel De Freitas, Jamie Hall...",2022-02-10,LaMDA: Language Models for Dialog Applications,https://arxiv.org/abs/2201.08239,1024.0,TPU-v3
21,"Ping Yu, Mikel Artexte, Myle Ott, Sam Shleife...",2022-04-14,Efficient Language Modeling with Sparse all-MLP,https://arxiv.org/abs/2203.06850,,32G V100
100,"Tom B. Brown, Benjamin Mann, Nick Ryder, Melan...",2020-05-28,Language models are Few- Shot Learners,https://arxiv.org/abs/2005.14165,,V100
103,"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,...",2020-06-30,GShard: Scaling Giant Models with Conditional ...,https://arxiv.org/abs/2006.16668,2048.0,TPU v3
108,"Zhenzhong Lan, Mingda Chen, Sebastian Goodman,...",2020-02-09,ALBERT: A Lite BERT for Self-supervised Learni...,https://arxiv.org/abs/1909.11942,,Cloud TPU V3
111,"Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhan...",2020-04-29,Once for all: Train one network and specialize...,https://arxiv.org/abs/1908.09791,,V100


In [29]:
evaluate_answers(df, keys, CORRECT_ANSWERS)

Number of hardware units
N/A != 1024
N/A != 32
2048 != 1024
N/A != 512
N/A != 32
N/A != 1
N/A != 64
N/A != 100
N/A != 20
Hardware model
TPU v4 != TPUv4
TPUv3/TPUv4 != TPUv3, TPUv4
CloudTPUv4 != TPUv4
TPU-v3 != TPUv3
32G V100 != V100
TPU v3 != TPUv3
Cloud TPU V3 != TPUv3
N/A != P100
TPU != TPUv2
N/A != K40
NVIDIA GeForce GTX 1080 != N/A
Number of hardware units: 9/18
Hardware model: 7/18


{'Number of hardware units': 9, 'Hardware model': 7}

## Count tokens in papers

In [None]:
total_davinci_tokens = 0
total_gpt_tokens = 0
davinci_tokenizer = tiktoken.encoding_for_model("text-davinci-003")
gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4")  # same for gpt-3.5-turbo
for i, row in df.iterrows():
    url = row['Link']
    # replace "abs" with "pdf" in arxiv url links
    url = url.replace('abs', 'pdf')
    paper_title = row['Reference'].replace(' ', '_').replace(':', '').replace('"', '').lower()
    with open('input_data/' + paper_title + '.txt', 'r') as f:
        text = f.read()

    davinci_tokens = davinci_tokenizer.encode(text, disallowed_special=())
    gpt_tokens = gpt4_tokenizer.encode(text, disallowed_special=())

    total_davinci_tokens += len(davinci_tokens)
    total_gpt_tokens += len(gpt_tokens)

    print(f"{row['Reference']}: {len(davinci_tokens)} tokens for davinci, {len(gpt_tokens)} tokens for gpt-4")

print(f"Total tokens: {total_davinci_tokens} for davinci, {total_gpt_tokens} for gpt-4")

In [None]:
len(gpt4_tokenizer.encode(chat_message_template))