# Retrieving data from papers using GPT

## Setup

In [1]:
!conda install -y openai

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - openai


The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2022.12.~ --> pkgs/main::ca-certificates-2023.01.10-hecd8cb5_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2022.12.7~ --> pkgs/main/osx-64::certifi-2022.12.7-py39hecd8cb5_0
  openssl            conda-forge::openssl-1.1.1t-hfd90126_0 --> pkgs/main::openssl-1.1.1t-hca72f7f_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [2]:
!conda install -y -c conda-forge pdfminer.six

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.11.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - pdfminer.six


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    pkgs/main::ca-certificates-2023.01.10~ --> conda-forge::ca-certificates-2022.12.7-h033912b_0
  certifi            pkgs/main/osx-64::certifi-2022.12.7-p~ --> conda-forge/noarch::certifi-2022.12.7-pyhd8ed1ab_0
  openssl              pkgs/main::openssl-1.1.1t-hca72f7f_0 --> conda-forge::openssl-1.1.1t-hfd90126_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [2]:
import datetime
import openai
import os
import pandas as pd
import re
import requests
from pdfminer.high_level import extract_text

In [3]:
os.makedirs('output_data', exist_ok=True)

In [4]:
openai.api_key = os.getenv("OPENAI_API_KEY")

## Playground

In [5]:
example_paper_text = "PaLM: Scaling Language Modeling with Pathways\nAakanksha Chowdhery∗ Sharan Narang∗ Jacob Devlin∗\nMaarten Bosma Gaurav Mishra Adam Roberts Paul Barham\nHyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi\nSasha Tsvyashchenko Joshua Maynez Abhishek Rao† Parker Barnes Yi Tay\nNoam Shazeer‡ Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson\nReiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari\nPengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev\nHenryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus\nDenny Zhou Daphne Ippolito David Luan‡ Hyeontaek Lim Barret Zoph\nAlexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick\nAndrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz\nErica Moreira Rewon Child Oleksandr Polozov† Katherine Lee Zongwei Zhou\nXuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta† Jason Wei\nKathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov Noah Fiedel\nGoogle Research\nAbstract\nLarge language models have been shown to achieve remarkable performance across a variety of natural\nlanguage tasks using few-shot learning, which drastically reduces the number of task-specific training\nexamples needed to adapt the model to a particular application. To further our understanding of the\nimpact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer\nlanguage model, which we call Pathways Language Model (PaLM).\nWe trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient\ntraining across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-\nthe-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a\nnumber of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-\nof-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the\nrecently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous\nimprovements from model scale, meaning that performance steeply increased as we scaled to our largest\nmodel. PaLM also has strong capabilities in multilingual tasks and source code generation, which we\ndemonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias\nand toxicity, and study the extent of training data memorization with respect to model scale. Finally,\nwe discuss the ethical considerations related to large language models and discuss potential mitigation\nstrategies.\n∗Equal Contribution. Author contributions and ordering details are listed in Appendix A.\nCorrespondence authors: chowdhery@google.com, sharannarang@google.com\nIn addition to other contributions, the last five authors advised the overall project.\n†Alphabet, X, the Moonshot Factory\n‡Work done while at Google\n\n"

In [6]:
response = openai.Completion.create(
    model="text-davinci-003",
    prompt=f"A table summarizing the training hardware from this paper:\n\n====\n\n{example_paper_text}\n\n====\n\n| Number of GPUs or TPUs | Hardware model (e.g. A100) | FLOP/s |\n",
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
response

<OpenAIObject text_completion id=cmpl-7CsqXXlik8mJ051Qo3FUCDKPA8WcU at 0x7fbfb02cd860> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "| ---------------------- | -------------------------- | ------ |\n| 6144                   | TPU v4                    | N/A    |"
    }
  ],
  "created": 1683305121,
  "id": "cmpl-7CsqXXlik8mJ051Qo3FUCDKPA8WcU",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 27,
    "prompt_tokens": 753,
    "total_tokens": 780
  }
}

In [7]:
prompt_text = f"""
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".
1. How many GPUs or TPUs were used to train the model? Just state the number. If the number of GPUs or TPUs is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
3. What FLOP/s (AKA: FLOP/second, FLOPS) was achieved during training? Include the same units as written in the paper. If FLOP/s is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100
3. 21 TFLOP/s

1. N/A
2. Titan V
3. 21 petaflops

1. 32
2. N/A
3. 127e12 FLOPS

====

{example_paper_text}

====

"""

response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt_text,
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
print(response["choices"][0]["text"])

1. N/A
2. TPUv4
3. N/A


In [8]:
prompt_text = """
Read the following excerpt of a Machine Learning research paper and answer the questions below. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".
1. How many GPUs or TPUs were used to train the model? Just state the number. If the number of GPUs or TPUs is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100

1. N/A
2. Titan V

1. 32
2. N/A
"""

openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant who is an expert in the field of Machine Learning."},
        {"role": "user", "content": prompt_text},
        {"role": "assistant", "content": "Understood."},
        {"role": "user", "content": example_paper_text,}
    ]
)

<OpenAIObject chat.completion id=chatcmpl-7CsqdCyyBaLu8GwX18EJn6EmFLUXB at 0x7fbfd8d13860> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "1. 6144\n2. TPUv4",
        "role": "assistant"
      }
    }
  ],
  "created": 1683305127,
  "id": "chatcmpl-7CsqdCyyBaLu8GwX18EJn6EmFLUXB",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 901,
    "total_tokens": 913
  }
}

## Pipeline

In [14]:
chat_message_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100

1. N/A
2. Titan V

1. 32
2. N/A

====

{paper_text}

====

"""

def parse_text_gpt_chat(text):
    prompt_text = chat_message_template.format(paper_text=text)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt_text},
        ]
    )
    return response

def parse_gpt_chat_response(response):
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    answers = response["choices"][0]["message"]["content"].strip().split("\n")
    answers = [a.split(".")[-1].strip() for a in answers]
    return answers

In [15]:
prompt_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
3. What FLOP/s (AKA: FLOP/second, FLOPS) was achieved during training? Include the same units as written in the paper. If FLOP/s is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100
3. 21 TFLOP/s

1. N/A
2. Titan V
3. 21 petaflops

1. 32
2. N/A
3. 127e12 FLOPS

====

{paper_text}

====

"""

def parse_text_gpt(text):
    prompt_text = prompt_template.format(paper_text=text)
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_text,
        temperature=0,
        max_tokens=100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response

def parse_gpt_response(response):
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    answers = response["choices"][0]["text"].strip().split("\n")
    answers = [a.split(".")[-1].strip() for a in answers]
    return answers

In [16]:
# I've heard that English has about 4 chars per token on average.
# `text-davinci-003` token limit (including output) is 4097.
# So 4097 * 3 should be pretty safe.
CHAR_LIMIT = 4097*3

def parse_paper(df, i, row, keys):
    url = row['Link']

    # replace "abs" with "pdf" in arxiv url links
    url = url.replace('abs', 'pdf')
    print(f"Looking into \"{row['Reference']}\"")

    try:
        response = requests.get(url)
    except Exception as e:
        print(f"There's something wrong with downloading: {e}")
        raise e

    file = open("download.pdf", "wb")
    file.seek(0) # overwrite previous file
    file.write(response.content)
    file.close()

    try:
        text = extract_text('download.pdf')

        answers = parse_gpt_chat_response(parse_text_gpt_chat(text[:CHAR_LIMIT]))

        for key, answer in zip(keys, answers):
            df.loc[i,key]  = answer if answer else ""
    except Exception as e:
        print(f"There's something wrong with extracting the text: {e}")
        raise e

In [17]:
# Download dataset from the Parameters, Compute and Data Trends in ML sheet
df = pd.read_csv('https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/export?format=csv#gid=0')

year_start = 2017

# Recode columns
df['Publication date'] = pd.to_datetime(df['Publication date'], errors='coerce', dayfirst=True)

# Filter for papers of only the last 5 years
df = df[df['Publication date'] > f'{year_start}-01-01']

# Keep only bibliographical data
df = df.filter(['Author(s)', 'Publication date', 'Reference', 'Link'])
df = df[df['Link'].notna()]
# Keep only links which forward to a pdf or an arxiv link
df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]

keys = ['Number of hardware units', 'Hardware model', 'Training FLOP/s']

# Enable for test running with the first ten papers
# df = df[:10]
# Or a specific paper
idx = 4
df = df[idx:idx+1]

for i, row in df.iterrows():
    try:
        parse_paper(df, i, row, keys)
        print("---")
    except:
        continue

display(df)

timestamp = datetime.datetime.now()
df.to_csv(f'output_data/parsed_paper_data_{timestamp.strftime("%Y-%m-%d_%H-%M-%S")}.csv')

  df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]


Looking into "Training Compute-Optimal Large Language Models"
---


Unnamed: 0,Author(s),Publication date,Reference,Link,Number of hardware units,Hardware model
5,"Jordan Hoffmann, Sebastian Borgeaud, Arthur Me...",2022-03-29,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2203.15556,,Titan V
