# Retrieving data from papers using GPT

## Setup

In [1]:
!conda install -y openai

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - openai


The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2022.12.~ --> pkgs/main::ca-certificates-2023.01.10-hecd8cb5_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2022.12.7~ --> pkgs/main/osx-64::certifi-2022.12.7-py39hecd8cb5_0
  openssl            conda-forge::openssl-1.1.1t-hfd90126_0 --> pkgs/main::openssl-1.1.1t-hca72f7f_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [2]:
!conda install -y -c conda-forge pdfminer.six

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.11.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /Users/bencottier/miniconda3/envs/nlp

  added / updated specs:
    - pdfminer.six


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.5.7   |       h8857fd0_0         145 KB  conda-forge
    certifi-2023.5.7           |     pyhd8ed1ab_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         294 KB

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2023.01.10~ --> conda-forge::ca-certificates-2023.5.7-h8857fd0_0
  certifi            pkgs/main/osx-64::certifi-2022.12.7-p~ --> conda-forge/noar

In [3]:
import datetime
import openai
import os
import pandas as pd
import re
import requests
from pdfminer.high_level import extract_text

In [4]:
os.makedirs('output_data', exist_ok=True)

In [5]:
openai.api_key = os.getenv("OPENAI_API_KEY")

## Playground

In [32]:
example_paper_text = """
2 Related work

Language models and dialog models: Language models have attracted much attention recently thanks to their
successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to
model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their ﬁndings,
our results show that model scaling improves our quality (sensibleness, speciﬁcity, and interestingness), safety and
groundedness metrics to some extent. However, ﬁne-tuning combined with scaling signiﬁcantly improves performance
on all metrics.

Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26,
17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our ﬁne-tuning
stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30].
Our use of ﬁne-tuning on crowdworker-annotated data to improve interestingness is comparable to Roller et al. [18].
However, we aim to maximize the interestingness of the model’s output distinctly from its ability to engage the user in
further interaction.

Our ﬁnding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes
that of Shuster et al. [31], who also focus on the problem of groundedness. Recent studies on scaling have found that
performance on question-answering tasks improves with model size [32, 33], similar to our ﬁndings on pre-trained
LaMDA prior to ﬁne-tuning.

Our approach to improving model groundedness is broadly consistent with a growing literature on augmenting neural
language models with retrieval systems. Most of the existing literature focuses on the problem of open-domain
question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge
sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing
approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et
al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’
used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank
passages [40]. The RETRO architecture demonstrates that language models can be primed with results retrieved from
a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et
al. [42], which ﬁne-tunes the model to use external APIs for movie ticketing dialog.

Parts of our ﬁndings are similar to recent studies on dialog groundedness. Granting access to external knowledge
bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of
retrieval systems and model architectures [31]. Another study ﬁnds that a question-answering system’s accuracy is
improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and
‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that can interact
with the open web via a text-only interface, and learns to imitate humans in answering questions by citing external
sources [44]. Komeili et al. [45] compare different types of pre-trained models and retrieval methods, and reach a
similar conclusion that augmenting language models with a search engine provides more factually grounded responses.
They encode the input context with grounded information from search to generate the next response, while we augment
the generated responses with information from known sources in our method. This allows us to ﬁne-tune the model for
groundedness without sacriﬁcing gains in safety or quality from other ﬁne-tuning treatments.

Dialog metrics: Deﬁning effective metrics for dialog models remains an open research topic. Our approach is
inspired by Adiwardana et al. [17], who argued for human-like metrics, such as sensibleness and speciﬁcity. Many
automated metrics for dialog models have been studied, including perplexity [16, 17], F1, Hits@1/N [25], USR [46],
or BLEU/ROUGE [47, 15, 27]. However, such automated metrics may not correlate well with human judgment [48].
More reliable metrics for dialog modeling require human evaluation [49, 50, 18, 25, 17, 51], as used in this paper.

Earlier research attempted to combine multifaceted evaluations of dialog quality into a single headline metric [52]. We
follow the pattern established in Adiwardana et al. [17] and Roller et al. [18] by considering the different components
of our evaluations separately. In addition to sensibleness and speciﬁcity per Adiwardana et al. [17], we add new metrics:
interestingness, safety, and groundedness. An advantage of using several different metrics is their debuggability: by
exploring responses with low safety or groundedness scores, we have been able to develop targeted methods to improve
them.

Safety and safety of dialog models:
Inappropriate and unsafe risks and behaviors of language models have been
extensively discussed and studied in previous works (e.g., [53, 54]). Issues encountered include toxicity (e.g., [55, 56,
57]), bias (e.g., [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]), and inappropriately revealing personally
identifying information (PII) from training data [73]. Weidinger et al. [54] identify 21 risks associated with large-scale

3

language models and discuss the points of origin for these risks. While many mitigation strategies have also been
suggested (e.g., [74, 75, 76, 77, 78, 79, 80, 81, 82]), meaningfully addressing these issues remains an active research
area.

Similar issues have also been discussed speciﬁcally for dialog models [53]. For instance, examples of bias, offensiveness,
and hate speech have been found both in training data drawn from social media, and consequently in the output of dialog
models trained on such data [83]. Dialog models [84] can learn, and even amplify, biases in the training data. Echoing
Gehman et al. [85], we ﬁnd ﬁne-tuning effective to augment language models for safety. The method we use in this
paper follows previous attempts to tackle these issues by training separate layers to detect unsafe output [17, 86, 18, 79].
Our strategy is similar to recent work that also uses ﬁne-tuning [87]. While their safety guidelines were derived from
human rights principles, they similarly ﬁnd that increasing scale has no impact on toxicity metrics, while ﬁne-tuning on
safety evaluations does.

Groundedness metrics: Similar to other recent research into groundedness cited above, we assess groundedness
by asking crowdworkers to judge whether the model’s output is in accordance with authoritative external sources.
The recently-proposed Attributable to Identiﬁed Sources (AIS) framework [88] articulates a more precise approach
to assess output of language models that pertains to the external world. It splits evaluation into two stages, where
crowdworkers are asked: (1) if they can understand and identify the information shared in a dialog turn, and (2) if all
of this information can be attributed to a source. Meanwhile, a recent study has reopened the question of automatic
evaluation, with the Q2 metric showing performance comparable to human annotation [89].

3 LaMDA pre-training

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data
alone [17, 18], we pre-trained LaMDA on a dataset created from public dialog data and other public web documents.
Therefore, LaMDA can be used as a general language model prior to ﬁne-tuning.

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T
words (Appendix E). Over 90% of the pre-training dataset is in the English language. We used the SentencePiece
library [90] to tokenize the dataset into 2.81T byte pair encoding (BPE) tokens [91], with a vocabulary of 32K tokens.
For comparison, the total number of words in the training set for Meena [17] was 40B words, which is nearly 40x
smaller.

The largest LaMDA model has 137B non-embedding parameters, which is ~50x more parameters than Meena [17].
We use a decoder-only Transformer [92] language model as the model architecture for LaMDA. The Transformer has
64 layers, dmodel = 8192, df f = 65536, h = 128, dk = dv = 128, relative attention as described in T5 [11], and
gated-GELU activation as described in Raffel et al. [93].

We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch. We used
the Lingvo framework [94] for training and achieved 123 TFLOPS/sec with 56.5% FLOPS utilization with the 2D
sharding algorithm, as described in GSPMD [95] (see Section 10 for carbon footprint estimates). We also trained
smaller 2B-parameter and 8B-parameter models to measure the effects of model scaling on our metrics. Hyperparameter
details for the models of different sizes can be found in Table 27, Appendix D.

Figure 2 gives an overview of the pre-training stage. We call the model before any ﬁne-tuning "PT", for PreTrained.

Figure 2: LaMDA pre-training as a language model.

4

PT uses the same sample-and-rank strategy as Meena [17] for decoding. We ﬁrst sample 16 independent candidate
responses using top-k (k = 40) sampling (no temperature). The ﬁnal output is the highest-scoring candidate, where the
score is based on the candidate’s log-likelihood and its length.

4 Metrics

Evaluating generative models in general, and open-ended dialog models in particular, is difﬁcult. See the Related
Work section for a general review of recent work in this area. In this section, we describe the metrics that we use for
evaluation.

4.1 Foundation metrics: Quality, Safety and Groundedness

Sensibleness, Speciﬁcity, Interestingness (SSI): Our overall quality score is an average of sensibleness, speciﬁcity,
and interestingness (SSI).

Adiwardana et al. [17] propose the sensibleness and speciﬁcity average (SSA) metric to measure the quality of Meena.
This metric is a simple average of two scores: sensibleness and speciﬁcity.

The ﬁrst score, sensibleness, measures whether a model’s responses make sense in context and do not contradict
anything that was said earlier. Humans tend to take this basic aspect of communication for granted, but generative
models often struggle to meet this requirement. However, if sensibleness alone is used to evaluate models, we could
inadvertently reward models for playing it safe by always producing short, generic, and boring responses. The
GenericBot algorithm [17], which answers every question with “I don’t know” and every statement with “Ok,” scores
70% on sensibleness, which even surpasses some large dialog models [17].

The second score, speciﬁcity, is used to measure whether a response is speciﬁc to a given context. For example, if a user
says “I love Eurovision” and the model responds “Me too,” then it would score 0 on speciﬁcity, since this response could
be used in many different contexts. If it answers “Me too. I love Eurovision songs,” then it would score 1. Adiwardana
et al. [17] report that Meena narrows the gap to average human performance in the SSA metric.

As the model’s performance increases, however, we ﬁnd that sensibleness and speciﬁcity are not sufﬁcient to measure
the quality of a dialog model. For example, a response to “How do I throw a ball?” could be “You can throw a ball by
ﬁrst picking it up and then throwing it”, which makes sense and is speciﬁc to the question. An alternative deeper and
more satisfying answer could be “One way to toss a ball is to hold it ﬁrmly in both hands and then swing your arm
down and up again, extending your elbow and then releasing the ball upwards.”
"""

In [7]:
response = openai.Completion.create(
    model="text-davinci-003",
    prompt=f"A table summarizing the training hardware from this paper:\n\n====\n\n{example_paper_text}\n\n====\n\n| Number of GPUs or TPUs | Hardware model (e.g. A100) | FLOP/s |\n",
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
response

<OpenAIObject text_completion id=cmpl-7EJDodhVtqVMFdhqaqJqkZLn8QU4m at 0x7f7df05b5630> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "| ---------------------- | -------------------------- | ------ |\n| 6144                   | TPU v4                    | N/A    |"
    }
  ],
  "created": 1683644836,
  "id": "cmpl-7EJDodhVtqVMFdhqaqJqkZLn8QU4m",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 27,
    "prompt_tokens": 753,
    "total_tokens": 780
  }
}

In [8]:
prompt_text = f"""
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".
1. How many GPUs or TPUs were used to train the model? Just state the number. If the number of GPUs or TPUs is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
3. What FLOP/s (AKA: FLOP/second, FLOPS) was achieved during training? Include the same units as written in the paper. If FLOP/s is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100
3. 21 TFLOP/s

1. N/A
2. Titan V
3. 21 petaflops

1. 32
2. N/A
3. 127e12 FLOPS

====

{example_paper_text}

====

"""

response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt_text,
    temperature=0,
    max_tokens=100,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
print(response["choices"][0]["text"])

1. N/A
2. TPUv4
3. N/A


In [48]:
prompt_text = """
Read the following excerpt of a Machine Learning research paper and answer the questions below.
1. How many GPUs or TPUs were used to train the model? Just state the number. If the number of GPUs or TPUs is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
"""

openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a world expert in Machine Learning."},
        {"role": "user", "content": f"{prompt_text}\n\n====\n\n{example_paper_text}\n\n====\n\n"},
    ]
)

<OpenAIObject chat.completion id=chatcmpl-7EKgaCRIo9oPO03hur3oopCp9ly1t at 0x7f7da0af64a0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "1. 1024 TPUs were used to pre-train the LaMDA model.\n2. TPU-v3 was used to pre-train the LaMDA model.",
        "role": "assistant"
      }
    }
  ],
  "created": 1683650464,
  "id": "chatcmpl-7EKgaCRIo9oPO03hur3oopCp9ly1t",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 37,
    "prompt_tokens": 3071,
    "total_tokens": 3108
  }
}

## Pipeline

In [10]:
chat_message_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100

1. N/A
2. Titan V

1. 32
2. N/A

====

{paper_text}

====

"""

def parse_text_gpt_chat(text):
    prompt_text = chat_message_template.format(paper_text=text)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt_text},
        ]
    )
    return response

def parse_gpt_chat_response(response):
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    answers = response["choices"][0]["message"]["content"].strip().split("\n")
    answers = [a.split(".")[-1].strip() for a in answers]
    return answers

In [19]:
len(chat_message_template)

638

In [11]:
prompt_template = """
Read the Machine Learning research paper below and answer the following questions. Just state the answer without explanation. If the answer is not mentioned in the text, write "N/A".

1. How many GPUs or TPUs or chips were used to train the model? Just state the number. If the number of GPUs or TPUs or chips is not mentioned in the text, write "N/A".
2. What model of GPU or TPU was used to train the model? Examples include: "A100", "V100", "P100", "TPUv3", "TPUv4". If the GPU or TPU is not mentioned in the text, write "N/A".
3. What FLOP/s (AKA: FLOP/second, FLOPS) was achieved during training? Include the same units as written in the paper. If FLOP/s is not mentioned in the text, write "N/A".

Here are some example answers:

1. 1
2. V100
3. 21 TFLOP/s

1. N/A
2. Titan V
3. 21 petaflops

1. 32
2. N/A
3. 127e12 FLOPS

====

{paper_text}

====

"""

def parse_text_gpt(text):
    prompt_text = prompt_template.format(paper_text=text)
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_text,
        temperature=0,
        max_tokens=100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response

def parse_gpt_response(response):
    # E.g. "1. 6144 TPUs\n2. TPU v4\n3. N/A\n"
    answers = response["choices"][0]["text"].strip().split("\n")
    answers = [a.split(".")[-1].strip() for a in answers]
    return answers

In [25]:
# I've heard that English has about 4 chars per token on average.
# But papers may have parts with a lot of digits, which I think are one token each.
# GPT-3 token limit (including output) is 4096.
# 4096 * 2 minus the initial prompt should be safe.
CHAR_LIMIT = 4096*2 - len(chat_message_template)
NUM_QUESTIONS = 2

def parse_paper(df, i, row, keys):
    url = row['Link']

    # replace "abs" with "pdf" in arxiv url links
    url = url.replace('abs', 'pdf')
    print(f"Looking into \"{row['Reference']}\"")

    # try:
    #     response = requests.get(url)
    # except Exception as e:
    #     print(f"There's something wrong with downloading: {e}")
    #     raise e

    # file = open("download.pdf", "wb")
    # file.seek(0) # overwrite previous file
    # file.write(response.content)
    # file.close()

    try:
        # text = extract_text('download.pdf')
        paper_title = row['Reference'].replace(' ', '_').replace(':', '').replace('"', '').lower()
        with open('input_data/' + paper_title + '.txt', 'r') as f:
            text = f.read()

        text_pos = 0
        final_answers = [None] * NUM_QUESTIONS
        while text_pos < len(text):
            # Get model answers for the next chunk of the text
            text_chunk = text[text_pos : text_pos + CHAR_LIMIT]
            answers = parse_gpt_chat_response(parse_text_gpt_chat(text_chunk))
            # Process each answer
            for i in range(NUM_QUESTIONS):
                if final_answers[i] is None and answers[i] is not None:
                    # Take the first answer as the final answer initially
                    final_answers[i] = answers[i]
                elif "N/A" in final_answers[i] and not ("N/A" in answers[i]):
                    """
                    If the answer was "N/A" previously but there's at least 
                    one non-"N/A" answer for a later chunk, then use the 
                    first non-"N/A" answer as the final answer
                    """
                    final_answers[i] = answers[i]
            # Move to the next chunk of text
            text_pos += CHAR_LIMIT
        
        for key, answer in zip(keys, final_answers):
            df.loc[i,key] = answer if answer else "none"
    except Exception as e:
        print(f"There's something wrong with extracting the text: {e}")
        raise e

In [26]:
# Download dataset from the Parameters, Compute and Data Trends in ML sheet
df = pd.read_csv('https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/export?format=csv#gid=0')

year_start = 2017

# Recode columns
df['Publication date'] = pd.to_datetime(df['Publication date'], errors='coerce', dayfirst=True)

# Filter for papers of only the last 5 years
df = df[df['Publication date'] > f'{year_start}-01-01']

# Keep only bibliographical data
df = df.filter(['Author(s)', 'Publication date', 'Reference', 'Link'])
df = df[df['Link'].notna()]
# Keep only links which forward to a pdf or an arxiv link
df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]

keys = ['Number of hardware units', 'Hardware model', 'Training FLOP/s']

# Enable for test running with the first ten papers
df = df[:10]
# Or a specific paper
# idx = 4
# df = df[idx:idx+1]

for i, row in df.iterrows():
    try:
        parse_paper(df, i, row, keys)
        print("---")
    except:
        continue

display(df)

timestamp = datetime.datetime.now()
df.to_csv(f'output_data/parsed_paper_data_{timestamp.strftime("%Y-%m-%d_%H-%M-%S")}.csv')

  df = df[df['Link'].str.contains('(arxiv|.pdf$)', regex=True)]


Looking into "GPT-4 Technical Report"
---
Looking into "Phenaki: Variable Length Video Generation From Open Domain Textual Description"
---
Looking into "Solving Quantitative Reasoning Problems with Language Models"
---
Looking into "PaLM: Scaling Language Modeling with Pathways"
---
Looking into "Training Compute-Optimal Large Language Models"
---
Looking into "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation"
---
Looking into "LaMDA: Language Models for Dialog Applications"
---
Looking into "AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model"
---
Looking into "High-Resolution Image Synthesis with Latent Diffusion Models"
There's something wrong with extracting the text: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID ec853e2be56148771033a5ee58bd849c in your message.)
Looking into "Robust Sp

Unnamed: 0,Author(s),Publication date,Reference,Link,Number of hardware units,Hardware model
0,OpenAI,2023-03-15,GPT-4 Technical Report,https://arxiv.org/abs/2303.08774,,
1,"Ruben Villegas, Mohammad Babaeizadeh, Pieter-J...",2022-10-05,Phenaki: Variable Length Video Generation From...,https://arxiv.org/abs/2210.02399,128.0,Titan V
2,"Aitor Lewkowycz, Anders Andreassen, David Doha...",2022-06-29,Solving Quantitative Reasoning Problems with L...,https://arxiv.org/abs/2206.14858,,
3,"Aakanksha Chowdhery, Sharan Narang, Jacob Devl...",2022-04-04,PaLM: Scaling Language Modeling with Pathways,https://arxiv.org/abs/2204.02311,,
5,"Jordan Hoffmann, Sebastian Borgeaud, Arthur Me...",2022-03-29,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2203.15556,,
6,"Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Lu...",2022-06-22,Scaling Autoregressive Models for Content-Rich...,https://arxiv.org/abs/2206.10789v1,,
7,"Romal Thoppilan, Daniel De Freitas, Jamie Hall...",2022-02-10,LaMDA: Language Models for Dialog Applications,https://arxiv.org/abs/2201.08239,,
9,"Saleh Soltan, Shankar Ananthakrishnan, Jack Fi...",2022-08-02,AlexaTM 20B: Few-Shot Learning Using a Large-S...,https://arxiv.org/abs/2208.01448,,
12,"Robin Rombach, Andreas Blattmann, Dominik Lore...",2022-04-13,High-Resolution Image Synthesis with Latent Di...,https://arxiv.org/abs/2112.10752,,
13,"Alec Radford, Jong Wook Kim, Tao Xu, Greg Broc...",2022-09-21,Robust Speech Recognition via Large-Scale Weak...,https://cdn.openai.com/papers/whisper.pdf,,
