This builds heavily on [this notebook](https://wandb.ai/ivangoncharov/GPT-3%20to%20Generate%20Doctor%20Who%20Synopses/reports/Using-OpenAI-s-GPT-3-to-Generate-Doctor-Who-Episode-Synopses---VmlldzoxNTI3NDIw) by Ivan Goncharov.

# Setting up

In [1]:
pip install --upgrade openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.20.0.tar.gz (42 kB)
[K     |████████████████████████████████| 42 kB 487 kB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.62-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 7.9 MB/s 
Building wheels for collected packages: openai
  Building wheel for openai (PEP 517) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.20.0-py3-none-any.whl size=54118 sha256=0880574984b72defb9e829a919dccb96f77cc16d4a71196fa8b21565a7b196c7
  Stored in directory: /root/.cache/pip/wheels/71/8d/9b/e28529ec53123e0279208f99148d4661232120d78cb866839b
Successfully built openai
Installing collected packages: pandas-stubs, openai
Successfully in

In [2]:
import openai

Put your OpenAI API key:

In [25]:
openai.api_key='put your API key here'
%env OPENAI_API_KEY=put your API key here

env: OPENAI_API_KEY=put your API key here


Get a quick trial response:

In [None]:
response = openai.Completion.create(
  model="text-curie-001",
  prompt='Write an abstract for a paper titled "Celestial Yang-Mills Amplitudes and D=4 Conformal Blocks", published in 2022',
  temperature=0.4,
  max_tokens=400
)

In [None]:
print(response['choices'][0]['text'])



The study of Yang-Mills amplitudes in D=4 conformal blocks is presented. In particular, the amplitudes of the Majorana fermion and the Dirac fermion are studied. It is found that the Majorana amplitude is larger than the Dirac amplitude in all conformal blocks. The reason for this is not clear, but it may be related to the topological properties of the conformal blocks.


# Creating some prompt-completion pairs

In [4]:
import json
import pandas as pd
from random import sample

Load cleandata.json from the inspire scraper

In [5]:
with open("cleandata.json", "r") as f:
    cleandata=json.load(f)

In [6]:
len(cleandata)

5001

In [7]:
cleandata[13]['date']

'2004-07'

In [8]:
prompt_completion=[]
for c in cleandata:
  pr='Write an abstract for a paper titled "'+c['title']+'" written in '+c['date']+'.'
  comp=c['abstract']
  prompt_completion+=[{'prompt':pr,'completion':comp}]

In [9]:
prompt_completion[:2]

[{'completion': 'The high star formation rates of luminous infrared galaxies (LIRGs) make them ideal places for core-collapse supernova (CCSN) searches. At radio frequencies, free from dust extinction, it is possible to detect compact components within the innermost LIRG nuclear regions, such as SNe and SN remnants, as well as AGN buried deep in the LIRG nuclei. We studied the LIRG IC883 aiming at: (i) investigating its (circum-)nuclear regions using the e-EVN at 5GHz, and e-MERLIN at 6.9GHz, complemented by archival VLBI data; (ii) detecting at radio frequencies the two recently reported circumnuclear SNe 2010cu and 2011hi, which were discovered by near-IR (NIR) adaptive optics observations of IC883; and (iii) further investigating the nature of SN2011hi at NIR by means of observations with Gemini-North. The circumnuclear regions traced by e-MERLIN at 6.9GHz have an extension of ~1kpc, and show a striking double-sided structure, which very likely corresponds to a warped rotating ring,

# Get a couple of examples without fine-tuning

In [None]:
testdata=[]
for i,pc in enumerate(prompt_completion[:19]):
  response = openai.Completion.create(
    model="text-curie-001",
    prompt=pc['prompt'],
    temperature=0.4,
    max_tokens=600
  )
  entry = {'title':cleandata[i]['title'], 'real_abstract':cleandata[i]['abstract'], 'fake_abstract':response['choices'][0]['text']}
  testdata+=[entry]

In [None]:
testdata[10:15]

[{'fake_abstract': '\n\nIn this paper, we study the mean-field theory of baryonic matter in the large $N_{c}$ and heavy quark mass limits. We find that the theory is in good agreement with the latest results from the LHC.',
  'real_abstract': "We discuss theoretical issues pertaining to baryonic matter in the combined heavy-quark and large $N_c$ limits of QCD. Witten's classic argument that baryons and interacting systems of baryons can be described in a mean-field approximation with each of the quarks moving in an average potential due to the remaining quarks is heuristic. It is important to justify this heuristic description for the case of baryonic matter since systems of interacting baryons are intrinsically more complicated than single baryons due to the possibility of hidden color states---states in which the subsystems making up the entire baryon crystal are not color-singlet nucleons but rather colorful states coupled together to make a color-singlet state. In this work, we pro

In [None]:
with open("testdata.json", "w") as write_file:
    json.dump(testdata, write_file, indent=4)

# Fine Tune GPT-3

## Re-create pairs with fine tuning requirements: no instructions included in prompt

In [10]:
prompt_completion=[]
for c in cleandata:
  #avoid double punctuation in prompt
  if c['title'][-1] == '.':
    pr = c['title']
  else:
    pr = c['title'] + '.'
  comp=c['abstract']
  prompt_completion+=[{'prompt':pr,'completion':comp}]

In [11]:
prompt_completion[:3]

[{'completion': 'The high star formation rates of luminous infrared galaxies (LIRGs) make them ideal places for core-collapse supernova (CCSN) searches. At radio frequencies, free from dust extinction, it is possible to detect compact components within the innermost LIRG nuclear regions, such as SNe and SN remnants, as well as AGN buried deep in the LIRG nuclei. We studied the LIRG IC883 aiming at: (i) investigating its (circum-)nuclear regions using the e-EVN at 5GHz, and e-MERLIN at 6.9GHz, complemented by archival VLBI data; (ii) detecting at radio frequencies the two recently reported circumnuclear SNe 2010cu and 2011hi, which were discovered by near-IR (NIR) adaptive optics observations of IC883; and (iii) further investigating the nature of SN2011hi at NIR by means of observations with Gemini-North. The circumnuclear regions traced by e-MERLIN at 6.9GHz have an extension of ~1kpc, and show a striking double-sided structure, which very likely corresponds to a warped rotating ring,

Select 400 random pairs for fine tuning:

In [None]:
train_prompt_completion=sample(prompt_completion,400)

In [None]:
train_prompt_completion[:5]

[{'completion': 'Several analyses of the microwave sky maps from the Wilkinson Microwave Anisotropy Probe (WMAP) have drawn attention to alignments amongst the low-order multipoles. Amongst the various possible explanations, an effect of cosmic topology has been invoked by several authors. We focus on an alignment of the first four multipoles (\\ell = 2 to 5) found by Land and Magueijo (2005), and investigate the distribution of their alignment statistic for a set of simulated cosmic microwave background maps for cosmologies with slab-like topology. We find that this topology does offer a modest increase in the probability of the observed value, but that even for the smallest topology considered the probability of the observed value remains below one percent.',
  'prompt': 'Cosmic microwave background multipole alignments in slab topologies.'},
 {'completion': 'We show that recent experiment data for the ratios $E_{1^+}/M_{1^+}$ and $S_{1^+}/M_{1^+}$ can be explained in a dynamical mod

In [None]:
with open("train_prompt_completion.json", "w") as write_file:
    json.dump(train_prompt_completion, write_file, indent=4)

## Use API to check data:

In [None]:
!openai tools fine_tunes.prepare_data -f train_prompt_completion.json

Logging requires wandb to be installed. Run `pip install wandb`.
Analyzing...

- Your file appears to be in a .JSON format. Your file will be converted to JSONL format
- Your file contains 400 prompt-completion pairs
- All prompts end with suffix `.`
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`
- [Recommended] Add a suffix ending ` END` to all completions [Y/n]: y
- [R

## Fine tune

Use wandb for performance tracking. To do this, you need an account. It is also OK to skip.

In [15]:
!pip install wandb
!wandb login

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.12.19-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 5.1 MB/s 
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 37.2 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.6.0-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 37.3 MB/s 
[?25hCollecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-

In [16]:
import wandb

In [None]:
run = wandb.init(project='GPT-3 to Generate ArXiv Abstracts')

[34m[1mwandb[0m: Currently logged in as: [33msarosi[0m. Use [1m`wandb login --relogin`[0m to force relogin


Split train data to training and validation.

In [None]:
!head -n 360 train_prompt_completion_prepared.jsonl > arxiv_train.jsonl
!tail -n 40  train_prompt_completion_prepared.jsonl > arxiv_valid.jsonl

Set fine tuning parameters.

In [None]:
model = 'curie'  # can be ada, babbage or curie
n_epochs = 4
batch_size = 4
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

Do the actual fine tuning. It is done by an API request and it is happening on OpenAI's end. It is correspondingly pretty fast. Costs \$0.85 so well within your \$18 credit.

In [None]:
!openai api fine_tunes.create \
    -t arxiv_train.jsonl \
    -v arxiv_valid.jsonl \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight

Upload progress:   0% 0.00/331k [00:00<?, ?it/s]Upload progress: 100% 331k/331k [00:00<00:00, 430Mit/s]
Uploaded file from arxiv_train.jsonl: file-wPI5nw3Jujt5e8SsLS4YcqNP
Upload progress: 100% 45.2k/45.2k [00:00<00:00, 60.3Mit/s]
Uploaded file from arxiv_valid.jsonl: file-pLHdzL8EZAYcPLMCnP76C8qS
Created fine-tune: ft-uSvPXFOZVvBCeBhGr7VexE3K
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-06-23 13:45:08] Created fine-tune: ft-uSvPXFOZVvBCeBhGr7VexE3K
[2022-06-23 13:45:30] Fine-tune costs $0.85
[2022-06-23 13:45:30] Fine-tune enqueued. Queue number: 0
[2022-06-23 13:45:32] Fine-tune started
[2022-06-23 13:47:13] Completed epoch 1/4
[2022-06-23 13:48:04] Completed epoch 2/4
[2022-06-23 13:48:56] Completed epoch 3/4
[2022-06-23 13:49:46] Completed epoch 4/4
[2022-06-23 13:50:10] Uploaded model: curie:ft-personal-2022-06-23-13-50-08
[2022-06-23 13:50:43] Uploaded result file: file-j0oycxDZd7sM87GNzxoRGiDo
[2022-0

In [None]:
fine_tuned_model = 'curie:ft-personal-2022-06-23-13-50-08'

Optionally sync performance metrics to wandb.

In [None]:
!openai wandb sync --project "GPT-3 to Generate ArXiv Abstracts"

[34m[1mwandb[0m: Currently logged in as: [33msarosi[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.19
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220623_135108-ft-uSvPXFOZVvBCeBhGr7VexE3K[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-uSvPXFOZVvBCeBhGr7VexE3K[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/sarosi/GPT-3%20to%20Generate%20ArXiv%20Abstracts[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/sarosi/GPT-3%20to%20Generate%20ArXiv%20Abstracts/runs/ft-uSvPXFOZVvBCeBhGr7VexE3K[0m
File file-wPI5nw3Jujt5e8SsLS4YcqNP could not be retrieved. Make sure you are allowed to download training/validation files
File file-pLHdzL8EZAYcPLMCnP76C8qS could not be retrieved. Make sure you are allowed to download training/validation files
[34m[1mwandb[0m: Waiting for W&B process to finish

# Using the fine tuned model to get predictions

Get a bunch of random title-abstract pairs:

In [23]:
get_prompt_completion=sample(prompt_completion,3)
len(get_prompt_completion)

3

And send it to your newly fine tuned model for completion. Beware: if you send too much stuff you can quickly run out of money!

In [26]:
data=[]
i=0
for pc in get_prompt_completion:
  print('\r',end='i:{}'.format(i))
  i+=1
  #Get response from API
  response = openai.Completion.create(
    model=fine_tuned_model,
    prompt=pc['prompt'],
    temperature=0.8,
    max_tokens=400
  )
  fake = response['choices'][0]['text']

  #Remove junk from the end of the response
  fake = fake[:fake.rindex(".")+1]
  try:
    fake = fake[:fake.index("END")]
  except:
    pass

  #Create entry
  entry = {'title':pc['prompt'], 'real_abstract':pc['completion'], 'fake_abstract':fake}
  data+=[entry]

i:2

In [27]:
data[:3]

[{'fake_abstract': ' We treat a class of massive scalar field Lagrangian four-dimensional spacetimes of gravitational double curvature. The action is chosen as a sum over all configurations of massless particles in the spacetime and its effective action is written in terms of the four-potential and four-magnitude. The attractor is a massless particle with spin and we get its motion in terms of its effective mass and its massless spinor. The massless spinor can be interpreted as the spin vector of a massive particle in a massless spacetime. ',
  'real_abstract': "Earlier we obtained quasi-classical equations of motion of spin 1/2 massless particle in a curved spacetime on base of simple Lagrangian model \\cite{al2}. Now we suggest an approach to derive the equations in framework of field theory. Noether theorem formulated in terms of Cartan' formalism of orthonormal frames gives equations for current of spin of the field and tensor of stress-energy. It is shown that under eikonal approx

Save and download your data.

In [None]:
with open("data.json", "w") as write_file:
    json.dump(data, write_file, indent=4)

In [None]:
from google.colab import files
files.download('data.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>