# Prompt-based learning with OpenAI's GPT model

In this tutorial, we will present how to utilize one of the most advanced Large language model in the world, GPT-3.5, to assist you in detecting phenotypic abnormalties in clinical notes.

## Install Packages

First, we need to install OpenAI library to our Colab session.

In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

## Register your OpenAI API tokens

Next, you need to provide your own OpenAI API token for using their cloud computing, like training an LLM or using their pretrained model for inference.

We will do the later here. You can follow the instruction on their [OpenAI user page](https://platform.openai.com/account/api-keys) to get an API token. (Register first)

Once you have this, please continue the tutorial.

In [None]:
import openai
# Load your API token 
openai.api_key = <your-api-token>

To check what models are available right now, you can use the following code. 

In [None]:
for candidate in openai.Model.list()['data']:
    print(candidate['id'])
   

# print(openai.Model.list())

babbage
davinci
text-davinci-edit-001
babbage-code-search-code
text-similarity-babbage-001
code-davinci-edit-001
text-davinci-001
ada
babbage-code-search-text
babbage-similarity
code-search-babbage-text-001
text-curie-001
code-search-babbage-code-001
text-ada-001
text-embedding-ada-002
text-similarity-ada-001
gpt-3.5-turbo
curie-instruct-beta
ada-code-search-code
gpt-3.5-turbo-0301
ada-similarity
code-search-ada-text-001
text-search-ada-query-001
davinci-search-document
ada-code-search-text
text-search-ada-doc-001
davinci-instruct-beta
text-similarity-curie-001
code-search-ada-code-001
ada-search-query
text-search-davinci-query-001
curie-search-query
davinci-search-query
babbage-search-document
ada-search-document
text-search-curie-query-001
whisper-1
text-search-babbage-doc-001
curie-search-document
text-davinci-003
text-search-curie-doc-001
babbage-search-query
text-babbage-001
text-search-davinci-doc-001
text-search-babbage-query-001
curie-similarity
curie
text-similarity-davinci-00

Please visit this [page](https://platform.openai.com/docs/models/model-endpoint-compatibility) for details. 

Until now, we can use the most advanced GPT-3.5 model for chat completion (not for fine-tuning). Let's do this.

## Prompt-based learning

First, we need to prepare some learning and testing data. Since we want the model to detect phenotypic abnormalties, we need to have few clinical notes and corresponding phenotypic information in advance.

Let's look at the following example:

**Clinical note 1**:

"
*Thirty six children with typical features of Angelman's syndrome, including global developmental delay, ataxia, episodes of paroxysmal laughter, seizures, and microcephaly were studied. The series included three sibships of three affected sisters, two affected brothers, and two affected sisters, respectively. The facial appearance is characterised by a prominent jaw, a wide mouth, and a pointed chin. Tongue thrusting is common. The movement disorder consists of a wide based, ataxic gait with frequent jerky limb movements and flapping of the hands.*
"


**Clinical note 2**:

"*We describe the clinical findings of 15 individuals in a large kindred affected with distal arthrogryposis type 1A (DA1A). The most consistent findings among individuals were overlapping fingers at birth, abnormal digital flexion creases, and foot deformities, including talipes equinovarus and vertical talus. There was marked intrafamilial variation in the expression of DA1A. Linkage mapping of the locus for DA1A suggests that the use of strict diagnostic criteria excludes unaffected individuals rigorously, but can produce incomplete ascertainment of affected individuals. In the context of an affected family, the range of phenotypes consistent with a diagnosis of DA1A needs to be expanded.*"

### Zero-shot learning

If we didn't let model learn what we expect, i.e. zero-shot learning, we can simply run the following code. Let's use clinical note 1 for example.

In [None]:
completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo", 
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "please identify human phenotype ontology for me"},
        {"role": "user", "content": "Thirty six children with typical features of Angelman's syndrome, including global developmental delay, ataxia, episodes of paroxysmal laughter, seizures, and microcephaly were studied. The series included three sibships of three affected sisters, two affected brothers, and two affected sisters, respectively. The facial appearance is characterised by a prominent jaw, a wide mouth, and a pointed chin. Tongue thrusting is common. The movement disorder consists of a wide based, ataxic gait with frequent jerky limb movements and flapping of the hands."},
        ]
)

In [None]:
print(completion['choices'][0]['message']['content'])

The human phenotype ontology (HPO) terms that could be related to the description of the phenotypic features observed in this case report include: 

- HP:0001249 Global developmental delay
- HP:0002078 Microcephaly
- HP:0000750 Ataxia
- HP:0002141 Paroxysmal bursts of laughter
- HP:0001250 Seizures
- HP:0000202 Prominent jaw
- HP:0000154 Wide mouth
- HP:0000307 Pointed chin
- HP:0000180 Tongue protrusion
- HP:0001288 Gait ataxia
- HP:0003487 Limb tremor
- HP:0001252 Flapping tremor of hands 

It's important to note that this is not an exhaustive list and other HPO terms may also be relevant based on the specific clinical findings of each patient.


Here, we first act as a *system* to give the *assistant* an identity and an order:

"*You are a helpful assistant.*"

"*Please identify human phenotype ontology for me*"

Next, you as a *user*, should provide a copy of clinical note as above. Then, here we go!

### One-shot learning

If we want the "assistant" to better align with our thoughts (contents, patterns, etc.), We can also perform one-shot learning by providing the expected outcomes. All we need to do is to provide at least one example (clinical note) with labeling information.

In our case, we should tell the assistant what we expect to see. Let's take the first clinical note as an example.

If the correct output of all phenotypic abnormalties is:


"I have found HPO terms:<br>
overlapping fingers | HP_0001177<br>
abnormal digital flexion creases | HP_0006143<br>
foot deformities | HP_0001760<br>
talipes equinovarus | HP_0001762<br>
vertical talus | HP_0001858<br>
intrafamilial variation | HP_0003828<br>
ascertainment bias | HP_0045088<br> "


We can provide this information along with clinical note, so that model can understand what is expected. Then immediately following this output, we give a second clinical note for prediction.

In [None]:
new_completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo", 
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "please identify human phenotype ontology for me"},
        {"role": "user", "content": "Thirty six children with typical features of Angelman's syndrome, including global developmental delay, ataxia, episodes of paroxysmal laughter, seizures, and microcephaly were studied. The series included three sibships of three affected sisters, two affected brothers, and two affected sisters, respectively. The facial appearance is characterised by a prominent jaw, a wide mouth, and a pointed chin. Tongue thrusting is common. The movement disorder consists of a wide based, ataxic gait with frequent jerky limb movements and flapping of the hands."},
        {"role": "assistant", "content": "I have found HPO terms:\nglobal developmental delay | HP_0001263\nataxia | HP_0001251\nepisodes of paroxysmal laughter | HP_0000749\nlaughter | HP_0000748\nseizures | HP_0001250\nmicrocephaly | HP_0000252\nprominent jaw | HP_0002051\nwide mouth | HP_0000154\npointed chin | HP_0000307\nTongue thrusting | HP_0000182\nmovement disorder | HP_0100022\nwide based, ataxic gait | HP_0002136\nataxic gait | HP_0002066\njerky limb movements | HP_0002276\nflapping of the hands | HP_0100023"},
        {"role": "user", "content": "We describe the clinical findings of 15 individuals in a large kindred affected with distal arthrogryposis type 1A (DA1A). The most consistent findings among individuals were overlapping fingers at birth, abnormal digital flexion creases, and foot deformities, including talipes equinovarus and vertical talus. There was marked intrafamilial variation in the expression of DA1A. Linkage mapping of the locus for DA1A suggests that the use of strict diagnostic criteria excludes unaffected individuals rigorously, but can produce incomplete ascertainment of affected individuals. In the context of an affected family, the range of phenotypes consistent with a diagnosis of DA1A needs to be expanded."}
    ]
)


In [None]:
print(new_completion['choices'][0]['message']['content'])

I have found HPO terms:
overlapping fingers | HP_0001177
abnormal digital flexion creases | HP_0006143
talipes equinovarus | HP_0001762
vertical talus | HP_0001845
marked intrafamilial variation | HP_0003827


# Fine-tuning with OpenAI API

Sure, some of you may not be satisfied by borrowing their pretrained general language model on this specific task. It feels a bit weird. So you may want to fine-tune a little bit to boost the result.

This is feasible with OpenAI's finetuning script.

## Prepare finetuning dataset

We should be **very careful** with what data you may want to use.



1.   Bad or low quality dataset will compromise the model and even destroy the orginal capacibility of language modeling;
2.   Sensitive dataset need to be examed before uploading to third-party platform.



Considering these issues, we will use the BiolarkGSC+ public dataset, which contains around 200 clinical texts and their corresponding phenotypic information.

You can download via this [webpage](https://data.mendeley.com/datasets/v4t59p8w4z/2). Download the file called "biolarkgsc_locs.csv" or you can run the following cell.

In [None]:
!wget https://data.mendeley.com/public-files/datasets/v4t59p8w4z/files/6c424d2f-3178-441e-bbb6-f533ab2f7350/file_downloaded


--2023-05-06 16:21:51--  https://data.mendeley.com/public-files/datasets/v4t59p8w4z/files/6c424d2f-3178-441e-bbb6-f533ab2f7350/file_downloaded
Resolving data.mendeley.com (data.mendeley.com)... 162.159.133.86, 162.159.130.86
Connecting to data.mendeley.com (data.mendeley.com)|162.159.133.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/e86ecbae-ca83-4736-9bf5-6e597a856185 [following]
--2023-05-06 16:21:52--  https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/e86ecbae-ca83-4736-9bf5-6e597a856185
Resolving prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com (prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com)... 52.92.33.42, 52.218.118.82, 52.218.116.42, ...
Connecting to prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com (prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com)|52.92.33.42|:443

In [None]:
import os

old_name = r"file_downloaded"
new_name = r"biolarkgsc_locs"
os.rename(old_name, new_name)

Let's exam the dataset a bit

In [None]:
import pandas as pd

df = pd.read_csv('./biolarkgsc_locs', delimiter='\t')
df.head()

Unnamed: 0,id,text,labels
0,1003450,A syndrome of brachydactyly (absence of some m...,HP_0001156|14:27;HP_0009881|29:71;HP_0001798|7...
1,10051003,Townes-Brocks syndrome (TBS) is an autosomal d...,HP_0000006|35:62;HP_0000006|35:53;HP_0000006|4...
2,10066029,Nevoid basal cell carcinoma syndrome (NBCCS) i...,HP_0002671|7:27;HP_0000006|89:107;HP_0000006|8...
3,10196695,Angelman syndrome (AS) is a neurodevelopmental...,HP_0000707|28:55;HP_0001466|839:863
4,10417280,Prader-Willi syndrome (PWS) and Angelman syndr...,HP_0000708|68:93;HP_0003745|223:230


In [None]:
len(df)

228

## dataset cleaning

We can divide this dataset into two portions: one for fine-tuning and the other for validation.

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)
len(train_df), len(test_df)

(205, 23)

Next, we should process both the training dataset and testing dataset such that it meets the requirement for the fine-tuning API on OpanAI's [documentation](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset)

In [None]:
def preprocess(train_df):
    rows = []
    for i, row in train_df.iterrows():
        #bad examples in train_df
        # if i in {160,7,23,76}:
        #     continue
        list_hpo = row.labels.split(';')
        list_hpo_pair = [[_.split('|')[0], _.split('|')[1].split(':')]  for _ in list_hpo]
        # print(list_hpo_pair)
        # for hpo, [s,e] in list_hpo_pair:
        #     print(row.text[int(s)-50:int(e)+50])
        # break
        ans = ""
        hpos = set()
        for hpo, [s,e] in list_hpo_pair:
            if hpo in hpos:
                continue
            ans += row.text[int(s):int(e)] + ' | ' + hpo + '\n'
            hpos.add(hpo)
        ans += 'END'
        rows.append({"prompt":f"{row.text}\n\n###\n\n", "completion":f" {ans}"})
    return pd.DataFrame(rows)

In [None]:
ft = preprocess(test_df)
ft.to_json(f'test.jsonl', orient='records', lines=True)

In [None]:
ft = preprocess(train_df)
ft.to_json(f'train.jsonl', orient='records', lines=True)

In [None]:
ft

Unnamed: 0,prompt,completion
0,Familial Angelman syndrome (AS) can result fro...,phenotypic abnormality | HP_0000118\ndominant...
1,The nevoid basal-cell carcinoma syndrome is ch...,basal-cell carcinoma | HP_0002671\ncysts of t...
2,Hereditary isolated brachydactyly type C (OMIM...,brachydactyly | HP_0001156\nbrachydactyly typ...
3,The results of a systematic study of the otolo...,earpit | HP_0004467\ndeafness | HP_0000404\nm...
4,"Six patients, including two sibs, with Angelma...",sporadic | HP_0003745\nheterogeneity | HP_000...
...,...,...
200,This study of 47 patients from 11 families wit...,neurofibromatosis | HP_0006746\nposterior cap...
201,Angelman syndrome is a neuro-developmental dis...,neuro-developmental disorder | HP_0012759\nEND
202,Neurofibromatosis type 2 is an autosomal-domin...,Neurofibromatosis | HP_0006746\nautosomal-dom...
203,"We present 3 individuals, a mother, her son, a...",unilateral renal agenesis | HP_0000122\nrenal...


Refresh your repository, you can see two new jsonl files ready to use!

Just in case, OpenAI provide a nice API for formatting checking and correcting. (free)

In [None]:
!openai tools fine_tunes.prepare_data -f train.jsonl

Analyzing...

- Your file contains 205 prompt-completion pairs
- All prompts end with suffix `.\n\n###\n\n`
- All completions end with suffix `\nEND`

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "train.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `.\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\nEND"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 5.26 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


Now we are ready to fine-tune our own GPT model to recognize HPO terms based on `train.jsonl` dataset.

Notice that Fine-tuning is currently only available for the following base models: `davinci`, `curie`, `babbage`, and `ada`.

In [None]:
%%bash
export OPENAI_API_KEY=<you-api-key>
openai api fine_tunes.create -t ./train.jsonl -v ./test.jsonl --batch_size 16 -m davinci --suffix phenogpt --n_epochs 12 --learning_rate_multiplier 0.1

Found potentially duplicated files with name 'train.jsonl', purpose 'fine-tune' and size 266305 bytes
file-3oswPITVAaQYeLtmzitGez9Y
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: Uploaded file from ./train.jsonl: file-hbDe2rPZBUI7mLPCx3WJwvIG
Found potentially duplicated files with name 'test.jsonl', purpose 'fine-tune' and size 31331 bytes
file-iic4DKLeem4RsosOEmHwmA9T
file-MpvcteVAPeVADrlGKEKeXF2h
file-XhiEnBOYQl4Z0FVjqkbruZD7
file-TnpeUdp05EgBugrwCbz9axEv
file-1xwMn7poXBvAklpsLzKYKBFE
file-GRvG3W0AY7etf7GkmjCe0bBR
file-GOS25ykHigJcWOjj3EhcH0Nf
file-PzWPdUwKvhxD5IklqSlYuSmr
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: Uploaded file from ./test.jsonl: file-gKUwYWkQ7B3TcA2KIm1l3K2A
Created fine-tune: ft-fKhpCacvqI38am5vtW5AB0mk
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-05-06 16:32:38] Created fine-tune: ft-fK

Upload progress:   0%|          | 0.00/266k [00:00<?, ?it/s]Upload progress: 100%|██████████| 266k/266k [00:00<00:00, 328Mit/s]
Upload progress:   0%|          | 0.00/31.3k [00:00<?, ?it/s]Upload progress: 100%|██████████| 31.3k/31.3k [00:00<00:00, 61.4Mit/s]


If it stops training, you can use the following cell to continue your training

In [None]:
%%bash
export OPENAI_API_KEY=<your-api-token>

[2023-05-06 16:32:38] Created fine-tune: ft-fKhpCacvqI38am5vtW5AB0mk
[2023-05-06 16:32:43] Fine-tune costs $22.72
[2023-05-06 16:32:43] Fine-tune enqueued. Queue number: 0
[2023-05-06 16:42:45] Fine-tune started
[2023-05-06 16:45:00] Completed epoch 1/12
[2023-05-06 16:45:39] Completed epoch 2/12
[2023-05-06 16:46:16] Completed epoch 3/12
[2023-05-06 16:46:56] Completed epoch 4/12
[2023-05-06 16:47:35] Completed epoch 5/12



In [None]:
# You can also monitor your fine-tune jobs' list.
openai.FineTune.list()

<OpenAIObject list at 0x7ff9abaede90> JSON: {
  "data": [
    {
      "created_at": 1679256505,
      "fine_tuned_model": "davinci:ft-personal:phenogpt-2023-03-19-20-21-53",
      "hyperparams": {
        "batch_size": 16,
        "learning_rate_multiplier": 0.2,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-tbnuwkw0jWUFsN2zCaPWSwjg",
      "model": "davinci",
      "object": "fine-tune",
      "organization_id": "org-V4oAQ7pfpLNhWU0v4hORFv9P",
      "result_files": [
        {
          "bytes": 3556,
          "created_at": 1679257314,
          "filename": "compiled_results.csv",
          "id": "file-nAyVYA1Io0OZFkP1kqrX8ezP",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 257960,
          "created_at": 1679256504,
          "filename": "./train_prepared.j

After it's done, it will return a model name like: `<model_name>:ft-personal:<prefix>-<date>`
Then you can predict using your own model! Just replace `gpt-3.5-turbo` with your modal name

In [None]:
prompt = "We describe the clinical findings of 15 individuals in a large kindred affected with distal arthrogryposis type 1A (DA1A). The most consistent findings among individuals were overlapping fingers at birth, abnormal digital flexion creases, and foot deformities, including talipes equinovarus and vertical talus. There was marked intrafamilial variation in the expression of DA1A. Linkage mapping of the locus for DA1A suggests that the use of strict diagnostic criteria excludes unaffected individuals rigorously, but can produce incomplete ascertainment of affected individuals. In the context of an affected family, the range of phenotypes consistent with a diagnosis of DA1A needs to be expanded.\n\n###\n\n"
openai.Completion.create(
    model="your-model-name",
    prompt=prompt,
    max_tokens = 1000,
    stop='\nEND')

More about how to fine-tune using OpenAI API can be found [here](https://platform.openai.com/docs/guides/fine-tuning). Please enjoy!