# OpenAI API Augmentation Experimentation Notebook

In [1]:
%load_ext autoreload
%autoreload 2

## Usage

To generate OpenAI API completions, execute the following three steps **in order**:
1. [RUN ONLY THE FIRST HALF OF THIS NOTEBOOK (Step 1), with `GENERATE_BATCHES` set to `True`]
  Use the first half of this notebook to generate batches of requests in `jsonl` format
2. [RUN `data/text_augmentation/openai_api_request_parallel_processor.py`]
  Use the (**adapted**) OpenAI API parallel processor script to efficiently generate reponses for the batched requests. 
  **Run this as an independent slurm job with a LONG TIME ALLOWANCE so it doesn't time out**
  See the SETUP VARIABLES cell to see how to configure the script
  

In [None]:
import pandas as pd
import json
import os

from madrigal.utils import DATA_DIR

**SETUP VARIABLES - USE CAUTION TO AVOID OVERWRITING EXISTING REQUESTS/RESULTS FILES!**

`GENERATE_BATCHES`: 
- set this to `True` when you want to generate a new batch of requests
- set this to `False` if you **don't** want to generate any new requests (so previous requests will not be overwritten)

`FILENAME`: 
- this is used in Step 1
- in this script, this is the **destination** path for the `jsonl` file containing the generated **requests**
  - hence, this should be the same as the `--requests_filepath` parameter in `openai_api_request_parallel_processor.py`

`RESULTS_FILENAME`: 
- this is used in Step 2
- in this script, this is the **source** path for the **responses** (also in `jsonl` format) that were generated by the `openai_api_request_parallel_processor.py` program
  - hence, this should be the same as the `--save_filepath` parameter in `openai_api_request_parallel_processor.py`

Notes:
- TODO: I have been storing requests & results in `cached_data`, but we must make sure to keep the results files safe so that we can store full generations and request information (stored by default) for future publishing as a dataset.
- I suggest ensuring that `FILENAME` and `RESULTS_FILENAME` share the same (descriptive) name root (e.g. `text_aug_go_v0.0_requests.jsonl'` and `text_aug_go_v0.0_results.jsonl'`).
  - TODO: improve code to enforce this.

In [None]:
GENERATE_BATCHES = True # or just process if False

# Paths to save requests
FILENAME = "../api_requests_multi_new.jsonl"

## Stage 1 - GENERATE REQUESTS

In [None]:
# Load data
description_df = pd.read_csv(os.path.join(DATA_DIR, 'polypharmacy_new/DrugBank/drugbank_ddi_directed_final_cleaned.tsv'), index_col=0)

In [110]:
unique_ddis = description_df['generalized_description'].unique()

In [111]:
class DDITemplateEngine:
    # TODO: assert these strings do not exist in data
    MAPPING = {
        '$DESC': 'generalized_description',
    }

    def __init__(self, ddi_detail):
        self.ddi_detail = ddi_detail
    
    def generate_prompt(self, ddi_idx, template):
        row = self.ddi_detail[ddi_idx]
        for k, v in self.MAPPING.items():
            if k in template:
                template = template.replace(k, row)
        return template

In [112]:
# TODO: extend to other datasets and to support task definition rephrasing
template_engine = DDITemplateEngine(unique_ddis)

In [113]:
# API args
# Commented args are not supported on chat-completion models (only supported on completion model endpoints which WE SHOULD NOT USE)
# See https://platform.openai.com/docs/api-reference/chat/create

# TODO: adjust these as necessary - find some good defaults and stick to them 
# Note: we may want to consider using nucleus sampling (e.g. top_p = 0.99) to limit "creativity" (as used in Honovich et al 2022 (Unnatural Instructions) when generating prompts)
API_ARGS = {
    # "suffix": None,
    "temperature": 1,
    "max_tokens": 1500,
    "top_p": 1,
    "n": 1,
    "stop": None,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    # "best_of": 1,
    # "logit_bias": None
}

In [117]:
extend = 'Your task is to take original descriptions of drug-drug interactions and augment them. Make sure the information you provide is correct and valid, and the extended text is two sentences long. To ensure diversity in these descriptions, respond as each of the following: medical doctors (physician), pharmacologist and toxicologist. \n Medical doctors: Especially those specializing in fields such as pharmacology, internal medicine, and clinical pharmacology. Hence, they understand the symptoms and organ systems affected due to the drug-drug interaction. \n Pharmacologists: Scientists who study the effects of drugs on biological systems. They understand the genetic and molecular mechanism behind these interactions. \n Toxicologists: Scientists who typically works with chemicals and other substances to determine if they are toxic, or harmful, to humans and other living organisms or the environment. \n Your job is to generate nine versions of the original description (given below after ‘Original description’) - three for each profession, and at the correct level of understanding and expertise. Your responses for the scientists should be diverse in terms of style, length, and language used. Each of the scientist responses must be different in terms of the level of vocabulary used, sentence structure, and word choice. None of the nine responses can be the same or have any repeating sentences. Please also make sure you do not make any assumptions about @Drug1 and @Drug2. \n An example output is given for you after ‘Example’. The description to augment is given after ‘Task’. \n Example: \n Description: The risk or severity of bleeding and bruising can be increased when @Drug1 is combined with @Drug2. \n Output: {"medical_doctor_1": "When @Drug1 is combined with @Drug2, there is an elevated risk of experiencing increased bleeding and bruising, a concern particularly for patients with pre-existing bleeding disorders or those on anticoagulant therapy. These patients should be closely monitored for any signs of abnormal bleeding and adjustments to therapy may be necessary.", "medical_doctor_2": "The concurrent use of @Drug1 with @Drug2 can lead to a heightened risk or severity of bleeding and bruising, necessitating careful consideration in patients undergoing surgical procedures or those with a history of bleeding problems. Its vital to evaluate the benefits and risks of continuing this drug combination in such contexts.", "medical_doctor_3": "Combining @Drug1 with @Drug2 can significantly increase the likelihood of bleeding and bruising, an issue of particular concern for individuals with liver impairment or those consuming alcohol regularly. These patients may require dose adjustments or more frequent monitoring to mitigate the risks.", "pharmacologist_1": "The interaction between @Drug1 and @Drug2 can escalate the risk of bleeding and bruising due to their synergistic effects on inhibiting platelet aggregation or affecting coagulation pathways. Understanding the molecular basis of this interaction is crucial for predicting patient responses and managing therapy effectively.", "pharmacologist_2": "When @Drug1 and @Drug2 are co-administered, they may interfere with each others pharmacokinetic or pharmacodynamic profiles, leading to an increased risk or severity of bleeding and bruising. This interaction suggests a complex interplay at the cellular level, possibly involving the modulation of enzymes or receptors associated with hemostasis.", "pharmacologist_3": "The combination of @Drug1 and @Drug2 can intensify bleeding and bruising risks, likely through a mechanism involving the alteration of drug metabolism or the exacerbation of anticoagulant effects. Pharmacologists should explore the specific pathways affected to tailor safer therapeutic strategies.", "toxicologist_1": "The interaction between @Drug1 and @Drug2 may elevate the risk of bleeding and bruising by enhancing the toxic effects on the bodys hemostatic system, potentially leading to hazardous conditions such as uncontrolled bleeding. Toxicologists should assess the cumulative impact of these drugs to ensure patient safety.", "toxicologist_2": "Co-administration of @Drug1 with @Drug2 can lead to increased toxicity, manifesting as heightened bleeding and bruising, by overburdening the bodys natural mechanisms for blood clotting. This interaction underscores the importance of toxicological evaluation for identifying potential adverse effects on the hemostatic process.", "toxicologist_3": "When @Drug1 and @Drug2 are used together, their combined toxic effects can compromise the integrity of the vascular and coagulation systems, increasing the risk of bleeding and bruising. This interaction highlights the need for ongoing monitoring and possibly adjusting dosages to mitigate the toxicological risks involved.”}\n Task: \n Description: $DESC \nOutput:'

In [118]:
# System message and model for API

# TODO: experiment further with different system messages
SYSTEM = 'You respond with technical, scientific definitions.'

# Always use chat_complete, the completion endpoints are not instruction tuned so don't work as well
TASK_TYPE = 'chat_complete'

# TODO: set a **specific** gpt-3.5-turbo checkpoint to use (for reproducibility)
MODEL = 'gpt-4-0125-preview'

# Templates to use (with corresponding template IDs)
# A request will be generated FOR EACH UNCOMMENTED TEMPLATE
# NOTE: if TEMPLATE IS CHANGED, A NEW ID MUST BE USED!! 
# TODO: replace with actual prompts (which use pre-prepared compositions) and extend this logic (and/or the template_engine logic) to support other datasets
TEMPLATES = {
    'extend':extend
}

In [119]:
TEMPLATES

{'extend': 'Your task is to take original descriptions of drug-drug interactions and augment them. Make sure the information you provide is correct and valid, and the extended text is two sentences long. To ensure diversity in these descriptions, respond as each of the following: medical doctors (physician), pharmacologist and toxicologist. \n Medical doctors: Especially those specializing in fields such as pharmacology, internal medicine, and clinical pharmacology. Hence, they understand the symptoms and organ systems affected due to the drug-drug interaction. \n Pharmacologists: Scientists who study the effects of drugs on biological systems. They understand the genetic and molecular mechanism behind these interactions. \n Toxicologists: Scientists who typically works with chemicals and other substances to determine if they are toxic, or harmful, to humans and other living organisms or the environment. \n Your job is to generate nine versions of the original description (given below 

In [120]:
def generate_request(idx, template):
    prompt = template_engine.generate_prompt(idx, template)

    return {
        'model': MODEL,
        'messages': [
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": prompt},
        ],
        **API_ARGS,
    }

Generate all requests (just generated - not yet executed), and save these in `jsonl` form (one `json` object per request - all stored in a single file (separated by `\n`s))

In [None]:
# Generate batches and save to FILENAME (if required) 
if GENERATE_BATCHES:
    with open(FILENAME, "w") as f:
        for idx in range(unique_ddis.shape[0]):
            
            for template in TEMPLATES.values():
                request = {
                    'idx': idx,
                    'request': generate_request(idx, template)
                }
                json_string = json.dumps(request)
                f.write(json_string + "\n")
    print(f'Saved to {FILENAME}')

## Step 2

**#### Now run openai_api_parallel_processor.py to generate responses ####**

See usage in that file

## Validations

In [96]:
TEMPLATES

{'extend': 'Your task is to take original descriptions of drug-drug interactions and augment them. Make sure the information you provide is correct and valid, and the extended text is two sentences long. To ensure diversity in these descriptions, respond as each of the following: medical doctors (physician), pharmacologist and toxicologist. \n Medical doctors: Especially those specializing in fields such as pharmacology, internal medicine, and clinical pharmacology. Hence, they understand the symptoms and organ systems affected due to the drug-drug interaction. \n Pharmacologists: Scientists who study the effects of drugs on biological systems. They understand the genetic and molecular mechanism behind these interactions. \n Toxicologists: Scientists who typically works with chemicals and other substances to determine if they are toxic, or harmful, to humans and other living organisms or the environment. \n Your job is to generate nine versions of the original description (given below 

In [None]:
# TODO: fix hacky mapping of prompts to templates 
#   - I suggest also storing the template ID **and** the actual template text along with the request in the request file, 
#   - this would also require amending openai_api_request_parallel_processor.py to also store the template ID and template text along with the request (search for the TODO in that file)
#       then this block could be removed

PROMPT_TO_TEMPLATE = {
    'concise summary': '$NAME is a $TYPE defined as follows: $DEF \nPlease provide a concise summary of this definition, using only the information given. ',
    'rephrase this definition': '$NAME is a $TYPE defined as follows: $DEF \nPlease rephrase this definition, using different terminology and common abbreviations/acronyms where possible. ',
    'extend the definition': 'Please extend the definition of the following $TYPE by adding multiple paragraphs. \nName: $NAME\nDefinition: $DEF ',
}

def find_template(prompt):
    for k, v in PROMPT_TO_TEMPLATE.items():
        if k in prompt:
            return v
    raise ValueError(f'Could not find template for prompt: {prompt}')

In [98]:
import tiktoken
def count_tokens(text, model):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def strip_newlines(text):
    return text.replace('\\n', ' ').replace('\n', ' ')

In [99]:
TASK_TYPE 

'chat_complete'

Gather reponses from `jsonl` results file(s) and store in a python object `data`

In [None]:
# TODO: In future runs: save template & template ID & other data with request
from tqdm.notebook import tqdm
data = []


# Create a list of all results files to be processed together
RESULTS_FILENAMES = ['./api_requests_results_multi_new.jsonl']

for result_filename in RESULTS_FILENAMES:
    print(result_filename)
    with open(result_filename, "r") as f:
        for line_num, line in tqdm(enumerate(f)):
            try:
                json_object = json.loads(line)
            except json.JSONDecodeError:
                print(line_num)
                print(line)
                raise

            try:
                # Note: this is how the openai_api_request_parallel_processor.py has been amended to save results (along with the request and idx)
                
                idx, request, response = json_object[0]['idx'], json_object[0]['request'], json_object[1]
                prompt = request['messages'][1]['content']
                
                if 'rephrase' in prompt:
                    template = 'rephrase1'
                elif 'extend' in prompt:
                    template = 'extend1'
                
                row = {
                    'id': unique_ddis[idx],
                    'api_id': response['id'],
                    'model': response['model'],
                    'system': SYSTEM if TASK_TYPE == 'chat_complete' else None,
                    'object': response['object'],
                    'prompt_tokens': response['usage']['prompt_tokens'],
                    'completion_tokens': response['usage'].get('completion_tokens', 0),
                    'total_tokens': response['usage']['total_tokens'],
                    'template': template,
                    'prompt': prompt,
                    'prompt_stripped': strip_newlines(prompt),
                    'full_response': json.dumps(response),
                    'temperature': request['temperature'],
                }
                
                
                
                for choice in response['choices']:
                    i = choice['index']

                    response_text = choice["text"] if TASK_TYPE == 'complete' else choice['message']['content']
                    # Print warning if truncated
                    # if choice['finish_reason'] != 'stop':
                    #     print(f'!!!!!! WARNING: Truncated response finish_reason:{choice["finish_reason"]}!!!!!!')
                    row[f"text_{i}"] = response_text
                    row[f"finish_reason_{i}"] = choice['finish_reason']
                    row[f"completion_tokens_{i}"] = count_tokens(response_text, MODEL)

                data.append(row)
            
            except TypeError:
                print(line_num)
                print(request)
                print(response)

            except KeyError:
                print(line_num)
                print(request)
                print(response)
                raise