# Generating Plain Language Texts using LLMs

This notebook focuses on generating plain language texts using LLMs such as GPT-4 and GPT-3.5. The input data consists of nearly 300 random abstracts from the Cochrane Library.

In [7]:
import pandas as pd
import json
import os
import openai

## Setting Up the OpenAI API

We configure the OpenAI API to use GPT-4 and GPT-3.5 models through Azure. This involves setting environment variables for the API key and base URL, and configuring the API settings.

In [None]:
# Set environment variables for OpenAI API
os.environ["OPENAI_API_BASE"] = ""
os.environ["OPENAI_API_KEY"] = ""

# Retrieve the environment variables
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_base = os.getenv("OPENAI_API_BASE") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/

# Configure the OpenAI API settings
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this may change in the future

## Loading Text Files

We load all `.txt` files from the `llms_testing/Cochrane/ground_truth` folder into a dictionary. The keys of the dictionary will be the filenames, and the values will be the content of the files.

In [5]:
# Initialize an empty dictionary to store the file contents
file_contents = {}

# Specify the directory containing the .txt files
directory = 'ground_truth'

# Iterate through all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
        # Construct the full path to the file
        file_path = os.path.join(directory, filename)
        
        # Open and read the content of the file
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        
        # Add the content to the dictionary with the filename as the key
        file_contents[filename] = content

# Output the resulting dictionary to verify its contents
print(file_contents)



## Generating Plain Language Summaries (PLS)

In this section, we use the GPT-4 and GPT-3.5 models to generate Plain Language Summaries (PLS) for each abstract in our dataset. The PLS will be structured to ensure clarity and accessibility for patients with varying levels of health literacy.

In [65]:
# Iterate through each file in the dictionary
for key in file_contents.keys():
    # Retrieve the text content of the file
    text = file_contents[key]
    
    # Construct the prompt for the GPT-4 model
    prompt = f"""
    Using the following abstract of a biomedical study as input, generate a Plain Language Summary (PLS) understandable by any patient, regardless of their health literacy. Ensure that the generated text adheres to the following instructions which should be followed step-by-step: 
    a. Specific Structure: the generated PLS should be presented in a logical order, using the following order:  
        1. Plain Title
        2. Rationale
        3. Trial Design
        4. Results 
    b. Sections should be authored following these parameters: 
        1. Plain Title: Simplified title understandable to a layperson that summarizes the research that was done. 
        2. Rationale: Include: background or study rationale providing a general description of the condition, what it may cause or why it is a burden for the patients; the reason and main hypothesis for the study; and why the study is needed, and why the study medication has the potential to treat the condition. 
        3. Trial Design: Answer ‘How is this study designed?’ Include the description of the design, description of study and patient population (age, health condition, genre), and the expected amount of time a person will be in the study.  
        4. Results: answer ‘What were the main results of the study’, include what are the benefits for the patients, how the study was relevant for the area of study, and what are the conclusions from the investigator. 
    c. Consistency and Replicability: the generated PLS should be consistent regardless of the order of sentences or the specific phrasing used in the input protocol text. 
    d. Compliance with Plain Language Guidelines: The generated PLS must follow all of these plain language guidelines: 
        1. Have readability grade level of 6 or below. 
        2. Do not have jargon. All technical or medical words or terms should be defined or broken down into simple and logical explanations. 
        3. Active voice, not passive 
        4. Mostly one or two syllable words 
        5. Sentences of 15 words or less
        6. Short paragraphs of 3-5 sentences
        7. Simple numbers (eg, ratios, no percentages)
    e. Do not invent Content: The AI model should not invent information. If the AI model includes data other than the one given in the input abstract, the AI model should guarantee such data is verified and real.
    f. Aim for an approximate PLS length of 500-900 words.
    
    Abstract of a biomedical study text: {text}"""
    
    # Request the GPT-4 model to generate the PLS
    response = openai.ChatCompletion.create(
        engine="gpt-4-32k-rfmanrique",
        messages=[{'role': 'user', 'content': prompt}],
    ).choices[0].message["content"]
    
    # Save the generated PLS to the output directory
    with open(f"output_cochrane/gpt-4/{key}.txt", 'w', encoding="utf-8") as f:
        f.write(response)


We configure the OpenAI API to use the GPT-3.5 model through Azure.

In [3]:
# Set environment variables for OpenAI API
os.environ["OPENAI_API_BASE"] = "<OPENAI_API_BASE>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

# Retrieve the environment variables
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_base = os.getenv("OPENAI_API_BASE") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/

# Configure the OpenAI API settings
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this may change in the future

Then, we generate the texts using GPT-3.5

In [5]:
# Iterate through each file in the dictionary
for key in file_contents.keys():
    # Retrieve the text content of the file
    text = file_contents[key]
    
    # Construct the prompt for the GPT-3.5 model
    prompt = f"""Using the following abstract of a biomedical study as input, generate a Plain Language Summary (PLS) understandable by any patient, regardless of their health literacy. Ensure that the generated text adheres to the following instructions which should be followed step-by-step: 
    a. Specific Structure: the generated PLS should be presented in a logical order, using the following order:  
        1. Plain Title
        2. Rationale
        3. Trial Design
        4. Results 
    b. Sections should be authored following these parameters: 
        1. Plain Title: Simplified title understandable to a layperson that summarizes the research that was done. 
        2. Rationale: Include: background or study rationale providing a general description of the condition, what it may cause or why it is a burden for the patients; the reason and main hypothesis for the study; and why the study is needed, and why the study medication has the potential to treat the condition. 
        3. Trial Design: Answer ‘How is this study designed?’ Include the description of the design, description of study and patient population (age, health condition, genre), and the expected amount of time a person will be in the study.  
        4. Results: answer ‘What were the main results of the study’, include what are the benefits for the patients, how the study was relevant for the area of study, and what are the conclusions from the investigator. 
    c. Consistency and Replicability: the generated PLS should be consistent regardless of the order of sentences or the specific phrasing used in the input protocol text. 
    d. Compliance with Plain Language Guidelines: The generated PLS must follow all of these plain language guidelines: 
        1. Have readability grade level of 6 or below. 
        2. Do not have jargon. All technical or medical words or terms should be defined or broken down into simple and logical explanations. 
        3. Active voice, not passive 
        4. Mostly one or two syllable words 
        5. Sentences of 15 words or less
        6. Short paragraphs of 3-5 sentences
        7. Simple numbers (eg, ratios, no percentages)
    e. Do not invent Content: The AI model should not invent information. If the AI model includes data other than the one given in the input abstract, the AI model should guarantee such data is verified and real.
    f. Aim for an approximate PLS length of 500-900 words.
    
    Abstract of a biomedical study text: {text}"""
    
    # Request the GPT-3.5 model to generate the PLS
    response = openai.ChatCompletion.create(
        engine="gpt-35-turbo-16k-rfmanrique",
        messages=[{'role': 'user', 'content': prompt}],
    ).choices[0].message["content"]
    
    # Save the generated PLS to the output directory
    with open(f"gpt-35/{key}.txt", 'w', encoding="utf-8") as f:
        f.write(response)