# Text Processing with GPT-3.5 for Pfizer Documents

This notebook leverages the GPT-3.5 model to correct writing errors in Pfizer text documents. The goal is to enhance the text's grammatical correctness without altering the original wording. The process involves importing necessary libraries, setting up the OpenAI API, and processing the text through the GPT-3.5 model.

In [1]:
import os
import openai

## Setting Up the OpenAI API

Next, we configure the OpenAI API to interact with the GPT-3.5 model. The API key, base URL, type, version, and deployment name are retrieved from environment variables to ensure secure and flexible configuration.

- **API Key**: Authenticates requests to the OpenAI API.
- **API Base**: Specifies the base URL for the API.
- **API Type**: Indicates the type of OpenAI API being used (Azure in this case).
- **API Version**: Specifies the version of the OpenAI API.
- **Deployment Name**: Identifies the specific deployment of the GPT-3.5 model.

Using environment variables helps to keep sensitive information secure and allows for easy updates and configuration changes.

In [2]:
openai.api_key = os.getenv("OPENAI_API_KEY_GPT_35")
openai.api_base = os.getenv("OPENAI_API_BASE_GPT_35")
openai.api_type = 'azure'
openai.api_version = '2023-05-15'
openai.deployment_name = os.getenv("OPENAI_DEPLOYMENT_NAME_GPT_35_16k")

## Loading Text Files

In this step, we load the text files from the specified directory. Each text file contains a fragment of Pfizer documents that need grammatical corrections. We read the content of each file and store it in a list for further processing.

- **Directory**: The text files are located in the `fragments/first_section` and `fragments/second_section` directory.
- **File Reading**: We iterate through all files in the directory, read the content of each text file, and store the content in a list. The names of the files are also stored (excluding the `.txt` extension) for reference.

In [7]:
# Initialize lists to store the texts and file names
texts = []
names = []

# Iterate through all files in the specified directory
for file in os.listdir("fragments/second_section"):
    if file.endswith(".txt"):
        # Open and read the content of each text file
        with open(os.path.join("fragments/second_section", file), "r", encoding="utf-8") as f:
            texts.append(f.read())
            names.append(file[:-4])  # Store the file name without the .txt extension

## Correcting Text with GPT-3.5

In this section, we use the GPT-3.5 model to correct writing errors in the loaded text fragments. The model corrects typos, misspelled words, strange characters, and improper spacing without changing the original wording or sentence structure. If a text fragment is empty or too short (less than 250 words), it is skipped to ensure meaningful corrections.

For each text fragment:
- **Text Verification**: Check if the text is non-empty and has at least 250 words.
- **Prompt Preparation**: Create a prompt instructing the GPT-3.5 model to correct errors while preserving the original words and sentences.
- **Model Interaction**: Send the prompt to the GPT-3.5 model and receive the corrected text.
- **Saving Results**: Save the corrected text to a new file in the `fragments/second_section_corrected` directory.

In [None]:
# Iterate through each text in the texts list
for text in texts:
    # Check if the text is empty or too short
    if text == "" or len(text.split()) < 250:
        print(names[texts.index(text)], "is empty or too short")
        continue

    # Prepare the prompt for the GPT-3.5 model
    prompt = """The following text in triple backticks is a fragment of a medical text. This text could have some words or sentences separated, misspelled words, typos or strange characters. Correct the text without changing any of the words or sentences. If you find some vignettes put \"-\" between them. If you find some strange characters, delete them. If you find some misspelled words, correct them. If you find some words or sentences separated, join them. If you find some words or sentences joined, separate them. Return just the text with the corrections.
    Text: ```{text}```
    """
    
    # Interact with the GPT-3.5 model to get the corrected text
    response = openai.ChatCompletion.create(
        engine=openai.deployment_name,
        messages=[{'role': 'user', 'content': prompt}],
    ).choices[0].message["content"]
    
    # Save the corrected text to a new file
    with open(os.path.join("fragments/second_section_corrected/", f"{names[texts.index(text)]}_2.txt"), "w", encoding="utf-8") as f:
        f.write(response)

NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
NCT00935012 is empty or too short
