Re-running Data

- based on `rowid`s, there are 11920 records
- GPT responses produced 11887 outputs
- out of the 33 missing outputs. 10 were spot checked and found that also the rowid exists in VA and Age file, it is missing in open narrative file.

In [None]:
"""
The HEALSL project provides verbal autopsy data into multiple different files. Round one and round two; adult, child, and neonate; and questionnaire, 
age, and open narrative are all separate files. This script aims to combine these files into a single dataset (dataframe) to simplify processing.

We extract the deceased's sex from the questionnaire dataset, the deceased's age from the age dataset, and the open narrative recorded from the
verbal autopsy from the narrative dataset. Then, we combine the extracted features (columns) using their row id as the key.

We utilize the OpenAI API's Chat Completions to generate a response for each verbal autopsy record. Several parameters were used for this project:
message: the input text to be processed by the API. It consists of the combination of two text prompts: the system prompt, which is the same for 
all requests provides the model with some context of its role and objective, and the user prompt, which concatenates more specific instructions
regarding the output response, along with the data from the dataframe.

model: this parameter specifies the language model to be used. To strive for consistency and reproducibility of the results, we used specific versions
of the GPT-3 and GPT-4 models; gpt-3.5-turbo-0125 and gpt-4-0613, respectively.

temperature: this parameter can be any value between 0 and 2. It controls the randomness of the output response. Low temperatures result in more 
deterministic responses, while high temperatures result in more random responses. We used a temperature of 0 to ensure the responses were as 
deterministic as possible.

lobprobs: this parameters controls whether the output includes the log probabilities of the tokens. We set it to True to provide more parametric 
information about the output.

The data files used are as follows:
healsl_rd1_neo_v1.csv
healsl_rd1_neo_age_v1.csv
healsl_rd1_neo_narrative_v1.csv

healsl_rd1_child_v1.csv
healsl_rd1_child_age_v1.csv
healsl_rd1_child_narrative_v1.csv

healsl_rd1_adult_v1.csv
healsl_rd1_adult_age_v1.csv
healsl_rd1_adult_narrative_v1.csv

healsl_rd2_neo_v1.csv
healsl_rd2_neo_age_v1.csv
healsl_rd2_neo_narrative_v1.csv

healsl_rd2_child_v1.csv
healsl_rd2_child_age_v1.csv
healsl_rd2_child_narrative_v1.csv

healsl_rd2_adult_v1.csv
healsl_rd2_adult_age_v1.csv
healsl_rd2_adult_narrative_v1.csv

Note: This script only processes one age group (neo, child, adult) of one round (1 and 2) at each execution. Therefore, the user must manually change
the input dataset filenames at execution to use the correct age group and round. The output filename can remain the same; as long as the rowids are
unique, which is the case for the HEALSL datasets, the results for each rowid will only be written once in the output file.

The prompts are as follows:

System prompt:
"You are a physician with expertise in determining underlying causes of death in Sierra Leone by assigning ICD-10 codes for deaths using verbal autopsy 
narratives. Return only the ICD-10 code without description. E.g. A00 
If there are multiple ICD-10 codes, show one code per line"

User prompt:
"With the highest certainty, determine the underlying cause of death and provide the most accurate ICD-10 code for a verbal autopsy narrative of a
AGE_VALUE_DEATH AGE_UNIT_DEATH old SEX_COD death in Sierra Leone: {open_narrative}"

    AGE_VALUE_DEATH and AGE_UNIT_DEATH: replaced with age_value_death and age_unit_death values from the age dataset.
    SEX_COD: replaced with sex_cod value from the questionnaire dataset.
    open_narrative: replaced with summary value from the open narrative dataset.

The response from the API is then dissected to extract relevant information in plain text, such as the response text, the log probabilities, 
along with other accounting information such as rowids, models used, timestamps, token consumption, into an array and exported into file.
        

Pseudo code:
1. Load the multiple datasets into respective dataframes.
2. Extract the necessary features from each dataframe.
3. Merge extracted features into a single dataframe using the rowid as the key.
4. Load results storage as array.
    If result storage is does not exist, create an empty array.
5. For each row in the dataframe:
    Check if rowid is in the result storage.
        If rowid is in the result storage, skip the row.
    Compose the two prompts and generate a response using the OpenAI API.
    Store the response and other relevant information in the result storage.
    Save the result storage to a file periodically.
    
---------

1. Load datasets into dataframes D1, D2, ..., Dn.
2. Extract necessary features F1, F2, ..., Fn from each dataframe.
3. Merge features into a single dataframe D_merge using rowid as the key.
4. Initialize results storage R as an empty array.
5. For each row in D_merge:
    a. If rowid is not in R:
        i. Compose prompts and generate response using OpenAI API.
        ii. Store response and relevant information in R.
        iii. Periodically save R to a file.
"""

In [2]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
import textwrap
import datetime
import pytz
import json
import re


open_api_key = os.environ.get('OPEN_API_KEY')
client = OpenAI(api_key=open_api_key)

WORDWRAP_WIDTH = 100
DATA_FILE = "data_storage.json"
SAVE_FREQ = 5

# Models
GPT4 = "gpt-4-0613"
GPT3 = "gpt-3.5-turbo-0125"
MODEL_NAME = GPT3

# Set the timezone to Eastern Time
TIMEZONE = pytz.timezone('US/Eastern')



# SYS_PROMPT = """You are a physician with expertise in determining underlying causes of death in Sierra Leone 
# by assigning ICD-10 codes for deaths using verbal autopsy narratives. Return only the ICD-10 code in JSON format: {“icd10”: [code1, code2, code3, code4, code5]}"""

# SYS_PROMPT = """You are a physician with expertise in determining underlying causes of death in Sierra Leone 
# by assigning ICD-10 codes for deaths using verbal autopsy narratives. Return only the ICD-10 code in JSON format: {“icd10”: code1}"""

# USR_PROMPT = """Determine the underlying cause of death and provide an ICD-10 code for a verbal autopsy narrative
# of a AGE_VALUE_DEATH AGE_UNIT_DEATH old SEX_COD death in Sierra Leone: {open_narrative}"""

SYS_PROMPT = """You are a physician with expertise in determining underlying causes of death in Sierra Leone by assigning ICD-10 codes for deaths using verbal autopsy narratives. Return only the ICD-10 code without description. E.g. A00 
If there are multiple ICD-10 codes, show one code per line."""

USR_PROMPT = """With the highest certainty, determine the underlying cause of death and provide the most accurate ICD-10 code for a verbal autopsy narrative of a AGE_VALUE_DEATH AGE_UNIT_DEATH old SEX_COD death in Sierra Leone: {open_narrative}"""

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [8]:
# cost projection
# ((((len(SYS_PROMPT + USR_PROMPT) // 4) + 300) / 1000) * 0.0005 + (15/1000) * 0.0015 ) * 12000

2.8320000000000003

In [None]:
questionnaire_df =  pd.read_csv("../data_202402/healsl_rd2_neo_v1.csv")
age_df =            pd.read_csv("../data_202402/healsl_rd2_neo_age_v1.csv")
narrative_df =      pd.read_csv("../data_202402/healsl_rd2_neo_narrative_v1.csv")

narrative_df = narrative_df.rename(columns={'summary': 'open_narrative'})


In [None]:
# quick_gp3 = pd.read_csv("../data_202402/healsl_rd1_rapid_chatgpt3_v1.csv")
# quick_gp4 = pd.read_csv("../data_202402/healsl_rd1_rapid_chatgpt4_v1.csv")

# questionnaire_df[questionnaire_df['p1_recon_icd_cod'].isna()][['rowid','p1_icd_cod','p2_icd_cod']]

In [None]:
narrative_only = narrative_df[['rowid','open_narrative']]
sex_only = questionnaire_df[['rowid','sex_cod']]
age_only = age_df[['rowid','age_value_death','age_unit_death']]

merged_df = narrative_only.merge(sex_only, on='rowid').merge(age_only, on='rowid')

# Fill in missing values with empty string
merged_df['sex_cod'] = merged_df['sex_cod'].fillna('')

assert not merged_df.isnull().values.any(), "Execution halted: NaN values found in merged_df"

print(f"Sample of merged_df {merged_df.shape}:")
display(merged_df.sample(5))

Sample of merged_df (233, 5):


Unnamed: 0,rowid,open_narrative,sex_cod,age_value_death,age_unit_death
48,24000243,"As per respondent, the child was fresh still b...",Female,0,Days
195,24003291,The deceased was 0 day male who was a still bi...,Male,0,Days
178,24000616,"According to the respondent, the deceased was ...",Male,7,Days
74,24000116,According to the respondent the deceased was a...,Female,2,Days
16,24003613,"As per respondent, the decease was a 0D old bo...",Male,10,Days


In [None]:
# F(x): Initialize the data storage dictionary

def load_data(filename=DATA_FILE):
    
    if os.path.exists(filename):
        print(f"{filename} found. Loading data...")
        with open(filename, 'r') as file:
            data = json.load(file)
        return data
    else:
        print(f"{filename} not found. Initializing empty dictionary...")
        return {}

def save_data(data, filename=DATA_FILE):
    # Save data to a file    
    with open(filename, 'w') as file:
        json.dump(data, file)

In [None]:
# F(x): Send a message to the chatbot
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-3.5-turbo-0125",
    # model: str = "gpt-3.5-turbo-0125",
    # max_tokens=500,
    temperature=0,
    # stop=None,
    # seed=123,
    tools=None,
    logprobs=None,
    top_logprobs=None,
) -> str:

    params = {
        "model": model,
        # "response_format": { "type": "json_object" },
        "messages": messages,
        # "max_tokens": max_tokens,
        "temperature": temperature,
        # "stop": stop,
        # "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

In [17]:
import datetime
# Load existing data or initialize an empty dictionary
data_storage = load_data()
skipped_rows = []
repeated_skips = False
print()


for index, row in merged_df.iterrows():
    # Access the values of each column in the current row
    # hijacking row 
    # row = merged_df[merged_df['rowid'] == 14005966].iloc[0]
    
    rowid = row['rowid']
    
    # Check if rowid already processed. Testing both because json changes int keys to str    
    if (rowid) in data_storage or str(rowid) in data_storage:
        if repeated_skips:
            print("\r", end='', flush=True)
        print(f"Skipping index {index}, row {rowid} - Already processed.", end='', flush=True)
        repeated_skips = True
        skipped_rows.append(rowid)
        continue

    
    narrative = row['open_narrative']
    sex_cod = row['sex_cod']
    age_value_death = row['age_value_death']
    age_unit_death = row['age_unit_death']
    
    prompt = USR_PROMPT
    prompt = prompt.replace('AGE_VALUE_DEATH', str(age_value_death))
    prompt = prompt.replace('AGE_UNIT_DEATH', age_unit_death.lower())
    prompt = prompt.replace('SEX_COD', sex_cod.lower())
    prompt = prompt.format(open_narrative=narrative)
    
    # print("Prompt:")    
    # print(textwrap.fill(prompt, width=WORDWRAP_WIDTH))
    # print()
    
    # for a in range(5):
    completion = get_completion(
        [
            {"role": "system", "content": SYS_PROMPT},
            {"role": "user", "content": prompt}
        ] ,
        model=MODEL_NAME,
        logprobs=True,
        # top_logprobs=2,
    )
    
    # print(completion.choices[0].message)
    
    # for token in completion.choices[0].logprobs.content:
    #     print(f"{repr(str(token.token)).ljust(15)}  {str(token.logprob).ljust(20)} {np.round(np.exp(token.logprob)*100,2)}%")
        
    output_msg = completion.choices[0].message.content
    logprob_data = [(token.token, float(token.logprob)) for token in completion.choices[0].logprobs.content]
    usage_data = list(completion.usage)    
    current_time = datetime.datetime.now(tz=TIMEZONE).isoformat()
       
    data_storage[str(rowid)] = {
        'rowid': rowid,
        'model': MODEL_NAME,
        'system_prompt': SYS_PROMPT,
        'user_prompt': prompt,
        'output_msg': output_msg,
        'logprobs': logprob_data,
        'usage': usage_data,
        'timestamp': current_time
    }

    # Save data periodically (you can adjust the frequency based on your needs)    
    if index % SAVE_FREQ == 0 and index > 0:
        if repeated_skips:
            print("\n", flush=True)
        repeated_skips = False
        
        save_data(data_storage)
        print(f"Saving index: {str(index).ljust(8)} Processing: {str(rowid).ljust(12)} Rows skipped: {len(skipped_rows)}", sep=' ', end='\r', flush=True)
        # break
    
try:
    save_data(data_storage)
    print(f"Saving index: {str(index).ljust(8)} Processing: {str(rowid).ljust(12)} Rows skipped: {len(skipped_rows)}", sep=' ', end='\r', flush=True)
    print("\nData saved successfully.")
except Exception as e:
    print(f"Error saving data: {e}")

if len(skipped_rows) > 0:
    print(f"DF length: {len(merged_df)}")
    print(f"Rows skipped: {len(skipped_rows)}")    

data_storage.json found. Loading data...

Saving index: 232      Processing: 24001069     Rows skipped: 0
Data saved successfully.
