In [1]:
"""
This script takes the raw responses from the OpenAI API, extracts the ICD-10 codes and their associated probabilities,

Pseudo code
-----------
1. Load the data from file.
2. Parse the feature "logprobs" and match with various ICD10 code patterns.
    2.1 Extract only the tokens that forms the ICD10 code pattern and its associated probability.
    2.2 Calculate the mean linear probability of all the tokens involved.
    2.3 Save the ICD10 code, mean linear probability, and relevant information in to "output_probs"
3. Sort ICD10 codes in "output_probs" by their mean linear probability in descending order.
4. Extract the top 5 ICD10 codes and their associated mean linear probabilities into their own columns.
5. Reorder the columns and save the dataframe to file.


Details regarding #2.1 of the pseudo code:
------------------------------------------

Although we specified the requirements in the API prompts, the response output sometimes contain additional information, 
such as extra descriptions, multiple ICD10 codes, or other unrelated information. There are only few dozens of such cases
in over ten thousand responses. Nevertheless, these need to be handled as there can only be one best ICD10 code. In 
general, we look into the output message, find all the ICD10 codes, calculate their mean probability, and save only ICD10 
codes with the top 5 highest mean probabilities.


Details regarding #2.1 of the pseudo code:
------------------------------------------

An output message may show only one ICD10 code, but behind the scenes, the code is formed by a number of tokens. For 
example, an output message of "M54.2" is composed of four tokens: "M", "54", ".", and "2". Each token has its own log 
probability. All probabilities are recorded in the "logprobs" feature as an array. 

The mean probability of a single ICD10 code is simple to calculate as we can just take the mean of the whole array.
However, when output message consists of multiple ICD10 codes or unrelated text, 'logprobs' must be parsed to extract 
only the relevant tokens.

We use a sliding window of various sizes to match different ICD10 code pattern using regular expressions. The pattern is
as follows:

    - ANN.ANNN
    - ANN.ANN
    - ANN

... where A is a letter and N is a number.

This will allow us to capture from the most detailed ICD10 code (e.g. G83.9) to the broadest (e.g. B54).


Details regarding #2.2 of the pseudo code:
------------------------------------------
The formula used for calculating the mean linear probability is:

    Linear_Mean_Probability = (1/n) * sum(exp(logprob_i) for i in 1 to n)

... where "logprobs" is a list of log probabilities associated with the tokens that form the ICD10 code.

        
Details regarding #5 of the pseudo code:
----------------------------------------

Below is the data structure of the parsed data:

    dataframe() = []
        'cause(n)_icd10': the unique identifier for the response. (n) can be 1 to 5.
        'cause(n)_icd10_prob': the mean linear probability of the ICD10 code. (n) can be 1 to 5.
        'output_timestamp': 
        'output_model': 
        'output_system_prompt': 
        'output_user_prompt':
        'output_usage_completion_tokens': Number of tokens used by completion
        'output_usage_prompt_tokens': Number of tokens used by prompt
        'output_probs': Extracted ICD-10 codes, linear mean, and their associated token probabilities.        
        'other_columns': columns carried over from the original dataframe. (optional; by setting)
        'raw': the original raw response (optional; by setting)
    ]

"""
pass

In [2]:
import os
import pandas as pd
import numpy as np
import json
import re
from datetime import datetime

# return the current date and time as a string
def get_datetime_string():
    return datetime.now().strftime('%Y%m%d_%H%M')


# Input response data file
# DATA_FILE = "data_storage.json"
DATA_FILE = "testing_response_mac.json"

JSON_EXPORT_FILE =  f"testing_validated_response_{get_datetime_string()}.json"
CSV_EXPORT_FILE =   f"testing_validated_response_{get_datetime_string()}.csv"

# Define the number of paired ICDs and probabilities we want to capture
PAIRS = 5

# Drop 'other_columns' from output dataframe
DROP_OTHER_COLUMNS = False

# Drop 'raw' from output dataframe
DROP_RAW = True


In [3]:
# F(x): Initialize the data storage dictionary

def load_data(filename=DATA_FILE):
    if os.path.exists(filename):
        print(f"{filename} found. Loading data...")
        with open(filename, 'r') as file:
            data = json.load(file)
        return data
    else:
        print(f"{filename} not found. Initializing empty dictionary...")
        return {}

def save_data(data, filename=DATA_FILE):
    with open(filename, 'w') as file:
        json.dump(data, file)

In [4]:
# F(x): Extract ICD probabilities from tokens

def extract_icd_probabilities(logprobs, debug=False):
    """
    Extracts ICD-10 codes and their associated probabilities from a list of tokens and log probabilities.

    This function iterates over the list of tokens and log probabilities, concatenating tokens together 
    and checking if they match the pattern of an ICD-10 code. If a match is found, it calculates the mean 
    linear probability of the ICD-10 code and packages the ICD-10 code, mean linear probability, and 
    associated tokens and log probabilities into a dictionary. It then appends this dictionary to a list 
    of parsed ICD-10 codes.

    Args:
        logprobs (list): A list of lists, where each inner list contains a token and its associated log probability.
        debug (bool, optional): If set to True, the function prints debug information. Defaults to False.

    Returns:
        list: A list of dictionaries, where each dictionary contains an ICD-10 code, its mean linear probability, 
              and a dictionary of associated tokens and log probabilities.
    """
    parsed_icds = []
    tmp_df = pd.DataFrame(logprobs)
    if debug > 0:
        print(repr(''.join(tmp_df.iloc[:,0])))
    tmp_df_limit = len(tmp_df)
    for pos in range(tmp_df_limit):
        # Concatenate 2, 4, or 5 tokens to form ICD-10 codes
        temp_concat_ANN = ''.join(tmp_df.iloc[pos:pos+2, 0]).strip()
        temp_concat_ANN_NNN = ''.join(tmp_df.iloc[pos:pos+4, 0]).strip()
        temp_concat_ANN_NNN_A = ''.join(tmp_df.iloc[pos:pos+5, 0]).strip()
        temp_concat_ANA_NNN = ''.join(tmp_df.iloc[pos:pos+5, 0]).strip()
        
        # Reference: https://www.webpt.com/blog/understanding-icd-10-code-structure
        
        # Regular expression pattern for various ICD-10 codes in the format
        # 'ANN' (e.g., 'A10')
        # 'ANN.NNN' (e.g., 'A10.001')
        # 'ANN.NNNA' (e.g., 'A10.001A') 
        # Note: last alphabet valid only if there are 6 characters before it
        # pattern_ANN = r"^[A-Z]\d{2}$"
        pattern_ANN = r"^[A-Z]\d[0-9A-Z]$"
        # pattern_ANN_NNN = r"^[A-Z]\d{2}\.\d{1,3}$"        
        pattern_ANN_NNN = r"^[A-Z]\d[0-9A-Z]\.\d{1,3}$"        
        # pattern_ANN_NNN_A = r"^[A-Z]\d{2}\.\d{3}[A-Z]$"
        pattern_ANN_NNN_A = r"^[A-Z]\d[0-9A-Z]\.\d{3}[A-Z]$"        
        
        # Check if the concatenated tokens match the ICD-10 code patterns
        match_ANN = re.match(pattern_ANN, temp_concat_ANN)
        match_ANN_NNN = re.match(pattern_ANN_NNN, temp_concat_ANN_NNN)
        match_ANN_NNN_A = re.match(pattern_ANN_NNN_A, temp_concat_ANN_NNN_A)
        match_ANA_NNN = re.match(pattern_ANN_NNN, temp_concat_ANA_NNN)
        
        # [debug] Each line will show which of the 3 patterns matched for the 3 token
        if debug == 2:
            print(
                str(pos).ljust(4), 
                repr(temp_concat_ANN).ljust(10), 
                ('yes' if match_ANN else 'no').ljust(15), 
                repr(temp_concat_ANN_NNN).ljust(10), 
                ('yes' if match_ANN_NNN else 'no').ljust(15), 
                repr(temp_concat_ANN_NNN_A).ljust(10), 
                ('yes' if match_ANN_NNN_A else 'no').ljust(15),
                repr(temp_concat_ANA_NNN).ljust(10), 
                ('yes' if match_ANA_NNN else 'no').ljust(5)
                )
        
        # Check match from longest to shortest
        # If a match is found, calculate the mean linear probability 
        # and package the ICD-10 code and associated data
        if match_ANN_NNN_A:
            winning_df = pd.DataFrame(logprobs[pos:pos+5])
            winning_icd = temp_concat_ANN_NNN_A
        elif match_ANA_NNN:
            winning_df = pd.DataFrame(logprobs[pos:pos+5])
            winning_icd = temp_concat_ANA_NNN
        elif match_ANN_NNN:
            winning_df = pd.DataFrame(logprobs[pos:pos+4])
            winning_icd = temp_concat_ANN_NNN            
        elif match_ANN:
            winning_df = pd.DataFrame(logprobs[pos:pos+2])
            winning_icd = temp_concat_ANN            
        else:
            continue
        
        # [debug] Display the winning ICD-10 code and its associated data
        if debug == 2:
            print(f"**** {winning_icd} - VALID ICD ****")
            display(winning_df)
        
        # Convert log probabilities to linear probabilities and calculate the mean
        winning_mean = np.exp(winning_df.iloc[:, 1]).mean()
        
        # Package the ICD-10 code and associated data
        winning_package = {
            'icd': winning_icd,
            'icd_linprob_mean': winning_mean,
            'logprobs': winning_df.rename(columns={0: 'token', 1:'logprob'}).to_dict(orient='list')
        }
        
        # Append the package to the list of parsed ICD-10 codes
        parsed_icds.append(winning_package)
    
    # [debug] Display the parsed ICD-10 codes
    if debug > 0:
        display(parsed_icds) 
    
    # Check if parsed_icds is empty
    if not parsed_icds:
        # If it is, raise an error and show the logprobs in question
        raise ValueError(f"No ICD-10 codes could be parsed from the provided logprobs: {logprobs}")

    return parsed_icds

# # Uncomment the following lines to test the function. 
# # `test` is an example of the `logprobs` field from the JSON data.
# test = [['A', -0.63648945],  ['09', -1.4643841], ['\n', -0.9866263], ['R', -0.6599979], ['50', -1.5362289],
#  ['.', -0.05481864],  ['9', -0.002321772], ['\n', -0.3524723], ['R', -0.56709456], ['11', -1.263591],
#  ['.', -0.05834798], ['0', -0.73551023], ['\n', -0.5051807], ['R', -0.65759194], ['63', -1.0282977],
#  ['.', -0.0006772888], ['4', -0.71002203]]

# test_output = extract_icd_probabilities(test)
# test_output

# # Uncomment to test a specific case
# extract_icd_probabilities(df.loc['24000015', 'logprobs'])


In [5]:
# Load JSON data and convert to dataframe
data_storage = load_data()
df = pd.DataFrame(data_storage).T

testing_response.json found. Loading data...


In [6]:
# Extract ICD-10 codes and their associated probabilities as a new column
df['output_probs'] = df['logprobs'].apply(extract_icd_probabilities)

In [7]:
# F(x): Given a list of ICDs in form of a list of tuples, convert each ICD into 1-dimension Series

def output_icds_to_cols(value, pairs=PAIRS):
    """
    Converts a list of ICD-10 codes and their associated probabilities into a one-dimensional pandas Series.

    This function takes a list of tuples, where each tuple contains an ICD-10 code and its associated 
    probability. It converts this list into a DataFrame, sorts the DataFrame by descending probability, 
    drops the 'logprobs' column, reshapes the DataFrame into a one-dimensional Series, and pads the Series 
    to fill a specified number of columns.

    Args:
        value (list): A list of tuples, where each tuple contains an ICD-10 code and its associated probability.
        pairs (int, optional): The number of columns to pad the Series to. Defaults to PAIRS.

    Returns:
        pandas.Series: A one-dimensional Series containing the ICD-10 codes and their associated probabilities.
    """
    tmp = pd.DataFrame(value) # convert list of tuples to dataframe
    tmp = tmp.sort_values(by="icd_linprob_mean", ascending=False) # sort by descending probability
    tmp = tmp.drop(columns=['logprobs'])
    tmp = tmp.stack().reset_index(drop=True) # convert to 1 row
    tmp = tmp.reindex(range(pairs*2), axis=1) # pad to fill PAIRS*2 columns
    return tmp

# Test
# output_icds_to_cols(test_output)

In [8]:
# Generate column names for the exploded ICDs in cause{n}_icd10 and cause{n}_icd10_prob format
icd_column_names_mapping = {i: f"cause{i // 2 + 1}_icd10" if i % 2 == 0 else f"cause{i // 2 + 1}_icd10_prob" for i in range(PAIRS*2)}

# Apply the `output_icds_to_cols` function to the `output_probs` column
# This will explode the ICDs into separate columns
parsed_df = df.merge(df.output_probs.apply(output_icds_to_cols).rename(columns=icd_column_names_mapping), left_index=True, right_index=True)

In [9]:
# Takes usage and extracts the first 2 values into separate columns
parsed_df = parsed_df.merge(
    parsed_df['usage'].apply(lambda x: pd.DataFrame(x).iloc[:2,1])
    .rename(columns={
        0: "output_usage_completion_tokens",
        1: "output_usage_prompt_tokens"
        }), left_index=True, right_index=True)

In [10]:
# Define the mapping variable
column_mapping = {
    'model': 'output_model',
    'system_prompt': 'output_system_prompt',
    'user_prompt': 'output_user_prompt',
    'user_prompt': 'output_user_prompt',
    'timestamp': 'output_created',
}

# Rename the columns using the mapping
parsed_df = parsed_df.rename(columns=column_mapping)

export_columns = []
export_columns += ['rowid']
export_columns += list(icd_column_names_mapping.values())
export_columns += [
                    'output_created',
                    'output_model',
                    'output_system_prompt' , 
                    'output_user_prompt', 
                    'output_usage_completion_tokens', 
                    'output_usage_prompt_tokens', 
                    'output_msg',
                    'output_probs'
                ]

if not DROP_OTHER_COLUMNS:
    export_columns += ['other_columns']
    
if not DROP_RAW:
    export_columns += ['raw']


# Show only relevant columns in the final dataframe
export_parsed_df = parsed_df[export_columns]

In [None]:
# Save the parsed data to a JSON file

# export_parsed_df.to_json(JSON_EXPORT_FILE, orient='records')
export_parsed_df.to_csv(CSV_EXPORT_FILE, index=True)