# Agreement analysis between reviewers

The objective of this analysis is to look for agreement/disagreement between reviewers' interpretation in systematic reviews. The sustained hypothesis on this matter is that reviewers will not agree on text interpretation and technical details of papers.

The data consists of 3 groups of 23 observations of 11 variables: title of manuscripts, url, full abstract, publication date, review (boolean), llm (boolean), set of llms used, structured_data (boolean), list of medical conditions, and evaluate_patient_trial. Each group represents a reviewer.

## Setup

### Imports

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import confusion_matrix
import ast

### Loading the Data

Here we are loading the reviewer files. Please note that the answers have been manually reviewed and small changes/ fix-ups have been done where needed.

In [2]:
Reviewer1 = pd.read_excel('./results/FullText_reviewer_1.xlsx')
GPT = pd.read_excel('./results/FullText_reviewer_GPT4.xlsx')
Resolution = pd.read_excel('./results/FullText_resolution.xlsx')

## Calculating Inter-rater Agreement

We have four functions to calculate inter-rater agreement. `kappa_calculation` is the main fuction that calculates Cohen's Kappa for two lists containing 'yes'/'no' values. `kappa_boolean` and `kappa_non_boolean` re-format the answers from the reviewers into 'yes'/'no' lists and call on `kappa_calculation` to generate the agreement values. Finally, `Kappa` is the function that puts everything together calling either `kappa_boolean` or `kappa_non_boolean` for each parameter.

### Kappa Calculation

The code for this Kappa Calculation was taken from this page: https://rowannicholls.github.io/python/statistics/agreement/cohens_kappa.html

The formula for the standard deviation is found in equation 7 of Cohen (1960)

In [3]:
def kappa_calculation(List1, List2):
    """ Function that calculates Cohen's Kappa coefficient for two lists that contain 'yes' or 'no' answers.
    Please note that the input lists should have the same length.

    Parameters:
    List1, List2 (list['yes'|'no']): lists of 'yes' or 'no' values.

    Returns:
    kappa, (lower, upper)
    A float represesenting the calculated Cohen's Kappa coefficient of the two lists and a tuple of two floats
    representing the approximate lower and the upper ends of the confidence interval for the kappa coefficient.
    """

    readerA = List1
    readerB = List2

    # Confusion matrix
    cm = confusion_matrix(readerA, readerB, labels=['yes','no'])

    # Sample size
    n = np.sum(cm)

    # Expected matrix
    sum0 = np.sum(cm, axis=0)
    sum1 = np.sum(cm, axis=1)
    expected = np.outer(sum0, sum1) / n**2

    # Number of classes
    n_classes = cm.shape[0]

    # Calculate p_o (the observed proportion of agreement) and
    # p_e (the probability of random agreement)
    identity = np.identity(n_classes)
    p_o = np.sum((identity * cm) / n)
    p_e = np.sum((identity * expected))
    # Calculate Cohen's kappa
    kappa = (p_o - p_e) / (1 - p_e)

    # Approximate confidence intervals
    # Equation 7 of Cohen (1960)
    se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
    ci = 1.96 * se * 2
    lower = kappa - 1.96 * se
    upper = kappa + 1.96 * se

    #display only upto two decimal places

    kappa = float(f"{kappa:.2f}")
    lower = float(f"{lower:.2f}")
    upper = float(f"{upper:.2f}")

    return kappa, (lower, upper)

### Boolean columns

The following function is used to process values for columns that have boolean values (**llm**, **review** and **structured_data** )

Its main purpose is to convert all the YES/NO values in the list to their lowercase form because `kappa_calculation` needs consistent 'yes' or 'no' values.

**NOTE: Please make sure the all the values from reviewers are either 'YES'/'yes' or 'NO'/'no', otherwise it may throw an error.**


In [4]:
def kappa_boolean(List1, List2):
    """ Function calculating Cohen's Kappa coefficient for two lists of boolean values. Reviewers were asked
    to answer "YES" or "NO", so those answers need to be converted to lowercase.

    Parameters:
    List1, List2 (list[str]): Answers retrieved by the reviewers, should be "YES" or "NO"

    Returns:
    kappa, (lower, upper)
    A float represesenting the calculated Cohen's Kappa coefficient of the two lists and a tuple of
    two floats representing the lower and the upper ends of the confidence interval for the kappa coefficient.
    """

    List1 = [(val).lower() for val in List1]
    List2 = [(val).lower() for val in List2]

    return kappa_calculation(List1, List2)


### Non Boolean Columns


This function will be used to process non boolean columns like **llm_name** and **list_of_medical_conditions**. The values in these columns represent lists of tokens separated by commas. We used a one-hot encoding with the union of all values for each entry to generate a list of 'yes'/'no' answers and calculate agreement with `kappa_calculation`. Here is an example of how the function works:

Say the responses to the first article are

    List1[0] = ['BERT', 'ClinicalBERT']
    List2[0] = ['BIOBERT', 'BERT'].

To create the one-hot encodings corresponding to the first article, we first make a union vector of all responses

    UNION = ['BERT','ClinicalBERT','BIOBERT']

Next, the encoding for each list will be a vector of same length as UNION with the i-th entry being 'yes' if UNION[i] is in the list and 'no' otherwise. For our example we'll have

    list1: ['yes', 'yes', 'no']
    list2: ['yes', 'no', 'yes']

This process is repeated for all articles and the one-hot-encodings are concatenated.

We treat the case in which neither of the reviewers identified any tokens as both reviewers agreeing that the answer should be null, so we add a 'yes' to each of the one-hot encoding lists.



In [5]:
def remove_whitespace_and_capitalize(input_string):
    """ Helper function used to pre-process the text in a list of token data. This function is used to ensure
    that casing and white space are ignored when comparing answers from reviewers. For xample, 'gpt3' will
    be equivalent to GPT3' and 'Clinical BERT' to 'clinicalBERT'
    """
    # Remove white spaces
    no_whitespace = input_string.replace(" ", "")

    # Convert to uppercase
    uppercase_string = no_whitespace.upper()

    return uppercase_string

In [6]:
def kappa_non_boolean(List1, List2):
    """ Function that converts list-of-tokens responses for non boolean parameters into boolean vectors
    and calculates the inter-rater agreement.

    Parameters:
    List1, List2 (list[str]): lists representing the extracted values for non-boolean parameters. Please note that for non-boolean data
                              each entry may have several answers. For example, several LLMs might have been used in a single article.
                              In this case, the entry corresponding to that article would be a string enumeration of all the LLMs used.
                              ex: ['', 'BERT, ClinicalBERT' , 'Glove, BERT', '', ...]

    Returns:
    kappa, (lower, upper)
    A float represesenting the calculated Cohen's Kappa coefficient of the two lists and a tuple of
    two floats representing the lower and the upper ends of the confidence interval for the kappa coefficient.

    """

    one_hot_list1 = []
    one_hot_list2 = []

    for ind in range(len(List1)):

        # remove white space and make every string to UpperCase
        string1 = remove_whitespace_and_capitalize(str(List1[ind])).split(',')
        string2 = remove_whitespace_and_capitalize(str(List2[ind])).split(',')

        # remove empty string or nan values
        string1 = [item for item in string1 if item != '' and item != 'NAN']
        string2 = [item for item in string2 if item != '' and item != 'NAN']

        # Make a union list for each index.
        UNION = list(set(string1) | set(string2))

        # if neither of the reviewers identified any values for the parameter, then they agree that the answer should be null,
        # so we will add a 'yes' to each of the one-hot-encoding lists.
        if len(UNION) == 0:
            one_hot_list1.append('yes')
            one_hot_list2.append('yes')

        # otherwise, we generate one-hot encodings of 'yes' and 'no based on presence of entries in the UNION
        else:
            for each_item in UNION:

                if each_item in string1:
                    one_hot_list1.append('yes')
                else:
                    one_hot_list1.append('no')

                if each_item in string2:
                    one_hot_list2.append('yes')
                else:
                    one_hot_list2.append('no')

    # kappa_calculation
    final_ans = kappa_calculation(one_hot_list1, one_hot_list2)

    return final_ans


### Agreement between a pair of reviewers

This function calculates the inter-rater agreement between two reviewers for all the parameters of interest. We still use Cohen's kappa for the main agreement value, as we did in the Screening phase, but adjust the variance estimates following the jackknife approach from Blackman and Koval (2000).

The approximate variance given in Cohen (1960, equation 7) used previously is known to suffer from asymmetries in the data. The jackknife approach, seen in Blackman and Koval (2000) provides a less inflated variance estimate. The confidence interval is still based on an asymptotic normal distribution of the estimated coefficient of agreement and can still lead to values outside of the range. However, these will almost certainly be closer to the true range of the possible values of kappa.

Jackknife estimation of the variance consists of getting a first estimate of kappa and then, one at a time, removing an observation and calculating the new estimate. These estimates obtained by "leaving one out" are used in the jackknife variance formula

$$estimated\_variance = \frac{n-1}{n} * sum\_of\_squared\_differences(all\_leave\_one\_out\_kappas, kappa\_using\_all\_data)$$




In [7]:
def Kappa(rater1, rater2):
    """ Function that calculates inter-rater agreement between `rater1` and `rater2` across all parameters
    of interest.

    Parameters:
    rater1, rater2: dataframes corresponding to the two reviewers. It is assumed that the dataframes have a column
                    for each parameter of interest (defined within the body of the function), that the numer of rows
                    is equal and the answers follow expected formatting (YES/NO, list of tokens, etc.)

    Returns:
    dict {str: (float, (float, float))}
    A collection mapping each parameter to the calculated Cohen's Kappa and confidence interval.
    """

    n = rater1.shape[0]
    boolean_columns = ['review','llm', 'structured_data']
    non_boolean_columns = ['llm_name', 'list_of_medical_conditions']
    result = {}

    for column in boolean_columns:
        List1 = rater1[column].to_list()
        List2 = rater2[column].to_list()

        result[column] = {}
        pseudoKappas = []
        kappaEstimate, _ = kappa_boolean(List1,List2)

        for idx_to_remove in range(n):
            # leave out the `idx_to_remove` observation
            pseudoRater1 = rater1.drop(idx_to_remove).reset_index(drop=True)
            pseudoRater2 = rater2.drop(idx_to_remove).reset_index(drop=True)

            pseudoList1 = pseudoRater1[column].to_list()
            pseudoList2 = pseudoRater2[column].to_list()

            # calculate the kappa coefficient with all observations except `idx_to_remove`
            pseudoKappa, _ = kappa_boolean(pseudoList1,pseudoList2)
            if not np.isnan(pseudoKappa):
                pseudoKappas.append(pseudoKappa)

        pseudoKappas = np.array(pseudoKappas)

        # calculate jackknife estimation for variance, and CI bounds
        estimatedVariance = ((n-1)/n)*np.sum((pseudoKappas-kappaEstimate)**2)
        lower = np.around(kappaEstimate-1.96*np.sqrt(estimatedVariance), 2)
        upper = np.around(kappaEstimate+1.96*np.sqrt(estimatedVariance), 2)
        result[column] = (kappaEstimate, (lower, upper))

    for column in non_boolean_columns:
        List1 = rater1[column].to_list()
        List2 = rater2[column].to_list()

        result[column] = {}
        pseudoKappas = []
        kappaEstimate, _ = kappa_non_boolean(List1,List2)

        for idx_to_remove in range(n):
            # leave out the `idx_to_remove` observation
            pseudoRater1 = rater1.drop(idx_to_remove).reset_index(drop=True)
            pseudoRater2 = rater2.drop(idx_to_remove).reset_index(drop=True)

            pseudoList1 = pseudoRater1[column].to_list()
            pseudoList2 = pseudoRater2[column].to_list()

            # calculate the kappa coefficient with all observations except `idx_to_remove`
            pseudoKappa, _ = kappa_non_boolean(pseudoList1,pseudoList2)
            if not np.isnan(pseudoKappa):
                pseudoKappas.append(pseudoKappa)

        pseudoKappas = np.array(pseudoKappas)

        # calculate jackknife estimation for variance, and CI bounds
        estimatedVariance = ((n-1)/n)*np.sum((pseudoKappas-kappaEstimate)**2)
        lower = np.around(kappaEstimate-1.96*np.sqrt(estimatedVariance), 2)
        upper = np.around(kappaEstimate+1.96*np.sqrt(estimatedVariance), 2)
        result[column] = (kappaEstimate, (lower, upper))
    return result



## Running the code

Below we calculate and display the agreement values for each pair of reviewers and consensus.

In [8]:
gpt_reviewer1 = Kappa(GPT,Reviewer1)
gpt_resolution = Kappa(GPT,Resolution)
reviewer1_resolution = Kappa(Reviewer1,Resolution)

  kappa = (p_o - p_e) / (1 - p_e)
  se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
  kappa = (p_o - p_e) / (1 - p_e)
  se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
  kappa = (p_o - p_e) / (1 - p_e)
  se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))
  kappa = (p_o - p_e) / (1 - p_e)
  se = np.sqrt((p_o * (1 - p_o)) / (n * (1 - p_e)**2))


In [9]:
gpt_reviewer1

{'review': (0.0, (0.0, 0.0)),
 'llm': (0.47, (0.05, 0.89)),
 'structured_data': (0.57, (0.23, 0.91)),
 'llm_name': (-0.18, (-0.3, -0.06)),
 'list_of_medical_conditions': (-0.2, (-0.35, -0.05))}

In [10]:
gpt_resolution

{'review': (0.0, (0.0, 0.0)),
 'llm': (0.59, (0.21, 0.97)),
 'structured_data': (0.65, (0.33, 0.97)),
 'llm_name': (-0.16, (-0.26, -0.06)),
 'list_of_medical_conditions': (-0.25, (-0.4, -0.1))}

In [11]:
reviewer1_resolution

{'review': (1.0, (1.0, 1.0)),
 'llm': (0.89, (0.67, 1.11)),
 'structured_data': (0.91, (0.74, 1.08)),
 'llm_name': (0.0, (0.0, 0.0)),
 'list_of_medical_conditions': (-0.16, (-0.34, 0.02))}

## Justification  

Literature review stands on two pillars: the guidelines for inclusion of documents and the subjejctive view of the reviewer. The guideles presented in a PRISMA review are well accepted and consolidated. They are widely used and accept as a generally good choice for the first pillar. Subjectivity of the reviewer after that point is under considered in the discussions of the quality of systematic reviews. There is ambiguity of varying degrees in scientific texts and it may influence systematic reviews.

One approach on dealing with subjectiviness of reviews, is to inspect agreement between reviewers. In this analysis a method is proposed for this. Consider a corpora of texts included in a review and consider the representation of the entities, in the sense of natural language processing, as tuples ($entity, semantics, attributes$). When analysing the corpora, all the entities of all the texts are analysed after discovery. For more details on this, see Bird et al. (2009), for example. The proposal is to consider all entities presented in the texts and evaluate, at the very least, if the same entities are considered as belonging to the same texts by independent reviewers. If pairs of reviewers cannot agree about a particular entity belonging to a text or not, that would be a clear indication of ambiguity in texts. The method is applied to pair of reviewers and consists in traversing the corpora of papers. For each paper the the union of the sets of entities annotated by each reviewer is obtained. A data frame is created with each row consisting of an entity in the union set and each collumn represents a list of boolean values indicating if each reviewer included or not the entity in that manuscript. The data frames of all papers are then stacked and Cohen's $\kappa$ coefficient of agreement is calculated for the lists of boolean values. A higher value of $\kappa$ indicates consensus between reviewers, the opposite being true for lower values. It is important to take notice that this proposal is aligned to the assumptions of Cohen (1960).

This idea is similar to was used previously in Liu et al. (2018), where Cohen's kappa is used as an ad-hoc metric for agreement of machine learning method. McHugh (2012) presents a suggestion of use of Cohen's kappa that creates a list of yes/no observations based on perceived scores by professionals in a way that is very similar to the proposed one-hot encoding presented here.  

One particular issue with data on systematic reviews is that sample size can be very restrictive. Most of the widely used formulas for $\kappa$ confidence intervals are approximations based on asymptotic properties of the sample distributions of the proportions involved in its calculation. To avoid conveergence issues or nonsene values for the confidence intervals, one alternative is to use the Jackknife estimator of the variance of the $\kappa$ under the null hypothesis described in Cohen (1960). This is seen, for instance, in Blackman and Koval (2000).

## Calculating additional metrics for boolean parameters

Here we calculate accuracy, sensitivity, specificity and F1 score of GPT4 as compared to the resolution answers for the boolean parameters.

First we binarize the answers for boolean parameters by transforming them from "YES"-es and "NO"-s to 1s and 0s. Then we use the same function as in the Abstract Screening phase to calculate all statistics of interest.

In [12]:
def binarize_answers(reviewer, parameter):
  """ Given the dataframe containing the answers of a reviewer, extracts the values for
      'parameter' and converts them from "YES" and "NO" to 1 and 0 respectively.

      Parameters:
      reviewer: dataframe, must contain a column named as parameter
      parameter: string, the name of the column of interest. Answers in this column must
                 be either "YES" or "NO"

      Returns:
      A list of 1s and 0s
  """
  # we expect our ansers to be capital YES and NO, but include lowercase options just in case
  acceptable_values_set = {"YES", "NO", "yes", "no"}
  boolean_yes_no_values = reviewer[parameter].to_list()

  # verify that all answers are in the YES/NO format
  for value in boolean_yes_no_values:
    if value not in acceptable_values_set:
      raise Exception("Answers of a rater must be 'YES' or 'NO'")

  binary_values = [1 if answer.lower() == "yes" else 0 for answer in boolean_yes_no_values]
  return binary_values


In [13]:
def calculate_stats(GPT_answers, ground_truth):
  """ Calculates Accuracy, Sensitivity, Specificity and F1 score for a list of binary answers given the ground truth.

      Parameters:
      GPT_answers, ground_truth: lists of 1s and 0s. Must have the same length.

      Returns:
      Dictionary mapping each calculated statistic to its corresponding value.
  """
  n = len(GPT_answers)
  true_positives = 0
  true_negatives = 0
  false_positives = 0
  false_negatives = 0

  for given_answer, correct_answer in zip(GPT_answers, ground_truth):
    if given_answer == correct_answer == 1:
      true_positives+=1
    elif given_answer == correct_answer == 0:
      true_negatives+=1
    # if reached here we know that given_answer != correct_answer since they can only take values of 1 or 0
    elif correct_answer == 1: # then given_answer == 0
      false_negatives+=1
    else: # correct_answer == 0 and given_answer == 1
      false_positives+=1

  if not (true_positives+true_negatives+false_positives+false_negatives == n):
    raise Exception('Something went wrong, check that your lists are binary and of equal length')

  answers_by_category = (true_positives, true_negatives, false_positives, false_negatives)



  precision = true_positives/(true_positives+false_positives)   # how many of classified positives are true positives
  sensitivity = true_positives/(true_positives+false_negatives) # how many of the actual positives are correctly identified as positives
  specificity = true_negatives/(true_negatives+false_positives) # how many of the actual negatives are identified as negatives
  accuracy = (true_positives+true_negatives)/n                  # how many of the examples are correctly classified
  f1_score = 2*precision*sensitivity/(precision+sensitivity)    # harmonic mean of precision and sensitivity

  return {'Results by category (TP, TN, FP, FN)': answers_by_category,
          'Accuracy': accuracy,
          'Sensitivity': sensitivity,
          'Specificity (sensitivity excluded)': specificity,
          'F1 score': f1_score}

In [14]:
# binarize both GPT and resolution responses for all boolean parameters ('review', 'llm', and 'structured_data')
GPT_review_binary = binarize_answers(GPT, 'review')
GPT_LLM_binary = binarize_answers(GPT, 'llm')
GPT_structured_data_binary = binarize_answers(GPT, 'structured_data')

Resolution_review_binary = binarize_answers(Resolution, 'review')
Resolution_LLM_binary = binarize_answers(Resolution, 'llm')
Resolution_structured_data_binary = binarize_answers(Resolution, 'structured_data')


Our results for the 'review' parameter are an edge case as the dataset contained no reviews or meta-analyses. We do not calculate statistics separately for the 'review' parameter because it entails division by 0. We do include the answers for the 'review' parameter in the overall concatenated vector of responses and they are considered in the overall statistical metrics we report.

In [15]:
# stats_review = calculate_stats(GPT_review_binary, Resolution_review_binary)
stats_LLM = calculate_stats(GPT_LLM_binary, Resolution_LLM_binary)
stats_structured_data = calculate_stats(GPT_structured_data_binary, Resolution_structured_data_binary)
stats_overall = calculate_stats([*GPT_review_binary, *GPT_LLM_binary, *GPT_structured_data_binary],
                                [*Resolution_review_binary, *Resolution_LLM_binary, *Resolution_structured_data_binary])


# print("Results based only on Review parameter: \n")
# for key, val in stats_review.items():
#   print(key, ": ", val)

print("\n\nResults based only on LLM parameter: \n")
for key, val in stats_LLM.items():
  print(key, ": ", val)

print("\n\nResults based only on Structured Data parameter: \n")
for key, val in stats_structured_data.items():
  print(key, ": ", val)

print("\n\nOverall results: \n")
for key, val in stats_overall.items():
  print(key, ": ", val)



Results based only on LLM parameter: 

Results by category (TP, TN, FP, FN) :  (5, 14, 2, 2)
Accuracy :  0.8260869565217391
Sensitivity :  0.7142857142857143
Specificity (sensitivity excluded) :  0.875
F1 score :  0.7142857142857143


Results based only on Structured Data parameter: 

Results by category (TP, TN, FP, FN) :  (10, 9, 1, 3)
Accuracy :  0.8260869565217391
Sensitivity :  0.7692307692307693
Specificity (sensitivity excluded) :  0.9
F1 score :  0.8333333333333333


Overall results: 

Results by category (TP, TN, FP, FN) :  (15, 45, 3, 6)
Accuracy :  0.8695652173913043
Sensitivity :  0.7142857142857143
Specificity (sensitivity excluded) :  0.9375
F1 score :  0.7692307692307692


## References
1. Blackman, N. J., & Koval, J. J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine, 19(5), 723–741. https://doi.org/10.1002/(sici)1097-0258(20000315)19:5<723::aid-sim379>3.0.co;2-a

2. Bird, S, klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly, Canada.

3. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104

4. Liu, W., Luo, Z., & Li, S. (2018). Improving deep ensemble vehicle classification by using selected adversarial samples. Knowledge-Based Systems, 160, 167–175. https://doi.org/10.1016/j.knosys.2018.06.035

5. McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/
