In this notebook we are using the Rule Based appraoch to solve the Negation & Uncertainty identification problem for a medical based corpus.
Our approach will revolve around working with the regex library "re" and the various methods provided by it.
The "json" library is used for parsing the train and tst files.

In [54]:
import re
import json

Parse Training and Testing data from the JSON files.

In [55]:
def parse_json(file_path):

  # Step 1: Open the file in read mode
  try:
    with open(file_path, "r") as json_file:
      # Step 2: Load the JSON data using json.load()
      parsed_file = json.load(json_file)
  except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
  else:
    print("JSON data parsed successfully!")
    # Step 3: Access and process the data
  return parsed_file

In [56]:
training_set = parse_json("./train_data.json")
testing_set = parse_json("./test_data.json")

JSON data parsed successfully!
JSON data parsed successfully!


The repartition of data is split following a 80:20 ratio.

In [57]:
print(len(training_set))
print(len(testing_set))

254
64


From the training set, extract the GT predictions and the texts.
From the test set, extract the texts.

In [58]:
predictions = [document["predictions"] for document in training_set]
texts = [document["data"]["text"] for document in training_set]
test_texts = [document["data"]["text"] for document in testing_set]

We are going to create a vocabulary of words that will enable us to create a Rule Based Model.
Firstly, we will use CUTEXT, which is a tool that provides popular medical terms in Spanish.
All of the words contained in CUTEXT will be added to our vocabulary.

In [59]:
def extract_terms_from_file(file_path):
    terms = []
    with open(file_path, 'r') as file:
        for line in file:
            if line.startswith("Term:"):
                term = line.split("Term:")[1].strip()
                terms.append(term)
    return terms

Filtering unnecessary words, such as those that contain punctuation.

In [60]:
def parse_terms(extracted_terms):
  new_terms = []
  for term in extracted_terms:
    if term[0].isalpha() and term[-1].isalpha() and "**" not in term and "(" not in term and ")" not in term and len(term)>3:
      new_terms.append(term)
  return new_terms

In [61]:
file_path = "./terms_raw.txt"
# Extract terms from the file
cutext_terms = extract_terms_from_file(file_path)

In [62]:
print("Cutext terms before filtering: ", len(cutext_terms))

Cutext terms before filtering:  21554


In [63]:
cutext_terms = parse_terms(cutext_terms)
print("Cutext terms after filtering: ", len(cutext_terms))

Cutext terms after filtering:  18236


Extract NEG, UNC, NSCO and USCO from Training Data annotations (Ground-Truth)

In [64]:
# Gets a list of tuples representing character offsets (start, end) and returns list of words mapped from the text
def get_words(text, offsets):
  words = []
  for start, end in offsets:
    if text[start-1].isalpha():
      s=start-1
    else:
      s=start
    if text[end-1].isalpha():
      e=end
    else:
      e=end-1
    words.append(text[s:e])
  return words

In [65]:
# Parses a document and returns 4 lists representing words for each category
def find_cues_and_scopes(document):
  neg_postitions_pairs = [(result_element["value"]["start"], result_element["value"]["end"]) for result_element in document["predictions"][0]["result"] if "NEG" in result_element["value"]["labels"]]
  unc_postitions_pairs = [(result_element["value"]["start"], result_element["value"]["end"]) for result_element in document["predictions"][0]["result"] if "UNC" in result_element["value"]["labels"]]
  nsco_postitions_pairs = [(result_element["value"]["start"], result_element["value"]["end"]) for result_element in document["predictions"][0]["result"] if "NSCO" in result_element["value"]["labels"]]
  usco_postitions_pairs = [(result_element["value"]["start"], result_element["value"]["end"]) for result_element in document["predictions"][0]["result"] if "USCO" in result_element["value"]["labels"]]
  neg_words = get_words(document["data"]["text"], neg_postitions_pairs)
  unc_words = get_words(document["data"]["text"], unc_postitions_pairs)
  nsco_words = get_words(document["data"]["text"], nsco_postitions_pairs)
  usco_words = get_words(document["data"]["text"], usco_postitions_pairs)
  return neg_words, unc_words, nsco_words, usco_words

In [66]:
NEG = set()
UNC = set()
NSCO = set()
USCO = set()

for document in training_set:
  neg_words, unc_words, nsco_words, usco_words = find_cues_and_scopes(document)
  nsco_words_set = set(nsco_words)
  usco_words_set = set(usco_words)

  NEG.update(neg_words)
  UNC.update(unc_words)
  NSCO.update(nsco_words)
  USCO.update(usco_words)

# Removing spaces and punctation signs from the start and end of each string
NEG = {word.strip(" ,.!?;)") for word in NEG}
UNC = {word.strip(" ,.!?);") for word in UNC}
NSCO = {word.strip(" ,.!?;)") for word in NSCO}
USCO = {word.strip(" ,.!?);") for word in USCO}

# Some negation cues from NEG are also found as UNCs a small amount of times.
# To avoid labeling one word as both UNC and NEG, we will remove words from UNC that are also present in NEG,
# mainly because 90% of the times those words are NEG in the GTs.
# Remove negation from UNC
for word in NEG:
  if word in UNC:
    UNC.remove(word)

Now we created a vocabulary for all the 4 categories of words.

NEG and UNC should be different, however there is no reason for applying the same rule for NSCO and USCO, and we will explain why.

Our search for negations is going to start by using a regex formed by concatenating all the words from NEG in order to look for the negation cues.

Next, we will analyze 5 words before and 5 words after that cue looking for its scope.

Same algorithm will be applied for uncertainties.

Therefore, it is impossible to match a NEG with USCO or an UNC with a NSCO. This means we can comfortably combine all the scopes into one big vocabulary to achieve greater coverage.


In [67]:
ALL_SCOPES = NSCO.union(USCO)

print("ALL_SCOPES size: ",  len(ALL_SCOPES))

# A set with all individual words from the scopes
SCOPE_words = set()         # ['erc', '(29/05/18)', 'ser', 'visibles', 'extratono', 'inicia', 'valor', 'frialdad', 'medicamentoses', 'neoformativo']
for scope in ALL_SCOPES:
  SCOPE_words.update(scope.split())

print("Scopes words before processing: ", len(SCOPE_words))

# Remove all symbols and numbers from the set
SCOPE_words = {word for word in SCOPE_words if word.isalpha()}
SCOPE_words = list(SCOPE_words)


print("Scopes words after processing: ", len(SCOPE_words))

ALL_SCOPES size:  2617
Scopes words before processing:  3184
Scopes words after processing:  2895


Combine SCOPE_words with extracted_terms from CUTEXT to achieve even freater coverage.

In [68]:
extracted_terms = list(set(cutext_terms+SCOPE_words))

In [69]:
print("Total size of medical scopes terms: ", len(extracted_terms))

Total size of medical scopes terms:  19948


Now the vocabulary has been created and the next step is to generate a regex using it.

Our aim is to match as much as possible from matched sequence, hence we will place the scopes in the regex in a descending order regarding their length.

In [70]:
extracted_terms.sort(key=len,reverse=True)
print(extracted_terms[:10])

['intervencion quirurgica de retirada de material de osteosintesis', 'desgarros puerperio procedimientos venoclisis monitorizacion nst', 'per metapneumovirus procediments aspirat nasofaringi tractament', 'iq antecedents antecedents patologic asma intermitent lleu', 'lesiones de caracteristicas inflamatorio-desmielinizantes', "anys procedencia aguts servei obstetricia data d'ingres", "anys procedencia aguts servei traumatics data d'ingres", "mateix hosp servei reconstruc osteoartic data d'ingres", 'genoll dret antecedents antecedents patologic diabetis', "anys procedencia aguts servei nefrologia data d'ingres"]


Prepare REGEX

In [71]:
NEG_pattern = "|".join(NEG)
UNC_pattern = "|".join(UNC)
SCOPE_pattern = "|".join(extracted_terms)
SCOPE_pattern_baseline = "|".join(SCOPE_words)

In [72]:
# SCOPE pattern just for CUTEXT.
SCOPE_pattern_CUTEXT = "|".join(cutext_terms)

In [108]:
# Regex for identifying the scopes that appear before the cues.
regex_neg_pos=rf"\b({SCOPE_pattern})\b\s\b({NEG_pattern})\b"
regex_unc_pos=rf"\b({SCOPE_pattern})\b\s\b({UNC_pattern})\b"

REGEX Baseline

In [109]:
regex_neg_pre =rf"\b({NEG_pattern})\b\s+((?:\b(?:{SCOPE_pattern_baseline})\b\s*){{0,5}})"
regex_unc_pre =rf"\b({UNC_pattern})\b\s+((?:\b(?:{SCOPE_pattern_baseline})\b\s*){{0,5}})"

REGEX CUTEXT

In [117]:
regex_neg_pre =rf"\b({NEG_pattern})\b\s+((?:\b(?:{SCOPE_pattern_CUTEXT})\b\s*){{0,5}})"
regex_unc_pre=rf"\b({UNC_pattern})\b\s+((?:\b(?:{SCOPE_pattern_CUTEXT})\b\s*){{0,5}})"

REGEX1

In [121]:
regex_neg_pre =rf"\b({NEG_pattern})\b\s+((?:\b(?:{SCOPE_pattern})\b\s*){{0,5}})"
regex_unc_pre=rf"\b({UNC_pattern})\b\s+((?:\b(?:{SCOPE_pattern})\b\s*){{0,5}})"

Upon qualitative analysis of the Ground Truths, we came to the conclusion that a regex that takes all the words after the cue until the end of sentence as a scope is a viable approach.

REGEX that takes all the words until the end of the proposition

In [125]:
regex_neg_pre = rf"\b({NEG_pattern})\b\s*(.*?)\."
regex_unc_pre = rf"\b({UNC_pattern})\b\s*(.*?)\."

In [78]:
print(len(SCOPE_pattern))

323583


Making Predictions

In [126]:
predictions = []
for i in range(len(test_texts)):
  dict = {"NEG":set(),"NSCO":set(),"UNC":set(),"USCO":set()}

  predictions.append(dict)


for id, test_text in enumerate(test_texts):
  neg_scopes_pre_matches = re.finditer(regex_neg_pre, test_text)
  neg_scopes_pos_matches = re.finditer(regex_neg_pos, test_text)
  unc_scopes_pre_matches = re.finditer(regex_unc_pre, test_text)
  unc_scopes_pos_matches = re.finditer(regex_unc_pos, test_text)

  if neg_scopes_pre_matches:
    for match in neg_scopes_pre_matches:
        # Get the matched word and its starting/ending positions
        matched_word = match.group(1)
        start_pos = match.start(1)
        end_pos = match.end(1)+1

        predictions[id]["NEG"].add((start_pos,end_pos,matched_word))

        # Get the scope word
        scope_word = match.group(2)
        sc_start_pos = end_pos
        sc_end_pos = match.end(2)+1

        predictions[id]["NSCO"].add((sc_start_pos,sc_end_pos,scope_word))

  if unc_scopes_pre_matches:
    for match in unc_scopes_pre_matches:
        # Get the matched word and its starting/ending positions
        matched_word = match.group(1)
        start_pos = match.start()
        end_pos = match.end(1)+1

        predictions[id]["UNC"].add((start_pos,end_pos,matched_word))


        # Get the scope word
        scope_word = match.group(2)
        sc_start_pos = end_pos
        sc_end_pos = match.end(2)+1

        predictions[id]["USCO"].add((sc_start_pos,sc_end_pos,scope_word))

Sort the text predictions by starting point

In [111]:
for dict in predictions:
    for key,value in dict.items():

      sorted_value=sorted(list(value), key=lambda x: x[0])
      dict[key] = sorted_value

In [112]:
print(predictions[0])

{'NEG': [(395, 398, 'no'), (499, 505, 'niega'), (1141, 1144, 'no'), (1163, 1166, 'no'), (1313, 1322, 'negativo'), (2118, 2122, 'sin')], 'NSCO': [(398, 433, 'alergias medicamentosas conocidas '), (505, 541, 'habitos toxicos medicacio habitual '), (1144, 1151, 'inmune'), (1166, 1173, 'immune'), (1322, 1323, ''), (2122, 2134, 'incidencias')], 'UNC': [(3460, 3466, 'puede')], 'USCO': [(3466, 3467, '')]}


Get ground thruth from testing_set

In [97]:
def get_gt_format(document):
    neg_predictions, unc_predictions, nsco_predictions, usco_predictions = [], [], [], []
    text = document["data"]["text"]
    for result_element in document["predictions"][0]["result"]:
        start = result_element["value"]["start"]
        end = result_element["value"]["end"]
        if "NEG" in result_element["value"]["labels"]:
            neg_predictions.append((start, end, text[start:end]))
        if "UNC" in result_element["value"]["labels"]:
            unc_predictions.append((start, end, text[start:end]))
        if "NSCO" in result_element["value"]["labels"]:
            nsco_predictions.append((start, end, text[start:end]))
        if "USCO" in result_element["value"]["labels"]:
            usco_predictions.append((start, end, text[start:end]))

    return neg_predictions, unc_predictions, nsco_predictions, usco_predictions

In [103]:
# FORMAT : (NEG, START, END, WORD)
def get_ground_truth(document):
    neg_results, unc_results, nsco_results, usco_results = get_gt_format(document)

    neg_results_sorted = sorted(neg_results, key=lambda x: x[0])
    unc_results_sorted = sorted(unc_results, key=lambda x: x[0])
    nsco_results_sorted = sorted(nsco_results, key=lambda x: x[0])
    usco_results_sorted = sorted(usco_results, key=lambda x: x[0])

    ground_truth_dict = {"NEG": neg_results_sorted, "UNC": unc_results_sorted, "NSCO": nsco_results_sorted, "USCO": usco_results_sorted}

    return ground_truth_dict

get_ground_truth(testing_set[0])

{'NEG': [(395, 398, 'no '),
  (499, 505, 'niega '),
  (1111, 1119, 'negativo'),
  (1141, 1144, 'no '),
  (1163, 1166, 'no '),
  (1194, 1203, 'negativos'),
  (2118, 2122, 'sin ')],
 'UNC': [],
 'NSCO': [(398, 422, 'alergias medicamentosas '),
  (505, 521, 'habitos toxicos '),
  (1107, 1111, 'vih '),
  (1144, 1150, 'inmune'),
  (1166, 1172, 'immune'),
  (1174, 1194, 'lues vih, vhb y vhc '),
  (2122, 2133, 'incidencias')],
 'USCO': []}

In [104]:
# List of dictionaries of GT docuemnts in the test set
ground_truths = [get_ground_truth(document) for document in testing_set]

Calculate Metrics

In [113]:
def calculate_metrics(predictions,ground_truths):
  precision = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  recall = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  f1 = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  tp = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  num_of_predictions = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  num_of_ground_truths = {"NEG":0,"NSCO":0,"UNC":0,"USCO":0}
  for d1,d2 in zip(predictions,ground_truths):

    for key in d1:
      for elem in d1[key]:
        for elem2 in d2[key]:
          # We are allowing an error of 1 character between GT and our predictions 
          # because the tagging is inconsistent in the GT and many times it happens that the punctuation sign or space
          # are included in the scope. 
          if abs(elem[0]-elem2[0]) <= 1 and abs(elem[1]-elem2[1]) <=1:
            tp[key]+=1
            break

      num_of_predictions[key]+=len(d1[key])
      num_of_ground_truths[key]+=len(d2[key])

  for key in precision:
    precision[key] = tp[key]/num_of_predictions[key]
    recall[key] = tp[key]/num_of_ground_truths[key]
    f1[key] = 2*precision[key]*recall[key]/(precision[key]+recall[key])


  return precision, recall, f1


In [127]:
precision, recall, f1 = calculate_metrics(predictions,ground_truths)

RESULTS

Baseline


In [116]:
print(precision)
print(recall)
print(f1)

{'NEG': 0.9088345864661654, 'NSCO': 0.5281954887218046, 'UNC': 0.5662650602409639, 'USCO': 0.14457831325301204}
{'NEG': 0.8542402826855123, 'NSCO': 0.5232774674115456, 'UNC': 0.7175572519083969, 'USCO': 0.18604651162790697}
{'NEG': 0.8806921675774135, 'NSCO': 0.5257249766136575, 'UNC': 0.632996632996633, 'USCO': 0.16271186440677965}


CUTEXT




In [120]:
print(precision)
print(recall)
print(f1)

{'NEG': 0.9058713886300093, 'NSCO': 0.16775396085740912, 'UNC': 0.5562130177514792, 'USCO': 0.07692307692307693}
{'NEG': 0.8586572438162544, 'NSCO': 0.16759776536312848, 'UNC': 0.7175572519083969, 'USCO': 0.10077519379844961}
{'NEG': 0.8816326530612244, 'NSCO': 0.16767582673497902, 'UNC': 0.6266666666666667, 'USCO': 0.08724832214765099}


CUTEXT + Scope_words REGEX 1

In [124]:
print(precision)
print(recall)
print(f1)

{'NEG': 0.9082308420056765, 'NSCO': 0.5723746452223274, 'UNC': 0.5662650602409639, 'USCO': 0.21084337349397592}
{'NEG': 0.8480565371024735, 'NSCO': 0.5633147113594041, 'UNC': 0.7175572519083969, 'USCO': 0.2713178294573643}
{'NEG': 0.8771128369118318, 'NSCO': 0.5678085405912718, 'UNC': 0.632996632996633, 'USCO': 0.23728813559322035}


Until the end of the proposition


In [128]:
print(precision)
print(recall)
print(f1)

{'NEG': 0.9134615384615384, 'NSCO': 0.55, 'UNC': 0.5253164556962026, 'USCO': 0.2721518987341772}
{'NEG': 0.8392226148409894, 'NSCO': 0.5325884543761639, 'UNC': 0.6335877862595419, 'USCO': 0.3333333333333333}
{'NEG': 0.874769797421731, 'NSCO': 0.5411542100283822, 'UNC': 0.5743944636678201, 'USCO': 0.29965156794425085}
