**Fundamentals of Natural Language Processing**


#Negation and Uncertainty Detection Project

**Installing the necessary packages:**

 SpaCy for natural language processing, along with its Spanish and Catalan models. We’ll also install `unidecode` text cleaning

In [57]:
# Install required packages (execute only if not already installed)
!pip install spacy




In [58]:
# Install required packages
!pip install spacy unidecode
!python -m spacy download es_core_news_sm
!python -m spacy download ca_core_news_sm


Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting ca-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ca_core_news_sm-3.8.0/ca_core_news_sm-3.8.0-py3-none-any.whl (19.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and i

### Testing on the Test Set
We use the **`normalize_text`** function for text preprocessing.

Pattern detection is improved by adding helper functions like **`detect_multiword_pattern`** to identify composed negation phrases. We also integrate four rule types into the **`analyze_medical_context`** function: prefix, postfix, UMLS_Term + Negation_Phrase, and Negation_Phrase + UMLS_Term.

All detected matches are logged, counters are updated, and relevant examples are stored for the final output.







**Importing libraries for text processing, NLP, and file handling in Colab**

In [59]:
import json
import re
import unicodedata
from collections import Counter, defaultdict
import spacy
from google.colab import files



**Text normalization:**

Before starting any analysis, we define a function to normalize the clinical text, removing special characters, extra spaces, and anything that might interfere with proper analysis.

In [60]:
def normalize_text(text):
    """Normalize medical text with Spanish/Catalan character support."""
    text = text.lower()
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*+', '', text)
    # Permitir el apóstrofe (')
    text = re.sub(r"[^\w\s.,;:!?'\-àáèéìíòóùúüñç]", '', text)
    return text
def normalize_for_eval(text):
    text = text.lower()
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*+', '', text)
    text = re.sub(r"[^\w\s.,;:!?'\-àáèéìíòóùúüñç]", '', text)
    return text

Here we will define the key elements our system will use to detect negation and uncertainty:  
- A list of negation and uncertainty cues (including multi-word expressions)  
- A base list of UMLS medical terms.  
- A set of common medical suffixes to help catch additional terms not explicitly listed

In [61]:

# Define lists of cues and UMLS medical terms
NEGATION_WORDS = [
    "no", "sin", "ausencia de", "descarta", "descartado", "excluye", "excluido", "niega", "negado",
    "negativa", "negación", "ningún", "ninguna", "ninguno", "imposible", "inhallable", "carece de", "nunca",
    "jamás", "tampoco", "ni", "nada", "negativo", "mai",
    "sin evidencia de", "no se observa", "no presenta", "no muestra", "no evidencia", "no compatible con",
    "no concluyente", "no parece", "no se detecta", "sin signos de", "sin síntomas de", "sin indicios de",
    "sin hallazgos de", "sin pruebas de", "sin rastro de", "ausente", "no encontrado", "sin cambios",
    "no se aprecian", "no se ven", "descartando", "descartable", "no hay evidencia de", "no hay indicación de",
    "libre de", "exento de", "sin manifestaciones de", "se excluye", "queda descartado", "ninguna evidencia de",
    "ningún signo de", "sin afección", "no identificado", "negado por el paciente", "negado clínicamente",
    "sin enfermedad", "sin afectación", "no afectado", "no positivo", "resultado negativo",
    "resultado no reactivo", "resultado no positivo"
]

UNCERTAINTY_WORDS = [
    "posible", "quizás", "podría", "sospecha de", "considera", "probable", "aparentemente", "puede", "posiblemente",
    "parece", "se considera", "indeterminado", "probabilidad de", "no concluyente", "eventual", "en estudio",
    "pendiente de evaluación", "sugestivo de", "sugiere", "indica que", "se sospecha de", "podría indicar",
    "dudoso", "no definido", "no específico", "no determinado", "valor incierto", "no claro", "no seguro",
    "compatible con", "aparenta ser", "tendría que evaluarse", "a determinar", "probabilidad baja de",
    "probabilidad alta de", "sin certeza", "hipotético", "hipotéticamente", "a confirmar", "falta de certeza",
    "en posible relación con", "estaría asociado", "aparentemente relacionado con", "se intuye", "se deduce que",
    "en consideració", "probablemente", "tal vez", "aproximadamente"
]

# Global list of UMLS medical terms (initially provided, without terms ending in a medical suffix)
UMLS_MEDICAL_TERMS = [
    "interna", "cistoscopia", "uretra", "cronica", "insuficiencia", "renal", "bloqueo", "auriculoventricular",
    "primer grado", "segundo grado", "hipertension", "arterial", "protesis", "cadera", "herniorrafia", "parto",
    "eutocico", "rotura", "membranas", "prematuro", "lactancia materna", "aguda", "parcial", "angiomiolipoma",
    "quistes", "renales", "fractura", "mandibular", "ictus", "infarto", "isquemico", "cerebral", "endovenosa",
    "microcirugia", "endolaringea", "sensitiva", "axonal", "multifactorial", "déficit"
]

# List of medical suffixes
MEDICAL_SUFFIXES = [
    "nosis", "tomia", "patia", "losis", "lisis", "iasis", "scopica", "scopia", "tocico", "itis",
    "algia", "oma", "emia", "plegia", "penia", "cele", "plasia", "ectasia", "uria"
]


**Helper functions for detection and evaluation**  
Next, we define our three utility functions:  
1. One to detect double negation in a context window.  
2. One to extract ground truth labels (`NEG` and `UNC`) from the annotated dataset.  
3. And one to identify terms based on typical medical suffixes.




In [73]:

def is_double_negation(tokens):
    """Check if a sequence of tokens contains double negation."""
    negation_count = sum(1 for token in tokens if token in NEGATION_WORDS)
    return negation_count >= 2

def extract_ground_truth(record, record_index):
    gt_neg = set()
    gt_unc = set()

    text = record.get("data", {}).get("text", "")
    entries = record.get("annotations") or record.get("predictions") or []

    for ann in entries:
        for res in ann.get("result", []):
            labels = res.get("value", {}).get("labels", [])
            start = res.get("value", {}).get("start", None)
            end = res.get("value", {}).get("end", None)

            # Extraer el texto original usando índices, si no hay campo "text"
            if start is not None and end is not None and start < end and end <= len(text):
                text_val = text[start:end].lower().strip()
                if not text_val:
                    print(f" Empty span extracted in record {record_index}: start={start}, end={end}")
                    continue

                if "NEG" in labels:
                    gt_neg.add((record_index, start, end, text_val))
                if "UNC" in labels:
                    gt_unc.add((record_index, start, end, text_val))

    return gt_neg, gt_unc



def detect_medical_suffix(word):
    """Return True if the word ends with any of the defined medical suffixes."""
    for suffix in MEDICAL_SUFFIXES:
        if word.endswith(suffix):
            return True
    return False


**Main analysis Function**  
This function willbring everything together. It will go through each sentence, looking for negation or uncertainty cues, and analyzing the surrounding context to detect relevant medical terms either from the UMLS list or by matching medical suffixes.

It also groups detections by context, flags double negation when found, and keeps track of everything for evaluation later on.


In [75]:


def analyze_medical_context(text, nlp):
    """
    Uses PhraseMatchers to detect negation and uncertainty cues in context windows.
    Expands with bigram medical term detection and increased context window.
    """
    normalized_text = normalize_text(text)
    doc = nlp(normalized_text)
    results = {
        "negated_terms_grouped": [],
        "uncertain_terms_grouped": [],
        "double_negated_terms": [],
        "negation_cues_used": Counter(),
        "uncertainty_cues_used": Counter(),
        "medical_terms_found": Counter(),
        "medical_suffix_terms": []
    }

    for sent in doc.sents:
        tokens = list(sent)
        sent_tokens_text = [token.text.lower() for token in tokens]
        i = 0
        while i < len(tokens):
            cue_found = False

            # NEGATION MATCHER
            for cue in NEGATION_WORDS:
                    cue_tokens = cue.split()
                    if i + len(cue_tokens) <= len(tokens):
                        candidate = " ".join([tokens[j].text.lower() for j in range(i, i+len(cue_tokens))])
                        if candidate == cue:
                            window_start = max(0, i - 5)
                            window_end = min(len(tokens), i + len(cue_tokens) + 5)
                            context = " ".join([tokens[j].text for j in range(window_start, window_end)])
                            window_tokens = sent_tokens_text[window_start:window_end]
                            detections = []
                            for j in range(window_start, window_end):
                                term_candidate = tokens[j].text.lower()
                                if (term_candidate in [t.lower() for t in UMLS_MEDICAL_TERMS]) or detect_medical_suffix(term_candidate):
                                    detection = {
                                        "term": term_candidate,
                                        "detected_text": tokens[j].text,
                                        "start": tokens[j].idx,
                                        "end": tokens[j].idx + len(tokens[j].text)
                                    }
                                    detections.append(detection)
                                    results["medical_terms_found"][term_candidate] += 1
                                    if detect_medical_suffix(term_candidate) and term_candidate not in [t.lower() for t in UMLS_MEDICAL_TERMS]:
                                        results["medical_suffix_terms"].append(detection)
                            if detections:
                                grouped_detection = {
                                    "cue": cue,
                                    "context": context,
                                    "window_start": tokens[window_start].idx,
                                    "window_end": tokens[window_end-1].idx + len(tokens[window_end-1].text),
                                    "detections": detections
                                }
                                if is_double_negation(window_tokens):
                                    results["double_negated_terms"].append(grouped_detection)
                                else:
                                    results["negated_terms_grouped"].append(grouped_detection)
                                results["negation_cues_used"].update([cue])
                            i += len(cue_tokens)
                            cue_found = True
                            break
            if cue_found:
              continue

            # UNCERTAINTY MATCHER
            for cue in UNCERTAINTY_WORDS:
                cue_tokens = cue.split()
                if i + len(cue_tokens) <= len(tokens):
                    candidate = " ".join([tokens[j].text.lower() for j in range(i, i+len(cue_tokens))])
                    if candidate == cue:
                        window_start = max(0, i - 5)
                        window_end = min(len(tokens), i + len(cue_tokens) + 5)
                        context = " ".join([tokens[j].text for j in range(window_start, window_end)])
                        window_tokens = sent_tokens_text[window_start:window_end]
                        detections = []
                        for j in range(window_start, window_end):
                            term_candidate = tokens[j].text.lower()
                            if (term_candidate in [t.lower() for t in UMLS_MEDICAL_TERMS]) or detect_medical_suffix(term_candidate):
                                detection = {
                                    "term": term_candidate,
                                    "detected_text": tokens[j].text,
                                    "start": tokens[j].idx,
                                    "end": tokens[j].idx + len(tokens[j].text)
                                }
                                detections.append(detection)
                                results["medical_terms_found"][term_candidate] += 1
                        if detections:
                            grouped_detection = {
                                "cue": cue,
                                "context": context,
                                "window_start": tokens[window_start].idx,
                                "window_end": tokens[window_end-1].idx + len(tokens[window_end-1].text),
                                "detections": detections
                            }
                            results["uncertainty_cues_used"].update([cue])
                            results["uncertain_terms_grouped"].append(grouped_detection)
                        i += len(cue_tokens)
                        cue_found = True
                        break
            if not cue_found:
                i += 1
    return results


**Evaluation metrics:**

We will now impkement a function to calculate precision, recall, and F1 score. Which are basic but essential metrics to evaluate how well our system detects negation and uncertainty compared to the annotated ground truth

In [76]:
def compute_metrics(predicted, ground_truth):
    tp = len(predicted & ground_truth)
    fp = len(predicted - ground_truth)
    fn = len(ground_truth - predicted)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    jaccard = tp / (tp + fp + fn) if (tp + fp + fn) > 0 else 0
    return precision, recall, f1, jaccard


**Final execution  evaluation and export**  
Great! Now we that we have everything we need. Lets implement the main function that will run the entire pipeline on the test set. It loads the model and data, processes each record, applies the detection logic, and updates statistics and evaluation metrics for negation and uncertainty.

It also prints out cue frequencies, example detections, and performance results (precision, recall, F1, and Jaccard accuracy).  
At the end, it saves the structured output to a JSON file and makes it available for download.


In [77]:
def main():
    import time
    from google.colab import files

    print("Loading SpaCy model...")
    nlp = spacy.load("es_core_news_sm")

    print("Please upload your JSON file...")
    uploaded = files.upload()
    filename = next(iter(uploaded))
    with open(filename, 'r', encoding='utf-8') as file:
        data = json.load(file)

    all_predicted_neg = set()
    all_predicted_unc = set()
    all_ground_truth_neg = set()
    all_ground_truth_unc = set()

    total_negation_counter = Counter()
    total_uncertainty_counter = Counter()
    total_medical_counter = Counter()
    negated_examples = []
    uncertain_examples = []
    records_with_neg = []
    records_with_unc = []
    output_data = []

    start_time = time.time()
    print(f"Processing {len(data)} records...")
    for i, record in enumerate(data):
        if i % 10 == 0:
            print(f"Processing record {i+1}/{len(data)}...")
        text = record.get("data", {}).get("text", "")
        if not text:
            continue

        results = analyze_medical_context(text, nlp)

        total_negation_counter.update(results["negation_cues_used"])
        total_uncertainty_counter.update(results["uncertainty_cues_used"])
        total_medical_counter.update(results["medical_terms_found"])

        for group in results.get("negated_terms_grouped", []):
            cue_text = normalize_for_eval(group["cue"])
            all_predicted_neg.add((i, cue_text))

        for group in results.get("uncertain_terms_grouped", []):
            cue_text = normalize_for_eval(group["cue"])
            all_predicted_unc.add((i, cue_text))

        # Extract GT
        gt_neg, gt_unc = extract_ground_truth(record, i)
        for (_, _, _, cue_text) in gt_neg:
            all_ground_truth_neg.add((i, normalize_for_eval(cue_text)))
        for (_, _, _, cue_text) in gt_unc:
            all_ground_truth_unc.add((i, normalize_for_eval(cue_text)))

        # Ejemplos para visualización
        if results.get("negated_terms_grouped"):
            records_with_neg.append(i)
            for group in results["negated_terms_grouped"]:
                terms_info = "; ".join([f"{d['term']} (detected: {d['detected_text']})" for d in group["detections"]])
                if len(negated_examples) < 20:
                    negated_examples.append(
                        f"Record {i}: Context: '{group['context']}' (cue: {group['cue']}, terms: {terms_info}, window: {group['window_start']}-{group['window_end']})"
                    )

        if results.get("uncertain_terms_grouped"):
            records_with_unc.append(i)
            for group in results["uncertain_terms_grouped"]:
                terms_info = "; ".join([f"{d['term']} (detected: {d['detected_text']})" for d in group["detections"]])
                if len(uncertain_examples) < 20:
                    uncertain_examples.append(
                        f"Record {i}: Context: '{group['context']}' (cue: {group['cue']}, terms: {terms_info}, window: {group['window_start']}-{group['window_end']})"
                    )

    end_time = time.time()
    processing_time = end_time - start_time

    print("\nGround truth samples:")
    for item in list(all_ground_truth_neg)[:5]:
        print("GT:", item)

    print("\nPredicted samples:")
    for item in list(all_predicted_neg)[:5]:
        print("PRED:", item)

    intersection = all_predicted_neg & all_ground_truth_neg
    print("\nINTERSECTION:")
    for match in list(intersection)[:5]:
        print("MATCH:", match)

    prec_neg, rec_neg, f1_neg, acc_neg = compute_metrics(all_predicted_neg, all_ground_truth_neg)
    prec_unc, rec_unc, f1_unc, acc_unc = compute_metrics(all_predicted_unc, all_ground_truth_unc)

    combined_pred = all_predicted_neg.union(all_predicted_unc)
    combined_gt = all_ground_truth_neg.union(all_ground_truth_unc)
    prec_comb, rec_comb, f1_comb, acc_comb = compute_metrics(combined_pred, combined_gt)

    print("\n=== STATISTICS ===")
    print("\nNegation Cue Frequencies (affecting medical terms):")
    for word, freq in total_negation_counter.most_common():
        print(f"{word}: {freq}")

    print("\nUncertainty Cue Frequencies (affecting medical terms):")
    for word, freq in total_uncertainty_counter.most_common():
        print(f"{word}: {freq}")

    print("\nMedical Terms Frequencies:")
    for word, freq in total_medical_counter.most_common():
        print(f"{word}: {freq}")

    print("\n=== EXAMPLES OF NEGATED MEDICAL TERMS ===")
    for example in negated_examples:
        print(example)

    print("\n=== EXAMPLES OF UNCERTAIN MEDICAL TERMS ===")
    for example in uncertain_examples:
        print(example)

    print("\n=== TOTAL DETECTED CONTEXTS ===")
    print(f"Total detected negated contexts: {len(all_predicted_neg)}")
    print(f"Total detected uncertain contexts: {len(all_predicted_unc)}")
    print(f"Total combined detected contexts: {len(combined_pred)}")

    print("\n=== RECORDS WITH DETECTIONS ===")
    print(f"Records with negation detections: {records_with_neg}")
    print(f"Records with uncertainty detections: {records_with_unc}")

    print("\n=== EFFICIENCY ===")
    print(f"Total processing time: {processing_time:.2f} seconds")

    print("\n=== EVALUATION METRICS ===")
    print("\nNegation Metrics:")
    print(f"Precision: {prec_neg:.2f}, Recall: {rec_neg:.2f}, F1: {f1_neg:.2f}, Accuracy (Jaccard): {acc_neg:.2f}")
    print(f"Matches found using cue containment: {len(intersection)}")

    print("\nUncertainty Metrics:")
    print(f"Precision: {prec_unc:.2f}, Recall: {rec_unc:.2f}, F1: {f1_unc:.2f}, Accuracy (Jaccard): {acc_unc:.2f}")

    output_filename = "output.json"
    with open(output_filename, "w", encoding="utf-8") as outfile:
        json.dump(output_data, outfile, ensure_ascii=False, indent=2)
    print(f"\nOutput file '{output_filename}' generated.")
    files.download(output_filename)

if __name__ == "__main__":
    main()


Loading SpaCy model...
Please upload your JSON file...


Saving negacio_test_v2024.json to negacio_test_v2024 (18).json
Processing 64 records...
Processing record 1/64...
Processing record 11/64...
Processing record 21/64...
Processing record 31/64...
Processing record 41/64...
Processing record 51/64...
Processing record 61/64...

Ground truth samples:
GT: (20, 'no')
GT: (31, 'no')
GT: (21, 'negativo')
GT: (57, 'afebril,')
GT: (54, 'niega')

Predicted samples:
PRED: (20, 'no')
PRED: (31, 'no')
PRED: (53, 'no')
PRED: (57, 'sin')
PRED: (55, 'no')

INTERSECTION:
MATCH: (20, 'no')
MATCH: (53, 'no')
MATCH: (31, 'no')
MATCH: (55, 'no')
MATCH: (63, 'sin')

=== STATISTICS ===

Negation Cue Frequencies (affecting medical terms):
no: 74
sin: 46
ni: 19
niega: 5
negativa: 3
ausencia de: 3
negativo: 1
descarta: 1

Uncertainty Cue Frequencies (affecting medical terms):
compatible con: 5
probable: 5
posible: 4
probablemente: 4
sospecha de: 3
aparentemente: 3
puede: 1
se considera: 1
parece: 1
sugestivo de: 1
aproximadamente: 1

Medical Terms Frequencies:


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>