**Fundamentals of Natural Language Processing**

#Negation and Uncertainty Detection Project#

## Code 1



**Installing Required Libraries:**

We begin by installing the necessary NLP libraries, including spaCy and its Spanish language model, which will be used later for tokenization and syntactic analysis.

In [None]:
# Install required packages (execute only if not already installed)
!pip install spacy
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Importing libraries for text processing, NLP, and file handling in Colab**

In [None]:
import json
import re
import unicodedata
from collections import Counter
import spacy
from google.colab import files

Before starting any analysis, we define a function to normalize the clinical text, preparing the data for further processing


In [None]:

# Function to normalize text
def normalize_text(text):
    """Normalize medical text with Spanish/Catalan character support"""
    # Convert to lowercase first, then normalize
    text = text.lower()
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove sensitive patient identifiers
    text = re.sub(r'\*+', '', text)

    # Remove unnecessary punctuation (but keep medical-relevant ones)
    text = re.sub(r'[^\w\s.,;:!?-àáèéìíòóùúüñç]', '', text)

    return text

Now we define the key words and phrases that indicate negation, uncertainty, and relevant medical concepts in Spanish and Catalan; that will guide our detection system

In [None]:

  # Define negation, uncertainty and UMLS medical terms

NEGATION_WORDS = [
      # Common negation words
      "no", "sin", "ausencia de", "descarta", "descartado", "excluye", "excluido", "niega", "negado",
      "negativa", "negación", "ningún", "ninguna", "ninguno", "imposible", "inhallable", "carece de", "nunca",
      "jamás", "tampoco", "ni", "nada", "negativo", "mai",

      # Medical-specific negation in Spanish
      "sin evidencia de", "no se observa", "no presenta", "no muestra", "no evidencia", "no compatible con",
      "no concluyente", "no parece", "no se detecta", "sin signos de", "sin síntomas de", "sin indicios de",
      "sin hallazgos de", "sin pruebas de", "sin rastro de", "ausente", "no encontrado", "sin cambios",
      "no se aprecian", "no se ven", "descartando", "descartable", "no hay evidencia de", "no hay indicación de",
      "libre de", "exento de", "sin manifestaciones de", "se excluye", "queda descartado", "ninguna evidencia de",
      "ningún signo de", "sin afección", "no identificado", "negado por el paciente", "negado clínicamente",
      "sin enfermedad", "sin afectación", "no afectado", "no positivo", "resultado negativo",
      "resultado no reactivo", "resultado no positivo",

      # Medical-specific negation in Catalan
      "sense", "no es detecta", "no es veu", "no hi ha", "no presenta", "sense indicis de", "sense evidència de",
      "sense senyals de", "sense rastre de", "sense afectació", "sense afecció", "no concloent", "sense canvis",
      "sense resultats", "sense manifestacions de", "no s'observa", "no s'aprecia", "sense presència de",
      "no compatible amb", "no és visible", "sense símptomes", "no diagnosticat", "sense senyals clars",
      "diagnòstic negatiu"
  ]

UNCERTAINTY_WORDS = [
      # Common & medical uncertainty words in Spanish
      "posible", "quizás", "podría", "sospecha de", "considera", "probable", "aparentemente", "puede", "posiblemente",
      "parece", "se considera", "indeterminado", "probabilidad de", "no concluyente", "eventual", "en estudio",
      "pendiente de evaluación", "sugestivo de", "sugiere", "indica que", "se sospecha de", "podría indicar",
      "dudoso", "no definido", "no específico", "no determinado", "valor incierto", "no claro", "no seguro",
      "compatible con", "aparenta ser", "tendría que evaluarse", "a determinar", "probabilidad baja de",
      "probabilidad alta de", "sin certeza", "hipotético", "hipotéticamente", "a confirmar", "falta de certeza",
      "en posible relación con", "estaría asociado", "aparentemente relacionado con", "se intuye", "se deduce que",
      "en consideración", "posible", "probablemente", "tal vez", "aproximadamente", "probable",

      # Common & medical uncertainty in Catalan
      "possible", "potser", "podria", "sospita de", "es considera", "probable", "aparentment", "pot ser",
      "possiblement", "sembla", "es sospita de", "és indeterminat", "probabilitat de", "no concloent", "eventual",
      "en estudi", "pendent d'avaluació", "suggerent de", "suggerix", "indica que", "dubtós", "no definit",
      "no específic", "no determinat", "valor incert", "no clar", "no segur", "aparentment relacionat amb",
      "es dedueix que", "en consideració"
  ]

UMLS_MEDICAL_TERMS = [
      "uretrotomia", "interna", "cistoscopia", "estenosis", "uretra", "cronica", "diverticulosis", "insuficiencia",
      "renal", "colelitiasis", "bloqueo", "auriculoventricular", "primer grado", "segundo grado", "hipertension",
      "arterial", "protesis", "cadera", "cordectomia", "herniorrafia", "parto", "eutocico", "rotura", "membranas",
      "prematuro", "episiotomia", "lactancia materna", "apendicectomia", "laparoscopica", "gastroenteritis", "aguda",
      "nefrectomia", "parcial", "angiomiolipoma", "quistes", "renales", "fractura", "mandibular", "ictus", "infarto",
      "isquemico", "trombectomia", "cerebral", "fibrinolisis", "endovenosa", "colangitis", "microcirugia",
      "endolaringea", "polineuropatia", "sensitiva", "axonal", "neuropatia", "multifactorial", "mielopatia", "déficit"
  ]

Next we add a function to detect double negation, which helps avoid false positives when multiple negation cues appear together:


In [None]:

# Function to detect double negation in a window around a medical term
def is_double_negation(tokens):
    """Check if a sequence of tokens contains double negation"""
    # Simple cues that would indicate negation (a subset of the full NEGATION_WORDS)
    simple_negation_cues = {"no", "sin", "nunca", "jamás", "ningún", "ninguna", "nadie", "ninguno", "negado", "niega"}
    negation_count = sum(1 for token in tokens if token.lower() in simple_negation_cues)
    return negation_count >= 2

Good! Now that the basics are set up, let’s add a function to analyze medical terms in context to check whether they appear with negation, uncertainty, or double negation cues.


In [None]:

# Function to process a text and find medical terms with their negation/uncertainty context
def analyze_medical_context(text, nlp):
    # Normalize the text
    normalized_text = normalize_text(text)

    # Process with spaCy
    doc = nlp(normalized_text)

    results = {
        "negated_terms": [],
        "uncertain_terms": [],
        "double_negated_terms": [],
        "negation_cues_used": Counter(),
        "uncertainty_cues_used": Counter(),
        "medical_terms_found": Counter()
    }

    # Process each sentence to better handle context boundaries
    for sent in doc.sents:
        sent_tokens = [token for token in sent]

        # For each token in the sentence
        for i, token in enumerate(sent_tokens):
            token_text = token.text.lower()

            # Check if it's a medical term
            if token_text in [term.lower() for term in UMLS_MEDICAL_TERMS]:
                # Count this medical term
                results["medical_terms_found"][token_text] += 1

                # Define the context window (5 tokens before and after)
                start_idx = max(0, i - 5)
                end_idx = min(len(sent_tokens), i + 6)
                context_window = sent_tokens[start_idx:end_idx]
                context_tokens = [t.text.lower() for t in context_window]
                context_text = " ".join(context_tokens)

                # Check for double negation in context
                if is_double_negation(context_tokens):
                    results["double_negated_terms"].append({
                        "term": token_text,
                        "context": context_text
                    })
                    continue  # Skip further checks for this term if double negation is found

                # Check for negation cues in context
                negated = False
                for neg_cue in NEGATION_WORDS:
                    # For multi-word cues, check if they appear in the context
                    if " " in neg_cue:
                        if neg_cue in context_text:
                            results["negation_cues_used"][neg_cue] += 1
                            negated = True
                            break
                    # For single-word cues, check if they appear in the context tokens
                    elif neg_cue in context_tokens:
                        results["negation_cues_used"][neg_cue] += 1
                        negated = True
                        break

                if negated:
                    results["negated_terms"].append({
                        "term": token_text,
                        "context": context_text
                    })
                    continue  # If term is negated, don't check for uncertainty

                # Check for uncertainty cues in context
                uncertain = False
                for unc_cue in UNCERTAINTY_WORDS:
                    # For multi-word cues, check if they appear in the context
                    if " " in unc_cue:
                        if unc_cue in context_text:
                            results["uncertainty_cues_used"][unc_cue] += 1
                            uncertain = True
                            break
                    # For single-word cues, check if they appear in the context tokens
                    elif unc_cue in context_tokens:
                        results["uncertainty_cues_used"][unc_cue] += 1
                        uncertain = True
                        break

                if uncertain:
                    results["uncertain_terms"].append({
                        "term": token_text,
                        "context": context_text
                    })

    return results

In this step, we will bring everything together in the main function:
Load the language model and dataset, run the analysis on each record, and display overall statistics.


In [None]:


# Main program
def main():
    print("Loading SpaCy model...")
    nlp = spacy.load("es_core_news_sm")

    print("Please upload your JSON file...")
    uploaded = files.upload()

    # Get the name of the uploaded file
    filename = next(iter(uploaded))

    # Open and load the JSON file
    with open(filename, 'r', encoding='utf-8') as file:
        data = json.load(file)

    # Initialize counters for overall statistics
    total_negation_counter = Counter()
    total_uncertainty_counter = Counter()
    total_medical_counter = Counter()

    # Lists to store examples of negated and uncertain terms
    negated_examples = []
    uncertain_examples = []

    # Process each record in the dataset
    print(f"Processing {len(data)} records...")
    for i, record in enumerate(data):
        if i % 10 == 0:  # Status update every 10 records
            print(f"Processing record {i+1}/{len(data)}...")

        text = record.get("data", {}).get("text", "")
        if not text:
            continue

        # Analyze the text
        results = analyze_medical_context(text, nlp)

        # Update overall counters
        total_negation_counter.update(results["negation_cues_used"])
        total_uncertainty_counter.update(results["uncertainty_cues_used"])
        total_medical_counter.update(results["medical_terms_found"])

        # Store examples of negated and uncertain terms (up to 5 of each)
        for neg_term in results["negated_terms"]:
            if len(negated_examples) < 20:  # Collect up to 20 examples
                negated_examples.append(f"'{neg_term['context']}' (term: {neg_term['term']})")

        for unc_term in results["uncertain_terms"]:
            if len(uncertain_examples) < 20:  # Collect up to 20 examples
                uncertain_examples.append(f"'{unc_term['context']}' (term: {unc_term['term']})")

    # Print results
    print("\n=== STATISTICS ===")

    print("\nNegation Cue Frequencies (affecting medical terms):")
    for word, freq in total_negation_counter.most_common():
        print(f"{word}: {freq}")

    print("\nUncertainty Cue Frequencies (affecting medical terms):")
    for word, freq in total_uncertainty_counter.most_common():
        print(f"{word}: {freq}")

    print("\nMedical Terms Frequencies:")
    for word, freq in total_medical_counter.most_common():
        print(f"{word}: {freq}")

    # Print examples of negated and uncertain terms
    print("\n=== EXAMPLES OF NEGATED MEDICAL TERMS ===")
    for example in negated_examples:
        print(example)

    print("\n=== EXAMPLES OF UNCERTAIN MEDICAL TERMS ===")
    for example in uncertain_examples:
        print(example)

if __name__ == "__main__":
    main()

Loading SpaCy model...
Please upload your JSON file...


Saving negacio_train_v2024.json to negacio_train_v2024.json
Processing 254 records...
Processing record 1/254...
Processing record 11/254...
Processing record 21/254...
Processing record 31/254...
Processing record 41/254...
Processing record 51/254...
Processing record 61/254...
Processing record 71/254...
Processing record 81/254...
Processing record 91/254...
Processing record 101/254...
Processing record 111/254...
Processing record 121/254...
Processing record 131/254...
Processing record 141/254...
Processing record 151/254...
Processing record 161/254...
Processing record 171/254...
Processing record 181/254...
Processing record 191/254...
Processing record 201/254...
Processing record 211/254...
Processing record 221/254...
Processing record 231/254...
Processing record 241/254...
Processing record 251/254...

=== STATISTICS ===

Negation Cue Frequencies (affecting medical terms):
sin: 123
no: 101
ni: 9
sense: 7
negativa: 3
ausencia de: 3
negativo: 3
libre de: 1
no se observa: 

## Code 2

**Installing Required Libraries:**


In [None]:
# Install required packages
!pip install spacy unidecode
!python -m spacy download es_core_news_sm
!python -m spacy download ca_core_news_sm


Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8
Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m93.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart ru

**Importing libraries for text processing, NLP, and file handling in Colab**

In [None]:
import json
import re
import unicodedata
from collections import Counter
import spacy
from google.colab import files
from unidecode import unidecode
from spacy.lang.es.stop_words import STOP_WORDS as stopwords_es
from spacy.lang.ca.stop_words import STOP_WORDS as stopwords_ca

Loading the pre-trained spaCy models for Spanish and Catalan to handle language-specific processing:


In [None]:
# Load language models
nlp_es = spacy.load("es_core_news_sm")
nlp_ca = spacy.load("ca_core_news_sm")

**Definition of our vocabulary lists:**
 negation cues, uncertainty expressions, and medical terms in Spanish and Catalan. These will be used later to identify and classify relevant patterns in the text.


In [None]:

# Define negation, uncertainty and UMLS medical terms
NEGATION_WORDS = [
      # Common negation words
      "no", "sin", "ausencia de", "descarta", "descartado", "excluye", "excluido", "niega", "negado",
      "negativa", "negación", "ningún", "ninguna", "ninguno", "imposible", "inhallable", "carece de", "nunca",
      "jamás", "tampoco", "ni", "nada", "negativo", "mai",

      # Medical-specific negation in Spanish
      "sin evidencia de", "no se observa", "no presenta", "no muestra", "no evidencia", "no compatible con",
      "no concluyente", "no parece", "no se detecta", "sin signos de", "sin síntomas de", "sin indicios de",
      "sin hallazgos de", "sin pruebas de", "sin rastro de", "ausente", "no encontrado", "sin cambios",
      "no se aprecian", "no se ven", "descartando", "descartable", "no hay evidencia de", "no hay indicación de",
      "libre de", "exento de", "sin manifestaciones de", "se excluye", "queda descartado", "ninguna evidencia de",
      "ningún signo de", "sin afección", "no identificado", "negado por el paciente", "negado clínicamente",
      "sin enfermedad", "sin afectación", "no afectado", "no positivo", "resultado negativo",
      "resultado no reactivo", "resultado no positivo",

      # Medical-specific negation in Catalan
      "sense", "no es detecta", "no es veu", "no hi ha", "no presenta", "sense indicis de", "sense evidència de",
      "sense senyals de", "sense rastre de", "sense afectació", "sense afecció", "no concloent", "sense canvis",
      "sense resultats", "sense manifestacions de", "no s'observa", "no s'aprecia", "sense presència de",
      "no compatible amb", "no és visible", "sense símptomes", "no diagnosticat", "sense senyals clars",
      "diagnòstic negatiu"
  ]

UNCERTAINTY_WORDS = [
      # Common & medical uncertainty words in Spanish
      "posible", "quizás", "podría", "sospecha de", "considera", "probable", "aparentemente", "puede", "posiblemente",
      "parece", "se considera", "indeterminado", "probabilidad de", "no concluyente", "eventual", "en estudio",
      "pendiente de evaluación", "sugestivo de", "sugiere", "indica que", "se sospecha de", "podría indicar",
      "dudoso", "no definido", "no específico", "no determinado", "valor incierto", "no claro", "no seguro",
      "compatible con", "aparenta ser", "tendría que evaluarse", "a determinar", "probabilidad baja de",
      "probabilidad alta de", "sin certeza", "hipotético", "hipotéticamente", "a confirmar", "falta de certeza",
      "en posible relación con", "estaría asociado", "aparentemente relacionado con", "se intuye", "se deduce que",
      "en consideración", "posible", "probablemente", "tal vez", "aproximadamente", "probable",

      # Common & medical uncertainty in Catalan
      "possible", "potser", "podria", "sospita de", "es considera", "probable", "aparentment", "pot ser",
      "possiblement", "sembla", "es sospita de", "és indeterminat", "probabilitat de", "no concloent", "eventual",
      "en estudi", "pendent d'avaluació", "suggerent de", "suggerix", "indica que", "dubtós", "no definit",
      "no específic", "no determinat", "valor incert", "no clar", "no segur", "aparentment relacionat amb",
      "es dedueix que", "en consideració"
  ]

UMLS_MEDICAL_TERMS = [
      "uretrotomia", "interna", "cistoscopia", "estenosis", "uretra", "cronica", "diverticulosis", "insuficiencia",
      "renal", "colelitiasis", "bloqueo", "auriculoventricular", "primer grado", "segundo grado", "hipertension",
      "arterial", "protesis", "cadera", "cordectomia", "herniorrafia", "parto", "eutocico", "rotura", "membranas",
      "prematuro", "episiotomia", "lactancia materna", "apendicectomia", "laparoscopica", "gastroenteritis", "aguda",
      "nefrectomia", "parcial", "angiomiolipoma", "quistes", "renales", "fractura", "mandibular", "ictus", "infarto",
      "isquemico", "trombectomia", "cerebral", "fibrinolisis", "endovenosa", "colangitis", "microcirugia",
      "endolaringea", "polineuropatia", "sensitiva", "axonal", "neuropatia", "multifactorial", "mielopatia", "déficit"
  ]

We also want to implement a helper function to detect double negation within a context window, to avoid misclassifications when multiple cues appear together:




In [None]:
# Function to detect double negation in a window around a medical term
def is_double_negation(tokens):
    """Check if a sequence of tokens contains double negation"""
    # Simple cues that would indicate negation (a subset of the full NEGATION_WORDS)
    simple_negation_cues = {"no", "sin", "nunca", "jamás", "ningún", "ninguna", "nadie", "ninguno", "negado", "niega"}
    negation_count = sum(1 for token in tokens if token.lower() in simple_negation_cues)
    return negation_count >= 2

Now we will create two functions: one to normalize the text (removing accents, symbols, and extra spaces), and another to preprocess it by tokenizing and filtering stopwords (while keeping negation cues!!)


In [None]:
def normalize_text(text):
    """Normalize medical text with Spanish/Catalan character support"""
    text = text.lower()
    text = unidecode(text)  # Remove accents
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*+', '', text)  # Remove sensitive patient identifiers
    text = re.sub(r'  ', '', text)
    #text = re.sub(r'[^\w\s.,;:!?-]', '', text)  # Remove unnecessary punctuation
    return text

def preprocess_text(text, lang="es"):
    """
    Preprocess text by tokenizing, removing stopwords (except negation words), and normalizing case.
    Returns:
        List of clean tokens with negation words preserved.
    """
    text = normalize_text(text)

    if lang == "es":
        nlp = nlp_es
        stopwords = stopwords_es
    else:
        nlp = nlp_ca
        stopwords = stopwords_ca

    doc = nlp(text)

    tokens = [
        token.text for token in doc
        if token.is_alpha and (token.text not in stopwords or token.text in NEGATION_WORDS)
    ]

    return tokens

Great!

Now we can implement the function that analyzes a medical text: it tokenizes the input, looks for known medical terms, and checks whether they appear in a context of negation, uncertainty, or double negation


In [None]:
def analyze_medical_context(text, lang="es"):
    """Analyze medical text for negation, uncertainty, and medical terms."""
    tokens = preprocess_text(text, lang)
    processed_text = " ".join(tokens)
    doc = nlp_es(processed_text) if lang == "es" else nlp_ca(processed_text)

    results = {
        "negated_terms": [],
        "uncertain_terms": [],
        "double_negated_terms": [],
        "negation_cues_used": Counter(),
        "uncertainty_cues_used": Counter(),
        "medical_terms_found": Counter()
    }

    for sent in doc.sents:
        sent_tokens = [token.text.lower() for token in sent]

        for i, token_text in enumerate(sent_tokens):
            if token_text in [term.lower() for term in UMLS_MEDICAL_TERMS]:
                results["medical_terms_found"][token_text] += 1

                start_idx = max(0, i - 5)
                end_idx = min(len(sent_tokens), i + 6)
                context_tokens = sent_tokens[start_idx:end_idx]
                context_text = " ".join(context_tokens)

                if is_double_negation(context_tokens):
                    results["double_negated_terms"].append({"term": token_text, "context": context_text})
                    continue

                negated = any(neg in context_text for neg in NEGATION_WORDS)
                if negated:
                    results["negation_cues_used"].update([neg for neg in NEGATION_WORDS if neg in context_text])
                    results["negated_terms"].append({"term": token_text, "context": context_text})
                    continue

                uncertain = any(unc in context_text for unc in UNCERTAINTY_WORDS)
                if uncertain:
                    results["uncertainty_cues_used"].update([unc for unc in UNCERTAINTY_WORDS if unc in context_text])
                    results["uncertain_terms"].append({"term": token_text, "context": context_text})

    return results


**Running the Rule-based approach on the train data:**

The final step is to run our full rule based system on the dataset.

Let’s see how it goes!

In [None]:
print("Please upload your JSON file...")
uploaded = files.upload()
filename = next(iter(uploaded))

with open(filename, 'r', encoding='utf-8') as file:
    data = json.load(file)

total_negation_counter = Counter()
total_uncertainty_counter = Counter()
total_medical_counter = Counter()

negated_examples = []
uncertain_examples = []

print(f"Processing {len(data)} records...")
for i, record in enumerate(data):
    if i % 10 == 0:
        print(f"Processing record {i+1}/{len(data)}...")

    text = record.get("data", {}).get("text", "")
    if not text:
        continue

    results = analyze_medical_context(text)

    total_negation_counter.update(results["negation_cues_used"])
    total_uncertainty_counter.update(results["uncertainty_cues_used"])
    total_medical_counter.update(results["medical_terms_found"])

    negated_examples.extend([f"'{term['context']}' (term: {term['term']})" for term in results["negated_terms"][:20]])
    uncertain_examples.extend([f"'{term['context']}' (term: {term['term']})" for term in results["uncertain_terms"][:20]])

print("\n=== STATISTICS ===")

print("\nNegation Cue Frequencies:")
for word, freq in total_negation_counter.most_common():
    print(f"{word}: {freq}")

print("\nUncertainty Cue Frequencies:")
for word, freq in total_uncertainty_counter.most_common():
    print(f"{word}: {freq}")

print("\nMedical Terms Frequencies:")
for word, freq in total_medical_counter.most_common():
    print(f"{word}: {freq}")

print("\n=== EXAMPLES OF NEGATED MEDICAL TERMS ===")
for example in negated_examples[:20]:
    print(example)

print("\n=== EXAMPLES OF UNCERTAIN MEDICAL TERMS ===")
for example in uncertain_examples[:20]:
    print(example)

Please upload your JSON file...


Saving negacio_train_v2024.json to negacio_train_v2024 (1).json
Processing 254 records...
Processing record 1/254...
Processing record 11/254...
Processing record 21/254...
Processing record 31/254...
Processing record 41/254...
Processing record 51/254...
Processing record 61/254...
Processing record 71/254...
Processing record 81/254...
Processing record 91/254...
Processing record 101/254...
Processing record 111/254...
Processing record 121/254...
Processing record 131/254...
Processing record 141/254...
Processing record 151/254...
Processing record 161/254...
Processing record 171/254...
Processing record 181/254...
Processing record 191/254...
Processing record 201/254...
Processing record 211/254...
Processing record 221/254...
Processing record 231/254...
Processing record 241/254...
Processing record 251/254...

=== STATISTICS ===

Negation Cue Frequencies:
no: 794
ni: 707
sin: 228
negativo: 14
sense: 14
negativa: 12
descarta: 12
niega: 10
nada: 10
no presenta: 8
no muestra: 


## HYBRID VERSION OF THE TWO PREVIOUS CODES

**We will now run the combined version of both codes.**

We use the **`normalize_text`** function from Code 2, and bring in the pattern-based rules from Code 1 including helper functions like **`detect_multiword_pattern`** and the four scope detection strategies: prefix, postfix, UMLS term + negation phrase, and negation phrase + UMLS term.

The output will include updated counters, matched examples, and final stats to evaluate how this hybrid version performs.







(We already have the packages installed from the previous codes)

**Importing libraries for text processing, NLP, and file handling in Colab**

In [None]:
import json
import re
import unicodedata
from collections import Counter, defaultdict
import spacy
from google.colab import files
import time

Let’s use our key word lists again: negation cues, uncertainty expressions, and medical terms. This will be the base of our combined system for tagging relevant patterns in the clinical texts.


In [None]:

# Define negation, uncertainty and UMLS medical terms
NEGATION_WORDS = [
    # Common negation words
    "no", "sin", "ausencia de", "descarta", "descartado", "excluye", "excluido", "niega", "negado",
    "negativa", "negación", "ningún", "ninguna", "ninguno", "imposible", "inhallable", "carece de", "nunca",
    "jamás", "tampoco", "ni", "nada", "negativo", "mai",

    # Medical-specific negation in Spanish
    "sin evidencia de", "no se observa", "no presenta", "no muestra", "no evidencia", "no compatible con",
    "no concluyente", "no parece", "no se detecta", "sin signos de", "sin síntomas de", "sin indicios de",
    "sin hallazgos de", "sin pruebas de", "sin rastro de", "ausente", "no encontrado", "sin cambios",
    "no se aprecian", "no se ven", "descartando", "descartable", "no hay evidencia de", "no hay indicación de",
    "libre de", "exento de", "sin manifestaciones de", "se excluye", "queda descartado", "ninguna evidencia de",
    "ningún signo de", "sin afección", "no identificado", "negado por el paciente", "negado clínicamente",
    "sin enfermedad", "sin afectación", "no afectado", "no positivo", "resultado negativo",
    "resultado no reactivo", "resultado no positivo",

    # Medical-specific negation in Catalan
    "sense", "no es detecta", "no es veu", "no hi ha", "no presenta", "sense indicis de", "sense evidència de",
    "sense senyals de", "sense rastre de", "sense afectació", "sense afecció", "no concloent", "sense canvis",
    "sense resultats", "sense manifestacions de", "no s'observa", "no s'aprecia", "sense presència de",
    "no compatible amb", "no és visible", "sense símptomes", "no diagnosticat", "sense senyals clars",
    "diagnòstic negatiu"
]

UNCERTAINTY_WORDS = [
    # Common & medical uncertainty words in Spanish
    "posible", "quizás", "podría", "sospecha de", "considera", "probable", "aparentemente", "puede", "posiblemente",
    "parece", "se considera", "indeterminado", "probabilidad de", "no concluyente", "eventual", "en estudio",
    "pendiente de evaluación", "sugestivo de", "sugiere", "indica que", "se sospecha de", "podría indicar",
    "dudoso", "no definido", "no específico", "no determinado", "valor incierto", "no claro", "no seguro",
    "compatible con", "aparenta ser", "tendría que evaluarse", "a determinar", "probabilidad baja de",
    "probabilidad alta de", "sin certeza", "hipotético", "hipotéticamente", "a confirmar", "falta de certeza",
    "en posible relación con", "estaría asociado", "aparentemente relacionado con", "se intuye", "se deduce que",
    "en consideración", "posible", "probablemente", "tal vez", "aproximadamente", "probable",

    # Common & medical uncertainty in Catalan
    "possible", "potser", "podria", "sospita de", "es considera", "probable", "aparentment", "pot ser",
    "possiblement", "sembla", "es sospita de", "és indeterminat", "probabilitat de", "no concloent", "eventual",
    "en estudi", "pendent d'avaluació", "suggerent de", "suggerix", "indica que", "dubtós", "no definit",
    "no específic", "no determinat", "valor incert", "no clar", "no segur", "aparentment relacionat amb",
    "es dedueix que", "en consideració"
]

UMLS_MEDICAL_TERMS = [
    "uretrotomia", "interna", "cistoscopia", "estenosis", "uretra", "cronica", "diverticulosis", "insuficiencia",
    "renal", "colelitiasis", "bloqueo", "auriculoventricular", "primer grado", "segundo grado", "hipertension",
    "arterial", "protesis", "cadera", "cordectomia", "herniorrafia", "parto", "eutocico", "rotura", "membranas",
    "prematuro", "episiotomia", "lactancia materna", "apendicectomia", "laparoscopica", "gastroenteritis", "aguda",
    "nefrectomia", "parcial", "angiomiolipoma", "quistes", "renales", "fractura", "mandibular", "ictus", "infarto",
    "isquemico", "trombectomia", "cerebral", "fibrinolisis", "endovenosa", "colangitis", "microcirugia",
    "endolaringea", "polineuropatia", "sensitiva", "axonal", "neuropatia", "multifactorial", "mielopatia", "déficit"
]


**Text normalization**

Next, we clean up the input text by normalizing it removing special characters, extra spaces, and anything that might interfere with proper analysis.


In [None]:
# Pre-processing
def normalize_text(text):
    """Normalize medical text with Spanish/Catalan character support"""
    text = text.lower()
    text = unicodedata.normalize('NFC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*+', '', text)
    text = re.sub(r'[^\w\s.,;:!?-àáèéìíòóùúüñç]', '', text)
    return text

**Helper functions for detection and ground truth**  

Here we will define three utility functions to support our analysis: detecting double negation, handling suffix-based medical terms, and extracting annotated ground truth for evaluation:


In [None]:
# Helper functions
def is_double_negation(tokens):
    """ Check if a sequence of tokens contains at least two simple negation cues """
    simple_negation_cues = {"no", "sin", "nunca", "jamás", "ningún", "ninguna", "nadie", "ninguno", "negado", "niega"}
    negation_count = sum(1 for token in tokens if token in simple_negation_cues)
    return negation_count >= 2

def get_extended_umls_terms(umls_terms, suffix_counter):
    """ Combina la lista original de UMLS con los términos detectados vía sufijo sin repetir (todo en minúsculas)"""
    extended = set(term.lower() for term in umls_terms)
    for term in suffix_counter:
        if term.lower() not in extended:
            extended.add(term.lower())
    return extended

def detect_medical_suffix(token):
    """Detect if a token ends with a sufijo médico específico: -nosis, -tomia, -patia, -losis"""
    pattern = re.compile(r'.*(nosis|tomia|patia|losis)$')
    return pattern.match(token)

def extract_ground_truth(record):
    """Extract ground truth negation and uncertainty terms from annotations"""
    gt_neg = set()
    gt_unc = set()
    annotations = record.get("annotations", [])
    for ann in annotations:
        for res in ann.get("result", []):
            labels = res.get("value", {}).get("labels", [])
            term = res.get("value", {}).get("text", "").lower()
            if "NEG" in labels:
                gt_neg.add(term)
            if "UNC" in labels:
                gt_unc.add(term)
    return gt_neg, gt_unc


**Evaluation metrics**  
We will now impkement a function to calculate precision, recall, and F1 score.Which are basic but essential metrics to evaluate how well our system detects negation and uncertainty compared to the annotated ground truth


In [None]:
# Metrics
def compute_metrics(predicted, ground_truth):
    """Calcula precisión, recall y F1"""
    tp = len(predicted & ground_truth)
    fp = len(predicted - ground_truth)
    fn = len(ground_truth - predicted)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall    = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1        = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return precision, recall, f1

**Main analysis function**  
This next function will bring everything together: it processes each sentence, detects medical terms, checks for suffix patterns and analyzes their surrounding context to flag negation, uncertainty, or double negation.

It also updates all relevant counters so we can evaluate later.


In [None]:
def analyze_medical_context(text, nlp):
    normalized_text = normalize_text(text)
    doc = nlp(normalized_text)
    results = {
        "negated_terms": [],
        "uncertain_terms": [],
        "double_negated_terms": [],
        "negation_cues_used": Counter(),
        "uncertainty_cues_used": Counter(),
        "medical_terms_found": Counter(),
        "medical_suffix_terms": Counter()
    }

    # Conjunto para rastrear contextos ya procesados
    processed_contexts = set()

    # Procesar cada oración
    for sent in doc.sents:
        sent_tokens = [token.text.lower() for token in sent]

        # Detectar tokens que contengan sufijos médicos y acumularlos
        for token in sent_tokens:
            if detect_medical_suffix(token):
                results["medical_suffix_terms"][token] += 1

        # Procesar coincidencias con términos UMLS
        for i, token_text in enumerate(sent_tokens):
            if token_text in [term.lower() for term in UMLS_MEDICAL_TERMS]:
                results["medical_terms_found"][token_text] += 1
                start_idx = max(0, i - 5)
                end_idx = min(len(sent_tokens), i + 6)
                context_tokens = sent_tokens[start_idx:end_idx]
                context_text = " ".join(context_tokens)

                # Verificar si este contexto ya ha sido procesado
                if context_text in processed_contexts:
                    continue

                # Marcar este contexto como procesado
                processed_contexts.add(context_text)

                if is_double_negation(context_tokens):
                    results["double_negated_terms"].append({"term": token_text, "context": context_text})
                    continue

                if any(neg in context_text for neg in NEGATION_WORDS):
                    results["negation_cues_used"].update([neg for neg in NEGATION_WORDS if neg in context_text])
                    results["negated_terms"].append({"term": token_text, "context": context_text})
                    continue

                if any(unc in context_text for unc in UNCERTAINTY_WORDS):
                    results["uncertainty_cues_used"].update([unc for unc in UNCERTAINTY_WORDS if unc in context_text])
                    results["uncertain_terms"].append({"term": token_text, "context": context_text})

    return results

**Final execution and evaluation**  
Here we will run the complete combined system on the dataset. It processes each record, applies our detection logic, tracks statistics, and evaluates performance using precision, recall, and F1 score both for negation and uncertainty.
And we also print sample outputs and processing time to assess efficiency.


In [None]:
print("Loading SpaCy model...")
nlp = spacy.load("es_core_news_sm")

print("Please upload your JSON file...")
uploaded = files.upload()
filename = next(iter(uploaded))
with open(filename, 'r', encoding='utf-8') as file:
    data = json.load(file)

total_negation_counter = Counter()
total_uncertainty_counter = Counter()
total_medical_counter = Counter()
total_suffix_counter = Counter()

# Conjuntos para realizar un seguimiento de los contextos únicos de los ejemplos
negated_examples = []
uncertain_examples = []
used_contexts = set()

# Variables para evaluación
all_predicted_neg = set()
all_ground_truth_neg = set()
all_predicted_unc = set()
all_ground_truth_unc = set()

start_time = time.time()
print(f"Processing {len(data)} records...")
for i, record in enumerate(data):
    if i % 10 == 0:
        print(f"Processing record {i+1}/{len(data)}...")
    text = record.get("data", {}).get("text", "")
    if not text:
        continue

    results = analyze_medical_context(text, nlp)
    total_negation_counter.update(results["negation_cues_used"])
    total_uncertainty_counter.update(results["uncertainty_cues_used"])
    total_medical_counter.update(results["medical_terms_found"])
    total_suffix_counter.update(results["medical_suffix_terms"])

    # Procesamos los términos individuales evitando contextos duplicados
    for neg_term in results["negated_terms"]:
        context = neg_term['context']
        norm_term = normalize_text(neg_term["term"])
        if context not in used_contexts and len(negated_examples) < 20:
            negated_examples.append(f"'{context}' (term: {norm_term})")
            used_contexts.add(context)
        all_predicted_neg.add(norm_term)

    for unc_term in results["uncertain_terms"]:
        context = unc_term['context']
        norm_term = normalize_text(unc_term["term"])
        if context not in used_contexts and len(uncertain_examples) < 20:
            uncertain_examples.append(f"'{context}' (term: {norm_term})")
            used_contexts.add(context)
        all_predicted_unc.add(norm_term)

    def extract_ground_truth(record):
      ann = record.get("annotations", {})
      gt_neg = set(normalize_text(term) for term in ann.get("negation", []))
      gt_unc = set(normalize_text(term) for term in ann.get("uncertainty", []))
      return gt_neg, gt_unc

end_time = time.time()
processing_time = end_time - start_time

prec_neg, rec_neg, f1_neg = compute_metrics(all_predicted_neg, all_ground_truth_neg)
prec_unc, rec_unc, f1_unc = compute_metrics(all_predicted_unc, all_ground_truth_unc)

combined_predicted = all_predicted_neg.union(all_predicted_unc)
combined_ground_truth = all_ground_truth_neg.union(all_ground_truth_unc)
prec_comb, rec_comb, f1_comb = compute_metrics(combined_predicted, combined_ground_truth)

# Construir la lista extendida de términos UMLS (sin duplicados)
extended_umls_terms = get_extended_umls_terms(UMLS_MEDICAL_TERMS, total_suffix_counter)

# Crear un contador extendido que sume frecuencias de los términos de la lista original y los sufijos
extended_terms_freq = defaultdict(int)
for term, freq in total_medical_counter.items():
    extended_terms_freq[term.lower()] += freq
for term, freq in total_suffix_counter.items():
    extended_terms_freq[term.lower()] += freq

print("\n=== STATISTICS ===")
print("\nNegation Cue Frequencies (affecting medical terms):")
for word, freq in total_negation_counter.most_common():
    print(f"{word}: {freq}")
print("\nUncertainty Cue Frequencies (affecting medical terms):")
for word, freq in total_uncertainty_counter.most_common():
    print(f"{word}: {freq}")
print("\nMedical Terms Frequencies:")
for word, freq in total_medical_counter.most_common():
    print(f"{word}: {freq}")
print("\nMedical Suffix Terms Frequencies:")
for word, freq in total_suffix_counter.most_common():
    print(f"{word}: {freq}")

print("\n=== EXTENDED UMLS TERMS ===")
for term in sorted(extended_umls_terms):
    print(term, f"(total frequency: {extended_terms_freq[term]})")

print("\n=== EXAMPLES OF NEGATED MEDICAL TERMS ===")
for example in negated_examples:
    print(example)
print("\n=== EXAMPLES OF UNCERTAIN MEDICAL TERMS ===")
for example in uncertain_examples:
    print(example)

print("\n=== EFFICIENCY ===")
print(f"Total processing time: {processing_time:.2f} seconds")

print("\n=== EVALUATION METRICS ===")
print("\nNegation Metrics:")
print(f"Precision: {prec_neg:.2f}, Recall: {rec_neg:.2f}, F1: {f1_neg:.2f}")
print("\nUncertainty Metrics:")
print(f"Precision: {prec_unc:.2f}, Recall: {rec_unc:.2f}, F1: {f1_unc:.2f}")
print("\nCombined (Negation + Uncertainty) Metrics:")
print(f"Precision: {prec_comb:.2f}, Recall: {rec_comb:.2f}, F1: {f1_comb:.2f}")

Loading SpaCy model...
Please upload your JSON file...


Saving negacio_train_v2024.json to negacio_train_v2024 (22).json
Processing 254 records...
Processing record 1/254...
Processing record 11/254...
Processing record 21/254...
Processing record 31/254...
Processing record 41/254...
Processing record 51/254...
Processing record 61/254...
Processing record 71/254...
Processing record 81/254...
Processing record 91/254...
Processing record 101/254...
Processing record 111/254...
Processing record 121/254...
Processing record 131/254...
Processing record 141/254...
Processing record 151/254...
Processing record 161/254...
Processing record 171/254...
Processing record 181/254...
Processing record 191/254...
Processing record 201/254...
Processing record 211/254...
Processing record 221/254...
Processing record 231/254...
Processing record 241/254...
Processing record 251/254...

=== STATISTICS ===

Negation Cue Frequencies (affecting medical terms):
no: 466
ni: 448
sin: 142
sin signos de: 23
sense: 11
descarta: 8
negativa: 4
ausencia de: 4
n