# MedQuAD Preprocessing

**Description:**  
This notebook performs advanced text preprocessing on the MedQuAD dataset, preparing it for training and evaluation of transformer-based question generation models (T5 and BART). The pipeline includes normalization of medical abbreviations and units, dependency parsing using spaCy, and generation of a cleaned and structured input column.

In [17]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import spacy
from tqdm.notebook import tqdm

In [2]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Custom dictionary for medical units and abbreviations (lowercase)
normalization_dict = {
    'htn': 'hypertension',
    'bp': 'blood pressure',
    'dm': 'diabetes mellitus',
    'cad': 'coronary artery disease',
    'chf': 'congestive heart failure',
    'copd': 'chronic obstructive pulmonary disease',
    'mi': 'myocardial infarction',
    'cva': 'cerebrovascular accident',
    'tia': 'transient ischemic attack',
    'uri': 'upper respiratory infection',
    'uti': 'urinary tract infection',
    'gi': 'gastrointestinal',
    'cns': 'central nervous system',
    'mg/dl': 'milligrams per deciliter',
    'mmhg': 'millimeters of mercury',
    'iv': 'intravenous',
    'im': 'intramuscular',
    'po': 'by mouth',
    'prn': 'as needed',
    'bid': 'twice a day',
    'tid': 'three times a day',
    'qid': 'four times a day',
    'qhs': 'at bedtime',
    'qam': 'every morning',
    'qpm': 'every evening'
}


# Medically relevant stopwords to preserve (lowercase)
preserve_stopwords = {
    'no', 'not', 'without', 'none', 'neither', 'nor', 'between', 'before',
    'after', 'during', 'until', 'since', 'should', 'could', 'would', 'may',
    'might', 'must', 'shall', 'can', 'if', 'unless', 'whether', 'while',
    'whereas', 'more', 'less', 'greater', 'fewer', 'increase', 'decrease',
    'associated', 'versus', 'compared', 'among', 'including', 'excluding'
}


# Stopword list (excluding preserved ones)
custom_stopwords = set(stopwords.words('english')) - preserve_stopwords

In [4]:
# Load spaCy English model (medium recommended)
nlp = spacy.load("en_core_web_md")

In [5]:
def clean_text(text):
    """
    Clean the input text by applying:
    - Lowercasing
    - Removal of extra spaces
    - Removal of brackets
    - Normalization of medical units and abbreviations
    - Removal of punctuation

    Returns a cleaned string.
    """

    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[\{\}\<\>\[\]\(\)\|/]', '', text)

    for abbr, full in normalization_dict.items():
        text = re.sub(rf'\b{abbr}\b', full, text, flags=re.IGNORECASE)

    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.strip()


def spacy_parse(doc):
    """
    Apply spaCy lemmatization to cleaned text.
    Returns a list of lemmas (tokens).
    """

    lemmas = [
        token.lemma_
        for token in doc
        if not token.is_space and not token.is_punct
    ]

    return lemmas


def filter_stopwords(lemmas):
    """
    Remove stopwords from a list of lemmas, except for medically relevant ones.
    Returns a string with useful tokens only.
    """

    tokens = [
        token
        for token in lemmas
        if token.lower() not in custom_stopwords
    ]
    return ' '.join(tokens)

In [6]:
def preprocess_pipeline(text):
    """
    Full preprocessing pipeline:
    - Clean text
    - Lemmatization
    - Stopword filtering

    Returns the final preprocessed string.
    """

    text = clean_text(text)

    doc = nlp(text)

    lemmas = spacy_parse(doc)

    processed_text = filter_stopwords(lemmas)

    return processed_text

## Dataset

**Source:** [MedQuAD: Medical Question Answer Dataset](https://www.kaggle.com/datasets/pythonafroz/medquad-medical-question-answer-for-ai-research/data)  
This dataset contains 16,412 medical question–answer pairs extracted from nine NIH (National Institutes of Health) websites.  
It covers a wide range of topics including diagnosis, treatments, side effects, and other medical knowledge.  
Each row includes a *question*, its corresponding *answer*, and additional metadata such as *source* and *focus area*.  
For this notebook, we will use the `question` and `answer` columns.


In [7]:
# Load the dataset
df = pd.read_csv('medquad.csv')
df = df[['question', 'answer']].dropna().reset_index(drop=True)

print(f"Dataset loaded: {df.shape[0]} rows")
df.head()

Dataset loaded: 16407 rows


Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea..."
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ..."
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...


In [9]:
tqdm.pandas()

# Preprocess question
df['preprocessed_question'] = df['question'].progress_apply(preprocess_pipeline)

# Preprocess answer
df['preprocessed_answer'] = df['answer'].progress_apply(preprocess_pipeline)

# Combine with tags
df['preprocessed_qa'] = (
    '<question> ' + df['preprocessed_question'] + ' <end_question> ' + '<answer> ' + df['preprocessed_answer'] + ' <end_answer>'
)

  0%|          | 0/16407 [00:00<?, ?it/s]

  0%|          | 0/16407 [00:00<?, ?it/s]

In [14]:
# Save the preprocessed data
df[['question', 'answer', 'preprocessed_question', 'preprocessed_answer', 'preprocessed_qa']].to_csv(
    'medquad_preprocessed_full.csv', index=False
)