# Medical Data Deduplication Tutorial

This notebook demonstrates how to use the deduplication modules in the medical project to remove duplicate entries from biomedical and QA datasets using Python scripts.

## Overview

The deduplication module provides two main functionalities:
1. **QA Deduplication**: Removes duplicates from question-answering datasets
2. **Biomedical Deduplication**: Removes duplicates from biomedical text datasets

## Table of Contents
1. [Setup and Environment Check](#setup)
2. [QA Dataset Deduplication](#qa-dedup)
3. [Biomedical Dataset Deduplication](#biomed-dedup)

## 1. Setup and Environment Check {#setup}

In [2]:
# Check if the deduplication.py script exists and is accessible
import os
from pathlib import Path

# Get the current notebook directory
notebook_dir = Path.cwd()
script_path = notebook_dir / "deduplication.py"

print(f"Notebook directory: {notebook_dir}")
print(f"Script path: {script_path}")
print(f"Script exists: {script_path.exists()}")

if script_path.exists():
    print("deduplication.py script found!")
else:
    print("deduplication.py script not found!")
    print("Please ensure the script is in the same directory as this notebook.")

Notebook directory: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/tutorial
Script path: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/tutorial/deduplication.py
Script exists: True
✅ deduplication.py script found!


In [3]:
# Test the script help to ensure it's working
!python deduplication.py --help

usage: deduplication.py [-h] [--mode {qa,biomed}]
                        [--qa-datasets [{LiveQA,MedicationQA,MedMCQA,MedQA-USMLE,PubMedQA} ...]]
                        [--biomed-datasets [{bc5cdr,BioNLI,CORD19,hoc,SourceData} ...]]
                        [--input-dir INPUT_DIR] [--output-dir OUTPUT_DIR]
                        [--test]

Medical Data Deduplication Tutorial

options:
  -h, --help            show this help message and exit
  --mode {qa,biomed}    Processing mode (default: qa)
  --qa-datasets [{LiveQA,MedicationQA,MedMCQA,MedQA-USMLE,PubMedQA} ...]
                        QA datasets to process
  --biomed-datasets [{bc5cdr,BioNLI,CORD19,hoc,SourceData} ...]
                        Biomedical datasets to process
  --input-dir INPUT_DIR
                        Input data directory (default: data)
  --output-dir OUTPUT_DIR
                        Output directory for deduplicated datasets (default:
                        data)
  --test                Run in test mode wit

## 2. QA Dataset Deduplication {#qa-dedup}
quick test and full running of QA Deduplication

### Quick Test QA Dataset

In [4]:
# quick test multiple QA datasets, recommended to run in terminal
!python deduplication.py --mode qa --qa-datasets LiveQA MedicationQA MedQA-USMLE --input-dir data --output-dir data --test

Script directory: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health/medicalproject2024/tutorial
Project root: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health
Python path updated with: /home/tjl20001104/workspace/Projects/USC/biobank/hugging-health
QA DATASET DEDUPLICATION
(RUNNING IN TEST MODE)
✅ Successfully imported deduplication modules
Processing 3 QA datasets: ['LiveQA', 'MedicationQA', 'MedQA-USMLE']
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Ignored unknown kwarg option special
Model loaded successfully on device: cuda
Found local copy...
Generating embeddings: 100%|██████████████████████| 2/

### Full Processing QA Dataset

In [None]:
# full processing multiple QA datasets, recommended to run in terminal
!python deduplication.py --mode qa --qa-datasets LiveQA MedicationQA MedQA-USMLE --input-dir data --output-dir data

## 3. Biomedical Dataset Deduplication {#biomed-dedup}
quick test and full running of Biomed Deduplication

### Quick Test Biomed Dataset

In [None]:
# quick test multiple biomedical datasets, recommended to run in terminal
!python deduplication.py --mode biomed --biomed-datasets bc5cdr BioNLI --input-dir data --output-dir data --test

### Full Processing Biomed Dataset

In [None]:
# Process multiple biomedical datasets, recommended to run in terminal
!python deduplication.py --mode biomed --biomed-datasets bc5cdr BioNLI --input-dir data --output-dir data

## Utility Functions

Some utility functions to help with processing.

In [None]:
# Function to check output files
def check_output_files(output_dir="data"):
    """Check what output files were created"""
    output_path = Path(output_dir)
    
    qa_dir = output_path / "deduplicate_qa"
    biomed_dir = output_path / "deduplicate_biomed"
    
    print("Output Files:")
    print("-" * 40)
    
    if qa_dir.exists():
        print("QA Deduplication Output:")
        for file in qa_dir.rglob("*"):
            if file.is_file():
                print(f"  {file.relative_to(output_path)}")
    
    if biomed_dir.exists():
        print("Biomedical Deduplication Output:")
        for file in biomed_dir.rglob("*"):
            if file.is_file():
                print(f"  {file.relative_to(output_path)}")

# Check output files
check_output_files()

In [None]:
# Function to run custom deduplication command
def run_custom_deduplication(qa_datasets=None, biomed_datasets=None, input_dir="data", output_dir="data"):
    """Run custom deduplication with specified parameters"""
    cmd_parts = ["python", "deduplication.py"]
    
    if qa_datasets:
        cmd_parts.extend(["--qa-datasets"] + qa_datasets)
    
    if biomed_datasets:
        cmd_parts.extend(["--biomed-datasets"] + biomed_datasets)
    
    cmd_parts.extend(["--input-dir", input_dir, "--output-dir", output_dir])
    
    cmd = " ".join(cmd_parts)
    print(f"Running command: {cmd}")
    print("-" * 60)
    
    # Execute the command
    os.system(cmd)

# Example usage:
# run_custom_deduplication(qa_datasets=["LiveQA"], biomed_datasets=["bc5cdr"])

## Summary

This notebook demonstrates how to use the `deduplication.py` script to process medical datasets. The script provides several processing modes:

- **qa**: Process only QA datasets
- **biomed**: Process only biomedical datasets

### Available Datasets:

**QA Datasets:**
- LiveQA
- MedicationQA
- MedMCQA
- MedQA-USMLE
- PubMedQA

**Biomedical Datasets:**
- bc5cdr
- BioNLI
- CORD19
- hoc
- SourceData

### Best Practices:

1. Start with the quick-test mode to verify everything works
2. Process full datasets to identify any issues
3. Check output files after processing
4. Monitor the console output for any errors or warnings