# Data Preprocessing Pipeline - SemEval 2010 Task 8

## Overview
This notebook implements the data preprocessing pipeline for the **SemEval 2010 Task 8** dataset:
- **Task**: Multi-Way Classification of Semantic Relations Between Pairs of Nominals
- **Dataset**: 8,000 training + 2,717 test sentences
- **Relations**: 9 semantic relations + "Other" class = 19 labels (accounting for directionality)

## Preprocessing Steps
1. **Load Data**: Read raw text files from SemEval format
2. **Parse Structure**: Extract sentence ID, text, entities (e1, e2), and labels
3. **Clean Text**: Remove markup tags and normalize
4. **Extract Components**: Separate entity mentions and context
5. **Save Structured Data**: Store in CSV/JSON for downstream tasks

## 1. Import Libraries

In [17]:
import re
import pandas as pd
import json
from pathlib import Path
from typing import Dict, List, Tuple
from collections import Counter

## 2. Define Data Paths

In [18]:
# Base paths
BASE_DIR = Path("../..").resolve()
RESOURCES_DIR = BASE_DIR / "resources" / "SemEval2010_task8_all_data 2"
DATA_DIR = BASE_DIR / "data"

# Input files
TRAIN_FILE = RESOURCES_DIR / "SemEval2010_task8_training" / "TRAIN_FILE.TXT"
TEST_FILE = RESOURCES_DIR / "SemEval2010_task8_testing_keys" / "TEST_FILE_FULL.TXT"
TEST_KEY_FILE = RESOURCES_DIR / "SemEval2010_task8_testing_keys" / "TEST_FILE_KEY.TXT"

# Output directories
RAW_DIR = DATA_DIR / "raw"
PREPROCESSED_DIR = DATA_DIR / "preprocessed"

# Create directories if they don't exist
RAW_DIR.mkdir(parents=True, exist_ok=True)
PREPROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Base directory: {BASE_DIR}")
print(f"Train file exists: {TRAIN_FILE.exists()}")
print(f"Test file exists: {TEST_FILE.exists()}")

Base directory: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS
Train file exists: True
Test file exists: True


## 3. Define Relation Types

SemEval 2010 Task 8 includes 9 relation types. With directionality (e1,e2) and (e2,e1), we have 18 directed relations + "Other" = 19 total labels.

In [19]:
# The 9 base semantic relations
RELATIONS = [
    "Cause-Effect",
    "Instrument-Agency",
    "Product-Producer",
    "Content-Container",
    "Entity-Origin",
    "Entity-Destination",
    "Component-Whole",
    "Member-Collection",
    "Message-Topic",
    "Other"
]

print(f"Total relation types: {len(RELATIONS)}")
print(f"Relations: {', '.join(RELATIONS)}")

Total relation types: 10
Relations: Cause-Effect, Instrument-Agency, Product-Producer, Content-Container, Entity-Origin, Entity-Destination, Component-Whole, Member-Collection, Message-Topic, Other


## 4. Data Parser Functions

### Parse SemEval Format
The data format has 3-4 lines per example:
- Line 1: `ID "sentence with <e1>entity1</e1> and <e2>entity2</e2>"`
- Line 2: `Relation(e1,e2)` or `Other`
- Line 3: `Comment: ...` (optional)
- Blank line separator

In [20]:
def parse_semeval_file(file_path: Path) -> List[Dict[str, str]]:
    """
    Parse SemEval 2010 Task 8 file format.
    
    Parameters
    ----------
    file_path : Path
        Path to the SemEval data file
    
    Returns
    -------
    List[Dict[str, str]]
        List of parsed examples with keys: id, sentence, relation, comment
    """
    examples = []
    
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        
        # Skip empty lines
        if not line:
            i += 1
            continue
        
        # Parse sentence line: ID "sentence"
        match = re.match(r'^(\d+)\s+"(.+)"$', line)
        if match:
            example_id = match.group(1)
            sentence = match.group(2)
            
            # Get relation (next line)
            i += 1
            relation = lines[i].strip() if i < len(lines) else ""
            
            # Get comment if exists (next line)
            i += 1
            comment = ""
            if i < len(lines) and lines[i].strip().startswith("Comment:"):
                comment = lines[i].strip().replace("Comment:", "").strip()
                i += 1
            
            examples.append({
                "id": example_id,
                "sentence": sentence,
                "relation": relation,
                "comment": comment
            })
        else:
            i += 1
    
    return examples


def extract_entities(sentence: str) -> Tuple[str, str, str]:
    """
    Extract entities and clean sentence from markup.
    
    Parameters
    ----------
    sentence : str
        Raw sentence with <e1> and <e2> tags
    
    Returns
    -------
    Tuple[str, str, str]
        (e1_text, e2_text, clean_sentence)
    """
    # Extract entity 1
    e1_match = re.search(r'<e1>(.+?)</e1>', sentence)
    e1_text = e1_match.group(1) if e1_match else ""
    
    # Extract entity 2
    e2_match = re.search(r'<e2>(.+?)</e2>', sentence)
    e2_text = e2_match.group(1) if e2_match else ""
    
    # Clean sentence (remove tags)
    clean_sentence = re.sub(r'</?e[12]>', '', sentence)
    
    return e1_text, e2_text, clean_sentence


def parse_relation_label(relation: str) -> Tuple[str, str]:
    """
    Parse relation string into relation type and direction.
    
    Parameters
    ----------
    relation : str
        Relation string like "Cause-Effect(e1,e2)" or "Other"
    
    Returns
    -------
    Tuple[str, str]
        (relation_type, direction) where direction is "e1,e2", "e2,e1", or "none"
    """
    if relation == "Other":
        return "Other", "none"
    
    # Match pattern: RelationType(e1,e2) or RelationType(e2,e1)
    match = re.match(r'(.+)\((e[12],e[12])\)', relation)
    if match:
        relation_type = match.group(1)
        direction = match.group(2)
        return relation_type, direction
    
    return relation, "unknown"


print("✓ Parser functions defined")

✓ Parser functions defined


## 5. Load and Parse Training Data

In [21]:
# Parse training file
print("Parsing training data...")
train_examples = parse_semeval_file(TRAIN_FILE)
print(f"Loaded {len(train_examples)} training examples")

# Show first example
if train_examples:
    print("\nFirst training example:")
    print(f"ID: {train_examples[0]['id']}")
    print(f"Sentence: {train_examples[0]['sentence']}")
    print(f"Relation: {train_examples[0]['relation']}")
    print(f"Comment: {train_examples[0]['comment']}")

Parsing training data...
Loaded 8000 training examples

First training example:
ID: 1
Sentence: The system as described above has its greatest application in an arrayed <e1>configuration</e1> of antenna <e2>elements</e2>.
Relation: Component-Whole(e2,e1)
Comment: Not a collection: there is structure here, organisation.


## 6. Load and Parse Test Data

In [22]:
# Parse test file
print("Parsing test data...")
test_examples = parse_semeval_file(TEST_FILE)
print(f"Loaded {len(test_examples)} test examples")

# Show first example
if test_examples:
    print("\nFirst test example:")
    print(f"ID: {test_examples[0]['id']}")
    print(f"Sentence: {test_examples[0]['sentence']}")
    print(f"Relation: {test_examples[0]['relation']}")
    print(f"Comment: {test_examples[0]['comment']}")

Parsing test data...
Loaded 2717 test examples

First test example:
ID: 8001
Sentence: The most common <e1>audits</e1> were about <e2>waste</e2> and recycling.
Relation: Message-Topic(e1,e2)
Comment: Assuming an audit = an audit document.


## 7. Process and Structure Data

Extract entities and create structured records for each example.

In [23]:
def process_examples(examples: List[Dict[str, str]]) -> pd.DataFrame:
    """
    Process raw examples into structured DataFrame.
    
    Parameters
    ----------
    examples : List[Dict[str, str]]
        Raw parsed examples
    
    Returns
    -------
    pd.DataFrame
        Structured DataFrame with all extracted features
    """
    processed_data = []
    
    for ex in examples:
        # Extract entities
        e1_text, e2_text, clean_sent = extract_entities(ex["sentence"])
        
        # Parse relation
        relation_type, direction = parse_relation_label(ex["relation"])
        
        processed_data.append({
            "id": int(ex["id"]),
            "sentence_raw": ex["sentence"],
            "sentence_clean": clean_sent,
            "e1": e1_text,
            "e2": e2_text,
            "relation_full": ex["relation"],
            "relation_type": relation_type,
            "relation_direction": direction,
            "comment": ex["comment"]
        })
    
    return pd.DataFrame(processed_data)


# Process training data
print("Processing training examples...")
train_df = process_examples(train_examples)
print(f"Processed {len(train_df)} training examples")

# Process test data
print("\nProcessing test examples...")
test_df = process_examples(test_examples)
print(f"Processed {len(test_df)} test examples")

# Display sample
print("\nSample processed training data:")
train_df.head(3)

Processing training examples...
Processed 8000 training examples

Processing test examples...
Processed 2717 test examples

Sample processed training data:


Unnamed: 0,id,sentence_raw,sentence_clean,e1,e2,relation_full,relation_type,relation_direction,comment
0,1,The system as described above has its greatest...,The system as described above has its greatest...,configuration,elements,"Component-Whole(e2,e1)",Component-Whole,"e2,e1","Not a collection: there is structure here, org..."
1,2,The <e1>child</e1> was carefully wrapped and b...,The child was carefully wrapped and bound into...,child,cradle,Other,Other,none,
2,3,The <e1>author</e1> of a keygen uses a <e2>dis...,The author of a keygen uses a disassembler to ...,author,disassembler,"Instrument-Agency(e2,e1)",Instrument-Agency,"e2,e1",


## 8. Data Statistics and Validation

In [24]:
def print_data_statistics(df: pd.DataFrame, dataset_name: str) -> None:
    """
    Print comprehensive statistics about the dataset.
    
    Parameters
    ----------
    df : pd.DataFrame
        Processed dataset
    dataset_name : str
        Name of the dataset (e.g., "Training", "Test")
    """
    print(f"\n{'='*60}")
    print(f"{dataset_name} Dataset Statistics")
    print(f"{'='*60}")
    
    print(f"\nTotal examples: {len(df)}")
    print(f"\nColumns: {list(df.columns)}")
    
    # Relation distribution
    print(f"\nRelation Type Distribution:")
    print("-" * 40)
    relation_counts = df["relation_type"].value_counts()
    for rel, count in relation_counts.items():
        percentage = (count / len(df)) * 100
        print(f"{rel:25s}: {count:4d} ({percentage:5.2f}%)")
    
    # Direction distribution (excluding "Other")
    print(f"\nRelation Direction Distribution (non-Other):")
    print("-" * 40)
    non_other = df[df["relation_type"] != "Other"]
    if len(non_other) > 0:
        direction_counts = non_other["relation_direction"].value_counts()
        for direction, count in direction_counts.items():
            percentage = (count / len(non_other)) * 100
            print(f"{direction:10s}: {count:4d} ({percentage:5.2f}%)")
    
    # Full relation label distribution
    print(f"\nTop 10 Full Relation Labels:")
    print("-" * 40)
    full_rel_counts = df["relation_full"].value_counts().head(10)
    for rel, count in full_rel_counts.items():
        print(f"{rel:30s}: {count:4d}")
    
    # Entity statistics
    print(f"\nEntity Statistics:")
    print("-" * 40)
    print(f"Average e1 length: {df['e1'].str.len().mean():.2f} characters")
    print(f"Average e2 length: {df['e2'].str.len().mean():.2f} characters")
    print(f"Average sentence length: {df['sentence_clean'].str.len().mean():.2f} characters")
    
    # Missing values
    print(f"\nMissing Values:")
    print("-" * 40)
    missing = df.isnull().sum()
    if missing.sum() == 0:
        print("No missing values!")
    else:
        print(missing[missing > 0])


# Print statistics
print_data_statistics(train_df, "Training")
print_data_statistics(test_df, "Test")


Training Dataset Statistics

Total examples: 8000

Columns: ['id', 'sentence_raw', 'sentence_clean', 'e1', 'e2', 'relation_full', 'relation_type', 'relation_direction', 'comment']

Relation Type Distribution:
----------------------------------------
Other                    : 1410 (17.62%)
Cause-Effect             : 1003 (12.54%)
Component-Whole          :  941 (11.76%)
Entity-Destination       :  845 (10.56%)
Product-Producer         :  717 ( 8.96%)
Entity-Origin            :  716 ( 8.95%)
Member-Collection        :  690 ( 8.62%)
Message-Topic            :  634 ( 7.92%)
Content-Container        :  540 ( 6.75%)
Instrument-Agency        :  504 ( 6.30%)

Relation Direction Distribution (non-Other):
----------------------------------------
e1,e2     : 3588 (54.45%)
e2,e1     : 3002 (45.55%)

Top 10 Full Relation Labels:
----------------------------------------
Other                         : 1410
Entity-Destination(e1,e2)     :  844
Cause-Effect(e2,e1)           :  659
Member-Collection(

## 9. Save Preprocessed Data

Save the structured data in multiple formats:
- **CSV**: Easy to read with pandas, good for tabular data
- **JSON**: Hierarchical structure, good for nested data
- **Parquet**: Efficient binary format (optional)

In [25]:
# Save training data
train_csv_path = PREPROCESSED_DIR / "train.csv"
train_json_path = PREPROCESSED_DIR / "train.json"

train_df.to_csv(train_csv_path, index=False)
train_df.to_json(train_json_path, orient="records", indent=2)

print(f"✓ Training data saved:")
print(f"  - CSV: {train_csv_path}")
print(f"  - JSON: {train_json_path}")

# Save test data
test_csv_path = PREPROCESSED_DIR / "test.csv"
test_json_path = PREPROCESSED_DIR / "test.json"

test_df.to_csv(test_csv_path, index=False)
test_df.to_json(test_json_path, orient="records", indent=2)

print(f"\n✓ Test data saved:")
print(f"  - CSV: {test_csv_path}")
print(f"  - JSON: {test_json_path}")

# Save metadata
metadata = {
    "dataset": "SemEval 2010 Task 8",
    "task": "Multi-Way Classification of Semantic Relations Between Pairs of Nominals",
    "train_size": len(train_df),
    "test_size": len(test_df),
    "num_relations": len(RELATIONS),
    "relations": RELATIONS,
    "columns": list(train_df.columns)
}

metadata_path = PREPROCESSED_DIR / "metadata.json"
with open(metadata_path, "w") as f:
    json.dump(metadata, f, indent=2)

print(f"\n✓ Metadata saved: {metadata_path}")

✓ Training data saved:
  - CSV: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/train.csv
  - JSON: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/train.json

✓ Test data saved:
  - CSV: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/test.csv
  - JSON: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/test.json

✓ Metadata saved: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/metadata.json


## 10. Validation: Load and Verify Saved Data

In [26]:
# Load saved data
train_loaded = pd.read_csv(train_csv_path)
test_loaded = pd.read_csv(test_csv_path)

print("Validation Results:")
print("-" * 40)
print(f"✓ Training data loaded: {len(train_loaded)} examples")
print(f"✓ Test data loaded: {len(test_loaded)} examples")
print(f"✓ Data integrity check: {len(train_loaded) == len(train_df) and len(test_loaded) == len(test_df)}")

# Display sample
print("\nSample from loaded training data:")
train_loaded.head(3)

Validation Results:
----------------------------------------
✓ Training data loaded: 8000 examples
✓ Test data loaded: 2717 examples
✓ Data integrity check: True

Sample from loaded training data:


Unnamed: 0,id,sentence_raw,sentence_clean,e1,e2,relation_full,relation_type,relation_direction,comment
0,1,The system as described above has its greatest...,The system as described above has its greatest...,configuration,elements,"Component-Whole(e2,e1)",Component-Whole,"e2,e1","Not a collection: there is structure here, org..."
1,2,The <e1>child</e1> was carefully wrapped and b...,The child was carefully wrapped and bound into...,child,cradle,Other,Other,none,
2,3,The <e1>author</e1> of a keygen uses a <e2>dis...,The author of a keygen uses a disassembler to ...,author,disassembler,"Instrument-Agency(e2,e1)",Instrument-Agency,"e2,e1",


## Summary

**Completed Data Preprocessing Steps:**
1. Loaded SemEval 2010 Task 8 raw data files
2. Parsed structured format (ID, sentence, relation, comment)
3. Extracted entity mentions (e1, e2) and removed markup tags
4. Separated relation type and directionality
5. Generated clean sentences and structured features
6. Saved preprocessed data in CSV and JSON formats
7. Validated data integrity

**Output Files:**
- `data/preprocessed/train.csv` - Training data (8,000 examples)
- `data/preprocessed/test.csv` - Test data (2,717 examples)
- `data/preprocessed/train.json` - Training data (JSON format)
- `data/preprocessed/test.json` - Test data (JSON format)
- `data/preprocessed/metadata.json` - Dataset metadata