# Text Preprocessing

## Overview
This notebook performs **text analysis** on the preprocessed SemEval 2010 Task 8 data using spaCy.

## 1. Import Libraries and Load Data

In [1]:
import pandas as pd
import spacy
import json
from pathlib import Path
from typing import List, Dict, Tuple, Any
from collections import Counter
from tqdm import tqdm

# For visualizing dependency trees (optional)
from spacy import displacy

In [2]:
# Set paths
BASE_DIR = Path("../..").resolve()
DATA_DIR = BASE_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
PREPROCESSED_DIR = DATA_DIR / "preprocessed"

# Create output directory
PREPROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Loading preprocessed data from: {RAW_DIR}")

# Load preprocessed data
train_df = pd.read_csv(RAW_DIR / "train.csv")
test_df = pd.read_csv(RAW_DIR / "test.csv")

print(f"✓ Loaded {len(train_df)} training examples")
print(f"✓ Loaded {len(test_df)} test examples")

Loading preprocessed data from: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/raw
✓ Loaded 8000 training examples
✓ Loaded 2717 test examples


## 2. Load spaCy Model

We'll use the English language model. If not installed, run:
```bash
python -m spacy download en_core_web_sm
```

For better accuracy (but slower), you can use:
- `en_core_web_md` - Medium model with word vectors
- `en_core_web_lg` - Large model with more word vectors

In [3]:
# Load spaCy model
try:
    nlp = spacy.load("en_core_web_lg")
    print(f"✓ Loaded spaCy model: en_core_web_lg")
except OSError:
    print("Model not found. Installing en_core_web_sm...")
    import subprocess
    subprocess.run(["python", "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")
    print(f"✓ Installed and loaded spaCy model: en_core_web_sm")

# Check pipeline components
print(f"\nPipeline components: {nlp.pipe_names}")

✓ Loaded spaCy model: en_core_web_lg

Pipeline components: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## 3. Define Text Feature Extraction Functions

In [4]:
def get_entity_positions(doc: spacy.tokens.Doc, e1_text: str, e2_text: str) -> Tuple[Tuple[int, int], Tuple[int, int]]:
    """
    Find token indices for entities in the document.
    
    Parameters
    ----------
    doc : spacy.tokens.Doc
        Processed spaCy document
    e1_text : str
        Entity 1 text
    e2_text : str
        Entity 2 text
    
    Returns
    -------
    Tuple[Tuple[int, int], Tuple[int, int]]
        ((e1_start, e1_end), (e2_start, e2_end)) token indices
    """
    text = doc.text.lower()
    e1_lower = e1_text.lower()
    e2_lower = e2_text.lower()
    
    # Find character positions
    e1_char_start = text.find(e1_lower)
    e2_char_start = text.find(e2_lower)
    
    # Convert to token positions
    e1_pos = (-1, -1)
    e2_pos = (-1, -1)
    
    for i, token in enumerate(doc):
        if e1_char_start != -1 and token.idx <= e1_char_start < token.idx + len(token.text):
            e1_start = i
            # Find end token
            for j in range(i, len(doc)):
                if doc[j].idx + len(doc[j].text) >= e1_char_start + len(e1_lower):
                    e1_pos = (e1_start, j + 1)
                    break
        
        if e2_char_start != -1 and token.idx <= e2_char_start < token.idx + len(token.text):
            e2_start = i
            # Find end token
            for j in range(i, len(doc)):
                if doc[j].idx + len(doc[j].text) >= e2_char_start + len(e2_lower):
                    e2_pos = (e2_start, j + 1)
                    break
    
    return e1_pos, e2_pos


def extract_tokens_between_entities(doc: spacy.tokens.Doc, e1_pos: Tuple[int, int], e2_pos: Tuple[int, int]) -> Dict[str, List[str]]:
    """
    Extract tokens and their properties between two entities.
    
    Parameters
    ----------
    doc : spacy.tokens.Doc
        Processed spaCy document
    e1_pos : Tuple[int, int]
        Entity 1 token positions (start, end)
    e2_pos : Tuple[int, int]
        Entity 2 token positions (start, end)
    
    Returns
    -------
    Dict[str, List[str]]
        Dictionary with tokens, lemmas, POS tags between entities
    """
    # Determine entity order
    first_entity_end = min(e1_pos[1], e2_pos[1]) if e1_pos[0] < e2_pos[0] else max(e1_pos[1], e2_pos[1])
    second_entity_start = max(e1_pos[0], e2_pos[0]) if e1_pos[0] < e2_pos[0] else min(e1_pos[0], e2_pos[0])
    
    # Get tokens between entities
    start_idx = e1_pos[1] if e1_pos[0] < e2_pos[0] else e2_pos[1]
    end_idx = e2_pos[0] if e1_pos[0] < e2_pos[0] else e1_pos[0]
    
    between_tokens = doc[start_idx:end_idx]
    
    return {
        "tokens": [token.text for token in between_tokens],
        "lemmas": [token.lemma_ for token in between_tokens],
        "pos_tags": [token.pos_ for token in between_tokens],
        "dep_tags": [token.dep_ for token in between_tokens],
        "num_tokens": len(between_tokens)
    }


def get_dependency_path(doc: spacy.tokens.Doc, e1_pos: Tuple[int, int], e2_pos: Tuple[int, int]) -> str:
    """
    Get the shortest dependency path between two entities.
    
    Parameters
    ----------
    doc : spacy.tokens.Doc
        Processed spaCy document
    e1_pos : Tuple[int, int]
        Entity 1 token positions
    e2_pos : Tuple[int, int]
        Entity 2 token positions
    
    Returns
    -------
    str
        Dependency path string
    """
    if e1_pos[0] == -1 or e2_pos[0] == -1:
        return ""
    
    # Get head tokens of entities
    e1_head = doc[e1_pos[0]:e1_pos[1]].root
    e2_head = doc[e2_pos[0]:e2_pos[1]].root
    
    # Find path to common ancestor
    e1_ancestors = list(e1_head.ancestors)
    e2_ancestors = list(e2_head.ancestors)
    
    # Simple path: just the dependency relations
    path_parts = []
    path_parts.append(f"{e1_head.text}({e1_head.dep_})")
    
    # Add common ancestor if exists
    common_ancestors = set(e1_ancestors) & set(e2_ancestors)
    if common_ancestors:
        ancestor = min(common_ancestors, key=lambda x: x.i)
        path_parts.append(f"→{ancestor.text}({ancestor.dep_})→")
    
    path_parts.append(f"{e2_head.text}({e2_head.dep_})")
    
    return " ".join(path_parts)


def extract_text_features(row: pd.Series, nlp: spacy.language.Language) -> Dict[str, Any]:
    """
    Extract all text features for a single example.
    
    Parameters
    ----------
    row : pd.Series
        DataFrame row with sentence and entity information
    nlp : spacy.language.Language
        Loaded spaCy model
    
    Returns
    -------
    Dict[str, Any]
        Dictionary with all text features
    """
    # Process sentence
    doc = nlp(row["sentence_clean"])
    
    # Get entity positions
    e1_pos, e2_pos = get_entity_positions(doc, row["e1"], row["e2"])
    
    # Extract features
    features = {
        "tokens": [token.text for token in doc],
        "lemmas": [token.lemma_ for token in doc],
        "pos_tags": [token.pos_ for token in doc],
        "dep_tags": [token.dep_ for token in doc],
        "ner_tags": [ent.label_ for ent in doc.ents],
        "ner_texts": [ent.text for ent in doc.ents],
        "num_tokens": len(doc),
        "num_entities": len(doc.ents),
    }
    
    # Entity-specific features
    if e1_pos[0] != -1:
        e1_span = doc[e1_pos[0]:e1_pos[1]]
        features["e1_pos_tags"] = [token.pos_ for token in e1_span]
        features["e1_head_pos"] = e1_span.root.pos_
        features["e1_head_dep"] = e1_span.root.dep_
    else:
        features["e1_pos_tags"] = []
        features["e1_head_pos"] = ""
        features["e1_head_dep"] = ""
    
    if e2_pos[0] != -1:
        e2_span = doc[e2_pos[0]:e2_pos[1]]
        features["e2_pos_tags"] = [token.pos_ for token in e2_span]
        features["e2_head_pos"] = e2_span.root.pos_
        features["e2_head_dep"] = e2_span.root.dep_
    else:
        features["e2_pos_tags"] = []
        features["e2_head_pos"] = ""
        features["e2_head_dep"] = ""
    
    # Features between entities
    between_features = extract_tokens_between_entities(doc, e1_pos, e2_pos)
    features["between_tokens"] = between_features["tokens"]
    features["between_lemmas"] = between_features["lemmas"]
    features["between_pos_tags"] = between_features["pos_tags"]
    features["between_dep_tags"] = between_features["dep_tags"]
    features["num_between_tokens"] = between_features["num_tokens"]
    
    # Dependency path
    features["dependency_path"] = get_dependency_path(doc, e1_pos, e2_pos)
    
    return features


print("✓ Feature extraction functions defined")

✓ Feature extraction functions defined


## 4. Example: Analyze a Single Sentence

Let's see what text features spaCy extracts from one example.

In [5]:
# Pick a sample sentence
sample_idx = 0
sample_row = train_df.iloc[sample_idx]

print("Sample Sentence:")
print("=" * 60)
print(f"ID: {sample_row['id']}")
print(f"Sentence: {sample_row['sentence_clean']}")
print(f"Entity 1: {sample_row['e1']}")
print(f"Entity 2: {sample_row['e2']}")
print(f"Relation: {sample_row['relation_full']}")

# Process and extract features
features = extract_text_features(sample_row, nlp)

print("\Text Features:")
print("=" * 60)
print(f"Tokens: {features['tokens']}")
print(f"Lemmas: {features['lemmas']}")
print(f"POS Tags: {features['pos_tags']}")
print(f"Dependency Tags: {features['dep_tags']}")
print(f"\nBetween Entities:")
print(f"  Tokens: {features['between_tokens']}")
print(f"  Lemmas: {features['between_lemmas']}")
print(f"  POS Tags: {features['between_pos_tags']}")
print(f"\nEntity 1 Features:")
print(f"  POS Tags: {features['e1_pos_tags']}")
print(f"  Head POS: {features['e1_head_pos']}")
print(f"  Head Dependency: {features['e1_head_dep']}")
print(f"\nEntity 2 Features:")
print(f"  POS Tags: {features['e2_pos_tags']}")
print(f"  Head POS: {features['e2_head_pos']}")
print(f"  Head Dependency: {features['e2_head_dep']}")
print(f"\nDependency Path: {features['dependency_path']}")

Sample Sentence:
ID: 1
Sentence: The system as described above has its greatest application in an arrayed configuration of antenna elements.
Entity 1: configuration
Entity 2: elements
Relation: Component-Whole(e2,e1)
\Text Features:
Tokens: ['The', 'system', 'as', 'described', 'above', 'has', 'its', 'greatest', 'application', 'in', 'an', 'arrayed', 'configuration', 'of', 'antenna', 'elements', '.']
Lemmas: ['the', 'system', 'as', 'describe', 'above', 'have', 'its', 'great', 'application', 'in', 'an', 'array', 'configuration', 'of', 'antenna', 'element', '.']
POS Tags: ['DET', 'NOUN', 'SCONJ', 'VERB', 'ADV', 'VERB', 'PRON', 'ADJ', 'NOUN', 'ADP', 'DET', 'VERB', 'NOUN', 'ADP', 'NOUN', 'NOUN', 'PUNCT']
Dependency Tags: ['det', 'nsubj', 'mark', 'advcl', 'advmod', 'ROOT', 'poss', 'amod', 'dobj', 'prep', 'det', 'amod', 'pobj', 'prep', 'compound', 'pobj', 'punct']

Between Entities:
  Tokens: ['of', 'antenna']
  Lemmas: ['of', 'antenna']
  POS Tags: ['ADP', 'NOUN']

Entity 1 Features:
  POS Ta

  print("\Text Features:")


## 5. Visualize Dependency Parse (Optional)

spaCy's displacy can show the dependency tree structure.

In [6]:
# Visualize dependency parse for sample sentence
doc = nlp(sample_row["sentence_clean"])

# Display inline in notebook using IPython
from IPython.display import HTML

html = displacy.render(doc, style="dep", jupyter=False)
HTML(html)

## 6. Process All Training Data

Extract text features for all training examples.

In [None]:
print("Processing training data...")

train_text_features = []

for idx, row in tqdm(train_df.iterrows(), total=len(train_df), desc="Training"):
    try:
        features = extract_text_features(row, nlp)
        train_text_features.append(features)
    except Exception as e:
        print(f"Error processing training example {row['id']}: {e}")
        # Add empty features for failed examples
        train_text_features.append({})

print(f"\n✓ Processed {len(train_text_features)} training examples")

Processing training data...
This may take several minutes...



Training: 100%|██████████| 8000/8000 [00:41<00:00, 191.18it/s]


✓ Processed 8000 training examples





## 7. Process All Test Data

In [None]:
print("Processing test data...")

test_text_features = []

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Test"):
    try:
        features = extract_text_features(row, nlp)
        test_text_features.append(features)
    except Exception as e:
        print(f"Error processing test example {row['id']}: {e}")
        # Add empty features for failed examples
        test_text_features.append({})

print(f"\n✓ Processed {len(test_text_features)} test examples")

Processing test data...
This may take a few minutes...



Test: 100%|██████████| 2717/2717 [00:14<00:00, 186.17it/s]


✓ Processed 2717 test examples





## 8. Combine Original Data with Text Features

In [9]:
def combine_features(df: pd.DataFrame, text_features: List[Dict]) -> pd.DataFrame:
    """
    Combine original dataframe with text features.
    
    Parameters
    ----------
    df : pd.DataFrame
        Original preprocessed dataframe
    text_features : List[Dict]
        List of text feature dictionaries
    
    Returns
    -------
    pd.DataFrame
        Combined dataframe
    """
    # Create copy
    enriched_df = df.copy()
    
    # Add text features as new columns
    for col_name in ["tokens", "lemmas", "pos_tags", "dep_tags", 
                     "between_tokens", "between_lemmas", "between_pos_tags", 
                     "between_dep_tags", "dependency_path",
                     "e1_head_pos", "e1_head_dep", "e2_head_pos", "e2_head_dep",
                     "num_tokens", "num_between_tokens", "num_entities"]:
        enriched_df[col_name] = [feat.get(col_name, None) for feat in text_features]
    
    return enriched_df


# Combine features
train_enriched = combine_features(train_df, train_text_features)
test_enriched = combine_features(test_df, test_text_features)

print(f"✓ Combined original data with text features")
print(f"  Training shape: {train_enriched.shape}")
print(f"  Test shape: {test_enriched.shape}")

# Display sample
print("\nSample enriched data (first 3 rows, selected columns):")
display_cols = ["id", "e1", "e2", "relation_type", "between_tokens", 
                "between_pos_tags", "e1_head_pos", "e2_head_pos"]
train_enriched[display_cols].head(3)

✓ Combined original data with text features
  Training shape: (8000, 25)
  Test shape: (2717, 25)

Sample enriched data (first 3 rows, selected columns):


Unnamed: 0,id,e1,e2,relation_type,between_tokens,between_pos_tags,e1_head_pos,e2_head_pos
0,1,configuration,elements,Component-Whole,"[of, antenna]","[ADP, NOUN]",NOUN,NOUN
1,2,child,cradle,Other,"[was, carefully, wrapped, and, bound, into, the]","[AUX, ADV, VERB, CCONJ, VERB, ADP, DET]",NOUN,NOUN
2,3,author,disassembler,Instrument-Agency,"[of, a, keygen, uses, a]","[ADP, DET, NOUN, VERB, DET]",NOUN,NOUN


## 9. Text Feature Statistics

In [None]:
def print_text_statistics(df: pd.DataFrame, dataset_name: str) -> None:
    """
    Print statistics about text features.
    
    Parameters
    ----------
    df : pd.DataFrame
        Enriched dataframe with text features
    dataset_name : str
        Name of the dataset
    """
    print(f"\n{'='*60}")
    print(f"{dataset_name} Text Features Statistics")
    print(f"{'='*60}")
    
    # Token statistics
    print(f"\nToken Statistics:")
    print("-" * 40)
    print(f"Average tokens per sentence: {df['num_tokens'].mean():.2f}")
    print(f"Average tokens between entities: {df['num_between_tokens'].mean():.2f}")
    print(f"Max tokens per sentence: {df['num_tokens'].max()}")
    print(f"Min tokens per sentence: {df['num_tokens'].min()}")
    
    # POS tag distribution
    print(f"\nEntity Head POS Tags Distribution:")
    print("-" * 40)
    print("E1 Head POS:")
    print(df['e1_head_pos'].value_counts().head(10))
    print("\nE2 Head POS:")
    print(df['e2_head_pos'].value_counts().head(10))
    
    # Most common tokens between entities
    print(f"\nMost Common Tokens Between Entities:")
    print("-" * 40)
    all_between_tokens = []
    for tokens in df['between_tokens']:
        if isinstance(tokens, list):
            all_between_tokens.extend(tokens)
    
    token_counts = Counter(all_between_tokens)
    print("Top 20:")
    for token, count in token_counts.most_common(20):
        print(f"  {token:15s}: {count:4d}")
    
    # Most common lemmas between entities
    print(f"\nMost Common Lemmas Between Entities:")
    print("-" * 40)
    all_between_lemmas = []
    for lemmas in df['between_lemmas']:
        if isinstance(lemmas, list):
            all_between_lemmas.extend(lemmas)
    
    lemma_counts = Counter(all_between_lemmas)
    print("Top 20:")
    for lemma, count in lemma_counts.most_common(20):
        print(f"  {lemma:15s}: {count:4d}")


# Print statistics
print_text_statistics(train_enriched, "Training")
print_text_statistics(test_enriched, "Test")


Training text Features Statistics

Token Statistics:
----------------------------------------
Average tokens per sentence: 19.34
Average tokens between entities: 3.88
Max tokens per sentence: 97
Min tokens per sentence: 4

Entity Head POS Tags Distribution:
----------------------------------------
E1 Head POS:
e1_head_pos
NOUN     7830
PROPN      88
ADJ        37
VERB       37
NUM         4
ADV         2
AUX         1
X           1
Name: count, dtype: int64

E2 Head POS:
e2_head_pos
NOUN     7860
VERB       67
PROPN      41
ADJ        25
ADV         3
AUX         3
ADP         1
Name: count, dtype: int64

Most Common Tokens Between Entities:
----------------------------------------
Top 20:
  the            : 3230
  of             : 1950
  a              : 1798
  from           :  914
  in             :  894
  into           :  750
  was            :  739
  by             :  731
  is             :  566
  ,              :  490
  with           :  476
  to             :  472
  caused    

## 10. Save Enriched Data

Save data with linguistic features in both CSV and JSON formats.

In [12]:
# Save training data
train_csv_path = PREPROCESSED_DIR / "train_text.csv"
train_json_path = PREPROCESSED_DIR / "train_text.json"

train_enriched.to_csv(train_csv_path, index=False)
train_enriched.to_json(train_json_path, orient="records", indent=2)

print(f"✓ Training data with text features saved:")
print(f"  - CSV: {train_csv_path}")
print(f"  - JSON: {train_json_path}")

# Save test data
test_csv_path = PREPROCESSED_DIR / "test_text.csv"
test_json_path = PREPROCESSED_DIR / "test_text.json"

test_enriched.to_csv(test_csv_path, index=False)
test_enriched.to_json(test_json_path, orient="records", indent=2)

print(f"\n✓ Test data with text features saved:")
print(f"  - CSV: {test_csv_path}")
print(f"  - JSON: {test_json_path}")

# Save metadata
text_metadata = {
    "spacy_model": "en_core_web_sm",
    "pipeline_components": nlp.pipe_names,
    "text_features": [
        "tokens", "lemmas", "pos_tags", "dep_tags",
        "between_tokens", "between_lemmas", "between_pos_tags",
        "dependency_path", "e1_head_pos", "e2_head_pos"
    ],
    "train_size": len(train_enriched),
    "test_size": len(test_enriched)
}

metadata_path = PREPROCESSED_DIR / "text_metadata.json"
with open(metadata_path, "w") as f:
    json.dump(text_metadata, f, indent=2)

print(f"\n✓ text metadata saved: {metadata_path}")

✓ Training data with text features saved:
  - CSV: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/train_text.csv
  - JSON: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/train_text.json

✓ Test data with text features saved:
  - CSV: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/test_text.csv
  - JSON: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/test_text.json

✓ text metadata saved: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/text_metadata.json


## 11. Validation: Load and Verify Saved Data

In [13]:
# Load saved data
train_loaded = pd.read_csv(train_csv_path)
test_loaded = pd.read_csv(test_csv_path)

print("Validation Results:")
print("-" * 40)
print(f"✓ Training data loaded: {len(train_loaded)} examples")
print(f"✓ Test data loaded: {len(test_loaded)} examples")
print(f"✓ Data integrity check: {len(train_loaded) == len(train_enriched) and len(test_loaded) == len(test_enriched)}")

# Display columns
print(f"\nColumns in enriched data: {list(train_loaded.columns)}")

# Display sample
print("\nSample from loaded training data:")
display_cols = ["id", "e1", "e2", "relation_type", "num_tokens", "num_between_tokens", "e1_head_pos"]
train_loaded[display_cols].head(3)

Validation Results:
----------------------------------------
✓ Training data loaded: 8000 examples
✓ Test data loaded: 2717 examples
✓ Data integrity check: True

Columns in enriched data: ['id', 'sentence_raw', 'sentence_clean', 'e1', 'e2', 'relation_full', 'relation_type', 'relation_direction', 'comment', 'tokens', 'lemmas', 'pos_tags', 'dep_tags', 'between_tokens', 'between_lemmas', 'between_pos_tags', 'between_dep_tags', 'dependency_path', 'e1_head_pos', 'e1_head_dep', 'e2_head_pos', 'e2_head_dep', 'num_tokens', 'num_between_tokens', 'num_entities']

Sample from loaded training data:


Unnamed: 0,id,e1,e2,relation_type,num_tokens,num_between_tokens,e1_head_pos
0,1,configuration,elements,Component-Whole,17,2,NOUN
1,2,child,cradle,Other,16,7,NOUN
2,3,author,disassembler,Instrument-Agency,16,5,NOUN


## 11. Export to CoNLL Format


In [14]:
import spacy_conll

# Add the CoNLL formatter to the existing spaCy pipeline
# Use the string name "conll_formatter" which is registered as a factory
nlp.add_pipe("conll_formatter", last=True)

print("✓ Added CoNLL formatter to spaCy pipeline")
print(f"Pipeline components: {nlp.pipe_names}")

✓ Added CoNLL formatter to spaCy pipeline
Pipeline components: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'conll_formatter']


In [15]:
def add_entity_labels_to_conll(doc, e1_text: str, e2_text: str) -> str:
    """
    Add entity labels to the MISC column of CoNLL output.
    
    Parameters
    ----------
    doc : spacy.Doc
        Processed document with CoNLL formatter
    e1_text : str
        Entity 1 text
    e2_text : str
        Entity 2 text
        
    Returns
    -------
    str
        CoNLL string with entity labels added to MISC column
    """
    # Get base CoNLL output
    conll_lines = doc._.conll_str.strip().split('\n')
    
    # Split entities into tokens for matching
    e1_tokens = e1_text.lower().split()
    e2_tokens = e2_text.lower().split()
    
    # Get list of token texts
    doc_tokens = [token.text.lower() for token in doc]
    
    # Find entity positions
    e1_positions = set()
    e2_positions = set()
    
    # Find e1
    for i in range(len(doc_tokens) - len(e1_tokens) + 1):
        if doc_tokens[i:i+len(e1_tokens)] == e1_tokens:
            e1_positions = set(range(i, i+len(e1_tokens)))
            break
    
    # Find e2
    for i in range(len(doc_tokens) - len(e2_tokens) + 1):
        if doc_tokens[i:i+len(e2_tokens)] == e2_tokens:
            e2_positions = set(range(i, i+len(e2_tokens)))
            break
    
    # Process each line and add entity labels
    result_lines = []
    for line in conll_lines:
        if line.startswith('#') or not line.strip():
            result_lines.append(line)
            continue
        
        parts = line.split('\t')
        if len(parts) >= 10:
            token_id = int(parts[0]) - 1  # Convert to 0-indexed
            misc = parts[9]
            
            # Add entity label
            if token_id in e1_positions:
                entity_label = 'Entity=e1'
            elif token_id in e2_positions:
                entity_label = 'Entity=e2'
            else:
                entity_label = None
            
            # Combine with existing MISC info
            if entity_label:
                if misc == '_':
                    parts[9] = entity_label
                else:
                    parts[9] = f"{misc}|{entity_label}"
                parts[9] = f"{misc}|{entity_label}"
            
            result_lines.append('\t'.join(parts))
        else:
            result_lines.append(line)
    
    return '\n'.join(result_lines)


# Test with a sample sentence from the dataset
sample_row = train_enriched.iloc[0]
doc = nlp(sample_row['sentence_clean'])

print(f"Sentence ID: {sample_row['id']}")
print(f"Relation: {sample_row['relation_full']}")
print(f"e1: {sample_row['e1']}, e2: {sample_row['e2']}")
print(f"\nCoNLL Format Output with Entity Labels:\n")

# Get CoNLL with entity labels
conll_with_entities = add_entity_labels_to_conll(doc, sample_row['e1'], sample_row['e2'])
print(conll_with_entities)

Sentence ID: 1
Relation: Component-Whole(e2,e1)
e1: configuration, e2: elements

CoNLL Format Output with Entity Labels:

1	The	the	DET	DT	Definite=Def|PronType=Art	2	det	_	_
2	system	system	NOUN	NN	Number=Sing	6	nsubj	_	_
3	as	as	SCONJ	IN	_	4	mark	_	_
4	described	describe	VERB	VBN	Aspect=Perf|Tense=Past|VerbForm=Part	2	advcl	_	_
5	above	above	ADV	RB	_	4	advmod	_	_
6	has	have	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	ROOT	_	_
7	its	its	PRON	PRP$	Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs	9	poss	_	_
8	greatest	great	ADJ	JJS	Degree=Sup	9	amod	_	_
9	application	application	NOUN	NN	Number=Sing	6	dobj	_	_
10	in	in	ADP	IN	_	9	prep	_	_
11	an	an	DET	DT	Definite=Ind|PronType=Art	13	det	_	_
12	arrayed	array	VERB	VBN	Aspect=Perf|Tense=Past|VerbForm=Part	13	amod	_	_
13	configuration	configuration	NOUN	NN	Number=Sing	10	pobj	_	_|Entity=e1
14	of	of	ADP	IN	_	13	prep	_	_
15	antenna	antenna	NOUN	NN	Number=Sing	16	compound	_	_
16	elements	element	NOUN	NNS	Number=Plur	1

In [16]:
# Export all data to CoNLL format with entity labels
CONLL_DIR = PREPROCESSED_DIR / "conll"
CONLL_DIR.mkdir(exist_ok=True)

print("Exporting to CoNLL format with entity labels...")

# Export training data
train_conll_path = CONLL_DIR / "train.conll"
with open(train_conll_path, 'w', encoding='utf-8') as f:
    for idx, row in tqdm(train_enriched.iterrows(), total=len(train_enriched), desc="Training"):
        # Write metadata as comments
        f.write(f"# sent_id = {row['id']}\n")
        f.write(f"# text = {row['sentence_clean']}\n")
        f.write(f"# relation = {row['relation_full']}\n")
        f.write(f"# e1 = {row['e1']}\n")
        f.write(f"# e2 = {row['e2']}\n")
        
        # Process with spaCy and add entity labels
        doc = nlp(row['sentence_clean'])
        conll_with_entities = add_entity_labels_to_conll(doc, row['e1'], row['e2'])
        f.write(conll_with_entities)
        f.write("\n\n")

print(f"✓ Training data saved to: {train_conll_path}")

# Export test data  
test_conll_path = CONLL_DIR / "test.conll"
with open(test_conll_path, 'w', encoding='utf-8') as f:
    for idx, row in tqdm(test_enriched.iterrows(), total=len(test_enriched), desc="Test"):
        # Write metadata as comments
        f.write(f"# sent_id = {row['id']}\n")
        f.write(f"# text = {row['sentence_clean']}\n")
        f.write(f"# relation = {row['relation_full']}\n")
        f.write(f"# e1 = {row['e1']}\n")
        f.write(f"# e2 = {row['e2']}\n")
        
        # Process with spaCy and add entity labels
        doc = nlp(row['sentence_clean'])
        conll_with_entities = add_entity_labels_to_conll(doc, row['e1'], row['e2'])
        f.write(conll_with_entities)
        f.write("\n\n")

print(f"✓ Test data saved to: {test_conll_path}")
print(f"\n✓ CoNLL export complete with entity labels!")

Exporting to CoNLL format with entity labels...


Training: 100%|██████████| 8000/8000 [00:53<00:00, 150.56it/s]


✓ Training data saved to: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/conll/train.conll


Test: 100%|██████████| 2717/2717 [00:18<00:00, 148.15it/s]

✓ Test data saved to: /Users/egeaydin/Github/ML2025WS/Token13-tuw-nlp-ie-2025WS/data/preprocessed/conll/test.conll

✓ CoNLL export complete with entity labels!





## Summary

✅ **Completed text Preprocessing:**
1. Loaded spaCy English model (en_core_web_sm)
2. Performed tokenization on all sentences
3. Added POS (Part-of-Speech) tags
4. Extracted dependency parse information
5. Identified entity head tokens and their properties
6. Extracted text features between entities
7. Generated dependency paths between entity pairs
8. Saved enriched data with all text annotations

**Output Files:**
- `data/text/train_text.csv` - Training data with text features
- `data/text/test_text.csv` - Test data with text features
- `data/text/train_text.json` - Training data (JSON format)
- `data/text/test_text.json` - Test data (JSON format)
- `data/text/text_metadata.json` - Processing metadata

**Key text Features Added:**
- **Tokens**: Word-level tokenization
- **Lemmas**: Base forms of words
- **POS Tags**: Part-of-speech labels (NOUN, VERB, ADJ, etc.)
- **Dependency Tags**: Syntactic relations (nsubj, obj, prep, etc.)
- **Between-Entity Features**: Tokens/lemmas/tags between e1 and e2
- **Entity Head Features**: POS and dependency info for entity heads
- **Dependency Paths**: Syntactic paths connecting entities