## Part 3: Data Pre-processing & Feature Engineering Pipeline

Author: Garvit Mathur

Task: Convert raw .txt data into LiLT-ready Tensors.

### Context & Challenge
The LiLT (Language-Independent Layout Transformer) model requires two inputs: text embeddings (input_ids) and spatial embeddings (bbox). [https://huggingface.co/docs/transformers/en/model_doc/lilt#transformers.LiltForTokenClassification]

However, our source data consists of flat .txt files without native Bbox coordinates. Furthermore, the LiLT tokenizer breaks words into sub-words (e.g., Agreement -> Agree, ##ment), 

creating a length mismatch between the original words and the token sequence.

### Solution Architecture
This notebook implements a production-grade pipeline to solve these challenges:

Synthetic Spatial Extraction: A heuristic engine (SpatialFeatureExtractor) that simulates layout by calculating 2D bounding boxes based on character position.

Sub-word Alignment: A tokenizer wrapper (LiLTTokenizerPipeline) that utilises the word_ids() mapping strategy to propagate the synthetic bounding boxes from parent words to all constituent sub-tokens.

Type Safety: Utilisation of Python dataclasses to ensure strict schema enforcement between pipeline stages.

# Install Libraries

In [None]:
!pip install -q transformers torch pydantic

## Import Libraries

In [None]:
import logging
import warnings
from typing import List, Optional, Dict
from transformers import AutoTokenizer, PreTrainedTokenizerFast
from pydantic import BaseModel, ConfigDict, field_validator
import torch
# Suppress library warnings for cleaner output
warnings.filterwarnings("ignore")

## Configure Logging

In [None]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("LiLTPreProcessing")


## Define Constants

In [None]:
# Constants
MODEL_NAME = "nielsr/lilt-xlm-roberta-base"
MAX_SEQUENCE_LENGTH = 512
BBOX_SCALE = 1000
PAD_TOKEN_BOX = [0, 0, 0, 0]


# Default document coordinate system used by LiLT (0-1000)
DOC_WIDTH = 1000
DOC_HEIGHT = 1000

# Default character and line spacing for synthetic bbox generation
DEFAULT_CHAR_WIDTH = 10
DEFAULT_LINE_HEIGHT = 20

# Define Pydantic Data Classes

In [None]:
class RawDocument(BaseModel):
    doc_id: str
    content: str

class ProcessedWord(BaseModel):
    text: str
    bbox: List[int]  # [x0, y0, x1, y1] normalised to 0-1000

    @field_validator('bbox')
    def check_bbox_geometry(cls, bbox_value):
        """
        Validates that the BBox has exactly 4 coordinates and 
        that they fall within the safe 0-1000 range.
        """
        if len(bbox_value) != 4:
            raise ValueError("BBox must have exactly 4 coordinates [x0, y0, x1, y1]")
        
        # Check for negative values or values exceeding document bounds
        if any(bbox < 0 for bbox in bbox_value):
            raise ValueError(f"BBox coordinates cannot be negative: {bbox_value}")
        if any(bbox > DOC_WIDTH for bbox in bbox_value):
            raise ValueError(f"BBox coordinates cannot exceed {DOC_WIDTH}: {bbox_value}")
            
        return bbox_value

class ModelInput(BaseModel):
    """Tensor-ready dictionary for the model."""
    # Configuration required to allow PyTorch Tensors inside Pydantic
    model_config = ConfigDict(arbitrary_types_allowed=True)
    
    input_ids: torch.Tensor
    attention_mask: torch.Tensor
    bbox: torch.Tensor
    token_type_ids: Optional[torch.Tensor] = None

# Pre Processing Steps

### Bound box Generation from txt file

In [None]:


# Spatial Feature Engineering
class SpatialFeatureExtractor:
    """Generates synthetic bounding boxes for words in a plain text document.

    Since legacy .txt files lack metadata, we assume a 'monospaced' font layout.
    This class calculates the X/Y coordinates based on character count. This gives the 
    LiLT model the spatial context it needs (e.g., distinguishing a Header 
    at Y=20 from a Footer at Y=900) without needing an expensive OCR engine and converting txt file to an image.

    Args:
        char_width_scale: int spacing per character (default 10).
        line_height_scale: int height per line (default 20).
    """

    def __init__(self, char_width_scale: int = DEFAULT_CHAR_WIDTH, line_height_scale: int = DEFAULT_LINE_HEIGHT):
        self.char_width = int(char_width_scale)
        self.line_height = int(line_height_scale)

    def process_text(self, document: RawDocument) -> List[ProcessedWord]:
        """Convert RawDocument into a sequence of ProcessedWord with bboxes.

        Empty lines advance the y cursor. Coordinates are clamped to the
        [0, DOC_WIDTH/DOC_HEIGHT] range.

        Args:
            document: RawDocument to process.

        Returns:
            List of ProcessedWord with synthetic bounding boxes.
        """
        logger.info(f"Generating spatial features for doc: {document.doc_id}")

        words_layout: List[ProcessedWord] = []

        # Split the documents into list of lines
        lines = document.content.split("\n")

        # Initialise Y cursor
        current_y = 0

        # Iterate over each line
        for line in lines:
            # Skip empty lines but advance the Y cursor
            if not line.strip():
                current_y += self.line_height
                continue
            
            # Initialise X cursor for the new line
            current_x = 0

            # Split the line into words
            words = line.split()

            # Iterate over each word in the line to find the width of the word and assign bbox 
            for word in words:
                word_w = len(word) * self.char_width

                bbox = [
                    min(max(0, current_x), DOC_WIDTH),
                    min(max(0, current_y), DOC_HEIGHT),
                    min(max(0, current_x + word_w), DOC_WIDTH),
                    min(max(0, current_y + self.line_height), DOC_HEIGHT),
                ]

                words_layout.append(ProcessedWord(text=word, bbox=bbox))

                # advance cursor (add a simulated space)
                current_x += word_w + self.char_width

            # next line
            current_y += self.line_height

        return words_layout




### Module to split words into token that the model can understand

In [None]:

class LiLTTokenizerPipeline:
    """
    Wraps the Hugging Face AutoTokenizer to handle the critical task of 
    aligning word-level Bounding Boxes to sub-word Tokens.
    """
    
    def __init__(self, model_name: str = MODEL_NAME):
        logger.info(f"Loading tokenizer: {model_name}")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Ensure we are using a fast tokenizer to get the 'word_ids()' method
        if not isinstance(self.tokenizer, PreTrainedTokenizerFast):
            raise ValueError("This pipeline requires a FastTokenizer (backed by Rust).")

    def normalise_bbox(self, box: List[int]) -> List[int]:
        """Ensure coordinates are strictly within 0-1000 range.

        If not clamped, out-of-bound coordinates can lead to model errors.
        
        Args:
            box: List of 4 integers [x0, y0, x1, y1]
        
        Returns:
            List of 4 integers with values clamped between 0 and 1000.
        """
        return [
            max(0, min(DOC_WIDTH, x)) for x in box
        ]

    def prepare_input(self, layout_data: List[ProcessedWord]) -> Dict[str, torch.Tensor]:
        """
        Converts a list of ProcessedWord objects into PyTorch tensors.
        
        Core Logic:
        1. Separate text and boxes.
        2. Tokenize text (with padding/truncation).
        3. Map original boxes to sub-word tokens using word_ids().
        """
        
        # 1. Unpack data
        words = [word.text for word in layout_data]
        boxes = [self.normalise_bbox(word.bbox) for word in layout_data]
        
        # 2. Tokenize
        # is_split_into_words=True is CRITICAL because we provide a list of words, not a string.
        encoding = self.tokenizer(
            words,
            is_split_into_words=True,
            padding="max_length",
            truncation=True,
            max_length=MAX_SEQUENCE_LENGTH,
            return_tensors="pt"
        )
        # 3. Align Bounding Boxes
        # The model expects one bbox per token, but we only have one bbox per word.
        # We must duplicate the word-bbox for every sub-token belonging to that word.
        
        # Transformers operate on Sub-word Tokens, but our bounding boxes are generated for Whole Words.
        # Word: "Agreement" (1 Box)
        # Tokens: ['Agree', 'ment'] (2 Tokens)
        # If we blindly pass the list of boxes, the tensor shapes won't match the input IDs, causing the model to crash.
        # Solution: We use the tokenizer's word_ids() method to map every sub-token back to its original word index, duplicating the bounding box for every sub-token.
        batch_index = 0  # We are processing single doc, so batch index is 0
        word_ids = encoding.word_ids(batch_index=batch_index)

        aligned_boxes = []
        
        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens (CLS, SEP, PAD) get [0, 0, 0, 0]
                aligned_boxes.append(PAD_TOKEN_BOX)
            else:
                # Real tokens inherit the bbox of their parent word
                aligned_boxes.append(boxes[word_idx])
        
        # Convert boxes to Tensor
        bbox_tensor = torch.tensor([aligned_boxes])
        
        # 4. Construct Final Model Input
        return {
            "input_ids": encoding["input_ids"],
            "attention_mask": encoding["attention_mask"],
            "bbox": bbox_tensor
        }




# Run Pipeline End-to-End

In [None]:
# Creating Mock Data (Simulating a file read)
# In production, this would be replaced by actual file I/O method that reads a txt file from either disk or cloud storage.
def get_data() -> str:
    '''Simulates reading a text document from file/storage.'''
    logger.info("Loading raw document content.")
    return """
    AGREEMENT NO:\t 8842-GH-12 \n
    DATE:       12th DECEMBER 2023 \n
    CUSTOMER:   GARVIT MATHUR \n
    AMOUNT:     $15,000 \n
    TERMS:      36 months \n
    -------------------------------------------
    Thank you for choosing our services.
    """

In [None]:
def run_pipeline() -> None:
    """
    Orchestrates the End-to-End transformation: Raw Text -> Tensors.
    """
    
    # 1. Load Raw Document
    # Simulating fetching a specific document from the batch
    raw_content = get_data()
    doc = RawDocument(doc_id="DOC-TEST-001", content=raw_content)

    try:
        # 2. Run Spatial Extraction
        logger.info(f"--- Stage 1: Spatial Feature Engineering ---")
        extractor = SpatialFeatureExtractor()
        layout_words = extractor.process_text(doc)
        
        logger.info(f"Extracted {len(layout_words)} words with BBoxes.")
        logger.info(f"Sample: {layout_words[1]}") # Print 'NO:' and its bbox

        # 3. Run Tokenization & Alignment (Transformers)
        logger.info(f"\n--- Stage 2: Tokenization & Alignment ---")
        pipeline = LiLTTokenizerPipeline()
        model_inputs = pipeline.prepare_input(layout_words)
        
        # 4. Shape Verification
        # Critical check: Input IDs and BBox tensors must match in 2nd dimension (Seq Len)
        logger.info(f"Input IDs Shape: {model_inputs['input_ids'].shape}") # Expect [1, 512]
        logger.info(f"BBox Shape:      {model_inputs['bbox'].shape}")      # Expect [1, 512, 4]
        
        # 5. Visual Verification (The "Senior" Check)
        # We decode the tokens and print them next to their assigned box to prove alignment works.
        tokens = pipeline.tokenizer.convert_ids_to_tokens(model_inputs['input_ids'][0])
        boxes = model_inputs['bbox'][0].tolist()
        
        print("\n" + "="*60)
        print(f"{'TOKEN':<15} | {'ASSIGNED BBOX (x0, y0, x1, y1)':<30}")
        print("="*60)
        
        # Show first 8 tokens (skipping <s>) to demonstrate splitting logic
        for i in range(1, 10): 
            print(f"{tokens[i]:<15} | {boxes[i]}")
            
        print("="*60 + "\n")
        logger.info("[SUCCESS] Pipeline Complete. Tensors ready for Inference.")

    except Exception as e:
        logger.error(f"Pipeline execution failed: {e}")


In [None]:
# Run the pipeline
run_pipeline()

#

# Code Integrity and Validation Strategy  

While this notebook demonstrates the functional logic of the pre-processing pipeline, deploying this to a production environment requires strict quality assurance standards. Below are the specific techniques and libraries I would implement to guarantee code integrity and reliability in the CI/CD pipeline.

1. Rigorous Unit Testing (pytest)
    - To prevent bugs, every function requires comprehensive unit tests. I would use pytest with parameterised test cases to verify the classes and methods are tested against extreme edge cases (e.g., empty lines, extremely long words).


2. Runtime Data Validation (pydantic)
    - As demonstrated in the ProcessedWord class, I utilised Pydantic to enforce data validation at runtime. In production, I would extend these validators to include Cross-Field Validation (e.g., ensuring input_ids length matches bbox length exactly before tensor creation).


3. Static Code Analysis & Type Safety
    - Tools such as mypy (Strict Mode) and ruff should be used to ensure type safety and enforce PEP-8 compliance (formatting) across the codebase.


4. Continuous Integration (CI/CD)
    - I would configure Pre-commit hooks and a Cloud Build pipeline to run these checks automatically. Code would only be merged to the main branch if it passes:
        - All Unit Tests (100% Pass Rate).
        - Minimum code coverage requirements (e.g., pytest-cov > 90%).
        - Static Analysis (ruff and mypy).