### Comprehensive Text Pre-processing Pipeline

This notebook implements a multi-stage pipeline designed to clean, correct, and standardize raw text extracted from historical documents, which often contain OCR errors. The process is broken down into four main stages, each applied sequentially to the data.

**Pipeline Stages:**

1.  **Initial Cleaning (`clean_text_piece_batched`):**
    *   This is the first pass over the raw text. It focuses on removing noise and restructuring the text into a more readable format.
    *   **Actions:** Removes metadata tags, strips invalid characters, intelligently de-hyphenates words at the end of lines, joins lines into paragraphs, and applies basic punctuation spacing.

2.  **Correction & Capitalization (`capitalization_and_correction_batched`):**
    *   This is the core correction stage, which uses a Named Entity Recognition (NER) model to intelligently apply changes.
    *   **Actions:**
        *   Identifies and merges fragmented entities (e.g., "Royal", "Academy" -> "Royal Academy").
        *   Applies sentence case to the entire text and properly capitalizes the identified entities.
        *   Performs multiple passes of spelling and word segmentation correction, using the NER entities as a "shield" to prevent incorrect changes to names and places.

3.  **Honorific Standardization (`final_polishing_batched`):**
    *   This stage focuses on a specific type of token: titles and honorifics.
    *   **Actions:** Standardizes titles like "Mr", "sir", "Dr.", etc., to a consistent format (e.g., `Mr.`, `Sir`, `Dr.`).

4.  **Final Spacing Cleanup (`final_spacing_batched`):**
    *   A final, non-destructive polishing step to ensure all punctuation and spacing is consistent after the previous modifications.
    *   **Actions:** Re-applies rules to fix issues like "word.word" -> "word. word" or "word . " -> "word.".

The `pipeline_driver` function at the end of the notebook orchestrates these four stages, processing all `.csv` files from an input directory and saving the results to an output directory.

In [152]:
import pandas as pd
import os
from pathlib import Path
import numpy as np
import re
import difflib
import pickle
import shutil
import textwrap
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
from symspellpy import SymSpell, Verbosity
import importlib.resources
import logging


In [153]:
COMPLETE_WORDLIST = set()
WORDLIST_NAME = 'complete_wordlist.pkl'
DICTIONARY_DATA_DIR = './Dictionary_data/'

In [154]:
# Setup basic logging to see warnings
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

# --- Load Model and Tokenizer ---
# It's better to load these outside the function so it's not done repeatedly
model_name = "dslim/bert-large-NER"
try:
    log.info(f"Loading NER model: {model_name}...")
    # Using AutoModel and AutoTokenizer allows for more control if needed later
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    # Using aggregation_strategy="simple" helps merge B-TAG I-TAG sequences
    ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    log.info("NER model loaded successfully.")
except Exception as e:
    log.error(f"Failed to load NER model: {e}")
    # Handle error appropriately - maybe raise it or set pipeline to None
    ner_pipeline = None

INFO:__main__:Loading NER model: dslim/bert-large-NER...
Some weights of the model checkpoint at dslim/bert-large-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
INFO:__main__:NER model loaded successfully.


In [155]:
with open(DICTIONARY_DATA_DIR + WORDLIST_NAME, 'rb') as file:
    COMPLETE_WORDLIST = pickle.load(file)
print("Total unique words loaded:", len(COMPLETE_WORDLIST))

Total unique words loaded: 495279


In [156]:
def load_csv_file(file_path):
    try:
        ds = load_dataset('csv', data_files=file_path)
        file = Path(file_path)

        return {'name': file.name, 
                'data': ds,
                'length': ds.get('train').num_rows # ds is a dict like dataset while train is an arrow dataset
                }
    
    except Exception as e:
        print(f"Caught error when loading csv file from path: {file_path} \n\n {e}")

### Stage 1: Initial Text Cleaning (`clean_text_piece_batched`)

This function performs several essential text cleaning steps on batches of text data, specifically designed for use with the `.map(batched=True)` method in datasets like Hugging Face's `datasets`. It focuses on improving readability and structure based on common issues found in OCR'd historical texts, without attempting to invent content.

**Cleaning Steps Performed:**

1.  **Metadata and Tag Removal:**
    * Removes backslashes (`\`).
    * Removes `<NEWPAGE>` tags.
    * *(Note: Placeholder for other specific metadata patterns if identified).*

2.  **Initial Noise Removal:**
    * *(Optional line filtering based on character count is currently commented out in the code, as it could interfere with de-hyphenation).*
    * Removes characters that are not letters, numbers, basic punctuation (`.,;:?!'"()`), the pound sign (`£`), or hyphens (`-`), replacing them with spaces.

3.  **Line Break and Hyphenation Handling:**
    * **Robust End-of-Line De-hyphenation:** Identifies and removes hyphens connecting word fragments across lines (e.g., `word-\nfragment` or `word- \n fragment`). This step uses a specific pattern (`([a-zA-Z])-\s*\n\s*([a-zA-Z])`) and is applied repeatedly *before* general line joining to handle consecutive hyphenations accurately.
    * **Paragraph/Sentence Joining:** After handling EOL hyphens, replaces all remaining newline characters (`\n`) with spaces to join lines into continuous text blocks, aiming to reconstruct paragraphs disrupted by arbitrary line breaks.

4.  **Specific OCR/Formatting Fixes (Rule-Based):**
    * **Currency:** Corrects common OCR errors like ` l.` (space-l-dot followed by space/digit) to ` £`.
    * **Contextual:** Includes *example* rules for correcting specific number misrecognitions like ` l00` to ` 100` and ` O0` to ` 00` when followed by space or punctuation. *(More rules should be added based on data analysis).*

5.  **Punctuation and Formatting Refinement:**
    * Ensures consistent spacing around punctuation marks (removes space before `.,;:?!`, ensures one space after).
    * Standardises spacing around parentheses (`()`).
    * Applies basic spacing around quotation marks (`'` `"`).

6.  **Whitespace Normalisation:**
    * Collapses multiple consecutive spaces into single spaces.
    * Removes leading and trailing whitespace from the final text.

**Usage:**

```python
# Assuming 'raw_dataset' is your dataset object with a 'Full_text' column
# and the function 'clean_text_piece_batched' is defined.
cleaned_dataset = raw_dataset.map(clean_text_piece_batched, batched=True)

# The cleaned text will be in a new column named 'Cleaned_text'.

In [227]:
def setup_spellchecker(master_word_set):
    """
    Creates and configures a SymSpellPy instance by loading BOTH a
    standard frequency dictionary AND your custom master word set.
    
    Crucially, it ONLY adds words from the master set if they
    are NOT already present in the standard dictionary, preserving
    the original frequencies of common words.
    """
    log.info(f"Setting up SymSpellPy...")
    symspell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

    # --- Step 1: Load the STANDARD frequency dictionary ---
    try:
        resource_path = importlib.resources.files("symspellpy").joinpath("frequency_dictionary_en_82_765.txt")
        with importlib.resources.as_file(resource_path) as usable_path:
            dictionary_path = str(usable_path)
            if not os.path.exists(dictionary_path):
                 log.error("SymSpell: Standard dictionary not found.")
                 return None
            if not symspell.load_dictionary(dictionary_path, term_index=0, count_index=1):
                log.error("SymSpell: Could not load standard frequency dictionary.")
                return None
        log.info(f"Loaded standard frequency dictionary. Word count: {symspell.word_count}")
    except Exception as e:
        log.warning(f"Could not load standard dictionary: {e}. Continuing with custom set only.")

    # --- Step 2: Load ONLY NEW words from YOUR master_word_set ---
    # We give them a high, fake frequency so they are treated as valid.
    # We use a count lower than the top words (like "the") but
    # high enough to be a very strong suggestion.
    custom_word_count = 1000000 # 1 million is a high, safe default
    new_words_added = 0
    
    if master_word_set:
        for word in master_word_set:
            # Check if the word already exists in the dictionary
            # lookup() with distance 0 is the way to check for existence
            if not symspell.lookup(word, Verbosity.TOP, max_edit_distance=0):
                # Word does NOT exist, so add it
                symspell.create_dictionary_entry(word, custom_word_count)
                new_words_added += 1
            # If the word *does* exist, we do nothing and preserve its original frequency.
            
    log.info(f"Added {new_words_added} new unique words from master set to SymSpell.")
    log.info(f"SymSpell setup complete. Total unique terms: {symspell.word_count}")
    return symspell

# (You will still need your preserve_case function)
def preserve_case(original_word, new_word):
    """Matches the case of the new word to the original word."""
    if original_word.isupper():
        return new_word.upper()
    if original_word.istitle():
        return new_word.title()
    if original_word[0].isupper() and (len(original_word) == 1 or original_word[1:].islower()):
        return new_word.title()
    return new_word.lower()


def correct_spelling_safe(text, ner_entities, symspell, master_word_set, 
                          max_edit_distance=1, segmentation_threshold=12, 
                          segmentation_ratio=5):
    """
    Corrects misspellings in a text string, using NER entities as a "shield".
    
    This function now includes a 3-stage check:
    1. Check against master word set.
    2. If unknown, check if NER-shielded.
    3. If not shielded, try 1-edit-distance correction.
    4. If that fails AND THE WORD IS LONG, try word_segmentation() as a fallback.
    
    Args:
        text (str): The text after capitalization (output of apply_capitalization).
        ner_entities (list): The list of merged/filtered NER entities for this text.
        master_word_set (set): The set of all known valid words.
        symspell (SymSpell): The pre-configured SymSpellPy instance.
        max_edit_distance (int): The max distance for simple 1-to-1 corrections.
        segmentation_threshold (int): Words *longer* than this will be checked
                                      for segmentation errors (e.g., 'exhibitioanof').
        
    Returns:
        str: The text with safe spelling corrections applied.
    """
    # Skip if spellchecker isn't configured
    if symspell is None or master_word_set is None or not text:
        return text

    # 1. Create a "shield" set of all indices that are part of a NER entity
    entity_indices = set()
    for ent in ner_entities:
        entity_indices.update(range(ent['start'], ent['end']))

    # 2. Iterate through words and build the corrected text piece by piece
    corrected_parts = []
    last_index = 0
    # This regex finds words, including those with apostrophes
    word_pattern = re.compile(r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?\b")

    for match in word_pattern.finditer(text):
        word = match.group(0)
        word_lower = word.lower()
        match_start = match.start()
        match_end = match.end()
        
        corrected_word = word # Default: no change

        # --- Correction Logic ---
        
        # 3. Check if word is known
        if word_lower not in master_word_set:
            
            # 4. If unknown, check if it's "shielded" by NER
            # We check if *any* part of the word is shielded
            is_shielded = any(idx in entity_indices for idx in range(match_start, match_end))
            
            if not is_shielded:
                # 5. Word is UNKNOWN and NOT shielded. Try to correct it.
                
                # --- STAGE 1: Try 1-edit distance (for simple typos like 'wvas') ---
                suggestions_dist1 = symspell.lookup(
                    word,
                    Verbosity.CLOSEST,
                    max_edit_distance=max_edit_distance, 
                    include_unknown=False
                )
                
                if len(suggestions_dist1) == 1:
                    # Success with 1-edit distance
                    correction = suggestions_dist1[0].term
                    corrected_word = preserve_case(word, correction)
                    # log.info(f"Spellcheck 1-edit: Corrected '{word}' -> '{corrected_word}'")
                
                else:
                    # --- STAGE 2: 1-edit failed. Try segmentation (for merged words) ---
                    # Check if the word is LONG enough to be a likely merge.
                    if len(word) > segmentation_threshold:
                        # Calculate a dynamic edit distance budget
                        # (e.g., 1 edit for every 5 chars)
                        dynamic_max_edits = len(word) // segmentation_ratio
                        # Ensure a minimum budget of 2 for segmentation
                        if dynamic_max_edits < 2:
                            dynamic_max_edits = 2

                        try:
                            # Run segmentation on *only this word*
                            segment_suggestion = symspell.word_segmentation(
                                word,
                                max_edit_distance=dynamic_max_edits # Allow more edits
                            )
                            
                            # Check if a valid, *different* segmentation was found
                            if segment_suggestion.corrected_string and \
                               segment_suggestion.corrected_string.lower() != word_lower:
                                
                                corrected_word = preserve_case(word, segment_suggestion.corrected_string)
                                # log.info(f"Spellcheck Segment: Corrected '{word}' -> '{corrected_word}'")
                                
                        except Exception as e:
                            log.error(f"Error during word_segmentation on '{word}': {e}")
                            # Keep original word if segmentation fails

        # 6. Append the text *before* this word, then the (possibly corrected) word
        corrected_parts.append(text[last_index:match_start])
        corrected_parts.append(corrected_word)
        last_index = match_end

    # 7. Append any remaining text after the last word
    corrected_parts.append(text[last_index:])
    
    return "".join(corrected_parts)

def correct_segmentation_errors_safe(text, ner_entities, symspell, max_edit_distance=2):
    """
    Corrects segmentation errors by processing chunks *between* NER entities.
    It further splits these chunks to isolate and correct *word-only* sections,
    avoiding errors from numbers and symbols.
    
    Args:
        text (str): The text to correct (output of 1-to-1 correction).
        ner_entities (list): The list of merged/filtered NER entities for this text.
        symspell (SymSpell): The pre-configured SymSpellPy instance.
    
    Returns:
        str: The text with segmentation corrections applied.
    """
    if symspell is None or not text:
        return text

    # Sort entities by start index to process them in order
    sorted_entities = sorted(ner_entities, key=lambda x: x['start'])
    
    corrected_fragments = []
    last_index = 0 # Our cursor for slicing the text

    # --- This regex splits a string by anything that ISN'T a letter, 
    # apostrophe, or space, but it KEEPS the delimiter.
    # It finds numbers (0-9), symbols (like ()-), and punctuation (\.,;:?!).
    splitter_pattern = re.compile(r"([^a-zA-Z'\s]+)")

    for entity in sorted_entities:
        start_index = entity['start']
        
        # --- 1. Process the "Safe Chunk" BEFORE the entity ---
        if start_index > last_index:
            safe_chunk_before_entity = text[last_index:start_index]
            
            # Now, split this chunk by numbers/symbols
            corrected_sub_fragments = []
            sub_fragments = splitter_pattern.split(safe_chunk_before_entity)
            
            for sub_fragment in sub_fragments:
                if not sub_fragment:
                    continue
                
                # Check if the sub_fragment is a delimiter or "pure text"
                # We check if it's ONLY letters, apostrophes, and spaces.
                if re.fullmatch(r"[a-zA-Z'\s]*", sub_fragment):
                    # It's "pure text". This is safe for lookup_compound.
                    try:
                        suggestions = symspell.lookup_compound(
                            sub_fragment,
                            max_edit_distance=max_edit_distance,
                            ignore_non_words=False,
                            transfer_casing=True
                        )
                        if suggestions:
                            corrected_sub_fragments.append(suggestions[0].term)
                        else:
                            corrected_sub_fragments.append(sub_fragment)
                    except Exception as e:
                        # Catch the string index error specifically
                        log.error(f"Error during lookup_compound on sub-fragment '{sub_fragment}': {e}")
                        corrected_sub_fragments.append(sub_fragment)
                else:
                    # It's a delimiter (like '(', '8300', '.'), append it as-is
                    corrected_sub_fragments.append(sub_fragment)

            # Re-join the processed sub-fragments for this chunk
            corrected_fragments.append("".join(corrected_sub_fragments))

        # --- 2. Append the "Shielded Entity" as-is ---
        if entity['end'] > last_index:
            entity_chunk = text[entity['start']:entity['end']]
            corrected_fragments.append(entity_chunk)
            # Move our cursor to the end of this entity
            last_index = entity['end']

    # --- 3. Process the Final "Safe Chunk" after the last entity ---
    if last_index < len(text):
        final_safe_chunk = text[last_index:]
        
        # Repeat the same sub-fragment logic for the final chunk
        corrected_sub_fragments = []
        sub_fragments = splitter_pattern.split(final_safe_chunk)
        
        for sub_fragment in sub_fragments:
            if not sub_fragment:
                continue
            if re.fullmatch(r"[a-zA-Z'\s]*", sub_fragment):
                # It's "pure text".
                try:
                    suggestions = symspell.lookup_compound(
                        sub_fragment,
                        max_edit_distance=max_edit_distance,
                        ignore_non_words=False,
                        transfer_casing=True
                    )
                    if suggestions:
                        corrected_sub_fragments.append(suggestions[0].term)
                    else:
                        corrected_sub_fragments.append(sub_fragment)
                except Exception as e:
                    log.error(f"Error during lookup_compound on final sub-fragment '{sub_fragment}': {e}")
                    corrected_sub_fragments.append(sub_fragment)
            else:
                # It's a delimiter.
                corrected_sub_fragments.append(sub_fragment)
        
        corrected_fragments.append("".join(corrected_sub_fragments))
    
    # --- 4. Re-join all fragments ---
    return "".join(corrected_fragments)

def fix_targeted_segmentation(text, ner_entities, symspell, max_edit_distance=2):
    """
    Finds and corrects *only* specific, targeted segmentation errors
    (like "wa ter") using regex and SymSpellPy, AND SKIPPING matches
    that overlap with NER entities.
    
    Args:
        text (str): The text to correct (output of 1-to-1 correction).
        ner_entities (list): The list of merged/filtered NER entities for this text.
        symspell (SymSpell): The pre-configured SymSpellPy instance.
    
    Returns:
        str: The text with targeted segmentation corrections.
    """
    if symspell is None or not text:
        return text

    # --- Step 1: Create the "shield" set of all indices ---
    entity_indices = set()
    for ent in ner_entities:
        entity_indices.update(range(ent['start'], ent['end']))

    # Define the patterns (excluding the risky p3)
    p1 = r"\b([a-zA-Z']{1,2})\s([a-zA-Z']{3,})\b"
    p2 = r"\b([a-zA-Z']{3,})\s([a-zA-Z']{1,2})\b"
    candidate_pattern = re.compile(f"({p1})|({p2})", flags=re.IGNORECASE)
    
    matches = list(candidate_pattern.finditer(text))
    corrected_text = text
    
    for match in reversed(matches):
        original_phrase = match.group(0)
        start_index = match.start()
        end_index = match.end()

        # --- Step 2: The Shield Check ---
        # Check if *any* part of this match overlaps with the NER shield
        is_shielded = False
        for i in range(start_index, end_index):
            if i in entity_indices:
                is_shielded = True
                break
        
        if is_shielded:
            # log.info(f"Skipping shielded phrase: '{original_phrase}'")
            continue # This match is part of a NER entity, do not correct.

        # --- Step 3: The Decider (SymSpellPy) ---
        # (This part is only reached if the phrase is NOT shielded)
        try:
            suggestions = symspell.lookup_compound(
                original_phrase,
                max_edit_distance=max_edit_distance,
                transfer_casing=True
            )
            
            if suggestions:
                best_suggestion = suggestions[0].term
                if best_suggestion != original_phrase and " " not in best_suggestion:
                    corrected_text = (
                        corrected_text[:start_index] + 
                        best_suggestion + 
                        corrected_text[end_index:]
                    )
        except Exception as e:
            log.error(f"Error during lookup_compound on phrase '{original_phrase}': {e}")
            # Continue without correcting this phrase

    return corrected_text

In [158]:
def clean_text_piece_batched(batch):
    """
    Cleans a batch of text pieces, incorporating initial normalization,
    **robust end-of-line de-hyphenation**, line break handling,
    currency/contextual fixes, and punctuation spacing.
    Designed for dataset.map(batched=True).
    """
    input_texts = batch['Full_text']
    cleaned_texts = []

    # --- Pre-compiled regex patterns ---
    # Step 1: Metadata/Tags
    backslash_re = re.compile(r'\\')
    newpage_re = re.compile(r'<NEWPAGE>')

    # Step 2: Junk Characters
    junk_char_re = re.compile(r'[^a-zA-Z0-9\s\.,;:?!£\'"()-]')

    # Step 3: De-hyphenation (End-of-Line specific)
    # Looks for letter, hyphen, optional space, newline, optional space, letter
    dehyphen_eol_re = re.compile(r'([a-zA-Z])-\s*\n\s*([a-zA-Z])')

    # Step 3: General Line Joining
    newline_re = re.compile(r'\n')

    # Step 4: Currency & Contextual Fixes
    currency_l_dot_re = re.compile(r'\s+l\.(?=\s+|\d)')
    num_l00_re = re.compile(r'\sl00(?=\s|[,.])')
    num_O0_re = re.compile(r'\sO0(?=\s|[,.])')
    # Add more specific rules based on your data...

    # Step 5: Punctuation Spacing
    space_before_punct_re = re.compile(r'\s+([,\.;:?!])')
    space_after_punct_re = re.compile(r'([,\.;:?!])\s*')
    space_around_brackets_open_re = re.compile(r'\s*\(\s*')
    space_around_brackets_close_re = re.compile(r'\s*\)\s*')
    space_around_quotes_re = re.compile(r'\s*(["\'])\s*')

    # Final whitespace cleanup
    multi_space_re = re.compile(r'\s+')
    # --- End of pre-compiled patterns ---


    for text in input_texts:
        # 1. Remove distracting metadata and tags
        cleaned_text = backslash_re.sub('', text)
        cleaned_text = newpage_re.sub('', cleaned_text)

        # Optional: Filter lines (consider removing if it affects hyphenation)
        # lines = cleaned_text.split('\n')
        # cleaned_lines = [line for line in lines if len(re.findall(r'[a-zA-Z]', line)) > 5]
        # cleaned_text = '\n'.join(cleaned_lines)

        # 2. Initial Noise Removal (Junk Characters)
        cleaned_text = junk_char_re.sub(' ', cleaned_text)

        # --- Step 3: Handle Line Breaks ---
        # 3a. De-hyphenate End-of-Line words repeatedly
        new_text = cleaned_text
        while True:
            processed_text = dehyphen_eol_re.sub(r'\1\2', new_text)
            if processed_text == new_text: # No more changes
                break
            new_text = processed_text # Update for next iteration
        cleaned_text = new_text

        # 3b. Join remaining lines into paragraphs/sentences
        cleaned_text = newline_re.sub(' ', cleaned_text)
        # --- End of Step 3 ---


        # 4. Apply Currency & Contextual Fixes
        cleaned_text = currency_l_dot_re.sub(' £', cleaned_text)
        cleaned_text = num_l00_re.sub(' 100', cleaned_text)
        cleaned_text = num_O0_re.sub(' 00', cleaned_text)
        # Add more specific rules here...

        # 5. Refine Punctuation Spacing
        cleaned_text = space_before_punct_re.sub(r'\1', cleaned_text)
        cleaned_text = space_after_punct_re.sub(r'\1 ', cleaned_text)
        cleaned_text = space_around_brackets_open_re.sub(' (', cleaned_text)
        cleaned_text = space_around_brackets_close_re.sub(') ', cleaned_text)
        cleaned_text = space_around_quotes_re.sub(r' \1 ', cleaned_text) # Basic quote spacing

        # Final Cleanup
        cleaned_text = multi_space_re.sub(' ', cleaned_text)
        cleaned_text = cleaned_text.strip()

        cleaned_texts.append(cleaned_text)

    return {'Cleaned_text': cleaned_texts}

In [159]:
def merge_entities_generic(ner_results, text, entity_types_to_merge={'PER', 'ORG', 'LOC'}):
    """
    Merges adjacent, fragmented entities (PER, ORG, LOC by default)
    from a NER model's output if separated by common connectors or whitespace.
    Handles initials specifically for PER entities.

    Args:
        ner_results (list): The raw output from the NER pipeline (ideally
                            with an aggregation strategy like "simple" already applied).
        text (str): The original text that was analyzed.
        entity_types_to_merge (set): A set of entity group labels to consider for merging
                                     (e.g., {'PER', 'ORG', 'LOC'}).

    Returns:
        list: A new list of entities with fragmented ones merged.
    """
    if not ner_results:
        return []

    merged_results = []
    i = 0
    while i < len(ner_results):
        current_entity = ner_results[i]

        # Check if the current entity is one we want to merge and if there's a next entity
        if current_entity['entity_group'] in entity_types_to_merge and i + 1 < len(ner_results):
            next_entity = ner_results[i+1]

            # --- MERGING LOGIC ---
            # Condition 1: Check if the next entity is of the SAME type
            if next_entity['entity_group'] == current_entity['entity_group']:
                # Get the text between the two entities
                start = current_entity['end']
                end = next_entity['start']
                text_between = text[start:end]
                stripped_text_between = text_between.strip()

                should_merge = False
                # Merge based on simple separators ('and', '&', space/nothing)
                if stripped_text_between in ['and', '&', '']:
                    should_merge = True

                # Merge initials specifically for PER entities
                elif current_entity['entity_group'] == 'PER' and \
                     current_entity['word'].endswith('.') and \
                     len(current_entity['word'].strip('.')) <= 1 and \
                     stripped_text_between == '':
                    # Ensure space is added between initials
                     text_between = " "
                     should_merge = True

                if should_merge:
                    # Combine words, average scores, update indices
                    # Add space if original separation was just whitespace but not empty
                    if stripped_text_between == '' and text_between != '':
                        new_word = current_entity['word'] + ' ' + next_entity['word']
                    else:
                        new_word = current_entity['word'] + text_between + next_entity['word']

                    new_score = (current_entity['score'] + next_entity['score']) / 2

                    # Create a new merged entity dictionary
                    merged_entity = {
                        'entity_group': current_entity['entity_group'],
                        'word': new_word.replace(" ##", ""), # Clean up sub-word tokens if necessary
                        'score': new_score,
                        'start': current_entity['start'],
                        'end': next_entity['end']
                    }

                    # Replace the current entity with the merged one for the next iteration/append step
                    current_entity = merged_entity
                    # Increment 'i' to skip the next entity that we just merged
                    i += 1

        # Append the current (potentially merged) entity to the results
        merged_results.append(current_entity)
        i += 1 # Move to the next entity

    # --- Final Filtering ---
    # Keep only the desired entity types and apply basic length filtering
    # Allows single initials ending with '.' for PER, otherwise length > 1
    final_results = [
        ent for ent in merged_results
        if ent['entity_group'] in entity_types_to_merge and \
           (len(ent['word'].strip()) > 1 or \
            (ent['entity_group'] == 'PER' and ent['word'].strip().endswith('.') and len(ent['word'].strip()) <= 2)
           )
    ]

    return final_results

In [160]:
def apply_capitalization(text, ner_entities):
    """
    Applies sentence case and capitalizes recognized entities,
    handling common abbreviations/acronyms as special cases (keeping them uppercase).
    """
    if not text or text.isspace():
        return text

    # Convert text to list of characters for easier modification
    text_chars = list(text.lower())

    # --- Apply Sentence Capitalization ---
    capitalize_next = True # Capitalize the very first char
    for i, char in enumerate(text_chars):
        if capitalize_next and char.isalpha():
            text_chars[i] = char.upper()
            capitalize_next = False
        # Capitalize after sentence-ending punctuation followed by space
        elif char in '.?!' and i + 1 < len(text_chars) and text_chars[i+1].isspace():
             capitalize_next = True
        # Don't capitalize if the next char isn't a letter or if inside a word
        elif not char.isspace():
            capitalize_next = False
    # Ensure first letter is capitalized even if text starts with non-alpha
    for i, char in enumerate(text_chars):
        if char.isalpha():
            text_chars[i] = char.upper()
            break


    # --- Apply Entity Capitalization ---
    sorted_entities = sorted(ner_entities, key=lambda x: x['start'])
    modified_indices = set()

    # --- Define known abbreviations or patterns ---
    # More robust: list of specific lowercased abbreviations to keep upper
    known_abbreviations = {'llc', 'ltd', 'inc', 'co', 'plc', 'corp',
                           'ra', 'mp', 'esq', 'bart', 'kbe', 'cbe', 'obe', 'mbe', # Titles/Honours
                           'mph', 'kph' # Units often uppercase
                           }
    # Regex for patterns like X.Y. or X.Y.Z. (allows letters only)
    initials_pattern = re.compile(r'^([A-Z]\.\s?)+$')
    # Regex for simple all-caps words (e.g., BBC, NATO) - adjust length as needed
    all_caps_pattern = re.compile(r'^[A-Z]{2,}$')


    for entity in sorted_entities:
        start = entity['start']
        end = entity['end']
        entity_group = entity['entity_group'] # PER, ORG, LOC
        original_word = entity['word'] # Get the word as identified by NER

        # Basic overlap check
        if any(idx in modified_indices for idx in range(start, end)):
            continue

        current_span_list = text_chars[start:end]
        if not current_span_list:
            continue
        current_span_lower = "".join(current_span_list) # This is already lowercase

        # --- Capitalization Rules ---
        capitalized_span = ""

        # Rule 1: Check against known abbreviations list (case-insensitive check)
        # We check the version without trailing punctuation if present
        check_word = current_span_lower.rstrip('.,;:?!')
        if check_word in known_abbreviations:
            capitalized_span = current_span_lower.upper()

        # Rule 2: Check for initials pattern (e.g., R.A., M.P.) using original word
        # Needs original word casing info
        elif initials_pattern.match(original_word):
             capitalized_span = current_span_lower.upper() # Keep uppercase

        # Rule 3: Check for simple all-caps pattern (e.g., BBC) using original word
        elif entity_group == 'ORG' and all_caps_pattern.match(original_word):
             capitalized_span = current_span_lower.upper() # Keep uppercase

        # Rule 4: Default - Title Case
        else:
            capitalized_span = current_span_lower.title()
            # Post-title case fixes (e.g., McDonald -> McDonald) could go here if needed
            # Example:
            # if 'mc' in capitalized_span.lower() and capitalized_span.startswith('Mc'):
            #     mc_index = capitalized_span.lower().find('mc')
            #     if mc_index + 2 < len(capitalized_span):
            #         third_char = capitalized_span[mc_index + 2]
            #         if third_char.islower(): # Already title cased like Mcdonald
            #              capitalized_span = capitalized_span[:mc_index+2] + third_char.upper() + capitalized_span[mc_index+3:]


        # --- Apply the change ---
        if len(capitalized_span) == len(current_span_list):
            text_chars[start:end] = list(capitalized_span)
            modified_indices.update(range(start, end))
        else:
            log.warning(f"Length mismatch during capitalization for entity: '{entity['word']}' "
                        f"Original span: '{current_span_lower}', Capitalized: '{capitalized_span}'")

    return "".join(text_chars)

In [None]:
def capitalization_and_correction_batched(batch, text_column="Cleaned_text", master_word_set=COMPLETE_WORDLIST, symspell=None):
    """
    Applies a multi-step, NER-aware correction and capitalization process.

    This function orchestrates the following steps for each text entry:
    1. Merges fragmented NER entities (e.g., 'Royal' 'Academy' -> 'Royal Academy').
    2. Applies sentence case to the text and capitalizes the merged entities.
    3. Performs a first pass of NER-shielded spelling correction for simple typos.
    4. Corrects targeted segmentation errors (e.g., 'wa ter' -> 'water').
    5. Corrects general segmentation errors (e.g., 'theexhibition' -> 'the exhibition').
    6. Performs a final, more aggressive spelling correction pass on the result.
    
    Designed for dataset.map(batched=True).
    """
    if ner_pipeline is None:
        log.error("NER pipeline not loaded. Cannot perform capitalization.")
        return {f"Corrected_{text_column}": batch[text_column]}

    input_texts = batch[text_column]
    final_texts = []

    try:
        all_ner_raw_results = ner_pipeline(input_texts)
    except Exception as e:
        log.error(f"Error during NER processing: {e}")
        return {f"Corrected_{text_column}": input_texts}

    if len(all_ner_raw_results) != len(input_texts):
         log.error(f"NER results/input length mismatch. Skipping batch.")
         return {f"Corrected_{text_column}": input_texts}

    # Process each text individually
    for i, text in enumerate(input_texts):
        if not text or text.isspace():
             final_texts.append(text)
             continue

        ner_results_for_text = all_ner_raw_results[i]

        # --- Step 1: Merge Entities ---
        merged_entities = merge_entities_generic(ner_results_for_text, text)

        # --- Step 2: Apply Capitalization ---
        capitalized_text = apply_capitalization(text, merged_entities)
        
        # --- Step 3: Apply Safe 1-to-1 Spell Correction ---
        # This fixes simple typos like "wvas" -> "was"
        corrected_text_simple = correct_spelling_safe(
            capitalized_text, 
            merged_entities, 
            symspell,
            master_word_set            
        )
        
# Step 4: Apply *Targeted* Segmentation Correction (now also shielded)
        # This fixes things like "wa ter"
        targeted_segmentation_fixes = fix_targeted_segmentation(
            corrected_text_simple,
            merged_entities, # <-- Pass the shield
            symspell
        )
        
        general_segmentation_fixes = correct_segmentation_errors_safe(
            targeted_segmentation_fixes,
            merged_entities,      # <-- Pass the shield
            symspell
        )

        final_corrected_text = correct_spelling_safe(
            general_segmentation_fixes, 
            merged_entities, 
            symspell,
            master_word_set,
            max_edit_distance=2, # Allow more edits for final pass            
            segmentation_ratio=3, # Lower threshold for final pass
        )
        
        
        final_texts.append(final_corrected_text)

    # Use a new output column name
    return {f"Corrected_{text_column}": final_texts}

In [192]:
def standardize_honorifics(text, style='dot'):
    """
    Standardizes common honorifics (Mr, Mrs, Sir, etc.) to a consistent
    capitalization and punctuation style, running as a final polish.
    
    This function should be applied *after* all NER and spelling corrections.
    
    Args:
        text (str): The text to process.
        style (str): 
            'dot'   -> Enforces a dot: 'Mr.', 'Mrs.', 'Dr.', 'Esq.'
            'no_dot' -> Enforces no dot: 'Mr', 'Mrs', 'Dr', 'Esq' 
                         (Closer to modern British English style)
    
    Returns:
        str: The text with standardized honorifics.
    """
    if not text:
        return text

    # --- Define Replacements ---
    
    # These titles are always capitalized but never take a dot.
    # We run these first.
    no_dot_titles = {
        r'\bsir\b': 'Sir',
        r'\bdame\b': 'Dame',
        r'\blord\b': 'Lord',
        r'\blady\b': 'Lady',
        r'\bmiss\b': 'Miss', # Miss is a full word, not an abbreviation
    }

    # These are abbreviations, and their punctuation depends on the style.
    # The regex \.? matches if a dot is present or not, standardizing both.
    if style == 'dot':
        style_dependent_titles = {
            r'\bmr\.?\b': 'Mr.',
            r'\bmrs\.?\b': 'Mrs.',
            r'\bms\.?\b': 'Ms.',
            r'\bdr\.?\b': 'Dr.',
            r'\brev\.?\b': 'Rev.',
            r'\bprof\.?\b': 'Prof.',
            r'\bcapt\.?\b': 'Capt.',
            r'\bcol\.?\b': 'Col.',
            r'\bgen\.?\b': 'Gen.',
            r'\besq\.?\b': 'Esq.',
            r'\bwm\.?\b': 'Wm.', # For William
        }
    else: # style == 'no_dot' (Modern British)
        style_dependent_titles = {
            r'\bmr\.?\b': 'Mr',
            r'\bmrs\.?\b': 'Mrs',
            r'\bms\.?\b': 'Ms',
            r'\bdr\.?\b': 'Dr',
            r'\brev\.?\b': 'Rev',
            r'\bprof\.?\b': 'Prof',
            r'\bcapt\.?\b': 'Capt',
            r'\bcol\.?\b': 'Col',
            r'\bgen\.?\b': 'Gen',
            r'\besq\.?\b': 'Esq', # For consistency, we'll remove the dot here too
            r'\bwm\.?\b': 'Wm', 
        }

    # Apply the replacements, starting with the non-dotted ones.
    # We use re.IGNORECASE to catch all variants (e.g., 'mr', 'Mr', 'MR').
    
    for pattern, replacement in no_dot_titles.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
        
    for pattern, replacement in style_dependent_titles.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
            
    return text

def final_polishing_batched(batch, text_column="Corrected_Cleaned_text", style='dot'):
    """
    Wrapper function to apply honorific standardization to a
    dataset batch.
    """
    input_texts = batch[text_column]
    
    polished_texts = [
        standardize_honorifics(text, style=style) for text in input_texts
    ]
    
    # Return in a new column to preserve the previous step
    return {f"Polished_{text_column}": polished_texts}

In [214]:
def reapply_punctuation_spacing(text):
    """
    Cleans up all spacing around punctuation. This is a final
    polishing step to be run AFTER all other corrections.
    """
    if not text:
        return text

    # 1. Removes space BEFORE punctuation
    _space_before_punct_re = re.compile(r'\s+([,\.;:?!])')

    # 2. Ensures one space AFTER punctuation if it's followed by a letter/number
    #    (This fixes "word.word" -> "word. word")
    _space_after_punct_re = re.compile(r'([,\.;:?!])([a-zA-Z0-9])')

    # 3. Cleans up spaces around brackets
    _space_around_brackets_open_re = re.compile(r'\s*\(\s*')
    _space_around_brackets_close_re = re.compile(r'\s*\)\s*')

    # 4. Collapses any remaining multiple spaces
    _multi_space_re = re.compile(r'\s+')


    # "word . word" -> "word. word"
    text = _space_before_punct_re.sub(r'\1', text)
    
    # "word.word" -> "word. word"
    text = _space_after_punct_re.sub(r'\1 \2', text)
    
    # "word ( word ) word" -> "word (word) word"
    text = _space_around_brackets_open_re.sub(' (', text)
    text = _space_around_brackets_close_re.sub(') ', text)

    # "word.  word" -> "word. word"
    text = _multi_space_re.sub(' ', text)
    
    # Final strip
    return text.strip()

def final_spacing_batched(batch, text_column="Corrected_Cleaned_text"):
    """
    Batch-processing wrapper for the final spacing cleanup.
    Run this on the output of your *previous* final step.
    """
    # Get the text from the *last* processing column
    input_texts = batch[text_column]
    
    # Use a list comprehension for speed
    polished_texts = [reapply_punctuation_spacing(text) for text in input_texts]
    
    # Return in a new, final column
    return {f"Final_Polished_Text": polished_texts}

In [None]:
def display_record(record):
    """
    Displays a Hugging Face dataset record with metadata followed by two
    text fields (Full_text, Final_Polished_Text) in side-by-side
    columns in the terminal.
    
    This function works with both dictionaries and pandas Series
    (e.g., a row from a DataFrame like df.loc[0]).
    """
    
    # --- 1. Display Metadata ---
    print("-" * 80)
    print("METADATA")
    print("-" * 80)
    
    # Define metadata keys to extract and display
    metadata_keys = [
        'Author', 'Title', 'Publication', 'Date', 'Place', 'URL'
    ]
    
    for key in metadata_keys:
        # Use .get() to safely access keys, providing 'N/A' if key is missing
        value = record.get(key, 'N/A')
        print(f"{key+':':<15} {value}")
        
    print("\n" + "=" * 80)
    print("TEXT COMPARISON")
    print("=" * 80)

    # --- 2. Prepare Columnar Text ---
    
    # Get text fields
    full_text = record.get('Full_text', '')
    capitalized_text = record.get('Final_Polished_Text', '')

    # Get terminal width to make columns responsive
    try:
        terminal_width = shutil.get_terminal_size().columns
    except OSError:
        # Fallback if not in a real terminal (e.g., CI/CD)
        terminal_width = 120

    # Define spacing: 3 columns + 2 separators (" | ")
    # We give a little extra buffer for the separators
    padding = 4 
    col_width = (terminal_width - padding) // 2
    
    if col_width < 10:
        print("Terminal is too narrow to display columns effectively.")
        print("\nFull_text:\n", full_text)
        print("\Final_Polished_Text:\n", capitalized_text)
        return

    # Wrap text for each column
    wrapped_full = textwrap.wrap(full_text, width=col_width)
    wrapped_capitalized = textwrap.wrap(capitalized_text, width=col_width)

    # Find the maximum number of lines needed
    max_lines = max(len(wrapped_full), len(wrapped_capitalized))

    # --- 3. Print Headers and Rows ---
    
    # Create header strings, left-aligned and truncated if necessary
    header_full = "Full_text".ljust(col_width)
    header_capitalized = "Final_Polished_Text".ljust(col_width)

    print(f"{header_full} | {header_capitalized}")
    print(f"{'-' * col_width} | {'-' * col_width}")

    # Print each row
    for i in range(max_lines):
        # Get the line for each column, or an empty string if text is shorter
        line_full = wrapped_full[i] if i < len(wrapped_full) else ""
        line_capitalized = wrapped_capitalized[i] if i < len(wrapped_capitalized) else ""
        
        # Print the formatted row, ensuring each part adheres to the column width
        print(f"{line_full.ljust(col_width)} | {line_capitalized.ljust(col_width)}")

    print("=" * 80)

In [None]:
def pipeline_driver():
    """
    Main driver function to run the full pre-processing pipeline.

    - Finds all .csv files in a specified input directory.
    - Sequentially applies the four main processing stages:
      1. Initial text cleaning (`clean_text_piece_batched`).
      2. NER-driven correction and capitalization (`capitalization_and_correction_batched`).
      3. Honorific standardization (`final_polishing_batched`).
      4. Final punctuation spacing cleanup (`final_spacing_batched`).
    - Saves the fully processed data to new .csv files in an output directory.
    - Returns the DataFrame from the last processed file for inspection.
    """
    # --- Configuration ---
    # Define the folder where your raw .csv files are located
    input_folder = Path('./Data_test')
    # Define the folder where processed files will be saved
    output_folder = Path('./Data_processed/full_pipeline/')
    
    # --- Setup ---
    # Create the output directory if it doesn't exist
    output_folder.mkdir(parents=True, exist_ok=True)
    log.info(f"Input folder: '{input_folder}'")
    log.info(f"Output folder: '{output_folder}'")

    # Setup SymSpellPy with the master word set
    symspell_checker = setup_spellchecker(COMPLETE_WORDLIST)

    # Find all CSV files in the input directory
    csv_files = list(input_folder.glob('*.csv'))
    if not csv_files:
        log.warning(f"No CSV files found in '{input_folder}'. Exiting.")
        return

    log.info(f"Found {len(csv_files)} CSV file(s) to process: {[f.name for f in csv_files]}")

    # --- Processing Loop ---
    for file_path in csv_files:
        log.info(f"\n{'='*20} Processing file: {file_path.name} {'='*20}")
        
        # 1. Load the data using the helper function
        loaded_data = load_csv_file(str(file_path))
        if not loaded_data:
            log.error(f"Failed to load {file_path.name}. Skipping.")
            continue
        
        raw_dataset = loaded_data['data']
        log.info(f"Loaded {loaded_data['length']} rows.")

        # --- 2. RUN THE FULL PIPELINE ---
        log.info("Starting text cleaning pipeline...")

        log.info("Step 1/4: Initial cleaning...")
        cleaned_dataset = raw_dataset['train'].map(
            clean_text_piece_batched, 
            batched=True
        )

        log.info("Step 2/4: NER, capitalization, and correction...")
        corrected_dataset = cleaned_dataset.map(
            capitalization_and_correction_batched,
            batched=True,
            batch_size=8,
            fn_kwargs={
                'text_column': 'Cleaned_text',
                'master_word_set': COMPLETE_WORDLIST,
                'symspell': symspell_checker
            }
        )

        log.info("Step 3/4: Polishing honorifics...")
        honorifics_dataset = corrected_dataset.map(
            final_polishing_batched,
            batched=True,
            fn_kwargs={
                'text_column': 'Corrected_Cleaned_text',
                'style': 'dot' 
            }
        )

        log.info("Step 4/4: Final punctuation spacing...")
        final_dataset = honorifics_dataset.map(
            final_spacing_batched,
            batched=True,
            fn_kwargs={'text_column': 'Polished_Corrected_Cleaned_text'}
        )

        log.info("Pipeline complete. Final text is in 'Final_Polished_Text'.")

        # 4. Save the results to a new CSV file
        log.info("Step 3/3: Saving processed data...")
        output_filename = f"{file_path.stem}_processed.csv"
        output_path = output_folder / output_filename
        
        # Convert the final dataset to a pandas DataFrame to save as CSV
        final_df = final_dataset.to_pandas()
        
        final_df.to_csv(output_path, index=False)
        log.info(f"Successfully saved processed data to '{output_path}'")

    log.info(f"\n{'='*20} Pipeline finished. {'='*20}")

    return final_df



In [228]:
# --- Execute the main function ---
# This block ensures the code runs when the script is executed
df = pipeline_driver()



INFO:__main__:Input folder: 'Data_test'
INFO:__main__:Output folder: 'Data_processed/full_pipeline'
INFO:__main__:Setting up SymSpellPy...
INFO:__main__:Loaded standard frequency dictionary. Word count: 82834
INFO:__main__:Added 415814 new unique words from master set to SymSpell.
INFO:__main__:SymSpell setup complete. Total unique terms: 498648
INFO:__main__:Found 1 CSV file(s) to process: ['200_full_text_samples.csv']
INFO:__main__:
INFO:__main__:Loaded 196 rows.
INFO:__main__:Starting text cleaning pipeline...
INFO:__main__:Step 1/4: Initial cleaning...
INFO:__main__:Step 2/4: NER, capitalization, and correction...


Map:   0%|          | 0/196 [00:00<?, ? examples/s]

ERROR:__main__:Error during word_segmentation on 'leieesteisquare': distance too large
ERROR:__main__:Error during word_segmentation on 'PeterfromPrisontm': distance too large
ERROR:__main__:Error during word_segmentation on 'Tionswhicharcsocrbditable': distance too large
ERROR:__main__:Error during word_segmentation on 'Westminsterbridge': distance too large
ERROR:__main__:Error during word_segmentation on 'Tionswhicharcsocrbditable': distance too large
ERROR:__main__:Error during word_segmentation on 'tleeparliamentary': distance too large
ERROR:__main__:Error during word_segmentation on 'lbcelteceniththtisetandt': distance too large
ERROR:__main__:Error during word_segmentation on 'ethearchntiologset': distance too large
ERROR:__main__:Error during word_segmentation on 'conscieottously': distance too large
ERROR:__main__:Error during word_segmentation on 'theintelligence': distance too large
ERROR:__main__:Error during word_segmentation on 'presenrtpromsinence': distance too large
E

Map:   0%|          | 0/196 [00:00<?, ? examples/s]

INFO:__main__:Step 4/4: Final punctuation spacing...


Map:   0%|          | 0/196 [00:00<?, ? examples/s]

INFO:__main__:Pipeline complete. Final text is in 'Final_Polished_Text'.
INFO:__main__:Step 3/3: Saving processed data...
INFO:__main__:Successfully saved processed data to 'Data_processed/full_pipeline/200_full_text_samples_processed.csv'
INFO:__main__:


In [232]:
display_record(df.loc[6]) 

--------------------------------------------------------------------------------
METADATA
--------------------------------------------------------------------------------
Author:         Null
Title:          BRITISH INSTITUTION.-The annual exhibition of paintings by old masters is now open at the British
Publication:    The Times
Date:           1838-Jun-18
Place:          London, England
URL:            https://link.gale.com/apps/doc/CS100821714/TTDA?u=uppsala&sid=bookmark-TTDA&xid=df660182

TEXT COMPARISON
Full_text                              | Final_Polished_Text                   
-------------------------------------- | --------------------------------------
BRITISH INSTITUTIO'.-The annual        | British institution.-the annual       
exchibition of paiHting;s by old       | exhibition of painting; s by old      
masters is now open at the British     | masters is now open at                
Institution, Pall-mall, and the        | theBritishInstitution, Pall-Mall, and 
nunmbe