# Word Frequency Analysis and OCR Correction

This notebook performs a large-scale word frequency analysis on a corpus of text files (CSV format). It is designed to handle large datasets efficiently using multiprocessing.

**Key Goals:**
1.  **Count Words:** Calculate the frequency of every unique word across the entire corpus.
2.  **Identify Unknowns:** Compare the found words against a "known" dictionary (`complete_wordlist.pkl`) to isolate potential misspellings or OCR errors.
3.  **Analyze OCR Errors:** Use the list of unknown words to identify common OCR patterns (e.g., "wv" instead of "w").
4.  **Develop Patches:** Create and test regex-based "patches" to automatically fix these systematic errors.

**Methodology:**
-   **Parallel Processing:** Uses `multiprocessing` to count words in parallel across multiple CPU cores.
-   **Streaming:** Uses a generator to load and process files in chunks, keeping memory usage low.
-   **Regex Cleaning:** Applies targeted regular expressions to correct specific OCR artifacts found in the analysis.


In [1]:
import glob
import pandas as pd
import re
from collections import Counter
from datasets import load_dataset
import os # To get the number of CPU cores
from tqdm import tqdm
import multiprocessing
import pickle

In [1]:
# This regex will extract words, including those with apostrophes (e.g., "don't", "artist's").
# It excludes standalone punctuation and numbers mixed with symbols.
word_counter_regex = re.compile(r"\b[a-zA-Z'0-9]+\b")

# Path to the directory containing the raw CSV data files
DATA_DIR_PATH = './Data/'

NameError: name 're' is not defined

### 1. Parallel Word Counting
The following functions define the parallel processing logic. `worker_word_counter` is the function that runs on each CPU core, processing a specific batch of text.


In [None]:
def worker_word_counter(job_tuple):
    """
    Worker function for multiprocessing.
    Counts words in a batch of texts and passes the 'is_last_batch' flag through.
    
    Args:
        job_tuple (tuple): (list_of_texts, is_last_batch_boolean)
        
    Returns:
        tuple: (Counter object with word counts, is_last_batch_boolean)
    """
    # 1. Unpack the job
    texts, is_last_batch_flag = job_tuple
    
    batch_counter = Counter()
    
    for text in texts:
        text_counter = Counter()
        if isinstance(text, str):
            # Use the GLOBAL regex to find all words in the text
            # Convert to lowercase to ensure case-insensitive counting
            words = word_counter_regex.findall(text.lower())
            text_counter.update(words)
        batch_counter.update(text_counter)
        
    # 2. Return the result AND the flag so the main process knows when a file is finished
    return (batch_counter, is_last_batch_flag)

In [None]:
def get_file_list(directory_path, file_extension="csv"):
    """
    Retrieves a list of all files with a specific extension in a directory.
    Exits the script if no files are found.
    """
    
    file_pattern = os.path.join(directory_path, "*." + file_extension) # e.g., './Data/*.csv'

    print(f"Finding all files matching: {file_pattern}")
    all_files = glob.glob(file_pattern)
    print(f"Found {len(all_files)} files to process.")

    if not all_files:
        print(f"Error: No files found. Check your DATA_DIR_PATH ({directory_path}) and file extension ({file_extension}).")
        exit() # Exit the script if no files are found
    
    return all_files, len(all_files)

### 2. Data Streaming
To avoid loading huge amount of data into RAM, we use a **generator**. This function yields small chunks (batches) of text one by one. The multiprocessing pool consumes these batches as they are yielded.


In [None]:
def batch_generator(file_list, chunk_size=1000, text_column="Full_text"):
    """
    Generator function. Loads one file at a time and yields 
    its texts in batches. This ensures we only hold one file's worth of data
    in memory at a time.
    """
    print("Batch Generator: Starting up...")
    
    for filename in file_list:
        try:
            # Load the file silently using Hugging Face datasets (efficient memory mapping)
            raw_dataset = load_dataset('csv', data_files=filename) 
            
            texts = raw_dataset['train'][text_column]
            total_texts_in_file = len(texts)
            
            if total_texts_in_file == 0:
                # Special case: file is empty, yield one "last_batch" flag to keep progress bar in sync
                yield ([], True)
                continue

            # Slice the data into chunks
            for i in range(0, total_texts_in_file, chunk_size):
                batch_of_texts = texts[i:i + chunk_size]
                
                # Check if this is the last batch for this specific file
                is_last_batch = (i + chunk_size >= total_texts_in_file)
                
                # Yield the batch AND the flag
                yield (batch_of_texts, is_last_batch)
                
        except Exception as e:
            # Still log errors, but skip the file so the whole process doesn't crash
            print(f"\n[Generator] Error processing file {filename}: {e}\n")
            # Yield a "done" flag for this file so the pbar ticks
            yield ([], True) 
            continue

### 3. Main Execution Driver
This function orchestrates the entire process. It sets up the multiprocessing pool, feeds it data from the generator, and aggregates the results into a single `final_counts` dictionary.


In [None]:
# ...existing code...
def driver_word_frequency(data_dir_path=None, batch_size=1000):
    """
    Main driver function.
    1. Finds files.
    2. Sets up multiprocessing pool.
    3. Aggregates results from workers.
    """
    all_files, file_count = get_file_list(data_dir_path, file_extension="csv")
    try:
        # Determine number of cores to use (leave 2 free for system stability)
        num_cores = max(1, os.cpu_count() - 2) 
        final_counts = Counter()
        
        # Create the generator ITSELF. (This does NOT load any files yet!)
        job_iterator = batch_generator(all_files, chunk_size=batch_size)

        print(f"Starting processing pool with {num_cores} workers...")
        
        with multiprocessing.Pool(processes=num_cores) as pool:
            
            # imap_unordered is faster as it yields results as soon as they are ready,
            # regardless of the order they were submitted.
            results_iterator = pool.imap_unordered(worker_word_counter, job_iterator)
            
            # --- Progress Bar ---
            with tqdm(total=file_count, desc="Processing files") as pbar:
                for batch_counter, is_last_flag in results_iterator:
                    
                    # 1. Do the reduce step (merge batch counts into global counts)
                    final_counts.update(batch_counter)
                    
                    # 2. Update the bar ONLY when a file is completely done
                    if is_last_flag:
                        pbar.update(1)
            
        print("...Processing complete.")

        return final_counts
        
    except Exception as e:
        print(f"An error occurred in the main block: {e}")
# ...existing code...

In [37]:
final_counts = driver_word_frequency(data_dir_path=DATA_DIR_PATH)

Finding all files matching: ./Data/*.csv
Found 7 files to process.
Starting processing pool with 4 workers...
Batch Generator: Starting up...


Processing files: 100%|██████████| 7/7 [00:03<00:00,  1.79it/s]

...Processing complete.





In [38]:
print(final_counts.most_common(100))
print(final_counts.total())
print(len(final_counts))

[('the', 1719250), ('of', 984369), ('and', 760102), ('a', 612475), ('in', 531746), ('to', 519167), ('is', 351825), ('by', 232014), ('it', 208198), ('that', 204275), ('with', 202987), ('was', 197162), ('as', 197064), ('for', 194026), ('his', 171659), ('at', 158738), ('mr', 144961), ('which', 140692), ('on', 138260), ('but', 136189), ('be', 131597), ('he', 114235), ('not', 112796), ('this', 110380), ('an', 108419), ('from', 105573), ('are', 101199), ('has', 97488), ('her', 91688), ('one', 79683), ('have', 77411), ('its', 72505), ('who', 68342), ('more', 67422), ('s', 64982), ('miss', 63013), ('all', 62500), ('music', 62342), ('there', 61087), ('were', 59699), ('their', 58154), ('so', 56800), ('or', 56410), ('i', 55348), ('will', 53839), ('first', 52751), ('they', 52656), ('been', 52099), ('theatre', 51092), ('than', 50927), ('we', 50425), ('she', 50355), ('no', 49462), ('had', 44771), ('two', 43758), ('some', 42666), ('new', 42159), ('last', 40343), ('most', 40312), ('when', 39455), ('wo

In [None]:
# Save final_counts to a text file sorted highest-to-lowest by count
output_path = os.path.join(DATA_DIR_PATH, "word_counts_sorted.txt")

with open(output_path, "w", encoding="utf-8") as f:
    # most_common() already returns (word, count) sorted desc by count
    for word, count in final_counts.most_common():
        f.write(f"{count}\t{word}\n")

print(f"Saved {len(final_counts)} unique tokens to: {output_path}")

Saved 903130 unique tokens to: ./Data/word_counts_sorted_with_numbers.txt


### 4. Dictionary Comparison
Now that we have the raw word counts, we compare them against a "known" dictionary (`complete_wordlist.pkl`). This allows us to separate valid words from potential errors.


In [3]:
COMPLETE_WORDLIST = set()
# Choose a wordlist
WORDLIST_NAME = 'complete_wordlist.pkl' 
#WORDLIST_NAME = 'artist_wordlist.pkl'
DICTIONARY_DATA_DIR = './Dictionary_data/'

In [4]:
with open(DICTIONARY_DATA_DIR + WORDLIST_NAME, 'rb') as file:
    COMPLETE_WORDLIST = pickle.load(file)
print("Total unique words loaded:", len(COMPLETE_WORDLIST))

Total unique words loaded: 502183


### 5. Analyzing Unknown Words
We isolate the words that were *not* found in our dictionary. These are likely to be:
1.  Proper nouns (names, places) not in our list.
2.  OCR errors (e.g., "thc" instead of "the").
3.  Foreign words.

We save these to a file to inspect them manually and design correction rules.

In [40]:
final_count_words = set(final_counts.keys())
non_identified_words = final_count_words - COMPLETE_WORDLIST
print(f"Total non-identified words: {len(non_identified_words)}")
print("Sample non-identified words:", list(non_identified_words)[:20])
non_identified_words_counts = Counter({word: final_counts[word] for word in non_identified_words})

Total non-identified words: 799570
Sample non-identified words: ['pctri', 'dedhouse', 'audieuca', '31al', 'patrty', 'verkldrte', "ffi'cr", 'affec', 'felik', 'instrunmentalists', 'eulow', 'dihiculties', 'exhibitcd', 'onu', 'plentvo', 'itwasno', 'entwinements', 'clodhopperly', '1xa', 'britwell']


In [13]:
# Save non_identified_words to a text file sorted highest-to-lowest by count
output_path = os.path.join('./', "non_identified_word_counts.txt")

with open(output_path, "w", encoding="utf-8") as f:
    # most_common() already returns (word, count) sorted desc by count
    for word, count in non_identified_words_counts.most_common():
        f.write(f"{count}\t{word}\n")

print(f"Saved {len(final_counts)} unique tokens to: {output_path}")

Saved 845044 unique tokens to: ./non_identified_word_counts.txt


In [14]:
#Find identified words which are correctly spelled
identified_words = final_count_words & COMPLETE_WORDLIST
print(f"Total identified words: {len(identified_words)}")
print("Sample identified words:", list(identified_words)[:20])

output_path = os.path.join('./', "identified_words_counts.txt")

total_word_count = sum(final_counts.values())
identified_words_counts = {word: final_counts[word] for word in identified_words}

with open(output_path, "w", encoding="utf-8") as f:
    # most_common() already returns (word, count) sorted desc by count
    for word in identified_words:
        f.write(f"{final_counts[word]}\t{word}\n")

print(f"Saved {len(final_counts)} unique tokens to: {output_path}")

DICTIONARY_DATA_DIR = './Dictionary_data/'
output_path = os.path.join(DICTIONARY_DATA_DIR, "identified_words_freq_percentage.pkl")
with open(output_path, 'wb') as file:
    pickle.dump(identified_words_counts, file)
print(f"Saved {len(identified_words_counts)} identified words with frequencies to: {output_path}")


Total identified words: 103471
Sample identified words: ['tenderly', 'jerkins', 'weald', 'ratter', 'blowings', 'dosser', 'shoring', 'mfr', 'slacking', 'rhapsodizing', 'leavening', 'viewpoints', 'tuned', 'argante', 'bladder', 'regally', 'flexible', 'homier', 'convoluted', 'solita']
Saved 845044 unique tokens to: ./identified_words_counts.txt
Saved 103471 identified words with frequencies to: ./Dictionary_data/identified_words_freq_percentage.pkl


In [44]:
final_counts

Counter({'my': 6925,
         'life': 15727,
         'as': 197064,
         'a': 612475,
         'sitting': 700,
         'duck': 182,
         'both': 19322,
         'artist': 6450,
         'and': 760102,
         'art': 16398,
         'critic': 1652,
         'matthew': 902,
         'collings': 23,
         'sets': 2114,
         'himself': 13440,
         'up': 25315,
         'to': 519167,
         'be': 131597,
         'shot': 750,
         'by': 232014,
         'sides': 751,
         'chris': 473,
         'mcandrew': 5,
         'jrtsjla': 1,
         'aam': 32,
         'ii': 2702,
         '0': 3163,
         'jy': 30,
         'i': 55348,
         'ss': 800,
         'for': 194026,
         'long': 15513,
         'time': 28115,
         "i've": 591,
         'led': 2864,
         'double': 3272,
         'been': 52099,
         'an': 108419,
         'writing': 4594,
         'nearly': 3323,
         '30': 7623,
         'years': 17611,
         'but': 136189,
      

### 6. Filtering Noise
Many "unknown" words are just numbers or measurements (e.g., "1990", "5kg"). We filter these out to focus on textual errors.


In [62]:
words_with_digits = Counter()

endings_1 = ['s', 'd', 'p', 'm']
endings_2 = ['th', 'pm', 'am', 'nd', 'rd', 'st', 'ft', 'kg', 'oz', 'km', 'kw', 'lb']
endings_3 = ['hrs']
endings_4 = ['mins']

for word, count in final_counts.most_common():
    digits = [c.isdigit() for c in word]
    if all(digits):
        continue
    if all(digits[:-1]) and word[-1:] in endings_1:
        continue
    if all(digits[:-2]) and word[-2:] in endings_2:
        continue
    if all(digits[:-3]) and word[-3:] in endings_3:
        continue
    if all(digits[:-4]) and word[-4:] in endings_4:
        continue
    if any(digits):
        words_with_digits.update({word : count})
print(len(words_with_digits))

51670


In [63]:
output_path = os.path.join('./', "words_with_digits.txt")
with open(output_path, "w", encoding="utf-8") as f:
    # most_common() already returns (word, count) sorted desc by count
    for word, count in words_with_digits.most_common():
        f.write(f"{count}\t{word}\n")

In [5]:
NER_WORDLIST = set()
NER_WORDLIST_NAME = 'ner_wordlist.pkl' 
DICTIONARY_DATA_DIR = './Dictionary_data/'
with open(DICTIONARY_DATA_DIR + NER_WORDLIST_NAME, 'rb') as file:
    NER_WORDLIST = pickle.load(file)
print("Total unique words loaded:", len(NER_WORDLIST))

Total unique words loaded: 14874


In [72]:
misspelled_without_digits = non_identified_words.difference(words_with_digits, NER_WORDLIST)

In [77]:
'\'' in "beethoven'"

True

In [75]:
print(sorted(list(NER_WORDLIST)[:100]))

['addis', 'alekseyev', 'alperon', 'altarpiece', 'amy', 'arabia', 'armagnac', 'aurand', 'ayman', 'aziz', 'baroque', 'bastian', 'bologna', 'branamour', 'brendan', 'british', 'canberra', 'cavaliers', 'clemens', 'clijsters', 'das', 'dunem', 'dwarka', 'egypt-gaza', 'elbaradei', 'everblades', 'fernet-branca', 'flannery', 'gamboa', 'gan-based', 'guinea', 'gulzar', 'hassell', 'hetty', 'ibb', 'internacional', 'interpol', 'jacky', 'jafar', 'jalalabad', 'jamestown', 'jinan', 'jirikan', 'joaquín', 'johnson-morris', 'kallio', 'karameh', 'khurshid', 'league-nawaz', 'lithuania', 'marcelo', 'marcinelle', 'mbabane', 'mcqueen', 'medicine', 'mirnyi', 'moi', 'monrovia', 'moyo', 'napoli', 'navy', 'nechvatal', 'neverland', 'nozari', 'nuristan', 'oktay', 'oval', 'pacheco', 'palaeocene', 'perrier', 'peugeot', 'printemps', 'pro-moscow', 'razzano', 'reims', 'reno', 'revava', 'rush', 'ryan', 'rødberg', 'sanha', 'scharping', 'selhurst', 'solaiman', 'somaieh', 'southwark', 'spare-time', 'sudeten', 'suyono', 't-nec

In [6]:
from datasets import load_dataset

ds = load_dataset("wiki_bio")

README.md: 0.00B [00:00, ?B/s]

wiki_bio.py: 0.00B [00:00, ?B/s]

RuntimeError: Dataset scripts are no longer supported, but found wiki_bio.py

### 7. OCR Correction Logic
Based on the analysis of the "unknown" words, we define specific regex rules to fix common OCR mistakes.

*   **Aggressive Patches:** Fix systematic character replacements (e.g., `wv` -> `w`).
*   **Common Patches:** Fix specific high-frequency misspelled words (e.g., `thc` -> `the`).

In [None]:
def apply_aggressive_ocr_patches(text):
    """
    Applies regex-based fixes for systematic OCR errors (wv->w, rn->m)
    and high-frequency specific word errors found in the dataset.
    """
    # --- 1. Systematic 'w' and 'm' Fixes ---
    # Fix "wv" or "vw" anywhere (e.g. "twvo", "wvas") -> likely "w"
    text = re.sub(r'wv|vw', 'w', text, flags=re.IGNORECASE)
    # Fix "wl" at start of word (e.g. "wlhich", "wlho") -> likely "w"
    text = re.sub(r'\bwl', 'w', text, flags=re.IGNORECASE)
    # Fix "w" being read as "v" in specific common words (e.g., "vhich" -> "which")
    text = re.sub(r'\bv(hich|ith|hen|here|ho|hose|ould)\b', r'w\1', text, flags=re.IGNORECASE)
    
    # Fix "mn" or "nm" at start (e.g. "mnore") -> likely "m"
    text = re.sub(r'\b[mn]m', 'm', text, flags=re.IGNORECASE)
    # Fix "rn" -> "m" at start (e.g. "rnore" -> "more")
    text = re.sub(r'\brn(?=[aeiou])', 'm', text, flags=re.IGNORECASE)

    # --- 2. 'The' / 'That' variants ---
    # Fixes for "the" (e.g., "tbe", "tke", "tne")
    text = re.sub(r'\bt[bkn]e\b', 'the', text, flags=re.IGNORECASE) 
    text = re.sub(r'\bt[li]he\b', 'the', text, flags=re.IGNORECASE)
    text = re.sub(r'\b[li]he\b', 'the', text, flags=re.IGNORECASE)
    # Fixes for "that", "this", "those" starting with 't' or 'l'
    text = re.sub(r'\bt[li](at|is|ose)\b', r'th\1', text, flags=re.IGNORECASE)

    # --- 3. High-Frequency Specific Fixes (Safe List) ---
    # Common misreads of "and"
    text = re.sub(r'\ba[ni][li]d\b', 'and', text, flags=re.IGNORECASE) 
    # Honorifics (e.g., "Mliss" -> "Miss")
    text = re.sub(r'\bml(iss|rs|r)\b', r'M\1', text, flags=re.IGNORECASE) 
    # Specific common words
    text = re.sub(r'\bMar\.\s', 'Mr. ', text)  # Fixes Mar. 
    text = re.sub(r'\bAlr\.\s', 'Mr. ', text)  # Fixes Alr. 
    text = re.sub(r'\bMlr\.\s', 'Mr. ', text)  # Fixes Mlr. 
    text = re.sub(r'\b[fl]irst\b', 'first', text, flags=re.IGNORECASE)
    text = re.sub(r'\bhiis\b', 'his', text, flags=re.IGNORECASE)
    text = re.sub(r'\balvays\b', 'always', text, flags=re.IGNORECASE)
    text = re.sub(r'\bprogrammc\b', 'programme', text, flags=re.IGNORECASE)
    
    return text

In [90]:
def apply_common_ocr_patches(text):
    # Fix common title misreads
    text = re.sub(r'\bMar\.\s', 'Mr. ', text)
    text = re.sub(r'\bAlr\.\s', 'Mr. ', text) # Saw "Alr." in your output
    
    # Common miss-spellings based on frequence analysis of 
    # unidentified words from the whole dataset.
    text = re.sub(r'\balvays\b', 'always', text)

    text = re.sub(r'\banld\b', 'and', text)
    text = re.sub(r'\bnnd\b', 'and', text)
    text = re.sub(r'\bsnd\b', 'and', text)
    text = re.sub(r'\bahd\b', 'and', text)
    text = re.sub(r'\banv\b', 'and', text)

    text = re.sub(r'\bcen\b', 'been', text)
    text = re.sub(r'\becn\b', 'been', text)

    text = re.sub(r'\bflrst\b', 'first', text)
    text = re.sub(r'\blirst\b', 'first', text)

    text = re.sub(r'\birom\b', 'from', text)
    text = re.sub(r'\bfiom\b', 'from', text)
    text = re.sub(r'\bfronm\b', 'from', text)
    text = re.sub(r'\btrom\b', 'from', text)

    text = re.sub(r'\bhavc\b', 'have', text)

    text = re.sub(r'\bhiis\b', 'his', text)

    text = re.sub(r'\bmladame\b', 'madame', text)
    text = re.sub(r'\bmiadame\b', 'madame', text)

    text = re.sub(r'\bmanv\b', 'many', text)

    text = re.sub(r'\bmav\b', 'may', text)
    text = re.sub(r'\bmnay\b', 'may', text)

    text = re.sub(r'\bmnore\b', 'more', text)
    text = re.sub(r'\bnmore\b', 'more', text)

    text = re.sub(r'\bmnost\b', 'most', text)

    text = re.sub(r'\bmedtner\b', 'mother', text)

    text = re.sub(r'\bmnuch\b', 'much', text)

    text = re.sub(r'\bmlusic\b', 'music', text)
    text = re.sub(r'\bmnsic\b', 'music', text)

    text = re.sub(r'\bonlv\b', 'only', text)

    text = re.sub(r'\bplav\b', 'play', text)

    text = re.sub(r'\bplaved\b', 'played', text)

    text = re.sub(r'\bprogrammc\b', 'programme', text)

    text = re.sub(r'\bsomc\b', 'some', text)

    text = re.sub(r'\bstvle\b', 'style', text)

    text = re.sub(r'\btbat\b', 'that', text)
    text = re.sub(r'\bthlat\b', 'that', text)
    text = re.sub(r'\bthiat\b', 'that', text)
    text = re.sub(r'\btlhat\b', 'that', text)

    text = re.sub(r'\bthc\b', 'the', text)
    text = re.sub(r'\btbe\b', 'the', text)
    text = re.sub(r'\bthle\b', 'the', text)
    text = re.sub(r'\blhe\b', 'the', text)
    text = re.sub(r'\bt\'he\b', 'the', text)
    text = re.sub(r'\btthe\b', 'the', text)
    text = re.sub(r'\bthr\b', 'the', text)
    text = re.sub(r'\bfhe\b', 'the', text)

    text = re.sub(r'\bteatro\b', 'theatre', text)
    text = re.sub(r'\btbeatre\b', 'theatre', text)
    text = re.sub(r'\bthcatre\b', 'theatre', text)

    text = re.sub(r'\bthemn\b', 'them', text)
    text = re.sub(r'\bthenm\b', 'them', text)

    text = re.sub(r'\bthel\b', 'they', text)

    text = re.sub(r'\bthoso\b', 'those', text)

    text = re.sub(r'\btimc\b', 'time', text)

    text = re.sub(r'\btvo\b', 'two', text)

    text = re.sub(r'\bvcry\b', 'very', text)

    text = re.sub(r'wv', 'w', text)
    text = re.sub(r'vw', 'w', text)

    text = re.sub(r'\bwvas\b', 'was', text)
    text = re.sub(r'\bwias\b', 'was', text)

    text = re.sub(r'\bwerc\b', 'were', text)

    text = re.sub(r'\bwlhat\b', 'what', text)

    text = re.sub(r'\bvhen\b', 'when', text)
    text = re.sub(r'\bwlhen\b', 'when', text)
    text = re.sub(r'\bwben\b', 'when', text)

    text = re.sub(r'\bwvhich\b', 'which', text)
    text = re.sub(r'\bwbich\b', 'which', text)
    text = re.sub(r'\bwhicb\b', 'which', text)
    text = re.sub(r'\bwhieh\b', 'which', text)
    text = re.sub(r'\bwhichl\b', 'which', text)
    text = re.sub(r'\bwhiclh\b', 'which', text)
    text = re.sub(r'\bwhlich\b', 'which', text)
    text = re.sub(r'\bwhicl\b', 'which', text)
    text = re.sub(r'\bwhiich\b', 'which', text)
    text = re.sub(r'\bwrhich\b', 'which', text)

    text = re.sub(r'\bwlho\b', 'who', text)
    text = re.sub(r'\bvho\b', 'who', text)
    text = re.sub(r'\bwbo\b', 'who', text)
    text = re.sub(r'\bwhlo\b', 'who', text)
    text = re.sub(r'\bwhio\b', 'who', text)

    text = re.sub(r'\bwtih\b', 'with', text)
    text = re.sub(r'\bwvith\b', 'with', text)
    text = re.sub(r'\bvith\b', 'with', text)
    text = re.sub(r'\bwitb\b', 'with', text)
    text = re.sub(r'\bwithl\b', 'with', text)
    text = re.sub(r'\bwitl\b', 'with', text)

    text = re.sub(r'\bvould\b', 'would', text)

    text = re.sub(r'\bvears\b', 'years', text)

    text = re.sub(r'\bvou\b', 'you', text)

    # 690

    return text

### 8. Validating Corrections
Here we test our cleaning functions on the list of unknown words. We count how many "unknown" words are successfully transformed into "known" words (found in `COMPLETE_WORDLIST`) after applying the patches. This gives us a metric for how effective our cleaning is.

In [100]:
print(f"Number of words in the full corpus: {len(final_count_words)}")
misspelled_counter = Counter()
for word in final_count_words:
    cleaned_word = apply_aggressive_ocr_patches(apply_common_ocr_patches(word))
    
    if cleaned_word in COMPLETE_WORDLIST:
        continue
    
    if any([c.isdigit() for c in cleaned_word]):
        continue

    if cleaned_word.endswith("'s"):
        root_word = cleaned_word[:-2]
        if root_word in COMPLETE_WORDLIST:
            continue
    elif cleaned_word.endswith("'"):
        root_word = cleaned_word[:-1]
        if root_word in COMPLETE_WORDLIST:
            continue

    misspelled_counter.update({word : final_counts[word]})

print(f"Number of words in the misspelled set: {len(misspelled_counter.keys())}")

Number of words in the full corpus: 903130
Number of words in the misspelled set: 724390


In [101]:
misspelled_counter.most_common()

[('wigmore', 7649),
 ('mlr', 1912),
 ('mliss', 1889),
 ('thie', 1637),
 ('ths', 1551),
 ('philharmonia', 1419),
 ('ofthe', 1235),
 ('lso', 1141),
 ('oi', 1067),
 ('tle', 1060),
 ('bechstein', 999),
 ('ith', 953),
 ('witlh', 911),
 ('grotrian', 910),
 ('und', 889),
 ('tions', 851),
 ('tlie', 803),
 ('wiu', 799),
 ('tiie', 747),
 ('miiss', 721),
 ('thev', 700),
 ('perfor', 642),
 ('www', 619),
 ('andl', 613),
 ('bave', 604),
 ('craine', 597),
 ('tre', 595),
 ('hich', 580),
 ('saens', 578),
 ('heinemann', 550),
 ('ments', 543),
 ('unwin', 532),
 ('glyndebourne', 527),
 ('favourites', 517),
 ("rachmaninov's", 513),
 ('pettitt', 509),
 ('rachmaninov', 506),
 ('lpo', 488),
 ('withi', 472),
 ('aldwych', 449),
 ('tive', 448),
 ('verv', 443),
 ('santley', 437),
 ('mance', 434),
 ('havo', 428),
 ('wel', 426),
 ('ih', 424),
 ('eroica', 419),
 ('gollancz', 414),
 ('fromn', 390),
 ('tlle', 389),
 ('orches', 387),
 ('thte', 386),
 ('alr', 382),
 ('ture', 368),
 ('farren', 367),
 ('aliss', 366),
 ('i