# Notebook: Symulacja błędów OCR

**Opis działania:**

Symulacja błędów OCR polega na stopniowym przekształcaniu poprawnego tekstu w stylu, który imituje typowe pomyłki algorytmów OCR:
1. Zamiana znaków (manualne mapy + `homoglyphs`)
2. Swap sąsiednich znaków
3. Usunięcia i wstawienia znaków
4. Zamiana słów (homofony)

In [2]:
import pandas as pd
import random
import re
from pathlib import Path

from homoglyphs import Homoglyphs

## Wczytanie danych

In [69]:
csv_path = Path("../Datasets/BooksDatasetSubset/BooksDataset_subset.csv")
books_df = pd.read_csv(csv_path)
display(books_df.head())


Unnamed: 0,title,authors,category,publisher,description,publish_month,publish_year
0,Goat Brothers,"By Colton, Larry","History , General",Doubleday,,January,1993
1,The Missing Person,"By Grumbach, Doris","Fiction , General",Putnam Pub Group,,March,1981
2,Don't Eat Your Heart Out Cookbook,"By Piscatella, Joseph C.","Cooking , Reference",Workman Pub Co,,September,1983
3,When Your Corporate Umbrella Begins to Leak: A...,"By Davis, Paul D.",,Natl Pr Books,,April,1991
4,Amy Spangler's Breastfeeding : A Parent's Guide,"By Spangler, Amy",,Amy Spangler,,February,1997


## Definicja funkcji do symulacji błędów
**Inicjalizacja homoglyphs**

In [3]:
hg = Homoglyphs()

hg.get_combinations('B')

['B',
 'ℬ',
 'Ꞵ',
 'Ｂ',
 '𝐁',
 '𝐵',
 '𝑩',
 '𝓑',
 '𝔅',
 '𝔹',
 '𝕭',
 '𝖡',
 '𝗕',
 '𝘉',
 '𝘽',
 '𝙱',
 '𝚩',
 '𝛣',
 '𝜝',
 '𝝗',
 '𝞑']

**Ręczna mapa znaków**

In [71]:
char_map = {
    'l': '1', '1': 'l',
    'O': '0', '0': 'O',
    'm': 'rn', 'rn': 'm',
    'a': '@', '@': 'a',
    'e': 'c', 'c': 'e'
}

def ocr_char_replace_manual(text, prob=0.05):
    if not isinstance(text, str): return text
    chars = list(text)
    for i, ch in enumerate(chars):
        if random.random() < prob and ch in char_map:
            chars[i] = char_map[ch]
    return ''.join(chars)




**Zamiany znaków z homoglyphs**

In [72]:
def ocr_char_replace_homoglyph(text, prob=0.05):
    if not isinstance(text, str): return text
    chars = list(text)
    for i, ch in enumerate(chars):
        if random.random() < prob:
            alternatives = hg.get_combinations(ch)
            if alternatives:
                chars[i] = random.choice(alternatives)
    return ''.join(chars)


**Swap sąsiednich znaków**

In [73]:
def ocr_swap(text, prob=0.03):
    if not isinstance(text, str) or len(text) < 2: return text
    chars = list(text)
    for i in range(len(chars)-1):
        if random.random() < prob:
            chars[i], chars[i+1] = chars[i+1], chars[i]
    return ''.join(chars)


**Usunięcia i wstawienia**

In [74]:
def ocr_delete_insert(text, del_prob=0.02, ins_prob=0.02):
    if not isinstance(text, str): return text
    # deletions
    if random.random() < del_prob and len(text) > 3:
        idx = random.randrange(len(text))
        text = text[:idx] + text[idx+1:]
    # insertions
    if random.random() < ins_prob:
        idx = random.randrange(len(text)+1)
        text = text[:idx] + random.choice('abcdefghijklmnopqrstuvwxyz') + text[idx:]
    return text


**Zamiana wieloznakowych sekwencji (np. 'cl' -> 'd')**

In [75]:
sequence_map = {
    'cl': 'd',
    'ol': 'd',
    'tl': 'll',
    'vv': 'w',
    'rn': 'm',
    'nn': 'm',
    'ii': 'u',
    'tt': 't',
    'ff': 'f',
    'oo': 'o',
    'aa': 'ä',
    'ss': 's',
    'sh': 's',
    'cj': 'g',
    'ck': 'k',
    'cd': 'd',
    'ri': 'n',
    'rl': 'bl',
    'mc': 'm',
    'nh': 'm',
    'om': 'm',
    'wc': 'w',
    'tr': 't',
    'kn': 'n',
    'np': 'm',
    'ie': 'll',
    'po': 'p',
    'xo': 'xo',
    'ur': 'u',
    'ar': 'a'
}

In [76]:
def ocr_sequence_replace(text, prob=0.02):
    if not isinstance(text, str):
        return text
    for seq, rep in sequence_map.items():
        if random.random() < prob:
            text = re.sub(seq, rep, text)
    return text

**Funkcja łącząca wszystkie transformacje Funkcja łącząca wszystkie transformacje**

In [77]:
def simulate_ocr(text):
    text = ocr_char_replace_manual(text)
    text = ocr_char_replace_homoglyph(text)
    text = ocr_swap(text)
    text = ocr_delete_insert(text)
    text = ocr_sequence_replace(text)
    return text

## Aplikacja błędów do kolumn

In [78]:
def introduce_ocr_errors(df, cols, error_rate=0.15):
    df = df.copy()
    for col in cols:
        mask = df[col].notnull()
        idxs = df.loc[mask].sample(frac=error_rate, random_state=42).index
        df.loc[idxs, col] = df.loc[idxs, col].apply(simulate_ocr)
    return df


In [79]:
cols_to_corrupt = [
    'title',
    'authors',
    'category',
    'publisher',
    'description',
    'publish_month',
    'publish_year'
]


In [80]:
n_versions = 10

for i in range(n_versions):
    books_ocr_i = introduce_ocr_errors(
        books_df,
        cols=cols_to_corrupt,
        error_rate=0.80
    )
    output_path = Path(f"../Datasets/BooksDatasetOCR/BooksDataset_OCR_v{i+1}.csv")
    books_ocr_i.to_csv(output_path, index=False)
    print(f"Zapisano wersję z błędami: {output_path}")



Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v1.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v2.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v3.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v4.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v5.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v6.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v7.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v8.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v9.csv
Zapisano wersję z błędami: ..\Datasets\BooksDatasetOCR\BooksDataset_OCR_v10.csv


**Przed zamianami**

In [81]:
display(books_df.head(10))

Unnamed: 0,title,authors,category,publisher,description,publish_month,publish_year
0,Goat Brothers,"By Colton, Larry","History , General",Doubleday,,January,1993
1,The Missing Person,"By Grumbach, Doris","Fiction , General",Putnam Pub Group,,March,1981
2,Don't Eat Your Heart Out Cookbook,"By Piscatella, Joseph C.","Cooking , Reference",Workman Pub Co,,September,1983
3,When Your Corporate Umbrella Begins to Leak: A...,"By Davis, Paul D.",,Natl Pr Books,,April,1991
4,Amy Spangler's Breastfeeding : A Parent's Guide,"By Spangler, Amy",,Amy Spangler,,February,1997
5,The Foundation of Leadership: Enduring Princip...,"By Short, Bo",,Excalibur Press,,January,1997
6,Chicken Soup for the Soul: 101 Stories to Open...,"By Canfield, Jack (COM) and Hansen, Mark Victo...","Self-help , Personal Growth , Self-Esteem",Health Communications Inc,,May,1993
7,Journey Through Heartsongs,"By Stepanek, Mattie J. T.","Poetry , General",VSP Books,Collects poems written by the eleven-year-old ...,September,2001
8,In Search of Melancholy Baby,"By Aksyonov, Vassily, Heim, Michael Henry, and...","Biography & Autobiography , General",Random House,The Russian author offers an affectionate chro...,June,1987
9,Christmas Cookies,"By Eakin, Katherine M. and Deaman, Joane (EDT)","Cooking , General",Oxmoor House,,June,1986


**Po zamianach**

In [82]:
display(books_ocr_i.head(10))

Unnamed: 0,title,authors,category,publisher,description,publish_month,publish_year
0,Goat Bro𝒽ters,"By Colotn, Larry","History , 𝐆eneral",Doub1eday,,January,1993
1,The 𝜧issing Person,"By Gr𝘶mbach, Doris","Fiction , G𝗲neral",Putnam Pub Group,,March,1981
2,Don't Eat Your Heart Out Cookbook,"By Piscatella, Joseph C.","Cooking , Reference",Workman Pub Co,,September,1983
3,When Your Corportac Umbrella Begins to Leak: Ａ...,"By Davis, Pau𝘭 D𝅭",,Natl Ppr Boᴏks,,April,1991
4,Amy S𝐩angIer's Breastfeeding : A Paent's Guide,By 𝐒pangler‚ Amy,,Amy Spa𝑛gler,,February,1997
5,The Foundation of Leadership: Enduring Princip...,"By Short, Bo",,Excalibur Press,,January,1997
6,Chicken S𝔬up for t𝚑e Soul: 101 Stℴries to Open...,"By 𝒞anfield, Jack (COM) an dHansen, 𝔐ark 𝔙i𝘤to...","Self-he1p , Personal Growth , Self-Esteem",Hcalth Communic@tions I𝑛c,,May,1993
7,Jour𝖓ey Thro𝝊gh Heatsongs,"By Stepanek, Mattie J. T.","Poetry , General",VSP Books,Collects poems written by the eleven-year-old ...,Septcmber,2001
8,In Search oꬵ Mel@nch𝞂ly Baby,"By Aksyonov, Vassily, Heim, Miehael Henry, and...","Biography & Autobiography , General",Random House,The Russian author offers an 𝑎ffectionate chro...,June,1987
9,Christm@s Cookies,"By Eakin ,Katheri𝐧e M. and Deaman, Joane (EDT)","Cooki𝙣g , Gencral",Oxmoor House,,June,1986


## Zapis wyników

In [None]:
s

Zapisano zasymulowane błędy OCR: ..\Datasets\BooksDatasetSubset\BooksDataset_OCR.csv
