# **auto-INcorrect**
The evil twin to your existing autocorrect models, It's a "spell-checker-in-reverse" that, instead of fixing mistakes, artfully creates them. This isn't just random chaos; it's a system that mimics exactly how humans mess up. From keyboard-adjacent "fat-finger" slips to those "brain fart" moments where you type "your" instead of "you're." It's a way to prove we understand language so well, we can break it on purpose.

Besides using this just for laughs, this is a great way to test autocorrect models. Using auto-INcorrect you can generate large amounts of "wrong" text with already available "correct" text and use it to train or test an autocorrect model. This idea is still in its infancy and there is is a lot of scope for creative modifications using deep learning systems. Personally, i have actually used this in some of my NLP projects and have found it to be quite useful.

# The "Fat Finger" Module
simulating keyboard errors. The foundation of this module is a dictionary that maps each letter to its neighbors on a standard QWERTY keyboard.

In [45]:
import random

# A dictionary mapping each key to its adjacent keys on a QWERTY keyboard
KEY_NEIGHBORS = {
    'q': ['w', 'a', 's','1'],
    'w': ['q', 'e', 'a', 's', 'd','2'],
    'e': ['w', 'r', 's', 'd', 'f','3'],
    'r': ['e', 't', 'd', 'f', 'g','4'],
    't': ['r', 'y', 'f', 'g', 'h','5'],
    'y': ['t', 'u', 'g', 'h', 'j','6'],
    'u': ['y', 'i', 'h', 'j', 'k','7'],
    'i': ['u', 'o', 'j', 'k', 'l','8'],
    'o': ['i', 'p', 'k', 'l','9'],
    'p': ['o', 'l','0'],

    'a': ['q', 'w', 's', 'z', 'x'],
    's': ['q', 'w', 'e', 'a', 'd', 'z', 'x', 'c'],
    'd': ['w', 'e', 'r', 's', 'f', 'x', 'c', 'v'],
    'f': ['e', 'r', 't', 'd', 'g', 'c', 'v', 'b'],
    'g': ['r', 't', 'y', 'f', 'h', 'v', 'b', 'n'],
    'h': ['t', 'y', 'u', 'g', 'j', 'b', 'n', 'm'],
    'j': ['y', 'u', 'i', 'h', 'k', 'n', 'm'],
    'k': ['u', 'i', 'o', 'j', 'l', 'm'],
    'l': ['i', 'o', 'p', 'k'],

    'z': ['a', 's', 'x'],
    'x': ['a', 's', 'd', 'z', 'c'],
    'c': ['s', 'd', 'f', 'x', 'v'],
    'v': ['d', 'f', 'g', 'c', 'b'],
    'b': ['f', 'g', 'h', 'v', 'n'],
    'n': ['g', 'h', 'j', 'b', 'm'],
    'm': ['h', 'j', 'k', 'n'],

}

In [46]:
def adjacent_swap(word):
  """
  Replaces a random character in a word with a keyboard-adjacent character.
  """
  # Make sure the word is long enough and not just symbols
  if len(word) < 1:
    return word

  # Pick a random index in the word
  char_index = random.randint(0, len(word) - 1)
  char_to_swap = word[char_index].lower() # Use lowercase for a clean dict lookup

  # Check if the character is one we have neighbors for
  if char_to_swap in KEY_NEIGHBORS:
    # Pick a random neighbor
    neighbor = random.choice(KEY_NEIGHBORS[char_to_swap])

    # Rebuild the word: (part before) + (new char) + (part after)
    # We'll preserve the original case of the neighbor
    if word[char_index].isupper():
        neighbor = neighbor.upper()

    new_word = word[:char_index] + neighbor + word[char_index + 1:]
    return new_word

  # If the character wasn't in our map (like '!' or ','), just return the original word
  return word

testing

In [47]:
print("Testing 'standard':")
for _ in range(5):
  print(f"  -> {adjacent_swap('standard')}")

print("\nTesting 'example':")
for _ in range(5):
  print(f"  -> {adjacent_swap('example')}")

print("\nTesting 'Python':")
for _ in range(5):
  print(f"  -> {adjacent_swap('Python')}")

Testing 'standard':
  -> stqndard
  -> standare
  -> standqrd
  -> stqndard
  -> standafd

Testing 'example':
  -> rxample
  -> examole
  -> dxample
  -> exampke
  -> exqmple

Testing 'Python':
  -> Pgthon
  -> Pytbon
  -> Pythln
  -> Pythom
  -> Pythpn


In [48]:
def transpose(word):
  """
  Swaps two random adjacent characters in a word.
  """
  # Can't transpose if the word is 1 or 0 chars long
  if len(word) < 2:
    return word

  # Pick a random index to be the *first* char of the swap
  # We -2 because we need to be able to select its neighbor (index + 1)
  # So for "word" (len 4), we can pick index 0 ('w'), 1 ('o'), or 2 ('r')
  char_index = random.randint(0, len(word) - 2)

  # Rebuild the word with the two chars swapped
  new_word = (
      word[:char_index] +         # Part before the swap
      word[char_index + 1] +    # The second char
      word[char_index] +        # The first char
      word[char_index + 2:]     # The part after
  )

  return new_word

testing

In [49]:
print("Testing 'standard':")
for _ in range(5):
  print(f"  -> {transpose('standard')}")

print("\nTesting 'example':")
for _ in range(5):
  print(f"  -> {transpose('example')}")

print("\nTesting 'Python':")
for _ in range(5):
  print(f"  -> {transpose('Python')}")

Testing 'standard':
  -> stnadard
  -> stnadard
  -> standrad
  -> tsandard
  -> stanadrd

Testing 'example':
  -> xeample
  -> examlpe
  -> examlpe
  -> xeample
  -> xeample

Testing 'Python':
  -> Pytohn
  -> Pythno
  -> Pythno
  -> Pyhton
  -> yPthon


In [50]:
def delete_char(word):
  """
  Deletes a random character from a word.
  """
  # Can't delete if the word is 1 char or less
  if len(word) < 2:
    return word

  # Pick a random index to delete
  char_index = random.randint(0, len(word) - 1)

  # Rebuild the word without that character
  new_word = word[:char_index] + word[char_index + 1:]

  return new_word

testing

In [51]:
print("Testing 'standard':")
for _ in range(5):
  print(f"  -> {delete_char('standard')}")

print("\nTesting 'example':")
for _ in range(5):
  print(f"  -> {delete_char('example')}")

Testing 'standard':
  -> stadard
  -> standad
  -> stndard
  -> sandard
  -> tandard

Testing 'example':
  -> exaple
  -> exaple
  -> exaple
  -> exampe
  -> exmple


In [52]:
def insert_char(word):
  """
  Inserts a random keyboard-adjacent character into a word.
  """
  if len(word) < 1:
    return word

  # Pick a random index *in* the word to insert *next to*
  # (Including the end of the word)
  insert_index = random.randint(0, len(word) - 1)
  char_to_neighbor = word[insert_index].lower()

  # Find a neighbor to insert
  if char_to_neighbor in KEY_NEIGHBORS:
    neighbor = random.choice(KEY_NEIGHBORS[char_to_neighbor])

    # Preserve case if the original char was uppercase
    if word[insert_index].isupper():
        neighbor = neighbor.upper()

    # Rebuild the word
    # We'll insert *after* the chosen character
    new_word = word[:insert_index + 1] + neighbor + word[insert_index + 1:]
    return new_word

  # If we can't find a neighbor (e.g., it's a symbol), just return the word
  return word

testing

In [53]:
print("Testing 'standard':")
for _ in range(5):
  print(f"  -> {insert_char('standard')}")

print("\nTesting 'example':")
for _ in range(5):
  print(f"  -> {insert_char('example')}")

Testing 'standard':
  -> standfard
  -> stazndard
  -> stanmdard
  -> sctandard
  -> staqndard

Testing 'example':
  -> example3
  -> efxample
  -> excample
  -> exampole
  -> examplef


In [54]:
def run_fat_finger_module(word):
  """
  Selects one of the four typographical error functions based on
  a weighted probability and applies it to the given word.
  """
  # List of all the error functions we've built
  error_functions = [
      adjacent_swap,
      transpose,
      delete_char,
      insert_char
  ]

  # [swap, transpose, delete, insert]
  weights = [0.4, 0.3, 0.2, 0.1]

  # random.choices returns a list, so we get the first (and only) item [0]
  chosen_function = random.choices(error_functions, weights=weights, k=1)[0]

  # Call the chosen function with the word
  return chosen_function(word)

testing

In [55]:
print("Running the 'Fat Finger' module on 'standard':")
for _ in range(20):
  print(f"  -> {run_fat_finger_module('standard')}")

print("\nRunning the 'Fat Finger' module on 'Python':")
for _ in range(20):
  print(f"  -> {run_fat_finger_module('Python')}")

Running the 'Fat Finger' module on 'standard':
  -> standarx
  -> stanfard
  -> standadr
  -> stawndard
  -> sfandard
  -> stndard
  -> stanhdard
  -> standaxrd
  -> standadr
  -> standadr
  -> stndard
  -> ztandard
  -> stndard
  -> standaed
  -> standrad
  -> atandard
  -> standeard
  -> standarx
  -> standardw
  -> standadd

Running the 'Fat Finger' module on 'Python':
  -> ython
  -> Pthon
  -> Pythn
  -> Pyhton
  -> Pythonh
  -> Pytbon
  -> Pythno
  -> Pytohn
  -> Ptyhon
  -> Pytbon
  -> Pythln
  -> Pyhton
  -> Ptyhon
  -> Pythoj
  -> Pytghon
  -> Pyhhon
  -> Pythno
  -> 0ython
  -> Pythkn
  -> P6thon


# The "Brain Fart" Module
This module simulates cognitive mistakes, where the typing is correct but the word choice is wrong. We'll start with the most common one: homophone replacement.

This is when someone uses "their" instead of "there," or "it's" instead of "its."

In [56]:
# A dictionary mapping common homophones.
# We'll store them in sets for easy lookup and manipulation.
HOMOPHONES = {
    'group_1': {'their', 'there', 'they\'re'},
    'group_2': {'your', 'you\'re'},
    'group_3': {'its', 'it\'s'},
    'group_4': {'to', 'too', 'two'},
    'group_5': {'affect', 'effect'},
    'group_6': {'weather', 'whether'},
    'group_7': {'peace', 'piece'},
    'group_8': {'break', 'brake'},
    'group_9': {'buy', 'by', 'bye'},
}

# We also need a fast way to look up *any* of these words
# This "flattens" the dictionary for quick checks.
HOMOPHONE_MAP = {}
for group_set in HOMOPHONES.values():
    for word in group_set:
        # We store the *other* words in the set as the replacement options
        replacements = group_set - {word}
        HOMOPHONE_MAP[word] = list(replacements)


def replace_homophone(word):
  """
  Replaces a word with a random homophone, if one is found.
  Preserves the original capitalization.
  """
  word_lower = word.lower()

  if word_lower in HOMOPHONE_MAP:
    # Get the list of possible replacements
    replacements = HOMOPHONE_MAP[word_lower]

    # Pick one random replacement
    new_word = random.choice(replacements)

    # Preserve capitalization
    if word.isupper():
      return new_word.upper()
    elif word.istitle(): # Capitalizes the first letter
      return new_word.capitalize()
    else:
      return new_word

  # If no homophone was found, return the original word
  return word

testing

In [57]:
print(f"'their'  -> {replace_homophone('their')}")
print(f"'There'  -> {replace_homophone('There')}")
print(f"'YOU'RE'  -> {replace_homophone('YOU\'RE')}")
print(f"'Its'    -> {replace_homophone('Its')}")
print(f"'to'     -> {replace_homophone('to')}")
print(f"'hello'  -> {replace_homophone('hello')}")

'their'  -> they're
'There'  -> They're
'YOU'RE'  -> YOUR
'Its'    -> It's
'to'     -> two
'hello'  -> hello


In [23]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

This function replaces a word with a related word, accepts a pos_tag to find the right lemma.

In [58]:
from nltk.corpus import wordnet
from nltk import pos_tag

# Helper function to convert NLTK's POS tag to a WordNet-compatible tag
def get_wordnet_pos(treebank_tag):
    """
    Converts NLTK's POS tag to a tag WordNet understands.
    """
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun

def replace_morphological(word, pos_tag):
    """
    Replaces a word with a related word.
    accepts a pos_tag to find the right lemma.
    """
    # 1. Convert the tag
    wordnet_pos = get_wordnet_pos(pos_tag)

    # 2. Find all "synsets"
    synsets = wordnet.synsets(word, pos=wordnet_pos)
    if not synsets:
        return word

    # 3. Get all related lemma names
    related_lemmas = set()
    for syn in synsets:
        for lemma in syn.lemmas():
            related_lemmas.add(lemma.name())

    # 4. Filter
    original_lower = word.lower()
    replacements = [
        lemma for lemma in related_lemmas
        if lemma.lower() != original_lower and '_' not in lemma
    ]

    if replacements:
        # 5. Pick a replacement and preserve case
        new_word = random.choice(replacements)
        if word.isupper():
            return new_word.upper()
        elif word.istitle():
            return new_word.capitalize()
        else:
            return new_word

    return word

In [59]:
def run_brain_fart_module(word, pos_tag):
    """
    Selects one of the two cognitive error functions.
    """
    error_functions = [
        replace_homophone,
        replace_morphological
    ]
    weights = [0.7, 0.3]

    chosen_function = random.choices(error_functions, weights=weights, k=1)[0]

    # Pass the tag to the "smart" function if it's chosen
    if chosen_function == replace_morphological:
        return chosen_function(word, pos_tag) # Pass the tag
    else:
        return chosen_function(word) # Homophone doesn't need it

# the "Malapropism" Module
The "Malapropism" module finds the root of a word (like 'banking' -> 'bank') and looks up semantically similar words in a machine learning model. It then replaces the original with a word that is both semantically related *and* shares the same root (like 'banker'), creating a plausible "near-miss" error.

In [26]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [27]:
import gensim.downloader

print("Loading GloVe")
try:
    glove_model = gensim.downloader.load('glove-wiki-gigaword-100')
    print("Model loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")

# 'most_similar' is the k-NN function we'll be using.
if 'glove_model' in globals():
    try:
        print("\nTesting model with 'king':")
        print(glove_model.most_similar('king'))
    except KeyError:
        print("\n'king' not in vocabulary, but model is loaded.")

Loading FloVe
Model loaded successfully.

Testing model with 'king':
[('prince', 0.7682328820228577), ('queen', 0.7507690787315369), ('son', 0.7020888328552246), ('brother', 0.6985775232315063), ('monarch', 0.6977890729904175), ('throne', 0.6919989585876465), ('kingdom', 0.6811409592628479), ('father', 0.6802029013633728), ('emperor', 0.6712858080863953), ('ii', 0.6676074266433716)]


In [60]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# NEW, MODIFIED FUNCTION
def replace_semantic_neighbor(word, pos_tag):
    """
    Replaces a word with a semantically similar, root-sharing word.
    NOW ACCEPTS a pos_tag to find the right lemma.
    """
    word_lower = word.lower()

    # 1. Convert tag and get the *correct* lemma. No more guessing!
    wordnet_pos = get_wordnet_pos(pos_tag)
    lemma = lemmatizer.lemmatize(word_lower, pos=wordnet_pos)

    # 2. Check if the lemma is in our model
    if lemma not in glove_model:
        return word

    try:
        # 3. Get neighbors of the LEMMA
        similar_words = glove_model.most_similar(lemma, topn=10)

        # 4. Filter
        replacements = []
        for w_tuple in similar_words:
            w = w_tuple[0].lower() # Neighbor word
            if w.startswith(lemma) and w != word_lower:
                replacements.append(w)

        if replacements:
            # 5. Pick one and preserve case
            new_word = random.choice(replacements)
            if word.isupper():
                return new_word.upper()
            elif word.istitle():
                return new_word.capitalize()
            else:
                return new_word

        return word

    except Exception as e:
        return word

In [61]:
def run_malapropism_module(word, pos_tag): # Added pos_tag
    """
    Selects one of the semantic error functions.
    """
    # Just call our final function and pass the tag
    return replace_semantic_neighbor(word, pos_tag)

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# "Punctuation Corruption" Module

This module is a bit different because it needs to handle two separate jobs:

Corrupting existing punctuation (remove/replace).

Adding new, wrong punctuation to words.

In [62]:
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
import random

# --- Define our punctuation sets ---
PUNCTUATION_MARKS = {'.', ',', '!', '?', ';', ':'}
REPLACEMENT_PUNCTUATION = [',', '!', '?', ';']

def corrupt_punctuation(text,
                   add_rate=0.1,
                   replace_rate=0.3,
                   remove_rate=0.6,
                   case_corruption_prob=0.5): # <-- Added this parameter
    """
    A standalone function to corrupt *both* punctuation AND case in a text.

    :param text: The input string.
    :param add_rate: Probability of adding punctuation *after* any given word.
    :param replace_rate: Probability of replacing *existing* punctuation.
    :param remove_rate: Probability of removing *existing* punctuation.
    :param case_corruption_prob: Probability of flipping a capital letter to lowercase.
    """

    words = word_tokenize(text)
    corrupted_words = []

    for word in words:

        # --- 1. Logic for tokens that ARE punctuation (e.g., '.') ---
        if word in PUNCTUATION_MARKS:
            # First, check if we remove it
            if random.random() < remove_rate:
                continue # Skip appending it

            # Second, check if we replace it
            elif random.random() < replace_rate:
                new_punct = random.choice(REPLACEMENT_PUNCTUATION)
                while new_punct == word: # Avoid replacing '?' with '?'
                    new_punct = random.choice(REPLACEMENT_PUNCTUATION)
                corrupted_words.append(new_punct)

            # If neither, append the original
            else:
                corrupted_words.append(word)

        # --- 2. Logic for tokens that are WORDS ---
        else:

            # --- START: Added Case Corruption Logic ---
            # Loop through the word and apply random case flips
            new_chars = [
                char.lower() if char.isupper() and random.random() < case_corruption_prob else char
                for char in word
            ]
            corrupted_word = "".join(new_chars)
            # --- END: Added Case Corruption Logic ---


            # First, append the (now possibly case-corrupted) word
            corrupted_words.append(corrupted_word)

            # Second, check if we should ADD punctuation after it
            if random.random() < add_rate:
                corrupted_words.append(random.choice(REPLACEMENT_PUNCTUATION))

    # Re-assemble the sentence
    detokenizer = TreebankWordDetokenizer()
    final_text = detokenizer.detokenize(corrupted_words)

    return final_text

# Final 'word-error' layer

In [63]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import random

# --- This is our default "mix" of errors if the user doesn't specify one ---
DEFAULT_DISTRIBUTION = {
    'fat_finger': 0.4,   # Back to 60%
    'brain_fart': 0.3,   # Back to 30%
    'malapropism': 0.3   # Back to 10%
}

# We have REMOVED 'punctuation_error_rate' and all punctuation logic.
def word_error(text, error_rate=0.15, distribution=None):
    """
    Intentionally introduces human-like errors into a text string.
    THIS VERSION ONLY AFFECTS WORDS, NOT PUNCTUATION.

    :param text: The input string.
    :param error_rate: A float (0.0 to 1.0) for the probability
                       that any given word will be corrupted.
    :param distribution: A dict specifying the weights for each
                         error module. e.g., {'fat_finger': 0.7, ...}
    """

    # 1. Set the distribution
    if distribution is None:
        distribution = DEFAULT_DISTRIBUTION

    module_names = list(distribution.keys())
    module_weights = list(distribution.values())

    # 2. Tokenize and get POS tags
    words = word_tokenize(text)
    tagged_words = pos_tag(words)

    corrupted_words = []

    # 3. Loop through each (word, tag)
    for word, tag in tagged_words:

        # 4. Check if it's punctuation. If so, skip it.
        # We'll use a basic check.
        if not word.isalnum(): # isalnum() is False if the token is '.', ',', '!', etc.
            corrupted_words.append(word)
            continue # Go to the next token

        # 5. Decide *if* we should apply an error (to this word)
        if random.random() < error_rate:

            # 6. Decide *which* module to use
            chosen_module = random.choices(module_names, weights=module_weights, k=1)[0]

            # 7. Call the chosen module
            if chosen_module == 'fat_finger':
                corrupted_word = run_fat_finger_module(word)
            elif chosen_module == 'brain_fart':
                corrupted_word = run_brain_fart_module(word, tag)
            elif chosen_module == 'malapropism':
                corrupted_word = run_malapropism_module(word, tag)
            else:
                corrupted_word = word # Failsafe

            corrupted_words.append(corrupted_word)

        else:
            # No error applied, just add the original word
            corrupted_words.append(word)

    # 8. Re-assemble the sentence
    detokenizer = TreebankWordDetokenizer()
    final_text = detokenizer.detokenize(corrupted_words)

    return final_text

testing

In [64]:
original_text = "Wardell Stephen Curry II, also known as Steph Curry, is an American professional basketball player for the Golden State Warriors of the National Basketball Association, where he plays as a point guard."

print(f"ORIGINAL:\n{original_text}\n")

# --- Test 1: auto_incorrect ONLY (Word errors) ---
print("--- 1. WORD ERRORS ONLY ---")
corrupted_words = word_error(original_text, error_rate=0.3)
print(f"  -> {corrupted_words}\n")


# --- Test 2: corrupt_punctuation ONLY (Punctuation errors) ---
print("--- 2. PUNCTUATION ERRORS ONLY ---")
corrupted_punct = corrupt_punctuation(original_text, add_rate=0.2, remove_rate=0.5, replace_rate=0.5, case_corruption_prob=0.5)
print(f"  -> {corrupted_punct}\n")


# --- Test 3: Chaining BOTH functions ---
print("--- 3. CHAINING BOTH (Word errors THEN Punctuation errors) ---")
# First, apply word errors
output_1 = word_error(original_text, error_rate=0.3)
# Then, apply punctuation errors to that output
output_2 = corrupt_punctuation(output_1, add_rate=0.2, remove_rate=0.5, replace_rate=0.5)
print(f"  -> {output_2}\n")

ORIGINAL:
Wardell Stephen Curry II, also known as Steph Curry, is an American professional basketball player for the Golden State Warriors of the National Basketball Association, where he plays as a point guard.

--- 1. WORD ERRORS ONLY ---
  -> Wardell Stephen Curry II, also known as Steph Curry, being na American professional basketball player for the Goldeg State Warriors of the National Basketball Association, where he player s a point guard.

--- 2. PUNCTUATION ERRORS ONLY ---
  -> wardell Stephen Curry iI also known! as steph Curry, is an; American professional basketball player for the Golden state Warriors of the National Basketball Association where he plays as a point guard,.

--- 3. CHAINING BOTH (Word errors THEN Punctuation errors) ---
  -> Wardell! stephen curry? ii! also, jnown as! steph? Curry, is an American professional, basketball player for the golden state! wsrriors? of fhe; subject basketball Association where he plays an a point? guard



# Auto_incorrect, the final function

In [65]:
def auto_incorrect(input_text,
                        # Parameters for word_error
                        error_rate=0.15,
                        distribution=None,

                        # Parameters for corrupt_punctuation
                        add_rate=0.1,
                        replace_rate=0.3,
                        remove_rate=0.3,
                        case_corruption_prob=0.5):
  """
  Runs the input text through the full corruption chain.
  Allows tweaking all parameters or using the defaults.

  :param input_text: The string to corrupt.
  :param error_rate: (from word_error) Probability of corrupting a word.
  :param distribution: (from word_error) Dict of word error weights.
  :param add_rate: (from corrupt_punctuation) Probability of adding punctuation.
  :param replace_rate: (from corrupt_punctuation) Probability of replacing punctuation.
  :param remove_rate: (from corrupt_punctuation) Probability of removing punctuation.
  """

  # Step 1: Apply word errors, passing the user's (or default) settings
  text_with_word_errors = word_error( # Changed from auto_incorrect to word_error
      input_text,
      error_rate=error_rate,
      distribution=distribution
  )

  # Step 2: Apply punctuation errors, passing the user's (or default) settings
  final_corrupted_text = corrupt_punctuation(
      text_with_word_errors,
      add_rate=add_rate,
      replace_rate=replace_rate,
      remove_rate=remove_rate,
      case_corruption_prob=case_corruption_prob
  )

  return final_corrupted_text

# --- How to use it ---

my_text = "I am running a test on my banking application for its creative standards. It's a beautiful day to see if your function works correctly."

print(f"Original Text:\n{my_text}\n")

# --- Example 1: Calling with all default settings ---
# (This will work just like the last function)
default_output = auto_incorrect(my_text)
print(f"Fully Corrupted (Default Settings):\n{default_output}\n")

# --- Example 2: Calling with custom, tweaked settings ---
# Let's make it very aggressive and only use 'fat_finger' typos
# and add a lot of punctuation.

# 1. Define a custom distribution
custom_distribution = {
    'fat_finger': 1.0,
    'brain_fart': 0.0,
    'malapropism': 0.0
}

# 2. Call the function with new values
tweaked_output = auto_incorrect(
    my_text,
    error_rate=0.3,
    distribution=custom_distribution,
    add_rate=0.1,
    remove_rate=0.6,
    replace_rate=0.3,
    case_corruption_prob=0.5
)

print(f"Fully Corrupted (Aggressive Tweaked Settings):\n{tweaked_output}")

Original Text:
I am running a test on my banking application for its creative standards. It's a beautiful day to see if your function works correctly.

Fully Corrupted (Default Settings):
i m running a test on my banking application for its creative standards It's! a beautiful day to see if your function works correctly;

Fully Corrupted (Aggressive Tweaked Settings):
I am running a test on! mu banknig applicaion for its creativ satndards, it's a baeutiful dya to sef! if your function; wkrks correctly


In [67]:
custom_text = input("enter the text to be corrupted: ")

custom_distribution = {
    'fat_finger': 0.5,
    'brain_fart': 0.3,
    'malapropism': 0.2
}

corrupted_custom_text = auto_incorrect(custom_text,error_rate= 0.0, distribution= custom_distribution, add_rate=0.1, remove_rate=0.8, replace_rate=0.3, case_corruption_prob=0.3)
#error_rate is the
print(f"Original Text:\n{custom_text}\n")
print(f"Corrupted Text:\n{corrupted_custom_text}")

enter the text to be corrupted: The complexity of human language presents a profound challenge for artificial intelligence. It's not merely a structured system of vocabulary; rather, it's a dynamic entity deeply intertwined with context, culture, and subtle intention. Natural Language Processing (NLP) models, often built on vast statistical analysis, strive to comprehend and generate text with human-like fluency. However, a fascinating inverse problem exists: modeling human error. Simulating a typo isn't just random substitution; it involves understanding keyboard layouts, common transpositions, and phonetic similarities. More advanced simulations, such as semantic errors, require a sophisticated grasp of how the human mind retrieves and associates words. This "reverse engineering" of mistakes—purposefully creating plausible incorrectness—is not only a creative exercise but also a powerful method for building and evaluating more robust correction systems. 
Original Text:
The complexity

use in testing

In [33]:
!pip install fuzzywuzzy
!pip install python-Levenshtein

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (159 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ra

In [35]:
from fuzzywuzzy import fuzz

def check_similarity(str1, str2):
  """
  Checks the similarity between two strings using fuzzywuzzy.

  Returns:
    An integer (0-100) representing the similarity score.
  """

  # fuzz.ratio() calculates the Levenshtein distance similarity
  similarity_score = fuzz.ratio(str1, str2)
  return similarity_score

str1 = input("enter the correct string" )
str2 = input("enter the correcTED string" )

score = check_similarity(str1, str2)


print(f"Str1: {str1}")
print(f"Str2: {str2}")

print(f"Similarity Score: {score}%")

enter the correct stringThe complexity of human language presents a profound challenge for artificial intelligence. It's not merely a structured system of vocabulary; rather, it's a dynamic entity deeply intertwined with context, culture, and subtle intention. Natural Language Processing (NLP) models, often built on vast statistical analysis, strive to comprehend and generate text with human-like fluency. However, a fascinating inverse problem exists: modeling human error. Simulating a typo isn't just random substitution; it involves understanding keyboard layouts, common transpositions, and phonetic similarities. More advanced simulations, such as semantic errors, require a sophisticated grasp of how the human mind retrieves and associates words. This "reverse engineering" of mistakes—purposefully creating plausible incorrectness—is not only a creative exercise but also a powerful method for building and evaluating more robust correction systems. 
enter the correcTED stringThe complex