# 1. Perturbations 
Perturbations are intentional modifications made to data to simulate variations, errors, or noise, aiming to test or improve the robustness and adaptability of models to real-world, imperfect inputs.

##  1.1. Types of Perturbations

Mistakes Humans Make While Interacting with LLMs:
1. **Grammatical Mistakes**: Incorrect use of tense, prepositions, article omission or misuse, verb agreement errors, and sentence fragment errors.
2. **Spelling Errors**: Typos, homophones (words that sound alike but have different meanings), and commonly confused words (e.g., "there" vs. "their").
3. **Unclear Task Definitions**: Vague questions, incomplete thoughts, or requests that lack specific details necessary for clear understanding.
4. **Use of Slang or Informal Language**: Colloquial expressions, regional dialects, internet slang, and abbreviations that may not be universally understood or are too casual for formal contexts.
5. **Overly Complex or Compound Queries**: Questions that contain multiple parts or topics, which may not be clearly delineated.
6. **Non-standard Syntax**: Inversion of the normal word order, overly complex sentences, or sentences that lack the usual structure.
7. **Ambiguity in Queries**: Questions that can be interpreted in multiple ways due to vague wording or lack of context.
8. **Use of Jargon or Highly Technical Language**: Specific to certain fields or interests, which might not be universally understood.
9. **Assumptions of Common Knowledge**: Questions that assume a base level of knowledge or context that the LLM might not have.
10. **Cultural References or Idioms**: Phrases or references that rely on cultural knowledge not shared by all users or the LLM.
11. **Inconsistent Formatting**: Mixed use of date formats, currency, units of measurement, or switching between first, second, and third person.
12. **Omission of Crucial Information**: Leaving out key details necessary to understand the request or assuming the model has access to external information not provided.


## 1.2. Techniques for Introducing Perturbations in Text/Questions/Dataset:

1. **Rule-Based Error Injection**: Manually define rules for introducing specific types of errors:
    - *Sentence Scrambling*: Rearrange the order of words or phrases in a sentence to simulate syntactic errors or unclear task definitions.
    - *Homophone Replacement*: Replace words with their homophones to simulate common spelling errors.
    - *Omission Simulation*: Randomly omit crucial words or phrases to simulate the omission of information.
    - Informal Language Injection: Introduce slang or informal phrases into sentences to simulate casual language use.
    - etc.
2. **Noise Injection**: To mimic spelling and grammatical errors:
    - *Synonym Replacement*: Randomly replace words with their synonyms to simulate the varied vocabulary humans might use. This can also introduce informal language variations.
    - Randomly alter characters
    - add words
    - delete words
    - etc.
3. **Back Translation**: Translate the text to another language and then back to the original language. This often introduces grammatical inaccuracies and changes in sentence structure.
4. **Paraphrasing Tools**: Use AI-based paraphrasing tools to rewrite questions in ways that might introduce ambiguity or complexity.
5. **Text Simplification/Complexification**: Use models to either simplify or make the text more complex, aiming to mimic the way different users might phrase the same question.
6. **Synthetic Data Generation**: Leverage generative models to create new questions based on the original ones, introducing errors or variations in the process.
7. **Keyboard Typing Errors**: Introduce errors based on common typing mistakes, considering keyboard layout (e.g., letters often mistyped because of their proximity).

# 2. Implementation in Python
1. **Rule-Based Error Injection**: Implement functions to drop random articles, misuse verbs, and introduce common spelling errors.
2. **Noise Injection**: Create a function to randomly replace characters or words in sentences, mimicking typographical errors.

In [4]:
import random
import re

# Sample questions from a dataset (for demonstration)
sample_questions = [
    "What is the capital of France?",
    "Who wrote the novel '1984'?",
    "How does photosynthesis work?"
]

## 2.1. Rule-Based Error Injection

In [6]:
# Rule-based error injection: Introduce grammatical mistakes and spelling errors
def rule_based_errors(text):
    # Common spelling mistakes
    spelling_mistakes = {
        'the': ['teh', 'hte', 'thee'],
        'is': ['si', 'iz', 'iss'],
        'does': ['doe', 'doez', 'dose'],
    }
    
    # Introduce spelling mistakes
    for correct, mistakes in spelling_mistakes.items():
        text = re.sub(r'\b' + correct + r'\b', random.choice(mistakes), text, flags=re.IGNORECASE)
    
    # Randomly drop articles ('the', 'a')
    text = re.sub(r'\b(the|a)\b', '', text, flags=re.IGNORECASE)

    return text

## 2.2. Noise Injection

In [8]:
# Noise injection: Randomly alter characters (simulate typos)
def noise_injection(text, error_rate=0.1):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    new_text = ''
    for char in text:
        if random.random() < error_rate and char.isalpha():
            # Simulate typo by replacing the character with a random letter
            new_text += random.choice(letters)
        else:
            new_text += char
    return new_text


In [9]:
# Apply perturbations to sample questions
perturbed_questions = [noise_injection(rule_based_errors(q)) for q in sample_questions]
perturbed_questions

['What iss hte capital of France?',
 "Who frete teh novel '1984'?",
 'How dose protosynthesis work?']

## 2.3. Back Translation

In [5]:
# TODO: For Back Translation, use very famous language Hindi, Marathi, Bengali, Mandarin, Spanish, French, Arabic, Russian, Portuguese, German, Japanese, etc

In [None]:
from transformers import pipeline

# Sample sentence from SQuAD
sentence = "What is the capital of France?"

# Define translation pipelines
translate_to_french = pipeline('translation_en_to_fr')
translate_back_to_english = pipeline('translation_fr_to_en')

# Perform back-translation
translated_to_french = translate_to_french(sentence)[0]['translation_text']
back_translated_to_english = translate_back_to_english(translated_to_french)[0]['translation_text']

print("Original:", sentence)
print("Back Translated:", back_translated_to_english)

## 2.4. Keyboard Typing Errors

In [35]:
# TODO: For Keyboard Typing Errors, Create 5X12 matrix with characters mentioned on keyboard accoridng to their position. 
# Then to create typo replace any character with its adjacent character according to keyboard position.
import numpy as np

keyboard = np.array([
    list('`1234567890-= '),
    list('\tqwertyuiop[]\\'),
    list('  asdfghjkl;\' '),
    list('  zxcvbnm,./  '),
    list('              ')
])


In [11]:
# TODO: For Parapharasing, Can we use writer's style of writing?
# how would following author paraphrase the given sentence?
# 1. Paulo Coelho
# 2. Stephen King
# 3. Agatha Christie
# 4. J.K. Rowling