# 1. Perturbations 
Perturbations are intentional modifications made to data to simulate variations, errors, or noise, aiming to test or improve the robustness and adaptability of models to real-world, imperfect inputs.

##  1.1. Types of Perturbations

Mistakes Humans Make While Interacting with LLMs:
1. **Grammatical Mistakes**: Incorrect use of tense, prepositions, article omission or misuse, verb agreement errors, and sentence fragment errors.
2. **Spelling Errors**: Typos, homophones (words that sound alike but have different meanings), and commonly confused words (e.g., "there" vs. "their").
3. **Unclear Task Definitions**: Vague questions, incomplete thoughts, or requests that lack specific details necessary for clear understanding.
4. **Use of Slang or Informal Language**: Colloquial expressions, regional dialects, internet slang, and abbreviations that may not be universally understood or are too casual for formal contexts.
6. **Non-standard Syntax**: Inversion of the normal word order, overly complex sentences, or sentences that lack the usual structure.
5. **Overly Complex or Compound Queries**: Questions that contain multiple parts or topics, which may not be clearly delineated.
7. **Ambiguity in Queries**: Questions that can be interpreted in multiple ways due to vague wording or lack of context.
8. **Use of Jargon or Highly Technical Language**: Specific to certain fields or interests, which might not be universally understood.
9. **Assumptions of Common Knowledge**: Questions that assume a base level of knowledge or context that the LLM might not have.
10. **Cultural References or Idioms**: Phrases or references that rely on cultural knowledge not shared by all users or the LLM.
11. **Inconsistent Formatting**: Mixed use of date formats, currency, units of measurement, or switching between first, second, and third person.
12. **Omission of Crucial Information**: Leaving out key details necessary to understand the request or assuming the model has access to external information not provided.

In [None]:
# TODO: (Optional) Find resources 


## 1.2. Techniques for Introducing Perturbations in Text/Questions/Dataset:

1. **Rule-Based Error Injection**: Manually define rules for introducing specific types of errors:
    - *Sentence Scrambling*: Rearrange the order of words or phrases in a sentence to simulate syntactic errors or unclear task definitions.
    - *Homophone Replacement*: Replace words with their homophones to simulate common spelling errors.
    - *Omission Simulation*: Randomly omit crucial words or phrases to simulate the omission of information.
    - Informal Language Injection: Introduce slang or informal phrases into sentences to simulate casual language use.
    - etc.
2. **Noise Injection**: To mimic spelling and grammatical errors:
    - *Synonym Replacement*: Randomly replace words with their synonyms to simulate the varied vocabulary humans might use. This can also introduce informal language variations.
    - Randomly alter characters
    - add words
    - delete words
    - etc.
3. **Back Translation**: Translate the text to another language and then back to the original language. This often introduces grammatical inaccuracies and changes in sentence structure.
4. **Paraphrasing Tools**: Use AI-based paraphrasing tools to rewrite questions in ways that might introduce ambiguity or complexity.
5. **Text Simplification/Complexification**: Use models to either simplify or make the text more complex, aiming to mimic the way different users might phrase the same question.
6. **Synthetic Data Generation**: Leverage generative models to create new questions based on the original ones, introducing errors or variations in the process.
7. **Keyboard Typing Errors**: Introduce errors based on common typing mistakes, considering keyboard layout (e.g., letters often mistyped because of their proximity).

# 2. Implementation in Python
1. **Rule-Based Error Injection**: Implement functions to drop random articles, misuse verbs, and introduce common spelling errors.
2. **Noise Injection**: Create a function to randomly replace characters or words in sentences, mimicking typographical errors.

In [1]:
import random
import re

from nltk.corpus import wordnet
import random
import nltk

# Ensure that NLTK resources are downloaded (if necessary)
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [2]:
class Perturbation:
    def __init__(self, question):
        self.question = question
        self.homophones = {
            "capital": "capitol",
            "sea": "see",
            "right": "write",
            "France": "french"
            # TODO: Find corpus for homphones from NLTK or other sources
        } 
    
    #### 2.1. Rule-Based Error Injection ####
    def sentence_scrambling(self):
        """Sentence Scrambling"""
        words = self.question.split()
        random.shuffle(words)
        return ' '.join(words)
    
    def homophone_replacement(self):
        """Homophone Replacement"""
        words = self.question.split()
        new_words = []
        for word in words:
            # Checking if the word's lowercase is in the homophones dictionary
            key = word.lower()
            if key in self.homophones:
                # Replace with homophone, preserving the original case
                homophone = self.homophones[key]
                if word.istitle():
                    homophone = homophone.capitalize()
                new_words.append(homophone)
            else:
                new_words.append(word)
        self.question = ' '.join(new_words)
        return self.question
    
    def omission(self):
        """Omission"""
        words = self.question.split()
        # TODO: Improve the logic to omit words more dynamically, while preserving meaning as much as possible
        omissible_words = [word for word in words if word.lower() not in ['how', 'what', 'why', 'who', 'where', '?']]
        if omissible_words:
            omitted_word = random.choice(omissible_words)
            words.remove(omitted_word)
        return ' '.join(words)
    
    def informal_language_injection(self):
        """Introduces slang or informal phrases into sentences"""
        # TODO: This is a basic implementation, and it could be expanded with a more comprehensive mapping or logic.
        informal_replacements = {
            "What is": "What's",
            "How many": "How many",
            "Why is": "Why's",
            "Who is": "Who's",
            # Add more replacements as needed
        }
        for formal, informal in informal_replacements.items():
            self.question = self.question.replace(formal, informal)
        return self.question
    
    #### 2.2. Noise Injection ####
    def synonym_replacement(self, n=1):
        """Replace up to n words with their synonyms"""
        words = self.question.split()
        new_sentence = []
        replacements = 0

        for word in words:
            synonyms = [lem.name() for syn in wordnet.synsets(word) for lem in syn.lemmas() if lem.name() != word]
            if synonyms and replacements < n:
                new_sentence.append(random.choice(synonyms))
                replacements += 1
            else:
                new_sentence.append(word)
        
        return ' '.join(new_sentence)

    def random_character_alteration(self, n=1):
        """Randomly alter characters in the question"""
        chars = list(self.question)
        for _ in range(n):
            index = random.randint(0, len(chars) - 1)
            alteration_type = random.choice(['insert', 'delete', 'substitute'])
            if alteration_type == 'insert':
                chars.insert(index, random.choice('abcdefghijklmnopqrstuvwxyz'))
            elif alteration_type == 'delete' and len(chars) > 1:
                del chars[index]
            elif alteration_type == 'substitute':
                chars[index] = random.choice('abcdefghijklmnopqrstuvwxyz')
        return ''.join(chars)

    def random_word_addition_deletion(self, add=True, delete=True):
        """Randomly add or delete words from the sentence"""
        words = self.question.split()
        if add:
            # Add a random word
            insert_at = random.randint(0, len(words))
            words.insert(insert_at, random.choice(words))
        if delete and len(words) > 1:
            # Delete a random word
            del words[random.randint(0, len(words) - 1)]
        return ' '.join(words)

In [3]:
sample_questions = ['How many Dubai-based airlines are owned by Somalis?',
       'What product did various compounds of zinc show massive differences in absorption?',
       'What do athletes perform for tumbling?',
       'Who plays an influential role in the formation of party policy?',
       "How many thumbs down votes did youtube's official statement about the new commenting system get within two days?",
       'In what year were the Beer Orders passed?',
       'What did the United States of America incorporate?',
       'What two letters can be replaced with each other a lot of the time in Estonian?',
       "What isn't the threshold of the number of copies and the value of the works?",
       "Who received the Judges' Save this season?",
       'What is the cost to build Cornell Tech?',
       'On what day did Paul VI die?',
       'Why is there no saving grace of relying on these theories?',
       'How did the Jewish population increase before the 1st century?',
       'What was the former name of the Tampa Bay Storm?',
       'Why was Shtokavian the most widespread culture in the western Balkans? ',
       'How many commercial airports does Fraport in the UK manage?',
       'What quality has LEDs been used as?',
       'What area was the likely focal point of the Roman senate?',
       'When was the Tower constructed?']

In [5]:
# Apply perturbations to sample questions
perturbation = Perturbation(random.choice(sample_questions))

In [6]:
perturbation.sentence_scrambling()

'Jewish did century? 1st before increase the the How population'

In [7]:
perturbation.question

'How did the Jewish population increase before the 1st century?'

## 2.3. Back Translation

In [5]:
# TODO: For Back Translation, use very famous language Hindi, Marathi, Bengali, Mandarin, Spanish, French, Arabic, Russian, Portuguese, German, Japanese, etc

In [49]:
from transformers import pipeline

# Sample sentence from SQuAD
sentence = "What is the capital of France?"

# Define translation pipelines
translate_to_french = pipeline('translation_en_to_fr')
translate_back_to_english = pipeline('translation_fr_to_en')

# Perform back-translation
translated_to_french = translate_to_french(sentence)[0]['translation_text']
back_translated_to_english = translate_back_to_english(translated_to_french)[0]['translation_text']

print("Original:", sentence)
print("Back Translated:", back_translated_to_english)

## 2.4. Keyboard Typing Errors

In [35]:
# TODO: For Keyboard Typing Errors, Create 5X12 matrix with characters mentioned on keyboard accoridng to their position. 
# Then to create typo replace any character with its adjacent character according to keyboard position.
import numpy as np

keyboard = np.array([
    list('`1234567890-= '),
    list('\tqwertyuiop[]\\'),
    list('  asdfghjkl;\' '),
    list('  zxcvbnm,./  '),
    list('              ')
])


In [11]:
# TODO: For Parapharasing, Can we use writer's style of writing?
# how would following author paraphrase the given sentence?
# 1. Paulo Coelho
# 2. Stephen King
# 3. Agatha Christie
# 4. J.K. Rowling