## Detailed Report on Data Augmentation Methods

- Controlled Synonym Replacement: This method replaces words in the sentence with their synonyms, ensuring that the replacement words are valid and meaningful. This increases variability while retaining the semantic meaning.

- Back-Translation Paraphrasing: This method uses back-translation to generate paraphrases. The sentence is translated to another language and then back to English. By using multiple languages (French, Spanish, German), we generate diverse paraphrases while retaining the original meaning.

- Template-Based Augmentation: This method uses a set of predefined templates to create variations of the original sentence. By inserting the original sentence into different templates, we generate meaningful variations that enhance the dataset.

- Multiple Augmentation Rounds: The dataset is augmented multiple times to generate a larger dataset. Each round includes synonym replacement, back-translation, and template-based augmentation to ensure comprehensive augmentation.

## Install Dependencies

In [4]:
!pip install textblob nltk sklearn

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: textblob
Successfully installed textblob-0.18.0.post0


## Initial Dataset

In [49]:
# format: prompt (key) -> top-5 most relevant uiuds (value) [1 (most relevant) to 5 (5th most relevant)]
dataset = {
    "what is the Alex desk": [923, 924, 925, 926, 927],
    "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
    "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
    "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
    "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927], 
    "what is the Pahl desk": [915, 916, 917, 918, 919],
    "for the Pahl desk, what are the warnings I should know of?": [916, 915, 917, 918, 919],
    "for the Pahl desk, what parts do I need?": [916, 915, 916, 918, 919],
    "for the Pahl desk, what is the first step?": [917, 915, 916, 918, 919],
    "for the Pahl desk, what is the second step?": [918, 915, 916, 917, 919],
    "for the Pahl desk, how many nails do I need for step one?": [917, 915, 916, 918, 919],
    "for the Pahl desk, how many parts do I need for step two?": [918, 915, 916, 917, 919], 
    "what is the Fredrik desk": [907, 908, 909, 910, 911],
    "for the Fredrik desk, what are the warnings I should know of?": [908, 907, 909, 910, 911],
    "for the Fredrik desk, what parts do I need?": [908, 907, 909, 910, 911],
    "for the Fredrik desk, what is the first step?": [909, 907, 908, 909, 910],
    "for the Fredrik desk, what is the second step?": [909, 907, 908, 909, 910],
    "for the Fredrik desk, what tool do I need for step one?": [909, 907, 908, 909, 910],
    "for the Fredrik desk, how many parts do I need for step two?": [909, 907, 908, 909, 910],
    "what is the Flisat desk": [887, 888, 889, 890, 891],
    "for the Flisat desk, what are the warnings I should know of?": [888, 887, 889, 890, 891],
    "for the Flisat desk, what parts do I need?": [889, 887, 888, 890, 891],
    "for the Flisat desk, what is the first step?": [890, 887, 889, 888, 891],
    "for the Flisat desk, what is the second step?": [891, 887, 889, 888, 890],
    "for the Flisat desk, how many nails do I need for step one?": [890, 887, 889, 888, 891],
    "for the Flisat desk, how many parts do I need for step two?": [891, 887, 889, 888, 890],  
    "what is the vittsjo shelf": [871, 876, 885, 884, 883],
    "for the vittsjo shelf, what are the warnings I should know of?": [872, 873, 874, 875, 871],
    "for the vittsjo shelf, what parts do I need?": [876, 871, 885, 884, 883],
    "for the vittsjo shelf, what is the first step?": [877, 876, 871, 885, 878],
    "for the vittsjo shelf, what is the second step?": [878, 871, 885, 884, 883],
    "for the vittsjo shelf, what pieces do I need for step one?": [877, 876, 871, 885, 878],
    "for the vittsjo shelf, how many parts do I need for step two?": [878, 871, 885, 884, 883], 
    "what is the vaniljstang shelf": [859, 860, 861, 869, 870],
    "for the vaniljstang shelf, what are the warnings I should know of?": [860, 859, 861, 869, 870],
    "for the vaniljstang shelf, what parts do I need?": [861, 859, 862, 869, 870],
    "for the vaniljstang shelf, what is the first step?": [862, 859, 861, 863, 870],
    "for the vaniljstang shelf, what is the second step?": [863, 859, 862, 861, 870],
    "for the vaniljstang shelf, what pieces do I need for step one?": [862, 859, 861, 863, 870],
    "for the vaniljstang shelf, how many parts do I need for step two?": [863, 859, 862, 861, 870],
    "what is the satsumas furniture": [847, 848, 849, 857, 858],
    "for the satsumas furniture, what are the warnings I should know of?": [848, 847, 849, 857, 858],
    "for the satsumas furniture, what parts do I need?": [849, 848, 847, 857, 858],
    "for the satsumas furniture, what is the first step?": [850, 847, 851, 857, 858],
    "for the satsumas furniture, what is the second step?": [850, 847, 851, 857, 858],
    "for the satsumas furniture, what pieces do I need for step one?": [850, 847, 851, 857, 858],
    "for the satsumas furniture, how many parts do I need for step two?": [850, 847, 851, 857, 858],  
}

## Naive Augmentation - make the dataset 10x larger

In [31]:
import json
from nltk.corpus import wordnet
from textblob import TextBlob
from tqdm import tqdm
# Example dataset
# dataset = {
#     "what is the Alex desk": [923, 924, 925, 926, 927],
#     "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
#     "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
#     "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
# }

# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Template-based augmentation
def template_augmentation(sentence):
    templates = [
        "Can you tell me about {}?",
        "I would like to know about {}.",
        "What can you say about {}?",
        "Provide details about {}.",
        "{} - could you elaborate?",
        "Please explain {} in detail.",
        "What information is available on {}?",
        "Could you tell me about {}?",
        "I need information on {}.",
        "Could you provide more details on {}?"
    ]
    augmented_sentences = []
    for template in templates:
        augmented_sentences.append(template.format(sentence))
    return augmented_sentences

# Function to augment the dataset multiple times
def augment_dataset(dataset, rounds=1):
    augmented_dataset = dataset.copy()
    for _ in range(rounds):
        new_entries = {}
        for question, ids in tqdm(augmented_dataset.items()):
            # Synonym Replacement
            augmented_sentence = synonym_replacement(question)
            if augmented_sentence != question:
                new_entries[augmented_sentence] = ids
            
            # Back-Translation Paraphrasing
            augmented_sentences = back_translate(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
            
            # Template-Based Augmentation
            augmented_sentences = template_augmentation(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
        
        augmented_dataset.update(new_entries)
    return augmented_dataset

# Augment the dataset
augmented_dataset = augment_dataset(dataset, rounds=1)

# Combine the original dataset with the augmented dataset
final_dataset = {**dataset, **augmented_dataset}

# Save the augmented dataset to a JSON file including the original dataset
with open('augmented_dataset.json', 'w') as f:
    json.dump(final_dataset, f, indent=4)

# Load the augmented dataset from the JSON file
with open('augmented_dataset.json', 'r') as f:
    loaded_augmented_dataset = json.load(f)

# Print the loaded augmented dataset
# print(json.dumps(loaded_augmented_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(loaded_augmented_dataset)}")


100%|████████████████████████████████████████████████████████████████████| 49/49 [00:00<00:00, 4242.09it/s]

Length of the augmented dataset: 588





In [10]:
# Load the augmented dataset from the JSON file
with open('augmented_dataset.json', 'r') as f:
    loaded_augmented_dataset = json.load(f)

# Print the loaded augmented dataset
# print(json.dumps(loaded_augmented_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(loaded_augmented_dataset)}")

Length of the augmented dataset: 294


## Augment to 500 samples with train, test, split

- Saved as `augmented_train_dataset.json`, `augmented_test_dataset.json`, `augmented_val_dataset.json`

In [50]:
import json
import random
from nltk.corpus import wordnet
from textblob import TextBlob
from sklearn.model_selection import train_test_split

# Example dataset
# dataset = {
#     "what is the Alex desk": [923, 924, 925, 926, 927],
#     "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
#     "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
#     "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
# }

# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Expanded Template array
templates = [
    "Can you tell me about {}?",
    "I would like to know about {}.",
    "What can you say about {}?",
    "Provide details about {}.",
    "{} - could you elaborate?",
    "Please explain {} in detail.",
    "What information is available on {}?",
    "Could you tell me about {}?",
    "I need information on {}.",
    "Could you provide more details on {}?",
    "Tell me something regarding {}.",
    "Give me an explanation about {}.",
    "Can you elaborate more on {}?",
    "Provide more insights about {}.",
    "{} - can you give further details?",
    "Could you explain {} more thoroughly?",
    "What do you know about {}?",
    "I want to understand {} better.",
    "{} - could you clarify?",
    "Please provide details regarding {}.",
    "Explain {} to me.",
    "Elaborate on {} please.",
    "Give me an overview of {}.",
    "Tell me what you know about {}.",
    "Could you shed light on {}?",
    "Clarify {} for me.",
    "Discuss {} in depth.",
    "{} - what's the scoop?",
    "Break down {} for me.",
]

# Function to randomly select a template
def random_template(templates):
    return random.choice(templates)

# Function to apply augmentation based on dataset type
def apply_augmentation(dataset, augment_func, rounds=10):
    augmented_dataset = {}
    for _ in range(rounds):
        for question, ids in dataset:
            augmented_sentences = augment_func(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence not in augmented_dataset:
                    augmented_dataset[augmented_sentence] = ids
    return augmented_dataset

# Different augmentation strategies for train, val, test sets
def augment_train_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

def augment_val_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

def augment_test_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

# Split dataset into train, val, test sets
train_dataset, test_dataset = train_test_split(list(dataset.items()), test_size=0.2, random_state=42)
train_dataset, val_dataset = train_test_split(train_dataset, test_size=0.1, random_state=42)

# Apply augmentation to each dataset
augmented_train = apply_augmentation(train_dataset, augment_train_set, rounds=10)
augmented_val = apply_augmentation(val_dataset, augment_val_set, rounds=10)
augmented_test = apply_augmentation(test_dataset, augment_test_set, rounds=10)

# Combine augmented datasets into final train, val, test sets
final_train_dataset = {**dataset, **augmented_train}
final_val_dataset = {**dataset, **augmented_val}
final_test_dataset = {**dataset, **augmented_test}

# Save the augmented datasets to JSON files
with open('augmented_train_dataset.json', 'w') as f:
    json.dump(final_train_dataset, f, indent=4)

with open('augmented_val_dataset.json', 'w') as f:
    json.dump(final_val_dataset, f, indent=4)

with open('augmented_test_dataset.json', 'w') as f:
    json.dump(final_test_dataset, f, indent=4)

# Print lengths of augmented datasets
print(f"Length of augmented train dataset: {len(final_train_dataset)}")
print(f"Length of augmented val dataset: {len(final_val_dataset)}")
print(f"Length of augmented test dataset: {len(final_test_dataset)}")

# Print 10 random samples from each dataset for verification
print("\nRandom samples from augmented train dataset:")
print(random.sample(list(final_train_dataset.items()), 10))
print("\nRandom samples from augmented val dataset:")
print(random.sample(list(final_val_dataset.items()), 10))
print("\nRandom samples from augmented test dataset:")
print(random.sample(list(final_test_dataset.items()), 10))


Length of augmented train dataset: 348
Length of augmented val dataset: 84
Length of augmented test dataset: 136

Random samples from augmented train dataset:
[('Provide more insights about for the vittsjo shelf, what parts do I need?.', [876, 871, 885, 884, 883]), ('Can you elaborate more on for the vittsjo shelf, what parts do I need??', [876, 871, 885, 884, 883]), ('Give me an overview of for the Alex desk, how many nails do I need for step one?.', [926, 925, 924, 923, 927]), ('for the Alex desk, how many parts do I need for step two?', [926, 925, 924, 923, 927]), ('for the Pahl desk, how many nails do I need for step one?', [917, 915, 916, 918, 919]), ('for the Pahl desk, how many parts do I need for step two?', [918, 915, 916, 917, 919]), ('Provide details about for the Flisat desk, what parts do I need?.', [889, 887, 888, 890, 891]), ('What do you know about for the vaniljstang shelf, what is the first step??', [862, 859, 861, 863, 870]), ('for the Fredrik desk, what parts do I n

## Augment to above 1 million samples given 7 

In [16]:
import json
from nltk.corpus import wordnet
from textblob import TextBlob

# Example dataset
dataset = {
    "what is the Alex desk": [923, 924, 925, 926, 927],
    "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
    "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
    "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
    "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
}

# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Template-based augmentation
def template_augmentation(sentence):
    templates = [
        "Can you tell me about {}?",
        "I would like to know about {}.",
        "What can you say about {}?",
        "Provide details about {}.",
        "{} - could you elaborate?",
        "Please explain {} in detail.",
        "What information is available on {}?",
        "Could you tell me about {}?",
        "I need information on {}.",
        "Could you provide more details on {}?"
    ]
    augmented_sentences = []
    for template in templates:
        augmented_sentences.append(template.format(sentence))
    return augmented_sentences

# Function to augment the dataset multiple times
def augment_dataset(dataset, rounds=1):
    augmented_dataset = dataset.copy()
    for _ in range(rounds):
        new_entries = {}
        for question, ids in augmented_dataset.items():
            # Synonym Replacement
            augmented_sentence = synonym_replacement(question)
            if augmented_sentence != question:
                new_entries[augmented_sentence] = ids
            
            # Back-Translation Paraphrasing
            augmented_sentences = back_translate(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
            
            # Template-Based Augmentation
            augmented_sentences = template_augmentation(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
        
        augmented_dataset.update(new_entries)
    return augmented_dataset

# Augment the dataset
augmented_dataset = augment_dataset(dataset, rounds=5)

# Save the augmented dataset to a JSON file including the original dataset
final_dataset = {**dataset, **augmented_dataset}
with open('augmented_dataset.json', 'w') as f:
    json.dump(final_dataset, f, indent=4)

# Print the augmented dataset
# print(json.dumps(final_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(final_dataset)}")


Length of the augmented dataset: 1123587
