This is a **contextual augmentation** approach using BERT (Bidirectional Encoder Representations from Transformers). Here's a brief explanation:

1. It identifies content words in the text (non-stopwords, longer than 3 characters)
2. It masks these words one by one and uses BERT to predict contextually appropriate replacements
3. It selects from BERT's top predictions to replace the original words
4. It maintains the original meaning and sentiment while creating variations in the text

Unlike simple word replacement or synonym-based methods, this technique leverages BERT's deep contextual understanding to ensure the replacements are semantically appropriate for the specific context. This helps maintain the sentiment and meaning of the reviews while providing enough variation to improve the model's ability to generalize.

The advantage over other augmentation methods is that it creates more natural-sounding variations that preserve the original sentiment, which is crucial for sentiment analysis tasks.

In [4]:
import pandas as pd
import numpy as np
import torch
from transformers import BertTokenizer, BertForMaskedLM
import random
from tqdm import tqdm
import os
import re
import time
from google.colab import files

class SimpleTokenizer:
    """A simple tokenizer that doesn't rely on NLTK, to avoid dependency issues."""

    def __init__(self):
        # Common punctuation and symbols to treat as separate tokens
        self.punctuation = set('.,;:!?()[]{}"\'-/')

    def tokenize(self, text):
        """Split text into word tokens."""
        # Replace punctuation with spaces around them for easier splitting
        for punct in self.punctuation:
            text = text.replace(punct, f' {punct} ')

        # Split by whitespace and filter out empty tokens
        tokens = [token for token in text.split() if token]
        return tokens

class ContextualAugmenter:
    """
    Implement contextual augmentation using BERT masked language model.
    This approach replaces words with contextually similar words predicted by BERT.
    """

    def __init__(self, device=None):
        """Initialize the augmenter with BERT model and tokenizer."""
        if device is None:
            # Check for available devices in this order: CUDA → MPS → CPU
            if torch.cuda.is_available():
                self.device = torch.device("cuda")
                print(f"Using CUDA device: {torch.cuda.get_device_name(0)}")
                # For A100, set larger model parallel size
                if "A100" in torch.cuda.get_device_name(0):
                    print("Detected A100 GPU - optimizing settings")
            elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
                self.device = torch.device("mps")
                print("Using Apple Silicon MPS acceleration")
            else:
                self.device = torch.device("cpu")
                print("Using CPU (no GPU acceleration available)")
        else:
            self.device = device
            print(f"Using specified device: {self.device}")

        # Custom tokenizer instead of NLTK
        self.word_tokenizer = SimpleTokenizer()

        # Load pre-trained model and tokenizer
        print("Loading BERT model and tokenizer...")
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.model = BertForMaskedLM.from_pretrained('bert-base-uncased')
        self.model.to(self.device)
        self.model.eval()

        # Define mask token ID
        self.mask_token_id = self.tokenizer.convert_tokens_to_ids(['[MASK]'])[0]

        # Maximum sequence length for BERT
        self.max_seq_length = 512

    def _get_word_replacements(self, text, word_to_replace, num_predictions=5):
        """Get contextual replacements for a specific word in the text."""
        # If text is too long, find the occurrence of the word and use a window around it
        if len(text.split()) > 100:  # Approximate threshold
            words = text.split()
            word_pos = -1

            # Find position of the word to replace
            for i, word in enumerate(words):
                if word.lower() == word_to_replace.lower():
                    word_pos = i
                    break

            if word_pos == -1:
                return []  # Word not found

            # Create a window around the word
            start_pos = max(0, word_pos - 50)
            end_pos = min(len(words), word_pos + 50)

            # Create a shorter text with the word in context
            text = ' '.join(words[start_pos:end_pos])

            # Adjust word_to_replace if it might have changed (e.g., with punctuation)
            if word_pos >= start_pos and word_pos < end_pos:
                word_to_replace = words[word_pos]

        # Tokenize the text
        tokens = self.tokenizer.tokenize(text)

        # If the text is still too long, truncate it to fit BERT's limit
        if len(tokens) > self.max_seq_length - 2:  # -2 for [CLS] and [SEP]
            # Find the word in tokens
            word_tokens = self.tokenizer.tokenize(word_to_replace)

            if not word_tokens:
                return []

            word_token = word_tokens[0]

            # Find the positions of the word token
            positions = [i for i, token in enumerate(tokens) if token == word_token]

            if not positions:
                return []

            # Use the first occurrence and create a window around it
            pos = positions[0]
            left_context = max(0, pos - 200)
            right_context = min(len(tokens), pos + 200)

            tokens = tokens[left_context:right_context]

        # Find the token(s) corresponding to the word to replace
        word_tokens = self.tokenizer.tokenize(word_to_replace)

        # If the word splits into multiple tokens, we'll only replace the first one
        # for simplicity (could be extended to handle multi-token replacements)
        if not word_tokens:
            return []

        word_token = word_tokens[0]

        # Find positions of the token in the tokenized text
        positions = [i for i, token in enumerate(tokens) if token == word_token]
        if not positions:
            return []

        # Randomly select one position to mask
        position = random.choice(positions)

        # Create a copy of tokens and replace the selected position with [MASK]
        masked_tokens = tokens.copy()
        masked_tokens[position] = '[MASK]'

        # Convert to input IDs and create attention mask
        inputs = self.tokenizer.encode_plus(
            " ".join(masked_tokens),
            return_tensors="pt",
            padding='max_length',
            truncation=True,
            max_length=self.max_seq_length
        )

        input_ids = inputs["input_ids"].to(self.device)
        attention_mask = inputs["attention_mask"].to(self.device)

        # Find position of mask token in input_ids
        mask_positions = (input_ids == self.mask_token_id).nonzero()
        if mask_positions.shape[0] == 0:
            return []

        mask_position = mask_positions[0, 1]

        # Generate predictions
        with torch.no_grad():
            outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = outputs.logits

        # Get top predictions for masked token
        probs = torch.nn.functional.softmax(predictions[0, mask_position], dim=0)
        top_k_weights, top_k_indices = torch.topk(probs, num_predictions)

        # Convert token IDs to tokens
        replacements = []
        for token_id in top_k_indices:
            token = self.tokenizer.convert_ids_to_tokens([token_id])[0]

            # Skip special tokens and subword tokens (starting with ##)
            if token.startswith('##') or token in ['[CLS]', '[SEP]', '[MASK]', '[PAD]']:
                continue

            # Skip if the replacement is the same as the original
            if token.lower() == word_token.lower():
                continue

            replacements.append(token)

        return replacements

    def augment(self, text, percent=0.15, top_n=5, max_length=10000):
        """
        Augment text by replacing a percentage of words with contextual predictions.

        Args:
            text (str): Input text to augment
            percent (float): Percentage of words to replace
            top_n (int): Number of top predictions to consider
            max_length (int): Maximum text length to process

        Returns:
            str: Augmented text
        """
        # If text is too long, truncate it (this is a safety measure)
        if len(text) > max_length:
            text = text[:max_length]

        # Tokenize into words using our simple tokenizer
        words = self.word_tokenizer.tokenize(text)

        # Determine number of words to replace
        n_to_replace = max(1, int(len(words) * percent))

        # Create a copy of words
        new_words = words.copy()

        # Get indices of words to potentially replace (exclude short words and non-alphabetic)
        valid_indices = []
        for i, word in enumerate(words):
            # Skip words that are too short
            if len(word) <= 3:
                continue

            # Skip words that are not alphabetic
            if not word.isalpha():
                continue

            valid_indices.append(i)

        # Shuffle and select indices to replace
        if valid_indices:
            random.shuffle(valid_indices)
            indices_to_replace = valid_indices[:n_to_replace]

            # Replace selected words
            for idx in indices_to_replace:
                word_to_replace = words[idx]

                # Get contextual replacements
                replacements = self._get_word_replacements(text, word_to_replace, top_n)

                # Replace if we found valid replacements
                if replacements:
                    new_words[idx] = random.choice(replacements)

        # Join words back into text
        augmented_text = ' '.join(new_words)

        return augmented_text


def augment_imdb_with_contextual(input_file, output_file, sample_fraction=0.2, augmentations_per_sample=1,
                                batch_size=1, max_reviews_length=10000):
    """
    Augment the IMDB dataset with contextual augmentation.

    Args:
        input_file: Path to the original IMDB CSV file
        output_file: Path to save the augmented dataset
        sample_fraction: Fraction of dataset to augment
        augmentations_per_sample: Number of augmentations to create per sample
        batch_size: Process this many reviews at once (for progress reporting)
        max_reviews_length: Skip reviews longer than this to avoid memory issues
    """
    # Check if the input file exists
    if not os.path.exists(input_file):
        raise FileNotFoundError(f"Input file not found: {input_file}")

    # Load the IMDB dataset
    print(f"Loading dataset from {input_file}...")
    df = pd.read_csv(input_file)

    # Check for required columns
    if 'review' not in df.columns or 'label' not in df.columns:
        raise ValueError("The dataset must contain 'review' and 'label' columns!")

    # Display dataset statistics
    print(f"Original dataset size: {len(df)} reviews")
    print(f"Class distribution: {df['label'].value_counts().to_dict()}")

    # Initialize the augmenter
    augmenter = ContextualAugmenter()

    # Sample a subset of the data to augment
    n_samples = int(len(df) * sample_fraction)
    print(f"Sampling {n_samples} reviews for augmentation...")

    # Ensure we have equal representation of both classes
    df_pos = df[df['label'] == 1].sample(n=n_samples//2, random_state=42)
    df_neg = df[df['label'] == 0].sample(n=n_samples//2, random_state=42)
    df_to_augment = pd.concat([df_pos, df_neg], ignore_index=True)

    # Filter out extremely long reviews
    df_to_augment = df_to_augment[df_to_augment['review'].str.len() < max_reviews_length]
    print(f"Using {len(df_to_augment)} reviews after filtering by length")

    # Initialize an empty list to store augmented data
    augmented_data = []

    # Augment the sampled data
    print("\nGenerating contextual augmentations...")

    # Process in batches for better progress reporting
    num_batches = len(df_to_augment) // batch_size + (1 if len(df_to_augment) % batch_size > 0 else 0)

    # Add a small delay to ensure tqdm can initialize properly
    time.sleep(0.5)

    # Save intermediate results every save_interval batches
    save_interval = max(1, num_batches // 10)  # Save approximately 10 times during processing

    for batch_idx in tqdm(range(num_batches), desc="Processing batches"):
        start_idx = batch_idx * batch_size
        end_idx = min((batch_idx + 1) * batch_size, len(df_to_augment))

        batch_df = df_to_augment.iloc[start_idx:end_idx]

        for _, row in batch_df.iterrows():
            review = row['review']
            label = row['label']

            # Create multiple augmentations per sample
            for i in range(augmentations_per_sample):
                try:
                    augmented_review = augmenter.augment(review, percent=0.15)

                    # Only add if the augmentation is different from the original
                    if augmented_review != review:
                        augmented_data.append({
                            'review': augmented_review,
                            'label': label,
                            'technique': f'contextual_aug_{i+1}'
                        })

                        # Save progress periodically
                        if len(augmented_data) % 100 == 0:
                            print(f"Created {len(augmented_data)} augmented samples so far")

                except Exception as e:
                    print(f"Error during augmentation: {str(e)[:100]}...")
                    continue

        # Save intermediate results periodically
        if batch_idx > 0 and batch_idx % save_interval == 0:
            print(f"Saving intermediate results after processing {batch_idx} batches...")

            # Add technique column to original data for intermediate save
            df_original = df.copy()
            df_original['technique'] = 'original'

            # Create a DataFrame with current augmented data
            if augmented_data:
                df_augmented = pd.DataFrame(augmented_data)

                # Combine original and current augmented data
                df_combined = pd.concat([df_original, df_augmented], ignore_index=True)

                # Save intermediate results
                intermediate_file = f"{os.path.splitext(output_file)[0]}_intermediate_{batch_idx}.csv"
                df_combined.to_csv(intermediate_file, index=False)
                print(f"Saved {len(df_augmented)} augmented samples to {intermediate_file}")

    # Add technique column to original data
    df_original = df.copy()
    df_original['technique'] = 'original'

    # Create a DataFrame with augmented data
    if augmented_data:
        df_augmented = pd.DataFrame(augmented_data)

        # Combine original and augmented data
        df_combined = pd.concat([df_original, df_augmented], ignore_index=True)

        # Display augmented dataset statistics
        print(f"\nAugmented dataset size: {len(df_combined)} reviews")
        print(f"Added {len(df_augmented)} augmented samples")
        print(f"Percentage increase: {(len(df_combined) - len(df)) / len(df) * 100:.2f}%")

        # Save the augmented dataset
        print(f"Saving augmented dataset to {output_file}...")
        df_combined.to_csv(output_file, index=False)
        print("Done!")

        # Check if running in Google Colab and trigger download
        try:
            import google.colab
            print("Running in Google Colab - initiating download...")
            files.download(output_file)
            print(f"File {output_file} should be downloading to your local machine.")
        except ImportError:
            print("Not running in Google Colab, skipping automatic download.")

        # Display a few examples of original and augmented reviews
        print("\nExamples of original and augmented reviews:")
        for i in range(1, augmentations_per_sample + 1):
            technique = f'contextual_aug_{i}'
            aug_examples = df_combined[df_combined['technique'] == technique].head(2)

            for _, row in aug_examples.iterrows():
                print(f"\nAugmented (contextual):")
                print(row['review'][:200] + "..." if len(row['review']) > 200 else row['review'])

                # Try to find the original review
                try:
                    # Find the original review
                    original_idx = df_to_augment.loc[
                        (df_to_augment['label'] == row['label']) &
                        (df_to_augment.index % len(df_to_augment) ==
                         row.name % len(df_to_augment))
                    ].index

                    if not original_idx.empty:
                        original_review = df.loc[original_idx[0], 'review']
                        print("\nOriginal:")
                        print(original_review[:200] + "..." if len(original_review) > 200 else original_review)
                except Exception:
                    print("Could not retrieve original review for comparison")
    else:
        print("No augmented data was generated. Saving original dataset only.")
        df_original.to_csv(output_file, index=False)

        # Check if running in Google Colab and trigger download
        try:
            import google.colab
            print("Running in Google Colab - initiating download...")
            files.download(output_file)
            print(f"File {output_file} should be downloading to your local machine.")
        except ImportError:
            print("Not running in Google Colab, skipping automatic download.")


if __name__ == "__main__":
    # Set the paths for input and output files
    input_file = "imdb_train_dataset.csv"
    output_file = "imdb_train_contextual_augmented.csv"

    # Set the fraction of the dataset to augment
    sample_fraction = 0.8  # Augment 20% of the dataset

    # Set number of augmentations per sample
    augmentations_per_sample = 1

    # Set batch size for progress reporting
    batch_size = 1

    # Set maximum review length (characters) to process
    max_reviews_length = 8000

    # Run the augmentation
    augment_imdb_with_contextual(
        input_file=input_file,
        output_file=output_file,
        sample_fraction=sample_fraction,
        augmentations_per_sample=augmentations_per_sample,
        batch_size=batch_size,
        max_reviews_length=max_reviews_length
    )

Loading dataset from imdb_train_dataset.csv...
Original dataset size: 25000 reviews
Class distribution: {1: 12500, 0: 12500}
Using CUDA device: NVIDIA A100-SXM4-40GB
Detected A100 GPU - optimizing settings
Loading BERT model and tokenizer...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Sampling 20000 reviews for augmentation...
Using 19992 reviews after filtering by length

Generating contextual augmentations...


Processing batches:   1%|          | 100/19992 [00:55<2:23:23,  2.31it/s]

Created 100 augmented samples so far


Processing batches:   1%|          | 200/19992 [01:36<3:06:15,  1.77it/s]

Created 200 augmented samples so far


Processing batches:   2%|▏         | 300/19992 [02:20<1:56:50,  2.81it/s]

Created 300 augmented samples so far


Processing batches:   2%|▏         | 400/19992 [03:15<1:52:04,  2.91it/s]

Created 400 augmented samples so far


Processing batches:   3%|▎         | 500/19992 [03:59<2:28:10,  2.19it/s]

Created 500 augmented samples so far


Processing batches:   3%|▎         | 600/19992 [04:55<2:35:10,  2.08it/s]

Created 600 augmented samples so far


Processing batches:   4%|▎         | 700/19992 [05:49<2:15:00,  2.38it/s]

Created 700 augmented samples so far


Processing batches:   4%|▍         | 800/19992 [06:47<5:28:52,  1.03s/it]

Created 800 augmented samples so far


Processing batches:   5%|▍         | 900/19992 [07:29<1:45:38,  3.01it/s]

Created 900 augmented samples so far


Processing batches:   5%|▌         | 1000/19992 [08:24<1:58:36,  2.67it/s]

Created 1000 augmented samples so far


Processing batches:   6%|▌         | 1100/19992 [09:12<2:20:56,  2.23it/s]

Created 1100 augmented samples so far


Processing batches:   6%|▌         | 1200/19992 [10:02<1:59:24,  2.62it/s]

Created 1200 augmented samples so far


Processing batches:   6%|▋         | 1299/19992 [10:51<2:39:38,  1.95it/s]

Created 1300 augmented samples so far


Processing batches:   7%|▋         | 1400/19992 [11:47<3:57:09,  1.31it/s]

Created 1400 augmented samples so far


Processing batches:   8%|▊         | 1500/19992 [12:45<3:50:55,  1.33it/s]

Created 1500 augmented samples so far


Processing batches:   8%|▊         | 1600/19992 [13:37<3:50:52,  1.33it/s]

Created 1600 augmented samples so far


Processing batches:   9%|▊         | 1700/19992 [14:35<5:13:39,  1.03s/it]

Created 1700 augmented samples so far


Processing batches:   9%|▉         | 1800/19992 [15:26<1:49:30,  2.77it/s]

Created 1800 augmented samples so far


Processing batches:  10%|▉         | 1900/19992 [16:08<2:18:03,  2.18it/s]

Created 1900 augmented samples so far


Processing batches:  10%|▉         | 1999/19992 [17:02<2:33:23,  1.96it/s]

Created 2000 augmented samples so far
Saving intermediate results after processing 1999 batches...


Processing batches:  10%|█         | 2001/19992 [17:04<2:52:56,  1.73it/s]

Saved 2000 augmented samples to imdb_train_contextual_augmented_intermediate_1999.csv


Processing batches:  11%|█         | 2100/19992 [18:00<2:23:18,  2.08it/s]

Created 2100 augmented samples so far


Processing batches:  11%|█         | 2200/19992 [18:46<2:22:35,  2.08it/s]

Created 2200 augmented samples so far


Processing batches:  12%|█▏        | 2300/19992 [19:27<1:55:06,  2.56it/s]

Created 2300 augmented samples so far


Processing batches:  12%|█▏        | 2400/19992 [20:19<2:30:00,  1.95it/s]

Created 2400 augmented samples so far


Processing batches:  13%|█▎        | 2501/19992 [21:12<1:22:13,  3.55it/s]

Created 2500 augmented samples so far


Processing batches:  13%|█▎        | 2600/19992 [21:56<1:59:18,  2.43it/s]

Created 2600 augmented samples so far


Processing batches:  14%|█▎        | 2700/19992 [22:45<2:06:26,  2.28it/s]

Created 2700 augmented samples so far


Processing batches:  14%|█▍        | 2800/19992 [23:37<2:08:06,  2.24it/s]

Created 2800 augmented samples so far


Processing batches:  15%|█▍        | 2900/19992 [24:25<2:28:52,  1.91it/s]

Created 2900 augmented samples so far


Processing batches:  15%|█▌        | 3000/19992 [25:22<3:20:14,  1.41it/s]

Created 3000 augmented samples so far


Processing batches:  16%|█▌        | 3101/19992 [26:21<2:05:27,  2.24it/s]

Created 3100 augmented samples so far


Processing batches:  16%|█▌        | 3200/19992 [27:12<1:54:35,  2.44it/s]

Created 3200 augmented samples so far


Processing batches:  17%|█▋        | 3300/19992 [28:10<2:17:13,  2.03it/s]

Created 3300 augmented samples so far


Processing batches:  17%|█▋        | 3400/19992 [29:03<1:45:45,  2.61it/s]

Created 3400 augmented samples so far


Processing batches:  18%|█▊        | 3500/19992 [29:50<2:06:11,  2.18it/s]

Created 3500 augmented samples so far


Processing batches:  18%|█▊        | 3600/19992 [30:37<1:42:58,  2.65it/s]

Created 3600 augmented samples so far


Processing batches:  19%|█▊        | 3700/19992 [31:27<2:09:28,  2.10it/s]

Created 3700 augmented samples so far


Processing batches:  19%|█▉        | 3800/19992 [32:19<2:06:41,  2.13it/s]

Created 3800 augmented samples so far


Processing batches:  20%|█▉        | 3901/19992 [33:14<1:40:06,  2.68it/s]

Created 3900 augmented samples so far


Processing batches:  20%|█▉        | 3998/19992 [34:00<1:31:39,  2.91it/s]

Saving intermediate results after processing 3998 batches...


Processing batches:  20%|██        | 3999/19992 [34:01<2:59:10,  1.49it/s]

Saved 3999 augmented samples to imdb_train_contextual_augmented_intermediate_3998.csv


Processing batches:  20%|██        | 4000/19992 [34:01<2:35:29,  1.71it/s]

Created 4000 augmented samples so far


Processing batches:  21%|██        | 4100/19992 [34:50<2:00:03,  2.21it/s]

Created 4100 augmented samples so far


Processing batches:  21%|██        | 4200/19992 [35:38<1:39:21,  2.65it/s]

Created 4200 augmented samples so far


Processing batches:  22%|██▏       | 4300/19992 [36:32<1:41:14,  2.58it/s]

Created 4300 augmented samples so far


Processing batches:  22%|██▏       | 4401/19992 [37:16<1:39:06,  2.62it/s]

Created 4400 augmented samples so far


Processing batches:  23%|██▎       | 4500/19992 [38:08<1:39:45,  2.59it/s]

Created 4500 augmented samples so far


Processing batches:  23%|██▎       | 4601/19992 [39:05<1:26:00,  2.98it/s]

Created 4600 augmented samples so far


Processing batches:  24%|██▎       | 4700/19992 [39:52<1:01:17,  4.16it/s]

Created 4700 augmented samples so far


Processing batches:  24%|██▍       | 4801/19992 [40:40<1:27:48,  2.88it/s]

Created 4800 augmented samples so far


Processing batches:  25%|██▍       | 4900/19992 [41:29<1:43:27,  2.43it/s]

Created 4900 augmented samples so far


Processing batches:  25%|██▌       | 5000/19992 [42:18<1:56:49,  2.14it/s]

Created 5000 augmented samples so far


Processing batches:  26%|██▌       | 5100/19992 [43:09<2:24:52,  1.71it/s]

Created 5100 augmented samples so far


Processing batches:  26%|██▌       | 5200/19992 [44:00<1:49:51,  2.24it/s]

Created 5200 augmented samples so far


Processing batches:  27%|██▋       | 5300/19992 [44:50<2:27:17,  1.66it/s]

Created 5300 augmented samples so far


Processing batches:  27%|██▋       | 5399/19992 [45:37<1:55:34,  2.10it/s]

Created 5400 augmented samples so far


Processing batches:  28%|██▊       | 5500/19992 [46:26<2:45:12,  1.46it/s]

Created 5500 augmented samples so far


Processing batches:  28%|██▊       | 5600/19992 [47:08<2:59:50,  1.33it/s]

Created 5600 augmented samples so far


Processing batches:  29%|██▊       | 5700/19992 [48:03<1:43:47,  2.29it/s]

Created 5700 augmented samples so far


Processing batches:  29%|██▉       | 5800/19992 [48:52<1:01:46,  3.83it/s]

Created 5800 augmented samples so far


Processing batches:  30%|██▉       | 5901/19992 [49:44<2:18:08,  1.70it/s]

Created 5900 augmented samples so far


Processing batches:  30%|██▉       | 5997/19992 [50:26<2:03:13,  1.89it/s]

Saving intermediate results after processing 5997 batches...


Processing batches:  30%|███       | 5998/19992 [50:28<3:43:04,  1.05it/s]

Saved 5998 augmented samples to imdb_train_contextual_augmented_intermediate_5997.csv


Processing batches:  30%|███       | 6000/19992 [50:29<2:45:43,  1.41it/s]

Created 6000 augmented samples so far


Processing batches:  31%|███       | 6100/19992 [51:23<2:02:53,  1.88it/s]

Created 6100 augmented samples so far


Processing batches:  31%|███       | 6200/19992 [52:09<3:19:41,  1.15it/s]

Created 6200 augmented samples so far


Processing batches:  32%|███▏      | 6300/19992 [52:53<1:22:35,  2.76it/s]

Created 6300 augmented samples so far


Processing batches:  32%|███▏      | 6400/19992 [53:51<1:22:33,  2.74it/s]

Created 6400 augmented samples so far


Processing batches:  33%|███▎      | 6500/19992 [54:45<1:42:23,  2.20it/s]

Created 6500 augmented samples so far


Processing batches:  33%|███▎      | 6600/19992 [55:38<1:41:52,  2.19it/s]

Created 6600 augmented samples so far


Processing batches:  34%|███▎      | 6701/19992 [56:39<1:12:03,  3.07it/s]

Created 6700 augmented samples so far


Processing batches:  34%|███▍      | 6800/19992 [57:27<1:36:38,  2.28it/s]

Created 6800 augmented samples so far


Processing batches:  35%|███▍      | 6900/19992 [58:21<2:26:59,  1.48it/s]

Created 6900 augmented samples so far


Processing batches:  35%|███▌      | 7000/19992 [59:13<1:29:54,  2.41it/s]

Created 7000 augmented samples so far


Processing batches:  36%|███▌      | 7100/19992 [1:00:06<2:36:29,  1.37it/s]

Created 7100 augmented samples so far


Processing batches:  36%|███▌      | 7200/19992 [1:00:52<2:04:19,  1.71it/s]

Created 7200 augmented samples so far


Processing batches:  37%|███▋      | 7300/19992 [1:01:43<1:59:16,  1.77it/s]

Created 7300 augmented samples so far


Processing batches:  37%|███▋      | 7400/19992 [1:02:40<1:22:46,  2.54it/s]

Created 7400 augmented samples so far


Processing batches:  38%|███▊      | 7500/19992 [1:03:29<1:22:57,  2.51it/s]

Created 7500 augmented samples so far


Processing batches:  38%|███▊      | 7600/19992 [1:04:21<1:44:27,  1.98it/s]

Created 7600 augmented samples so far


Processing batches:  39%|███▊      | 7701/19992 [1:05:10<50:25,  4.06it/s]

Created 7700 augmented samples so far


Processing batches:  39%|███▉      | 7800/19992 [1:06:01<2:10:36,  1.56it/s]

Created 7800 augmented samples so far


Processing batches:  40%|███▉      | 7900/19992 [1:06:44<1:07:29,  2.99it/s]

Created 7900 augmented samples so far


Processing batches:  40%|███▉      | 7996/19992 [1:07:36<2:54:45,  1.14it/s]

Saving intermediate results after processing 7996 batches...


Processing batches:  40%|████      | 7997/19992 [1:07:38<4:12:14,  1.26s/it]

Saved 7997 augmented samples to imdb_train_contextual_augmented_intermediate_7996.csv


Processing batches:  40%|████      | 8000/19992 [1:07:39<2:18:32,  1.44it/s]

Created 8000 augmented samples so far


Processing batches:  41%|████      | 8100/19992 [1:08:34<1:08:01,  2.91it/s]

Created 8100 augmented samples so far


Processing batches:  41%|████      | 8200/19992 [1:09:31<2:37:04,  1.25it/s]

Created 8200 augmented samples so far


Processing batches:  42%|████▏     | 8300/19992 [1:10:22<1:41:15,  1.92it/s]

Created 8300 augmented samples so far


Processing batches:  42%|████▏     | 8400/19992 [1:11:15<1:58:49,  1.63it/s]

Created 8400 augmented samples so far


Processing batches:  43%|████▎     | 8500/19992 [1:12:08<1:27:43,  2.18it/s]

Created 8500 augmented samples so far


Processing batches:  43%|████▎     | 8600/19992 [1:12:55<2:05:11,  1.52it/s]

Created 8600 augmented samples so far


Processing batches:  44%|████▎     | 8700/19992 [1:13:44<1:17:16,  2.44it/s]

Created 8700 augmented samples so far


Processing batches:  44%|████▍     | 8800/19992 [1:14:40<1:26:57,  2.15it/s]

Created 8800 augmented samples so far


Processing batches:  45%|████▍     | 8900/19992 [1:15:23<1:38:07,  1.88it/s]

Created 8900 augmented samples so far


Processing batches:  45%|████▌     | 9000/19992 [1:16:12<2:35:59,  1.17it/s]

Created 9000 augmented samples so far


Processing batches:  46%|████▌     | 9100/19992 [1:17:00<1:47:51,  1.68it/s]

Created 9100 augmented samples so far


Processing batches:  46%|████▌     | 9200/19992 [1:17:56<1:28:59,  2.02it/s]

Created 9200 augmented samples so far


Processing batches:  47%|████▋     | 9300/19992 [1:18:40<51:04,  3.49it/s]

Created 9300 augmented samples so far


Processing batches:  47%|████▋     | 9400/19992 [1:19:31<2:11:39,  1.34it/s]

Created 9400 augmented samples so far


Processing batches:  48%|████▊     | 9500/19992 [1:20:23<59:41,  2.93it/s]  

Created 9500 augmented samples so far


Processing batches:  48%|████▊     | 9600/19992 [1:21:21<2:03:01,  1.41it/s]

Created 9600 augmented samples so far


Processing batches:  49%|████▊     | 9700/19992 [1:22:14<1:36:48,  1.77it/s]

Created 9700 augmented samples so far


Processing batches:  49%|████▉     | 9800/19992 [1:23:00<55:30,  3.06it/s]

Created 9800 augmented samples so far


Processing batches:  50%|████▉     | 9900/19992 [1:23:52<1:20:11,  2.10it/s]

Created 9900 augmented samples so far


Processing batches:  50%|████▉     | 9995/19992 [1:24:42<51:19,  3.25it/s]

Saving intermediate results after processing 9995 batches...


Processing batches:  50%|█████     | 9996/19992 [1:24:44<2:23:13,  1.16it/s]

Saved 9996 augmented samples to imdb_train_contextual_augmented_intermediate_9995.csv


Processing batches:  50%|█████     | 10000/19992 [1:24:45<1:07:23,  2.47it/s]

Created 10000 augmented samples so far


Processing batches:  51%|█████     | 10100/19992 [1:25:33<1:03:38,  2.59it/s]

Created 10100 augmented samples so far


Processing batches:  51%|█████     | 10200/19992 [1:26:36<2:16:22,  1.20it/s]

Created 10200 augmented samples so far


Processing batches:  52%|█████▏    | 10300/19992 [1:27:29<1:00:59,  2.65it/s]

Created 10300 augmented samples so far


Processing batches:  52%|█████▏    | 10400/19992 [1:28:22<1:45:43,  1.51it/s]

Created 10400 augmented samples so far


Processing batches:  53%|█████▎    | 10500/19992 [1:29:10<51:09,  3.09it/s]

Created 10500 augmented samples so far


Processing batches:  53%|█████▎    | 10600/19992 [1:30:00<1:14:14,  2.11it/s]

Created 10600 augmented samples so far


Processing batches:  54%|█████▎    | 10700/19992 [1:30:49<57:32,  2.69it/s]  

Created 10700 augmented samples so far


Processing batches:  54%|█████▍    | 10800/19992 [1:31:45<2:09:44,  1.18it/s]

Created 10800 augmented samples so far


Processing batches:  55%|█████▍    | 10900/19992 [1:32:38<1:07:28,  2.25it/s]

Created 10900 augmented samples so far


Processing batches:  55%|█████▌    | 11000/19992 [1:33:32<1:00:06,  2.49it/s]

Created 11000 augmented samples so far


Processing batches:  56%|█████▌    | 11100/19992 [1:34:25<51:09,  2.90it/s]

Created 11100 augmented samples so far


Processing batches:  56%|█████▌    | 11200/19992 [1:35:19<56:14,  2.61it/s]  

Created 11200 augmented samples so far


Processing batches:  57%|█████▋    | 11300/19992 [1:36:08<1:23:15,  1.74it/s]

Created 11300 augmented samples so far


Processing batches:  57%|█████▋    | 11400/19992 [1:37:04<1:27:51,  1.63it/s]

Created 11400 augmented samples so far


Processing batches:  58%|█████▊    | 11500/19992 [1:37:49<1:15:22,  1.88it/s]

Created 11500 augmented samples so far


Processing batches:  58%|█████▊    | 11600/19992 [1:38:37<1:26:43,  1.61it/s]

Created 11600 augmented samples so far


Processing batches:  59%|█████▊    | 11700/19992 [1:39:30<1:24:28,  1.64it/s]

Created 11700 augmented samples so far


Processing batches:  59%|█████▉    | 11800/19992 [1:40:21<59:57,  2.28it/s]

Created 11800 augmented samples so far


Processing batches:  60%|█████▉    | 11900/19992 [1:41:11<1:15:40,  1.78it/s]

Created 11900 augmented samples so far


Processing batches:  60%|█████▉    | 11994/19992 [1:42:00<1:16:27,  1.74it/s]

Saving intermediate results after processing 11994 batches...


Processing batches:  60%|█████▉    | 11995/19992 [1:42:02<2:02:05,  1.09it/s]

Saved 11995 augmented samples to imdb_train_contextual_augmented_intermediate_11994.csv


Processing batches:  60%|██████    | 12000/19992 [1:42:04<1:04:47,  2.06it/s]

Created 12000 augmented samples so far


Processing batches:  61%|██████    | 12100/19992 [1:42:54<53:18,  2.47it/s]

Created 12100 augmented samples so far


Processing batches:  61%|██████    | 12200/19992 [1:43:47<42:01,  3.09it/s]

Created 12200 augmented samples so far


Processing batches:  62%|██████▏   | 12300/19992 [1:44:37<1:20:00,  1.60it/s]

Created 12300 augmented samples so far


Processing batches:  62%|██████▏   | 12400/19992 [1:45:22<34:41,  3.65it/s]

Created 12400 augmented samples so far


Processing batches:  63%|██████▎   | 12500/19992 [1:46:11<1:06:53,  1.87it/s]

Created 12500 augmented samples so far


Processing batches:  63%|██████▎   | 12600/19992 [1:46:58<1:08:59,  1.79it/s]

Created 12600 augmented samples so far


Processing batches:  64%|██████▎   | 12700/19992 [1:47:50<1:08:35,  1.77it/s]

Created 12700 augmented samples so far


Processing batches:  64%|██████▍   | 12800/19992 [1:48:40<52:57,  2.26it/s]

Created 12800 augmented samples so far


Processing batches:  65%|██████▍   | 12900/19992 [1:49:31<49:37,  2.38it/s]

Created 12900 augmented samples so far


Processing batches:  65%|██████▌   | 13000/19992 [1:50:16<45:25,  2.57it/s]

Created 13000 augmented samples so far


Processing batches:  66%|██████▌   | 13100/19992 [1:51:11<57:18,  2.00it/s]  

Created 13100 augmented samples so far


Processing batches:  66%|██████▌   | 13200/19992 [1:52:02<45:25,  2.49it/s]

Created 13200 augmented samples so far


Processing batches:  67%|██████▋   | 13300/19992 [1:52:53<33:30,  3.33it/s]

Created 13300 augmented samples so far


Processing batches:  67%|██████▋   | 13400/19992 [1:53:42<46:52,  2.34it/s]

Created 13400 augmented samples so far


Processing batches:  68%|██████▊   | 13500/19992 [1:54:38<1:00:41,  1.78it/s]

Created 13500 augmented samples so far


Processing batches:  68%|██████▊   | 13600/19992 [1:55:24<45:59,  2.32it/s]  

Created 13600 augmented samples so far


Processing batches:  69%|██████▊   | 13700/19992 [1:56:12<37:45,  2.78it/s]

Created 13700 augmented samples so far


Processing batches:  69%|██████▉   | 13800/19992 [1:57:02<1:02:35,  1.65it/s]

Created 13800 augmented samples so far


Processing batches:  70%|██████▉   | 13900/19992 [1:57:54<50:38,  2.00it/s]

Created 13900 augmented samples so far


Processing batches:  70%|██████▉   | 13993/19992 [1:58:38<1:11:35,  1.40it/s]

Saving intermediate results after processing 13993 batches...


Processing batches:  70%|██████▉   | 13994/19992 [1:58:39<1:43:40,  1.04s/it]

Saved 13994 augmented samples to imdb_train_contextual_augmented_intermediate_13993.csv


Processing batches:  70%|███████   | 14000/19992 [1:58:41<41:32,  2.40it/s]

Created 14000 augmented samples so far


Processing batches:  71%|███████   | 14100/19992 [1:59:24<47:51,  2.05it/s]

Created 14100 augmented samples so far


Processing batches:  71%|███████   | 14200/19992 [2:00:11<40:17,  2.40it/s]

Created 14200 augmented samples so far


Processing batches:  72%|███████▏  | 14300/19992 [2:01:00<44:35,  2.13it/s]

Created 14300 augmented samples so far


Processing batches:  72%|███████▏  | 14400/19992 [2:01:49<37:24,  2.49it/s]

Created 14400 augmented samples so far


Processing batches:  73%|███████▎  | 14500/19992 [2:02:37<41:02,  2.23it/s]

Created 14500 augmented samples so far


Processing batches:  73%|███████▎  | 14600/19992 [2:03:25<28:18,  3.18it/s]

Created 14600 augmented samples so far


Processing batches:  74%|███████▎  | 14700/19992 [2:04:14<45:37,  1.93it/s]

Created 14700 augmented samples so far


Processing batches:  74%|███████▍  | 14800/19992 [2:05:06<1:02:18,  1.39it/s]

Created 14800 augmented samples so far


Processing batches:  75%|███████▍  | 14900/19992 [2:05:47<28:39,  2.96it/s]

Created 14900 augmented samples so far


Processing batches:  75%|███████▌  | 15000/19992 [2:06:39<40:30,  2.05it/s]

Created 15000 augmented samples so far


Processing batches:  76%|███████▌  | 15100/19992 [2:07:36<43:17,  1.88it/s]

Created 15100 augmented samples so far


Processing batches:  76%|███████▌  | 15200/19992 [2:08:28<30:27,  2.62it/s]

Created 15200 augmented samples so far


Processing batches:  77%|███████▋  | 15300/19992 [2:09:24<30:59,  2.52it/s]

Created 15300 augmented samples so far


Processing batches:  77%|███████▋  | 15400/19992 [2:10:17<40:13,  1.90it/s]

Created 15400 augmented samples so far


Processing batches:  78%|███████▊  | 15501/19992 [2:11:09<26:59,  2.77it/s]

Created 15500 augmented samples so far


Processing batches:  78%|███████▊  | 15600/19992 [2:11:58<23:18,  3.14it/s]

Created 15600 augmented samples so far


Processing batches:  79%|███████▊  | 15700/19992 [2:12:40<37:57,  1.88it/s]

Created 15700 augmented samples so far


Processing batches:  79%|███████▉  | 15800/19992 [2:13:30<37:02,  1.89it/s]

Created 15800 augmented samples so far


Processing batches:  80%|███████▉  | 15900/19992 [2:14:17<26:40,  2.56it/s]

Created 15900 augmented samples so far


Processing batches:  80%|███████▉  | 15992/19992 [2:14:59<27:36,  2.41it/s]

Saving intermediate results after processing 15992 batches...


Processing batches:  80%|███████▉  | 15993/19992 [2:15:02<1:10:08,  1.05s/it]

Saved 15993 augmented samples to imdb_train_contextual_augmented_intermediate_15992.csv


Processing batches:  80%|████████  | 16001/19992 [2:15:05<24:50,  2.68it/s]

Created 16000 augmented samples so far


Processing batches:  81%|████████  | 16100/19992 [2:15:48<24:44,  2.62it/s]

Created 16100 augmented samples so far


Processing batches:  81%|████████  | 16200/19992 [2:16:42<41:52,  1.51it/s]

Created 16200 augmented samples so far


Processing batches:  82%|████████▏ | 16300/19992 [2:17:32<29:44,  2.07it/s]

Created 16300 augmented samples so far


Processing batches:  82%|████████▏ | 16400/19992 [2:18:27<19:49,  3.02it/s]

Created 16400 augmented samples so far


Processing batches:  83%|████████▎ | 16500/19992 [2:19:16<23:03,  2.52it/s]

Created 16500 augmented samples so far


Processing batches:  83%|████████▎ | 16600/19992 [2:20:10<35:46,  1.58it/s]

Created 16600 augmented samples so far


Processing batches:  84%|████████▎ | 16700/19992 [2:21:03<20:42,  2.65it/s]

Created 16700 augmented samples so far


Processing batches:  84%|████████▍ | 16800/19992 [2:21:58<34:48,  1.53it/s]

Created 16800 augmented samples so far


Processing batches:  85%|████████▍ | 16900/19992 [2:22:51<25:56,  1.99it/s]

Created 16900 augmented samples so far


Processing batches:  85%|████████▌ | 17000/19992 [2:23:42<26:02,  1.91it/s]

Created 17000 augmented samples so far


Processing batches:  86%|████████▌ | 17100/19992 [2:24:33<18:38,  2.58it/s]

Created 17100 augmented samples so far


Processing batches:  86%|████████▌ | 17200/19992 [2:25:24<32:50,  1.42it/s]

Created 17200 augmented samples so far


Processing batches:  87%|████████▋ | 17300/19992 [2:26:12<14:36,  3.07it/s]

Created 17300 augmented samples so far


Processing batches:  87%|████████▋ | 17400/19992 [2:27:05<29:14,  1.48it/s]

Created 17400 augmented samples so far


Processing batches:  88%|████████▊ | 17500/19992 [2:27:53<24:55,  1.67it/s]

Created 17500 augmented samples so far


Processing batches:  88%|████████▊ | 17600/19992 [2:28:49<16:32,  2.41it/s]

Created 17600 augmented samples so far


Processing batches:  89%|████████▊ | 17700/19992 [2:29:36<26:59,  1.42it/s]

Created 17700 augmented samples so far


Processing batches:  89%|████████▉ | 17800/19992 [2:30:32<17:09,  2.13it/s]

Created 17800 augmented samples so far


Processing batches:  90%|████████▉ | 17900/19992 [2:31:20<21:25,  1.63it/s]

Created 17900 augmented samples so far


Processing batches:  90%|████████▉ | 17991/19992 [2:32:04<10:38,  3.13it/s]

Saving intermediate results after processing 17991 batches...


Processing batches:  90%|████████▉ | 17992/19992 [2:32:07<34:02,  1.02s/it]

Saved 17992 augmented samples to imdb_train_contextual_augmented_intermediate_17991.csv


Processing batches:  90%|█████████ | 18000/19992 [2:32:10<12:54,  2.57it/s]

Created 18000 augmented samples so far


Processing batches:  91%|█████████ | 18099/19992 [2:32:59<21:36,  1.46it/s]

Created 18100 augmented samples so far


Processing batches:  91%|█████████ | 18200/19992 [2:33:49<12:50,  2.33it/s]

Created 18200 augmented samples so far


Processing batches:  92%|█████████▏| 18300/19992 [2:34:31<16:44,  1.68it/s]

Created 18300 augmented samples so far


Processing batches:  92%|█████████▏| 18400/19992 [2:35:15<18:23,  1.44it/s]

Created 18400 augmented samples so far


Processing batches:  93%|█████████▎| 18500/19992 [2:36:10<11:52,  2.10it/s]

Created 18500 augmented samples so far


Processing batches:  93%|█████████▎| 18600/19992 [2:36:58<07:18,  3.18it/s]

Created 18600 augmented samples so far


Processing batches:  94%|█████████▎| 18700/19992 [2:37:49<08:45,  2.46it/s]

Created 18700 augmented samples so far


Processing batches:  94%|█████████▍| 18800/19992 [2:38:36<12:55,  1.54it/s]

Created 18800 augmented samples so far


Processing batches:  95%|█████████▍| 18900/19992 [2:39:30<05:09,  3.53it/s]

Created 18900 augmented samples so far


Processing batches:  95%|█████████▌| 19000/19992 [2:40:17<04:40,  3.54it/s]

Created 19000 augmented samples so far


Processing batches:  96%|█████████▌| 19100/19992 [2:41:09<03:59,  3.72it/s]

Created 19100 augmented samples so far


Processing batches:  96%|█████████▌| 19200/19992 [2:41:59<06:14,  2.11it/s]

Created 19200 augmented samples so far


Processing batches:  97%|█████████▋| 19300/19992 [2:42:54<04:35,  2.51it/s]

Created 19300 augmented samples so far


Processing batches:  97%|█████████▋| 19400/19992 [2:43:43<03:29,  2.83it/s]

Created 19400 augmented samples so far


Processing batches:  98%|█████████▊| 19500/19992 [2:44:34<04:04,  2.01it/s]

Created 19500 augmented samples so far


Processing batches:  98%|█████████▊| 19600/19992 [2:45:25<02:06,  3.10it/s]

Created 19600 augmented samples so far


Processing batches:  99%|█████████▊| 19700/19992 [2:46:14<02:30,  1.94it/s]

Created 19700 augmented samples so far


Processing batches:  99%|█████████▉| 19800/19992 [2:47:08<02:52,  1.11it/s]

Created 19800 augmented samples so far


Processing batches: 100%|█████████▉| 19900/19992 [2:47:58<00:28,  3.18it/s]

Created 19900 augmented samples so far


Processing batches: 100%|█████████▉| 19990/19992 [2:48:41<00:00,  2.63it/s]

Saving intermediate results after processing 19990 batches...


Processing batches: 100%|█████████▉| 19991/19992 [2:48:43<00:01,  1.08s/it]

Saved 19991 augmented samples to imdb_train_contextual_augmented_intermediate_19990.csv


Processing batches: 100%|██████████| 19992/19992 [2:48:44<00:00,  1.97it/s]



Augmented dataset size: 44992 reviews
Added 19992 augmented samples
Percentage increase: 79.97%
Saving augmented dataset to imdb_train_contextual_augmented.csv...
Done!
Running in Google Colab - initiating download...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

File imdb_train_contextual_augmented.csv should be downloading to your local machine.

Examples of original and augmented reviews:

Augmented (contextual):
that movie is an eye out for those who can see the dream lifestyles of the stars . It reminds you how people who just like to do so are not allowed to . Plus the gas blast itself is for real . <br / >...

Original:
And a rather Unexpected plot line too-for the era: there is Plague in the City of New Orleans-and only Richard Widmark can stop it! Elia Kazan's trademark subjects: waterfronts, working men, crowds, f...

Augmented (contextual):
- unknown Michelle Rodriguez as Diana was a stroke of genius . She ' s perfect . Her acting on actually works in her favor . We ' ve never seen her before so it just sounds fitting her relationship . ...

Original:
Yesterday I attended the world premiere of "Descent" at the Tribeca Film Festival in NYC. I had a great time. It was sold out and attended by all the major stars including fellow my-spa