## 1. Introduction
In this notebook, we show how **TextBugger** performs adversarial attacks on a BERT-based sentiment analysis model.

TextBugger makes small changes to input text (like replacing or changing letters) to "fool" the model into giving the wrong prediction.

In [1]:
!pip install transformers torch datasets textattack --quiet


In [2]:
!pip install --upgrade datasets



In [3]:
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset

dataset = load_dataset('imdb')

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# Tokenization function for the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=512)

# Apply tokenizer to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
import os
import torch
from transformers import Trainer, TrainingArguments

os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

trainer.train()


Using device: cuda


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
10,0.7342
20,0.6998
30,0.7123
40,0.6732
50,0.6716
60,0.6723
70,0.6775
80,0.653
90,0.6679
100,0.6661


Step,Training Loss
10,0.7342
20,0.6998
30,0.7123
40,0.6732
50,0.6716
60,0.6723
70,0.6775
80,0.653
90,0.6679
100,0.6661


## 2. Fine-tuning BERT
We train the model for 1 epoch on a small portion of IMDb to demonstrate the attack process quickly.

In [None]:
!pip install -q textattack

from textattack.attack_recipes import TextBuggerLi2018
from textattack.models.wrappers import HuggingFaceModelWrapper
from textattack.shared import AttackedText

model_wrapper = HuggingFaceModelWrapper(model, tokenizer)

attack = TextBuggerLi2018.build(model_wrapper)

original_sentences = [
    "This movie is amazing! I loved it!",
    "It was the worst film I've ever seen."   ]

labels = [1, 0]

for sentence, label in zip(original_sentences, labels):
    result = attack.attack(sentence, label)
    print("Original Sentence:", result.original_text())
    print("Perturbed Sentence:", result.perturbed_text())
    print("-" * 80)


textattack: Unknown if model of class <class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.


Original Sentence: This movie is amazing! I loved it!
Perturbed Sentence: This movie is ɑmazing! I loved it!
--------------------------------------------------------------------------------
Original Sentence: It was the worst film I've ever seen.
Perturbed Sentence: It was the pire film I've ever seen.
--------------------------------------------------------------------------------


In [5]:
import torch
import torch.nn.functional as F

def predict_sentiment(text):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # Put model in eval mode and get logits
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Apply softmax to get probabilities
    probs = F.softmax(logits, dim=1)
    predicted_class = torch.argmax(probs).item()

    # Map label to sentiment
    label_map = {0: "Negative", 1: "Positive"}
    return label_map[predicted_class], probs[0].tolist()


In [6]:
sentences = [
    "I really enjoyed this movie, it was fantastic!",
    "This film was boring and too long.",
    "The plot had some interesting twists.",
    "Acting was terrible and the script was weak.",
    "An absolute masterpiece with stunning visuals."
]

for sent in sentences:
    label, probs = predict_sentiment(sent)
    print(f"Sentence: {sent}")
    print(f"Predicted Sentiment: {label} (Confidence: {probs})")
    print("-" * 80)


Sentence: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Negative (Confidence: [0.7090131044387817, 0.29098689556121826])
--------------------------------------------------------------------------------
Sentence: This film was boring and too long.
Predicted Sentiment: Negative (Confidence: [0.6965874433517456, 0.30341264605522156])
--------------------------------------------------------------------------------
Sentence: The plot had some interesting twists.
Predicted Sentiment: Negative (Confidence: [0.6968936324119568, 0.303106427192688])
--------------------------------------------------------------------------------
Sentence: Acting was terrible and the script was weak.
Predicted Sentiment: Negative (Confidence: [0.7054392695426941, 0.29456081986427307])
--------------------------------------------------------------------------------
Sentence: An absolute masterpiece with stunning visuals.
Predicted Sentiment: Negative (Confidence: [0.6957298517227173, 0.304270