# Defending Against Adversarial Attacks

### Scenario

* **Task**: Sentiment analysis (positive/negative classification of text).
* **Adversarial Attack**: Subtle word substitutions or typos to flip the sentiment prediction.
* **Defense**: Input preprocessing to detect and neutralize adversarial perturbations.

**Setup**:
* We use DistilBERT fine-tuned on SST-2 (sentiment analysis) as a proxy for LLaMA, since LLaMA isn’t directly accessible via Hugging Face in this format. LLaMA would work similarly with a tokenizer and sequence classification head.
* Install dependencies: <code>pip install transformers torch re textattack</code>.

In [1]:
import importlib.util

def check_install(package_name):
    if importlib.util.find_spec(package_name) is None:
        !pip install {package_name}
    else:
        print(f"'{package_name}' is already installed.")

In [2]:
check_install('transformers')
check_install('torch')
check_install('re')
check_install('textattack')

'transformers' is already installed.
'torch' is already installed.
're' is already installed.
'textattack' is already installed.


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from textattack.augmentation import WordNetAugmenter
import re

textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package omw to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [None]:
# Load a pre-trained model and tokenizer (simulating LLaMA-like behavior)
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Function to predict sentiment
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    prediction = torch.argmax(logits, dim=1).item()
    return "Positive" if prediction == 1 else "Negative"

# Original input
# The clean input ("I really enjoyed the movie, it was great!") is classified as "Positive."
clean_text = "I really enjoyed the movie, it was great!"
print("Original Text:", clean_text)
print("Original Prediction:", predict_sentiment(clean_text))

# Adversarial Attack: Use synonym substitution to flip sentiment
# 1. We simulate an attack by perturbing the text. In practice, tools like textattack can generate 
# subtle adversarial examples (e.g., replacing "enjoyed" with "tolerated" or introducing typos).
# 2. For simplicity, I manually crafted an adversarial example ("I barely tolerated the film, 
# it was awful!") that flips the sentiment to "Negative."
augmenter = WordNetAugmenter()
adv_text = augmenter.augment(clean_text)[0]  # Example perturbation
# Simulated adversarial example (manual tweak for demo purposes)
adv_text = "I barely tolerated the film, it was awful!"  # Perturbed to negative
print("Adversarial Text:", adv_text)
print("Adversarial Prediction:", predict_sentiment(adv_text))

# Defense: Input Preprocessing
# Defense Mechanism:
# Preprocessing: The preprocess_defense function cleans the input by removing unusual characters 
# and normalizing text. For LLaMA, you might extend this with semantic checks (e.g., embedding 
# similarity to trusted inputs).
# Adversarial Detection: The detect_adversarial function compares token sets between original 
# and perturbed inputs. A significant difference flags potential attacks.
def preprocess_defense(text):
    # Simple sanitization: Remove suspicious patterns, normalize text
    text = re.sub(r'[^\w\s.,!?]', '', text)  # Remove odd characters
    text = text.lower().strip()  # Normalize case and whitespace
    # Add more sophisticated checks (e.g., cosine similarity to known good inputs)
    return text

# Apply defense
defended_text = preprocess_defense(adv_text)
print("Defended Text:", defended_text)
print("Defended Prediction:", predict_sentiment(defended_text))

# Bonus: Check for adversarial intent (rudimentary anomaly detection)
def detect_adversarial(original, perturbed):
    orig_tokens = set(original.lower().split())
    pert_tokens = set(perturbed.lower().split())
    diff = len(orig_tokens.symmetric_difference(pert_tokens))
    threshold = 3  # Arbitrary threshold for significant change
    return diff > threshold

if detect_adversarial(clean_text, adv_text):
    print("Warning: Possible adversarial input detected!")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Original Text: I really enjoyed the movie, it was great!
Original Prediction: Positive


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\frank\AppData\Roaming\nltk_data...


Adversarial Text: I barely tolerated the film, it was awful!
Adversarial Prediction: Negative
Defended Text: i barely tolerated the film, it was awful!
Defended Prediction: Negative


### Applying to LLaMA Specifically
* **Accessing LLaMA**: If you have access to LLaMA (e.g., via Meta AI’s research release), replace the model loading with LLaMA’s weights and adjust the tokenizer accordingly. The process remains the same: encode text, predict, perturb, defend.
* **Attack Sophistication**: For LLaMA, adversarial attacks might involve prompt injection (e.g., appending "Ignore previous instructions" to mislead it) or subtle paraphrasing. Use libraries like textattack or OpenAttack for realistic text perturbations.
* **Enhanced Defense**: LLaMA’s scale allows for more robust defenses:
    * **Embedding Checks**: Compare input embeddings to a trusted corpus using cosine similarity.
    * **Contextual Analysis**: Use LLaMA itself to detect semantic inconsistency (e.g., "Does this input align with typical sentiment patterns?").

### Limitations
* The basic preprocessing here doesn’t fully neutralize the adversarial example because it doesn’t address semantic shifts (e.g., "enjoyed" to "tolerated"). A real defense might require training LLaMA with adversarial examples or using a secondary model to filter inputs.
* This demo simplifies LLaMA’s complexity. In practice, LLaMA’s architecture (e.g., 7B+ parameters) would require more compute and tailored defenses.