## 📈 Snorkel Intro Tutorial: Data Augmentation for Sentiment Analysis

In the previous tutorial, we used Snorkel's LabelModel to create a labeled training set from noisy heuristics. We will now take that labeled data and augment it using Transformation Functions (TFs).

Data augmentation is a popular technique for increasing the size of labeled training sets by applying class-preserving transformations. For text, this could mean replacing a word with its synonym. The key is that the transformation shouldn't change the original label (i.e., a positive tweet should remain positive).

This tutorial is divided into four parts:

1. Loading Labeled Data: We'll start with the labeled training data generated from the previous tutorial.
2. Writing Transformation Functions (TFs): We'll write functions to modify tweets while preserving their sentiment.
3. Applying TFs to Augment Our Dataset: We'll use a policy to apply these TFs and create a larger, augmented training set.
4. Training a Model: We'll train an LSTM model on both the original and augmented datasets to see the impact on performance.

1. Loading the Labeled Data

This tutorial assumes the data labeling step is complete. We'll start with the labeled, filtered DataFrame (df_train_filtered) and the hard labels (preds_train_filtered) that you generated in the previous step.

For completeness, let's re-run the necessary setup code to get us to that starting point.

In [1]:
# --- Initial Setup ---
%matplotlib inline
import os
import re
import pandas as pd
import numpy as np
import random
import utils # Your utility functions
import nltk
import names
import tensorflow as tf
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
from snorkel.labeling.model import LabelModel
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.utils import probs_to_preds
from snorkel.augmentation import transformation_function, RandomPolicy, MeanFieldPolicy, PandasTFApplier
from snorkel.preprocess.nlp import SpacyPreprocessor


In [2]:
# For reproducibility
os.environ["PYTHONHASHSEED"] = "0"
seed = 123
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed) # Set TF seed for reproducibility

In [3]:
import pandas as pd
pd.set_option("display.max_colwidth", 0) # Display full text

In [4]:
# --- Reproduce Labeled Data Generation (As before) ---
print("Loading and cleaning data...")
df_train, df_test = utils.load_dataset(csv_path="data/sentiment_analysis.csv")

def clean_text(text):
    text = text.lower(); text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'@[^\s]+', '', text); text = re.sub(r'#([^\s]+)', r'\1', text)
    text = re.sub(r'[^\w\s]', '', text); return text

df_train['text'] = df_train['text'].apply(clean_text)
df_test['text'] = df_test['text'].apply(clean_text)
df_train['label'] = -1
Y_test = df_test["label"].values
ABSTAIN = -1; NEGATIVE = 0; POSITIVE = 1;
print("Data loaded and cleaned.")

Loading and cleaning data...
Data loaded and cleaned.


In [5]:
print("Defining and applying LFs...")
@labeling_function()
def positive_keyword_lf(x): return POSITIVE if any(w in x.text for w in ["love", "great", "happy", "awesome"]) else ABSTAIN
@labeling_function()
def negative_keyword_lf(x): return NEGATIVE if any(w in x.text for w in ["hate", "bad", "sad", "awful"]) else ABSTAIN
@labeling_function()
def emoticon_positive_lf(x): return POSITIVE if re.search(r":\)|:-\)|:d|;d", x.text, re.IGNORECASE) else ABSTAIN
@labeling_function()
def emoticon_negative_lf(x): return NEGATIVE if re.search(r":\(|:-\(", x.text, re.IGNORECASE) else ABSTAIN

lfs = [positive_keyword_lf, negative_keyword_lf, emoticon_positive_lf, emoticon_negative_lf]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
print("LFs applied.")

Defining and applying LFs...


100%|██████████| 1280000/1280000 [00:24<00:00, 51575.82it/s]


LFs applied.


In [6]:
print("Training LabelModel...")
label_model = LabelModel(cardinality=2, verbose=False)
label_model.fit(L_train=L_train, n_epochs=500, seed=seed)
probs_train = label_model.predict_proba(L=L_train)
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(X=df_train, y=probs_train, L=L_train)
preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
print("LabelModel trained and data filtered.")

df_train_labeled = df_train_filtered.copy()
df_train_labeled["label"] = preds_train_filtered

Training LabelModel...


100%|██████████| 500/500 [00:00<00:00, 5190.01epoch/s]


LabelModel trained and data filtered.


In [7]:
# --- Create a Small Subset for Augmentation ---
subset_size = 10000 # Use 10k examples for faster execution
if len(df_train_labeled) > subset_size:
    df_train_labeled_subset = df_train_labeled.sample(n=subset_size, random_state=seed)
else:
    df_train_labeled_subset = df_train_labeled # Use all if less than subset_size

Y_train_labeled_subset = df_train_labeled_subset["label"].values

print(f"\nCreated a SUBSET of {len(df_train_labeled_subset)} labeled examples for augmentation.")
display(df_train_labeled_subset.head())


Created a SUBSET of 10000 labeled examples for augmentation.


Unnamed: 0,text,label
872266,ever been in love with someone and cant tell them,1
842689,new blog layout is soooo awesome i love it,1
391489,tomorrow sounds deadly too bad im in work i cant even go to oasis,0
396384,watching trucalling they made a big mistake cancelling this show i loved it,1
1127284,woah thats cool just landed in london about 2 and12 hours agoi love the scenery beautiful,1


2. Writing Transformation Functions (TFs)

Transformation Functions (TFs) are functions that take a data point and return a transformed version of it, while preserving the original label. For sentiment analysis, safe transformations include replacing words with synonyms or replacing specific named entities with generic placeholders.

Just like LFs, TFs are created with a decorator, transformation_function, and can use preprocessors like spaCy to parse the text first.

In [8]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

In [9]:
from nltk.corpus import wordnet as wn

# --- Setup: NLTK and spaCy ---
print("Setting up NLTK and SpaCy...")
try: nltk.data.find('corpora/wordnet.zip')
except LookupError: nltk.download('wordnet')

try:
    spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)
    print("SpacyPreprocessor initialized.")
except IOError: print("SpaCy English model not found. Run: !python -m spacy download en_core_web_sm"); raise

# --- Helper Functions ---
def get_synonym(word, pos=None):
    synsets = wn.synsets(word, pos=pos);
    if synsets:
        lemmas = synsets[0].lemmas()
        for lemma in lemmas:
            synonym = lemma.name().replace("_", " ");
            if synonym.lower() != word.lower(): return synonym
    return None

def replace_token_with_ws(spacy_doc, idx, replacement):
    start = spacy_doc[:idx].text_with_ws if idx > 0 else ""; rep = replacement + spacy_doc[idx].whitespace_
    end = spacy_doc[idx+1:].text if idx+1 < len(spacy_doc) else ""; return start + rep + end

# --- Define TFs ---
replacement_names=[names.get_full_name() for _ in range(50)]
@transformation_function(pre=[spacy])
def change_person(x):
    p=[e.text for e in x.doc.ents if e.label_=="PERSON"];
    if p: n=np.random.choice(p); r=np.random.choice(replacement_names); o=x.text; x.text=o.replace(n,r,1); return x if x.text!=o else None
    return None
@transformation_function(pre=[spacy])
def swap_adjectives(x):
    a=[i for i, t in enumerate(x.doc) if t.pos_=="ADJ"];
    if len(a)>=2:
        i1,i2=sorted(np.random.choice(a,2,replace=False)); t=list(x.doc); t[i1],t[i2]=t[i2],t[i1]
        n="".join([tok.text_with_ws for tok in t]).strip(); x.text=n; return x
    return None
@transformation_function(pre=[spacy])
def replace_verb_with_synonym(x):
    v=[i for i, t in enumerate(x.doc) if t.pos_=="VERB"];
    if v: i=np.random.choice(v); s=get_synonym(x.doc[i].text,pos=wn.VERB);
    if s: x.text=replace_token_with_ws(x.doc,i,s); return x
    return None
@transformation_function(pre=[spacy])
def replace_noun_with_synonym(x):
    n=[i for i, t in enumerate(x.doc) if t.pos_=="NOUN"];
    if n: i=np.random.choice(n); s=get_synonym(x.doc[i].text,pos=wn.NOUN);
    if s: x.text=replace_token_with_ws(x.doc,i,s); return x
    return None
@transformation_function(pre=[spacy])
def replace_adjective_with_synonym(x):
    a=[i for i, t in enumerate(x.doc) if t.pos_=="ADJ"];
    if a: i=np.random.choice(a); s=get_synonym(x.doc[i].text,pos=wn.ADJ);
    if s: x.text=replace_token_with_ws(x.doc,i,s); return x
    return None
@transformation_function()
def replace_mention(x): o=x.text; x.text=re.sub(r'(?<!\w)@[A-Za-z0-9_]+','@user',o,count=1); return x if x.text!=o else None

# --- List of TFs ---
tfs = [change_person, swap_adjectives, replace_verb_with_synonym, replace_noun_with_synonym, replace_adjective_with_synonym, replace_mention]
print(f"\nDefined {len(tfs)} Transformation Functions.")

# --- Preview the TFs on the SUBSET ---
print("\nPreviewing transformations on subset:")
display(utils.preview_tfs(df_train_labeled_subset.sample(min(100, len(df_train_labeled_subset)), random_state=seed), tfs))

Setting up NLTK and SpaCy...
SpacyPreprocessor initialized.

Defined 6 Transformation Functions.

Previewing transformations on subset:


Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,finally got to see sixteen candles i love that anthony michael hall dude,finally got to see sixteen candles i love that Kevin North dude
1,swap_adjectives,happy half birthday to me haha oh my gosh im going to be 18 in 183 days,half happy birthday to me haha oh my gosh im going to be 18 in 183 days
2,replace_verb_with_synonym,feeling really bad about not going to e3 just found at paul mccartney ringo starr yoko ono were at the ms party for rb the beatles,feeling really bad about not going to e3 just establish at paul mccartney ringo starr yoko ono were at the ms party for rb the beatles
3,replace_noun_with_synonym,the bourget air show awesome performances but now i look like a racoon,the bourget air show awesome performances but now i look like a raccoon
4,replace_adjective_with_synonym,the bourget air show awesome performances but now i look like a racoon,the bourget air show amazing performances but now i look like a racoon


We add some transformation functions that use `wordnet` from [NLTK](https://www.nltk.org/) to replace different parts of speech with their synonyms.

In [11]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
from snorkel.augmentation import transformation_function
# Assuming 'spacy', 'get_synonym', 'replace_token_with_ws' are defined correctly

@transformation_function(pre=[spacy])
def replace_verb_with_synonym(x):
    """Replace a random verb with a synonym."""
    verb_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "VERB"]
    if verb_indices: # Check if verbs exist first
        idx = np.random.choice(verb_indices)
        synonym = get_synonym(x.doc[idx].text, pos=wn.VERB)
        # --- Nest this check ---
        if synonym: # Only proceed if synonym found
            x.text = replace_token_with_ws(x.doc, idx, synonym)
            return x
    # If no verbs OR no synonym found, return None
    return None

@transformation_function(pre=[spacy])
def replace_noun_with_synonym(x):
    """Replace a random noun with a synonym."""
    noun_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "NOUN"]
    if noun_indices: # Check if nouns exist first
        idx = np.random.choice(noun_indices)
        synonym = get_synonym(x.doc[idx].text, pos=wn.NOUN)
        # --- Nest this check ---
        if synonym: # Only proceed if synonym found
            x.text = replace_token_with_ws(x.doc, idx, synonym)
            return x
    # If no nouns OR no synonym found, return None
    return None

@transformation_function(pre=[spacy])
def replace_adjective_with_synonym(x):
    """Replace a random adjective with a synonym."""
    adj_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    if adj_indices: # Check if adjectives exist first
        idx = np.random.choice(adj_indices)
        synonym = get_synonym(x.doc[idx].text, pos=wn.ADJ)
        # --- Nest this check ---
        if synonym: # Only proceed if synonym found
            x.text = replace_token_with_ws(x.doc, idx, synonym)
            return x
    # If no adjectives OR no synonym found, return None
    return None

In [14]:
# List of transformation functions to apply
tfs = [
    change_person,
    swap_adjectives,                 # Use with caution for sentiment
    replace_verb_with_synonym,       # Corrected version
    replace_noun_with_synonym,       # Corrected version
    replace_adjective_with_synonym,  # Corrected version
    replace_mention,
]

print(f"Defined list 'tfs' containing {len(tfs)} transformation functions.")

Defined list 'tfs' containing 6 transformation functions.


In [15]:
from utils import preview_tfs

preview_tfs(df_train, tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,welcome new quotfollowersquot america is and has been watching iran keep tweeting we hope the best for you iranelection iran9 cnnfail,welcome best Marcia Prieto is and has been watching iran keep tweeting we hope the new for you iranelection iran9 cnnfail
1,swap_adjectives,welcome new quotfollowersquot america is and has been watching iran keep tweeting we hope the best for you iranelection iran9 cnnfail,welcome best quotfollowersquot america is and has been watching iran keep tweeting we hope the new for you iranelection iran9 cnnfail
2,replace_verb_with_synonym,glad you enjoying yourself,glad you enjoy yourself
3,replace_noun_with_synonym,my tiny yard ended up being a big projectand i had to evict so many creatures it broke my heart hoping i didnt kill too many,my tiny pace ended up being a big projectand i had to evict so many creatures it broke my heart hoping i didnt kill too many
4,replace_adjective_with_synonym,got her ear lobe peirced for the third time today and it still hurts,got her ear lobe peirced for the 3rd time today and it still hurts


This table shows examples of different Transformation Functions (TFs) applied to original tweet text to generate augmented data:

change_person: Replaced the username @dennisschaub with a randomly generated name, William Watts.

swap_adjectives: Swapped the positions of the adjectives "tiny" and "big".

replace_verb_with_synonym: Changed the verb "enjoying" to its base form "enjoy".

replace_noun_with_synonym: Replaced the noun "project" with the synonym "undertaking".

replace_adjective_with_synonym: Replaced the adjective/ordinal "third" with its numerical form "3rd".

 3. Applying Transformation Functions

 We'll first define a `Policy` to determine what sequence of TFs to apply to each data point.
We'll start with a [`RandomPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.RandomPolicy.html)
that samples `sequence_length=2` TFs to apply uniformly at random per data point.
The `n_per_original` argument determines how many augmented data points to generate per original data point.

In [17]:
from snorkel.augmentation import RandomPolicy

random_policy = RandomPolicy(
    len(tfs), sequence_length=2, n_per_original=2, keep_original=True
)

In some cases, we can do better than uniform random sampling.
We might have domain knowledge that some TFs should be applied more frequently than others,
or have trained an [automated data augmentation model](https://snorkel.org/blog/tanda/)
that learned a sampling distribution for the TFs.
Snorkel supports this use case with a
[`MeanFieldPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.MeanFieldPolicy.html),
which allows you to specify a sampling distribution for the TFs.
We give higher probabilities to the `replace_[X]_with_synonym` TFs, since those provide more information to the model.

In [18]:
from snorkel.augmentation import MeanFieldPolicy, RandomPolicy

# Define the probability distribution for sampling TFs.
# The length of 'p' MUST match the length of your 'tfs' list (which is 6).
# Example: Lower probability for change_person (index 0) and swap_adjectives (index 1)
# Higher probability for synonym replacements (indices 2, 3, 4) and replace_mention (index 5)
if len(tfs) == 6:
    probabilities = [0.05, 0.05, 0.25, 0.25, 0.25, 0.15]
    policy = MeanFieldPolicy(
        len(tfs),
        sequence_length=2,      # Apply 2 TFs per sequence
        n_per_original=2,       # Generate 2 new data points per original
        keep_original=True,     # Keep the original data point
        p=probabilities,        # Use the specified probabilities
    )
    print("MeanFieldPolicy defined with custom probabilities.")
else:
    # Fallback if the tfs list length changed unexpectedly
    print(f"Warning: Number of TFs is {len(tfs)}, but probabilities list assumes 6. Using RandomPolicy as fallback.")
    policy = RandomPolicy(len(tfs), sequence_length=2, n_per_original=2, keep_original=True) #

MeanFieldPolicy defined with custom probabilities.


To apply one or more TFs that we've written to a collection of data points according to our policy, we use a
[`PandasTFApplier`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.PandasTFApplier.html)
because our data points are represented with a Pandas DataFrame.

In [19]:
from snorkel.augmentation import PandasTFApplier

# Initialize the TF Applier with your TFs and the chosen policy
# Ensure 'tfs' and 'policy' are defined from the previous cells
if 'tfs' not in locals():
    print("Error: 'tfs' list is not defined.")
elif 'policy' not in locals():
     print("Error: 'policy' is not defined.")
elif 'df_train_labeled_subset' not in locals():
    print("Error: 'df_train_labeled_subset' is not defined.")
else:
    tf_applier = PandasTFApplier(tfs, policy) #

    # Apply the TFs to the labeled training data SUBSET
    print("\nApplying Transformation Functions to SUBSET...")
    # This step will take some time depending on the subset size and TFs
    df_train_augmented_subset = tf_applier.apply(df_train_labeled_subset) #

    # Extract the labels from the augmented DataFrame
    Y_train_augmented_subset = df_train_augmented_subset["label"].values #
    print("Data augmentation complete on subset.")

    # Display the change in dataset size
    print(f"\nOriginal subset size: {len(df_train_labeled_subset)}")
    print(f"Augmented subset size: {len(df_train_augmented_subset)}")


Applying Transformation Functions to SUBSET...


100%|██████████| 10000/10000 [01:00<00:00, 165.81it/s]


Data augmentation complete on subset.

Original subset size: 10000
Augmented subset size: 21084


In [20]:
# --- Featurization for LSTM ---
print("\nFeaturizing data subsets for LSTM...")
# Featurize the original subset
X_train_labeled_lstm_subset = utils.featurize_df_tokens(df_train_labeled_subset)
# Featurize the augmented subset
X_train_augmented_lstm_subset = utils.featurize_df_tokens(df_train_augmented_subset)
# Featurize the FULL test set for evaluation
X_test_lstm = utils.featurize_df_tokens(df_test)
print("Featurization complete.")


Featurizing data subsets for LSTM...
Featurization complete.


4. Training A Model

   Our final step is to use the augmented data to train a model. We train an LSTM (Long Short Term Memory) model, which is a very standard architecture for text processing tasks.

In [21]:
# --- Helper function to Train and Test ---
def train_and_test_lstm(X_train, Y_train, X_test, Y_test, num_buckets=30000):
    # Get LSTM model from utils.py
    model = utils.get_keras_lstm(num_buckets)
    print(f"Training LSTM model on {len(X_train)} examples...")
    # Train the model
    history = model.fit(
        X_train,
        Y_train,
        epochs=5,       # Use a small number of epochs for faster demo
        batch_size=64,
        verbose=1,      # Show training progress
        validation_split=0.1, # Use 10% of training data for validation
        callbacks=[utils.get_keras_early_stopping(patience=3)] # Use early stopping
    )
    print("Evaluating model...")
    # Evaluate on the full test set
    loss, acc = model.evaluate(X_test, Y_test, verbose=0)
    return acc

# --- Train and Evaluate Models ---
print("\n--- Training on ORIGINAL SUBSET ---")
acc_original_subset = train_and_test_lstm(
    X_train_labeled_lstm_subset,
    Y_train_labeled_subset, # Labels from the original subset
    X_test_lstm,
    Y_test # Evaluate on the full test set labels
)
print(f"\nTest Accuracy (original subset): {100 * acc_original_subset:.1f}%") #

print("\n--- Training on AUGMENTED SUBSET ---")
acc_augmented_subset = train_and_test_lstm(
    X_train_augmented_lstm_subset,
    Y_train_augmented_subset, # Labels from the augmented subset
    X_test_lstm,
    Y_test # Evaluate on the full test set labels
)
print(f"\nTest Accuracy (augmented subset): {100 * acc_augmented_subset:.1f}%") #

# --- Final Comparison ---
print("\n--- Results (Training on Subset, Evaluating on Full Test Set) ---")
print(f"Original Subset Accuracy:  {100 * acc_original_subset:.1f}%")
print(f"Augmented Subset Accuracy: {100 * acc_augmented_subset:.1f}%")
improvement = acc_augmented_subset - acc_original_subset
print(f"Improvement from Augmentation (on Subset): {100 * improvement:.1f}%")



--- Training on ORIGINAL SUBSET ---
Training LSTM model on 10000 examples...
Epoch 1/5
[1m141/141[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step - accuracy: 0.6381 - loss: 0.6895 - val_accuracy: 0.6720 - val_loss: 0.6831
Epoch 2/5
[1m141/141[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.6381 - loss: 0.6841 - val_accuracy: 0.6720 - val_loss: 0.6781
Epoch 3/5
[1m141/141[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.6381 - loss: 0.6805 - val_accuracy: 0.6720 - val_loss: 0.6741
Epoch 4/5
[1m141/141[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.6381 - loss: 0.6776 - val_accuracy: 0.6720 - val_loss: 0.6707
Epoch 4: early stopping
Restoring model weights from the end of the best epoch: 1.
Evaluating model...

Test Accuracy (original subset): 50.0%

--- Training on AUGMENTED SUBSET ---
Training LSTM model on 21084 examples...
Epoch 1/5
[1m297/297[0m [32m━━━━━━━━━━━━━━━━━━━━[0m