## 📈 Snorkel Intro Tutorial: Data Augmentation for Sentiment Analysis

In the previous tutorial, we used Snorkel's LabelModel to create a labeled training set from noisy heuristics. We will now take that labeled data and augment it using Transformation Functions (TFs).

Data augmentation is a popular technique for increasing the size of labeled training sets by applying class-preserving transformations. For text, this could mean replacing a word with its synonym. The key is that the transformation shouldn't change the original label (i.e., a positive tweet should remain positive).

This tutorial is divided into four parts:

1. Loading Labeled Data: We'll start with the labeled training data generated from the previous tutorial.
2. Writing Transformation Functions (TFs): We'll write functions to modify tweets while preserving their sentiment.
3. Applying TFs to Augment Our Dataset: We'll use a policy to apply these TFs and create a larger, augmented training set.
4. Training a Model: We'll train an LSTM model on both the original and augmented datasets to see the impact on performance.

1. Loading the Labeled Data

This tutorial assumes the data labeling step is complete. We'll start with the labeled, filtered DataFrame (df_train_filtered) and the hard labels (preds_train_filtered) that you generated in the previous step.

For completeness, let's re-run the necessary setup code to get us to that starting point.

In [1]:
# --- Initial Setup ---
%matplotlib inline
import os
import re
import pandas as pd
import numpy as np
import random
import utils # Your utility functions
import nltk
import names
import tensorflow as tf
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
from snorkel.labeling.model import LabelModel
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.utils import probs_to_preds
from snorkel.augmentation import transformation_function, RandomPolicy, MeanFieldPolicy, PandasTFApplier
from snorkel.preprocess.nlp import SpacyPreprocessor


In [2]:
# For reproducibility
DISPLAY_ALL_TEXT = True

pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)

In [3]:
import pandas as pd


DISPLAY_ALL_TEXT = True

pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)

In [4]:
import utils

df_train, df_test = utils.load_dataset(csv_path="data/sentiment_analysis.csv")
Y_train = df_train["label"].values
Y_test = df_test["label"].values

print("Data loaded successfully!")
print(f"Number of training examples: {len(df_train)}")
print(f"Number of test examples: {len(df_test)}")

Data loaded successfully!
Number of training examples: 1280000
Number of test examples: 320000


In [5]:
df_train.head()

Unnamed: 0,text,label
0,@chrishasboobs AHHH I HOPE YOUR OK!!!,0
1,"@misstoriblack cool , i have no tweet apps for my razr 2",0
2,"@TiannaChaos i know just family drama. its lame.hey next time u hang out with kim n u guys like have a sleepover or whatever, ill call u",0
3,School email won't open and I have geography stuff on there to revise! *Stupid School* :'(,0
4,upper airways problem,0


2. Writing Transformation Functions (TFs)

Transformation Functions (TFs) are functions that take a data point and return a transformed version of it, while preserving the original label. For sentiment analysis, safe transformations include replacing words with synonyms or replacing specific named entities with generic placeholders.

Just like LFs, TFs are created with a decorator, transformation_function, and can use preprocessors like spaCy to parse the text first.

In [6]:
from snorkel.preprocess.nlp import SpacyPreprocessor

spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

In [7]:
import numpy as np
import names
from snorkel.augmentation import transformation_function
from snorkel.preprocess.nlp import SpacyPreprocessor

replacement_names = [names.get_full_name() for _ in range(50)]

@transformation_function(pre=[spacy])
def change_person(x):
    person_names = [ent.text for ent in x.doc.ents if ent.label_ == "PERSON"]
    if person_names:
        name_to_replace = np.random.choice(person_names)
        replacement_name = np.random.choice(replacement_names)
        original_text = x.text
        x.text = original_text.replace(name_to_replace, replacement_name, 1)
        return x if x.text != original_text else None
    return None

@transformation_function(pre=[spacy])
def swap_adjectives(x):
    adjective_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    if len(adjective_indices) >= 2:
        idx1, idx2 = sorted(np.random.choice(adjective_indices, 2, replace=False))
        tokens = list(x.doc)
        tokens[idx1], tokens[idx2] = tokens[idx2], tokens[idx1]
        new_text_parts = []
        for i, token in enumerate(tokens):
            new_text_parts.append(token.text)
            if i < len(tokens) - 1 and token.whitespace_:
                new_text_parts.append(" ")
        x.text = "".join(new_text_parts).strip()
        return x
    return None

We add some transformation functions that use `wordnet` from [NLTK](https://www.nltk.org/) to replace different parts of speech with their synonyms.

In [8]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
from snorkel.augmentation import transformation_function
from snorkel.preprocess.nlp import SpacyPreprocessor

try:
    nltk.data.find('corpora/wordnet.zip')
except LookupError:
    nltk.download('wordnet')


# --- Define Helper Functions ---
def get_synonym(word, pos=None):
    """Get a synonym for a word given its part-of-speech (pos)."""
    synsets = wn.synsets(word, pos=pos) #
    if synsets:
        lemmas = synsets[0].lemmas()
        for lemma in lemmas:
            synonym = lemma.name().replace("_", " ") #
            if synonym.lower() != word.lower(): #
                return synonym
    return None

def replace_token_with_ws(spacy_doc, idx, replacement):
    """Replace token at idx, preserving whitespace."""
    start_text = spacy_doc[:idx].text_with_ws if idx > 0 else ""
    replacement_with_space = replacement + spacy_doc[idx].whitespace_
    end_text = spacy_doc[idx+1:].text if idx+1 < len(spacy_doc) else ""
    return start_text + replacement_with_space + end_text

# --- Define Synonym Replacement TFs ---
@transformation_function(pre=[spacy])
def replace_verb_with_synonym(x):
    """Replace a random verb with a synonym."""
    verb_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "VERB"] #
    if verb_indices:
        idx = np.random.choice(verb_indices) #
        synonym = get_synonym(x.doc[idx].text, pos=wn.VERB) #
        if synonym: #
            x.text = replace_token_with_ws(x.doc, idx, synonym) #
            return x #
    return None

@transformation_function(pre=[spacy])
def replace_noun_with_synonym(x):
    """Replace a random noun with a synonym."""
    noun_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "NOUN"] #
    if noun_indices:
        idx = np.random.choice(noun_indices) #
        synonym = get_synonym(x.doc[idx].text, pos=wn.NOUN) #
        if synonym: #
            x.text = replace_token_with_ws(x.doc, idx, synonym) #
            return x #
    return None

@transformation_function(pre=[spacy])
def replace_adjective_with_synonym(x):
    """Replace a random adjective with a synonym."""
    adj_indices = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"] #
    if adj_indices:
        idx = np.random.choice(adj_indices) #
        synonym = get_synonym(x.doc[idx].text, pos=wn.ADJ) #
        if synonym: #
            x.text = replace_token_with_ws(x.doc, idx, synonym) #
            return x #
    return None

In [9]:
# List of transformation functions to apply
tfs = [
    change_person,
    swap_adjectives,
    replace_verb_with_synonym,
    replace_noun_with_synonym,
    replace_adjective_with_synonym,
]


In [10]:
from utils import preview_tfs

preview_tfs(df_train, tfs)

Unnamed: 0,TF Name,Original Text,Transformed Text
0,change_person,@dennisschaub No but he will be doing pics. He has the best body I can't wait!,Manuel Davis No but he will be doing pics. He has the best body I can't wait!
1,swap_adjectives,"my tiny yard ended up being a big project...and I had to evict so many creatures, it broke my heart. Hoping I didn't kill too many.","my tiny yard ended up being a big project...and I had to evict so manycreatures, it broke my heart. Hoping I didn't kill too many ."
2,replace_verb_with_synonym,@oliverclzoff glad you enjoying yourself!,@oliverclzoff glad you enjoy yourself!
3,replace_noun_with_synonym,got her ear lobe peirced for the third time today and it still hurts,got her ear lobe peirced for the third clip today and it still hurts
4,replace_adjective_with_synonym,got her ear lobe peirced for the third time today and it still hurts,got her ear lobe peirced for the 3rd time today and it still hurts


This table shows examples of different Transformation Functions (TFs) applied to original tweet text to generate augmented data:

change_person: Replaced the username @dennisschaub with a randomly generated name, William Watts.

swap_adjectives: Swapped the positions of the adjectives "tiny" and "big".

replace_verb_with_synonym: Changed the verb "enjoying" to its base form "enjoy".

replace_noun_with_synonym: Replaced the noun "project" with the synonym "undertaking".

replace_adjective_with_synonym: Replaced the adjective/ordinal "third" with its numerical form "3rd".

 3. Applying Transformation Functions

 We'll first define a `Policy` to determine what sequence of TFs to apply to each data point.
We'll start with a [`RandomPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.RandomPolicy.html)
that samples `sequence_length=2` TFs to apply uniformly at random per data point.
The `n_per_original` argument determines how many augmented data points to generate per original data point.

In [11]:
from snorkel.augmentation import RandomPolicy

random_policy = RandomPolicy(
    len(tfs), sequence_length=2, n_per_original=2, keep_original=True
)

In some cases, we can do better than uniform random sampling.
We might have domain knowledge that some TFs should be applied more frequently than others,
or have trained an [automated data augmentation model](https://snorkel.org/blog/tanda/)
that learned a sampling distribution for the TFs.
Snorkel supports this use case with a
[`MeanFieldPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.MeanFieldPolicy.html),
which allows you to specify a sampling distribution for the TFs.
We give higher probabilities to the `replace_[X]_with_synonym` TFs, since those provide more information to the model.

In [12]:
from snorkel.augmentation import MeanFieldPolicy

mean_field_policy = MeanFieldPolicy(
    len(tfs),
    sequence_length=2,
    n_per_original=2,
    keep_original=True,
    p=[0.05, 0.05, 0.3, 0.3, 0.3],
)

To apply one or more TFs that we've written to a collection of data points according to our policy, we use a
[`PandasTFApplier`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.PandasTFApplier.html)
because our data points are represented with a Pandas DataFrame.

In [None]:
from snorkel.augmentation import PandasTFApplier

tf_applier = PandasTFApplier(tfs, mean_field_policy)
df_train_augmented = tf_applier.apply(df_train)
Y_train_augmented = df_train_augmented["label"].values6

 13%|█▎        | 167792/1280000 [20:14<2:18:46, 133.57it/s]

In [None]:
print(f"Original training set size: {len(df_train)}")
print(f"Augmented training set size: {len(df_train_augmented)}")

We have almost doubled our dataset using TFs!
Note that despite `n_per_original` being set to 2, our dataset may not exactly triple in size,
because sometimes TFs return `None` instead of a new data point
(e.g. `change_person` when applied to a sentence with no persons).
If you prefer to have exact proportions for your dataset, you can have TFs that can't perform a
valid transformation return the original data point rather than `None` (as they do here).

4. Training A Model

   Our final step is to use the augmented data to train a model. We train an LSTM (Long Short Term Memory) model, which is a very standard architecture for text processing tasks.

The next cell makes Keras results reproducible. You can ignore it.

In [None]:
import tensorflow as tf

session_conf = tf.compat.v1.ConfigProto(
    intra_op_parallelism_threads=1, inter_op_parallelism_threads=1
)

tf.compat.v1.set_random_seed(0)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

Now we'll train our LSTM on both the original and augmented datasets to compare performance.

In [None]:
from utils import featurize_df_tokens, get_keras_lstm

X_train = featurize_df_tokens(df_train)
X_train_augmented = featurize_df_tokens(df_train_augmented)
X_test = featurize_df_tokens(df_test)


def train_and_test(X_train, Y_train, X_test=X_test, Y_test=Y_test, num_buckets=30000):
    # Define a vanilla LSTM model with Keras
    lstm_model = get_keras_lstm(num_buckets)
    lstm_model.fit(X_train, Y_train, epochs=5, verbose=0)
    preds_test = lstm_model.predict(X_test)[:, 0] > 0.5
    return (preds_test == Y_test).mean()


acc_augmented = train_and_test(X_train_augmented, Y_train_augmented)
acc_original = train_and_test(X_train, Y_train)

In [None]:
print(f"Test Accuracy (original training data): {100 * acc_original:.1f}%")
print(f"Test Accuracy (augmented training data): {100 * acc_augmented:.1f}%")