# ✂️ Snorkel Intro Tutorial: _Data Slicing_

In real-world applications, some model outcomes are often more important than others — e.g. vulnerable cyclist detections in an autonomous driving task, or, in our running **spam** application, potentially malicious link redirects to external websites.

Traditional machine learning systems optimize for overall quality, which may be too coarse-grained.
Models that achieve high overall performance might produce unacceptable failure rates on critical slices of the data — data subsets that might correspond to vulnerable cyclist detection in an autonomous driving task, or in our running spam detection application, external links to potentially malicious websites.

In this tutorial, we:
1. **Introduce _Slicing Functions (SFs)_** as a programming interface
1. **Monitor** application-critical data subsets
2. **Improve model performance** on slices

## 1. Load Labeled Data and Define Slicing Functions (SFs)

In [1]:
# --- Initial Setup ---
%matplotlib inline
import os
import re
import pandas as pd
import numpy as np
import random
import torch
import torch.nn as nn # Ensure nn is imported
import utils # Your utility functions file
import logging
import tensorflow as tf # For reproducibility seed
import scipy.sparse # For type checking

In [2]:
# For reproducibility
os.environ["PYTHONHASHSEED"] = "0"
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
tf.random.set_seed(SEED) # Set TF seed as well

In [3]:

# Configure logging and display
logger = logging.getLogger()
logger.setLevel(logging.WARNING) # Reduce verbose Snorkel logging
pd.set_option("display.max_colwidth", 0) # Display full text

In [4]:
# --- Load Full Labeled Data ---
# Note: load_dataset keeps labels for both train and test here
df_train_full, df_test_full = utils.load_dataset(csv_path="data/sentiment_analysis.csv")

# --- Clean Text ---
def clean_text(text):
    text = str(text).lower() # Ensure input is string
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'@[^\s]+', '', text)
    text = re.sub(r'#([^\s]+)', r'\1', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

print("Cleaning full dataset text...")
df_train_full['text'] = df_train_full['text'].apply(clean_text)
df_test_full['text'] = df_test_full['text'].apply(clean_text)

# --- Create a Small Subset ---
subset_size_train = 10000 # Use 10k examples
subset_size_test = 2000   # Use 2k examples

if len(df_train_full) > subset_size_train:
    df_train = df_train_full.sample(n=subset_size_train, random_state=SEED)
else:
    df_train = df_train_full

if len(df_test_full) > subset_size_test:
    df_test = df_test_full.sample(n=subset_size_test, random_state=SEED)
else:
    df_test = df_test_full

print(f"\nUsing a SUBSET for training/evaluation:")
print(f"  Training examples: {len(df_train)}")
print(f"  Test examples: {len(df_test)}")

# Extract labels for the SUBSETS
Y_train = df_train["label"].values
Y_test = df_test["label"].values

# Define labels
ABSTAIN = -1; NEGATIVE = 0; POSITIVE = 1;

print("\nTraining Subset Head:")
display(df_train.head())

Cleaning full dataset text...

Using a SUBSET for training/evaluation:
  Training examples: 10000
  Test examples: 2000

Training Subset Head:


Unnamed: 0,text,label
844098,saaaaad day,0
1268963,,0
1118710,lol hilarious waitis it george bush the one who gets with a shoe or sth,1
340147,are already more than three hours and i have not lunch,1
393637,i miss my bff keith ambers i wish i can visit him soon in australia im not contented with these ims and phone calls come bck keith,0


## 2. Writing Slicing Functions (SFs)

Slicing Functions (SFs) are similar to LFs but output a boolean mask indicating whether a data point belongs to the slice. We'll define a few SFs relevant to sentiment analysis.

In [5]:
from snorkel.slicing import slicing_function
from snorkel.preprocess import preprocessor
from textblob import TextBlob

# SF for short tweets
@slicing_function()
def short_tweet(x):
    """Tweets with fewer than 5 words."""
    return len(x.text.split()) < 5

# SF for tweets containing negation words
negation_words = ["not", "no", "never", "ain't", "don't", "isn't", "can't", "won't"]
@slicing_function()
def has_negation(x):
    """Tweets containing common negation words."""
    return any(word in x.text for word in negation_words)

# SF using a preprocessor for high polarity
@preprocessor(memoize=True)
def textblob_polarity_score(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    return x

@slicing_function(pre=[textblob_polarity_score])
def high_positive_polarity(x):
    """Tweets with TextBlob polarity > 0.8"""
    return x.polarity > 0.8

# List of SFs to use
# Removed 'is_question' due to previous zero coverage issue
sfs = [
    short_tweet,
    has_negation,
    high_positive_polarity
]
print(f"Defined {len(sfs)} Slicing Functions: {[sf.name for sf in sfs]}")

# (Optional) Visualize examples from a slice
from snorkel.slicing import slice_dataframe
print("\nExamples from the 'has_negation' slice (from test subset):")
# Apply the SF to the test subset DataFrame to get examples
negation_df_subset = slice_dataframe(df_test, has_negation)
display(negation_df_subset[['text', 'label']].head())

Defined 3 Slicing Functions: ['short_tweet', 'has_negation', 'high_positive_polarity']

Examples from the 'has_negation' slice (from test subset):


100%|██████████| 2000/2000 [00:00<00:00, 58474.31it/s]


Unnamed: 0,text,label
57814,went to a harley davidson dealer to show some of my art this weekend allot of looks but no sales ill be judging a tattoo comp next,0
77033,itâs start raining now hmm second cloudy day,0
171509,just saw quotwallk to rememberquot again and its never get old,1
136333,i was when i found out you were in the same house im a huge fan toolol through my tv show im getting racing known in main stream,0
173418,ahh thank god he has leftan angry gay man with a hang over is not my idea of fun on a saturday morninghi 2 all in twitt land,1


Visualize Slices

We can use slice_dataframe to see examples from a specific slice.

## 3. Train Baseline Model and Monitor Slices
Now, let's train a baseline model and see how it performs overall and on our defined slices. This process works with any model framework.


Train a Baseline Classifier

- We'll use scikit-learn's LogisticRegression with TF-IDF features as our baseline.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from snorkel.slicing import PandasSFApplier
from snorkel.analysis import Scorer

# Featurize SUBSET data using TF-IDF
print("Featurizing subset data with TF-IDF...")
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_train_sparse, _ = utils.df_to_features(vectorizer, df_train, "train") # Fit on train subset
X_test_sparse, _ = utils.df_to_features(vectorizer, df_test, "test")   # Transform test subset
print("Featurization complete.")

# Train baseline model on the SUBSET
print("Training baseline Logistic Regression model on subset...")
baseline_model = LogisticRegression(solver="liblinear", C=0.1, random_state=SEED)
baseline_model.fit(X=X_train_sparse, y=Y_train) # Train on subset labels
print("Baseline model trained.")

# Get predictions and probabilities for the TEST SUBSET
preds_test = baseline_model.predict(X_test_sparse)
probs_test = baseline_model.predict_proba(X_test_sparse)

# Calculate overall accuracy on the TEST SUBSET
accuracy_baseline = baseline_model.score(X_test_sparse, Y_test) # Score on test subset
print(f"\nBaseline Model Overall Test Subset Accuracy: {accuracy_baseline * 100:.1f}%")

# Apply SFs to TEST SUBSET
print("\nApplying Slicing Functions to test subset...")
applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test) # Apply to df_test (the test subset)
print("SFs applied to test subset.")

# Check for empty slices on TEST SUBSET
print("\nChecking slice coverage on the test subset (S_test):")
empty_slices = []
if hasattr(S_test, 'dtype') and S_test.dtype.names:
    slice_names_in_S_test = S_test.dtype.names
    for slice_name in slice_names_in_S_test:
        coverage = S_test[slice_name].sum()
        print(f"- Slice '{slice_name}': {coverage} examples")
        if coverage == 0: empty_slices.append(slice_name)
else: print("Warning: S_test format issue.")

if empty_slices: print(f"\nWarning: Empty slices found: {empty_slices}.")
else: print("\nAll slices appear to have coverage on the test subset.")

# Score baseline model on TEST SUBSET slices
# Using accuracy metric consistent with previous tutorials
scorer = Scorer(metrics=["accuracy"])
print("\nScoring baseline model performance on test subset slices:")
if not empty_slices:
    try:
        slice_scores = scorer.score_slices(
            S=S_test, golds=Y_test, preds=preds_test, probs=probs_test, as_dataframe=True
        )
        display(slice_scores)
    except ValueError as e: print(f"\nError during scoring: {e}")
    except Exception as e: print(f"\nUnexpected error during scoring: {e}")
else:
    print("Skipping scoring due to empty slices.")

Featurizing subset data with TF-IDF...
Featurization complete.
Training baseline Logistic Regression model on subset...
Baseline model trained.

Baseline Model Overall Test Subset Accuracy: 71.0%

Applying Slicing Functions to test subset...


100%|██████████| 2000/2000 [00:00<00:00, 6307.02it/s]

SFs applied to test subset.

Checking slice coverage on the test subset (S_test):
- Slice 'short_tweet': 252 examples
- Slice 'has_negation': 492 examples
- Slice 'high_positive_polarity': 33 examples

All slices appear to have coverage on the test subset.

Scoring baseline model performance on test subset slices:





Unnamed: 0,accuracy
overall,0.71
short_tweet,0.777778
has_negation,0.693089
high_positive_polarity,0.818182


## 4. Improve Slice Performance with SliceAwareClassifier

In the previous section, we identified slices where our baseline model performed poorly compared to its overall accuracy. Now, we'll use Slice-based Learning, a technique that adds slice-specific components to our model to improve performance on those challenging subsets. Snorkel implements this via the SliceAwareClassifier.

Constructing the SliceAwareClassifier

First, we need a base PyTorch model architecture. We'll use the simple Multi-Layer Perceptron (MLP) defined in your utils.py file. Then, we initialize the SliceAwareClassifier.

base_architecture: The core PyTorch model (our MLP).

head_dim: The output dimension of the base_architecture before its final classification layer. This is used by the slice-specific heads.

slice_names: The names of the slices we want the model to be aware of.

Prepare Slice-Aware DataLoaders

We need to apply our Slicing Functions (SFs) to the training data (df_train) to get S_train. Then, we create special PyTorch DataLoader objects that include this slice information alongside the features and labels.

In [7]:
from snorkel.slicing import SliceAwareClassifier
from snorkel.classification.data import DictDataset, DictDataLoader
from snorkel.classification import Trainer

# --- Initialize SliceAwareClassifier (using DENSE base MLP) ---
print("\nInitializing SliceAwareClassifier...")
slice_model = None # Initialize to None
train_dl_slice = None
test_dl_slice = None

# Check required variables are defined and valid
required_vars_init = ['X_train_sparse', 'sfs', 'scorer']
missing_vars_init = [var for var in required_vars_init if var not in locals()]

if missing_vars_init: print(f"Error: Missing variables for initialization: {missing_vars_init}")
elif not isinstance(X_train_sparse, scipy.sparse.spmatrix): print("Error: X_train_sparse is not sparse.")
else:
    bow_dim = X_train_sparse.shape[1]
    hidden_dim = 128
    try:
        # Use the DENSE base MLP from utils.py
        base_mlp = utils.get_pytorch_mlp_base(input_dim=bow_dim, hidden_dim=hidden_dim, num_layers=1)
        head_dim_to_use = hidden_dim
        slice_model = SliceAwareClassifier(
            base_architecture=base_mlp, head_dim=head_dim_to_use,
            slice_names=[sf.name for sf in sfs], scorer=scorer,
        )
        print("SliceAwareClassifier initialized with DENSE base MLP.")
    except AttributeError: print("Error: 'get_pytorch_mlp_base' not found in utils.py.")
    except Exception as e: print(f"Error initializing SliceAwareClassifier: {e}")

# --- Prepare DENSE Slice-Aware DataLoaders (using SUBSET) ---
if slice_model is not None:
    print("\nApplying Slicing Functions to train subset...")
    S_train = applier.apply(df_train) # Apply SFs to df_train (train subset)
    print("SFs applied to train subset.")

    print("Converting subset features to PyTorch DENSE tensors...")
    try:
        X_train_tensor = torch.FloatTensor(X_train_sparse.toarray()) # Dense Train Subset
        X_test_tensor = torch.FloatTensor(X_test_sparse.toarray())   # Dense Test Subset
        Y_train_tensor = torch.LongTensor(Y_train) # Train subset labels
        Y_test_tensor = torch.LongTensor(Y_test)   # Test subset labels

        train_dataset = DictDataset.from_tensors(X_train_tensor, Y_train_tensor, "train")
        test_dataset = DictDataset.from_tensors(X_test_tensor, Y_test_tensor, "test")
        print("PyTorch DENSE datasets created from subset.")

        BATCH_SIZE = 64
        print("Creating slice-aware dataloaders for DENSE data...")
        # Create dataloaders using subset slice matrices
        train_dl_slice = slice_model.make_slice_dataloader(train_dataset, S_train, shuffle=True, batch_size=BATCH_SIZE)
        test_dl_slice = slice_model.make_slice_dataloader(test_dataset, S_test, shuffle=False, batch_size=BATCH_SIZE)
        print("Dataloaders ready for DENSE data.")
    except Exception as e:
        print(f"Error preparing dense data/dataloaders: {e}.")
        train_dl_slice = None; test_dl_slice = None


Initializing SliceAwareClassifier...
SliceAwareClassifier initialized with DENSE base MLP.

Applying Slicing Functions to train subset...


100%|██████████| 10000/10000 [00:01<00:00, 6730.78it/s]


SFs applied to train subset.
Converting subset features to PyTorch DENSE tensors...
PyTorch DENSE datasets created from subset.
Creating slice-aware dataloaders for DENSE data...
Dataloaders ready for DENSE data.


Train the Slice-Aware Model

Now we train the SliceAwareClassifier using Snorkel's Trainer. The trainer handles the multi-task learning process automatically.

In [9]:
# --- Train Slice-Aware Model ---
if slice_model is not None and train_dl_slice is not None:
    print("\nTraining SliceAwareClassifier on subset...")
    trainer = Trainer(n_epochs=3, lr=1e-3, progress_bar=True) #
    try:
        trainer.fit(slice_model, [train_dl_slice]) #
        print("SliceAwareClassifier training complete.")
    except Exception as e:
        print(f"\nAn unexpected error occurred during training: {e}")
        slice_model = None # Mark model as failed
else:
    print("\nSkipping SliceAwareClassifier training due to initialization errors.")


Training SliceAwareClassifier on subset...


Epoch 0:: 100%|██████████| 157/157 [00:06<00:00, 24.84it/s, model/all/train/loss=0.478, model/all/train/lr=0.001]
Epoch 1:: 100%|██████████| 157/157 [00:06<00:00, 25.30it/s, model/all/train/loss=0.229, model/all/train/lr=0.001]
Epoch 2:: 100%|██████████| 157/157 [00:06<00:00, 24.29it/s, model/all/train/loss=0.117, model/all/train/lr=0.001]

SliceAwareClassifier training complete.





Evaluate the Slice-Aware Model

Finally, evaluate the trained SliceAwareClassifier on the test set slices. The score_slices method will report metrics for the main task and for each slice-specific head.

In [10]:
# --- Evaluate Slice-Aware Model ---
if slice_model is not None and test_dl_slice is not None:
    print("\nEvaluating SliceAwareClassifier on test subset slices:")
    try:
        slice_aware_scores = slice_model.score_slices([test_dl_slice], as_dataframe=True) #

        if 'slice_scores' in locals() and isinstance(slice_scores, pd.DataFrame):
            print("\nComparison with Baseline Model (on Test Subset):")
            comparison_df = slice_scores.rename(columns={"accuracy": "baseline_accuracy"})
            score_col_name = 'score'
            if score_col_name in slice_aware_scores.columns:
                 slice_aware_scores['slice_name'] = slice_aware_scores['label'].apply(lambda x: x.replace('task_slice:', '').replace('_pred', '') if 'task_slice:' in x else ('overall' if x == 'task' else x))
                 if comparison_df.index.name != 'slice_name': comparison_df.index.name = 'slice_name'
                 comparison_df = comparison_df.reset_index().merge(slice_aware_scores[['slice_name', score_col_name]], on='slice_name', how='left')
                 comparison_df = comparison_df.rename(columns={score_col_name: "slice_aware_accuracy"}).set_index('slice_name')
                 if 'baseline_accuracy' in comparison_df.columns and 'slice_aware_accuracy' in comparison_df.columns:
                     # Add a column for improvement
                     comparison_df['improvement'] = comparison_df['slice_aware_accuracy'] - comparison_df['baseline_accuracy']
                     display(comparison_df[['baseline_accuracy', 'slice_aware_accuracy', 'improvement']])
                 else: print("Warning: Comparison table columns missing."); display(slice_aware_scores)
            else: print(f"Warning: Score column '{score_col_name}' missing."); display(slice_aware_scores)
        else:
            print("\nBaseline scores not found/valid. Displaying SliceAwareClassifier scores only:")
            display(slice_aware_scores)
    except Exception as e:
        print(f"\nAn error occurred during SliceAwareClassifier evaluation: {e}")
else:
    print("\nSkipping SliceAwareClassifier evaluation due to initialization or training errors.")


Evaluating SliceAwareClassifier on test subset slices:

Comparison with Baseline Model (on Test Subset):


Unnamed: 0_level_0,baseline_accuracy,slice_aware_accuracy,improvement
slice_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
overall,0.71,0.711,0.001
short_tweet,0.777778,0.746032,-0.031746
has_negation,0.693089,0.70122,0.00813
high_positive_polarity,0.818182,0.909091,0.090909
