In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
# | hide
from onprem.core import *

# Few-Shot Classification



The `pipelines.FewShotClassifier` is a simple wrapper around the `SetFit` package and allows you to make text classification predictions on only a few labeled examples (e.g., 8 examples per class). It is useful when only a small amount of labeled examples are available for training a model. We will supply the `use_smaller=True` argument to use the smaller version of the default model.

In [None]:
# | notest
from onprem.pipelines import FewShotClassifier
clf = FewShotClassifier(use_smaller=True)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In this example, we will classify a sample of the 20NewsGroup dataset.

In [None]:
# | notest

from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd

# Fetching data
classes = ["soc.religion.christian", "sci.space"]
newsgroups = fetch_20newsgroups(subset="all", categories=classes)
corpus = np.array(newsgroups.data)
group_labels = np.array(newsgroups.target_names)[newsgroups.target]

# Wrangling data into a dataframe and selecting training examples
data = pd.DataFrame({"text": corpus, "label": group_labels})
train_df = data.groupby("label").sample(5)
test_df = data.drop(index=train_df.index)

# small sample of entire dataset set (and much smaller than the test set)
X_sample = train_df['text'].values
y_sample = train_df['label'].values

# test set
X_test = test_df['text'].values
y_test = test_df['label'].values


There are only 10 training examples.

In [None]:
# | notest

len(X_sample)

10

There are 1974 test examples.

In [None]:
# | notest

len(X_test)

1974

Let's train:

In [None]:
# | notest

clf.train(X_sample,  y_sample, num_epochs=1, batch_size=16, num_iterations=20)

Applying column mapping to the training dataset


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 400
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 25


Step,Training Loss


In [None]:
# | notest

print(clf.evaluate(X_test, y_test, labels=clf.model.labels, print_report=True))

                        precision    recall  f1-score   support

             sci.space       0.96      0.99      0.98       982
soc.religion.christian       0.99      0.96      0.98       992

              accuracy                           0.98      1974
             macro avg       0.98      0.98      0.98      1974
          weighted avg       0.98      0.98      0.98      1974



Make predictions on new data:

In [None]:
# | notest

clf.predict(['Elon Musk likes launching satellites.'])

array(['sci.space'], dtype='<U22')

In [None]:
# | notest


clf.predict(['My mom likes going to church.'])

array(['soc.religion.christian'], dtype='<U22')

Show prediction probabilities:

In [None]:
# | notest

clf.predict_proba(['Elon Musk likes launching satellites.'])

tensor([[0.6201, 0.3799]], dtype=torch.float64)

Explain predictions:

In [None]:
# | notest

clf.explain(['Elon Musk likes launching satellites.'], labels=clf.model.labels)

Save and reload the model:

In [None]:
# | notest

clf.save('/tmp/my_fewshot_model')

In [None]:
# | notest

clf = FewShotClassifier('/tmp/my_fewshot_model')

In [None]:
# | notest

clf.predict(['Elon Musk likes launching satellites.'])

array(['sci.space'], dtype='<U22')