# SetFit for Text Classification

In this notebook, we'll learn how to do few-shot text classification with SetFit.

## Setup

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
!unzip /content/country_classifier.zip

Archive:  /content/country_classifier.zip
   creating: country_classifier/
  inflating: country_classifier/house-addresses.csv  
  inflating: country_classifier/LICENSE.txt  
  inflating: country_classifier/README.txt  


In [5]:
df = pd.read_csv('/content/country_classifier/house-addresses.csv')

In [7]:
df.head()

Unnamed: 0,Address,AddressWithCountry,Country
0,"32, DUMOND STREET, UNIT 123, BENTLEY, WA, 6102","32, DUMOND STREET, UNIT 123, BENTLEY, WA, 6102...",AU
1,"26, ANDREW ROAD, UNIT 75, GREENBANK, QLD, 4124","26, ANDREW ROAD, UNIT 75, GREENBANK, QLD, 4124...",AU
2,"52, FERNSIDE AVENUE, BRIAR HILL, VIC, 3088","52, FERNSIDE AVENUE, BRIAR HILL, VIC, 3088, AU",AU
3,"44, SIGANTO DRIVE, HELENSVALE, QLD, 4212","44, SIGANTO DRIVE, HELENSVALE, QLD, 4212, AU",AU
4,"6, CORONATION STREET, BELLINGEN, NSW, 2454","6, CORONATION STREET, BELLINGEN, NSW, 2454, AU",AU


## Creating Label Mapping

In [11]:
mapper = {v:k for k,v in zip(*[range(len(df.Country.unique())),df.Country.unique()])}
mapper

{'AU': 0,
 'BE': 1,
 'BR': 2,
 'CA': 3,
 'ES': 4,
 'FR': 5,
 'JP': 6,
 'MX': 7,
 'US': 8,
 'ZA': 9}

In [12]:
df['labels'] = df.Country.map(mapper)
df.head()

Unnamed: 0,Address,AddressWithCountry,Country,labels
0,"32, DUMOND STREET, UNIT 123, BENTLEY, WA, 6102","32, DUMOND STREET, UNIT 123, BENTLEY, WA, 6102...",AU,0
1,"26, ANDREW ROAD, UNIT 75, GREENBANK, QLD, 4124","26, ANDREW ROAD, UNIT 75, GREENBANK, QLD, 4124...",AU,0
2,"52, FERNSIDE AVENUE, BRIAR HILL, VIC, 3088","52, FERNSIDE AVENUE, BRIAR HILL, VIC, 3088, AU",AU,0
3,"44, SIGANTO DRIVE, HELENSVALE, QLD, 4212","44, SIGANTO DRIVE, HELENSVALE, QLD, 4212, AU",AU,0
4,"6, CORONATION STREET, BELLINGEN, NSW, 2454","6, CORONATION STREET, BELLINGEN, NSW, 2454, AU",AU,0


In [14]:
train, test = train_test_split(df, test_size=.5, stratify=df['labels'], random_state=20221018)

If you're running this Notebook on Colab or some other cloud platform, you will need to install the `setfit` library. Uncomment the following cell and run it:

In [15]:
%pip install setfit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting setfit
  Downloading setfit-0.3.0-py3-none-any.whl (21 kB)
Collecting evaluate==0.2.2
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 4.9 MB/s 
[?25hCollecting datasets==2.3.2
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 16.4 MB/s 
[?25hCollecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 56.4 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████

To be able to share your model with the community, there are a few more steps to follow.

First, you have to store your authentication token from the Hugging Face Hub (sign up [here](https://huggingface.co/join) if you haven't already!). To do so, execute the following cell and input an [access token](https://huggingface.co/docs/hub/security-tokens) associated with your account:

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

Then you need to install Git-LFS, which you can do by uncommenting and running following command:

In [None]:
# !apt install git-lfs

Finally, you may need to configue Git on your system by providing details about who you are:

In [None]:
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

This notebook is designed to work with any multiclass [text classification dataset](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) and pretrained [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub. Change the values below to try a different dataset / model!

In [17]:
dataset_id = "sst2"
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"

## Loading and sampling the dataset

We will use the 🤗 Datasets library to download the data, which can be done as follows:

In [18]:
from datasets import load_dataset, Dataset

dataset = Dataset.from_pandas(train)
dataset

Dataset({
    features: ['Address', 'AddressWithCountry', 'Country', 'labels', '__index_level_0__'],
    num_rows: 50000
})

Most datasets on the Hub have many more labeled examples than those one encounters in few-shot settings. To simulate the effect of training on a limited number of examples, let's subsample the training set to have 8 labeled examples per class:

In [19]:
num_samples = 100
num_classes = 10
train_dataset = dataset.shuffle(seed=42).select(range(num_samples * num_classes))
train_dataset

Dataset({
    features: ['Address', 'AddressWithCountry', 'Country', 'labels', '__index_level_0__'],
    num_rows: 1000
})

Here we have 16 total examples to train with since the `sst2` dataset has two classes (positive and negative). For evaluation, we'll use the validation split, since the test split of `sst2` is unlabeled:

In [20]:
eval_dataset = Dataset.from_pandas(test)
eval_dataset

Dataset({
    features: ['Address', 'AddressWithCountry', 'Country', 'labels', '__index_level_0__'],
    num_rows: 50000
})

Okay, now we have the dataset, let's load and train a model!

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the `from_pretrained()` method associated with the `SetFitModel` class:

In [21]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id)

Downloading:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


Here, we've downloaded a pretrained Sentence Transformer from the Hub and added a logistic classification head to the create the SetFit model. As indicated in the message, we need to train this model on some labeled examples. We can do so by using the `SetFitTrainer` class as follows:

In [22]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_epochs=1,
    num_iterations=20,
    column_mapping={"Address": "text", "labels": "label"},
)

The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning
* `column_mapping`: The `SetFitTrainer` expects the inputs to be found in a `text` and `label` column. This mapping automatically formats the training and evaluation datasets for us.

Now that we've created a trainer, we can train it!

In [None]:
trainer.train()

Applying column mapping to training dataset
***** Running training *****
  Num examples = 40000
  Num epochs = 1
  Total optimization steps = 2500
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2500 [00:00<?, ?it/s]

The final step is to compute the model's performance using the `evaluate()` method:

In [None]:
metrics = trainer.evaluate()
metrics

And once the model is trained, you can push it to the Hub:

In [None]:
trainer.push_to_hub("my-awesome-setfit-model")

Cloning https://huggingface.co/lewtun/my-awesome-setfit-model-3 into local empty directory.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file pytorch_model.bin:   0%|          | 32.0k/418M [00:00<?, ?B/s]

Upload file model_head.pkl: 100%|##########| 6.76k/6.76k [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/lewtun/my-awesome-setfit-model-3
   ce796ab..356d99b  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'https://huggingface.co/lewtun/my-awesome-setfit-model-3/commit/356d99ba2c33a8bab9c285398f3d15ce55871b9a'

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `your-username/the-name-you-picked` so for instance:

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("lewtun/my-awesome-setfit-model")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
preds   

array([1, 0])