# Simple Vectorization and Classification Using SetFit
This notebook shows a solution to the RR task (https://sites.google.com/view/legaleval/home?pli=1) using SetFit pre-trained model from the blog post:

- Outperform OpenAI GPT-3 with SetFit for text-classification: https://www.philschmid.de/getting-started-setfit
- arxiv: https://arxiv.org/abs/2209.11055

# SetFit Example

In [1]:
%pip install setfit[optuna]==0.3.0 datasets -U

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Using cached datasets-2.7.1-py3-none-any.whl (451 kB)


# RR Classification Using SetFit

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import Dataset ## %pip install setfit[optuna]==0.3.0 datasets -U

In [2]:
# train and dev files
## Change the following paths to your paths:
train_file = "/content/drive/MyDrive/Colab Notebooks/semEval/legalEval/taskA-RR/data/train.csv"
dev_file = "/content/drive/MyDrive/Colab Notebooks/semEval/legalEval/taskA-RR/data/dev.csv"

In [3]:
# Read in the train data
train = pd.read_csv(train_file)
train.shape

(28986, 9)

In [4]:
train['value.labels'].unique()

array(["['PREAMBLE']", "['NONE']", "['FAC']", "['ARG_RESPONDENT']",
       "['RLC']", "['ARG_PETITIONER']", "['ANALYSIS']", "['PRE_RELIED']",
       "['RATIO']", "['RPC']", "['ISSUE']", "['STA']",
       "['PRE_NOT_RELIED']"], dtype=object)

## Map the labels to numbers

In [5]:
lab2id = {"['PREAMBLE']":1, "['NONE']":2, "['FAC']":3, "['ARG_RESPONDENT']":4,
       "['RLC']":5, "['ARG_PETITIONER']":6, "['ANALYSIS']":7, "['PRE_RELIED']":8,
       "['RATIO']":9, "['RPC']":10, "['ISSUE']":11, "['STA']":12,
       "['PRE_NOT_RELIED']":13}

In [6]:
id2lab = {1:"['PREAMBLE']", 2:"['NONE']", 3:"['FAC']", 4:"['ARG_RESPONDENT']",
       5:"['RLC']", 6:"['ARG_PETITIONER']", 7:"['ANALYSIS']", 8:"['PRE_RELIED']",
       9:"['RATIO']", 10:"['RPC']", 11:"['ISSUE']", 12:"['STA']",
       13:"['PRE_NOT_RELIED']"}

## Extract Training sentences and labels

In [7]:
sents = train["value.text"].str.replace('\n', "").apply(lambda x: x.lower())
sents

0              in the high court of karnataka,         ...
1              beforethe hon'ble mr.justice anand byrar...
2        this criminal appeal is filed under section 37...
3               this appeal coming on for hearing this ...
4               heard the learned counsel for the appel...
                               ...                        
28981     so section 132 of the evidence act sufficient...
28982     for the reasons aforesaid, the appeal is allo...
28983    the judgment and order dated april 27, 1987 pa...
28984                                               r.s.s.
28985                                      appeal allowed.
Name: value.text, Length: 28986, dtype: object

In [8]:
y = train['value.labels'].map(lab2id)
y

0         1
1         1
2         1
3         1
4         2
         ..
28981     9
28982    10
28983    10
28984     2
28985    10
Name: value.labels, Length: 28986, dtype: int64

## Make the train dataset for SetFit fine-tuning

In [9]:
train_df = pd.DataFrame({'text':sents, 'label':y})
train_df

Unnamed: 0,text,label
0,"in the high court of karnataka, ...",1
1,beforethe hon'ble mr.justice anand byrar...,1
2,this criminal appeal is filed under section 37...,1
3,this appeal coming on for hearing this ...,1
4,heard the learned counsel for the appel...,2
...,...,...
28981,so section 132 of the evidence act sufficient...,9
28982,"for the reasons aforesaid, the appeal is allo...",10
28983,"the judgment and order dated april 27, 1987 pa...",10
28984,r.s.s.,2


In [10]:
labels = y.unique()
labels

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

In [11]:
# create train dataset
samples_per_label = 8
sampled_dfs = []
for lab in labels:
    sampled_dfs.append(train_df[train_df.label == lab].sample(8, random_state=42))

In [12]:
train_df_8 = pd.concat(sampled_dfs)
train_df_8

Unnamed: 0,text,label
14019,i addl. sessio...,1
1708,"bench:khanna, hans rajgoswami, p.k.citation: ...",1
5869,"shri. d.patnaik, m.sc,ll.b. ...",1
2054,on the other hand the appellant's case isthat ...,1
24948,in the high court of delhi at new delhi,1
...,...,...
27393,although in dhillon case conflicting view were...,13
16972,the other cases which we need not examine are ...,13
4466,the decision of the andhra pradesh high court...,13
18029,"the assessee had paid a sum of rs.3,200 as fee...",13


In [13]:
train_dataset = Dataset.from_pandas(train_df_8.reset_index(drop=True))

In [14]:
train_dataset[0:2]

{'text': ['                                i addl. sessions judge,                                mysuru.                                              **',
  'bench:khanna, hans rajgoswami, p.k.citation:          1976 scr (3) 645 1976 air 1172 1976 scc (2) 258act:         income-tax-assessee in princely state-payment bygovernment of india by cheque posted in british india-whether receipt by assessee in british india liable toindian income tax.headnote:         the government of india was placing bulk purchaseorders with the assessee-company, a textile mill, which had,during the assessment years 1945-46, 1946-47 and 1947-48,its registered office in the hyderabad state outside britishindia.'],
 'label': [1, 1]}

## Build test dataset for SetFit evaluation

In [15]:
dev = pd.read_csv(dev_file)
dev.shape

(2890, 9)

In [16]:
sents_dev = dev["value.text"].str.replace('\n', "").apply(lambda x: x.lower())
sents_dev

0       petitioner:the commissioner of income-taxnew d...
1              date of judgment:05/05/1961bench:das, s.k.
2       bench:das, s.k.hidayatullah, m.shah, j.c.citat...
3       itentered into transactions in the nature of f...
4       the assessee claimed deduction of theselosses ...
                              ...                        
2885                           the petitions are allowed.
2886    the impugned orders are set aside with directi...
2887     the respondent having challenged the judgment...
2888    therefore, having regard to the law laid down ...
2889                                        sd/- judge nv
Name: value.text, Length: 2890, dtype: object

In [17]:
y_dev = dev['value.labels'].map(lab2id)
y_dev

0        1
1        1
2        1
3        1
4        1
        ..
2885    10
2886    10
2887    10
2888    10
2889     2
Name: value.labels, Length: 2890, dtype: int64

In [18]:
test_df = pd.DataFrame({'text':sents_dev, 'label':y_dev})
test_df

Unnamed: 0,text,label
0,petitioner:the commissioner of income-taxnew d...,1
1,"date of judgment:05/05/1961bench:das, s.k.",1
2,"bench:das, s.k.hidayatullah, m.shah, j.c.citat...",1
3,itentered into transactions in the nature of f...,1
4,the assessee claimed deduction of theselosses ...,1
...,...,...
2885,the petitions are allowed.,10
2886,the impugned orders are set aside with directi...,10
2887,the respondent having challenged the judgment...,10
2888,"therefore, having regard to the law laid down ...",10


In [19]:
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

In [20]:
test_dataset[:2]

{'text': ['petitioner:the commissioner of income-taxnew delhivs.respondent:m/s. chuni lal moonga ram',
  'date of judgment:05/05/1961bench:das, s.k.'],
 'label': [1, 1]}

## Fine-tuning SetFit and Classifying

In [24]:
import torch
torch.cuda.empty_cache()

In [26]:
train_dataset.shape, test_dataset.shape

((104, 2), (2890, 2))

In [22]:
from setfit import SetFitModel, SetFitTrainer
from sentence_transformers.losses import CosineSimilarityLoss

# Load a SetFit model from Hub
model_id = "sentence-transformers/all-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    #batch_size=64,
    batch_size=32,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    #num_iterations=10,
    num_epochs=1, # The number of epochs to use for constrastive learning
)

# Train and evaluate
trainer.train()


model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
***** Running training *****
  Num examples = 4160
  Num epochs = 1
  Total optimization steps = 130
  Total train batch size = 32


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/130 [00:00<?, ?it/s]

***** Running evaluation *****


model used: sentence-transformers/all-mpnet-base-v2
train dataset: 104 samples
accuracy: 0.2519031141868512


In [23]:
metrics = trainer.evaluate()

print(f"model used: {model_id}")
print(f"train dataset: {len(train_dataset)} samples")
print(f"accuracy: {metrics['accuracy']}")


***** Running evaluation *****


model used: sentence-transformers/all-mpnet-base-v2
train dataset: 104 samples
accuracy: 0.2519031141868512


## Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
nbcls = MultinomialNB()

In [None]:
nbcls.fit(features, y)

MultinomialNB()

In [None]:
predicts = nbcls.predict(features_dev)

In [None]:
evals_dev = precision_recall_fscore_support(predicts, y_dev, average='weighted')

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print('weighted precision on dev: {}'.format(evals_dev[0]))
print('weighted recall on dev: {}'.format(evals_dev[1]))
print('weighted f1score on dev: {}'.format(evals_dev[2]))

weighted precision on dev: 0.7010339420321985
weighted recall on dev: 0.5141868512110727
weighted f1score on dev: 0.5620552512797896


## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(features, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [None]:
predicts = lr.predict(features_dev)

In [None]:
evals_dev = precision_recall_fscore_support(predicts, y_dev, average='weighted')

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
evals_dev

(0.6520520451227161, 0.5539792387543253, 0.5814886909919296, None)