## Stanford Sentiment Treebank with 5 labels (SST-5)

The SST-5, or Stanford Sentiment Treebank with 5 labels, is a dataset utilized for sentiment analysis. It contains 11,855 individual sentences sourced from movie reviews, along with 215,154 unique phrases from parse trees. These phrases are annotated by three human judges and are categorized as negative, somewhat negative, neutral, somewhat positive, or positive. This fine-grained labeling is what gives the dataset its name, SST-5.

See https://paperswithcode.com/dataset/sst-5 for more detail.

### $\S1.$ Storing data into text files

In [1]:
import pytreebank # API version for SST-5 from https://pypi.org/project/pytreebank/
import sys
import os

dataset = pytreebank.load_sst()

In [2]:
print(dataset['train'][0])

(3 (2 (2 The) (2 Rock) )(4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's) )(2 (3 new) (2 (2 ``) (2 Conan) )))))))(2 '') )(2 and) )(3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash) )(2 (2 even) (3 greater) )))(2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger) )(2 ,) )(2 (2 Jean-Claud) (2 (2 Van) (2 Damme) )))(2 or) )(2 (2 Steven) (2 Segal) ))))))))))))(2 .) ))


In [3]:
out_path = os.path.join('sst_{}.txt')

In [4]:
for category in ['train', 'test', 'dev']:
    with open(out_path.format(category), 'w+') as outfile:
        for item in dataset[category]:
            outfile.write("__label__{}\t{}\n".format(
                item.to_labeled_lines()[0][0] + 1,
                item.to_labeled_lines()[0][1]
            ))
# Print the length of the training set
print("# of training data:", len(dataset['train']))
print("# of validation data:", len(dataset['dev']))
print("# of test data:", len(dataset['test']))

# of training data: 8544
# of validation data: 1101
# of test data: 2210


### $\S2.$ Making data frames

In [2]:
import pandas as pd
from tqdm import tqdm
from tqdm import trange

In [6]:
df_train = pd.read_csv("sst_train.txt", sep="\t", header=None, names=['label','text'], encoding='latin-1')
df_train['label'] = df_train['label'].str.replace("__label__","")
df_train

Unnamed: 0,label,text
0,4,The Rock is destined to be the 21st Century 's...
1,5,The gorgeously elaborate continuation of `` Th...
2,4,Singer/composer Bryan Adams contributes a slew...
3,3,You 'd think by now America would have had eno...
4,4,Yet the act is still charming here .
...,...,...
8539,1,A real snooze .
8540,2,No surprises .
8541,4,We 've seen the hippie-turned-yuppie plot befo...
8542,1,Her fans walked out muttering words like `` ho...


In [7]:
df_dev = pd.read_csv("sst_dev.txt", sep="\t", header=None, names=['label','text'], encoding='latin-1')
df_dev['label'] = df_dev['label'].str.replace("__label__","")
df_dev

Unnamed: 0,label,text
0,4,It 's a lovely film with lovely performances b...
1,3,"No one goes unindicted here , which is probabl..."
2,4,And if you 're not nearly moved to tears by a ...
3,5,"A warm , funny , engaging film ."
4,5,Uses sharp humor and insight into human nature...
...,...,...
1096,2,it seems to me the film is about the art of ri...
1097,2,It 's just disappointingly superficial -- a mo...
1098,2,The title not only describes its main characte...
1099,3,Sometimes it feels as if it might have been ma...


In [8]:
df_test = pd.read_csv("sst_test.txt", sep="\t", header=None, names=['label','text'], encoding='latin-1')
df_test['label'] = df_test['label'].str.replace("__label__","")
df_test

Unnamed: 0,label,text
0,3,Effective but too-tepid biopic
1,4,If you sometimes like to go to the movies to h...
2,5,"Emerges as something rare , an issue movie tha..."
3,3,The film provides some great insight into the ...
4,5,Offers that rare combination of entertainment ...
...,...,...
2205,4,An imaginative comedy/thriller .
2206,5,"( A ) rare , beautiful film ."
2207,5,( An ) hilarious romantic comedy .
2208,4,Never ( sinks ) into exploitation .


### $\S3.$ Vectorizing sentences using a pretrained GTE model

In [6]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [7]:
# function that vectorizes a sentence

def get_sentence_embedding(text, sentence_model):
    if not text.strip(): 
    #.strip() gets rid of new lines
        print("Attempted to get embedding for empty text.")
        return []

    embedding = sentence_model.encode(text)

    return embedding.tolist()

In [11]:
# Loading pretrained GTE model for generating sentence embeddings
gte_model = SentenceTransformer('thenlper/gte-large')

In [12]:
tqdm.pandas()

In [13]:
df_train['vector_gte'] = df_train['text'].progress_apply(lambda x: get_sentence_embedding(x, gte_model))
df_train

100%|██████████| 8544/8544 [02:45<00:00, 51.50it/s]


Unnamed: 0,label,text,vector_gte
0,4,The Rock is destined to be the 21st Century 's...,"[0.028988052159547806, -0.009446077048778534, ..."
1,5,The gorgeously elaborate continuation of `` Th...,"[0.009347114711999893, -0.0242688599973917, 0...."
2,4,Singer/composer Bryan Adams contributes a slew...,"[0.008711150847375393, 0.02659980021417141, 0...."
3,3,You 'd think by now America would have had eno...,"[0.0290781632065773, -0.013503121212124825, -0..."
4,4,Yet the act is still charming here .,"[-0.009661804884672165, -0.008504939265549183,..."
...,...,...,...
8539,1,A real snooze .,"[0.022056950256228447, 0.005160029046237469, -..."
8540,2,No surprises .,"[-0.02582346461713314, 0.01339962799102068, 0...."
8541,4,We 've seen the hippie-turned-yuppie plot befo...,"[0.022835902869701385, 0.005308559164404869, -..."
8542,1,Her fans walked out muttering words like `` ho...,"[0.0018498250283300877, -0.006132108625024557,..."


In [19]:
df_dev['vector_gte'] = df_dev['text'].progress_apply(lambda x: get_sentence_embedding(x, gte_model))
df_dev

100%|██████████| 1101/1101 [00:25<00:00, 42.51it/s]


Unnamed: 0,label,text,vector_gte
0,4,It 's a lovely film with lovely performances b...,"[0.014344191178679466, -0.0142461396753788, -0..."
1,3,"No one goes unindicted here , which is probabl...","[-0.03392713889479637, 0.004705153871327639, -..."
2,4,And if you 're not nearly moved to tears by a ...,"[0.03554988279938698, 0.005494722165167332, 0...."
3,5,"A warm , funny , engaging film .","[0.0033094475511461496, -0.006415199488401413,..."
4,5,Uses sharp humor and insight into human nature...,"[-0.003727895673364401, -0.01158248633146286, ..."
...,...,...,...
1096,2,it seems to me the film is about the art of ri...,"[-0.0041084568947553635, 0.0022037599701434374..."
1097,2,It 's just disappointingly superficial -- a mo...,"[0.02682664804160595, -0.0030683069489896297, ..."
1098,2,The title not only describes its main characte...,"[-0.01506032980978489, -0.01804324984550476, -..."
1099,3,Sometimes it feels as if it might have been ma...,"[0.01700528711080551, -0.019340755417943, 0.01..."


In [20]:
df_test['vector_gte'] = df_test['text'].progress_apply(lambda x: get_sentence_embedding(x, gte_model))
df_test

100%|██████████| 2210/2210 [00:41<00:00, 52.93it/s]


Unnamed: 0,label,text,vector_gte
0,3,Effective but too-tepid biopic,"[0.03771600499749184, -0.01407541148364544, -0..."
1,4,If you sometimes like to go to the movies to h...,"[0.004801100119948387, 0.016335587948560715, -..."
2,5,"Emerges as something rare , an issue movie tha...","[-0.007570610381662846, -0.014615904539823532,..."
3,3,The film provides some great insight into the ...,"[0.0174684040248394, 0.0014129565097391605, -0..."
4,5,Offers that rare combination of entertainment ...,"[-0.0018031998770311475, 0.00950391124933958, ..."
...,...,...,...
2205,4,An imaginative comedy/thriller .,"[0.0037158015184104443, -0.017833540216088295,..."
2206,5,"( A ) rare , beautiful film .","[0.014899299480021, -0.0012953567784279585, -0..."
2207,5,( An ) hilarious romantic comedy .,"[0.00384555128403008, -0.028396954759955406, -..."
2208,4,Never ( sinks ) into exploitation .,"[-0.006038010120391846, 0.018880706280469894, ..."


In [21]:
df_train.to_parquet("SST-5_train_gte.parquet")
df_dev.to_parquet("SST-5_validation_gte.parquet")
df_test.to_parquet("SST-5_test_gte.parquet")

**Remark**. If you have run the codes above before, run the following instead:

In [3]:
df_train = pd.read_parquet("SST-5_train_gte.parquet")
df_dev = pd.read_parquet("SST-5_validation_gte.parquet")
df_test = pd.read_parquet("SST-5_test_gte.parquet")

In [4]:
df_train['label'] = df_train['label'].astype(int)
df_dev['label'] = df_dev['label'].astype(int)
df_test['label'] = df_test['label'].astype(int)

## $\S4.$ Fine-tuning pretrained GTE model

In [18]:
# texts[0] contains label 1
# ...
# texts[4] contains label 5

texts = []

for i in range(5):
    texts.append( df_train[df_train['label'] == i+1]['text'].tolist() )

In [19]:
for i in range(5):
    print(f"# of {i+1}-star ratings:", len(texts[i]))

# of 1-star ratings: 1092
# of 2-star ratings: 2218
# of 3-star ratings: 1624
# of 4-star ratings: 2322
# of 5-star ratings: 1288


In [23]:
import random

In [25]:
random.seed(100) # setting the random seed 

In [26]:
sampled_texts = []

In [32]:
for i in range(5):
    sampled_texts.append(random.sample(texts[i], 1092))

In [35]:
for s in sampled_texts:
    print(len(s))

1092
1092
1092
1092
1092


In [22]:
from datasets import Dataset
from datasets import concatenate_datasets # https://huggingface.co/docs/datasets/

In [52]:
# creating anchor-positive dataset that separates rating 5 from rating 1
# 1092 / 2 = 546

dataset_5_from_1_first = Dataset.from_dict({
    'anchor': sampled_texts[4][:546], # index 4 means rating 5
    'positive': sampled_texts[4][546:],
    'negative': sampled_texts[0][:546] # index 0 means rating 1
})

In [53]:
dataset_5_from_1_second = Dataset.from_dict({
    'anchor': sampled_texts[4][:546],
    'positive': sampled_texts[4][546:],
    'negative': sampled_texts[0][546:]
})

In [54]:
# creating anchor-positive dataset that separates rating 5 from rating 2
# 1092 / 2 = 546

dataset_5_from_2_first = Dataset.from_dict({
    'anchor': sampled_texts[4][:546],
    'positive': sampled_texts[4][546:],
    'negative': sampled_texts[1][:546] # idx 1 means rating 2
})

In [55]:
dataset_5_from_2_second = Dataset.from_dict({
    'anchor': sampled_texts[4][:546],
    'positive': sampled_texts[4][546:],
    'negative': sampled_texts[1][546:] # idx 1 means rating 2
})

In [56]:
# creating anchor-positive dataset that separates rating 1 from rating 5
# 1092 / 2 = 546

dataset_1_from_5_first = Dataset.from_dict({
    'anchor': sampled_texts[0][:546],
    'positive': sampled_texts[0][546:],
    'negative': sampled_texts[4][:546]
})

In [57]:
dataset_1_from_5_second = Dataset.from_dict({
    'anchor': sampled_texts[0][:546],
    'positive': sampled_texts[0][546:],
    'negative': sampled_texts[4][546:]
})

In [58]:
# creating anchor-positive dataset that separates rating 1 from rating 4
# 1092 / 2 = 546

dataset_1_from_4_first = Dataset.from_dict({
    'anchor': sampled_texts[0][:546],
    'positive': sampled_texts[0][546:],
    'negative': sampled_texts[3][:546]
})

In [59]:
dataset_1_from_4_second = Dataset.from_dict({
    'anchor': sampled_texts[0][:546],
    'positive': sampled_texts[0][546:],
    'negative': sampled_texts[3][546:]
})

In [60]:
# Concatenate datasets -- https://huggingface.co/docs/datasets/v1.3.0/processing.html

train_dataset = concatenate_datasets(
    [
        dataset_5_from_1_first,
        dataset_5_from_1_second,
        dataset_5_from_2_first,
        dataset_5_from_2_second,
        dataset_1_from_5_first,
        dataset_1_from_5_second,
        dataset_1_from_4_first,
        dataset_1_from_4_second
    ])

In [61]:
gte_model._first_module()

Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 

In [62]:
auto_model = gte_model._first_module().auto_model
auto_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-23): 24 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inpl

In [63]:
for name, param in auto_model.named_parameters():
    print(name)
    param.requires_grad = False

embeddings.word_embeddings.weight
embeddings.position_embeddings.weight
embeddings.token_type_embeddings.weight
embeddings.LayerNorm.weight
embeddings.LayerNorm.bias
encoder.layer.0.attention.self.query.weight
encoder.layer.0.attention.self.query.bias
encoder.layer.0.attention.self.key.weight
encoder.layer.0.attention.self.key.bias
encoder.layer.0.attention.self.value.weight
encoder.layer.0.attention.self.value.bias
encoder.layer.0.attention.output.dense.weight
encoder.layer.0.attention.output.dense.bias
encoder.layer.0.attention.output.LayerNorm.weight
encoder.layer.0.attention.output.LayerNorm.bias
encoder.layer.0.intermediate.dense.weight
encoder.layer.0.intermediate.dense.bias
encoder.layer.0.output.dense.weight
encoder.layer.0.output.dense.bias
encoder.layer.0.output.LayerNorm.weight
encoder.layer.0.output.LayerNorm.bias
encoder.layer.1.attention.self.query.weight
encoder.layer.1.attention.self.query.bias
encoder.layer.1.attention.self.key.weight
encoder.layer.1.attention.self.key

In [64]:
for name, param in auto_model.named_parameters():
    for i in range(20, 24):
        if name == f'encoder.layer.{i}.output.dense.weight':
            param.requires_grad = True
        elif name == f'encoder.layer.{i}.output.dense.bias':
            param.requires_grad = True
        elif name == f'encoder.layer.{i}.intermediate.dense.weight':
            param.requires_grad = True
        elif name == f'encoder.layer.{i}.intermediate.dense.bias':
            param.requires_grad = True

In [65]:
for name, param in auto_model.named_parameters():
    print(name, ": ",param.requires_grad)

embeddings.word_embeddings.weight :  False
embeddings.position_embeddings.weight :  False
embeddings.token_type_embeddings.weight :  False
embeddings.LayerNorm.weight :  False
embeddings.LayerNorm.bias :  False
encoder.layer.0.attention.self.query.weight :  False
encoder.layer.0.attention.self.query.bias :  False
encoder.layer.0.attention.self.key.weight :  False
encoder.layer.0.attention.self.key.bias :  False
encoder.layer.0.attention.self.value.weight :  False
encoder.layer.0.attention.self.value.bias :  False
encoder.layer.0.attention.output.dense.weight :  False
encoder.layer.0.attention.output.dense.bias :  False
encoder.layer.0.attention.output.LayerNorm.weight :  False
encoder.layer.0.attention.output.LayerNorm.bias :  False
encoder.layer.0.intermediate.dense.weight :  False
encoder.layer.0.intermediate.dense.bias :  False
encoder.layer.0.output.dense.weight :  False
encoder.layer.0.output.dense.bias :  False
encoder.layer.0.output.LayerNorm.weight :  False
encoder.layer.0.outp

In [70]:
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments, losses
from sentence_transformers.training_args import BatchSamplers

# https://www.sbert.net/docs/sentence_transformer/training_overview.html

In [68]:
loss = losses.MultipleNegativesRankingLoss(gte_model)

# https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss
# https://arxiv.org/pdf/1705.00652 -- Section 4.4

In [71]:
# https://www.sbert.net/docs/package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.SentenceTransformerTrainingArguments

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/fine_tuned_gte",

    # Optional training parameters:
    num_train_epochs=3, # default 3
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5, # default 5e-5
    warmup_ratio=0.1, # Ratio of total training steps used for a linear warmup from 0 to learning_rate
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
)

In [72]:
trainer = SentenceTransformerTrainer(
    model = gte_model,
    train_dataset=train_dataset,
    loss=loss,
    args=args
)

In [73]:
trainer.train()

  0%|          | 0/819 [00:00<?, ?it/s]

{'loss': 2.9312, 'grad_norm': 2.674386739730835, 'learning_rate': 8.683853459972865e-06, 'epoch': 1.83}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 117.8969, 'train_samples_per_second': 111.148, 'train_steps_per_second': 6.947, 'train_loss': 2.793761014647245, 'epoch': 3.0}


TrainOutput(global_step=819, training_loss=2.793761014647245, metrics={'train_runtime': 117.8969, 'train_samples_per_second': 111.148, 'train_steps_per_second': 6.947, 'train_loss': 2.793761014647245, 'epoch': 3.0})

In [74]:
trainer.save_model('./fine_tuned_gte_august22')

**Remark**. If you have run the codes above before, run the following instead:

In [8]:
fine_tuned_model = SentenceTransformer('./fine_tuned_gte_august22')

In [12]:
# fine-tuned vectorization:

df_train['vector_ft'] = df_train['text'].progress_apply(lambda x: get_sentence_embedding(x, fine_tuned_model))
df_dev['vector_ft'] = df_dev['text'].progress_apply(lambda x: get_sentence_embedding(x, fine_tuned_model))
df_test['vector_ft'] = df_test['text'].progress_apply(lambda x: get_sentence_embedding(x, fine_tuned_model))

100%|██████████| 8544/8544 [03:17<00:00, 43.23it/s]
100%|██████████| 1101/1101 [00:24<00:00, 45.64it/s]
100%|██████████| 2210/2210 [00:48<00:00, 45.18it/s]


In [13]:
df_train.to_parquet("SST-5_train_august27.parquet")
df_dev.to_parquet("SST-5_validation_august27.parquet")
df_test.to_parquet("SST-5_test_august27.parquet")

**Remark**. If you have run the codes above before, run the following instead:

In [None]:
df_train = pd.read_parquet("SST-5_train_august27.parquet")
df_dev = pd.read_parquet("SST-5_validation_august27.parquet")
df_test = pd.read_parquet("SST-5_test_august27.parquet")