# Retrieve SST-2 Test Set Labels
Since the SST-2 dataset from HuggingFace does not include the labels for the test set, we manually extract them from the original SST-2 data (https://gluebenchmark.com/tasks). 

We match the phrases of the test set in HuggingFace with the phrases in the SST-2 dataset from the *dictionary.txt* file to get their phrase IDs. Then we use those IDs to extract the labels from *sentiment\_labels.txt*. Every label above $0.6$ is mapped to *positive* and equal to or lower than $0.4$ is mapped to *negative*, as mentioned in the instructions of the *README.md* file. Some sentences are matched manually as they differ only in British vs. American English spelling.

## Load Data

In [111]:
import numpy as np
import json

In [102]:
from generalize_checklist.utils import get_dataset
            
dataset = get_dataset("glue", "albert-large-v2", "sst2", split="test")

Reusing dataset glue (/Users/urjakhurana/.cache/huggingface/datasets/glue/sst2/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [103]:
# Match phrases to their IDs.
labels = np.array([d["labels"] for d in dataset]).astype(int)

dictionary_path = "Downloads/SST-2/original/dictionary.txt"
all_sentences = [d["sentence"] for d in dataset]

with open(dictionary_path, "r") as f: 
    original_sentences = f.read().splitlines()
    
sentences = [h.split("|")[0] for h in original_sentences]
original_ids = [h.split("|")[1] for h in original_sentences]
lower_originals = [s.lower() for s in sentences]

phrase_to_id_og = dict(zip(lower_originals, original_ids))
phrase_to_id = {k: v for k, v in phrase_to_id_og.items() if k.strip() in all_sentences}

In [104]:
# Match IDs with the labels. 
labels_path = "Downloads/SST-2/original/sentiment_labels.txt"

with open(labels_path, "r") as f: 
    ids_labels = f.read().splitlines()
    
ids = [h.split("|")[0] for h in ids_labels]
og_labels = [h.split("|")[1] for h in ids_labels]

labels = []
for label in og_labels[1:]:
    if float(label) <= 0.4: 
        label = 0.0 
    elif float(label) > 0.6: 
        label = 1.0
    labels.append(label)

id_to_label = dict(zip(ids[1:], labels))

In [105]:
# Map phrases to the labels. 
phrase_to_label = {}
for phrase in phrase_to_id.keys(): 
    phrase_id = phrase_to_id[phrase]
    label = id_to_label[phrase_id]
    # Get rid of neutral sentences
    if label == 0.0 or label == 1.0:
        phrase_to_label[phrase] = label

In [106]:
len(phrase_to_label), set(phrase_to_label.values())

(1795, {0.0, 1.0})

In [107]:
missed_sents = [sent for sent in all_sentences if sent not in phrase_to_id.keys()]
missed_sents

['with spy kids 2 : the island of lost dreams writer/director/producer robert rodriguez has cobbled together a film that feels like a sugar high gone awry .',
 'a very witty take on change , risk and romance , and the film uses humor to make its points about acceptance and growth .',
 'once again , director jackson strikes a rewarding balance between emotion on the human scale and action/effects on the spectacular scale .',
 'he has not learned that storytelling is what the movies are about .',
 "a recent favorite at sundance , this white-trash satire will inspire the affection of even those unlucky people who never owned a cassette of def leppard 's pyromania .",
 "one minute , you think you 're watching a serious actioner ; the next , it 's as though clips from the pink panther strikes again and/or sailor moon have been spliced in .",
 'a teasing drama whose relentless good-deed/bad-deed reversals are just interesting enough to make a sinner like me pray for an even more interesting 

In [108]:
# These are not matched due to british vs american english and some issues with slashes (\/).
missed_ids = ["150999", "18604", "26285", "223622", "24438", "225308", "24492", "225334", "143102", "222979", "149724", "24391", "143730", "26026", None, "222185", "145027", "225842", "19357", "151070", "13851"]

In [109]:
for sent, sent_id in zip(missed_sents, missed_ids): 
    if sent_id:
        label = id_to_label[sent_id]
        if label == 0.0 or label == 1.0: 
            phrase_to_label[sent] = label

In [110]:
len(phrase_to_label)

1815

In [114]:
with open("sst2_test_labels.json", "w") as f: 
    json.dump(phrase_to_label, f, indent=2)