# ChatGPT API: Zero-Shot Text Classification (Binary Classification)

## The Association for Computational Linguistics
## WASSA 2023 Shared Task on Multi-Label and Multi-Class Emotion Classification on Code-Mixed Text Messages
See more details [here](https://codalab.lisn.upsaclay.fr/competitions/10864#learn_the_details)

In [96]:
import openai
import numpy as np
import pandas as pd
import sklearn
import re, os
import time
import zipfile, pickle
from typing import List
from copy import deepcopy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, RidgeClassifier, RidgeClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, multilabel_confusion_matrix
from tqdm.autonotebook import tqdm
import random
import tiktoken
import backoff
tqdm.pandas()

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 400)
#os.path.join()

In [2]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    '''Return number of tokens used in a list of messages for ChatGPT'''
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        #print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        #print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        #print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

In [4]:
random_state = 47

# Load and Prepare Data

In [5]:
file1    = 'data/mcec_train_translated.pkl'
df_train = pd.read_pickle(file1)

file2    = 'data/mcec_dev_translated.pkl'
df_dev   = pd.read_pickle(file2)

file3    = 'data/mcec_test.csv'
df_test  = pd.read_csv(file3)

file4    = 'data/sample_submission/predictions_MCEC.csv'
sample_submission = pd.read_csv(file4)

print(df_train.shape, df_dev.shape, df_test.shape, sample_submission.shape)

(9530, 4) (1191, 10) (1191, 1) (1191, 1)


In [7]:
# copy manually arbitrated translation into English from column 'gpt_translated2_corrected'
#file = 'data/mcec_dev.xlsx'
#df_dev2 = pd.read_excel( file )
#print(df_dev2.shape)
#df_dev2.head()

(1191, 9)


Unnamed: 0,text,text_clean,emotion,target,gtp_translated,gpt_translated2,gpt_translated2_corrected,translated_hi,translated_ur
0,Tension lene ki koi baat ni,Tension lene ki koi baat ni,neutral,1,There's no need to take tension.,There's no need to worry.,There's no need to worry.,There is nothing to take tension,Any talk of taking tangoes
1,Main ghar punch gya hun or ab spny laga hun,Main ghar punch gya hun or ab spny laga hun,neutral,1,I have reached home and now I am going to sleep.,I have reached home and now I am going to sleep.,I have reached home and now I am going to sleep.,I have gone home punch and now I am Sapni,I have gone home punch and now dreams
2,Nai mje nai mili mail..mene check ki ti,Nai mje nai mili mail .. mene check ki ti,pessimism,0,"I didn't receive any mail, I had checked.",I didn't receive any new mail. I had checked.,I didn't receive any new mail. I had checked.,Nai Maje Nai Mile Mail .. I checked,Ni Ni Ni Mille Mail
3,Yr us din mai pura din bzy rahe vo mujy awne h...,Yr us din mai pura din bzy rahe vo mujy awne h...,disgust,0,"That day, they were busy all day and not givin...","I was busy the whole day on that day, they wer...","I was busy the whole day on that day, they wer...",YR Us Din Mai Pura Din Bzy Rahe Vo Mujy Awne H...,Yr us din mai pura din bzy rahe vo mujy awne h...
4,Lakin wo abhe dar dar ka chalata ha,Lakin wo abhe dar dar ka chalata ha,fear,0,But he still walks cautiously.,But he still walks with fear and hesitation.,But he still walks with fear and hesitation.,But it still moves at the rate,But Wu runs the cedar


In [11]:
#df_dev['gpt_translated2_corrected'] = df_dev2['gpt_translated2_corrected'].values

#file2  = 'data/mcec_dev_translated.pkl'
#df_dev.to_pickle( file2 )

In [107]:
df_test.head()

Unnamed: 0,Text,text_clean
0,Razia bta rahe the but wo sure nahe the,Razia bta rahe the but wo sure nahe the
1,Me phr kuch parh hi lun :-P,Me phr kuch parh hi lun :- P
2,Hoxtl life ma hm bht jald matur hO jaty hai,Hoxtl life ma hm bht jald matur hO jaty hai
3,Yar A4 me seminar ha a ja..,Yar A 4 me seminar ha a ja ..
4,K.Quid e azam k 400 bnay hain,K . Quid e azam k 400 bnay hain


In [12]:
# submission format
print( type(sample_submission) )
sample_submission.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Emotion
0,neutral
1,neutral
2,pessimism
3,disgust
4,fear


In [14]:
df_train['target'] = df_train['emotion'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['target']   = df_dev['emotion'].apply( lambda x: 0 if x=='neutral' else 1 )

In [12]:
print(df_train['emotion'].value_counts(), '\n')
print(df_train['target'].value_counts())
df_train.head()

neutral         3262
trust           1118
joy             1022
optimism         880
anticipation     832
disgust          687
sadness          486
fear             453
anger            226
surprise         199
love             187
pessimism        178
Name: emotion, dtype: int64 

1    6268
0    3262
Name: target, dtype: int64


Unnamed: 0,text,emotion,translated_hi,translated_ur,target
0,Yes.I am in fyp lab cabin.but fyp presentation...,neutral,Yes.i am in fyp lab cabin.but fyp presentation...,Y. Um in Fap Lab Cabin. Butt Fap Presentations...,0
1,Yar insan ka bcha bn chawliyn na mar :p,joy,"Dude become a child of a human being, do not die.",Dude human beings do not die: P: P,1
2,Terai uncle nai kahna hai kai ham nai to bahr ...,disgust,Your Uncle Nai says that we had sent out money,Your Ankali says that we sent out money and wa...,1
3,Yr ajao I m cming in the club,neutral,YR AJAO I'M Coming in the Club,Yer organs were the club,0
4,Mje wese Nimra ahmad ka Qur'aan ki aayaat k ba...,joy,Mje wes nimra ahmad ka qur'aan ki aayaat k bar...,Mje Wese Nimra Ahmad Ka Qur'aan Ki Aayaaat K B...,1


In [15]:
print(df_dev['emotion'].value_counts(), '\n')
print(df_dev['target'].value_counts())
df_dev.head()

neutral         388
joy             131
trust           125
disgust         113
optimism        110
anticipation     94
sadness          62
fear             52
surprise         35
anger            35
pessimism        29
love             17
Name: emotion, dtype: int64 

1    803
0    388
Name: target, dtype: int64


Unnamed: 0,text,emotion,target,gtp_translated,translated_hi,translated_ur,text_clean,gpt_pred,gpt_pred_num,gpt_translated2,gpt_translated2_corrected
0,Tension lene ki koi baat ni,neutral,0,There's no need to take tension.,There is nothing to take tension,Any talk of taking tangoes,Tension lene ki koi baat ni,neutral,1,There's no need to worry.,There's no need to worry.
1,Main ghar punch gya hun or ab spny laga hun,neutral,0,I have reached home and now I am going to sleep.,I have gone home punch and now I am Sapni,I have gone home punch and now dreams,Main ghar punch gya hun or ab spny laga hun,neutral,1,I have reached home and now I am going to sleep.,I have reached home and now I am going to sleep.
2,Nai mje nai mili mail..mene check ki ti,pessimism,1,"I didn't receive any mail, I had checked.",Nai Maje Nai Mile Mail .. I checked,Ni Ni Ni Mille Mail,Nai mje nai mili mail .. mene check ki ti,neutral,1,I didn't receive any new mail. I had checked.,I didn't receive any new mail. I had checked.
3,Yr us din mai pura din bzy rahe vo mujy awne h...,disgust,1,"That day, they were busy all day and not givin...",YR Us Din Mai Pura Din Bzy Rahe Vo Mujy Awne H...,Yr us din mai pura din bzy rahe vo mujy awne h...,Yr us din mai pura din bzy rahe vo mujy awne h...,negative,0,"I was busy the whole day on that day, they wer...","I was busy the whole day on that day, they wer..."
4,Lakin wo abhe dar dar ka chalata ha,fear,1,But he still walks cautiously.,But it still moves at the rate,But Wu runs the cedar,Lakin wo abhe dar dar ka chalata ha,neutral,1,But he still walks with fear and hesitation.,But he still walks with fear and hesitation.


In [18]:
# light text cleaning (should I use clean regex for better accuracy?)
pad_punct    = re.compile('([^a-zA-Z ]+)')
multi_spaces = re.compile('\s{2,}')
#clean        = re.compile('[^a-zA-Z0-9,.?!\'\s]+')

def clean_text(s):
    s = s.replace('\n', ' ')
    s = pad_punct.sub(r' \1 ', s)
    #s = clean.sub(' ', s)
    s = multi_spaces.sub(' ', s)
    return s.strip()

df_train['text_clean'] = df_train['text'].apply( clean_text )
df_dev['text_clean']   = df_dev['text'].apply( clean_text )
df_test['text_clean']  = df_test['Text'].apply( clean_text )

In [19]:
# 2K duplicates - these may affect claa imbalance during training! TO BE REDUCED
print(df_train.shape)
temp1 = df_train[ df_train.duplicated(subset=['text_clean'], keep=False) ]
print(temp1.shape)
temp2 = df_train[ df_train.duplicated(subset=['text_clean', 'emotion'], keep=False) ]
print(temp2.shape)
temp3 = df_train[ df_train.duplicated(keep=False) ]
print(temp3.shape)

(9530, 6)
(4222, 6)
(4221, 6)
(4221, 6)


In [20]:
# 82 duplicates ['clean_text', 'emotion'] - can't reduce because this is a dev set
print(df_dev.shape)
temp1 = df_dev[ df_dev.duplicated(subset=['text_clean'], keep=False) ]
print(temp1.shape)
temp2 = df_dev[ df_dev.duplicated(subset=['text_clean', 'emotion'], keep=False) ]
print(temp2.shape)
temp3 = df_dev[ df_dev.duplicated(keep=False) ]
print(temp3.shape)

(1191, 11)
(82, 11)
(82, 11)
(68, 11)


In [21]:
# 93 complete duplicates - can't reduce because this is a test set
print(df_test.shape)
temp1 = df_test[ df_test.duplicated(subset=['text_clean'], keep=False) ]
print(temp1.shape)
temp3 = df_test[ df_test.duplicated(keep=False) ]
print(temp3.shape)

(1191, 2)
(93, 2)
(93, 2)


In [22]:
# df_train vs. df_dev: half of the dev set is in train set
overlap1 = [t for t in df_train['text_clean'].values if t in df_dev['text_clean'].values]
overlap2 = [t for t in df_dev['text_clean'].values if t in df_train['text_clean'].values]
len(overlap1), len(overlap2), len(set(overlap1)), len(set(overlap2))

(714, 554, 526, 526)

In [23]:
# df_test vs. rest
overlap3 = [ t for t in df_train['text_clean'].tolist() + df_dev['text_clean'].tolist()\
             if t in df_test['text_clean'].tolist() ]
overlap4 = [ t for t in df_test['text_clean'].tolist() if t in\
             df_train['text_clean'].tolist() + df_dev['text_clean'].tolist() ]
len(overlap3), len(overlap4), len(set(overlap3)), len(set(overlap4))

(817, 584, 557, 557)

In [24]:
# df_test vs. df_dev
overlap5 = [t for t in df_dev['text_clean'].values if t in df_test['text_clean'].values]
overlap6 = [t for t in df_test['text_clean'].values if t in df_dev['text_clean'].values]
len(overlap5), len(overlap6), len(set(overlap5)), len(set(overlap6))

(90, 97, 88, 88)

In [25]:
# df_test vs. df_train: half of the dev set is in train set
overlap7 = [t for t in df_train['text_clean'].values if t in df_test['text_clean'].values]
overlap8 = [t for t in df_test['text_clean'].values if t in df_train['text_clean'].values]
len(overlap7), len(overlap8), len(set(overlap7)), len(set(overlap8))

(727, 540, 519, 519)

The reason why baseline ML models perform better than ChatGPT is because they get a lot of hints due to duplicates from the training set! ChatGPT doesn't have this knowledge because it's doing a zero-shot classification! The number of duplicates is such that they would not fit the context window of ChatGPT anyway.

The only way to compare ML and ChatGPT correctly is to remove all the duplicates from the TRAINING SET, then train ML model and test it the dev set and compare with ChatGPT! (also, deduplicate the training set)

Submission: use non-overfit ML or ChatGPT (whichever is better) on those samples from the test set that don't have duplicates in the training or dev set. Use training/dev set labels for the duploicates in the test set.

In [26]:
# remove overlap with validation sets
val_sets = df_dev['text_clean'].tolist() + df_test['text_clean'].tolist()
print(len(val_sets), len(set(val_sets)))

print(df_train.shape)
df_train = df_train[ ~df_train['text_clean'].isin(val_sets) ]
print(df_train.shape)

2382 2206
(9530, 6)
(8151, 6)


In [27]:
# remove duplicates from train set
df_train = df_train.drop_duplicates(subset=['text_clean', 'emotion'])
print(df_train.shape)

(6167, 6)


In [28]:
# is additional text cleaning necessary? I don't see why
from collections import Counter
train_words = ' '.join( df_train['text_clean'].tolist() ).lower().split()
c = Counter( train_words )
c.most_common(350)

[('.', 2127),
 ('k', 1244),
 ('to', 1231),
 ('ha', 1214),
 ('hai', 804),
 ('ho', 793),
 ('ka', 726),
 ('me', 640),
 ('?', 615),
 ('b', 604),
 ('kr', 568),
 ('ga', 559),
 ('ni', 553),
 ('ko', 543),
 ('ki', 532),
 ('tha', 528),
 (',', 518),
 ('...', 502),
 ('na', 497),
 ('hn', 473),
 ('hy', 464),
 ('wo', 461),
 ('ma', 453),
 ('nai', 450),
 ('..', 450),
 ('a', 446),
 ('se', 415),
 ('p', 409),
 ('yar', 401),
 ('or', 392),
 ('yr', 389),
 ('h', 388),
 ('i', 385),
 ('han', 385),
 ('tu', 371),
 ('e', 331),
 (':', 327),
 ('ne', 324),
 ('kia', 321),
 ('he', 287),
 ('hain', 284),
 ('main', 281),
 ('ab', 254),
 ('koi', 252),
 ('us', 251),
 ('nae', 250),
 ('ap', 250),
 ('sir', 250),
 ('sy', 248),
 ('tm', 237),
 ('is', 223),
 ('nahi', 223),
 ('hi', 222),
 ('raha', 220),
 ('kal', 218),
 ('rha', 214),
 ('ja', 202),
 ('ny', 200),
 ('aj', 199),
 ('g', 199),
 ('m', 198),
 ('phr', 195),
 (':-', 193),
 ('aur', 192),
 ('mai', 192),
 ('....', 187),
 ('gya', 184),
 ('d', 183),
 ('bht', 181),
 ('u', 173),
 ('p

In [29]:
# https://www.kaggle.com/code/owaisraza009/roman-urdu-sentiment-analysis/notebook
stopwords1 = [ 'ai', 'ayi', 'hy', 'hai', 'main', 'ki', 'tha', 'koi', 'ko', 'sy', 'woh', 'bhi', 'aur', 'wo', 'yeh',
               'rha', 'hota', 'ho', 'ga', 'ka', 'le', 'lye', 'kr', 'kar', 'lye', 'liye', 'hotay', 'waisay', 'gya',
               'gaya', 'kch', 'ab', 'thy', 'thay', 'houn', 'hain', 'han', 'to', 'is', 'hi', 'jo', 'kya', 'thi', 'se',
               'pe', 'phr', 'wala', 'waisay', 'us', 'na', 'ny', 'hun', 'rha', 'raha', 'ja', 'rahay', 'abi', 'uski',
               'ne', 'haan', 'acha', 'nai', 'sent', 'photo', 'you', 'kafi', 'gai', 'rhy', 'kuch', 'jata', 'aye', 'ya',
               'dono', 'hoa', 'aese', 'de', 'wohi', 'jati', 'jb', 'krta', 'lg', 'rahi', 'hui', 'karna', 'krna', 'gi',
               'hova', 'yehi', 'jana', 'jye', 'chal', 'mil', 'tu', 'hum', 'par', 'hay', 'kis', 'sb', 'gy', 'dain',
               'krny', 'tou', ]

# https://github.com/haseebelahi/roman-urdu-stopwords.git
file = 'data/stopwords.txt'
stopwords2 = open(file).read().split()
print(stopwords2 == stopwords1)

from sklearn.feature_extraction import _stop_words
stopwords_en  = _stop_words.ENGLISH_STOP_WORDS
# selected from stopwords_en
stopwords_en2 = [ 'a', 'about', 'also', 'am', 'an', 'and', 'are', 'as', 'at', 'be', 
                  'been', 'being', 'by', 'co', 'con', 'de', 'eg', 'eight', 'eleven', 'else', 'etc', 
                  'fifteen', 'fifty', 'five', 'for', 'forty', 'four', 'from', 'had',
                  'has', 'hasnt', 'have', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 
                  'his', 'how', 'i', 'ie', 'if', 'in', 'inc', 'into', 'is', 'it', 'its', 'itself',
                  'ltd', 'me', 'mine', 'my', 'myself', 'nine', 'no', 'now', 'of', 'off', 'on',
                  'once', 'one', 'onto', 'or', 'other', 'others', 'our', 'ours', 'ourselves',
                  'out', 'part', 'per', 're', 'several', 'she', 'side', 'since', 'six', 'sixty',
                  'so', 'ten', 'than', 'that', 'the', 'their', 'them',
                  'themselves', 'then', 'there', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 
                  'three', 'to', 'twelve', 'twenty', 'two', 'un','us', 'very',
                  'via', 'was', 'we', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 
                  'who', 'whom', 'whose', 'why', 'with', 'within', 'would', 'yet', 'you', 'your', 'yours',
                   'yourself', 'yourselves', ]

print( len(stopwords1), len(stopwords_en), len(stopwords_en2), )

True
102 318 129


# ChatGPT API: Zero-Shot Classification

In [138]:
#prompt_one   = '''The text below may contain words or phrases in Roman Urdu along with English. Translate the text below into English only. Then classify the translated text as 'emotional' if it contains emotions or 'neutral' if it does not contain emotions. Output only 'emotional' or 'neutral' and nothing else. Text: "{}"'''
prompt_one   = '''Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "{}"'''
s = 'This is a text sample'
print(prompt_one.format(s), '\n')

# Using followup questions improves the reponse. but ChatGPT can change its mind too easily sometimes
followup1 = 'Are you sure about that? If yes, output the same category, if no change the category'
followup2 = 'Output only the category and nothing else'
print(followup1)

Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "This is a text sample" 

Are you sure about that? Output only the category


In [121]:
openai.api_key = os.getenv("OPENAI_API_KEY")
model          = 'gpt-3.5-turbo'
labels_set     = {'emotional', 'neutral'}
clean = re.compile(r'[^a-zA-Z ]+')
multi_spaces = re.compile('\s{2,}')
print(labels_set)

{'neutral', 'emotional'}


In [122]:
def verify_label(label_):
    '''
       Verify if label_ contains any of the categories
       from the predefined set of labels
    '''
    label_ = clean.sub(' ', label_)
    label_ = multi_spaces.sub(' ', label_).lower().split()
    res    = [i for i in label_ if i in labels_set]
    res    = list(set(res))
    return '/'.join(res) if res else None

In [123]:
def verify_num_tokens(model, messages):
    '''Check that there is enough tokens available for a ChatGPT repsonse'''
    num_tokens_tiktoken = num_tokens_from_messages(messages, model)
    if num_tokens_tiktoken > 4080:
        print(f'Number of tokens is {num_tokens_tiktoken} which exceeds 3950')
        print(f'TEXT: {text_}\n')
        return False
    else:
        return True


@backoff.on_exception(backoff.expo, openai.error.RateLimitError, max_time=10)
def get_response(model, messages, temperature=0, max_tokens=None):
    '''Send request, return reponse'''
    response  = openai.ChatCompletion.create(
        model = model,
        messages = messages,
        temperature = temperature,        # range(0,2), the more the less deterministic / focused
        top_p = 1,                        # top probability mass, e.g. 0.1 = only tokens from top 10% proba mass
        n = 1,                            # number of chat completions
        #max_tokens = max_tokens,          # tokens to return
        stream = False,        
        stop=None,                        # sequence to stop generation (new line, end of text, etc.)
        )
    content = response['choices'][0]['message']['content'].strip()
    #num_tokens_api = response['usage']['prompt_tokens']
    return content

In [124]:
def translate_text(text_, prompt_):
    '''Translate text_ using prompt_ and ChatGPT API'''    
        
    # compose messages and check num_tokens
    messages = [            
            { "role": "system", "content": "You are an accurate translator from Roman Urdu.", },
            { "role": "user", "content": prompt_.format(text_), },
            ]
    if not verify_num_tokens(model, messages): return None
    return get_response(model, messages)

In [125]:
def classify_text(text_, prompt_):
    '''Classify text_ using prompt_ and ChatGPT API'''
        
    # compose messages and check num_tokens
    messages = [
            { "role": "system", "content": "You are a smart binary text classifier.", },
            { "role": "user", "content": prompt_.format(text_), },
            ]
    if not verify_num_tokens(model, messages): return None
    label_    = get_response(model, messages)
    old_label = label_
    label_    = verify_label(label_)        # get just the category if response is too long
        
    # if label not found in response text - second, extended chat
    if label_ is None:
        messages += [
            { "role": "assistant", "content": old_label, },
            { "role": "user", "content": followup1, }
            ]        
        label_    = get_response(model, messages)        
        old_label = label_
        label_    = verify_label(label_)        # get just the category if response is too long
            
    return label_ if label_ is not None else old_label

In [126]:
def classify_text_with_clarifying(text_, prompt_):
    '''
       Classify text_ using prompt_ and ChatGPT API,
       then clarify response with followup1 question -
       this can help make the response more precise
    '''
        
    # compose messages and check num_tokens
    messages = [
            { "role": "system", "content": "You are a smart binary text classifier.", },
            { "role": "user", "content": prompt_.format(text_), },
            ]
    if not verify_num_tokens(model, messages): return None
    label_    = get_response(model, messages)
    old_label = label_
    label_    = verify_label(label_)                      # get just the category if response is too long
        
    # ask additional clarifying question - sometimes it helps
    messages += [
        { "role": "assistant", "content": old_label, },
        { "role": "user", "content": followup1, }
        ]
    #time.sleep( random.uniform(1.1, 1.8) )                # wait not to overload ChatGPT
    label2_    = get_response(model, messages)
    old_label2 = label2_
    label2_    = verify_label(label2_)                    # get just the category if response is too long

    return old_label, label_, old_label2, label2_

In [127]:
# test as single prompt
idx = 11
text, groundtruth_labels = df_dev[['gpt_translated2_corrected', 'emotion']].values[idx]
label  = classify_text(text, prompt_one)
#labels = classify_text(text, prompt_one)
labels = classify_text_with_clarifying(text, prompt_one)

print(prompt_one.format( text ))
print(f"\nGROUNDTRUTH LABEL:\n{'/'.join( groundtruth_labels )}")
print(f"\nPREDICTED LABEL:\n{labels}")
#print(f'\nTOTAL TOKENS: {tokens}')

Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "Dude, when did I ever say no to you guys? Come on over, I'm free right now anyway."

GROUNDTRUTH LABEL:
n/e/u/t/r/a/l

PREDICTED LABEL:
('Category: Emotional. \n\nExplanation: The use of the word "Dude" and the exclamation mark suggests a friendly and enthusiastic tone, which is a sign of positive emotion.', 'emotional', 'Emotional.', 'emotional')


### Zero-shot classification using simple iteration

In [137]:
# run for the entire text column of the dataframe
start = time.time()
res   = dict()
count = 0
for t in df_dev['text'].tolist():
    if t in res:
        continue
    try:
        res[ t ] = classify_text_with_clarifying(t, prompt_one)
    except openai.error.RateLimitError:
        print(f'\nText: {t}. Rate limit error\n')
    except Exception as e:
        print(f'\nText: {t}. Error: {e}\n')
                
    count += 1    
    if count % 10 == 0:
        print(f'Processing text {count}')
        #with open('data/res.pkl', 'wb') as f:
        #    pickle.dump(res, f, protocol=pickle.HIGHEST_PROTOCOL)    
        
elapsed = (time.time() - start)/60
print(f'\nTime elapsed {round(elapsed, 4)} min')
#file = 'data/res.pkl'
#with open(file, 'rb') as f:
#    res2 = pickle.load(handle)

Processing text 10
Processing text 20
Processing text 30
Processing text 40
Processing text 50
Processing text 60
Processing text 70
Processing text 80
Processing text 90
Processing text 100
Processing text 110
Processing text 120
Processing text 130
Processing text 140
Processing text 150
Processing text 160
Processing text 170
Processing text 180
Processing text 190
Processing text 200
Processing text 210
Processing text 220
Processing text 230
Processing text 240
Processing text 250
Processing text 260
Processing text 270
Processing text 280
Processing text 290
Processing text 300
Processing text 310
Processing text 320
Processing text 330
Processing text 340
Processing text 350
Processing text 360
Processing text 370
Processing text 380
Processing text 390
Processing text 400
Processing text 410
Processing text 420
Processing text 430
Processing text 440
Processing text 450
Processing text 460
Processing text 470
Processing text 480
Processing text 490
Processing text 500
Processin

In [106]:
# duplicates in df_dev
df_dev.shape, len(res), len(set(df_dev['text_clean'].tolist())), len(set(df_dev['gpt_translated2_corrected'].tolist()))

((1191, 13), 1156, 1150, 1156)

In [140]:
df_dev['gpt_pred'] = df_dev['text'].map( res )
#df_dev = df_dev.replace('neutral/emotional', 'emotional')
print(df_dev.isna().sum())
#df_dev['gpt_pred'].value_counts()

text                         0
emotion                      0
target                       0
gtp_translated               0
translated_hi                0
translated_ur                0
text_clean                   0
gpt_pred                     0
gpt_pred_num                 0
gpt_translated2              0
gpt_translated2_corrected    0
gpt_pred_binary              0
gpt_pred_clarified           0
dtype: int64


In [141]:
df_dev['gpt_pred_binary'] = df_dev['gpt_pred'].apply( lambda x: 0 if x[3]=='neutral' else 1 )
df_dev['gpt_pred_binary'].value_counts()

0    773
1    418
Name: gpt_pred_binary, dtype: int64

In [142]:
y_dev      = df_dev['target'].values
y_dev_pred = df_dev['gpt_pred_binary'].values
print( classification_report( y_dev, y_dev_pred, digits=4 ) )

              precision    recall  f1-score   support

           0     0.3894    0.7758    0.5185       388
           1     0.7919    0.4122    0.5422       803

    accuracy                         0.5306      1191
   macro avg     0.5906    0.5940    0.5303      1191
weighted avg     0.6607    0.5306    0.5345      1191



### Zero-shot classification using pandas
NOTE: tqdm or pandas make the API calls twice as slow compared with the simple iteration above

In [114]:
# prompt 1 tqdm results - 1191/1191 [25:38<00:00, 1.18s/it]
def apply_func_with_exception(text_, prompt_):
    try:
        return classify_text_with_clarifying(text_, prompt_)
    except openai.error.RateLimitError:
        print(f'Text: {text_}. Rate limit error\n')
        return np.nan
    except Exception as e:
        print(f'Text: {text_}. Another error: {e}\n')
        return np.nan
    
df_dev['gpt_pred_clarified'] = df_dev['gpt_translated2_corrected'].progress_apply( lambda x: apply_func_with_exception(x, prompt_one) )
print( df_dev.isna().sum() )
#df_dev['gpt_pred'].value_counts()

  0%|          | 0/1191 [00:00<?, ?it/s]

text                         0
emotion                      0
target                       0
gtp_translated               0
translated_hi                0
translated_ur                0
text_clean                   0
gpt_pred                     0
gpt_pred_num                 0
gpt_translated2              0
gpt_translated2_corrected    0
gpt_pred_binary              0
gpt_pred_clarified           0
dtype: int64


In [133]:
df_dev['gpt_pred_binary'] = df_dev['gpt_pred'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['gpt_pred_binary'].value_counts()

0    989
1    202
Name: gpt_pred_binary, dtype: int64

In [129]:
# if ChatGPT made no prediction, choose the prediction coming from the classifier
'''def improve_predictions(row):
    if row['gpt_pred_binary'] is None:
        row['gpt_pred_binary'] = row['clf_pred']
    return row

df_dev = df_dev.apply( improve_predictions, axis=1 )'''

In [118]:
y_dev      = df_dev['target'].values
y_dev_pred = df_dev['gpt_pred_binary'].values
print( classification_report( y_dev, y_dev_pred, digits=4 ) )

              precision    recall  f1-score   support

           0     0.3517    0.7088    0.4701       388
           1     0.7237    0.3686    0.4884       803

    accuracy                         0.4794      1191
   macro avg     0.5377    0.5387    0.4793      1191
weighted avg     0.6025    0.4794    0.4825      1191



## APPENDIX

## Prompts and results (in reverse chronological order)

__Best zero-shot result, so far__  
_Disregard translation and use the CODE-MIXED BILINGUAL TEXT column "text" hoping that ChatGPT will will figure it out on its own_  
_Double prompting with classify_text_with_clarifying()_ ON ORIGINAL BILINGUAL TEXT:  
_Prompt 1_: Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "This is a text sample"  
_Prompt 2_: 'Are you sure about that? Output only the category'
```
              precision    recall  f1-score   support

           0     0.3894    0.7758    0.5185       388
           1     0.7919    0.4122    0.5422       803

    accuracy                         0.5306      1191
   macro avg     0.5906    0.5940    0.5303      1191
weighted avg     0.6607    0.5306    0.5345      1191
```

_Disregard translation and use the CODE-MIXED BILINGUAL TEXT column "text" hoping that ChatGPT will will figure it out on its own_:  
_PROMPT_: Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "This is a text sample" 
```
              precision    recall  f1-score   support

           0     0.3782    0.9639    0.5432       388
           1     0.9307    0.2341    0.3741       803

    accuracy                         0.4719      1191
   macro avg     0.6544    0.5990    0.4587      1191
weighted avg     0.7507    0.4719    0.4292      1191
```

_Second double prompting witih classify_text_with_clarifying()_ ON THE GPT TRANSLATED ENGLISH TEXT:
_Prompt 1_: Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "Dude, when did I ever say no to you guys? Come on over, I'm free right now anyway."  
_Prompt 2_: 'Are you sure about that?'
```
              precision    recall  f1-score   support

           0     0.3517    0.7088    0.4701       388
           1     0.7237    0.3686    0.4884       803

    accuracy                         0.4794      1191
   macro avg     0.5377    0.5387    0.4793      1191
weighted avg     0.6025    0.4794    0.4825      1191
```

_Improved single prompt achieves the same result as double prompting_  ON THE GPT TRANSLATED ENGLISH TEXT:  
_Prompt_: Act as a binary text classifier. Output the category "emotional" only and only if the text below suggests any human emotion. Otherwise output "neutral". Text: "Dude, when did I ever say no to you guys? Come on over, I'm free right now anyway."
```
              precision    recall  f1-score   support

           0     0.3916    0.8840    0.5427       388
           1     0.8571    0.3362    0.4830       803

    accuracy                         0.5147      1191
   macro avg     0.6243    0.6101    0.5129      1191
weighted avg     0.7055    0.5147    0.5025      1191
```

_Double prompting witih classify_text_with_clarifying()_  ON THE GPT TRANSLATED ENGLISH TEXT:  
_Prompt 1_: Act as a text classifier. Classify the text below into one most relevant category from this list of categories: emotional, neutral. Use the emotional category only if the text below describes any emotions; use the neutral category only if the text below does not speak about emotions at all. Output only one word: 'emotional' or 'neutral', whichever is more relevant. Text: "This is a text sample"   
_Prompt 2_: 'Are you sure about that?'

```
                precision    recall  f1-score   support

           0     0.3867    0.8273    0.5271       388
           1     0.8144    0.3661    0.5052       803

    accuracy                         0.5164      1191
   macro avg     0.6006    0.5967    0.5161      1191
weighted avg     0.6751    0.5164    0.5123      1191
```

_Prompt to classify after the corrected English translation (zero shot)_  ON THE GPT TRANSLATED ENGLISH TEXT:  
Act as a text classifier. Classify the text below into one most relevant category from this list of categories: emotional, neutral. Use the emotional category only if the text below describes any emotions; use the neutral category only if the text below does not speak about emotions at all. Output only one word: 'emotional' or 'neutral', whichever is more relevant. Text: "This is a text sample."
```
              precision    recall  f1-score   support

           0     0.3775    0.9253    0.5362       388
           1     0.8792    0.2628    0.4046       803

    accuracy                         0.4786      1191
   macro avg     0.6283    0.5940    0.4704      1191
weighted avg     0.7157    0.4786    0.4475      1191
```

_Prompt to classify after the first English translation (zero shot)_  ON THE GPT TRANSLATED ENGLISH TEXT:  
Act as a careful and accurate text classifier. Classify the text below as 'emotional' only if it contains emotions; lassify the text below as 'neutral' only if it does not contain emotions. Output only one word: 'emotional' or 'neutral' whichever is more relevant. Text: "This is a text sample"
```
              precision    recall  f1-score   support

           0     0.3789    0.9510    0.5419       388
           1     0.9124    0.2466    0.3882       803

    accuracy                         0.4761      1191
   macro avg     0.6456    0.5988    0.4650      1191
weighted avg     0.7386    0.4761    0.4383      1191
```

_Prompt to translate and classify_ ON THE GPT TRANSLATED ENGLISH TEXT:  
The text below may contain words or phrases in Roman Urdu along with English. Translate the text below into English only. Then classify the translated text as 'emotional' if it contains emotions or 'neutral' if it does not contain emotions. Output only 'emotional' or 'neutral' and nothing else. Text: "This is a text sample"

```
              precision    recall  f1-score   support

           0     0.3662    0.9278    0.5252       388
           1     0.8654    0.2242    0.3561       803

    accuracy                         0.4534      1191
   macro avg     0.6158    0.5760    0.4406      1191
weighted avg     0.7028    0.4534    0.4112      1191
```