# ChatGPT API: Few-Shot Binary Text Classification

## The Association for Computational Linguistics
## WASSA 2023 Shared Task on Multi-Label and Multi-Class Emotion Classification on Code-Mixed Text Messages
See more details [here](https://codalab.lisn.upsaclay.fr/competitions/10864#learn_the_details)

In [1]:
import openai
import numpy as np
import pandas as pd
import sklearn
import re, os
import time
import zipfile, pickle
from typing import List
from copy import deepcopy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, RidgeClassifier, RidgeClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, multilabel_confusion_matrix
from openai.embeddings_utils import cosine_similarity
from tqdm.autonotebook import tqdm
import random
import tiktoken
import backoff
tqdm.pandas()

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 400)
#os.path.join()

  from tqdm.autonotebook import tqdm


In [2]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    '''Return number of tokens used in a list of messages for ChatGPT'''
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        #print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        #print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        #print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

In [3]:
random_state = 47

# Load and Prepare Data

In [4]:
file1    = 'data/mcec_train_translated.pkl'
df_train = pd.read_pickle(file1)

file2    = 'data/mcec_dev_translated.pkl'
df_dev   = pd.read_pickle(file2)

file3    = 'data/mcec_test.csv'
df_test  = pd.read_csv(file3)

file4    = 'data/sample_submission/predictions_MCEC.csv'
sample_submission = pd.read_csv(file4)

print(df_train.shape, df_dev.shape, df_test.shape, sample_submission.shape)

(9530, 5) (1191, 12) (1191, 1) (1191, 1)


In [8]:
df_train['target'] = df_train['emotion'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['target']   = df_dev['emotion'].apply( lambda x: 0 if x=='neutral' else 1 )

In [9]:
print(df_train['emotion'].value_counts(), '\n')
print(df_train['target'].value_counts())
df_train.head()

neutral         3262
trust           1118
joy             1022
optimism         880
anticipation     832
disgust          687
sadness          486
fear             453
anger            226
surprise         199
love             187
pessimism        178
Name: emotion, dtype: int64 

1    6268
0    3262
Name: target, dtype: int64


Unnamed: 0,text,emotion,translated_hi,translated_ur,gpt_embedding,target
0,Yes.I am in fyp lab cabin.but fyp presentation...,neutral,Yes.i am in fyp lab cabin.but fyp presentation...,Y. Um in Fap Lab Cabin. Butt Fap Presentations...,"[-0.005477939732372761, -0.01738985814154148, ...",0
1,Yar insan ka bcha bn chawliyn na mar :p,joy,"Dude become a child of a human being, do not die.",Dude human beings do not die: P: P,"[0.0006696455529890954, -0.006965265609323978,...",1
2,Terai uncle nai kahna hai kai ham nai to bahr ...,disgust,Your Uncle Nai says that we had sent out money,Your Ankali says that we sent out money and wa...,"[0.021171217784285545, -0.02109299972653389, 0...",1
3,Yr ajao I m cming in the club,neutral,YR AJAO I'M Coming in the Club,Yer organs were the club,"[-0.010511564090847969, -0.02738134376704693, ...",0
4,Mje wese Nimra ahmad ka Qur'aan ki aayaat k ba...,joy,Mje wes nimra ahmad ka qur'aan ki aayaat k bar...,Mje Wese Nimra Ahmad Ka Qur'aan Ki Aayaaat K B...,"[-0.0016743674641475081, -0.021855581551790237...",1


In [10]:
print(df_dev['emotion'].value_counts(), '\n')
print(df_dev['target'].value_counts())
df_dev.head()

neutral         388
joy             131
trust           125
disgust         113
optimism        110
anticipation     94
sadness          62
fear             52
surprise         35
anger            35
pessimism        29
love             17
Name: emotion, dtype: int64 

1    803
0    388
Name: target, dtype: int64


Unnamed: 0,text,emotion,target,gtp_translated,translated_hi,translated_ur,text_clean,gpt_pred,gpt_pred_num,gpt_translated2,gpt_translated2_corrected,gpt_embedding
0,Tension lene ki koi baat ni,neutral,0,There's no need to take tension.,There is nothing to take tension,Any talk of taking tangoes,Tension lene ki koi baat ni,neutral,1,There's no need to worry.,There's no need to worry.,"[-0.00021548081713262945, 0.005029499996453524..."
1,Main ghar punch gya hun or ab spny laga hun,neutral,0,I have reached home and now I am going to sleep.,I have gone home punch and now I am Sapni,I have gone home punch and now dreams,Main ghar punch gya hun or ab spny laga hun,neutral,1,I have reached home and now I am going to sleep.,I have reached home and now I am going to sleep.,"[-0.0010164333507418633, -0.013282055966556072..."
2,Nai mje nai mili mail..mene check ki ti,pessimism,1,"I didn't receive any mail, I had checked.",Nai Maje Nai Mile Mail .. I checked,Ni Ni Ni Mille Mail,Nai mje nai mili mail .. mene check ki ti,neutral,1,I didn't receive any new mail. I had checked.,I didn't receive any new mail. I had checked.,"[-0.010691414587199688, -0.01292553823441267, ..."
3,Yr us din mai pura din bzy rahe vo mujy awne h...,disgust,1,"That day, they were busy all day and not givin...",YR Us Din Mai Pura Din Bzy Rahe Vo Mujy Awne H...,Yr us din mai pura din bzy rahe vo mujy awne h...,Yr us din mai pura din bzy rahe vo mujy awne h...,negative,0,"I was busy the whole day on that day, they wer...","I was busy the whole day on that day, they wer...","[0.009936108253896236, -0.016926730051636696, ..."
4,Lakin wo abhe dar dar ka chalata ha,fear,1,But he still walks cautiously.,But it still moves at the rate,But Wu runs the cedar,Lakin wo abhe dar dar ka chalata ha,neutral,1,But he still walks with fear and hesitation.,But he still walks with fear and hesitation.,"[0.019262924790382385, -0.0011249196249991655,..."


In [11]:
# light text cleaning (should I use clean regex for better accuracy?)
pad_punct    = re.compile('([^a-zA-Z ]+)')
multi_spaces = re.compile('\s{2,}')
#clean        = re.compile('[^a-zA-Z0-9,.?!\'\s]+')

def clean_text(s):
    s = s.replace('\n', ' ')
    s = pad_punct.sub(r' \1 ', s)
    #s = clean.sub(' ', s)
    s = multi_spaces.sub(' ', s)
    return s.strip()

df_train['text_clean'] = df_train['text'].apply( clean_text )
df_dev['text_clean']   = df_dev['text'].apply( clean_text )
df_test['text_clean']  = df_test['Text'].apply( clean_text )

In [19]:
# remove overlap with validation sets
val_sets = df_dev['text_clean'].tolist() + df_test['text_clean'].tolist()
print(len(val_sets), len(set(val_sets)))

print(df_train.shape)
df_train = df_train[ ~df_train['text_clean'].isin(val_sets) ]
print(df_train.shape)

2382 2206
(9530, 7)
(8151, 7)


In [20]:
# remove duplicates from train set
df_train = df_train.drop_duplicates(subset=['text_clean', 'emotion'])
print(df_train.shape)

(6167, 7)


# ChatGPT API: Few-Shot Classification

## Helper functions

In [42]:
# Using followup questions improves the reponse. but ChatGPT can change its mind too easily sometimes
followup1 = 'Are you sure about that? If yes, output the same category, if no change the category'
followup2 = 'Output only the category and nothing else'
print(followup1)

Are you sure about that? If yes, output the same category, if no change the category


In [43]:
openai.api_key = os.getenv("OPENAI_API_KEY")
model          = 'gpt-3.5-turbo'
labels_set     = {'emotional', 'neutral'}
clean = re.compile(r'[^a-zA-Z ]+')
multi_spaces = re.compile('\s{2,}')
print(labels_set)

{'emotional', 'neutral'}


In [44]:
def verify_label(label_):
    '''
       Verify if label_ contains any of the categories
       from the predefined set of labels
    '''
    label_ = clean.sub(' ', label_)
    label_ = multi_spaces.sub(' ', label_).lower().split()
    res    = [i for i in label_ if i in labels_set]
    res    = list(set(res))
    return '/'.join(res) if res else None

In [45]:
def verify_num_tokens(model, messages):
    '''Check that there is enough tokens available for a ChatGPT repsonse'''
    num_tokens_tiktoken = num_tokens_from_messages(messages, model)
    if num_tokens_tiktoken > 4080:
        print(f'Number of tokens is {num_tokens_tiktoken} which exceeds 4080')        
        return False
    else:
        return True


@backoff.on_exception(backoff.expo, openai.error.RateLimitError, max_time=10)
def get_response(model, messages, temperature=0, max_tokens=None):
    '''Send request, return reponse'''
    response  = openai.ChatCompletion.create(
        model = model,
        messages = messages,
        temperature = temperature,        # range(0,2), the more the less deterministic / focused
        top_p = 1,                        # top probability mass, e.g. 0.1 = only tokens from top 10% proba mass
        n = 1,                            # number of chat completions
        #max_tokens = max_tokens,          # tokens to return
        stream = False,        
        stop=None,                        # sequence to stop generation (new line, end of text, etc.)
        )
    content = response['choices'][0]['message']['content'].strip()
    #num_tokens_api = response['usage']['prompt_tokens']
    return content

## Binary approach 1: concatenate random few-shot examples into one long string

Example prompt:
```
Message: Support has been terrible for 2 weeks...
Sentiment: Negative
###
Message: I love your API, it is simple and so fast!
Sentiment: Positive
###
Message: GPT-J has been released 2 months ago.
Sentiment: Neutral
###
Message: The reactivity of your team has been amazing, thanks!
Sentiment:
```

In [25]:
# total tokens in training set
messages = [{ 'role': 'user', 'content': ' '.join(df_train['text_clean'].tolist()) }]
total_tokens        = num_tokens_from_messages(messages)
avg_tokens_per_msg  = total_tokens / df_train.shape[0]
msg_per_window_size = 4096 /  avg_tokens_per_msg
print('Total tokens:', total_tokens)
print('Avg tokens per msg:', avg_tokens_per_msg)
print('Avg messages per window size:', msg_per_window_size)

Total tokens: 116049
Avg tokens per msg: 18.817739581644236
Avg messages per window size: 217.66695102930657


Split df into n random chunks to be used as few-shot examples:

In [91]:
# divide df into n chunks - this allows for 3k to 3.8k tokens per example
n         = 135
key2label = { 0: 'neutral', 1: 'emotional', }
examples  = []
chunks  = [ df_train.sample(frac=1, random_state=random_state).iloc[i:i+n] for i in range(0, df_train.shape[0], n) ]
for ch in chunks:
    example = ''
    for text, target in ch[['text', 'target']].values:
        example += f'Message: {text}.\nCategory: {key2label[target]}.\n####\n'
    examples.append(example)
        
lengths = [ num_tokens_from_messages([{'role':'user', 'content': example} ]) for example in examples ]
print(lengths)

[3472, 3690, 3662, 3346, 3394, 3608, 3630, 3805, 4022, 3726, 3590, 3530, 3729, 3633, 3722, 3633, 3795, 3630, 3613, 3808, 3540, 3380, 3597, 3528, 3697, 3586, 3634, 3586, 3825, 3750, 3782, 3620, 3472, 3519, 3542, 3583, 3794, 3416, 3455, 3977, 3427, 3476, 3302, 3400, 3531, 2424]


In [105]:
len(lengths)

46

In [92]:
print(examples[0])

Message: Yr jtni jldi ml jaye utna acha he....
Category: emotional.
####
Message: Haha mei kahan se sweet hun :p.
Category: emotional.
####
Message: Han abi kuch theek hy shop pr ly kr gya tha.
Category: neutral.
####
Message: I think .. he made his own slides....
Category: emotional.
####
Message: Plz give me bike keys. I m in office.
Category: emotional.
####
Message: Teri do Aakhiyaan Wich Dubb Mrya,.
Category: emotional.
####
Message: Haha ni wo boy ni smjy g q k wo khud boy ho g.
Category: emotional.
####
Message: Mera Sirf DataStructure Ka Pta Chala Hai. & Your?.
Category: emotional.
####
Message: Salam Faiza Baaji
How are you  ... 
Plz take a  good care of urself. 
Aur jo treat meri taraf due hai uska kiya karna hai?  :-).
Category: emotional.
####
Message: Sir transport dept walay recomendation letter mang rahay han....
Category: neutral.
####
Message: Yar  Maine fee a hi tak Jamaica nahi krai 2000 fine ho gear hai to fine maaf kra sakta hai.
Category: neutral.
####
Message: Ok

In [93]:
def classify_text_few_shot(prompt_):
    '''Classify text_ using prompt_ and ChatGPT API'''
        
    # compose messages and check num_tokens
    messages = [
            #{ "role": "system", "content": "You are a very accurate text classifier.", },
            { "role": "user", "content": prompt_, },
            ]
    if not verify_num_tokens(model, messages): return None
    label_    = get_response(model, messages)
    old_label = label_
    label_    = verify_label(label_)        # get just the category if response is too long
        
    # if label not found in response text - second, extended chat
    if label_ is None:
        messages += [
            { "role": "assistant", "content": old_label, },
            { "role": "user", "content": followup1, }
            ]        
        label_    = get_response(model, messages)        
        old_label = label_
        label_    = verify_label(label_)        # get just the category if response is too long
            
    return label_ if label_ is not None else old_label

In [100]:
start  = time.time()
#res    = dict()
count1 = 0
count2 = 0
for t in df_dev['text'].tolist():
    if t in res:
        continue
    if count2 >= len(examples):
        count2 = 0
    prompt = examples[ count2 ] + f'\nMessage: {t}.\nCategory:'
    count2 += 1
    try:
        res[ t ] = classify_text_few_shot(prompt)
    except openai.error.RateLimitError:
        print(f'\nText: {t}.\nRate limit error\n')
    except Exception as e:
        print(f'\nText: {t}\nError: {e}\n')
                
    count1 += 1    
    if count1 % 10 == 0:
        print(f'Processing text {count1}; example {count2-1}')
        with open('data/res.pkl', 'wb') as f:
            pickle.dump(res, f, protocol=pickle.HIGHEST_PROTOCOL)
                        
        
elapsed = (time.time() - start)/60
print(f'\nTime elapsed {round(elapsed, 4)} min')
#file = 'data/res.pkl'
#with open(file, 'rb') as f:
#    res2 = pickle.load(handle)


Time elapsed 0.1439 min


In [102]:
df_dev['gpt_pred'] = df_dev['text'].map( res )
df_dev['gpt_pred'].value_counts()

emotional                                        715
neutral                                          470
Yes, I'm sure. The category is motivational.       1
Yes, I'm sure. The category is offensive.          1
Yes, I'm sure. The category is inappropriate.      1
Yes, the category is offensive.                    1
Yes, I'm sure. The category is positive.           1
Yes, I'm sure. The category is humorous.           1
Name: gpt_pred, dtype: int64

In [103]:
df_dev['gpt_pred_binary'] = df_dev['gpt_pred'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['gpt_pred_binary'].value_counts()

1    721
0    470
Name: gpt_pred_binary, dtype: int64

In [104]:
y_dev      = df_dev['target'].values
y_dev_pred = df_dev['gpt_pred_binary'].values
print( classification_report( y_dev, y_dev_pred, digits=4 ) )

              precision    recall  f1-score   support

           0     0.4723    0.5722    0.5175       388
           1     0.7698    0.6912    0.7283       803

    accuracy                         0.6524      1191
   macro avg     0.6211    0.6317    0.6229      1191
weighted avg     0.6729    0.6524    0.6597      1191



## Approach 2: concatenate random few shot examples using the ChatGPT chat mode

Split df into n random chunks to be used as few-shot examples:

In [126]:
# divide df into n chunks - this allows for 3k to 3.8k tokens per example
n          = 100
key2label = { 0: 'neutral', 1: 'emotional', }
examples  = []
chunks = [ df_train.sample(frac=1, random_state=random_state).iloc[i:i+n] for i in range(0, df_train.shape[0], n) ]
for ch in chunks:
    example = []
    for text, target in ch[['text', 'target']].values:
        example += [
            { "role": "user", "content": f'Text: {text}', },
            { "role": "assistant", "content": f'Category: {key2label[target]}', }
        ]
    examples.append(example)
        
lengths = [ num_tokens_from_messages(example) for example in examples ]
print(len(lengths))
print(lengths)

62
[3276, 3350, 3424, 3554, 3229, 3153, 3260, 3487, 3364, 3365, 3709, 3762, 3400, 3387, 3440, 3391, 3424, 3327, 3546, 3423, 3485, 3484, 3543, 3349, 3414, 3372, 3678, 3399, 3280, 3190, 3375, 3327, 3381, 3527, 3335, 3470, 3383, 3494, 3549, 3502, 3499, 3454, 3436, 3261, 3324, 3311, 3371, 3271, 3568, 3475, 3273, 3229, 3403, 3705, 3208, 3326, 3203, 3250, 3275, 3261, 3324, 2279]


In [127]:
# how one example looks
examples[2]

[{'role': 'user', 'content': 'Text: kiu yar a jana mza aye ga!?'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Mujhe msg oe msg kr rhi he'},
 {'role': 'assistant', 'content': 'Category: neutral'},
 {'role': 'user',
  'content': 'Text: ma ab tk k exprnce k hwly sy bl rha hn. life to pta h bht h'},
 {'role': 'assistant', 'content': 'Category: neutral'},
 {'role': 'user', 'content': "Text: Mene jannat k pattay prh liya atlast :')"},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user',
  'content': 'Text: Or wo achi khaasi intelligent t tbi itni okhi paheli boojh li t :P'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user',
  'content': 'Text: Tm logo  ka off ho gaixa hn aur tm ne rida ki pic daekane the . ...,'},
 {'role': 'assistant', 'content': 'Category: neutral'},
 {'role': 'user',
  'content': 'Text: Pehla name mera e c jinna nu pta a,par mainu ki pta a, ae mainu nai pts'},
 {'role': 'as

In [128]:
def classify_text_few_shot2(text_, messages_, model):
    '''Classify text_ using prompt_ and ChatGPT API'''
    messages_ = deepcopy(messages_)
    messages_ += [
        { "role": "user", "content": f'Text: {text_}', },
    ]
    if not verify_num_tokens(model, messages_): return None
    label_    = get_response(model, messages_)
    old_label = label_
    label_    = verify_label(label_)        # get just the category if response is too long
        
    # if label not found in response text - second, extended chat
    if label_ is None:
        messages_ += [
            { "role": "assistant", "content": old_label, },
            { "role": "user", "content": followup1, }
            ]        
        label_    = get_response(model, messages_)
        old_label = label_
        label_    = verify_label(label_)        # get just the category if response is too long
            
    return label_ if label_ is not None else old_label

In [129]:
start  = time.time()
res    = dict()
count1 = 0
count2 = 0
for t in df_dev['text'].tolist():
    if t in res:
        continue
    if count2 >= len(examples):
        count2 = 0
    messages = examples[ count2 ]
    count2 += 1
    try:
        res[ t ] = classify_text_few_shot2(t, messages, model)
    except openai.error.RateLimitError:
        print(f'\nText: {t}.\nRate limit error\n')
    except Exception as e:
        print(f'\nText: {t}\nError: {e}\n')
                
    count1 += 1    
    if count1 % 10 == 0:
        print(f'Processing text {count1}; example {count2-1}. Time elapsed: {round((time.time()-start)/60, 4)} min')
        with open('data/res.pkl', 'wb') as f:
            pickle.dump(res, f, protocol=pickle.HIGHEST_PROTOCOL)
                        
        
elapsed = (time.time() - start)/60
print(f'\nTime elapsed {round(elapsed, 4)} min')
#file = 'data/res.pkl'
#with open(file, 'rb') as f:
#    res2 = pickle.load(handle)

Processing text 10; example 9. Time elapsed: 0.2034 min
Processing text 20; example 19. Time elapsed: 0.3919 min
Processing text 30; example 29. Time elapsed: 0.5834 min
Processing text 40; example 39. Time elapsed: 0.7687 min
Processing text 50; example 49. Time elapsed: 0.9588 min
Processing text 60; example 59. Time elapsed: 1.1971 min
Processing text 70; example 7. Time elapsed: 1.409 min
Processing text 80; example 17. Time elapsed: 1.6186 min
Processing text 90; example 27. Time elapsed: 1.8049 min
Processing text 100; example 37. Time elapsed: 2.0017 min
Processing text 110; example 47. Time elapsed: 2.1894 min
Processing text 120; example 57. Time elapsed: 2.3815 min
Processing text 130; example 5. Time elapsed: 2.5695 min
Processing text 140; example 15. Time elapsed: 2.7638 min
Processing text 150; example 25. Time elapsed: 2.9498 min
Processing text 160; example 35. Time elapsed: 3.1631 min
Processing text 170; example 45. Time elapsed: 3.4141 min
Processing text 180; exampl

In [130]:
df_dev['gpt_pred'] = df_dev['text'].map( res )
df_dev['gpt_pred'].value_counts()

emotional                                       808
neutral                                         382
Yes, I'm sure. The category is motivational.      1
Name: gpt_pred, dtype: int64

In [131]:
df_dev['gpt_pred_binary'] = df_dev['gpt_pred'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['gpt_pred_binary'].value_counts()

1    809
0    382
Name: gpt_pred_binary, dtype: int64

In [132]:
y_dev      = df_dev['target'].values
y_dev_pred = df_dev['gpt_pred_binary'].values
print( classification_report( y_dev, y_dev_pred, digits=4 ) )

              precision    recall  f1-score   support

           0     0.4895    0.4820    0.4857       388
           1     0.7515    0.7572    0.7543       803

    accuracy                         0.6675      1191
   macro avg     0.6205    0.6196    0.6200      1191
weighted avg     0.6662    0.6675    0.6668      1191



## Approach 3: same as 2, but use the most similar examples

In [23]:
model          = 'gpt-3.5-turbo'
key2label      = { 0: 'neutral', 1: 'emotional', }
embedding_type = 'gpt_embedding'

In [30]:
# find top_n closest df_train embeddings for each df_dev embedding
def batch_cosine(embedding_, df, top_n=100):
    df['similarity'] = df[embedding_type].apply(lambda x: cosine_similarity(x, embedding_))
    return df.sort_values(by='similarity', ascending=False).head(top_n)['text'].tolist()

df_train_copy = df_train.copy()
start = time.time()
res   = dict()
count = 0
for t, e in df_dev[['text', embedding_type]].values:
    if t in res:
        continue
    res[ t ] = batch_cosine( e, df_train_copy, top_n=100, )
    count += 1
    if count % 10 == 0:
        print(f'Processing text {count}. Time elapsed: {round((time.time()-start)/60, 4)} min')
        with open('data/res.pkl', 'wb') as f:
            pickle.dump(res, f, protocol=pickle.HIGHEST_PROTOCOL)
                        
elapsed = (time.time() - start)/60
print(f'\nTime elapsed {round(elapsed, 4)} min')

Processing text 10. Time elapsed: 0.3077 min
Processing text 20. Time elapsed: 0.615 min
Processing text 30. Time elapsed: 0.9206 min
Processing text 40. Time elapsed: 1.2267 min
Processing text 50. Time elapsed: 1.534 min
Processing text 60. Time elapsed: 1.8395 min
Processing text 70. Time elapsed: 2.1474 min
Processing text 80. Time elapsed: 2.4545 min
Processing text 90. Time elapsed: 2.76 min
Processing text 100. Time elapsed: 3.0655 min
Processing text 110. Time elapsed: 3.3716 min
Processing text 120. Time elapsed: 3.68 min
Processing text 130. Time elapsed: 3.9857 min
Processing text 140. Time elapsed: 4.2943 min
Processing text 150. Time elapsed: 4.6008 min
Processing text 160. Time elapsed: 4.9075 min
Processing text 170. Time elapsed: 5.2152 min
Processing text 180. Time elapsed: 5.5222 min
Processing text 190. Time elapsed: 5.8275 min
Processing text 200. Time elapsed: 6.1335 min
Processing text 210. Time elapsed: 6.4391 min
Processing text 220. Time elapsed: 6.7438 min
Pro

In [32]:
df_dev['closest_texts'] = df_dev['text'].map( res )
print(df_dev.isna().sum())

file = 'data/df_dev_100_closest_GptEmbeddings.pkl'
df_dev.to_pickle(file)

text                         0
emotion                      0
target                       0
gtp_translated               0
translated_hi                0
translated_ur                0
text_clean                   0
gpt_pred                     0
gpt_pred_num                 0
gpt_translated2              0
gpt_translated2_corrected    0
gpt_embedding                0
closest_texts                0
dtype: int64


In [26]:
file = 'data/df_dev_100_closest_GptEmbeddings.pkl'
df_dev = pd.read_pickle(file)
df_dev.head()

Unnamed: 0,text,emotion,target,gtp_translated,translated_hi,translated_ur,text_clean,gpt_pred,gpt_pred_num,gpt_translated2,gpt_translated2_corrected,gpt_embedding,closest_texts
0,Tension lene ki koi baat ni,neutral,0,There's no need to take tension.,There is nothing to take tension,Any talk of taking tangoes,Tension lene ki koi baat ni,neutral,1,There's no need to worry.,There's no need to worry.,"[-0.00021548081713262945, 0.005029499996453524...","[Tension na lo hi jaw ga, O nahi yar tension n..."
1,Main ghar punch gya hun or ab spny laga hun,neutral,0,I have reached home and now I am going to sleep.,I have gone home punch and now I am Sapni,I have gone home punch and now dreams,Main ghar punch gya hun or ab spny laga hun,neutral,1,I have reached home and now I am going to sleep.,I have reached home and now I am going to sleep.,"[-0.0010164333507418633, -0.013282055966556072...","[Main tu ghar hi rehta main kahan ja skta, Ghr..."
2,Nai mje nai mili mail..mene check ki ti,pessimism,1,"I didn't receive any mail, I had checked.",Nai Maje Nai Mile Mail .. I checked,Ni Ni Ni Mille Mail,Nai mje nai mili mail .. mene check ki ti,neutral,1,I didn't receive any new mail. I had checked.,I didn't receive any new mail. I had checked.,"[-0.010691414587199688, -0.01292553823441267, ...","[Nai yar mje to pta e nai kuch, Mjhe tu ni lgt..."
3,Yr us din mai pura din bzy rahe vo mujy awne h...,disgust,1,"That day, they were busy all day and not givin...",YR Us Din Mai Pura Din Bzy Rahe Vo Mujy Awne H...,Yr us din mai pura din bzy rahe vo mujy awne h...,Yr us din mai pura din bzy rahe vo mujy awne h...,negative,0,"I was busy the whole day on that day, they wer...","I was busy the whole day on that day, they wer...","[0.009936108253896236, -0.016926730051636696, ...",[Yr masla pta kia ha. Yad ha tm log us din upa...
4,Lakin wo abhe dar dar ka chalata ha,fear,1,But he still walks cautiously.,But it still moves at the rate,But Wu runs the cedar,Lakin wo abhe dar dar ka chalata ha,neutral,1,But he still walks with fear and hesitation.,But he still walks with fear and hesitation.,"[0.019262924790382385, -0.0011249196249991655,...","[Pata nahe ab kahan chala gea hai kamina, Han ..."


In [31]:
df_dev['target'].value_counts()

1    803
0    388
Name: target, dtype: int64

In [28]:
# verify
for i, j in df_dev[['text', 'closest_texts']].values[:10]:
    print(i)
    print(j, '\n')
    #closest = '\n'.join(j)
    #print(f'ORIGINAL TRANSLATED TEXT:\n{i}')
    #print(f"CLOSEST TRANSLATIONS:\n{closest}\n")

Tension lene ki koi baat ni
['Tension na lo hi jaw ga', 'O nahi yar tension na lu', 'Pata hai mujahi par no tension', 'Han yar kam ho jai ga parha dn yar tension na lai', 'Na bro us din koi duty nahe hai. Bus kal ki hai baki phir kisi din ki koi tension nahe', 'Haan mian koi baat bni ? ', 'Result ki tension lag gayi hay', 'Tu rehney dey tension na lay ka', 'Han kiya hoa koi masla ha kiya', 'Mje tensn hti hai na kisi ka kaam na hoo', 'Hn ho jna ha arrange..\nTension na ly..', 'Yar telha sy meri baat krwa day', 'Or sunain janab koi ni tazi', 'G ap bolain kiya baat h. ', 'Nai tm tensi0n na lo mene n0te kia tha lectr me tmy likhwa dn g,', 'Pta ni de ga zaror tension na le', 'Shukar mama ko bht tension thi', 'Han main kr dun ga tun tension mat ly yar', 'Tujy koi aur kaam nhi ha bhai ', 'Haan yawr koi kam ni mila tha henna ? ', 'Jani tujey aik baat batai thi', 'Ijtami faisla hua or koi meeting ni hui', 'Thek se has lo,.thn mood ho to baat kr lena', 'Hahaha koi nai ho jaye ga', 'O ja yar tens

In [68]:
# Using top_n closest embeddings, create ChatGPT messages object (alternating user (text)/assistant(category) Q&As)
def create_messages(df_, closest_texts, top_n=100 ):
    df_temp = df_[ df_['text'].isin(closest_texts[:top_n]) ]
    messages = []
    for text, target in df_temp[['text', 'target']].values:
        messages += [
            { "role": "user", "content": f'Text: {text}', },
            { "role": "assistant", "content": f'Category: {key2label[target]}', }
        ]
    while num_tokens_from_messages(messages) > 4080:
        messages = messages[:-1]
    return messages

In [59]:
def classify_text_few_shot3(text_, messages_, model):
    '''Classify text_ using prompt_ and ChatGPT API'''
    messages_ = deepcopy(messages_)
    messages_ += [
        { "role": "user", "content": f'Text: {text_}', },
    ]
    while num_tokens_from_messages(messages_) > 4080:
        messages_ = messages_[:-2] + [messages_[-1]]
    if not verify_num_tokens(model, messages_): return None
    label_    = get_response(model, messages_)
    old_label = label_
    label_    = verify_label(label_)        # get just the category if response is too long
        
    # if label not found in response text - second, extended chat
    if label_ is None:
        messages_ += [
            { "role": "assistant", "content": old_label, },
            { "role": "user", "content": followup1, }
            ]        
        label_    = get_response(model, messages_)
        old_label = label_
        label_    = verify_label(label_)        # get just the category if response is too long
            
    return label_ if label_ is not None else old_label

In [66]:
# how one messages object looks
c = df_dev['closest_texts'].values[4]
create_messages(df_train, c, top_n=10)

[{'role': 'user',
  'content': 'Text: Han yar ab lga di hain to karni to pare g na'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Pata nahe ab kahan chala gea hai kamina'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Han keh diya hy wo bta dy ga apko'},
 {'role': 'assistant', 'content': 'Category: neutral'},
 {'role': 'user',
  'content': 'Text: Han khty ab nikal ry to aba g ko kangaal krr k e wapis ain na'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Tera dola sola dekh kr dar Lagta ha'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Dkh lena awain phas na jaon'},
 {'role': 'assistant', 'content': 'Category: emotional'},
 {'role': 'user', 'content': 'Text: Woh dars k liye use hota hai ab'},
 {'role': 'assistant', 'content': 'Category: neutral'},
 {'role': 'user', 'content': 'Text: G abi ki

In [79]:
# this simple iteration is faster than pandas df with tqdm
model = 'gpt-3.5-turbo'
start = time.time()
res   = dict()
count = 0
for t, closest in df_dev[['text', 'closest_texts']].values:
    if t in res:
        continue
    messages = create_messages(df_train, closest, top_n=10)
    try:
        res[ t ] = classify_text_few_shot3(t, messages, model)
    except openai.error.RateLimitError:
        print(f'\nText: {t}.\nRate limit error\n')
    except Exception as e:
        print(f'\nText: {t}\nError: {e}\n')
                
    count += 1    
    if count % 10 == 0:
        print(f'Processing text {count}. Time elapsed: {round((time.time()-start)/60, 4)} min')
        with open('data/res.pkl', 'wb') as f:
            pickle.dump(res, f, protocol=pickle.HIGHEST_PROTOCOL)
                        
        
elapsed = (time.time() - start)/60
print(f'\nTime elapsed {round(elapsed, 4)} min')
#file = 'data/res.pkl'
#with open(file, 'rb') as f:
#    res2 = pickle.load(handle)

Processing text 10. Time elapsed: 0.2187 min
Processing text 20. Time elapsed: 0.5146 min
Processing text 30. Time elapsed: 1.2399 min
Processing text 40. Time elapsed: 1.4608 min
Processing text 50. Time elapsed: 1.6706 min
Processing text 60. Time elapsed: 1.9425 min
Processing text 70. Time elapsed: 2.2348 min
Processing text 80. Time elapsed: 2.8885 min
Processing text 90. Time elapsed: 3.1024 min
Processing text 100. Time elapsed: 3.3186 min
Processing text 110. Time elapsed: 3.6155 min
Processing text 120. Time elapsed: 3.8622 min
Processing text 130. Time elapsed: 4.0695 min
Processing text 140. Time elapsed: 4.3613 min
Processing text 150. Time elapsed: 4.5798 min
Processing text 160. Time elapsed: 4.8102 min
Processing text 170. Time elapsed: 5.1345 min
Processing text 180. Time elapsed: 5.4468 min
Processing text 190. Time elapsed: 5.7625 min
Processing text 200. Time elapsed: 6.0005 min
Processing text 210. Time elapsed: 6.2346 min
Processing text 220. Time elapsed: 6.5002 m

In [80]:
df_dev['gpt_pred'] = df_dev['text'].map( res )
df_dev['gpt_pred'].value_counts()

emotional                                                                                                                                                                                                                                                                                                                                                        600
neutral                                                                                                                                                                                                                                                                                                                                                          577
Yes, the category is humorous.                                                                                                                                                                                                                                                                

In [81]:
df_dev['gpt_pred_binary'] = df_dev['gpt_pred'].apply( lambda x: 0 if x=='neutral' else 1 )
df_dev['gpt_pred_binary'].value_counts()

1    614
0    577
Name: gpt_pred_binary, dtype: int64

In [82]:
y_dev      = df_dev['target'].values
y_dev_pred = df_dev['gpt_pred_binary'].values
print( classification_report( y_dev, y_dev_pred, digits=4 ) )

              precision    recall  f1-score   support

           0     0.4714    0.7010    0.5637       388
           1     0.8111    0.6202    0.7029       803

    accuracy                         0.6465      1191
   macro avg     0.6412    0.6606    0.6333      1191
weighted avg     0.7004    0.6465    0.6576      1191



## APPENDIX

### Prompts and results (in reverse chronological order)

__CONCLUSIONS__:
* The more examples the better. Even a few most similar examples are not as good as 100 examples

__Experiment 6__  
Same as experiment 3, but using only 10 semantically closest texts as examples instead of 100
```
              precision    recall  f1-score   support

           0     0.4714    0.7010    0.5637       388
           1     0.8111    0.6202    0.7029       803

    accuracy                         0.6465      1191
   macro avg     0.6412    0.6606    0.6333      1191
weighted avg     0.7004    0.6465    0.6576      1191
```

__Experiment 5__  
Same as experiment 3, but using only 25 semantically closest texts as examples instead of 100
```
              precision    recall  f1-score   support

           0     0.5191    0.6289    0.5688       388
           1     0.8003    0.7186    0.7572       803

    accuracy                         0.6893      1191
   macro avg     0.6597    0.6737    0.6630      1191
weighted avg     0.7087    0.6893    0.6958      1191
```
_____

__Experiment 4__  
Same as experiment 3, but using only 50 semantically closest texts as examples instead of 100
```
              precision    recall  f1-score   support

           0     0.5583    0.5799    0.5689       388
           1     0.7931    0.7783    0.7857       803

    accuracy                         0.7137      1191
   macro avg     0.6757    0.6791    0.6773      1191
weighted avg     0.7166    0.7137    0.7151      1191
```
_____

__Experiment 3__  
Did this instead of splitting training set into random chunks:
* Get OpenAI embeddings for original bilingual texts in the training set and dev set
* For each dev set text (its embedding), find 100 closest texts (embeddings) in the training set using the cosine similarity metric
* Using the 100 closest texts as examples, create the messages object with alternating user (text) and assistant (category) chat utterances
```
              precision    recall  f1-score   support

           0     0.5948    0.5258    0.5581       388
           1     0.7830    0.8269    0.8044       803

    accuracy                         0.7288      1191
   macro avg     0.6889    0.6763    0.6813      1191
weighted avg     0.7217    0.7288    0.7241      1191
```
_____

__Experiment 2__  
Split the training set into random non-stratified 62 chunks, each 100 examples long (this allows not to exceed the ChatGPT window size of 4096 tokens). Convert the 100 examples in each chunk into the messages object containing alternating user(text) and assistant (category) chat utterances. Iterate over chunks and dev set single examples and for each dev set example use the next message; the dev sety example is added as the next user utterance for few-shot learning - ChatGPT to respond with an assitant utterance containing the category.
```
              precision    recall  f1-score   support

           0     0.4895    0.4820    0.4857       388
           1     0.7515    0.7572    0.7543       803

    accuracy                         0.6675      1191
   macro avg     0.6205    0.6196    0.6200      1191
weighted avg     0.6662    0.6675    0.6668      1191
```
_____

__Experiment 1__  
Split the training set into random non-stratified 46 chunks, each 135 examples long. Iteratively concatenate the 135 examples in each chunk as "Text: {text}\nCategory: {category}\n#####..." Iterate over chunks and dev set single examples and for each dev set example use the next concatenated chunk for few-shot learning.
```
              precision    recall  f1-score   support

           0     0.4723    0.5722    0.5175       388
           1     0.7698    0.6912    0.7283       803

    accuracy                         0.6524      1191
   macro avg     0.6211    0.6317    0.6229      1191
weighted avg     0.6729    0.6524    0.6597      1191
```