# SILLM Tutorial 1

In this notebook, you will:
- download a text-based dataset which has *labeled* examples. For instance, a sentiment analysis dataset with tweets and their corresponding sentiment label.
- define a function that calls a large language model to prompt it to label the dataset with predicted sentiment. You can do in this in a few different modes.
- investigate the LLM's predicted labels

## 1. Get the data

In [25]:
import pandas as pd

As an example, we will use one of the first hate speech datasets, specifically: https://github.com/t-davidson/hate-speech-and-offensive-language/tree/master

from the paper '[Automated Hate Speech Detection and the Problem of Offensive Language](https://ojs.aaai.org/index.php/ICWSM/article/view/14955)' from 2017.

In [26]:
data_link = 'https://raw.githubusercontent.com/t-davidson/hate-speech-and-offensive-language/master/data/labeled_data.csv'
dataset = pd.read_csv(data_link)

In [27]:
dataset

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an..."
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies


class = class label for majority of CF ('crowd flower' a noew defunct crowd sourcing site) users. 

0 - hate speech 1 - offensive language 2 - neither


## 2. Make the Prompt

In [28]:
def make_prompt(task, options, instance, **kwargs):
    options_str = '' # options ---> all possible labels
    for i in range(len(options)):
        options_str = options_str + ' %d) %s' %(i+1, options[i])
    prompt = 'Given a piece of text, you have to label whether it is %s or not.\
    Please return one of the following options with only the text and no number:%s.'\
    %(task, options_str)
    
    if kwargs['zero_shot']:
        return prompt + ' What is the label of this text: "' + instance+ '"'
    else: # for few-shot
        examples_str = ''
    for example in kwargs['examples']:
        examples_str = examples_str + 'text: %s, label: %s\n' %(example[0], example[1])
    return prompt + ' Here are some examples of instances and their labels:\
    \n%sWhat is the label of this text: ' %(examples_str) + instance

In [29]:
task = 'hate speech'
options = ['hate', 'not hate']
examples = [] # the first two instances of hate speech in the dataset are used as few-shot examples
for _, row in dataset.iterrows():
    if row['class'] == 0:
        examples.append([row['tweet'], 'hate'])
    if len(examples) == 2:
        break
instance = dataset['tweet'].values[90]
instance

'"@CCobey: @AydanMcCoy happy birthday nigs" Thanks yo'

In [30]:
make_prompt(task, options, instance, zero_shot = True, examples = examples)

'Given a piece of text, you have to label whether it is hate speech or not.    Please return one of the following options with only the text and no number: 1) hate 2) not hate. What is the label of this text: ""@CCobey: @AydanMcCoy happy birthday nigs" Thanks yo"'

In [31]:
print(make_prompt(task, options, instance, zero_shot = False, examples = examples))

Given a piece of text, you have to label whether it is hate speech or not.    Please return one of the following options with only the text and no number: 1) hate 2) not hate. Here are some examples of instances and their labels:    
text: "@Blackman38Tide: @WhaleLookyHere @HowdyDowdy11 queer" gaywad, label: hate
text: "@CB_Baby24: @white_thunduh alsarabsss" hes a beaner smh you can tell hes a mexican, label: hate
What is the label of this text: "@CCobey: @AydanMcCoy happy birthday nigs" Thanks yo


In [32]:
prompt = make_prompt(task, options, instance, zero_shot = False, examples = examples)

## 3. Call the LLM with the prompt

In [33]:
runs = 3 # specify how many labels we want per instance.

First, we try with a commercial model like ChatGPT using our API key.

In [34]:
! pip install openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [35]:
import openai
openai.api_base="http://91.107.239.71:80" #"http://127.0.0.1:8000"
openai.api_key="fjljfrJt9a5uCOZfyerH" # enter you API key here

# list models
# models = openai.Model.list()
# models

In [36]:
responses = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                         messages=[{"role": "user", "content": prompt}],
                                         max_tokens = 2,
                                         n=runs)

In [37]:
responses

<OpenAIObject chat.completion id=chatcmpl-8DZmfYIXQcWT2Ujg6GnrxXjhNlaGx at 0x7b9346e83b00> JSON: {
  "id": "chatcmpl-8DZmfYIXQcWT2Ujg6GnrxXjhNlaGx",
  "object": "chat.completion",
  "created": 1698246509,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "hate"
      },
      "finish_reason": "length"
    },
    {
      "index": 1,
      "message": {
        "role": "assistant",
        "content": "hate"
      },
      "finish_reason": "length"
    },
    {
      "index": 2,
      "message": {
        "role": "assistant",
        "content": "hate"
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 155,
    "completion_tokens": 6,
    "total_tokens": 161
  }
}

In [38]:
[i['message']['content'] for i in responses['choices']]

['hate', 'hate', 'hate']

Now let us try the same thing, but with a open source model like Flan-T5.

In [39]:
! pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [40]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-xl")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl", max_new_tokens = 500)
model.cuda()
inputs = tokenizer("A step by step recipe to make bolognese pasta:",
                   return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



['In a large saucepan, combine the ground beef, onion, garlic, tomato paste, tomato']


 The code to prompt Flan-T5 is a bit more complex. 
 
 We use the Hugging Face Transformers library to perform sequence-to-sequence (seq2seq) language modeling with a pre-trained model called "google/flan-t5-xl.

- AutoModelForSeq2SeqLM is used to load a pre-trained seq2seq model.

- AutoTokenizer is used to load the tokenizer associated with the model.
Load the pre-trained model and tokenizer:

The code loads a pre-trained sequence-to-sequence model named "google/flan-t5-xl" using AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-xl"). This model is a variant of T5 (Text-to-Text Transfer Transformer) architecture.

We then create a tokenizer with a specified maximum number of new tokens:

- The code creates a tokenizer for the "google/flan-t5-xl" model using AutoTokenizer.from_pretrained("google/flan-t5-xl", max_new_tokens=500). 

This tokenizer is configured to handle sequences with a maximum of 500 additional tokens beyond the original text.

We then move the model to the GPU (CUDA):

model.cuda() moves the loaded model to the GPU for faster inference if a compatible GPU is available. It uses the "cuda:0" device.

inputs = tokenizer("A step by step recipe to make bolognese pasta:", return_tensors="pt").to("cuda:0") tokenizes the input text "A step by step recipe to make bolognese pasta:" using the tokenizer. The return_tensors="pt" option returns PyTorch tensors. The resulting tokenized input is then moved to the GPU.

We then generate a sequence from the model:

outputs = model.generate(**inputs) generates a sequence based on the tokenized input using the loaded model. The generate method takes the tokenized input as input and produces a sequence of output tokens.
Decode and print the generated sequence:

tokenizer.batch_decode(outputs, skip_special_tokens=True) decodes the generated output tokens into text, skipping any special tokens that are not part of the final result. 

In summary, this code loads a pre-trained seq2seq model, tokenizes an input text, generates a sequence based on the input using the model, and then prints the generated text. It uses the "google/flan-t5-xl" model, which is a large T5 variant suitable for various text-to-text tasks. The code is designed for GPU acceleration for faster inference.

In [41]:
responses = []
for n in range(0, runs):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
    outputs = model.generate(**inputs)
    responses.append(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

In [42]:
responses

['hate', 'hate', 'hate']

Now do this for all the instances in your dataset.
**Hint**: Use a loop over your dataframe. When doing few-shot labeling, make sure that the examples are not the same as the instance to be labeled.

- Try both zero-shot and few-shot and compare their performance.
- Try both ChatGPT and Flan-T5 small
- Try to get the label from the LLM output. Is it always as expected and can it always be used as is for quantitative analysis?
- At least for the first 50 instances in your dataset, use metrics like accuracy and F1 score to assess the performance of the LLMs against the true ground truth label.

Bonus:
- try varying the wording of the prompts
- try giving an explicit definition of the task in the prompt

## Labeling multiple instances

In [47]:
data_subset = dataset.sample(50).reset_index()
data_subset.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
10050,10323,3,2,1,0,0,I be sayin I'm done drinkin everyday but then ...
10448,10726,3,1,2,0,1,I hate bitches. Thirsty ass bitches
22311,22782,3,1,2,0,1,"Two blonde dykes wanna kiss all night, I just ..."
10309,10587,3,0,0,3,2,I found your arm in that trash receptacle and ...
20377,20823,6,0,6,0,1,RT @xleaaahh: &#8220;@1stBlocJeremiah: I want ...


In [48]:
data_subset.groupby('class').size()

class
0     3
1    41
2     6
dtype: int64

In [49]:
from tqdm import tqdm # to help you keep track of how many instances have been labeled
import time # to deal w/ rate limits

all_responses = []

# for chatgpt zero-shot
for n, row in  tqdm(data_subset.iterrows(), total=data_subset.shape[0]):
    if (n+1) % 30 == 0:
        time.sleep(10)
    prompt = make_prompt(task, options, zero_shot = True, instance = row['tweet'])
    responses = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                         messages=[{"role": "user", "content": prompt}],
                                         n=runs)
    response_list = [row['tweet'], row['class']]
    response_list.extend([i['message']['content'] for i in responses['choices']])
    all_responses.append(response_list)

100%|██████████| 50/50 [00:36<00:00,  1.37it/s]


In [50]:
chatgpt_results = pd.DataFrame(all_responses, columns = ['tweet', 'hate speech', 'chatgpt_pred_1',
                                      'chatgpt_pred_2',
                                      'chatgpt_pred_3'])

In [51]:
# repeat for flan-t5 zero-shot. Hint: Try to modularize this further
# another advantage of your own model is that you aren'trate limited
all_responses = []
for _, row in  tqdm(data_subset.iterrows(), total=data_subset.shape[0]):
    prompt = make_prompt(task, options, zero_shot = True, instance = row['tweet'])
    responses = []
    for n in range(0, runs):
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
        outputs = model.generate(**inputs)
        responses.append(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
    response_list = [row['tweet'], row['class']]
    response_list.extend(responses)
    all_responses.append(response_list)

100%|██████████| 50/50 [00:25<00:00,  1.99it/s]


In [52]:
flant5_results = pd.DataFrame(all_responses, columns = ['tweet', 'hate speech', 'flant5_pred_1',
                                      'flant5_pred_2',
                                      'flant5_pred_3'])
flant5_results

Unnamed: 0,tweet,hate speech,flant5_pred_1,flant5_pred_2,flant5_pred_3
0,I be sayin I'm done drinkin everyday but then ...,0,hate,hate,hate
1,I hate bitches. Thirsty ass bitches,1,hate,hate,hate
2,"Two blonde dykes wanna kiss all night, I just ...",1,hate,hate,hate
3,I found your arm in that trash receptacle and ...,2,not hate,not hate,not hate
4,RT @xleaaahh: &#8220;@1stBlocJeremiah: I want ...,1,hate,hate,hate
5,Ref finally calls it when it's an arm ball but...,1,hate,hate,hate
6,Tf bitch do mine,1,hate,hate,hate
7,#IndigenousPeoplesDay HAHA!!! You fuckin' reta...,1,hate,hate,hate
8,And it's gotta be some ghetto fool :/,2,hate,hate,hate
9,RT @ImDatNigga_Jack: Welcome to Dallas Nawfsid...,1,hate,hate,hate


## Combine the results of both models

And now, you can see how the LLMs compare against the crowdworkers' labels.

In [53]:
all_results = chatgpt_results.merge(flant5_results, on = ['tweet', 'hate speech'])
all_results

Unnamed: 0,tweet,hate speech,chatgpt_pred_1,chatgpt_pred_2,chatgpt_pred_3,flant5_pred_1,flant5_pred_2,flant5_pred_3
0,I be sayin I'm done drinkin everyday but then ...,0,1) hate,hate,hate,hate,hate,hate
1,I hate bitches. Thirsty ass bitches,1,hate,hate,hate,hate,hate,hate
2,"Two blonde dykes wanna kiss all night, I just ...",1,hate,hate,1) hate,hate,hate,hate
3,I found your arm in that trash receptacle and ...,2,not hate,not hate,not hate,not hate,not hate,not hate
4,RT @xleaaahh: &#8220;@1stBlocJeremiah: I want ...,1,not hate,not hate,not hate,hate,hate,hate
5,Ref finally calls it when it's an arm ball but...,1,hate,hate,1) hate,hate,hate,hate
6,Tf bitch do mine,1,1) hate,1) hate,1) hate,hate,hate,hate
7,#IndigenousPeoplesDay HAHA!!! You fuckin' reta...,1,hate,hate,hate,hate,hate,hate
8,And it's gotta be some ghetto fool :/,2,not hate,not hate,hate,hate,hate,hate
9,RT @ImDatNigga_Jack: Welcome to Dallas Nawfsid...,1,hate,hate,hate,hate,hate,hate


Now, convert the numerical ground truth data to text, and consider 'offensive' to be 'not hate' 

In [54]:
label_mapping = {0 : 'hate', 1 : 'not hate', 2: 'not hate'}
all_results['true_label'] = [label_mapping[i] for i in all_results['hate speech']]

In [55]:
all_results

Unnamed: 0,tweet,hate speech,chatgpt_pred_1,chatgpt_pred_2,chatgpt_pred_3,flant5_pred_1,flant5_pred_2,flant5_pred_3,true_label
0,I be sayin I'm done drinkin everyday but then ...,0,1) hate,hate,hate,hate,hate,hate,hate
1,I hate bitches. Thirsty ass bitches,1,hate,hate,hate,hate,hate,hate,not hate
2,"Two blonde dykes wanna kiss all night, I just ...",1,hate,hate,1) hate,hate,hate,hate,not hate
3,I found your arm in that trash receptacle and ...,2,not hate,not hate,not hate,not hate,not hate,not hate,not hate
4,RT @xleaaahh: &#8220;@1stBlocJeremiah: I want ...,1,not hate,not hate,not hate,hate,hate,hate,not hate
5,Ref finally calls it when it's an arm ball but...,1,hate,hate,1) hate,hate,hate,hate,not hate
6,Tf bitch do mine,1,1) hate,1) hate,1) hate,hate,hate,hate,not hate
7,#IndigenousPeoplesDay HAHA!!! You fuckin' reta...,1,hate,hate,hate,hate,hate,hate,not hate
8,And it's gotta be some ghetto fool :/,2,not hate,not hate,hate,hate,hate,hate,not hate
9,RT @ImDatNigga_Jack: Welcome to Dallas Nawfsid...,1,hate,hate,hate,hate,hate,hate,not hate


Finally, compute quantitative performance metrics.

In [56]:
from sklearn.metrics import classification_report
print(classification_report(all_results['true_label'], all_results['flant5_pred_1']))

              precision    recall  f1-score   support

        hate       0.05      0.67      0.09         3
    not hate       0.86      0.13      0.22        47

    accuracy                           0.16        50
   macro avg       0.45      0.40      0.15        50
weighted avg       0.81      0.16      0.21        50



In [57]:
from sklearn.metrics import classification_report
print(classification_report(all_results['true_label'], all_results['chatgpt_pred_1']))

              precision    recall  f1-score   support

     1) hate       0.00      0.00      0.00         0
        hate       0.07      0.67      0.12         3
    not hate       1.00      0.36      0.53        47

    accuracy                           0.38        50
   macro avg       0.36      0.34      0.22        50
weighted avg       0.94      0.38      0.51        50



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


These results aren't very promising. 
1) Flan-T5 has an accuracy of 16%, way worse then random chance

2) ChatGPT is a bit better, but it has at least some malformed output ('1) hate').

Let's now consider 'offensive' to also be hate.

In [58]:
label_mapping = {0 : 'hate', 1 : 'hate', 2: 'not hate'}
all_results['true_label_2'] = [label_mapping[i] for i in all_results['hate speech']]

In [59]:
print(classification_report(all_results['true_label_2'], all_results['flant5_pred_1']))

              precision    recall  f1-score   support

        hate       0.95      0.93      0.94        44
    not hate       0.57      0.67      0.62         6

    accuracy                           0.90        50
   macro avg       0.76      0.80      0.78        50
weighted avg       0.91      0.90      0.90        50



In [60]:
print(classification_report(all_results['true_label_2'], all_results['chatgpt_pred_1']))

              precision    recall  f1-score   support

     1) hate       0.00      0.00      0.00         0
        hate       1.00      0.68      0.81        44
    not hate       0.35      1.00      0.52         6

    accuracy                           0.72        50
   macro avg       0.45      0.56      0.44        50
weighted avg       0.92      0.72      0.78        50



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The better performance of LLMs when we combine 'hate'  and 'offensive', tells us that it doesn't seem very good at differentiating these two.

Try:
- with definitions of 'hate speech' and 'offensive' (from the dictionary, for example) and see if the results change
- look at a few instances yourself to see if you agree or disagree with the LLMs and/or the crowdworkers