# Workshop 1 - Question and Answers
In this workshop, you will learning how to write prompts and feed them into LLMs. You
will also be learning how to use different prompt techniques to improve the response
from the LLM.

## Loading and Explorng the Dataset
The workshop will be using [`facebook/ExploreToM`](https://huggingface.co/datasets/facebook/ExploreToM) dataset from [HuggingFace](https://huggingface.co).

In [1]:
# TODO: Load the following libraries: datasets
from datasets import load_dataset

In [2]:
# Dataset name
dataset_name = "facebook/ExploreToM"

In [None]:
# TODO: load and explore the dataset
dataset = load_dataset(dataset_name)

In [7]:
# TODO: number of rows in the dataset
print(dataset.shape)

# TODO: Keys in the dataset
print(dataset.keys())

# TODO: Feature names
print(dataset['train'].features)

# TODO: Display a single row
idx = 30
print(dataset['train'][idx])


{'train': (13309, 18)}
dict_keys(['train'])
{'story_structure': Value('string'), 'infilled_story': Value('string'), 'question': Value('string'), 'expected_answer': Value('string'), 'qprop=params': Value('string'), 'qprop=nth_order': Value('int64'), 'qprop=non_unique_mental_state': Value('bool'), 'sprop=is_false_belief_story_1st': Value('bool'), 'sprop=is_false_belief_story_1st_and_2nd': Value('bool'), 'sprop=story_accuracy_1st_raw': Value('float64'), 'sprop=story_accuracy_1st_infilled': Value('float64'), 'sprop=global_idx': Value('int64'), 'param=story_type': Value('string'), 'param=num_stories_total': Value('int64'), 'param=max_sentences': Value('int64'), 'param=num_people': Value('int64'), 'param=num_moves': Value('int64'), 'param=num_rooms': Value('int64')}
{'story_structure': 'Lachlan entered the production room. Aubrey entered the production room. Lachlan moved the script to the cardboard box, which is also located in the production room. While this action was happening, Aubrey go

In [8]:
# TODO: import pipeline
from transformers import pipeline

## `pipeline`
[`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) is an easy to use API to perform inferencing. It provides a wrapper for task-specific pipelines and abstracts most of the complexity by allowing you to focus on the model and the task.

You can use `pipeline` to perform summarisation, image classification, audio generation, etc. You can find an exhaustive list of `pipeline` task [here](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline.task).

In [9]:
# TODO: Summarise the text with the pipeline's default model
qna = pipeline("question-answering")


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [34]:
idx = 20
context = dataset['train'][idx]['story_structure']
context = dataset['train'][idx]['infilled_story']
question = dataset['train'][idx]['question']
expected_answer = dataset['train'][idx]['expected_answer']

result = qna(question=question, context=context)
print(expected_answer)
print(result)

cardboard box
{'score': 0.6982207577675581, 'start': 623, 'end': 636, 'answer': 'cardboard box'}


## Manual Inference - Question and Answer
In this section, we will look at what `pipeline` does under the hood to perform its inference. This will give us a better understanding of the major steps involved.

In [17]:
# TODO: load tokenizer
from transformers import AutoTokenizer

## DistilBERT base cased distilled SQuAD
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. More details [here](https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).

In [18]:
model_name = "distilbert/distilbert-base-cased-distilled-squad"

In [19]:
from fsspec.spec import tokenize
# TODO: Create a tokenizer
tokenize = AutoTokenizer.from_pretrained(model_name)


In [24]:
# TODO: Encode text
text = "Kaylee entered the hotel lobby. Kaylee moved the silver letter opener to the wooden desk drawer, which is also located in the hotel lobby. While this action was happening, Liam witnessed this action in secret (and only this action). Kaylee left the hotel lobby. Liam entered the hotel lobby. Kaylee entered the hotel lobby. Liam moved the silver letter opener to the leather briefcase, which is also located in the hotel lobby."
text = """Two households, both alike in dignity
(In fair Verona, where we lay our scene),
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
A pair of star-crossed lovers take their life;"""

# pt = PyTorch Tensor
enc_text = tokenize(text, return_tensors='pt')
print(enc_text)


{'input_ids': tensor([[  101,  1960,  3065,   117,  1241, 11609,  1107, 14931,   113,  1130,
          4652, 22051,   117,  1187,  1195,  3191,  1412,  2741,   114,   117,
          1622,  2890,   176, 27577,  2549,  1106,  1207,   182, 16065,  3382,
           117,  2777,  2987,  1892,  2228,  2987,  1493,  4906,  1389,   119,
          1622,  5275,  1103, 11874, 25338,  4935,  1104,  1292,  1160, 27579,
          1116,   138,  3111,  1104,  2851,   118,  3809, 12977,  1321,  1147,
          1297,   132,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [25]:
print(enc_text.input_ids[0])

tensor([  101,  1960,  3065,   117,  1241, 11609,  1107, 14931,   113,  1130,
         4652, 22051,   117,  1187,  1195,  3191,  1412,  2741,   114,   117,
         1622,  2890,   176, 27577,  2549,  1106,  1207,   182, 16065,  3382,
          117,  2777,  2987,  1892,  2228,  2987,  1493,  4906,  1389,   119,
         1622,  5275,  1103, 11874, 25338,  4935,  1104,  1292,  1160, 27579,
         1116,   138,  3111,  1104,  2851,   118,  3809, 12977,  1321,  1147,
         1297,   132,   102])


In [26]:
dec_text = tokenize.decode(enc_text.input_ids[0])
print(dec_text)

[CLS] Two households, both alike in dignity ( In fair Verona, where we lay our scene ), From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean. From forth the fatal loins of these two foes A pair of star - crossed lovers take their life ; [SEP]


In [27]:
for tok in enc_text.input_ids[0]:
  txt = tokenize.decode(tok)
  print(f'txt={txt}, tok={tok}')


txt=[CLS], tok=101
txt=Two, tok=1960
txt=households, tok=3065
txt=,, tok=117
txt=both, tok=1241
txt=alike, tok=11609
txt=in, tok=1107
txt=dignity, tok=14931
txt=(, tok=113
txt=In, tok=1130
txt=fair, tok=4652
txt=Verona, tok=22051
txt=,, tok=117
txt=where, tok=1187
txt=we, tok=1195
txt=lay, tok=3191
txt=our, tok=1412
txt=scene, tok=2741
txt=), tok=114
txt=,, tok=117
txt=From, tok=1622
txt=ancient, tok=2890
txt=g, tok=176
txt=##rudge, tok=27577
txt=break, tok=2549
txt=to, tok=1106
txt=new, tok=1207
txt=m, tok=182
txt=##uti, tok=16065
txt=##ny, tok=3382
txt=,, tok=117
txt=Where, tok=2777
txt=civil, tok=2987
txt=blood, tok=1892
txt=makes, tok=2228
txt=civil, tok=2987
txt=hands, tok=1493
txt=uncle, tok=4906
txt=##an, tok=1389
txt=., tok=119
txt=From, tok=1622
txt=forth, tok=5275
txt=the, tok=1103
txt=fatal, tok=11874
txt=lo, tok=25338
txt=##ins, tok=4935
txt=of, tok=1104
txt=these, tok=1292
txt=two, tok=1160
txt=foe, tok=27579
txt=##s, tok=1116
txt=A, tok=138
txt=pair, tok=3111
txt=of, tok=

In [29]:
# TODO: Encoding multiple texts
text0 = "Kaylee entered the hotel lobby. Kaylee moved the silver letter opener to the wooden desk drawer, which is also located in the hotel lobby. While this action was happening, Liam witnessed this action in secret (and only this action). Kaylee left the hotel lobby. Liam entered the hotel lobby. Kaylee entered the hotel lobby. Liam moved the silver letter opener to the leather briefcase, which is also located in the hotel lobby."
text1 = """Two households, both alike in dignity
(In fair Verona, where we lay our scene),
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
A pair of star-crossed lovers take their life;"""

texts = [ text0, text1]
print(texts)

enc_texts = tokenize(texts, return_tensors='pt', padding=True)
print(enc_texts)


['Kaylee entered the hotel lobby. Kaylee moved the silver letter opener to the wooden desk drawer, which is also located in the hotel lobby. While this action was happening, Liam witnessed this action in secret (and only this action). Kaylee left the hotel lobby. Liam entered the hotel lobby. Kaylee entered the hotel lobby. Liam moved the silver letter opener to the leather briefcase, which is also located in the hotel lobby.', 'Two households, both alike in dignity\n(In fair Verona, where we lay our scene),\nFrom ancient grudge break to new mutiny,\nWhere civil blood makes civil hands unclean.\nFrom forth the fatal loins of these two foes\nA pair of star-crossed lovers take their life;']
{'input_ids': tensor([[  101, 11247,  6894,  2242,  1103,  3415,  9722,   119, 11247,  6894,
          1427,  1103,  2878,  2998, 17622,  1106,  1103,  4122,  3917, 13560,
           117,  1134,  1110,  1145,  1388,  1107,  1103,  3415,  9722,   119,
          1799,  1142,  2168,  1108,  5664,   117, 

In [None]:
# TODO: Decode text


## Working with LLMs
Create and instance of the Large Language Model (LLM). We will then create a simple
prompt, tokenize the prompt and feed the tokenized prompt to the LLM. The response
from the LLM will be decoded to human friendly text.

In [31]:
# TODO: Load libraries
from transformers import AutoModelForQuestionAnswering
import torch

In [32]:
# TODO: Load question answer model
model = AutoModelForQuestionAnswering.from_pretrained(model_name)


In [33]:
print(model)

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [35]:
# TODO: Encode context and question
idx = 20
context = dataset['train'][idx]['story_structure']
#context = dataset['train'][idx]['infilled_story']
question = dataset['train'][idx]['question']
expected_answer = dataset['train'][idx]['expected_answer']


In [37]:
# TODO: Tokenize the inputs
enc_question = tokenize(question, context, return_tensors='pt', padding=True)
print(enc_question)


{'input_ids': tensor([[  101,  1130,  1134, 12461,  1108,  1103,  5444,  1120,  1103,  2150,
           136,   102, 23929,  8495,  1389,  2242,  1103,  1707,  1395,   119,
         20587,  2242,  1103,  1707,  1395,   119, 23929,  8495,  1389,  1427,
          1103,  5444,  1106,  1103, 22823,  2884,   117,  1134,  1110,  1145,
          1388,  1107,  1103,  1707,  1395,   119,  1799,  1142,  2168,  1108,
          5664,   117, 20587,  1400, 11353,  1105,  1225,  1136,  4663,  1184,
          2171,   117,  1443,  2256, 16927,  1103,  4094,  2960,  1104,  2209,
           117,  1105,  1280,  1171,  1106,  6573,  2209,  2411,  1170,  1103,
          2168,  1108,  1845,   119, 20587,  1427,  1103,  5444,  1106,  1103,
          5439,  2068, 20492,   117,  1134,  1110,  1145,  1388,  1107,  1103,
          1707,  1395,   119,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [38]:
dec_question = tokenize.decode(enc_question.input_ids[0])
print(dec_question)

[CLS] In which container was the script at the beginning? [SEP] Lachlan entered the production room. Aubrey entered the production room. Lachlan moved the script to the cardboard box, which is also located in the production room. While this action was happening, Aubrey got distracted and did not realize what happened, without anyone noticing the brief lack of attention, and going back to paying attention immediately after the action was finished. Aubrey moved the script to the leather satchel, which is also located in the production room. [SEP]


In [39]:
result = model(enc_question.input_ids, enc_question.attention_mask)
print(result)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-2.1147e+00, -3.7442e+00, -3.9679e+00, -3.8657e+00, -6.0034e+00,
         -5.3322e+00, -4.8833e+00, -5.4657e+00, -5.3415e+00, -4.8426e+00,
         -3.4545e+00, -4.0436e+00, -1.5973e+00, -5.2831e+00, -5.2384e+00,
         -4.3936e+00, -3.7877e+00, -2.8501e+00, -4.5548e+00, -3.5823e+00,
         -5.7799e-03, -4.0304e+00, -3.4177e+00, -2.5375e+00, -4.3950e+00,
         -2.0794e+00, -5.4663e-02, -4.7430e+00, -4.7353e+00, -9.0420e-01,
         -4.2497e-02, -6.7598e-02,  3.3091e-02,  7.1442e+00,  9.8639e+00,
          3.0480e+00, -3.1136e+00, -3.6514e+00, -4.6225e+00, -5.0883e+00,
         -3.6220e+00, -3.4869e+00, -2.8248e+00, -2.5199e+00, -4.0495e+00,
         -3.2023e+00, -3.7003e+00, -4.5455e+00, -4.5279e+00, -6.6697e+00,
         -4.9747e+00, -4.9050e+00, -9.4283e-01, -5.3817e+00, -4.3121e+00,
         -6.7488e+00, -4.8868e+00, -5.6436e+00, -5.8123e+00, -5.3503e+00,
         -5.5068e+00, -6.6925e+00, -4.6569e+00, -5.3800e+00

In [41]:
start_ans = torch.argmax(result.start_logits)
end_ans = torch.argmax(result.end_logits) + 1

enc_ans = enc_question.input_ids[0][start_ans:end_ans]
print(enc_ans)

ans = tokenize.decode(enc_ans)
print(ans)

tensor([22823,  2884])
cardboard box
