## Mistral-7B

## Zero shot Chain of Thought Prompting



Install Necessary Packages

In [1]:
%pip install datasets
%pip install transformers
%pip install evaluate
%pip install torch
%pip install torcheval
%pip install scikit-learn
%pip install nltk
%pip install absl-py
%pip install rouge_score
%pip install accelerate
%pip install langchain
%pip install -U bitsandbytes
%pip install spacy
%pip install langdetect

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any.

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import numpy as np
import torch as tt
from torcheval.metrics import MulticlassAccuracy
import matplotlib.pyplot as plt
from datasets import load_dataset
from evaluate import load
import evaluate

In [4]:
import torch
from langchain import PromptTemplate, HuggingFacePipeline
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

In [5]:
hf_token = "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"

Evaluation Metrics required for all the tasks

In [6]:
accuracy_metric = load("accuracy")   # load the accuracy metric for caluclation of accuracy
f1_metric = load("f1")     # load the f1 metric for caluclation of f1 score
bleu_metric = load("bleu")     # load the bleu metric for caluclation of bleu score
meteor_metric = load('meteor') # load the meteor metric for caluclation of meteor score
rouge_metric = load("rouge")   # load the rouge metric for caluclation of rouge score
mult_acc = MulticlassAccuracy()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Loading Mistral-7B model from Hugging Face

In [7]:
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Initialization of a tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, token = hf_token)


# Initialization of a model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config,
    token = hf_token
)

# Configuration of some generation-related settings
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024 # maximum number of new tokens that can be generated by the model
generation_config.temperature = 0.6 # randomness of the generated tex
generation_config.top_p = 0.90 # diversity of the generated text
generation_config.do_sample = True # sampling during the generation process
generation_config.repetition_penalty = 1.15 # the degree to which the model should avoid repeating tokens in the generated text


pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=True,
    return_full_text=True,
    generation_config=generation_config
)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [8]:
# HuggingFace pipeline
llm = HuggingFacePipeline(pipeline=pipe)

In [9]:
import gc
gc.collect()

212

Loading Dataset for Question-Answering task.

In [10]:
# Load the validation split as test split is not available for public use
qa_dataset = load_dataset("google/boolq", split="validation")

Downloading readme:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Loading Dataset for Reasoning task.

In [11]:
# Load the validation split
reasoning_dataset = load_dataset("tau/commonsense_qa", split="validation")

Downloading readme:   0%|          | 0.00/7.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/160k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9741 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1221 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1140 [00:00<?, ? examples/s]

Loading Datasets for Translation task

In [12]:
# Load the validation split for english to french translation
french_dataset = load_dataset("iwslt2017","iwslt2017-en-fr" , split="validation")

Downloading data:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

Loading Dataset for Summarisation task.

In [13]:
# Load the test split for summarisation task
sum_dataset = load_dataset("samsum", split="test")

Downloading data:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/335k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

## Question Answering task

Prompt Formulation

In [14]:
def prompt_output(item):
    """
    This function takes the dataset item and includes the context in the prompt
    before generating the output using the model.
    """
    passage = item['passage']  # Extracting context from the item
    question = item['question']


    template = f"<s>[INST]\nBased on the passage:'{passage}'\nAnswer True/False to the question: '{question}'.Let's think step by step.[/INST]\nAnswer:"

    prompt = PromptTemplate.from_template(template)

    chain = prompt | llm

    predictions = chain.invoke({'question': question,'passage':passage})

    # Combine results and references into a single dictionary
    output = {'results': [predictions]}
    return output


Processing the question answering dataset

In [None]:
#proceed with your multiprocessing code, Adjust the batch size according to your GPU memory
results = qa_dataset.map(prompt_output, batched=True, batch_size=1,  num_proc=1)

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting

Extracting the Answer from the generated text

In [19]:
def extract_answers(text):
    # Convert the text to lowercase
    text = text.lower()
    lines = text.split('\n')
    l1 = ['yes','true']
    l2 = ['no','false']
    for line in lines:
        if "answer:" in line:
            answer_sentence = line.replace('answer:', '').strip()
            for word in l1:
                if word in answer_sentence:
                    return 1
            for word in l2:
                if word in answer_sentence:
                    return 0
            return 0

In [20]:
predictions = []
references = []

for item in results:
        generated_text = item['results']  # 'results' key contains the predicted answer
        prediction = extract_answers(generated_text)
        if item['answer'] == True:
            answer=1
        else:
            answer=0
        predictions.append(prediction)
        references.append(answer)

Computation of Accuracy and F1 score

In [18]:
#predictions and references must be list of numbers(0 or 1), check it
acc_score = accuracy_metric.compute(predictions=predictions, references=references)
f1_score  = f1_metric.compute(predictions=predictions, references=references)
# Accuracy and F1 score for the Question Answering task
print(acc_score)
print(f1_score)

{'accuracy': 0.8}
{'f1': 0.8571428571428571}


Qualitative analysis

In [21]:
passage = "Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr."
question = "is windows movie maker part of windows essentials"
template_basic = f"<s>[INST]\nBased on the passage:'{passage}'\nAnswer True/False to the question: '{question}'.[/INST]\nAnswer:"
template_zcot = f"<s>[INST]\nBased on the passage:'{passage}'\nAnswer True/False to the question: '{question}'.Let's think step by step.[/INST]\nAnswer:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)
chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'question': question,'passage':passage})
predictions2 = chain_zcot.invoke({'question': question,'passage':passage})


#Answer : True

print(predictions1)
print(predictions2)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Based on the passage:'Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.'
Answer True/False to the question: 'is windows movie maker part of windows essentials'.[/INST]
Answer: True.
<s>[INST]
Based on the passage:'Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.'
Answer True/False to the question: 'is windows movie maker part of windows essentials'.Let's think step by step.[/INST]
Answer: True.

Explanation: According to the given passage, "Windows Movie Maker is a

In [22]:
passage = "A shoot-out is usually considered for statistical purposes to be separate from the match which preceded it. In the case of a two-legged fixture, the two matches are still considered either as two draws or as one win and one loss; in the case of a single match, it is still considered as a draw. This contrasts with a fixture won in extra time, where the score at the end of normal time is superseded. Converted shoot-out penalties are not considered as goals scored by a player for the purposes of their individual records, or for ``golden boot'' competitions."
question = "does a penalty shoot out goal count towards the golden boot"
template_basic = f"<s>[INST]\nBased on the passage:'{passage}'\nAnswer True/False to the question: '{question}'.[/INST]\nAnswer:"
template_zcot = f"<s>[INST]\nBased on the passage:'{passage}'\nAnswer True/False to the question: '{question}'.Let's think step by step.[/INST]\nAnswer:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)
chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'question': question,'passage':passage})
predictions2 = chain_zcot.invoke({'question': question,'passage':passage})



#Answer : "False"

print(predictions1)
print(predictions2)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Based on the passage:'A shoot-out is usually considered for statistical purposes to be separate from the match which preceded it. In the case of a two-legged fixture, the two matches are still considered either as two draws or as one win and one loss; in the case of a single match, it is still considered as a draw. This contrasts with a fixture won in extra time, where the score at the end of normal time is superseded. Converted shoot-out penalties are not considered as goals scored by a player for the purposes of their individual records, or for ``golden boot'' competitions.'
Answer True/False to the question: 'does a penalty shoot out goal count towards the golden boot'.[/INST]
Answer: False. According to the passage, converted shoot-out penalties are not considered as goals scored by a player for the purposes of their individual records or for "golden boot" competitions.
<s>[INST]
Based on the passage:'A shoot-out is usually considered for statistical purposes to be separa

## Reasoning task

Prompt formulation

In [23]:
def prompt_output_reasoning(item):

    question = item['question'][0]  # Extracting premise from the item
    opt1 = item['choices'][0]['label'][0] # Extracting choice1 from the item
    opt2 = item['choices'][0]['label'][1] # Extracting choice2 from the item
    opt3 = item['choices'][0]['label'][2] # Extracting choice3 from the item
    opt4 = item['choices'][0]['label'][3] # Extracting choice4 from the item
    opt5 = item['choices'][0]['label'][4] # Extracting choice5 from the item

    text1 = item['choices'][0]['text'][0] # Extracting text1 from the item
    text2 = item['choices'][0]['text'][1] # Extracting text2 from the item
    text3 = item['choices'][0]['text'][2] # Extracting text3 from the item
    text4 = item['choices'][0]['text'][3] # Extracting text4 from the item
    text5 = item['choices'][0]['text'][4] # Extracting text5 from the item

    template = f"<s>[INST]\nChoose the answer.\n{question}\n{opt1}. {text1}\n{opt2}. {text2}\n{opt3}. {text3}\n{opt4}. {text4}\n{opt5}. {text5}\nLet's think step by step.[/INST]\nAnswer:"

    prompt = PromptTemplate.from_template(template)
    chain = prompt | llm
    predictions = chain.invoke({'question': question})

    results = {'results': [predictions]}


    return results

Processing the Reasoning dataset

In [None]:
#proceed with your multiprocessing code, Adjust the batch size according to your GPU memory
results = reasoning_dataset.map(prompt_output_reasoning, batched=True, batch_size=1,  num_proc=1)

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Extracting the Answer from the generated text

In [35]:
def analyze_text(text):
    # Convert the text to lowercase
    text = text.lower()
    lines = text.split('\n')

    for i,line in enumerate(lines):
        if "answer:" in line:
            answer_sentence = lines[i].replace('answer:', '').strip()
            if  'a.' in answer_sentence:
              return 0
            elif 'b.' in answer_sentence:
              return 1
            elif 'c.' in answer_sentence:
              return 2
            elif 'd.' in answer_sentence:
              return 3
            elif 'e.' in answer_sentence:
              return 4
            else:
              return 0


In [36]:
predictions = []
references = []

for item in results:
        prediction = item['results'] #'results' key contains the predicted answer
        value = analyze_text(prediction)
        if item['answerKey'] == 'A':
            answer=0
        if item['answerKey'] == 'B':
            answer=1
        if item['answerKey'] == 'C':
            answer=2
        if item['answerKey'] == 'D':
            answer=3
        if item['answerKey'] == 'E':
            answer=4
        predictions.append(value)
        references.append(answer)

a. bank.
a. complete jobs.
b. bookstore
a. fast food restaurant.
d. farming areas.
c. great britain.
b. mexico.
a. this option is incorrect as animals do not feel pleasure when an enemy is approaching. instead, they typically display behaviors that help them defend themselves or escape from danger. some common responses include hiding, running away, making loud noises, or attacking the enemy. therefore, none of the given options (feeling pleasure, procreating, passing water, listening to each other, or singing) accurately describe what animals do when an enemy is approaching.
a. literacy
e. making music
a. pants
d. make peace.
a. farm house or b. barnyard
e. walked
c. being entertained.
d. people
d. examine thing
a, injury
e. two eyes
d. office or e. kitchen drawer.


Computation of Accuracy and F1 Score

In [37]:
#predictions and references must be list of numbers, check it
predictions = tt.tensor(predictions)
references = tt.tensor(references)
mult_acc.update(predictions, references)
acc_score = mult_acc.compute()
# Accuracy and F1 score for the Question Answering task
print(acc_score.numpy())

0.73333335


Qualitative analysis

In [39]:
question = 	"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"
text1 = "ignore"
text2 = "enforce"
text3 = "authoritarian"
text4 = "yell at"
text5 =  "avoid"

template_basic = f"<s>[INST]\nChoose the answer.\n{question}\nA. {text1}\nB. {text2}\nC. {text3}\nD. {text4}\nE. {text5}\n[/INST]\nAnswer:"
template_zcot = f"<s>[INST]\nChoose the answer.\n{question}\nA. {text1}\nB. {text2}\nC. {text3}\nD. {text4}\nE. {text5}\nLet's think step by step.[/INST]\nAnswer:"


prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)
chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'question': question})
predictions2 = chain_zcot.invoke({'question': question})

print(predictions1)
print(predictions2)

#Answer : "A"

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Choose the answer.
The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?
A. ignore
B. enforce
C. authoritarian
D. yell at
E. avoid
[/INST]
Answer: A. ignore

Explanation: The phrase "seemed to what" in this context is an idiom that means "had the effect of negating or undermining." Therefore, the correct answer would be the option that best fits with the idea of negating or undermining the efforts the school has made to change. Option A, "ignore," fits this description as it suggests not acknowledging or taking into account the changes that have been made. Options B, C, D, and E do not fit the meaning of "punishing blow" in this context. Enforcing (option B) implies applying rules or regulations strictly; authoritarian (option C) refers to ruling with absolute power; yelling at (option D) suggests expressing anger verbally; and avoiding (option E) means staying away from something.
<s>[INST]
Choose the answe

In [40]:
question = 	"Sammy wanted to go to where the people were. Where might he go?"
text1 = "race track"
text2 = "populated areas"
text3 =  "the desert"
text4 = "apartment"
text5 =  "roadblock"


template_basic = f"<s>[INST]\nChoose the answer.\n{question}\nA. {text1}\nB. {text2}\nC. {text3}\nD. {text4}\nE. {text5}\n[/INST]\nAnswer:"
template_zcot = f"<s>[INST]\nChoose the answer.\n{question}\nA. {text1}\nB. {text2}\nC. {text3}\nD. {text4}\nE. {text5}\nLet's think step by step.[/INST]\nAnswer:"


prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)
chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'question': question})
predictions2 = chain_zcot.invoke({'question': question})

print(predictions1)
print(predictions2)

#Answer : "B"

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Choose the answer.
Sammy wanted to go to where the people were. Where might he go?
A. race track
B. populated areas
C. the desert
D. apartment
E. roadblock
[/INST]
Answer: B. populated areas
<s>[INST]
Choose the answer.
Sammy wanted to go to where the people were. Where might he go?
A. race track
B. populated areas
C. the desert
D. apartment
E. roadblock
Let's think step by step.[/INST]
Answer: B. populated areas.

Explanation:
Given context does not provide any specific information about the type of place where "the people are." However, we can infer that Sammy wants to go to a place with many people based on the phrase "where the people were." Therefore, option B, which is a general term for places with large populations, is the most likely answer. The other options do not fit the given context as well since they describe more specific types of places (race track, desert, apartment, and roadblock).


## Translation task

Prompt Formulation

In [41]:
def prompt_output_french(item):

    eng_text = item['translation'][0]['en']

    template = f"<s>[INST]\nTranslate '{eng_text}' to french.Let's translate step by step.[/INST]\nFrench:"
    prompt = PromptTemplate.from_template(template)

    chain = prompt | llm
    predictions = chain.invoke({'eng_text': eng_text})

    results = {'results': [predictions]}

    return results

Processing the Translation dataset

In [None]:
#proceed with your multiprocessing code, Adjust the batch size according to your GPU memory
results = french_dataset.map(prompt_output_french, batched=True, batch_size=1,  num_proc=1)

Map:   0%|          | 0/890 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting

Extracting French text from generated text

In [43]:
def extract_french_text(passage):
    lines = passage.split('\n')
    for i, line in enumerate(lines):
        if 'French:' in line:
            french_sentence = lines[i].replace('French:', '').strip()
            return french_sentence

In [44]:
predictions = []
references = []

for item in results:
        generated_text = item['results']  #  'results' key contains the predicted answer
        answer  = item['translation']['fr']# 'translation' key contains the actual answer
        prediction = extract_french_text(generated_text)

        predictions.append(prediction)
        references.append(answer)

Computation of BLEU and METEOR score

In [46]:
#predictions and references must be list of strings, check it
bleu_score = bleu_metric.compute(predictions=predictions, references=references)
meteor_score = meteor_metric.compute(predictions=predictions, references=references)
print(bleu_score)
print(meteor_score)

{'bleu': 0.13106001504918835, 'precisions': [0.453781512605042, 0.20614035087719298, 0.10550458715596331, 0.04326923076923077], 'brevity_penalty': 0.9117066696581597, 'length_ratio': 0.9153846153846154, 'translation_length': 238, 'reference_length': 260}
{'meteor': 0.39101895001023496}


Qulaitative analysis

In [47]:
#   French


eng_text = "Several years ago here at TED, Peter Skillman introduced a design challenge called the marshmallow challenge."

template_basic = f"<s>[INST]\nTranslate '{eng_text}' to french.[/INST]\nFrench:"
template_zcot = f"<s>[INST]\nTranslate '{eng_text}' to french.Let's translate step by step.[/INST]\nFrench:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)

chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'eng_text': eng_text})
predictions2 = chain_zcot.invoke({'eng_text': eng_text})

print(predictions1)
print(predictions2)

# french : "Il y a plusieurs années, ici à Ted, Peter Skillman a présenté une épreuve de conception appelée l'épreuve du marshmallow."


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Translate 'Several years ago here at TED, Peter Skillman introduced a design challenge called the marshmallow challenge.' to french.[/INST]
French: "Il y a plusieurs années ici à TED, Peter Skillman a présenté un défi de conception appelé le défi du gâteau au marshmallow." (Literally: Several years ago here at TED, Peter Skillman presented a design challenge called the marshmallow challenge.)

However, it would be more idiomatic in French to say something like: "Several years ago at TED, Peter Skillman posa un défi de conception intitulé le défi du gâteau au marshmallow" or "Several years ago at TED, Peter Skillman proposa un défi de conception nommé le défi du gâteau au marshmallow". Both mean: "Several years ago at TED, Peter Skillman proposed a design challenge named the marshmallow challenge."
<s>[INST]
Translate 'Several years ago here at TED, Peter Skillman introduced a design challenge called the marshmallow challenge.' to french.Let's translate step by step.[/INST]
Fr

In [48]:
#   French


eng_text = "The marshmallow has to be on top."

template_basic = f"<s>[INST]\nTranslate '{eng_text}' to french.[/INST]\nFrench:"
template_zcot = f"<s>[INST]\nTranslate '{eng_text}' to french.Let's translate step by step.[/INST]\nFrench:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)

chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'eng_text': eng_text})
predictions2 = chain_zcot.invoke({'eng_text': eng_text})

print(predictions1)
print(predictions2)

# french : "Le marshmallow doit être placé au sommet."


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Translate 'The marshmallow has to be on top.' to french.[/INST]
French: "Le marshmallow doit être en tête."

This translates to "The marshmallow must be on top" in English, but since the original sentence was in the imperative mood (a command or instruction), I assumed that the same structure should be maintained in the translation. Therefore, I translated "have to be" as "doit être" which is the imperative form of "avoir de devoir étre" and "on top" as "en tête," which means "at the head" or "at the top."
<s>[INST]
Translate 'The marshmallow has to be on top.' to french.Let's translate step by step.[/INST]
French: "Le marshallow doit être au-dessus."

Here is the breakdown of the translation:

1. The first word, "The," does not need to be translated as it is a determiner in English and does not have an equivalent in French. In French, articles are implied based on the context.
2. "Marshmallow" translates to "marshallow" in French.
3. "Has to be" can be translated to "doit êt

## Summarisation task

Prompt Formulation

In [49]:
def prompt_output_summary(item):

    dialogue = item['dialogue']  # Extracting dialogue from the item

    template = f"<s>[INST]\nSummarise the Dialogue: {dialogue}.Let's summarise step by step.[/INST]\nSummary:"
    prompt = PromptTemplate.from_template(template)

    chain = prompt | llm


    predictions = chain.invoke({"dialogue": dialogue})
    results = {'results': [predictions]}

    return results

Processing the summarization dataset

In [None]:
#proceed with your multiprocessing code, Adjust the batch size according to your GPU memory
results= sum_dataset.map(prompt_output_summary, batched=True, batch_size=1,  num_proc=1)

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Extracting summary from the generated text

In [51]:
predictions = []
references = []

for item in results:
        prediction = item['results'].split("Summary:")[-1].strip() # 'results' key contains the predicted answer
        answer  = item['summary']# 'summary' key contains the actual answer
        predictions.append(prediction)
        references.append(answer)

Computation of Rouge Score

In [52]:
#predictions and references must be list of strings, check it
rouge_score = rouge_metric.compute(predictions=predictions, references=references)
print(rouge_score)

{'rouge1': 0.2362274160127481, 'rouge2': 0.06849910601886755, 'rougeL': 0.16165629365704418, 'rougeLsum': 0.1621127531707014}


Qualitative analysis

In [54]:

dialogue = "Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye"

template_basic = f"<s>[INST]\nSummarise the Dialogue: {dialogue}.[/INST]\nSummary:"
template_zcot = f"<s>[INST]\nSummarise the Dialogue: {dialogue}.Let's summarise step by step.[/INST]\nSummary:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)

chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'dialogue': dialogue})
predictions2 = chain_zcot.invoke({'dialogue': dialogue})

# summary : "Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry."

print(predictions1)
print(predictions2)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Summarise the Dialogue: Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye.[/INST]
Summary: Hannah asked Amanda if she had Betty's phone number. Amanda checked but couldn't find it. Amanda suggested asking Larry instead, who had contacted Betty during their last visit to the park. Hannah expressed hesitancy as she didn't know Larry well and preferred a text message over a call. Eventually, she agreed to let Amanda text Larry for Betty's contact information. They ended the conversation with goodbyes.
<s>[INST]
Summarise the Dialogue: Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry,

In [55]:

dialogue = "Eric: MACHINE! Rob: That's so gr8! Eric: I know! And shows how Americans see Russian ;) Rob: And it's really funny! Eric: I know! I especially like the train part! Rob: Hahaha! No one talks to the machine like that! Eric: Is this his only stand-up? Rob: Idk. I'll check. Eric: Sure. Rob: Turns out no! There are some of his stand-ups on youtube. Eric: Gr8! I'll watch them now! Rob: Me too! Eric: MACHINE! Rob: MACHINE! Eric: TTYL? Rob: Sure :)"


template_basic = f"<s>[INST]\nSummarise the Dialogue: {dialogue}.[/INST]\nSummary:"
template_zcot = f"<s>[INST]\nSummarise the Dialogue: {dialogue}.Let's summarise step by step.[/INST]\nSummary:"

prompt_basic = PromptTemplate.from_template(template_basic)
prompt_zcot = PromptTemplate.from_template(template_zcot)

chain_basic = prompt_basic | llm
chain_zcot = prompt_zcot | llm

predictions1 = chain_basic.invoke({'dialogue': dialogue})
predictions2 = chain_zcot.invoke({'dialogue': dialogue})

# summary : "Eric and Rob are going to watch a stand-up on youtube."

print(predictions1)
print(predictions2)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s>[INST]
Summarise the Dialogue: Eric: MACHINE! Rob: That's so gr8! Eric: I know! And shows how Americans see Russian ;) Rob: And it's really funny! Eric: I know! I especially like the train part! Rob: Hahaha! No one talks to the machine like that! Eric: Is this his only stand-up? Rob: Idk. I'll check. Eric: Sure. Rob: Turns out no! There are some of his stand-ups on youtube. Eric: Gr8! I'll watch them now! Rob: Me too! Eric: MACHINE! Rob: MACHINE! Eric: TTYL? Rob: Sure :).[/INST]
Summary: Eric and Rob discuss a comedian named "Machine" whose performance they found entertaining, particularly enjoying the train part. They joke about Americans' perception of Russians based on Machine's act. When they discover there are more of Machine's stand-up performances available on YouTube, they both express excitement and plan to watch them. Before ending their conversation, they bid each other goodbye with "MACHINE!" as a shared reference to the comedian.
<s>[INST]
Summarise the Dialogue: Eric: 