# Lab 3: LLMs, Prompting and RAG

## June 13, 2024

Welcome to Lab 3 of our course on Natural Language Processing. Today, we will be diving deep into the fourth and most recent paradigm in NLP teased in the previous Lab, i.e. Pre-train, Prompt and Predict. The core idea behind the paradigm is that once we train a big enough language model (pre-training + instruction tuning), we do not really need to train these models further to solve any specific taks, but instead can directly prompt the model to solve a task by specifying instructions, task descriptions and in some cases a few examples.

Like last time we will be working on the with the [SocialIQA](https://arxiv.org/abs/1904.09728) dataset, and demonstrating how to work with LLMs to solve such tasks.

Along with the SocialIQA dataset, we will also delve into the fascinating world of Retrieval Augmented Generation (RAG). RAG is a powerful technique that combines the strengths of pre-trained language models and information retrieval systems to generate contextually relevant responses.

For building the RAG system, why not build something which might be useful for plaksha students. We will build a Question Answering system which will be able to answer questions about the Plaksha Professors. For this task, the dataset is already generated by scraping the plaksha full-time faculty page to get information about various professors, their areas of expertise, research interests, and more. Our goal is to convert this data into embeddings.

Once we have these embeddings, we can use them to retrieve contextually relevant information based on a given query. For instance, if a student wants to know which professor specializes in Natural Language Processing, our RAG system should be able to retrieve the relevant professor details.

The final part of our system is a Language Model (LM). Once we have the relevant context from our retrieval system, we pass it to a pre-trained LM. The LM then generates a coherent and contextually appropriate response.

By the end of this lab, you will have a hands-on understanding of how to build a RAG system. You will learn how to convert text data into embeddings, how to retrieve relevant context based on a query, and how to generate responses using a pre-trained LM.

This Lab doesn't require any GPU, since we will be heavily using APIs from various third party sources.  

For the embeddings we will be utilizing [Voyage AI's latest Embedding model](https://docs.voyageai.com/docs/embeddings) via their API. The api provdies free access to embeddings upto 50 Million tokens, which are plenty for our assignments and even for your final projects if needed.  
*Note: The Voyage API has very low rate limits when you don't add payment details, with 3 RPM(requests per minute)*

For Language Models, we have two options, the first one being [groq](https://console.groq.com/docs/models), which has various models like LLaMa 3 8b/70b, Mixtral 8x7b and Gemma 7b which are free to use and have very high throughput.  
Another option is to use [Open Router](https://openrouter.ai/docs/models), where there are plenty of free model options available.  
Both service providers are compatible with OpenAI package, and hence can be used interchangebly by just changing `base_url`, `api_key` and `model_name`.

Learning Outcomes of the Lab:
- **Mastering Prompting Techniques:** Learn how to effectively prompt large language models to solve specific tasks and to work with the SocialIQA dataset, demonstrating the practical use of LLMs in solving real-world NLP tasks.
- **Embedding Creation and Utilization:** Gain hands-on experience in converting text data into embeddings using embedding model and utilizing these embeddings for information retrieval.
- **Understanding Retrieval-Augmented Generation (RAG):** Learn how to combine pre-trained language models with information retrieval systems to generate contextually relevant responses.

Recommended Reading:
- Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. <i>Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing</i>. https://arxiv.org/abs/2107.13586

- Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang. <i>Retrieval-Augmented Generation for Large Language Models: A Survey</i>. https://arxiv.org/abs/2312.10997




Let's get started!


In [None]:
# !pip install -U voyageai
# !pip install openai

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
siqa_data_dir = "gdrive/MyDrive/PlakshaNLP2024/Lab3/data/socialiqa-train-dev/"
plaksha_data_dir = "gdrive/MyDrive/PlakshaNLP2024/Lab3/data/"

In [None]:
# We start by importing libraries that we will be making use of in the assignment.
import json
import os
import random
import re
import time

from collections import Counter
from functools import partial
from pprint import pprint

import numpy as np
import pandas as pd
import tqdm

import openai
import voyageai


# Preperation

In [None]:
OPENROUTER_API_KEY = "" # Your OpenRouter API Key: https://openrouter.ai/keys
client = openai.OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=OPENROUTER_API_KEY
)

VOYAGE_API_KEY = "" # Your Voyage API Key: https://dash.voyageai.com/api-keys
vo = voyageai.Client(VOYAGE_API_KEY)

GROQ_API_KEY = "" # Your GROQ API Key: https://console.groq.com/keys
client = openai.OpenAI(
  base_url="https://api.groq.com/openai/v1",
  api_key=GROQ_API_KEY
)

In [None]:
# Loading the SocialIQA dataset

def load_siqa_data(split):

    # We first load the file containing context, question and answers
    with open(f"data/socialiqa-train-dev/{split}.jsonl") as f:
        data = [json.loads(jline) for jline in f.read().splitlines()]

    # We then load the file containing the correct answer for each question
    with open(f"data/socialiqa-train-dev/{split}-labels.lst") as f:
        labels = f.read().splitlines()

    return data, labels


train_data, train_labels = load_siqa_data("train")
dev_data, dev_labels = load_siqa_data("dev")

print(f"Number of Training Examples: {len(train_data)}")
print(f"Number of Validation Examples: {len(dev_data)}")

In [None]:
train_data[0]

In [None]:
train_labels[0]

## Task 1: Prompting Basics (30 minutes)

In this task, you will be learning how create standard NLP problems into text prompts which can then be fed to an LLM for its prediction. Mainly there are 2 concepts that are important to understand while creating prompts:
- Prompt Template or Function: a textual string that has two slots: an input slot [X] for input x and an answer slot
[Z] for an intermediate generated answer text z that will later be mapped into y.
- Answer verbalizer: A mapping between the task labels to words or phrases that converts the more artificial looking labels to natural language that fits with the prompt. eg. for sentiment analysis we can define Z = {“excellent”, “good”, “OK”, “bad”, “horrible”} to represent each of the classes in Y = {++, +, ~, -, --}.

<img src="data/prompting_basics.png" alt="prompting" border="0">

We can also include more interesting stuff like instruction of the task in the template and explanation of the answer in the verbalizer to make more powerful prompts, as we will see a bit later.

## Task 1.1 Defining prompt function and verbalizer for SocialIQA.

For the purpose of this excercise, we ask you to implement this prompt function:
```
Context: {{context}}
    Question: {{question}}
    Which one of these answers best answers the question according to the context?
    AnswerA: {{answerA}}
    AnswerB: {{answerB}}
    AnswerC: {{answerC}}
```

and verbalizer:

```
{"1": "The answer is A", "2": "The answer is B", "3": "The answer is C"}
```

This prompt was obtained from [PromptSource](https://huggingface.co/spaces/bigscience/promptsource), an awesome resource for finding prompts for hundreds of NLP tasks!

In [153]:
def social_iqa_prompting_fn(siqa_example: dict[str, str]):
    """
    Takes an example from the SocialIQA dataset, fills in the prompt template, and returns the prompt.
    
    Inputs:
        siqa_example: A dictionary containing the context, question and answerA, answerB, answerC for a SocialIQA example.

    Outputs:

    """
    prompt = None

    # YOUR CODE HERE

    raise NotImplementedError()

    return prompt
    

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
siqa_example = train_data[0]
prompt = social_iqa_prompting_fn(siqa_example)
expected_prompt = """Context: Cameron decided to have a barbecue and gathered her friends together.
    Question: How would Others feel as a result?
    Which one of these answers best answers the question according to the context?
    AnswerA: like attending
    AnswerB: like staying home
    AnswerC: a good friend to have"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")
assert prompt == expected_prompt

# Sample Test Case 2
print("Running Sample Test Case 2")
siqa_example = train_data[100]
prompt = social_iqa_prompting_fn(siqa_example)
expected_prompt = """Context: Jordan's dog peed on the couch they were selling and Jordan removed the odor as soon as possible.
    Question: How would Jordan feel afterwards?
    Which one of these answers best answers the question according to the context?
    AnswerA: selling a couch
    AnswerB: Disgusted
    AnswerC: Relieved"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")
assert prompt == expected_prompt


In [154]:
def social_iqa_verbalizer(label: str):
    """
    Takes in the label and coverts it into a natural language phrase as specified above

    Inputs:
        label: A string containing the correct answer for a SocialIQA example.
    
    Outputs:
        A string containing the natural language phrase corresponding to the label.
    """

    verbalized_label = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return verbalized_label

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
siqa_example = train_labels[0]
output = social_iqa_verbalizer(siqa_example)
expected_output = """The answer is A"""
print(f"Input Example:\n{siqa_example}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 2
print("\nRunning Sample Test Case 2")
siqa_example = train_labels[100]
output = social_iqa_verbalizer(siqa_example)
expected_output = """The answer is B"""
print(f"Input Example:\n{siqa_example}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output


Let's now obtain the prompts and verbalized labels for each of the the examples in the dataset

In [None]:
train_prompts = None
train_verbalized_labels = None
val_prompts = None
val_verbalized_labels = None

# YOUR CODE HERE
train_prompts = [social_iqa_prompting_fn(example) for example in train_data]
train_verbalized_labels = [social_iqa_verbalizer(label) for label in train_labels]
val_prompts = [social_iqa_prompting_fn(example) for example in dev_data]
val_verbalized_labels = [social_iqa_verbalizer(label) for label in dev_labels]


In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
idx = 10
siqa_example = train_data[idx]
prompt = train_prompts[idx]
expected_prompt = """Context: Sydney was a school teacher and made sure their students learned well.
    Question: How would you describe Sydney?
    Which one of these answers best answers the question according to the context?
    AnswerA: As someone that asked for a job
    AnswerB: As someone that takes teaching seriously
    AnswerC: Like a leader"""
print(f"Input Example:\n{siqa_example}")
print(f"Prompt:\n{prompt}")
print(f"Expected Prompt:\n{expected_prompt}")
assert prompt == expected_prompt

# Sample Test Case 2
print("\nRunning Sample Test Case 2")
idx = 10
siqa_label = train_labels[idx]
output = social_iqa_verbalizer(siqa_label)
verbalized_label = "The answer is B"
print(f"Input Example:\n{siqa_label}")
print(f"Verbalized Label:\n{verbalized_label}")
print(f"Expected Verbalized Label:\n{verbalized_label}")
assert output == verbalized_label

It is often useful to have a reverse verbalizer as well that converts the verbalized labels back to the structured and consistent labels in the dataset. For example, "The answer is A" is mapped back to "1" and so on. 

In [156]:
def social_iqa_reverse_verbalizer(verbalized_label: str):
    """
    Reverses the verbalized label into the label
    Inputs:
        verbalized_label: A string containing the natural language phrase corresponding to the label.
    Outputs:
        label: A string containing the correct answer for a SocialIQA example.
    
    Important Note: We will be using this function to map LLM's output to structured label. The output of LLM now can be in some format other than what we expect
    For example, it can be "The answer is A" or "The answer is A." or or "<some text> answer is A" or "The answer is A <some text>"
    When you reverse the verbalized label, make sure you handle these cases.
    
    Important Note 2: If the resulting text doesn't have the answer, then just return an empty string.

    HINT: use regex pattern matching.
    """

    label = None
    # YOUR CODE HERE
    raise NotImplementedError()

    return label

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1")
example_verbalized_label = "The answer is C"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "3"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 2
print("\nRunning Sample Test Case 2")
example_verbalized_label = "The answer is B"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "2"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 3
print("\nRunning Sample Test Case 3")
example_verbalized_label = "some explanation before the actual answer, The answer is A"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "1"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 4
print("\nRunning Sample Test Case 4")
example_verbalized_label = "some text here the answer is C, some more text"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = "3"
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

# Sample Test Case 5
print("\nRunning Sample Test Case 5")
example_verbalized_label = "none of the options is the correct answer"
output = social_iqa_reverse_verbalizer(example_verbalized_label)
expected_output = ""
print(f"Input Example:\n{example_verbalized_label}")
print(f"output:\n{output}")
print(f"Expected output:\n{expected_output}")
assert output == expected_output

## Task 1.2: Choose Few-Shot examples

Often we can get better performance on a task by providing a few examples of the task as part of the prompt. This is also known as in-context learning, where the model learns to solve a task based on the examples provided in the context (and no updates to the model's weights!). One of the easiest way that works reasonably well in practice is to simply choose `k` examples randomly for each class from the entire training dataset, such that we have n_classes * k few-shot examples where n_classes = 3 for SocialIQA dataset. Implement the `choose_few_shot` function below that does that.

In [157]:
def choose_few_shot(train_prompts, train_verbalized_labels, k = 1, seed = 42):
    """
    Randomly chooses k examples from the training set for few-shot in-context learning.
    Inputs:
        train_prompts: A list of prompts for the training set.
        train_verbalized_labels: A list of labels for the training set.
        k: The number of examples per class to choose.
        n_classes: The number of classes in the dataset.
        seed: The random seed to use, to ensure reproducible outputs

    Outputs:
        - List[Dict[str, str]]: A list of 3k examples from the training set, where each example is represented as a dictionary with "prompt" and "label" as keys and corresponding values.

    Example Output: [
        {
            "prompt": <Example Prompt 1>,
            "label": <Example Label_1>
        },
        ...,
        {
            "prompt": <Example Prompt 3k>,
            "label": <Example Label_3k>
        }
    ]
    """

    random.seed(seed)
    np.random.seed(seed)

    fs_examples = []

    # YOUR CODE HERE
    raise NotImplementedError()

    # Shuffle the examples to ensure there is no bias in the order of the examples
    random.shuffle(fs_examples)

    return fs_examples

In [None]:
# Sample Test Case 1
print("Running Sample Test Case 1. Checking if the output length is correct")
k = 1
seed = 42
output = choose_few_shot(train_prompts, train_verbalized_labels, k, seed)
output_len = len(output)
expected_output_len = k * len(set(train_labels))
print(f"k: {k}")
print(f"Output Length:\n{output_len}")
print(f"Expected Output Length:\n{expected_output_len}")
assert output_len == expected_output_len

# Sample Test Case 2
print("\nRunning Sample Test Case 2. Checking if all labels are predicted")
output_labels = sorted(list(set([example["label"] for example in output])))
expected_output_labels = ["The answer is A", "The answer is B", "The answer is C"]
print(f"Output Labels:\n{output_labels}")
print(f"Expected Output Labels:\n{expected_output_labels}")
assert output_labels == expected_output_labels

# Sample Test Case 3
print("\nRunning Sample Test Case 3. Checking if count of labels are correct")
k = 3
output = choose_few_shot(train_prompts, train_verbalized_labels, k, seed)
output_label_counter = Counter(list(([example["label"] for example in output])))
expected_output_counter = {"The answer is A": k, "The answer is B": k, "The answer is C": k}
print(f"For k = {k}")
print(f"Output Label Counter:\n{output_label_counter}")
print(f"Expected Output Label Counter:\n{expected_output_counter}")
assert output_label_counter == expected_output_counter

In [None]:
# Choose 3 few-shot examples from training data
few_shot_examples = choose_few_shot(train_prompts, train_verbalized_labels, k = 1, seed = 42)

### Few-shot examples with explanations

So far above we have been constructing label verbalizer to provide the answer directly. Often it can be useful to prompt the model to first generate an explanation before the answer. For eg.
```
"prompt": "Context: Tracy didn't go home that evening and resisted Riley's attacks.
            Question: What does Tracy need to do before this?
            Options: 
            (A) make a new plan 
            (B) Go home and see Riley 
            (C) Find somewhere to go"
"label": "Tracy found somewhere to go and didn't come home because she wanted to resist Riley's attacks. Hence, the answer is C"
```
One way to prompt the model to generate such explanations is to provide the explanations for the few-shot examples, which will ground the model to first generate an explanation and then the answer. This helps both improve the performance of the model as well as have more interpretable outputs from LLM.

Below we provide a few examples with explanations for SocialIQA task obtained from [Super-NaturalInstructions](https://aclanthology.org/2022.emnlp-main.340/), an amazing resource for prompts, instructions and explanations for around 1600 NLP tasks.


In [None]:
fs_examples_w_explanations = [
    {
        "prompt": "Context: Tracy didn't go home that evening and resisted Riley's attacks.\nQuestion: What does Tracy need to do before this?\nWhich one of these answers best answers the question according to the context?\nAnswerA: make a new plan\nAnswerB: Go home and see Riley\AnswerC: Find somewhere to go",
        "label": "Tracy found somewhere to go and didn't come home because she wanted to resist Riley's attacks. Hence, the correct answer is C."
    },
    {
        "prompt": "Context: Sydney walked past a homeless woman asking for change but did not have any money they could give to her. Sydney felt bad afterwards.\nQuestion: How would you describe Sydney?\nWhich one of these answers best answers the question according to the context?\nAnswerA: sympathetic\nAnswerB: like a person who was unable to help\nAnswerC: incredulous",
        "label": "Sydney is a sympathetic person because she felt bad for someone who needed help, and she couldn't help her. Hence, the correct answer is A."
    },
    {
        "prompt": "Context: Taylor gave help to a friend who was having trouble keeping up with their bills.\nQuestion: What will their friend want to do next?\nWhich one of these answers best answers the question according to the context?\nAnswerA: help the friend find a higher paying job\nAnswerB: thank Taylor for the generosity\nAnswerC: pay some of their late employees",
        "label": "The friend should thank Taylor for the generosity she showed by helping him pay bills. Hence, the correct answer is B."
    }
]

In [None]:
fs_examples_w_explanations

## Task 2: Evaluating ChatGPT (GPT-3.5-Turbo) on SocialIQA (45 minutes)

Today we will be working with OpenAI's GPT family of models. ChatGPT (or GPT-3.5) was built on top of GPT-3, which is a pre-trained Large Language Model (LLM) with 175 Billion parameters, trained on a huge amount of unlabelled data using the language modelling objective (i.e. given k tokens, generate (k+1)th token). While this forms the basis of all GPT family of models, GPT-3.5 and later models are based on [InstructGPT](https://arxiv.org/abs/2203.02155), which further adds an Instruction Tuning step that learns from human feedback to follow provided instructions.

![Instruction Tuning](data/instructgpt.png)
*From the [Ouyang et al. 2022](https://arxiv.org/abs/2203.02155)*

As a consequence of the pre-training with language modeling objective and instruction tuning, we can use GPT-3.5 to complete a given piece of text and provide specific instructions about how to go about completing the text. We achieve this by defining a text prompt which is to be given as the input to the LLM which then generates a completion of the provided text.

In [None]:
response = client.chat.completions.create(
  model="llama3-70b-8192",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
  max_tokens=20,
  temperature=0.0,
)

Let's try to wrap our head around different parameters to this function call.

First we have `model`, where we specify which OpenAI model to use. We have used `"gpt-3.5-turbo"` here, which is similar to ChatGPT like you would have used online. You can find the list of other models [here](https://platform.openai.com/docs/models).

Next, we have `messages`, which contains the conversation between the user and assistant that is to be completed. Notice that the first message is what we call a "system prompt", which is used to set the behavior of the assistant. 

`max_tokens` is used to specify the maximum number of response tokens that the model should generate. This can be useful when you know how long the response is typically going to be, and can help reduce cost.

`temperature`, helps in controlling the variability in the output. Lower values for temperature result in more consistent outputs, while higher values generate more diverse and creative results. Setting temperature to 0 will make the outputs mostly deterministic, but a small amount of variability will remain.

In [None]:
response

Now let's look at the response. The assistant's reply can be  extracted with `response['choices'][0]['message']['content']`. Every response will include a finish_reason. The possible values for finish_reason are:

- stop: API returned complete message, or a message terminated by one of the stop sequences provided via the stop parameter
- length: Incomplete model output due to max_tokens parameter or token limit
- function_call: The model decided to call a function
- content_filter: Omitted content due to a flag from our content filters
- null: API response still in progress or incomplete

Depending on input parameters (like providing functions as shown below), the model response may include different information.

In [None]:
model_output = response.choices[0].message.content
print(model_output)

## Task 2.1: Using ChatGPT to solve SocialIQA problems

Now we have an understanding of how to work with OpenAI API, we can go ahead and call the api with the prompts that we just created and check how well does the model perform the task. We promt the model with the test example for which we want the prediction and provide few-shot examples as part of the context. This can be done by simply providing the example prompt and labels as user-assistant conversation history and test example as the most recent query of the user. Implement the function `get_social_iqa_pred_gpt` that receives a test prompt to be answered, few-shot examples, and some api specific hyperparameters to predict the answer.

In [158]:
def get_social_iqa_pred_gpt(
        test_prompt,
        few_shot_examples,
        model_name = "llama3-70b-8192",
        max_tokens = 20,
        temperature = 0.0,
):
    
    """
    Calls the OpenAI API with test_prompt and few-shot examples to generate the answer.
    Inputs:
        test_prompt: The prompt for the test example
        few_shot_examples: A list of few-shot examples
        model_name: The name of the model to use
        max_tokens: The maximum number of tokens to generate
        temperature: The temperature to use for the model

    Outputs:
        model_output: The model's output

    Hint: Your messages to be sent should be in the following format:
        [
            {"role": "user", "content": <fs-example-1-promot>},
            {"role": "assistant", "content": <fs-example-1-label>},
            ...,
            {"role": "user", "content": <fs-example-3k-promot>},
            {"role": "assistant", "content": <fs-example-3k-label>},
            {"role": "user", "content": <test-prompt>},
        ]
    """

    messages_prompt = [{
        "role": "user", "content": "You are an expert of Human Social Common Sense. You need to solve the SocialIQA task. In this task, you're given a context, a question, and three options. Your task is to find the correct answer to the question using the given context and options. Also, you may need to use commonsense reasoning about social situations to answer the questions. Classify your answers into 'A', 'B', and 'C'. You must choose the most likely option."
    }]
    model_output = None
    
    while True:
        try:
            # YOUR CODE HERE
            
            raise NotImplementedError()
            break
        except (openai.APIConnectionError, openai.RateLimitError, openai.Timeout, openai.InternalServerError) as e:
            #Sleep and try again
            print(f"Couldn't get response due to {e}. Trying again!")
            # time.sleep(20)
            continue

    return model_output.choices[0].message.content

In [None]:
test_example = val_prompts[0]
test_example_label = val_verbalized_labels[0]
model_output = get_social_iqa_pred_gpt(test_example, few_shot_examples,
                                       model_name = "llama3-70b-8192",
                                       max_tokens = 20, temperature = 0.0)
print(test_example)
print(f"Model's response: ", model_output)
print(f"Correct answer: ", test_example_label)

As you can see the model didn't quite get the answer right. Let's try providing examples with explanations i.e. `fs_examples_w_explanations` and see the output. Note that we will need to give a higher value of `max_tokens`, since the model is also expected to generate explanation now.

In [None]:
test_example = val_prompts[0]
test_example_label = val_verbalized_labels[0]
model_output = get_social_iqa_pred_gpt(test_example, fs_examples_w_explanations,
                                        model_name = "llama3-70b-8192",
                                        max_tokens = 50, temperature = 0.0)
print(test_example)
print(f"Model's response: ", model_output)
print(f"Correct answer: ", test_example_label)

As you can see the output is correct and the explanation also makes sense.

Let's do a full fledged evaluation now. Due to API limits, we will only be evaluating first 32 examples of the validation set and not the whole but that should give us some idea of how good our LLM (LLaMa3-70b) is at solving social common-sense reasoning problems

In [159]:
def get_model_predictions(
        test_prompts,
        few_shot_examples,
        model_name = "llama3-70b-8192",
        max_tokens = 20,
        temperature = 0.0,
):
    """
    Get predictions for all test prompts using the `get_social_iqa_pred_gpt` function

    Inputs:
        test_prompts: A list of test prompts
        few_shot_examples: A list of few-shot examples
        model_name: The name of the model to use
        max_tokens: The maximum number of tokens to generate
        temperature: The temperature to use for the model
    
    Outputs:
        model_preds: A list of model predictions for each test prompt
    """

    model_preds = []
    # YOUR CODE HERE
    raise NotImplementedError
        
        
    return model_preds

def evaluate_model_preds(
        model_preds,
        test_labels
):
    """
    Evaluates the prediction of the model by performing string match between the predictions and labels.

    Inputs:
        model_preds: A list of model predictions for each test prompt
        test_labels: A list of test labels. Note that these are not verbalized

    Outputs:
        accuracy: The accuracy of the model i.e. #correct_predictions / #total_predictions
    """

    accuracy = None
    # YOUR CODE HERE
    raise NotImplementedError
    return accuracy*100

In [None]:
# To test if things are working fine
k = 5
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_preds = get_model_predictions(test_prompts, few_shot_examples,
                                    model_name = "llama3-70b-8192",
                                    max_tokens = 20, temperature = 0.0)

accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

In [None]:
# Evaluate on 32 validation examples
k = 32
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_preds = get_model_predictions(test_prompts, few_shot_examples,
                                    model_name = "llama3-70b-8192",
                                    max_tokens = 20, temperature = 0.0)

accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

In [None]:
# Evaluate on 32 validation examples with explanations
k = 32
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_output, model_preds = get_model_predictions(test_prompts, 
                                    fs_examples_w_explanations, 
                                    model_name = "llama3-70b-8192",
                                    max_tokens = 150, temperature = 0.0)
accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

**Doing 128 examples will take some time of around 30 minutes**

In [None]:
# Evaluate on 128 validation examples with explanations
k = 128
test_prompts = val_prompts[:k]
test_labels = dev_labels[:k]
model_output, model_preds = get_model_predictions(test_prompts, 
                                    fs_examples_w_explanations, 
                                    model_name = "llama3-70b-8192",
                                    max_tokens = 150, temperature = 0.0)
accuracy = evaluate_model_preds(model_preds, test_labels)
print(f"Accuracy: {accuracy}")

As you can see we get slightly better performance on prompting the model with explanations than without 78.9% vs 71.875%. We can do more prompt-engineering and better type of explanations to improve the performance further. Also, there may be instances where the answer was correct but our pattern matching didn't catch the correct answer and marked it incorrect.

But we hope with this you would have gotten some idea on how to use these models to solve NLP tasks like this. Also, notice that common sense reasoning remains an open problem for the models we have today, as even with LLMs like LLaMa3, ChatGPT, which are fairly strong LLMs, the accuracy remains isn't as high as we wanted.

## Task 3: Retrieval Augmented Generation (RAG)

Now, lets move to another task, where our goal is to create a question answering system for students to ask questions about professors from Plaksha University. 

The data for professors has been scrapped from the university full-time faculty webpage and stored in a csv file. The csv file has the following columns:
- name: The name of the professor
- expertise: The area of expertise of the professor
- interest: The research interests of the professor
- about: A short bio/about of the professor

In [None]:
df = pd.read_csv("./data/Plaksha.csv")
df.fillna("N/A", inplace=True)
df.head()

### Task 3.1 Creating Embeddings

Create an embedding prompt by combing multiple columns which will be sent to the embedding model to get the final embedding.

Template: `Professor <professor_name>'s Area of Expertise is in <expertise> and some of the professor's Research Interests are <interest>. Here is a short Bio/About of the Professor: <about>`

In [None]:
def embedding_prompt(x):
    """
    This function takes a row of the dataframe as input and returns a string.
    The string is a combination of the professor's name, expertise, research interests, and bio.
    """
    return f"Professor {x['name']}'s Area of Expertise is in {x['expertise']} and some of the professor's Research Interests are {x['interest']}. Here is a short Bio/About of the Professor:\n\n {x['about']}"

Apply the embedding prompt to the dataframe to create a new column `prompt` which will be used to get the embeddings.

In [None]:
df['prompt'] = df.apply(embedding_prompt, axis=1)
assert 'prompt' in df.columns, "Prompt column not found"


Getting embeddings from the model `voyage-large-2-instruct` and save it as a new column `embedding`

In [None]:
def get_embedding(x, input_type='document'):
    """
    This function takes a prompt string as input and returns the embedding of the prompt.
    The embedding is obtained using the voyage-large-2-instruct model.
    """
    # time.sleep(25)
    emb_obj = None
    # YOUR CODE HERE
    raise NotImplementedError

    return emb_obj.embeddings[0]

Load the embeddings dataset csv file

In [None]:
df = pd.read_csv("./data/Plaksha_with_embeddings.csv")
df.fillna("N/A", inplace=True)
df.head()

Convert the embedding column which is a string of list to a numpy array

In [None]:
def str2np(x):
    """
    This function takes a string representation of a list as input, 
    removes the brackets, splits the string into a list, converts the elements of the list to float, 
    and finally converts the list to a numpy array.

    HINT: Use the replace and split functions
    HINT: Remove brackets from string, and using split to convert to normal list
    HINT: Covert the elements of list to float
    HINT: Convert list to numpy array
    """
    output = []
    # YOUR CODE HERE
    raise NotImplementedError

    return np.array(output)



Apply the `str2np` function to the embedding column

In [None]:
df['embedding'] = df['embedding'].apply(str2np)

assert 'embedding' in df.columns, "Embedding column not found"
assert isinstance(df['embedding'][0], np.ndarray), "Embedding column is not of type numpy array"
assert df['embedding'][0].shape == (1024,), "Embedding column is not of shape (1024,)"
assert df['embedding'][0].dtype == np.float64, "Embedding column is not of type float64"

## Document In-Memory

Create a numpy 2D array from the embedding column

In [None]:
embeddings = np.stack(df['embedding'])
assert embeddings.shape == (37, 1024), "Incorrect Embedding Shape"

Now, we will create a cosine similarity function which will be used to get the similarity between the query and the professors.

In [160]:
def cosine_similarity(a, b):
    """
    This function calculates the cosine similarity between two vectors.

    Parameters:
    a (numpy array): The first vector.
    b (numpy array): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    similarity = None
    # YOUR CODE HERE
    # raise NotImplementedError
    return similarity

In [None]:
# Test cases for cosine similarity

a = [1, 2, 3]
b = [4, 5, 6]
print("Your cosine similarity is: ", cosine_similarity(a, b))
print("Expected cosine similarity is: ", 0.9746318461970762)
assert np.isclose(cosine_similarity(a, b), 0.9746318461970762, atol=0.00001), "Incorrect Cosine Similarity"

a = [0.1, 0.6, 0.8, 0.6, 0.34, 0.78, 0.65, 0.88, 0.1, 0.98, 0.34, 0.77]
b = [0.8, 0.5, 0.44, 0.67, 0.4, 0.6, 0.7, 0.23, 0.87, 0.45, 0.78, 0.98]
print("\nYour cosine similarity is: ", cosine_similarity(a, b))
print("Expected cosine similarity is: ", 0.7814329877768034)
assert np.isclose(cosine_similarity(a, b), 0.7814329877768034, atol=0.00001), "Incorrect Cosine Similarity"



Getting the N most similar professors for a given query

In [161]:
def similarProfessor(query, df, embeddings_matrix):
    """
    This function returns the N most similar professors for a given query.
    
    Parameters:
    query (str): The query for which similar professors are to be found.
    df (pandas DataFrame): The DataFrame containing information about professors.
    embeddings_matrix (numpy array): The matrix of embeddings for the professors.
    
    Returns:
    pandas DataFrame: A DataFrame containing the N most similar professors.

    # HINT: Get the embedding for the query first, input_type = 'query'
    # HINT: Use the cosine_similarity function to get the similarity between the query and the professors.
    # HINT: Use the np.argsort to get the indices of the most similar professors.
    # IMPORTANT: argsort return indices in ascending order, but we need descending order.
    """
    result = None
    # YOUR CODE HERE

    raise NotImplementedError

    return result

Now, we will create a system prompt which will do cosine similarity with the existing embeddings and get top N most similar professors for a given query.

Template: `You are an helpful assistant, whose role is to help students with their queries. You will be given a context about one or more professors followed by a query from the student about which, what professor is better, or which topic does a particular professor is best at etc. Given the context you have to correctly answer the query and if there is not information in the context regarding the query then you have to answer 'No information available.'`

You can play with the system prompt 

In [None]:
system_prompt = "You are an helpful assistant, whose role is to help students with their queries. You will be given a context about one or more professors followed by a query from the student about which, what professor is better, or which topic does a particular professor is best at etc. Given the context you have to correctly answer the query and if there is not information in the context regarding the query then you have to answer 'No information available'."

Now for a given query, let's find the top N professor who are related to the query and create the context string for the LLM

In [None]:
def context_template(name, expertise, interest, about):
    return f"""
{name}:
Area of Expertise: {expertise}
The Professor's Research Interests are {interest}
Here is the Bio/About the Professor:
{about}"""

In [None]:
query = "I want to be part of a change in Indian education, working with someone who is doing cutting edge research on education, who will be the best person?"

In [None]:
result_df = similarProfessor(query, df, embeddings)

context = ""
for index, row in result_df.iterrows():
    context += context_template(row['name'], row['expertise'], row['interest'], row['about'])
    context += "\n\n"

print(context)

In [None]:
response = client.chat.completions.create(
    model= "llama3-70b-8192",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Context: \n\n {context}.\n\n Query: {query}"}
    ],
)

In [None]:
print(response.choices[0].message.content)