<a href="https://colab.research.google.com/github/finardi/WatSpeed_LLM_foundation/blob/main/Module3%3A%20Data_augmentation_(enrichment)_with_GPT_3_5_turbo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data augmentation (enrichment) with GPT-3.5-turbo

This notebook showcases the use of GPT-3.5-turbo for data augmentation by adding a new feature to the examples of the [Incomplete Information Reading Comprehension (IIRC)](https://allenai.org/data/iirc) dataset. The task is to provide an explanation of how the provided documents answer the given question, in addition to the original question and context. 

The IIRC dataset is a crowdsourced dataset that contains information-seeking questions, which require models to identify and retrieve necessary information that is missing from the original context. Each context is a paragraph from English Wikipedia and comes with a set of links to other Wikipedia pages, and answering the questions requires following the appropriate links and retrieving relevant information from those linked pages that is missing from the original context. 

The newly added feature will be helpful for using the dataset as few-shot examples that induce chain-of-thought, enabling models to learn to reason and make predictions based on incomplete information.

# Installing required packages

In this example, we have to install `openai` and `tiktoken` libraries.

**`openai`**:

OpenAI is an artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The OpenAI library is a powerful machine learning library that provides an easy-to-use interface to the OpenAI API. With this library, users can easily integrate OpenAI's state-of-the-art language models, including GPT-3, into their applications, and leverage the full power of these models to perform various natural language processing (NLP) tasks, such as language generation, classification, question-answering, and more.

**`tiktoken`**:

Tiktoken is an open-source BPE tokeniser developed by OpenAI that is used to split text strings into tokens. It is useful for models like GPT-3 that encode text into tokens. Tiktoken is designed to be highly efficient, capable of handling large amounts of text quickly and accurately.

In [None]:
!pip install openai
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

# Download dataset

To run the data augmentation example using GPT-3.5-turbo, we will use the IIRC (Incomplete Information Reading Comprehension) dataset, which is available for download from the official website of the Allen Institute for Artificial Intelligence (AI2) at https://allenai.org/data/iirc. To download and extract the dataset, we can use the code below.

This will download the dataset and extract it into the current directory.

In [None]:
!wget https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_train_dev.tgz
!tar zxvf iirc_train_dev.tgz

--2023-05-15 23:38:06--  https://iirc-dataset.s3.us-west-2.amazonaws.com/iirc_train_dev.tgz
Resolving iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)... 52.92.152.98, 52.218.181.169, 3.5.82.173, ...
Connecting to iirc-dataset.s3.us-west-2.amazonaws.com (iirc-dataset.s3.us-west-2.amazonaws.com)|52.92.152.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5713428 (5.4M) [application/gzip]
Saving to: ‘iirc_train_dev.tgz’


2023-05-15 23:38:06 (37.9 MB/s) - ‘iirc_train_dev.tgz’ saved [5713428/5713428]

._iirc_train_dev
iirc_train_dev/
iirc_train_dev/._dev.json
iirc_train_dev/dev.json
iirc_train_dev/._README
iirc_train_dev/README
iirc_train_dev/._train.json
iirc_train_dev/train.json


## Data preparation

The code below loads the training data of the IIRC dataset, which is stored in the "train.json" file. It then processes the data to extract the questions and their corresponding answers, and stores them in a list. 

The script starts by loading the data from the "train.json" file and storing it in the "dev" variable. The "dev" variable is a list of dictionaries, where each dictionary represents an item from the dataset. Each item contains a paragraph of text from Wikipedia, along with a set of questions related to the text.

The script then initializes an empty list called "all_questions". It loops through each item in the "dev" variable and loops through each question in the "questions" field of the item's dictionary. For each question, it extracts the title of the corresponding paragraph from the "title" field of the item's dictionary and stores it in the "title" field of the question's dictionary. 

The script then processes the answer field of the question's dictionary. Depending on the "type" field of the answer, it formats the answer as a string and stores it in the "answer" field of the question's dictionary. The different types of answers that can be encountered are "span", "value", "binary", and "none". 

Finally, the script appends the question's dictionary to the "all_questions" list and continues looping through the questions. The script then prints the length of the "all_questions" list, which is the total number of questions in the training data.

In [None]:
import json

dev = json.load(open("./iirc_train_dev/train.json"))
all_questions = []

for item in dev:
    for q in item['questions']:
        q['title'] = item['title']
        answer = ""
        if q['answer']['type'] == "span":
            answer = ", ".join([a['text'] for a in q['answer']["answer_spans"]])
        elif q['answer']['type'] == "value":
            answer = "{0} {1}".format(q['answer']['answer_value'],q['answer']['answer_unit'])
        elif q['answer']['type'] == "binary":
            answer = q['answer']['answer_value']
        elif q['answer']['type'] == "none":
            answer = "Not enough information."
        q['answer'] = answer
        all_questions.append(q)
len(all_questions)

10839

The prepared dataset format is a list of dictionaries, where each dictionary represents a single question-answer pair. The keys in the dictionary include:

- **`context`**: a list containing a single dictionary representing the context or passage from which the question is being asked. The 'text' key in the dictionary contains the actual text of the passage, while the 'indices' key specifies the start and end indices of the passage within the full document.
- **`question_links`**: a list of Wikipedia links that are relevant to answering the question.
- **`answer`**: a string representing the answer to the question.
- **`question`**: a string representing the question itself.
- **`qid`**: a string representing the unique identifier of the question.
- **`title`**: a string representing the title of the Wikipedia page that the context passage is taken from.


In [None]:
all_questions[0]

{'context': [{'text': 'During Operation Market Garden, the attempt to seize a bridgehead across the Rhine in the Netherlands, the 704th dropped supplies to allied troops near Nijmegen.',
   'indices': [494, 655],
   'passage': 'main'},
  {'text': 'Operation Market Garden was a failed World War II military operation fought in the Netherlands from 17 to 25 September 1944.',
   'indices': [0, 124],
   'passage': 'Operation Market Garden'}],
 'question_links': ['Operation Market Garden'],
 'answer': ' from 17 to 25 September 1944',
 'question': 'When did the operation during which the 704th dropped supplies to allied troops near Nijmegen begin?',
 'qid': 'q_0',
 'title': '446th Operations Group'}

# Using OpenAI API

To use OpenAI API, we need to set our API key and import the OpenAI module. In the given code, we have the `OPENAI_KEY` variable which we can set to our OpenAI API key. After that, we can use the `openai.api_key` method to set the API key for our session.

The function `generate_chat` takes in a list of messages and generates a response using the OpenAI Chat API. The `model` parameter specifies which model to use for generating the response. In the given code, we have used the `gpt-3.5-turbo` model. However, we can also use `gpt-4`.

**IMPORTANT:** It's important to note that there are costs associated with using the OpenAI API, so we need to choose the appropriate model and set the parameters carefully to avoid unnecessary expenses.

In [None]:
import os
import openai

OPENAI_KEY = "" # @param set your OpenAI API key here

openai.api_key = OPENAI_KEY

def generate_chat(messages,model="gpt-3.5-turbo"):
  response = openai.ChatCompletion.create(
    model=model,
    messages=messages,
    temperature=0
  )
  return response["choices"][0]['message']['content']

# Data Augmentation

The aim of this section is to generate explanations on how the documents of each example of the IIRC dataset can be used to answer the associated question. To achieve this, we will use GPT-3.5-turbo for data augmentation.

The method we will use involves asking the model to generate explanations based on a prompt that includes three elements:

1. An instruction on what the model should generate (in this case, an explanation on how to use the documents to answer the question).
2. Few-shot examples (at least one) that provide the model with context and information about the task. These examples will be provided in the form of prompts that include both the question and the associated documents.
3. The target example, which is the example we want to augment. This example will be provided in the form of a prompt that includes only the question.

By combining these elements, we can generate new examples that include an explanation on how to use the documents to answer the question. This approach will help us to improve the IIRC dataset by providing more informative examples that induce a chain-of-thought and can be used as few-shot examples for downstream tasks.

## The instruction

The instruction is a template text that explains the task that the GPT-3.5-turbo model will perform during the data augmentation process. The instruction has three input parts that will be filled in with the actual data by the user:

In [None]:
instruction = """The user will provide you with:
1) some content documents;
2) a question that can be answered by these documents; and
3) the correct answer to the question based on the documents.

Your task is to write a reasoning paragraph that explains how the provided documents answer the question.

Expected output:
Explanation: <>
"""

## One-shot example

The code below presents a one-shot example, which is a single example of a task for the IIRC model to solve. The example includes a list of four documents, a question that can be answered using the information in the documents, **the correct answer to the question** based on the documents, and an explanation of how the documents can be used to answer the question.

**Additionally, you can add more examples or modify the explanation provided to the model, aiming to induce it to produce the desired result.**


In [None]:
examples = [
    {
      "documents": [
          "\"San Tropez\" is the fourth track from the album Meddle by the band Pink Floyd. This song was one of several to be considered for the band's \"best of\" album, Echoes: The Best of Pink Floyd.",
          "The French Riviera (known in French as the Côte d'Azur [kot daˈzyʁ]; Occitan: Còsta d'Azur [ˈkɔstɔ daˈzyɾ]; literal translation \"Azure Coast\") is the Mediterranean coastline of the southeast corner of France. There is no official boundary, but it is usually considered to extend from Cassis, Toulon or Saint-Tropez on the west to Menton at the France–Italy border in the east, where the Italian Riviera joins. The coast is entirely within the Provence-Alpes-Côte d'Azur (Région Sud) region of France. The Principality of Monaco is a semi-enclave within the region, surrounded on three sides by France and fronting the Mediterranean.",
          "Moon also promised transparency in his presidency, moving the presidential residence from the palatial and isolated Blue House to an existing government complex in downtown Seoul.",
          "Saint-Tropez (US: /ˌsæn troʊˈpeɪ/ SAN-troh-PAY, French: [sɛ̃ tʁɔpe]; Occitan: Sant-Tropetz , pronounced [san(t) tʀuˈpes]) is a town on the French Riviera, 68 kilometres (42 miles) west of Nice and 100 kilometres (62 miles) east of Marseille in the Var department of the Provence-Alpes-Côte d'Azur region of Occitania, Southern France."
      ],
      "question": "Did Pink Floyd have a song about the French Riviera?",
      "answer": "yes",
      "explanation": "According to Document 1, \"San Tropez\" is the fourth track from the album Meddle by the band Pink Floyd. Document 4 states that Saint-Tropez is a town on the French Riviera, which is a part of the Mediterranean coastline in the southeast corner of France, as mentioned in Document 2. Therefore, the song \"San Tropez\" by Pink Floyd is about a location on the French Riviera."
    }
]

## Generate explanation

The next step is to construct the entire prompt by adding the target example, and then run the API call. 

Before that, we need to load the GPT-3.5-turbo tokenizer using the `tiktoken` library. The tokenizer will be useful for estimating how much our implementation will cost when using the OpenAI API. 

In [None]:
import tiktoken

model = "gpt-3.5-turbo"
tokenizer = tiktoken.encoding_for_model(model)

Once we have the tokenizer, we can construct the prompt by combining the previously defined instruction and example with the target example. The target example will include a question, documents, and the correct answer to the question based on the provided documents. Finally, we will execute the API call to generate the explanation paragraph for the given example.

The **`generate_explanation`** function is responsible for generating an explanation for a given item, which consists of a context, a question, and the correct answer. The function takes in the **`item`** parameter, which is a dictionary containing the context, question, and answer for the target example. The function constructs a prompt based on the provided **`instruction`**, the **`examples`**, and the target example. 

The function then sends the constructed prompt to **`generate_chat`** function, which is responsible for generating a response from the GPT-3.5-turbo model. The output from the model is then parsed using a regular expression to extract the explanation. 

If the **`cost_estimation`** parameter is set to **`True`**, the function will estimate the cost of generating the prompt using the GPT-3.5-turbo API based on the number of tokens in the prompt. The function returns the generated explanation if it is found, or **`None`** otherwise.

In [None]:
import re

def generate_explanation(item, cost_estimation=False):
  messages = [{"role":"system","content":instruction}]

  for example in examples:
    docs_str = ""
    for i,doc in enumerate(example['documents']):
      docs_str += f"[Document {i+1}]: {doc}\n##\n"
    
    messages.append({"role":"user","content":f"{docs_str}Question: {example['question']}\nAnswer: {example['answer']}"})

  target_docs_str = ""

  for i, doc in enumerate(item['context']):
    target_docs_str += f"[Document {i+1}]: {doc['text']}\n##\n"
  
  #  Note that we include the correct answer as an input to the model, thus making its job model easier.
  messages.append({"role":"user","content":f"{target_docs_str}Question: {item['question']}\nAnswer: {item['answer']}"})

  if cost_estimation:
    prompt = "\n".join([message["content"] for message in messages])
    tokens = tokenizer.encode(prompt)
    output_size = 128 #128 tokens
    return ((len(tokens) + output_size) / 1000) * 0.002
  
  res = generate_chat(messages)

  regex = r"Explanation:\s(.*)"

  match = re.search(regex, res)

  if match:
      return match.group(1) 

## Cost Estimation

The code below performs a cost estimation for the generation of explanations for a set of questions. The `limit` variable defines the number of questions to consider.

In [None]:
import numpy as np

limit = 100
costs = []

for question in all_questions[:limit]:
  costs.append(generate_explanation(question, cost_estimation=True))

print(f"Generating explanations for {limit} questions will cost U$ {np.sum(costs):.2f}")

Generating explanations for 100 questions will cost U$ 0.14


## Running

Before generating the explanations, we use the code above to randomly sample a fixed number of examples from the training dataset. This is done to avoid running the explanation generation process on the entire dataset, which could be computationally expensive. The **`random.sample`** function is used to randomly select examples from the list of all questions **`all_questions`**, and the number of examples to be selected is specified by the variable **`limit`**. After sampling the examples, we define the variable **`explained_dataset`**, which will store the explained questions generated by our model.

In [None]:
import random

sampled_examples = random.sample(all_questions, k=limit)

explained_dataset = []

The above code block is a for loop that iterates over each of the randomly sampled questions (stored in the **`sampled_examples`** list) and generates an explanation for them using the **`generate_explanation()`** function. The **`tqdm()`** function provides a progress bar to show the progress of the for loop. The generated explanations are then stored in the **`explanation`** field of each question dictionary and the updated question dictionary is appended to the **`explained_dataset`** list. 

The length of the **`explained_dataset`** list is used as an index to the **`sampled_examples`** list to ensure that we do not generate explanations for the same questions more than once. This is necessary to avoid overloading the OpenAI API and to ensure that we are not wasting resources generating redundant explanations.

In [None]:
from tqdm import tqdm

for question in tqdm(sampled_examples[len(explained_dataset):]):
  question["explanation"] = generate_explanation(question)  
  explained_dataset.append(question)

100%|██████████| 100/100 [02:42<00:00,  1.62s/it]


The following code saves the explained questions with their corresponding explanations to a JSON file. The code uses the **`json`** library to encode the list of questions and their explanations to a JSON object, which is then written to a file named **`explained_dataset.json`**. The **`with open()`** block ensures that the file is properly closed after writing the JSON object to it. 

❗**NOTE**❗: please, download this generated file from the Colab notebook since it will be necessary for the lab assignment notebook.

If you prefer, you can use [this version](https://drive.google.com/file/d/11QOpNF9PoANSli0MAUKTeYq_TfhYE6sg/view?usp=share_link).

In [None]:
import json

with open('explained_dataset.json', 'w') as f:
  json.dump(explained_dataset, f)