# Create an Instruction Tuned Dataset from Plain Text

by `@sanjivda`

One of the pain points of instruction fine tuning is to prepare a dataset in the form of `Instruction + Context + Ground Truth`. This is usually done manually. Indeed, DataBricks prepared the [Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) dataset by crowd-sourcing it among its employees.

This is a lot of work that can easily be replicated by careful prompting of an LLM. We show that this is effective on a large sample of text transcripts from quarterly earnings calls of companies. We used the Claude2 LLM from Bedrock. You will need access to Bedrock for this but no other knowledge, other than setting up the permissions. The code in this notebook takes care of all else re Bedrock, but these are useful docs for reference in any case.

References:

- https://github.com/aws-samples/amazon-bedrock-workshop
- https://docs.aws.amazon.com/bedrock/latest/userguide/api-methods-run-inference.html#api-inference-examples-meta-llama

The approach is a simple two-step hack.

1. Prompt the LLM to generate question (`instruction`) based on some text (`context`).
2. The prompt also asks the LLM to then provide the answer (`response`) to the question.

Here is an example using Amazon Bedrock with Claude-2. It works remarkably well!


In [1]:
import boto3
import json
import langchain

In [2]:
# This is a long document we can split up.
with open("../dataset/unsupervised-fine-tuning/train.txt") as f:
    file = f.read()
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=50
)
texts = text_splitter.split_text(file)

Created a chunk of size 5962, which is longer than the specified 500
Created a chunk of size 1596, which is longer than the specified 500
Created a chunk of size 662, which is longer than the specified 500
Created a chunk of size 557, which is longer than the specified 500
Created a chunk of size 547, which is longer than the specified 500
Created a chunk of size 550, which is longer than the specified 500
Created a chunk of size 553, which is longer than the specified 500
Created a chunk of size 553, which is longer than the specified 500
Created a chunk of size 553, which is longer than the specified 500
Created a chunk of size 538, which is longer than the specified 500
Created a chunk of size 564, which is longer than the specified 500
Created a chunk of size 798, which is longer than the specified 500
Created a chunk of size 745, which is longer than the specified 500
Created a chunk of size 998, which is longer than the specified 500
Created a chunk of size 602, which is longer t

In [160]:
len(texts)

2823

In [32]:
import boto3
import botocore
from pprint import pprint

bedrock_client = boto3.client(service_name="bedrock", region_name="us-west-2")
print(bedrock_client.list_foundation_models())

{'ResponseMetadata': {'RequestId': '68ade209-1005-485f-becc-ae627adfc3ed', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Wed, 10 Jan 2024 14:53:03 GMT', 'content-type': 'application/json', 'content-length': '17086', 'connection': 'keep-alive', 'x-amzn-requestid': '68ade209-1005-485f-becc-ae627adfc3ed'}, 'RetryAttempts': 0}, 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-tg1-large', 'modelId': 'amazon.titan-tg1-large', 'modelName': 'Titan Text Large', 'providerName': 'Amazon', 'inputModalities': ['TEXT'], 'outputModalities': ['TEXT'], 'responseStreamingSupported': True, 'customizationsSupported': [], 'inferenceTypesSupported': ['ON_DEMAND'], 'modelLifecycle': {'status': 'ACTIVE'}}, {'modelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-g1-text-02', 'modelId': 'amazon.titan-embed-g1-text-02', 'modelName': 'Titan Text Embeddings v2', 'providerName': 'Amazon', 'inputModalities': ['TEXT'], 'outputModalities': ['EMBEDDING'], 

In [7]:
import json
import csv
from datetime import datetime

bdrk_runtime = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")


def call_bedrock(bedrock_runtime, modelId, prompt_data):
    if "amazon" in modelId:
        body = json.dumps(
            {
                "inputText": prompt_data,
                "textGenerationConfig": {
                    "maxTokenCount": 4096,
                    "stopSequences": [],
                    "temperature": 0,
                    "topP": 0.9,
                },
            }
        )
        # modelId = 'amazon.titan-tg1-large'
    elif "anthropic" in modelId:
        body = json.dumps(
            {
                "prompt": prompt_data,
                "max_tokens_to_sample": 20000,
                "temperature": 0.5,
                "top_p": 0.9,
                "stop_sequences": ["\n\nHuman:"],
            }
        )
        # modelId = 'anthropic.claude-instant-v1'
    elif "ai21" in modelId:
        body = json.dumps({"prompt": prompt_data, "maxTokens": 4096})
        # modelId = 'ai21.j2-grande-instruct'
    elif "stability" in modelId:
        body = json.dumps({"text_prompts": [{"text": prompt_data}]})
        # modelId = 'stability.stable-diffusion-xl'
    else:
        print("Parameter model must be one of titan, claude, j2, or sd")
        return
    accept = "application/json"
    contentType = "application/json"

    before = datetime.now()
    response = bedrock_runtime.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    latency = datetime.now() - before
    response_body = json.loads(response.get("body").read())

    if "amazon" in modelId:
        response = response_body.get("results")[0].get("outputText")
    elif "anthropic" in modelId:
        response = response_body.get("completion")
    elif "ai21" in modelId:
        response = response_body.get("completions")[0].get("data").get("text")

    return response

In [8]:
call_bedrock(
    bdrk_runtime,
    "anthropic.claude-v2:1",
    "\n\nHuman:Who was the first president of the United States?\n\nAssistant:",
)

' The first president of the United States was George Washington. He took office in 1789 after leading the Continental Army in the American Revolutionary War and presiding over the Constitutional Convention. Washington served two four-year terms as president from 1789-1797.'

In [17]:
examples = [
    {
        "instruction": "What efforts is Apple making to increase market share in countries like India where market penetration is lower compared to developed countries?",
        "context": "Good day, and welcome to the Apple Q1 Fiscal Year 2021 Earnings Conference Call. Today's call is being recorded. At this time, for opening remarks and introductions, I would like to turn the call over to Tejas Gala, Director of Investor Relations and Corporate Finance. Please go ahead.  We'll go question from Katy Huberty with Morgan Stanley. Please go ahead.  We'll hear next from Wamsi Mohan with Bank of America. Please go ahead.  We'll go ahead and take our next question from Shannon Cross with Cross Research.  We'll hear from Toni Sacconaghi with Bernstein.  We'll hear from Amit Daryanani with Evercore ISI. Please go ahead.  We'll go ahead and take our next question from Samik Chatterjee with JP Morgan.  We'll hear from Krish Sankar with Cowen.  We'll go ahead and take our next question from Chris Caso with Raymond James. Please go ahead.  We'll go ahead and hear from Jim Suva with Citigroup. James Suva It's amazing how your company has pivoted and progressed through this uncertain time in society. A lot of the pushback we get on our view on Apple is that everyone around them or that they know developed countries has an iPhone or Apple product and the market is kind of being saturated some. But when I look at other countries like India, I believe statistically, you are materially below that in market share. So are you doing active efforts there? It seems like there's been some news reports of moving supply chain there. Or you recently opened up an Apple Store. How should we think about that? Because it just seems like you're really not full market share equally around the world.  Once again, that does conclude today's conference. We do appreciate your participation. ",
        "response": "Apple is actively working to increase market share in India through measures like moving supply chain operations there, opening Apple Stores, and recent news reports. India represents a major growth opportunity for Apple since smartphone penetration is much lower compared to developed countries where many consumers already own iPhones.",
    },
    {
        "instruction": "Can you talk a bit about what's going on with services growth, including factors impacting acceleration and deceleration?",
        "context": "Thank you very much. Can you talk a bit about what's going on within services, some of the puts and takes? Luca, you gave us some color in terms of the growth rates in that, but I'm just curious and I know you won't talk about future products, but as you think about the opportunity, you think about what you've got now and in the future and then some of what’s been going on with China and that, is this something that could reaccelerate, or again, the 18% on a constant currency basis is obviously quite strong. But, how are you thinking about it? Luca Maestri Yes. I think it's important to start with that 18% in constant currencies, Shannon. Our reported results are on a normalized basis, removing the one-time item from last year, was 15%. Clearly FX, plays a role around the world, 300 basis points of FX impact during the June quarter. In spite of that, it was an all-time record revenue. Our installed base continues to grow. It's growing in every geography and it's growing across all our major product categories, and that is very, very important for the services business. I would say, I'll give you a bit more color around two offsetting factors around this performance during the June quarter. On one side, the App Store, I mentioned in my prepared remarks that growth accelerated sequentially. We had double-digit growth on the App Store in every geography. In China, we saw significant acceleration. As you know, we tend to monetize in China on the App Store through game titles and the government has approved a few key game titles during the quarter. That has helped our performance there. On the other side, AppleCare, I mentioned AppleCare was an all-time record in June, so really strong performance, but our growth has decelerated in AppleCare due to factors that we fully expected, because we are comping this expansion of our coverage for AppleCare that we've had significant success during the last 18 to 24 months in really broadening our coverage of AppleCare around the world with some key partners, carriers and resellers. And obviously, as we go through the year those comps become a bit more difficult. Having said all that, that we've given ourselves a couple of targets and we feel very confident about reaching those targets. The first one is that, we wanted to double the size of the services business from our fiscal 2016 to 2020. We are on our way there. Paid subscriptions is another target, is important to us. It's an important way for us to monetize our ecosystem. We set a target of surpassing 0.5 billion paid subscriptions on the ecosystem during 2020. We’re already at 420 million now. So, we feel confident there. And of course, as you mentioned, we're very excited about the fact that we're going to be launching new services soon. As Tim said, we're starting the rollout of Apple Card in August, and there's two more very important services that we’re going to be adding to our portfolio during the fall. One is Apple Arcade, which is our gaming subscription service, and then of course Apple TV+, which is our video streaming service. So obviously these services will help us carry on with the momentum that we had in services.  Great. Thank you. And this is probably for you too as well, Luca. Can you talk about gross margin? The guidance was pretty solid. Obviously, there are various things that are at play here. I know you mentioned $1 billion worth of top-line impact, I think from, currency last quarter. Then maybe if you can kind of talk about what went into your gross margin guidance. Luca Maestri Yes. So of course, Shannon, as you've seen, our guidance for margin is 50 basis points higher than the guidance that we had given for June. I would say on the positive, we expect to benefit from leverage. As you've seen from our revenue guidance, and from cost savings, because as you know the commodity environment is fairly favorable right now. On the negative side, the headwind on gross margins on a year-over-year basis from foreign exchange is about 100 basis points. And so, we need to keep that in mind. But we feel pretty good about the guidance we provided. Nancy Paxton Thanks Shannon. Could we have the next question please. ",
        "response": "Luca Maestri provided color on services growth. The App Store accelerated globally and saw significant acceleration in China as new game titles were approved. AppleCare set an all-time revenue record but growth decelerated as Apple lapped expansion of coverage over the past 18-24 months. New services like Apple Card, Apple Arcade, and Apple TV+ will help continue momentum. Services remains on track to double in size from 2016 to 2020 and surpass 500 million paid subscriptions in 2020.",
    },
    {
        "instruction": "How has Apple been impacted by supply chain disruptions over the past couple of years, and has this led the company to rethink its manufacturing footprint and broader supply chain strategy going forward?",
        "context": "So first question just as it relates to some of the disruption you've seen on the component side, manufacturing and logistics over the past couple of years, are you starting to rethink your broader supply chain strategy or the manufacturing footprint on the back of the significant disruption? Are you happy with the overall geographic exposure that you see in the supply chain today?  Okay. And how are you thinking about the metaverse opportunity and Apple's role in that market? ",
        "response": "Apple has faced component shortages and logistics challenges during the pandemic, but it's unclear if this will lead to major changes in Apple's supply chain strategy or manufacturing footprint. The company seems satisfied with its geographic exposure but may make tweaks to enhance resilience.",
    },
    {
        "instruction": "Tim, four years ago you predicted services would reach $50 billion in revenue by 2020. Do you have an updated prediction for services revenue four years from now?",
        "context": "Hi, thanks for taking my question. I have two of them. First one, Tim, when you look at the services business and in terms of your TV+ content production have the movement restrictions impacted the content production efforts? And along the same path four years ago your premonition on services being a $50 billion business in 2020 came sooner than expected, I don't know if you want to make any such forecast four years out and how you think services revenue is going to be. Then I have a follow-up for Luca.  Yes, Tim, I was trying to see, four years ago you made a great prediction that services is going to be $50 billion by 2020. I wanted to see if you have any update to the prediction four years down the road?  If I could just squeeze in one for Luca. With the strong sales in Mac given the shelter-in-place, do you think the back-to-school season got pulled in by a quarter or do you expect the momentum to still continue? Thank you very much. ",
        "response": "Four years ago, Tim Cook predicted Apple's services revenue would reach $50 billion by 2020, which it did. He has not made any similar prediction looking four years ahead, but services revenue continues to grow rapidly for Apple.",
    },
    {
        "instruction": "Why is Apple's revenue guidance for the upcoming quarter lower than historical trends despite enthusiasm for new iPhone, Apple Watch, and services products? Is Apple being conservative given economic uncertainty, or are other factors at play?",
        "context": "Yes. Thank you. I think this is for Luca, and then I have a follow-up as well. If I look at your guidance, the midpoint of your guidance for revenues on a sequential basis, it's up about 36.5%. Historically, fiscal Q4 to Q1 was up 50% or more, and even last year given that iPhone is a slower growing product, you guided for revenues to be up 45% sequentially. So given the enthusiasm about the iPhone 11 launch and the new Wearables products and the new services, I guess the question is, why is your guidance not stronger for Q1 on the top line? And is that sort of a reflection of conservatism given that there's a lot of uncertainty in the world and we certainly saw that last year? Or are there other forces at work that we should be considered?  Okay. Thank you for that. And then if I could follow-up, just on the bundling of Apple TV+, I guess for you, Tim, this is really the first time we've seen a significant bundling of services offering and hardware offering. And I'm wondering if you view this as kind of a strategic advantage of Apple and whether we might see more hardware plus services offerings - bundled offerings, and ultimately do you - do you ever believe that your hardware itself might be offered as a bundled service. And maybe while we're on that either you or Luca could give us the 30 second tutorial on how we should think about the deferred revenue accounting? Approximately how much of the $60 are you going to be deferring, and what's your expectation for attach rate on that? Thank you. ",
        "response": "Apple's revenue guidance for the upcoming quarter is lower than past years' trends despite new product launches. This could reflect conservatism by Apple given economic uncertainty globally. It may also indicate other factors influencing demand, like longer upgrade cycles for iPhones. Apple may be taking a cautious approach to guidance given unpredictability in the current environment.",
    },
]

sample_question = examples[2]["instruction"]
sample_context = examples[2]["context"]
sample_answer = examples[2]["response"]

In [59]:
## Simple version of the prompt

import re

PromptTemplate = """\n\nHuman:\
Below is an snippet of a financial document.\n

Snippet:\n
<context>
{context}</context>\n\n

Given the context information above generate only questions based on the below query.

You are a Professor teaching finance at University. You had good success with teaching your students how to extract data from earnings transcripts.
There is an examples below. Your task is to setup new questions for an upcoming \
quiz/examination based on financial documents like 10k or 8k statements. The questions should be diverse in nature and difficulty \
across the database. You can create more difficult questions than shown above.  The questions should not contain options, not start with Q1/ Q2. \
Try to create different levels of complexity for the questions. Some where they purely need to find some data, others where they need to infer the context \
Output only the question and the corresponding answers between <NewPair></NewPair> tags!\
Try to create 1 to 3 questions per snippet, try to vary the difficulty!

The questions can simply be who was hired, who leads a division etc. Students should just get familiar with extracting information from the documents. \

Create a new questions and answers based on the above query. Do not repeat the same questions.
Remember, only output the question and the corresponding answers between <NewPair></NewPair> tags! I need to parse them afterwards, so any other text is unwanted!

An example of a question and answer pair and its format is shown below based on the earnings call transcripts:\n

    <sample_context>{sample_context}</sample_context>

    <NewPair>
    <question>{sample_question}</question>
    <answer>{sample_answer}</answer>
    </NewPair>

\n\nAssistant:\"""
"""


def parse_output(output):
    # Define a regex pattern to capture each pair of question, answer, and complexity
    pattern = r"<NewPair>\s*<question>(.*?)</question>\s*<answer>\s*(.*?)\s*</answer>\s*</NewPair>"
    # Find all matches in the output
    matches = re.findall(pattern, output, re.DOTALL)
    return matches


for text in texts:
    # Fill the prompt template
    prompt = PromptTemplate.format(
        context=texts[0],
        sample_context=sample_context,
        sample_question=sample_question,
        sample_answer=sample_answer,
    )

    # Generate the questions
    response = call_bedrock(bdrk_runtime, "anthropic.claude-v2:1", prompt)

    # Parse the questions
    questions = parse_output(response)

    break

In [60]:
questions

[("What is the start date for Joseph Galli Jr's employment with Amazon.com?",
  'June 24, 1999'),
 ("How much was Joseph Galli Jr's initial salary at Amazon.com?",
  '$16,666.70 per month ($200,000 annualized)'),
 ('What was the amount of the signing bonus paid to Joseph Galli Jr in 3 installments?',
  '$7,900,000'),
 ('What was the number of shares in the initial stock option grant to Joseph Galli Jr that vests over 10 years?',
  '735,000 shares'),
 ('What was the number of shares in the additional stock option grant to Joseph Galli Jr that vests after 10 years?',
  '1,225,000 shares')]

### Non parallelized version


In [45]:
# import re
# import pandas as pd


# def parse_output(output):
#     pattern = r"<NewPair>\s*<question>(.*?)</question>\s*<answer>\s*(.*?)\s*</answer>\s*</NewPair>"
#     matches = re.findall(pattern, output, re.DOTALL)
#     return matches


# # DataFrame to store the results
# df = pd.DataFrame(columns=["Context", "Question", "Answer"])

# for text in texts:
#     prompt = PromptTemplate.format(
#         context=text,
#         sample_context=sample_context,
#         sample_question=sample_question,
#         sample_answer=sample_answer,
#     )

#     # Generate the questions
#     response = call_bedrock(bdrk_runtime, "anthropic.claude-v2:1", prompt)

#     # Parse the output
#     questions = parse_output(response)

#     # Create a DataFrame for the current set of questions and answers
#     new_entries = pd.DataFrame(
#         [{"Context": text, "Question": q, "Answer": a} for q, a in questions]
#     )

#     # Concatenate the new entries with the main DataFrame
#     df = pd.concat([df, new_entries], ignore_index=True)

# # Now df has all the context, questions, and answers
# print(df)

In [62]:
texts = texts[869:]
len(texts)

1954

In [63]:
df = first_869_df

In [64]:
import re
import pandas as pd
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from time import sleep


def parse_output(output):
    pattern = r"<NewPair>\s*<question>(.*?)</question>\s*<answer>\s*(.*?)\s*</answer>\s*</NewPair>"
    matches = re.findall(pattern, output, re.DOTALL)
    return matches


def generate_and_parse(text, prompt_template, max_retries=5):
    attempts = 0
    while attempts < max_retries:
        try:
            prompt = prompt_template.format(
                context=text,
                sample_context=sample_context,
                sample_question=sample_question,
                sample_answer=sample_answer,
            )

            # Generate the questions
            response = call_bedrock(bdrk_runtime, "anthropic.claude-v2:1", prompt)

            # Parse the output
            questions = parse_output(response)
            return text, questions

        except Exception as e:  # Catch general exception
            if "ThrottlingException" in str(
                e
            ):  # Check if exception message contains 'ThrottlingException'
                sleep_time = 60 * (2**attempts)  # Exponential back-off
                print(f"Throttling, retrying in {sleep_time} seconds...")
                sleep(sleep_time)
                attempts += 1

    return text, []  # Return empty if max retries exceeded


# ThreadPoolExecutor for parallel processing
with ThreadPoolExecutor(max_workers=5) as executor:
    # Dictionary to keep track of futures
    future_to_text = {
        executor.submit(generate_and_parse, text, PromptTemplate): text
        for text in texts
    }

    for future in tqdm(
        as_completed(future_to_text), total=len(future_to_text), desc="Processing texts"
    ):
        text, questions = future.result()

        # Create a DataFrame for the current set of questions and answers
        new_entries = pd.DataFrame(
            [{"Context": text, "Question": q, "Answer": a} for q, a in questions]
        )

        # Concatenate the new entries with the main DataFrame
        df = pd.concat([df, new_entries], ignore_index=True)

# Now df has all the context, questions, and answers
print(df)

Processing texts:   1%|▏         | 26/1954 [00:44<42:23,  1.32s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  11%|█         | 211/1954 [05:30<46:21,  1.60s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  11%|█         | 213/1954 [05:35<1:03:25,  2.19s/it]

Throttling, retrying in 60 seconds...


Processing texts:  12%|█▏        | 233/1954 [06:14<51:12,  1.79s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  13%|█▎        | 251/1954 [06:52<41:39,  1.47s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  14%|█▍        | 279/1954 [07:44<44:35,  1.60s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  15%|█▍        | 289/1954 [08:11<55:36,  2.00s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  16%|█▋        | 318/1954 [09:16<57:52,  2.12s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  17%|█▋        | 328/1954 [09:33<47:27,  1.75s/it]

Throttling, retrying in 60 seconds...


Processing texts:  17%|█▋        | 330/1954 [09:38<55:10,  2.04s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  20%|█▉        | 382/1954 [11:18<32:09,  1.23s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  20%|█▉        | 383/1954 [11:20<34:34,  1.32s/it]

Throttling, retrying in 60 seconds...


Processing texts:  21%|██▏       | 419/1954 [12:28<29:53,  1.17s/it]  

Throttling, retrying in 60 seconds...


Processing texts:  26%|██▋       | 516/1954 [15:12<36:09,  1.51s/it]  

Throttling, retrying in 60 seconds...


Processing texts: 100%|██████████| 1954/1954 [49:23<00:00,  1.52s/it]

                                                Context  \
0     Signature Date Printed Name: Jeffrey A. Wilke ...   
1     Signature Date Printed Name: Jeffrey A. Wilke ...   
2     Item 1. Business\n\nThis Annual Report on Form...   
3     Item 1. Business\n\nThis Annual Report on Form...   
4     Item 1. Business\n\nThis Annual Report on Form...   
...                                                 ...   
7926  (1) The Report fully complies with the require...   
7927  (1) The Report fully complies with the require...   
7928  Exhibit 32.1\n\nCertification Pursuant to 18 U...   
7929  Exhibit 32.1\n\nCertification Pursuant to 18 U...   
7930  Exhibit 32.1\n\nCertification Pursuant to 18 U...   

                                               Question  \
0     What was Richard Dalzell's starting salary whe...   
1     How much of a signing bonus did Amazon.com off...   
2                What year was Amazon.com incorporated?   
3     On what stock exchange is Amazon's common stoc...




In [65]:
df.to_csv("Questions_save.csv", index=False)

In [66]:
df.to_feather("Questions_save.feather")

In [71]:
train_df = df[:5000]
monitor_df = df[5000:]

print(train_df.shape)
print(monitor_df.shape)

(5000, 3)
(2931, 3)


In [72]:
import json

# Save as Q&A Dataset
with open("qa_dataset_train.jsonl", "w") as file:
    for _, row in train_df.iterrows():
        qa_entry = {"question": row["Question"], "answer": row["Answer"]}
        file.write(json.dumps(qa_entry) + "\n")

# Save as Q&A Dataset
with open("qa_dataset_monitor.jsonl", "w") as file:
    for _, row in monitor_df.iterrows():
        qa_entry = {"question": row["Question"], "answer": row["Answer"]}
        file.write(json.dumps(qa_entry) + "\n")

In [73]:
# Save as Instruction Dataset
with open("instruction_dataset_train.jsonl", "w") as file:
    for _, row in train_df.iterrows():
        instruction_entry = {
            "instruction": row["Question"],
            "context": row["Context"],
            "response": row["Answer"],
        }
        file.write(json.dumps(instruction_entry) + "\n")

# Save as Instruction Dataset
with open("instruction_dataset_monitor.jsonl", "w") as file:
    for _, row in monitor_df.iterrows():
        instruction_entry = {
            "instruction": row["Question"],
            "context": row["Context"],
            "response": row["Answer"],
        }
        file.write(json.dumps(instruction_entry) + "\n")