# Example of generating QAs for IRS PDF
In this example, we will show you how to generate question-answers (QAs) from a pdf using Huggingface's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [IRS 2023](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install langchain pandas pypdf



### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformHuggingFaceConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig
from uniflow.op.prompt_schema import Context, GuidedPrompt

from uniflow.flow.config import ExtractPDFConfig
from uniflow.op.model.model_config import NougatModelConfig

# from uniflow.op.extract.split.markdown_header_splitter import MarkdownHeaderSplitter

from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER

from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


False

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `PyPDFLoader` from langchain.

In [4]:
pdf_file = "IRS_2023.pdf"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

#### Load the pdf using Nougat

In [6]:
data = [
    {"filename": input_file},
]

from pprint import pprint
pprint(data)

[{'filename': '/home/ubuntu/uniflow/example/transform/data/raw_input/IRS_2023.pdf'}]


In [7]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=MARKDOWN_HEADER_SPLITTER,
)


#### Set up prompt

In [9]:
# guided_prompt = GuidedPrompt(
#     instruction="Generate three Q&A pairs based on the context provided.",
#     examples=[
#         Context(
#             context="In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
#             question="Who published A Mathematical Theory of Communication in 1948?",
#             answer="Claude E. Shannon."
#         ),
#         Context(
#             context="In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
#             question="What concept did Shannon introduce in his 1948 article?",
#             answer="The concept of information entropy."
#         ),
#         Context(
#             context="In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
#             question="What field was established by Shannon's 1948 publication?",
#             answer="The theory of information."
#         ),
#     ]
# )

In [8]:
# guided_prompt = GuidedPrompt(
#     instruction="'Generate Q&A based on the context'.",
#     examples=[
#         Context(
#             context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
#             question="Who published A Mathematical Theory of Communication in 1948?",
#             answer="Claude E. Shannon.",
#         ),
# ])

In [8]:
# guided_prompt = GuidedPrompt(
#     instruction="Generate 1 question and the corresponding answers based on the context, following the JSON format which Question and Answewr as two necessary keys. Calling output['context'], output['question'] and output['answer'] should return the corresponding context, question, and answer.",
#     examples=[]
# )

In [8]:
# generate questions
def generate_question_text(question_set, number_QAs):
    context = question_set["Context"]
    CQs = question_set["QAs"]

    # Initialize question and answer texts
    question_text = ""
    # answer_text = ""

    for i in range(number_QAs):
        qa = CQs[i % len(CQs)]
        question_text += f"Question {i+1}: {qa['Question']} "
        # answer_text += f"Answer {i+1}: {qa['Answer']} "

    # Remove trailing spaces
    question_text = question_text.strip()
    # answer_text = answer_text.strip()

    # return context, question_text, answer_text
    return context, question_text

In [8]:
# def generate_QA_text(QA_set, number_QAs):
#     context = QA_set["Context"]
#     QAs = QA_set["QAs"]

#     # Initialize question and answer texts
#     question_text = ""
#     answer_text = ""

#     for i in range(number_QAs):
#         qa = QAs[i % len(QAs)]
#         question_text += f"Question {i+1}: {qa['Question']} "
#         answer_text += f"Answer {i+1}: {qa['Answer']} "

#     # Remove trailing spaces
#     question_text = question_text.strip()
#     answer_text = answer_text.strip()

#     return context, question_text, answer_text

In [9]:
number_QAs = 3

In [10]:
# generate questions
from pprint import pprint

sample_instruction = """Generate {} question(s) and based on the context. Following \
the format of the examples below to include context and question in the response.""".format(number_QAs)
# generate as many questions as possible and make sure those questions can cover any question people can think of
# assume .. tax expert, prompt engineering

# generate answers based on the context and question

# no examples
# Define the QA sets
question_set = {
    "Context": "In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
    "QAs": [
        {"Question": "Who published A Mathematical Theory of Communication in 1948?;"},
        {"Question": "What concept did Shannon introduce in his 1948 article?;"},
        {"Question": "What field was established by Shannon's 1948 publication?;"},
    ],
}

QA_set_C, QA_set_Q = generate_question_text(question_set, number_QAs)

sample_examples = [
        Context(
            context=QA_set_C,
            question=QA_set_Q,
        ),
]

guided_prompt = GuidedPrompt(
    instruction=sample_instruction,
    examples=sample_examples
)

print("Sample_instruction:")
print(sample_instruction, '\n')
print("Sample_examples:")
pprint(sample_examples)

Sample_instruction:
Generate 3 question(s) and based on the context. Following the format of the examples below to include context and question in the response. 

Sample_examples:
[Context(context='In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.', question="Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;")]


In [10]:
# from pprint import pprint

# sample_instruction = """Generate {} question(s) and the corresponding answer(s) based on the context. Following \
# the format of the examples below to include context, question, and answer in the response.""".format(number_QAs)
# # generate as many questions as possible and make sure those questions can cover any question people can think of

# # generate answers based on the context and question

# # no examples
# # Define the QA sets
# QA_set = {
#     "Context": "In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
#     "QAs": [
#         {"Question": "Who published A Mathematical Theory of Communication in 1948?;", "Answer": "Claude E. Shannon;"},
#         {"Question": "What concept did Shannon introduce in his 1948 article?;", "Answer": "The concept of information entropy;"},
#         {"Question": "What field was established by Shannon's 1948 publication?;", "Answer": "The theory of information;"},
#     ],
# }

# QA_set_C, QA_set_Q, QA_set_A = generate_QA_text(QA_set, number_QAs)

# sample_examples = [
#         Context(
#             context=QA_set_C,
#             question=QA_set_Q,
#             answer=QA_set_A
#         ),
# ]

# guided_prompt = GuidedPrompt(
#     instruction=sample_instruction,
#     examples=sample_examples
# )

# print("Sample_instruction:")
# print(sample_instruction, '\n')
# print("Sample_examples:")
# pprint(sample_examples)

Sample_instruction:
Generate 3 question(s) and the corresponding answer(s) based on the context. Following the format of the examples below to include context, question, and answer in the response. 

Sample_examples:
[Context(context='In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.', question="Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;", answer='Answer 1: Claude E. Shannon; Answer 2: The concept of information entropy; Answer 3: The theory of information;')]


### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [11]:
# current_batch_size = 4 if number_QAs == 1 else 2
# print("batch size:", current_batch_size)

batch size: 2


In [11]:
transform_config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(),
    # model_config=HuggingfaceModelConfig(batch_size=current_batch_size),
)

In [12]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))
output = p.run(data)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading checkpoint shards: 100%|██████████| 2/2 [01:49<00:00, 54.83s/it]
  0%|          | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Costs You Can Deduct or Capitalize Page 27


100%|██████████| 1/1 [08:39<00:00, 519.24s/it]
100%|██████████| 197/197 [33:50<00:00, 10.31s/it] 


In [20]:
#  {'error': 'CUDA out of memory. Tried to allocate 20.41 GiB. GPU 0 has a total capacty of 21.99 GiB of which 14.34 GiB is free. Including non-PyTorch memory, this process has 7.63 GiB memory in use. Of the allocated memory 7.05 GiB is allocated by PyTorch, and 280.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF',
# Even when the batch size is 1, the above error may occur (once), which lead to 1 output has only 1 question instead of 3(multiple). 

In [13]:
output

[[{'output': [{'response': ["instruction: Generate 3 question(s) and based on the context. Following the format of the examples below to include context and question in the response.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.\nquestion: Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, an

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [15]:
# print(len(output[0]))

197


In [14]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

In [36]:
# import re

# keywords = ["context:", "question:", "answer:"]
# pattern = '|'.join(map(re.escape, keywords))

# o = output[0]['output'][0]['response'][0] ## we only postprocess the first output
# segments = [segment for segment in re.split(pattern, o) if segment.strip()]
# result = {
#     "context": segments[-3],
#     "question": segments[-2],
#     "answer": segments[-1]
# }


# # pprint(output)
# pprint(output[0]['response'])
# # pprint(result, sort_dicts=False)

In [32]:
# for item in output[0]:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
            
#         break
#     break

In [15]:
for item in output[0]:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {}
            last_key = None

            for part in parts:
                if ":" in part:
                    key, value = part.split(":", 1)
                    key = key.strip()
                    value = value.strip()
                    response_dict[key] = value
                    last_key = key
                elif last_key is not None:
                    response_dict[last_key] += " " + part
            
            if any(key not in response_dict for key in ['context', 'question']):
                # print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
                continue
            # if "Claude E. Shannon" in response_dict['answer']:
            #     # print("[WARNING] Used example context, skipping:\n", response_dict["context"])
            #     continue
            contexts.append(response_dict['context'])
            questions.append(response_dict['question'])

pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.width', 1000)

print(len(contexts))
print(len(questions))

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
})

196
196


In [16]:
df

Unnamed: 0,Context,Question
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.","Question 1: Where can I find the _How To Get Tax Help_ section at the end of a publication? ; Question 2: How do I access the IRS Interactive Tax Assistant page? ; Question 3: How do I order current forms, instructions, and publications from the IRS website?"
1,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",What is the website where you can find the latest information about developments related to Pub. 535?
2,"## What's New for 2022 The following items highlight some changes in the tax law for 2022. **Form 1098-k reporting transition period.** The transition period described in _Notice 2023-10_ delays the reporting of transactions in excess of 5600 to transactions that occur after calendar year 2022. The transition period is intended to facilitate an orderly transition for TPSO tax compliance, as well as individual payge compliance with income tax reporting. A participating payge, in the case of a third-party network transaction, is any person who accepts payment from a third-party settlement organization for a business transaction. **The COVID-19 related credit for qualified sick and family leave wages is limited to leave taken after March 31, 2020, and before October 1, 2021.** Generally, the credit for qualified sick and family leave wages, as enacted under the _Families_ First Coronavirus Response Act (FFCRA) and amended and extended by the COVID-related Tax Relief Act of 2020, for l...",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;
3,## References * [1] * [2] The following reminders and other items may help you file your tax return. [MISSING_PAGE_POST],Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;
4,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",Question 1: When should I capitalize my business startup costs? ; Question 2: Can I deduct certain business startup costs? ; Question 3: What is the purpose of tip provided in this section?
...,...,...
191,"### Reasonable period of time. A reasonable period of time depends on the facts and circumstances. Generally, actions that take place within the times specified in the following list will be treated as taking place within a reasonable period of time. 1. You give an advance within 30 days of the time the employee pays or incurs the expense. 2. Your employees adequately account for their expenses within 60 days after the expenses were paid or incurred. 3. Your employees return any excess reimbursement within 120 days after the expenses were paid or incurred. 4. You give a periodic statement (at least quarterly) to your employees that asks them to either return or adequately account for outstanding advances _and_ they comply within 120 days of the date of the statement. How to deduct You can claim a deduction for travel and non-entertainment-related meals expenses if you reinfunwise your employees for these expenses under an accountable plan. Generally, the amount you can deduct for n...",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;
192,"## Per Diem and Car Allowances You can reimburse your employees under an accountable plan based on travel days, miles, or some other fixed allowance. In these cases, your employee is considered to have accounted to you for the amount of the expense that doesn't exceed the rates established by the federal government. Your employee must actually substantiate to you the other elements of the expense, such as time, place, and business purpose. Federal rate The federal rate can be figured using any one of the following methods.",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;
193,"## The Taxpayer Advocate Service (TAS) Is Here To Help You! The Taxpayer Advocate Service (TAS) is a free service provided by the IRS that helps taxpayers resolve their problems with the IRS. TAS assists taxpayers who have experienced significant difficulties or delays when trying to contact the IRS directly. If you are having trouble getting through to an IRS representative, you can call TAS at 1-800-121-7567.",Question 1: What is the purpose of the Taxpayer Advocate Service? ; Question 2: How does one access the Taxpayer Advocate Service? ; Question 3: When should someone use the Taxpayer Advocate Service?
194,### What Is TAS? TAS is an _Independent_ organization within the IRS that helps taxpayers and protects taxpayer rights. Their job is to ensure that every taxpayer is treated fairly and that you know and understand your rights under the _Taxpayer Bill of Rights_. They also help resolve disputes between taxpayers and the IRS.,Question 1: What does TAS do?; Question 2: What is the purpose of TAS?; Question 3: How does TAS protect taxpayer rights?


In [42]:
# contexts

# questions

In [43]:
# # generate answers
# def generate_answer_text(answer_set, number_QAs):
#     context = answer_set["Context"]
#     CQs = answer_set["CQs"]
#     QAs = answer_set["QAs"]

#     # Initialize question and answer texts
#     question_text = ""
#     answer_text = ""

#     for i in range(number_QAs):
#         qa = QAs[i % len(QAs)]
#         # question_text += f"Question {i+1}: {qa['Question']} "
#         answer_text += f"Answer {i+1}: {qa['Answer']} "

#     # Remove trailing spaces
#     # question_text = question_text.strip()
#     answer_text = answer_text.strip()

#     return context, CQs, answer_text
#     # return context, answer_text

In [45]:
# # generate answers
# from pprint import pprint

# sample_instruction = """Generate the answer for the corresponding question based on the context. Following \
# the format of the examples below to include context, question, and answer in the response.""".format(number_QAs)
# # generate as many questions as possible and make sure those questions can cover any question people can think of

# # generate answers based on the context and question

# # no examples
# # Define the QA sets
# question_set = {
#     "Context": "In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
#     "CQs": [
#         {"Question": "Who published A Mathematical Theory of Communication in 1948?;"},
#         {"Question": "What concept did Shannon introduce in his 1948 article?;"},
#         {"Question": "What field was established by Shannon's 1948 publication?;"},
#     ],
#     "QAs": [
#         {"Answer": "Claude E. Shannon;"},
#         {"Answer": "The concept of information entropy;"},
#         {"Answer": "The theory of information;"},
#     ],
# }

# QA_set_C, QA_set_Q, QA_set_A = generate_answer_text(question_set, number_QAs)

# sample_examples = [
#         Context(
#             context=QA_set_C,
#             question=QA_set_Q,
#             answer=QA_set_A
#         ),
# ]

# guided_answer_prompt = GuidedPrompt(
#     instruction=sample_instruction,
#     examples=sample_examples
# )

# print("Sample_instruction:")
# print(sample_instruction, '\n')
# print("Sample_examples:")
# pprint(sample_examples)

Sample_instruction:
Generate the answer for the corresponding question based on the context. Following the format of the examples below to include context, question, and answer in the response. 

Sample_examples:
[Context(context='In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.', question=[{'Question': 'Who published A Mathematical Theory of Communication in 1948?;'}, {'Question': 'What concept did Shannon introduce in his 1948 article?;'}, {'Question': "What field was established by Shannon's 1948 publication?;"}], answer='Answer 1: Claude E. Shannon; Answer 2: The concept of information entropy; Answer 3: The theory of information;')]


In [50]:
# transform_config_answer = TransformHuggingFaceConfig(
#     guided_prompt_template=guided_answer_prompt,
#     model_config=HuggingfaceModelConfig(),
#     # model_config=HuggingfaceModelConfig(batch_size=current_batch_size),
# )

In [51]:
# # call transform client
# from uniflow.flow.client import TransformClient
# TransformClient(transform_config_answer)

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.18s/it]


ValueError: thread_0/huggingface_model_op already exists.

In [43]:
# # This function formats the context and questions for the model to generate answers.
# def generate_answer_text(context, questions):
#     # Initialize answer text
#     answer_text = ""

#     # Format each question in the required structure
#     for i, question in enumerate(questions):
#         answer_text += f"Question {i+1}: {question['Question']} Answer {i+1}: \n"

#     return context, answer_text

In [34]:
# Sample instruction for generating answers
sample_instruction = """Generate the answer for the corresponding question based on the context. Following \
the format of the examples below to include context, question, and answer in the response."""

# Define the context and questions for which you want to generate answers
# Here you would replace the placeholder with your actual extracted context and generated questions
question_set = {
    "Context": "In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",
    "QAs": [
        {"Question": "Who published A Mathematical Theory of Communication in 1948?"},
        {"Question": "What concept did Shannon introduce in his 1948 article?"},
        # Add more questions as needed
    ]
}

# Format the context and questions for the answer generation model
context, formatted_questions = generate_answer_text(question_set["Context"], question_set["QAs"])


In [35]:
from uniflow.op.prompt_schema import Context, GuidedPrompt

# Create the examples for the guided prompt
sample_examples = [
    Context(
        context=context,
        question=formatted_questions,
        answer=""  # The answer will be generated by the model, so we leave it empty here
    ),
]

# Create the guided prompt for answer generation
guided_answer_prompt = GuidedPrompt(
    instruction=sample_instruction,
    examples=sample_examples
)


In [36]:
transform_config_answer = TransformHuggingFaceConfig(
    guided_prompt_template=guided_answer_prompt,
    model_config=HuggingfaceModelConfig()
)

In [37]:
input_data = []

In [38]:
def generate_answer_text(context, question):
    # Assume the question is a plain string; format it
    formatted_question = f"Question: {question} Answer:"
    # Return the context and the formatted question
    return context, formatted_question

In [39]:
for context, question in zip(contexts, questions):
    formatted_context, formatted_question = generate_answer_text(context, question)
    
    # Append the formatted context and question to the input data
    input_data.append({
        "context": formatted_context,
        "question": formatted_question
    })

In [40]:
input_data

[{'context': "**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.",
  'question': 'Question: Question 1: Where can I find the _How To Get Tax Help_ section at the end of a publication? ; Question 2: How do I access the IRS Interactive Tax Assistant

In [41]:
client = TransformClient(transform_config_answer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Exception ignored in: <function Op.__del__ at 0x7fcbf38f8940>
Traceback (most recent call last):
  File "/home/ubuntu/uniflow/example/transform/../../uniflow/op/op.py", line 46, in __del__
    utils.OPS_NAME.remove(self._scope_name)
KeyError: 'thread_0/huggingface_model_op'
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.18s/it]


In [47]:
input_data = [
    Context(
        context=ctx,
        question=qst,
        answer=""  # Leave answer blank as the model will generate this
    )
    for ctx, qst in zip(contexts, questions)
]

In [48]:
input_data

[Context(context="**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.", question='Question 1: Where can I find the _How To Get Tax Help_ section at the end of a publication? ; Question 2: How do I access the IRS Interactive Tax Assistant page? ; Qu

In [49]:
output_with_answer = client.run(input_data)

  0%|          | 0/196 [00:00<?, ?it/s]

100%|██████████| 196/196 [19:48<00:00,  6.06s/it]


In [50]:
output_with_answer

[{'output': [{'response': ["instruction: Generate the answer for the corresponding question based on the context. Following the format of the examples below to include context, question, and answer in the response.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.\nquestion: Question: [{'Question': 'Who published A Mathematical Theory of Communication in 1948?'}, {'Question': 'What concept did Shannon introduce in his 1948 article?'}] Answer:\nanswer: \ncontext: **Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and pub

In [56]:
contexts = []
questions = []
answers = []

In [99]:
# for item in output_with_answer:
#     for response_item in item['output']:
#         for response in i.get('response', []):
            

instruction: Generate 1 question(s) and based on the context. Following the format of the examples below to include context and question in the response.
context: In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.
question: Question 1: Who published A Mathematical Theory of Communication in 1948?;
context: ### How Can You Learn About Your Taxpayer Rights?
The Taxpayer Bill of Rights describes 10 basic rights that all taxpayers have when dealing with the IRS. Go to _Taxpayer.ak/.pick/.pick_ to help you understand what these rights mean to you and how they apply. These are _your_ rights. Know them. Use them.
You can find a list of your rights and the IRS's obligations to protect them in _Pub.L.Y. Your Rights as a Taxayer_. It includes the following.
1. **The Right To Be Informed.** Taxaye

In [57]:
for item in output_with_answer:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {}
            last_key = None

            for part in parts:
                if ":" in part:
                    key, value = part.split(":", 1)
                    key = key.strip()
                    value = value.strip()
                    response_dict[key] = value
                    last_key = key
                elif last_key is not None:
                    response_dict[last_key] += " " + part
            
            if any(key not in response_dict for key in ['context', 'question', 'answer']):
                # print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
                continue
            # if "Claude E. Shannon" in response_dict['answer']:
            #     # print("[WARNING] Used example context, skipping:\n", response_dict["context"])
            #     continue
            contexts.append(response_dict['context'])
            questions.append(response_dict['question'])
            answers.append(response_dict['answer'])

pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.width', 100)

print(len(contexts))
print(len(questions))

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

196
196


In [58]:
df

Unnamed: 0,Context,Question,Answer
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.","Question 1: Where can I find the _How To Get Tax Help_ section at the end of a publication? ; Question 2: How do I access the IRS Interactive Tax Assistant page? ; Question 3: How do I order current forms, instructions, and publications from the IRS website?",
1,The following are some of the most common questions asked by people who have recently purchased a new car.,What should I do if my new car won't start?,"If your new car won't start, check the battery connections and make sure they are free from corrosion. If the problem persists, contact a mechanic."
2,"## What's New for 2022 The following items highlight some changes in the tax law for 2022. **Form 1098-k reporting transition period.** The transition period described in _Notice 2023-10_ delays the reporting of transactions in excess of 5600 to transactions that occur after calendar year 2022. The transition period is intended to facilitate an orderly transition for TPSO tax compliance, as well as individual payge compliance with income tax reporting. A participating payge, in the case of a third-party network transaction, is any person who accepts payment from a third-party settlement organization for a business transaction. **The COVID-19 related credit for qualified sick and family leave wages is limited to leave taken after March 31, 2020, and before October 1, 2021.** Generally, the credit for qualified sick and family leave wages, as enacted under the _Families_ First Coronavirus Response Act (FFCRA) and amended and extended by the COVID-related Tax Relief Act of 2020, for l...",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;,1. Claude E. Shannon; 2. Information entropy; 3. The theory of information
3,"In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948), which established the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. This work laid the foundation for modern digital communication.",Question: Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon
4,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",Question 1: When should I capitalize my business startup costs? ; Question 2: Can I deduct certain business startup costs? ; Question 3: What is the purpose of tip provided in this section?,
...,...,...,...
191,"### Reasonable period of time. A reasonable period of time depends on the facts and circumstances. Generally, actions that take place within the times specified in the following list will be treated as taking place within a reasonable period of time. 1. You give an advance within 30 days of the time the employee pays or incurs the expense. 2. Your employees adequately account for their expenses within 60 days after the expenses were paid or incurred. 3. Your employees return any excess reimbursement within 120 days after the expenses were paid or incurred. 4. You give a periodic statement (at least quarterly) to your employees that asks them to either return or adequately account for outstanding advances _and_ they comply within 120 days of the date of the statement. How to deduct You can claim a deduction for travel and non-entertainment-related meals expenses if you reinfunwise your employees for these expenses under an accountable plan. Generally, the amount you can deduct for n...",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;,1. Claude E. Shannon; 2. Information entropy; 3. The theory of information.
192,"## Per Diem and Car Allowances You can reimburse your employees under an accountable plan based on travel days, miles, or some other fixed allowance. In these cases, your employee is considered to have accounted to you for the amount of the expense that doesn't exceed the rates established by the federal government. Your employee must actually substantiate to you the other elements of the expense, such as time, place, and business purpose. Federal rate The federal rate can be figured using any one of the following methods.",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;,
193,"## The Taxpayer Advocate Service (TAS) Is Here To Help You! The Taxpayer Advocate Service (TAS) is a free service provided by the IRS that helps taxpayers resolve their problems with the IRS. TAS assists taxpayers who have experienced significant difficulties or delays when trying to contact the IRS directly. If you are having trouble getting through to an IRS representative, you can call TAS at 1-800-121-7567.",Question 1: What is the purpose of the Taxpayer Advocate Service? ; Question 2: How does one access the Taxpayer Advocate Service? ; Question 3: When should someone use the Taxpayer Advocate Service?,
194,### What Is TAS? TAS is an _Independent_ organization within the IRS that helps taxpayers and protects taxpayer rights. Their job is to ensure that every taxpayer is treated fairly and that you know and understand your rights under the _Taxpayer Bill of Rights_. They also help resolve disputes between taxpayers and the IRS.,Question 1: What does TAS do?; Question 2: What is the purpose of TAS?; Question 3: How does TAS protect taxpayer rights?,


In [59]:
df_unique = df.drop_duplicates(subset=['Question', 'Answer'])
df_unique

Unnamed: 0,Context,Question,Answer
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.","Question 1: Where can I find the _How To Get Tax Help_ section at the end of a publication? ; Question 2: How do I access the IRS Interactive Tax Assistant page? ; Question 3: How do I order current forms, instructions, and publications from the IRS website?",
1,The following are some of the most common questions asked by people who have recently purchased a new car.,What should I do if my new car won't start?,"If your new car won't start, check the battery connections and make sure they are free from corrosion. If the problem persists, contact a mechanic."
2,"## What's New for 2022 The following items highlight some changes in the tax law for 2022. **Form 1098-k reporting transition period.** The transition period described in _Notice 2023-10_ delays the reporting of transactions in excess of 5600 to transactions that occur after calendar year 2022. The transition period is intended to facilitate an orderly transition for TPSO tax compliance, as well as individual payge compliance with income tax reporting. A participating payge, in the case of a third-party network transaction, is any person who accepts payment from a third-party settlement organization for a business transaction. **The COVID-19 related credit for qualified sick and family leave wages is limited to leave taken after March 31, 2020, and before October 1, 2021.** Generally, the credit for qualified sick and family leave wages, as enacted under the _Families_ First Coronavirus Response Act (FFCRA) and amended and extended by the COVID-related Tax Relief Act of 2020, for l...",Question 1: Who published A Mathematical Theory of Communication in 1948?; Question 2: What concept did Shannon introduce in his 1948 article?; Question 3: What field was established by Shannon's 1948 publication?;,1. Claude E. Shannon; 2. Information entropy; 3. The theory of information
3,"In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948), which established the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. This work laid the foundation for modern digital communication.",Question: Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon
4,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",Question 1: When should I capitalize my business startup costs? ; Question 2: Can I deduct certain business startup costs? ; Question 3: What is the purpose of tip provided in this section?,
...,...,...,...
189,"### Adequate accounting. Your employees must adequately account to you for their travel and non-entertainment-related meals expenses. They must give you documentary evidence of their travel, mileage, and other employee business expenses. This evidence should include items such as receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. The documentation provided must be accurate, complete, and timely. Failure to provide adequate accounting may result in penalties and disallowance of expenses.",Question 1: When do your employees need to provide you with evidence of their travel and non-entertainment related meals expenses? ; Question 2: What type of evidence is required from your employees to prove their expenses? ; Question 3: What consequences can occur if your employees fail to provide adequate accounting?,
190,"### Excess reimbursement or allowance. An excess reimbursement or allowance is any amount you pay to an employee that is more than the business-related expenses for which the employee adequately accounted. The employee must return any excess reimbursement or other expense allowance to you within a reasonable period of time. If the employee fails to do so, you may treat it as taxable income.",Question 1: What is excess reimbursement or allowance?; Question 2: When does the employee need to return any excess reimbursement or other expense allowance?; Question 3: How can excess reimbursement or allowance be treated if not returned within a reasonable period of time?,
193,"## The Taxpayer Advocate Service (TAS) Is Here To Help You! The Taxpayer Advocate Service (TAS) is a free service provided by the IRS that helps taxpayers resolve their problems with the IRS. TAS assists taxpayers who have experienced significant difficulties or delays when trying to contact the IRS directly. If you are having trouble getting through to an IRS representative, you can call TAS at 1-800-121-7567.",Question 1: What is the purpose of the Taxpayer Advocate Service? ; Question 2: How does one access the Taxpayer Advocate Service? ; Question 3: When should someone use the Taxpayer Advocate Service?,
194,### What Is TAS? TAS is an _Independent_ organization within the IRS that helps taxpayers and protects taxpayer rights. Their job is to ensure that every taxpayer is treated fairly and that you know and understand your rights under the _Taxpayer Bill of Rights_. They also help resolve disputes between taxpayers and the IRS.,Question 1: What does TAS do?; Question 2: What is the purpose of TAS?; Question 3: How does TAS protect taxpayer rights?,


In [60]:
df.to_csv('QnA_output.csv', index=False)

In [49]:
# # Create a TransformClient instance with the answer transform configuration
# client = TransformClient(transform_config_answer)

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.39s/it]


ValueError: thread_0/huggingface_model_op already exists.

In [34]:
# # input_data = []

# questions

'Q'

In [None]:
# # Create a TransformClient instance with the answer transform configuration
# client = TransformClient(transform_config_answer)

# # Assuming `data` is your input data structured appropriately for the TransformClient
# # You might need to run the client with something like this:
# output_with_answer = client.run(data)


In [17]:
# for item in output[0]:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
#             parts = response.split('\n')
#             response_dict = {}
#             last_key = None

#             for part in parts:
#                 if ":" in part:
#                     # Split on the first colon, regardless of whether there's a space after it
#                     key, value = part.split(":", 1)
#                     key = key.strip()  
#                     value = value.strip()  
#                     response_dict[key] = value
#                     last_key = key
#                 elif last_key is not None:
#                     response_dict[last_key] += " " + part
            
#             if any(key not in response_dict for key in ['context', 'question', 'answer']):
#                 # print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
#                 continue
#             if "Claude E. Shannon" in response_dict['answer']:
#                 # print("[WARNING] Used example context, skipping:\n", response_dict["context"])
#                 continue
#             contexts.append(response_dict['context'])
#             questions.append(response_dict['question'])
#             answers.append(response_dict['answer'])

# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

# print(len(contexts))
# print(len(questions))
# print(len(answers))

# df = pd.DataFrame({
#     'Context': contexts,
#     'Question': questions,
#     'Answer': answers
# })

0
0
0


In [16]:
# for item in output[0]:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
#             parts = response.split('\n')
#             response_dict = {}
#             last_key = None

#             for part in parts:
#                 if ":" in part:
#                     # Split on the first colon, regardless of whether there's a space after it
#                     key, value = part.split(":", 1)
#                     key = key.strip()  
#                     value = value.strip()  
#                     response_dict[key] = value
#                     last_key = key
#                 elif last_key is not None:
#                     response_dict[last_key] += " " + part
            
#             if any(key not in response_dict for key in ['context', 'question', 'answer']):
#                 # print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
#                 continue
#             if "Claude E. Shannon" in response_dict['answer']:
#                 # print("[WARNING] Used example context, skipping:\n", response_dict["context"])
#                 continue
#             contexts.append(response_dict['context'])
#             questions.append(response_dict['question'])
#             answers.append(response_dict['answer'])

# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

# print(len(contexts))
# print(len(questions))
# print(len(answers))

# df = pd.DataFrame({
#     'Context': contexts,
#     'Question': questions,
#     'Answer': answers
# })

97
97
97


In [17]:
# df

Unnamed: 0,Context,Question,Answer
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.","Question 1: Where can I find the _How To Get Tax Help_ section at the end of this publication?; Question 2: How do I access the IRS Interactive Tax Assistant page at _IRS.gov_? ; Question 3: How do I download current and prior-year forms, instructions, and publications from the IRS website?;",Answer 1: At the back cover of the publication; Answer 2: By going to _IRS.gov/Hela/ITA_; Answer 3: By going to _IRS.gov/Forms_
1,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",Question 1: Where can I find the latest information about developments related to Pub. 535? ;,You can find the latest information about developments related to Pub. 535 at _IRS.gov/Pub.535_.
2,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",Question 1: When should I capitalize my business startup costs?; Question 2: Can I deduct certain business startup costs?; Question 3: What is the purpose of tip provided in this section?;,"Answer 1: You should capitalize your business startup costs when it is necessary for your trade or business to operate.; Answer 2: No, you cannot deduct certain business startup costs if they are used primarily for personal purposes.; Answer 3: The purpose of the tip provided in this section is to provide guidance on whether certain business startup costs can be deducted or not."
3,"### Cost recovery Although you generally cannot take a current deduction for a capital expense, you may be able to recover the amount you spend through deprecation, amortization, or depletion. These recovery methods allow you to deduct part of your cost each year. In this way, you are able to recover your capital expense. See _Amortization_ (chapter 8) and _Dealization_ (chapter 9) in this publication. A taxpayer can elect to deduct a portion of the costs of certain depreciable property as a section 179 deduction. A greater portion of these costs can be deducted if the property is qualified disaster assistance property. See Pub. 946 for details.",Question 1: When can a taxpayer not take a current deduction for a capital expense?; Question 2: How does a taxpayer recover their capital expense over time?; Question 3: Can a taxpayer choose to deduct a portion of the costs of certain depreciable property as a section 179 deduction?;,"Answer 1: You generally cannot take a current deduction for a capital expense.; Answer 2: Through deprecation, amortization, or depletion. See Amortization (chapter 8) and Depreciation (chapter 9); Answer 3: Yes, a taxpayer can choose to deduct a portion of the costs of certain depreciable property as a section 179 deduction."
4,"## Going Into Business The costs of getting started in business, before you actually begin business operations, are capital expenses. These costs may include expenses for advertising, travel, or wages for training employees. Capital expenses can be financed through loans from banks or other financial institutions.",Question 1: What are capital expenses? ; Question 2: How can capital expenses be financed? ; Question 3: Can capital expenses be used to purchase inventory?,"Answer 1: Capital expenses are the costs of getting started in business, such as advertising, travel, or wages for training employees. ; Answer 2: Capital expenses can be financed through loans from banks or other financial institutions. ; Answer 3: No, capital expenses cannot be used to purchase inventory."
...,...,...,...
92,### Topics,What are some topics discussed in this chapter?;,"Some topics discussed in this chapter are travel and non-entertainment-related meals, ribres and kickbacks, charitable contributions, education expenses, lobbying expenses, penalties and fines, repayments (claim of right), and other miscellaneous expenses."
93,"### Reimbursers A ""reimbursement or allowance arrangement"" provides for payment of advances, reimbursments, and allowances for travel and non-entertainment-related meals expenses incurred by your employees during the ordinary course of business. If the expenses are substantiated, you can deduct the allowable amount on your tax return. Because of differences between accounting methods and tax law, the amount you can deduct for tax purposes may not be the same as the amount you deduct on your business books and records. For example, you can deduct 100% of the cost of meals on your business books and records. However, only 50% of these costs are allowed by law as a tax deduction.",Question 1: How does one deduct a business expense under a reimbursement or allowance arrangement? ; Question 2: What is the difference between an accountable plan and a nonaccountable plan? ; Question 3: Can one deduct 100% of the cost of meals on their business books and records?,"Answer 1: Deducting a business expense under a reimbursement or allowance arrangement depends on whether you have an accountable plan or a nonaccountable plan. Under an accountable plan, deduct the expenses as travel and non-entertainment-related meals expenses. Under a nonaccountable plan, report the reimbursals as wages on Form W-2 and deduct them as wages on the appropriate line of your tax return. If you make a single payment that includes both wages and an expense reimbursement, specify the amount of the reimbursement and report it accordingly. See Table 11-1."
94,"### Adequate accounting. Your employees must adequately account to you for their travel and non-entertainment-related meals expenses. They must give you documentary evidence of their travel, mileage, and other employee business expenses. This evidence should include items such as receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred.",Question 1: When do your employees need to provide documentation for their travel and non-entertainment related meals expenses?; Question 2: What type of evidence is required from your employees?; Question 3: Can your employees use electronic devices to keep track of their expenses?,"Answer 1: Your employees need to provide documentation for their travel and non-entertainment related meals expenses when they are reimbursed.; Answer 2: Receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred.; Answer 3: No, electronic devices cannot be used to keep track of expenses."
95,"## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Issues The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS). TAS helps taxpayers resolve their IRS issues. If you have an issue with the IRS, contact TAS at 1-800-121-7526 or visit www.irs.gov/advocacy.",Question 1: What is the purpose of the Taxpayer Advocate Service? ; Question 2: How can taxpayers contact the Taxpayer Advocate Service? ; Question 3: Where can taxpayers find more information about the Taxpayer Advocate Service?,Answer 1: The purpose of the Taxpayer Advocate Service is to help taxpayers resolve their IRS issues. ; Answer 2: Taxpayers can contact the Taxpayer Advocate Service at 1-800-121-7526. ; Answer 3: Taxpayers can find more information about the Taxpayer Advocate Service at www.irs.gov/advocacy.


In [18]:
# for item in output[0]:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
#             parts = response.split('\n')
#             response_dict = {}
#             last_key = None

#             for part in parts:
#                 if ":" in part:
#                     # Split on the first colon, regardless of whether there's a space after it
#                     key, value = part.split(":", 1)
#                     key = key.strip()  
#                     value = value.strip()  
#                     response_dict[key] = value
#                     last_key = key
#                 elif last_key is not None:
#                     response_dict[last_key] += " " + part
            
#             if any(key not in response_dict for key in ['context', 'question', 'answer']):
#                 # print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
#                 continue
#             if "Claude E. Shannon" in response_dict['answer']:
#                 # print("[WARNING] Used example context, skipping:\n", response_dict["context"])
#                 continue
#             contexts.append(response_dict['context'])
#             questions.append(response_dict['question'])
#             answers.append(response_dict['answer'])

# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

# print(len(contexts))
# print(len(questions))
# print(len(answers))

# df = pd.DataFrame({
#     'Context': contexts,
#     'Question': questions,
#     'Answer': answers
# })

194
194
194


In [16]:
# df

Unnamed: 0,Context,Question,Answer
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online. _Tax help for people with disabilities_. If you have a disability that affects your ability to use the internet or other electronic tools, we offer special services to help you access our website and other resources. For more information, visit _IRS.gov/Accessibility_. _Tax help for seniors_. If you are age 65 or older, you may be eligible for free tax preparation assistance through the Volunteer Income Tax Assistance program (VITA). VITA is available at many locations throughout the United States. To learn more about VITA, visit _IRS.gov/VITA_. _Tax help for military families_. Military families may also be eligible for free tax preparation assistance through the Armed Forces Tax Council (AFTC). AFTC provides tax education and assistance to members of the U.S. armed forces, their spouses, and dependents. To learn more about AFTC, visit _IRS.gov/Military_. _Tax help for low-income individuals_. Low-income individuals may be eligible for free tax preparation assistance through the Tax Counseling for the Elderly (TCE) program. TCE provides tax education and assistance to senior citizens who live alone and have limited income. To learn more about TCE, visit _IRS.gov/TCE_. _Tax help for students_. Students may be eligible for free tax preparation assistance through the College Credit Scholarship Foundation (CCSF). CCSF provides tax education and assistance to college students who need help preparing their taxes. To learn more about CCSF, visit _IRS.gov/CollegeCreditScholarshipFoundation_. _Tax help for small businesses_. Small business owners may be eligible for free tax preparation assistance through the Small Business Administration (SBA). SBA offers various programs and services to help small businesses succeed. To learn more about SBA, visit _IRS.gov/SmallBusinessAdministration_. _Tax help for nonprofits_. Nonprofit organizations may be eligible for free tax preparation assistance through the National Center for Charitable Statistics (NCCS). NCCS provides data and statistics on charitable giving in the United States. To learn more about NCCS, visit _IRS.gov/NationalCenterForCharitableStatistics_. _Tax help for international visitors_. International visitors may be eligible for free tax preparation assistance through the IRS Foreign Account Reporting Compliance Act (FARCA) Education Program. FARCA requires certain foreign financial institutions to report information about accounts held by U.S. residents. To learn more about FARCA, visit _IRS.gov/ForeignAccountReportingComplianceActEducationProgram_. _Tax help for multistate filers_. Multistate filers may be eligible for free tax preparation assistance through the MultiState Filer Project. This project helps taxpayers file state and federal income tax returns. To learn more about the MultiState Filer Project, visit _IRS.gov/MultiStateFilerProject_. _Tax help for individuals with disabilities_. Individuals with disabilities may be eligible for free tax preparation assistance through the IRS Disability Tax Credit Program. This program provides tax credits to individuals with permanent and severe disabilities. To learn more about the Disability Tax Credit Program, visit _IRS.gov/DisabilityTaxCreditProgram_. _Tax help for individuals with language barriers_. Individuals with language barriers may be eligible for free tax preparation assistance through the IRS Language Line. This service connects taxpayers with interpreters who speak their language. To learn more about the Language Line, visit _",What field was established by Shannon's 1948 publication?,The theory of information.
1,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",When was A Mathematical Theory of Communication published?,1948.
2,"## What's New for 2022 The following items highlight some changes in the tax law for 2022. **Form 1098-k reporting transition period.** The transition period described in _Notice 2023-10_ delays the reporting of transactions in excess of 5600 to transactions that occur after calendar year 2022. The transition period is intended to facilitate an orderly transition for TPSO tax compliance, as well as individual payge compliance with income tax reporting. A participating payge, in the case of a third-party network transaction, is any person who accepts payment from a third-party settlement organization for a business transaction. **The COVID-19 related credit for qualified sick and family leave wages is limited to leave taken after March 31, 2020, and before October 1, 2021.** Generally, the credit for qualified sick and family leave wages, as enacted under the _Families_ First Coronavirus Response Act (FFCRA) and amended and extended by the COVID-related Tax Relief Act of 2020, for leave taken after March 31, 2020, and before April 1, 2021, and the credit for qualified sick and family leave wages under sections 3131, 3132, and 3133 of the Internal Revenue Code, as enacted under the American Rescue Plan Act of 2021 (the ARP), for leave taken after March 31, 2021, and before October 1, 2021, have expired. However, employers that pay qualified sick and family leave wages in 2022 for leave taken after March 31, 2020, and before October 1, 2021, are eligible to claim a credit for qualified sick and family leave wages in 2022. For more information, see _chapter 2_. **The COVID-19 related employee retention credit has expired.** The employee retention credit enacted under the Coronavirus Aid, Relief, and Economic Security (CARES) Act and amended and extended by the Taxplayer certainty and Disaster Tax Relief Act of 2020 was limited to qualified wages paid after March 12, 2020, and before July 1, 2021. The employee retention credit under section 3134 of the Internal Revenue Code, as enacted by the ARP and amended by the Infrastructure Investment and Jobs Act, was limited to wages paid after June 30, 2021, and before October 1, 2021, unless the employer was a recovery startup business. An employer that was a recovery startup business could also claim the employee retention credit for wages paid after September 30, 2021, and before January 1, 2022. For more information, see _chapter 2_. **Credit for COBRA premium assistance payments is limited to periods of coverage beginning on or after April 1, 2021, through periods of coverage beginning on or before September 30, 2021.** Section 9501 of the ARP provides for COBRA premium assistance in the form of a full reduction in the premium otherwise payable by certain individuals and their families who elect COBRA continuation coverage due to a loss of coverage as the result of a reduction in hours or an involuntary termination of employment (assistance eligible individuals). This COBRA premium assistance is available for periods of coverage beginning on or after April 1, 2021, through periods of coverage beginning on or before September 30, 2021. For more information, see _chapter 2_. **The IRS has issued guidance on how to report virtual currency transactions.** Virtual currencies, such as Bitcoin, are digital assets used as a medium of exchange, a unit of account, or a store of value. The IRS has issued guidance on how to report virtual currency transactions, including virtual currency purchases, sales, exchanges, and transfers. Reporting requirements apply to all U.S. persons who engage in virtual currency transactions, regardless of whether they are residents of the United States or foreign nationals. For more information, see _chapter 2_.",What field was established by Shannon's 1948 publication?,The theory of information.
3,## References * [1] * [2] The following reminders and other items may help you file your tax return. [MISSING_PAGE_POST],What field was established by Shannon's 1948 publication?,The theory of information.
4,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8. ## Tip Business startup costs include any costs that help you get your business off the ground. Examples include rent, utilities, salaries, supplies, equipment, and legal fees. You may be able to deduct these costs if they are necessary for starting your business. However, if you choose to deduct them instead of capitalizing them, you cannot later claim depreciation on those items. If you decide to capitalize these costs, you can write them off over time using bonus depreciation or Section 167(f)(3) property tax treatment.",What field was established by Shannon's 1948 publication?,The theory of information.
...,...,...,...
190,"### Reasonable period of time. A reasonable period of time depends on the facts and circumstances. Generally, actions that take place within the times specified in the following list will be treated as taking place within a reasonable period of time. 1. You give an advance within 30 days of the time the employee pays or incurs the expense. 2. Your employees adequately account for their expenses within 60 days after the expenses were paid or incurred. 3. Your employees return any excess reimbursement within 120 days after the expenses were paid or incurred. 4. You give a periodic statement (at least quarterly) to your employees that asks them to either return or adequately account for outstanding advances _and_ they comply within 120 days of the date of the statement. How to deduct You can claim a deduction for travel and non-entertainment-related meals expenses if you reinfunwise your employees for these expenses under an accountable plan. Generally, the amount you can deduct for non-entertainment-related meals subject to a 50% limit, discussed later. If you are a sole proprieter, or are filing as a single member limited liability company, deduct the travel reimbursement on line 24a and the deductible part of the non-entertainment-related meals reimbursement on line 24b of Schedule C (Form 1040). If you are filing an income tax return for a corporation, include the reimbursement on the _Other deductions_ line of Form 1120. If you are filling any other business income tax return, such as a partnership or S corporation return, deduct the reimbursement on the appropriate line of the return as provided in the instructions for that return. To qualify for this deduction, you must have an accountable plan in place. An accountable plan is a written policy that requires employees to submit receipts for all meal expenses and to report the expenses to you. It also specifies how long the employee has to file the receipt and report the expense. Under an accountable plan, you do not need to pay taxes on the reimbursed amounts. However, you may still be required to keep records of the payments made to employees. What is the purpose of section 7202(c)? The purpose of Section 7202(c) is to provide a mechanism for employers to recover overpayments made to employees under an accountable plan. Employers who make overpayments to employees under an accountable plan can require the employees to repay the overpayments. This provision applies only to overpayments made under an accountable plan. Overpayments made under a nontaxable fringe benefit plan are not subject to recovery under Section 7202(c).",What field was established by Shannon's 1948 publication?,The theory of information.
191,"## Per Diem and Car Allowances You can reimburse your employees under an accountable plan based on travel days, miles, or some other fixed allowance. In these cases, your employee is considered to have accounted to you for the amount of the expense that doesn't exceed the rates established by the federal government. Your employee must actually substantiate to you the other elements of the expense, such as time, place, and business purpose. Federal rate The federal rate can be figured using any one of the following methods.",What field was established by Shannon's 1948 publication?,The theory of information.
192,"## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems The Taxpayer Advocate Service (TAS) is a free service that helps taxpayers with their IRS problems. If you have an issue with your taxes or need help navigating the complexities of the IRS system, TAS can assist you.",Is the Taxpayer Advocate Service a paid service?,"No, the Taxpayer Advocate Service is a free service."
193,### What Is TAS? TAS is an _Independent_ organization within the IRS that helps taxpayers and protects taxpayer rights. Their job is to ensure that every taxpayer is treated fairly and that you know and understand your rights under the _Taxpayer Bill of Rights_. They also help resolve disputes between taxpayers and the IRS.,What does TAS do?,TAS helps taxpayers and protects taxpayer rights.


In [19]:
# df_unique = df.drop_duplicates(subset=['Question', 'Answer'])
# df_unique

Unnamed: 0,Context,Question,Answer
0,"**Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.","Question 1: Where can I find the _How To Get Tax Help_ section at the end of this publication?; Question 2: How do I access the IRS Interactive Tax Assistant page at _IRS.gov_? ; Question 3: How do I download current and prior-year forms, instructions, and publications from the IRS website?;",Answer 1: At the back cover of the publication; Answer 2: By going to _IRS.gov/Hela/ITA_; Answer 3: By going to _IRS.gov/Forms_
1,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",Question 1: Where can I find the latest information about developments related to Pub. 535? ;,You can find the latest information about developments related to Pub. 535 at _IRS.gov/Pub.535_.
2,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",Question 1: When should I capitalize my business startup costs?; Question 2: Can I deduct certain business startup costs?; Question 3: What is the purpose of tip provided in this section?;,"Answer 1: You should capitalize your business startup costs when it is necessary for your trade or business to operate.; Answer 2: No, you cannot deduct certain business startup costs if they are used primarily for personal purposes.; Answer 3: The purpose of the tip provided in this section is to provide guidance on whether certain business startup costs can be deducted or not."
3,"### Cost recovery Although you generally cannot take a current deduction for a capital expense, you may be able to recover the amount you spend through deprecation, amortization, or depletion. These recovery methods allow you to deduct part of your cost each year. In this way, you are able to recover your capital expense. See _Amortization_ (chapter 8) and _Dealization_ (chapter 9) in this publication. A taxpayer can elect to deduct a portion of the costs of certain depreciable property as a section 179 deduction. A greater portion of these costs can be deducted if the property is qualified disaster assistance property. See Pub. 946 for details.",Question 1: When can a taxpayer not take a current deduction for a capital expense?; Question 2: How does a taxpayer recover their capital expense over time?; Question 3: Can a taxpayer choose to deduct a portion of the costs of certain depreciable property as a section 179 deduction?;,"Answer 1: You generally cannot take a current deduction for a capital expense.; Answer 2: Through deprecation, amortization, or depletion. See Amortization (chapter 8) and Depreciation (chapter 9); Answer 3: Yes, a taxpayer can choose to deduct a portion of the costs of certain depreciable property as a section 179 deduction."
4,"## Going Into Business The costs of getting started in business, before you actually begin business operations, are capital expenses. These costs may include expenses for advertising, travel, or wages for training employees. Capital expenses can be financed through loans from banks or other financial institutions.",Question 1: What are capital expenses? ; Question 2: How can capital expenses be financed? ; Question 3: Can capital expenses be used to purchase inventory?,"Answer 1: Capital expenses are the costs of getting started in business, such as advertising, travel, or wages for training employees. ; Answer 2: Capital expenses can be financed through loans from banks or other financial institutions. ; Answer 3: No, capital expenses cannot be used to purchase inventory."
...,...,...,...
92,### Topics,What are some topics discussed in this chapter?;,"Some topics discussed in this chapter are travel and non-entertainment-related meals, ribres and kickbacks, charitable contributions, education expenses, lobbying expenses, penalties and fines, repayments (claim of right), and other miscellaneous expenses."
93,"### Reimbursers A ""reimbursement or allowance arrangement"" provides for payment of advances, reimbursments, and allowances for travel and non-entertainment-related meals expenses incurred by your employees during the ordinary course of business. If the expenses are substantiated, you can deduct the allowable amount on your tax return. Because of differences between accounting methods and tax law, the amount you can deduct for tax purposes may not be the same as the amount you deduct on your business books and records. For example, you can deduct 100% of the cost of meals on your business books and records. However, only 50% of these costs are allowed by law as a tax deduction.",Question 1: How does one deduct a business expense under a reimbursement or allowance arrangement? ; Question 2: What is the difference between an accountable plan and a nonaccountable plan? ; Question 3: Can one deduct 100% of the cost of meals on their business books and records?,"Answer 1: Deducting a business expense under a reimbursement or allowance arrangement depends on whether you have an accountable plan or a nonaccountable plan. Under an accountable plan, deduct the expenses as travel and non-entertainment-related meals expenses. Under a nonaccountable plan, report the reimbursals as wages on Form W-2 and deduct them as wages on the appropriate line of your tax return. If you make a single payment that includes both wages and an expense reimbursement, specify the amount of the reimbursement and report it accordingly. See Table 11-1."
94,"### Adequate accounting. Your employees must adequately account to you for their travel and non-entertainment-related meals expenses. They must give you documentary evidence of their travel, mileage, and other employee business expenses. This evidence should include items such as receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred.",Question 1: When do your employees need to provide documentation for their travel and non-entertainment related meals expenses?; Question 2: What type of evidence is required from your employees?; Question 3: Can your employees use electronic devices to keep track of their expenses?,"Answer 1: Your employees need to provide documentation for their travel and non-entertainment related meals expenses when they are reimbursed.; Answer 2: Receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred.; Answer 3: No, electronic devices cannot be used to keep track of expenses."
95,"## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Issues The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS). TAS helps taxpayers resolve their IRS issues. If you have an issue with the IRS, contact TAS at 1-800-121-7526 or visit www.irs.gov/advocacy.",Question 1: What is the purpose of the Taxpayer Advocate Service? ; Question 2: How can taxpayers contact the Taxpayer Advocate Service? ; Question 3: Where can taxpayers find more information about the Taxpayer Advocate Service?,Answer 1: The purpose of the Taxpayer Advocate Service is to help taxpayers resolve their IRS issues. ; Answer 2: Taxpayers can contact the Taxpayer Advocate Service at 1-800-121-7526. ; Answer 3: Taxpayers can find more information about the Taxpayer Advocate Service at www.irs.gov/advocacy.


In [35]:
output_df = df_unique[['Question', 'Answer']]

output_dir = 'data/output'

uniflow_output_path = f"{output_dir}/new_irs_QApairs.csv"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(uniflow_output_path, index=False)