# Example of generating QAs for IRS
In this example, we will show you how to generate question-answers (QAs) from a pdf using Huggingface's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install langchain pandas pypdf



### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformHuggingFaceConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig
from langchain.document_loaders import PyPDFLoader
from uniflow.op.prompt_schema import Context, GuidedPrompt

from uniflow.node import Node
from uniflow.op.extract.load.pdf_op import ExtractPDFOp
from uniflow.op.model.llm_preprocessor import LLMDataPreprocessor

from uniflow.flow.config import ExtractPDFConfig
from uniflow.op.model.model_config import NougatModelConfig
from uniflow.flow.client import ExtractClient

from uniflow.op.extract.split.markdown_header_splitter import MarkdownHeaderSplitter

from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER

from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


False

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `PyPDFLoader` from langchain.

In [5]:
pdf_file = "IRS_2023.pdf"

##### Set current directory and input data directory.

In [6]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load and split the pdf

In [16]:
# initial_node = Node(name="initial_pdf_node", value_dict={"pdf_file_path": input_file})

# extract_pdf_op = ExtractPDFOp(
#             name="extract_pdf_op",
#             model=LLMDataPreprocessor(
#                 model_config = {
#                     "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
#                     "batch_size": 1,
#                     "model_server": "HuggingfaceModelServer",
#                 }
#             ),
#         )

In [17]:
# extracted_nodes = extract_pdf_op([initial_node])

In [18]:
# print("Extracted Nodes:")
# print(extracted_nodes)

In [19]:
# if extracted_nodes:
#     print("\nStructure of the First Extracted Node:")
#     print(extracted_nodes[0].value_dict)
#     # Optionally, you can also loop through all keys to see their names and values
#     for key, value in extracted_nodes[0].value_dict.items():
#         print(f"Key: {key}, Value: {value}")
# else:
#     print("No nodes were extracted.")

In [11]:
# print(extracted_text)




In [20]:
# # Assuming extracted_nodes is a list of Node objects with extracted text

# # Function to print the content of each node
# def examine_extracted_data(nodes):
#     for i, node in enumerate(nodes):
#         # Extract the text content from each node
#         text_content = node.value_dict.get("text", "")

#         # Print the node number and its content
#         print(f"Node {i + 1}:")
#         print(text_content)
#         print("\n" + "-"*50 + "\n")  # Separator for readability

# # Call the function with the extracted nodes
# examine_extracted_data(extracted_nodes)


In [6]:
# SplitterOpsFactory.list()

['ParagraphSplitter', 'MarkdownHeaderSplitter']

In [6]:
data = [
    {"filename": input_file},
]

In [7]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=MARKDOWN_HEADER_SPLITTER,
)


In [9]:
guided_prompt = GuidedPrompt(
    instruction="Generate Q&A based on the context.", 
    examples=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

In [None]:
guided_prompt = GuidedPrompt(
    instruction="Generate Q&A based on the context. The format for the return should be .."
])

In [9]:
# nougat_client = ExtractClient(config)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


In [10]:
# output = nougat_client.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Costs You Can Deduct or Capitalize Page 27


100%|██████████| 1/1 [08:45<00:00, 525.40s/it]


In [12]:
# p = output[0]['output'][0]['text'][0]

In [56]:
# output[0]['output'][0]['text'][0]["content"]

"**Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online."

In [15]:
# output

[{'output': [{'text': [{'content': "**Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.",
      'metadata': {}},
     {'content': '## Future Developments\nFor the latest information about developments related to Pub. 535, such as legislation en

In [87]:
# print(output[0]['output'][0]['text'])

*


In [80]:
# print(len(p))

1


In [81]:
# print(output)

[{'output': [{'text': '**Publication 535**\n\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don\'t** resmbur requests you already sent us. You can get forms and publications faster online.\n\n## Future Developments\n\nFor the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov

In [52]:
# extracted_texts = []
# for item in output:
#     for text_dict in item['output']:
#         extracted_texts.append(text_dict['text'])

# nodes = [Node(name=f"Node_{i}", value_dict={"text": "\n".join(text)}) for i, text in enumerate(extracted_texts)]

# splitter = MarkdownHeaderSplitter(name="markdown_splitter")

# split_nodes = splitter(nodes)


In [49]:
# output[0]['output'][0]['text'][0]

# extracted_texts = []
# for item in output:
#     for text_dict in item['output']:
#         extracted_texts.append(text_dict['text'])

In [100]:
# extracted_texts = []
# for info in output[0]['output'][0]['text']:
#     # print(tmp['output'][0]['text'][0]['content'])
#     extracted_texts.append(info)

In [101]:
# print(len(extracted_texts))

197


In [102]:
# extracted_texts

[{'content': "**Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online.",
  'metadata': {}},
 {'content': '## Future Developments\nFor the latest information about developments related to Pub. 535, such as legislation enacted after it was published, 

In [85]:
# print(len(extracted_texts))

In [86]:
# for text in extracted_texts:
#     print("Extracted Text:")
#     print(text)
#     print("----")

In [87]:
# print(len(extracted_texts))

In [41]:
# splitter = MarkdownHeaderSplitter(name="my_markdown_splitter")

In [42]:
# nodes = [Node(name=f"Node_{i}", value_dict={"text": text}) for i, text in enumerate(extracted_texts)]

In [43]:
# split_nodes = splitter(nodes)

In [44]:
# print(len(split_nodes))

In [45]:
# for i, node in enumerate(split_nodes):
#     print(f"Section {i} Content:")
#     print(node.value_dict['text'])
#     print("----")

In [46]:
# print(len(split_nodes))

In [47]:
# for node in split_nodes:
#     print("Section Content:")
#     print(node.value_dict['text'])
#     print("----")

In [33]:
# loader = PyPDFLoader(input_file)
# pages = loader.load_and_split()
# page_contents = [page.page_content for page in pages]

In [34]:
# if os.path.exists(input_file):
#     print("File found:", input_file)
# else:
#     print("File not found. Please check the file path.")

# print(len(pages))

In [35]:
# for i, page in enumerate(pages[146:]):  # Adjust the number of pages to preview
#     print(f"Content of Page {i+1}:\n{page.page_content}\n\n---\n")

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `GuidedPrompt` class.

In [11]:
# guided_prompt = GuidedPrompt(
#     instruction="Generate Q&A based on the context.",
#     examples=[
#         Context(
#             context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
#             question="Who published A Mathematical Theory of Communication in 1948?",
#             answer="Claude E. Shannon.",
#         ),
# ])

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [11]:
# input_data = [ Context(context=p[:500]) for p in page_contents[6:16] if len(p) > 200]
# input_data = [ Context(context=p[:500]) for p in page_contents[:] if len(p) > 200]
# input_data

[Context(context="Contents\nIntroduction .................. 1\nWhat's New for 2022 ............. 2\nWhat's New for 2023 ............. 2\nReminders ................... 2\nChapter 1. Deducting \nBusiness Expenses .......... 3\nChapter 2. Employees' Pay ........ 8\nChapter 3. Rent Expense ........ 11\nChapter 4. Interest ............ 13\nChapter 5. Taxes ............. 18\nChapter 6. Insurance ........... 21\nChapter 7. Costs You Can Deduct \nor Capitalize .............. 25\nChapter 8. Amortization ......... 29\nChapter 9. Depletio"),
 Context(context='publication or the How To Get Tax Help  section \nat the end of this publication, go to the IRS In-\nteractive Tax Assistant page at IRS.gov/\nHelp/ITA  where you can find topics by using the \nsearch feature or viewing the categories listed.\nGetting tax forms, instructions, and pub-\nlications. Go to IRS.gov/Forms  to download \ncurrent and prior -year forms, instructions, and \npublications.\nOrdering tax forms, instructions, and \npublic

In [124]:
# # input_data = [ Context(context=p[:500]) for p in extracted_texts if len(p) > 200]


# input_data = [ Context(context=p['content']) for pair in extracted_texts if len(pair['content']) > 50]

# print(len(input_data))
# # for pair in extracted_texts:
# #     print(len(pair['content']))

188


In [125]:
# input_data = [ Context(context=p[:500]) for p in extracted_texts['content'] if len(p) > 200]

In [127]:
# print(len(input_data))

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [11]:
# config = TransformOpenAIConfig(
#     guided_prompt_template=guided_prompt,
#     model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
# )
# client = TransformClient(config)

In [12]:
# pip3 install transformers accelerate bitsandbytes scipy
# pip3 install lmqg spacy
# pip3 install accelerate

In [10]:
transform_config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(),
)

In [11]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))
output = p.run(data)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.19s/it]
  0%|          | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Costs You Can Deduct or Capitalize Page 27


100%|██████████| 1/1 [08:35<00:00, 515.49s/it]
100%|██████████| 197/197 [36:36<00:00, 11.15s/it] 


In [35]:
# output = p.run(data)

: 

In [12]:
output

[[{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructi

In [16]:
output[0]

[{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructio

In [None]:
# client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [129]:
# output = client.run(input_data)

  0%|          | 0/188 [00:00<?, ?it/s]

100%|██████████| 188/188 [2:19:54<00:00, 44.65s/it]  


### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [131]:
# print(output[:3])

[{'output': [{'response': ['instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: ### How Can You Learn About Your Taxpayer Rights?\nThe Taxpayer Bill of Rights describes 10 basic rights that all taxpayers have when dealing with the IRS. Go to _Taxpayer.ak/.pick/.pick_ to help you understand what these rights mean to you and how they apply. These are _your_ rights. Know them. Use them.\nYou can find a list of your rights and the IRS\'s obligations to protect them in _Pub.L.Y. Your Rights as a Taxayer_. It includes the following.\n1. **The Right To Be Informed.** Taxayers have the right to know what they need to do to com

In [132]:
# print(output[0])

{'output': [{'response': ['instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: ### How Can You Learn About Your Taxpayer Rights?\nThe Taxpayer Bill of Rights describes 10 basic rights that all taxpayers have when dealing with the IRS. Go to _Taxpayer.ak/.pick/.pick_ to help you understand what these rights mean to you and how they apply. These are _your_ rights. Know them. Use them.\nYou can find a list of your rights and the IRS\'s obligations to protect them in _Pub.L.Y. Your Rights as a Taxayer_. It includes the following.\n1. **The Right To Be Informed.** Taxayers have the right to know what they need to do to comp

In [151]:
# print(len(output))

188


In [167]:
# output[0]['output'][0]['response'][0]

# for tmp in output:
#     print(tmp['output'][0]['error'])

'instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: ### How Can You Learn About Your Taxpayer Rights?\nThe Taxpayer Bill of Rights describes 10 basic rights that all taxpayers have when dealing with the IRS. Go to _Taxpayer.ak/.pick/.pick_ to help you understand what these rights mean to you and how they apply. These are _your_ rights. Know them. Use them.\nYou can find a list of your rights and the IRS\'s obligations to protect them in _Pub.L.Y. Your Rights as a Taxayer_. It includes the following.\n1. **The Right To Be Informed.** Taxayers have the right to know what they need to do to comply with the tax laws. They

In [17]:
contexts = []
questions = []
answers = []

In [204]:
for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            print(response)
            parts = response.split('\n')
            # print(parts[1])
    break

instruction: Generate Q&A based on the context.
context: In 1948, Claude E. Shannon published A Mathematical Theory of
Communication (Shannon, 1948) establishing the theory of
information. In his article, Shannon introduced the concept of
information entropy for the first time. We will begin our journey here.
question: Who published A Mathematical Theory of Communication in 1948?
answer: Claude E. Shannon.
context: ### How Can You Learn About Your Taxpayer Rights?
The Taxpayer Bill of Rights describes 10 basic rights that all taxpayers have when dealing with the IRS. Go to _Taxpayer.ak/.pick/.pick_ to help you understand what these rights mean to you and how they apply. These are _your_ rights. Know them. Use them.
You can find a list of your rights and the IRS's obligations to protect them in _Pub.L.Y. Your Rights as a Taxayer_. It includes the following.
1. **The Right To Be Informed.** Taxayers have the right to know what they need to do to comply with the tax laws. They are entitle

In [168]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {part.split(': ')[0]: part.split(': ')[1] for part in parts if ': ' in part}

            if all(key in response_dict for key in ['context', 'question', 'answer']):
                contexts.append(response_dict['context'])
                questions.append(response_dict['question'])
                answers.append(response_dict['answer'])
            else:
                print("Missing context, question or answer in response:", response)

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
1,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
2,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
3,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
4,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
5,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
6,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
7,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
8,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
9,### How Can You Learn About Your Taxpayer Rights?,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.


In [17]:
# # Extracting context, question, and answer into a DataFrame
# contexts = []
# questions = []
# answers = []

# for item in output:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
#             if any(key not in response for key in ['context', 'question', 'answer']):
#                 print("Missing context, question or answer in response:", response)
#                 continue
#             contexts.append(response['context'])
#             questions.append(response['question'])
#             answers.append(response['answer'])

# # Set display options
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

# df = pd.DataFrame({
#     'Context': contexts,
#     'Question': questions,
#     'Answer': answers
# })

# styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
#     'selector': 'th',
#     'props': [('text-align', 'left')]
# }])
# styled_df

In [18]:
contexts = []
questions = []
answers = []

In [21]:
output

[[{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructi

In [22]:
output[0]

[{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructio

In [25]:
# print(output[0][0]['output'])

[{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publ

In [29]:
# output = output[0][0]

KeyError: 0

In [31]:
# output

{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: **Publication 535**\n**Publication 535**\npublication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed.\n_Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications.\n_Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instruction

In [24]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

In [28]:
# output
# output[0]

In [33]:
for item in output[0]:
    for i in item.get('output', []):

SyntaxError: incomplete input (3399303170.py, line 2)

In [34]:


for item in output[0]:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {part.split(': ')[0]: part.split(': ')[1] for part in parts if ': ' in part}

            if all(key in response_dict for key in ['context', 'question', 'answer']):
                contexts.append(response_dict['context'])
                questions.append(response_dict['question'])
                answers.append(response_dict['answer'])
            else:
                print("Missing context, question or answer in response:", response)

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

In [35]:
df

Unnamed: 0,Context,Question,Answer
0,**Publication 535**,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
1,## Future Developments,What is the website where you can find the latest information about developments related to Pub. 535?,The website is _IRS.gov/Pub.535_.
2,## What's New for 2022,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
3,## References,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
4,## Capital Expenses,What are the three types of costs that businesses typically capitalize?,"Business startup costs, business assets, and improvements."
...,...,...,...
191,### Reasonable period of time.,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
192,## Per Diem and Car Allowances,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
193,## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems,What is the Taxpayer Advocate Service (TAS)?,The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS) to help taxpayers resolve their IRS problems.
194,### What Is TAS?,What does TAS do?,TAS helps taxpayers and protects taxpayer rights by ensuring fair treatment and helping resolve disputes with the IRS.


In [36]:
output_df = df[['Question', 'Answer']]

In [37]:
output_dir = 'data/output'

In [39]:
uniflow_output_path = f"{output_dir}/new_irs_QApairs.csv"


In [40]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [41]:
output_df.to_csv(uniflow_output_path, index=False)

In [42]:
len(df)

196

In [47]:
question_to_exclude = "Who published A Mathematical Theory of Communication in 1948?"

In [48]:
df_filtered = df[df['Question'] != question_to_exclude]
len(df_filtered)

123

In [49]:
df_filtered

Unnamed: 0,Context,Question,Answer
1,## Future Developments,What is the website where you can find the latest information about developments related to Pub. 535?,The website is _IRS.gov/Pub.535_.
4,## Capital Expenses,What are the three types of costs that businesses typically capitalize?,"Business startup costs, business assets, and improvements."
5,### Cost recovery,What are the three ways to recover the amount spent on a capital expense?,"Depreciation, Amortization, and Depletion."
6,## Going Into Business,What are capital expenses?,Costs that businesses incur to get started before they start making money.
7,### If you go into business,What should be treated as capital expenses when starting a business?,"Costs that are necessary to start and run a business should be treated as capital expenses. These may include land, buildings, equipment, furniture, fixtures, and other assets used in the operation of the business. Additionally, any money spent on advertising, marketing, legal fees, and other professional services related to starting the business should also be considered capital expenses."
...,...,...,...
188,### Accountable Plans,What does an accountable plan require from its employees?,"An accountable plan requires each employee to have paid or incurred deductible expenses while performing services as your employee, adequately account to you for these expenses within a reasonable period of time, and return any excess reimbursement or allowance within a reasonable period of time. Additionally, it requires that any advances made to employees be reasonably calculated not to exceed the amount of anticipated expenses and made within a reasonable period of time of their paying or incurring the expense. If expenses reimbursed under this arrangement aren't substantiated or an excess reimbursement isn't returned within a reasonable period of time by an employee, they cannot be treated as reimbursed under an accountable plan."
189,### Adequate accounting.,What is adequate accounting when it comes to employee travel and meal expenses?,"Adequate accounting requires that employees provide documentary evidence of their travel and non-entertainment related meals expenses, including receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. Employees must keep this evidence for at least one year after the date of the expense."
190,### Excess reimbursement or allowance.,What happens if an employee fails to return excess reimbursement or allowance?,You may have to treat it as taxable income and report it on your Form W-2.
193,## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems,What is the Taxpayer Advocate Service (TAS)?,The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS) to help taxpayers resolve their IRS problems.


In [55]:
df_filtered = df_filtered.drop_duplicates()

In [56]:
len(df_filtered)

104

In [59]:
df_filtered

Unnamed: 0,Context,Question,Answer
1,## Future Developments,What is the website where you can find the latest information about developments related to Pub. 535?,The website is _IRS.gov/Pub.535_.
4,## Capital Expenses,What are the three types of costs that businesses typically capitalize?,"Business startup costs, business assets, and improvements."
5,### Cost recovery,What are the three ways to recover the amount spent on a capital expense?,"Depreciation, Amortization, and Depletion."
6,## Going Into Business,What are capital expenses?,Costs that businesses incur to get started before they start making money.
7,### If you go into business,What should be treated as capital expenses when starting a business?,"Costs that are necessary to start and run a business should be treated as capital expenses. These may include land, buildings, equipment, furniture, fixtures, and other assets used in the operation of the business. Additionally, any money spent on advertising, marketing, legal fees, and other professional services related to starting the business should also be considered capital expenses."
...,...,...,...
187,### Reimbursers,What is a reimbursement or allowance arrangement?,"A reimbursement or allowance arrangement provides for payment of advances, reimbursements, and allowances for travel and non-entertainment-related meals expenses incurred by your employees during the ordinary course of business."
189,### Adequate accounting.,What is adequate accounting when it comes to employee travel and meal expenses?,"Adequate accounting requires that employees provide documentary evidence of their travel and non-entertainment related meals expenses, including receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. Employees must keep this evidence for at least one year after the date of the expense."
190,### Excess reimbursement or allowance.,What happens if an employee fails to return excess reimbursement or allowance?,You may have to treat it as taxable income and report it on your Form W-2.
193,## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems,What is the Taxpayer Advocate Service (TAS)?,The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS) to help taxpayers resolve their IRS problems.


In [50]:
output_filtered = df_filtered[['Question', 'Answer']]

In [53]:
tmp_path = f"{output_dir}/tmp.csv"

In [54]:
output_filtered.to_csv(tmp_path, index=False)

In [46]:
# tmp_path = f"{output_dir}/tmp.csv"

# output_df.to_csv(tmp_path, index=False)

In [None]:
# styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
#     'selector': 'th',
#     'props': [('text-align', 'left')]
# }])
# styled_df

Finally, we can save the output to a csv file.

In [21]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(f"{output_dir}/IRS_2023_QApairs.csv", index=False)