# Example of generating QAs for IRS
In this example, we will show you how to generate question-answers (QAs) from a pdf using Huggingface's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install langchain pandas pypdf



### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformHuggingFaceConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig
from langchain.document_loaders import PyPDFLoader
from uniflow.op.prompt_schema import Context, GuidedPrompt

# from uniflow.node import Node
# from uniflow.op.extract.load.pdf_op import ExtractPDFOp
# from uniflow.op.model.llm_preprocessor import LLMDataPreprocessor

from uniflow.flow.config import ExtractPDFConfig
from uniflow.op.model.model_config import NougatModelConfig
# from uniflow.flow.client import ExtractClient

from uniflow.op.extract.split.markdown_header_splitter import MarkdownHeaderSplitter

# from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER

from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


False

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `PyPDFLoader` from langchain.

In [4]:
pdf_file = "IRS_2023.pdf"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

#### Load the pdf using Nougat

In [6]:
data = [
    {"filename": input_file},
]

In [7]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=MARKDOWN_HEADER_SPLITTER,
)


#### Set up prompt

In [8]:
guided_prompt = GuidedPrompt(
    instruction="Generate Q&A based on the context.",
    examples=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

In [8]:
# guided_prompt = GuidedPrompt(
#     instruction="Generate 1 question and the corresponding answers based on the context, following the JSON format which Question and Answewr as two necessary keys. Calling output['context'], output['question'] and output['answer'] should return the corresponding context, question, and answer.",
#     examples=[]
# )

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [9]:
transform_config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(),
)

In [10]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))
output = p.run(data)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.20s/it]
  0%|          | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Costs You Can Deduct or Capitalize Page 27


100%|██████████| 1/1 [08:36<00:00, 516.51s/it]
100%|██████████| 197/197 [36:50<00:00, 11.22s/it] 


In [128]:
# output

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [104]:
print(len(output[0]))

197


In [109]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

In [110]:
for item in output[0]:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {}
            last_key = None

            for part in parts:
                if ":" in part:
                    # Split on the first colon, regardless of whether there's a space after it
                    key, value = part.split(":", 1)
                    key = key.strip()  
                    value = value.strip()  
                    response_dict[key] = value
                    last_key = key
                elif last_key is not None:
                    response_dict[last_key] += " " + part
            
            if any(key not in response_dict for key in ['context', 'question', 'answer']):
                print("[WARNING] Missing context, question or answer in response, skipping:\n", response)
                continue
            if "Claude E. Shannon" in response_dict['answer']:
                print("[WARNING] Used example context, skipping:\n", response_dict["context"])
                continue
            contexts.append(response_dict['context'])
            questions.append(response_dict['question'])
            answers.append(response_dict['answer'])

pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

print(len(contexts))
print(len(questions))
print(len(answers))

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

 **Publication 535** **Publication 535** publication or the _How To Get Tax Help_ section at the end of this publication, go to the IRS Interactive Tax Assistant page at _IRS.gov_ _Hela/ITA_ where you can find topics by using the search feature or viewing the categories listed. _Getting tax forms, instructions, and publications_. Go to _IRS.gov/Forms_ to download current and prior-year forms, instructions, and publications. _Ordering tax forms, instructions, and publications._ Go to _IRS.gov/OrderForms_ to order current forms, instructions, and publications; call 800-829-3676 to order prior-year forms and instructions. The IRS will process your order for forms and publications as soon as possible. **Don't** resmbur requests you already sent us. You can get forms and publications faster online. 
 ## What's New for 2022 The following items highlight some changes in the tax law for 2022. **Form 1098-k reporting transition period.** The transition period described in _Notice 2023-10_ delay

In [107]:
df

Unnamed: 0,Context,Question,Answer
0,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",What is the website where you can find the latest information about developments related to Pub. 535?,The website is _IRS.gov/Pub.535_.
1,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",What are the three types of costs that businesses typically capitalize?,"Business startup costs, business assets, and improvements."
2,"### Cost recovery Although you generally cannot take a current deduction for a capital expense, you may be able to recover the amount you spend through deprecation, amortization, or depletion. These recovery methods allow you to deduct part of your cost each year. In this way, you are able to recover your capital expense. See _Amortization_ (chapter 8) and _Dealization_ (chapter 9) in this publication. A taxpayer can elect to deduct a portion of the costs of certain depreciable property as a section 179 deduction. A greater portion of these costs can be deducted if the property is qualified disaster assistance property. See Pub. 946 for details.",What are the three ways to recover the amount spent on a capital expense?,"Depreciation, Amortization, and Depletion."
3,"## Going Into Business The costs of getting started in business, before you actually begin business operations, are capital expenses. These costs may include expenses for advertising, travel, or wages for training employees. Capital expenses can be financed through loans from banks or other financial institutions.",What are capital expenses?,Costs that businesses incur to get started before they start making money.
4,"### If you go into business When you go into business, treat all costs you had to get your business started as capital expenses. Usually, you recover costs for a particular asset through depreciation. Generally, you cannot recover other costs until you sell the business or otherwise go out of business. However, you can choose to amortize certain costs for setting up your business. See _Starting a Business in chapter 8_ for more information on business startup costs.",What should be treated as capital expenses when starting a business?,"Costs that are necessary to start and run a business should be treated as capital expenses. These may include land, buildings, equipment, furniture, fixtures, and other assets used in the operation of the business. Additionally, any money spent on advertising, marketing, legal fees, and other professional services related to starting the business should also be considered capital expenses."
...,...,...,...
116,### Accountable Plans,What does an accountable plan require from its employees?,"An accountable plan requires each employee to have paid or incurred deductible expenses while performing services as your employee, adequately account to you for these expenses within a reasonable period of time, and return any excess reimbursement or allowance within a reasonable period of time. Additionally, it requires that any advances made to employees be reasonably calculated not to exceed the amount of anticipated expenses and made within a reasonable period of time of their paying or incurring the expense. If expenses reimbursed under this arrangement aren't substantiated or an excess reimbursement isn't returned within a reasonable period of time by an employee, they cannot be treated as reimbursed under an accountable plan."
117,"### Adequate accounting. Your employees must adequately account to you for their travel and non-entertainment-related meals expenses. They must give you documentary evidence of their travel, mileage, and other employee business expenses. This evidence should include items such as receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. The evidence should be kept by your employees for at least one year after the date of the expense.",What is adequate accounting when it comes to employee travel and meal expenses?,"Adequate accounting requires that employees provide documentary evidence of their travel and non-entertainment related meals expenses, including receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. Employees must keep this evidence for at least one year after the date of the expense."
118,"### Excess reimbursement or allowance. An excess reimbursement or allowance is any amount you pay to an employee that is more than the business-related expenses for which the employee adequately accounted. The employee must return any excess reimbursement or other expense allowance to you within a reasonable period of time. If the employee fails to do so, you may have to treat it as taxable income and report it on your Form W-2.",What happens if an employee fails to return excess reimbursement or allowance?,You may have to treat it as taxable income and report it on your Form W-2.
119,## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems,What is the Taxpayer Advocate Service (TAS)?,The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS) to help taxpayers resolve their IRS problems.


In [126]:
df_unique = df.drop_duplicates(subset=['Question', 'Answer'])
df_unique

Unnamed: 0,Context,Question,Answer
0,"## Future Developments For the latest information about developments related to Pub. 535, such as legislation enacted after it was published, go to _IRS.gov/Pub.535_.",What is the website where you can find the latest information about developments related to Pub. 535?,The website is _IRS.gov/Pub.535_.
1,"## Capital Expenses You must capitalize, rather than deduct, some costs. These costs are a part of your investment in your business and are called 'capital expenses.' Capital expenses are considered assets in your business. In general, you capitalize three types of costs. * Business startup costs (see _Tip_ below). * Business assets. * Improvements. You can elect to deduct or amortize certain business startup costs. See chapters 7 and 8.",What are the three types of costs that businesses typically capitalize?,"Business startup costs, business assets, and improvements."
2,"### Cost recovery Although you generally cannot take a current deduction for a capital expense, you may be able to recover the amount you spend through deprecation, amortization, or depletion. These recovery methods allow you to deduct part of your cost each year. In this way, you are able to recover your capital expense. See _Amortization_ (chapter 8) and _Dealization_ (chapter 9) in this publication. A taxpayer can elect to deduct a portion of the costs of certain depreciable property as a section 179 deduction. A greater portion of these costs can be deducted if the property is qualified disaster assistance property. See Pub. 946 for details.",What are the three ways to recover the amount spent on a capital expense?,"Depreciation, Amortization, and Depletion."
3,"## Going Into Business The costs of getting started in business, before you actually begin business operations, are capital expenses. These costs may include expenses for advertising, travel, or wages for training employees. Capital expenses can be financed through loans from banks or other financial institutions.",What are capital expenses?,Costs that businesses incur to get started before they start making money.
4,"### If you go into business When you go into business, treat all costs you had to get your business started as capital expenses. Usually, you recover costs for a particular asset through depreciation. Generally, you cannot recover other costs until you sell the business or otherwise go out of business. However, you can choose to amortize certain costs for setting up your business. See _Starting a Business in chapter 8_ for more information on business startup costs.",What should be treated as capital expenses when starting a business?,"Costs that are necessary to start and run a business should be treated as capital expenses. These may include land, buildings, equipment, furniture, fixtures, and other assets used in the operation of the business. Additionally, any money spent on advertising, marketing, legal fees, and other professional services related to starting the business should also be considered capital expenses."
...,...,...,...
115,"### Reimbursers A ""reimbursement or allowance arrangement"" provides for payment of advances, reimbursments, and allowances for travel and non-entertainment-related meals expenses incurred by your employees during the ordinary course of business. If the expenses are substantiated, you can deduct the allowable amount on your tax return. Because of differences between accounting methods and tax law, the amount you can deduct for tax purposes may not be the same as the amount you deduct on your business books and records. For example, you can deduct 100% of the cost of meals on your business books and records. However, only 50% of these costs are allowed by law as a tax deduction.",What is a reimbursement or allowance arrangement?,"A reimbursement or allowance arrangement provides for payment of advances, reimbursements, and allowances for travel and non-entertainment-related meals expenses incurred by your employees during the ordinary course of business."
117,"### Adequate accounting. Your employees must adequately account to you for their travel and non-entertainment-related meals expenses. They must give you documentary evidence of their travel, mileage, and other employee business expenses. This evidence should include items such as receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. The evidence should be kept by your employees for at least one year after the date of the expense.",What is adequate accounting when it comes to employee travel and meal expenses?,"Adequate accounting requires that employees provide documentary evidence of their travel and non-entertainment related meals expenses, including receipts, along with either a statement of expenses, an account book, a day planner, or similar record in which the employee entered each expense at or near the time the expense was incurred. Employees must keep this evidence for at least one year after the date of the expense."
118,"### Excess reimbursement or allowance. An excess reimbursement or allowance is any amount you pay to an employee that is more than the business-related expenses for which the employee adequately accounted. The employee must return any excess reimbursement or other expense allowance to you within a reasonable period of time. If the employee fails to do so, you may have to treat it as taxable income and report it on your Form W-2.",What happens if an employee fails to return excess reimbursement or allowance?,You may have to treat it as taxable income and report it on your Form W-2.
119,## The Taxpayer Advocate Service (TAS) Is Here To Help You With Your IRS Problems,What is the Taxpayer Advocate Service (TAS)?,The Taxpayer Advocate Service (TAS) is a free service provided by the Internal Revenue Service (IRS) to help taxpayers resolve their IRS problems.


In [127]:
output_df = df_unique[['Question', 'Answer']]

output_dir = 'data/output'

uniflow_output_path = f"{output_dir}/new_irs_QApairs.csv"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(uniflow_output_path, index=False)