# Example of generating QAs for a 10K
In this example, we will show you how to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [2]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [3]:
!{sys.executable} -m pip install langchain pandas pypdf



### Import Dependency

In [4]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
# from uniflow.flow.config import TransformOpenAIConfig
from uniflow.flow.config import TransformHuggingFaceConfig
# from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig
from langchain.document_loaders import PyPDFLoader
from uniflow.op.prompt_schema import Context, GuidedPrompt

from uniflow.node import Node
from uniflow.op.extract.load.pdf_op import ExtractPDFOp
from uniflow.op.model.llm_preprocessor import LLMDataPreprocessor

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


False

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `PyPDFLoader` from langchain.

In [5]:
# pdf_file = "nike-10k-2023.pdf"
pdf_file = "IRS_2023.pdf"

##### Set current directory and input data directory.

In [6]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load and split the pdf

In [6]:
# initial_node = Node(name="initial_pdf_node", value_dict={"pdf_file_path": input_file})

# extract_pdf_op = ExtractPDFOp(
#             name="extract_pdf_op",
#             model=LLMDataPreprocessor(
#                 model_config = {
#                     "model_name": "mistralai/Mistral-7B-Instruct-v0.1",
#                     "batch_size": 1,
#                     "model_server": "HuggingfaceModelServer",
#                 }
#             ),
#         )

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.36s/it]


In [8]:
# extracted_nodes = extract_pdf_op([initial_node])



In [13]:
# print("Extracted Nodes:")
# print(extracted_nodes)

Extracted Nodes:
[<uniflow.node.Node object at 0x7fabe14a2470>]


In [14]:
# if extracted_nodes:
#     print("\nStructure of the First Extracted Node:")
#     print(extracted_nodes[0].value_dict)
#     # Optionally, you can also loop through all keys to see their names and values
#     for key, value in extracted_nodes[0].value_dict.items():
#         print(f"Key: {key}, Value: {value}")
# else:
#     print("No nodes were extracted.")


Structure of the First Extracted Node:
{'text': '/home/ubuntu/uniflow/example/transform/data/raw_input/IRS_2023.pdf\n\nComment: I\'m not sure what you mean by "I have a pdf file with 15 pages and it is very big" - how big? What does the size of the PDF file tell us about its contents?\n\n## Answer (4)\n\nYou can use `pdftk` to extract text from your PDF files, then pipe that into `grep`. You may need to adjust the options for `pdftk`, depending on whether or not your PDFs are OCR-enabled.\n\nHere\'s an example command:\n\n```\npdftk *.pdf cat output=output.txt | grep \'your search term\'\n```\n\nThis will create a new file called `output.txt` containing all the text in all the PDFs specified in the glob pattern. Then it will filter out any lines that don\'t contain the search term.\n\nIf you want to keep the original PDFs as well, you can add another line at the end like this:\n\n```\npdftk *.pdf output=output.pdf\n```\n\nThis will combine all the PDFs into one document called `output

In [11]:
# print(extracted_text)




In [12]:
# # Assuming extracted_nodes is a list of Node objects with extracted text

# # Function to print the content of each node
# def examine_extracted_data(nodes):
#     for i, node in enumerate(nodes):
#         # Extract the text content from each node
#         text_content = node.value_dict.get("text", "")

#         # Print the node number and its content
#         print(f"Node {i + 1}:")
#         print(text_content)
#         print("\n" + "-"*50 + "\n")  # Separator for readability

# # Call the function with the extracted nodes
# examine_extracted_data(extracted_nodes)


Node 1:
/home/ubuntu/uniflow/example/transform/data/raw_input/IRS_2023.pdf

Comment: I'm not sure what you mean by "I have a pdf file with 15 pages and it is very big" - how big? What does the size of the PDF file tell us about its contents?

## Answer (4)

You can use `pdftk` to extract text from your PDF files, then pipe that into `grep`. You may need to adjust the options for `pdftk`, depending on whether or not your PDFs are OCR-enabled.

Here's an example command:

```
pdftk *.pdf cat output=output.txt | grep 'your search term'
```

This will create a new file called `output.txt` containing all the text in all the PDFs specified in the glob pattern. Then it will filter out any lines that don't contain the search term.

If you want to keep the original PDFs as well, you can add another line at the end like this:

```
pdftk *.pdf output=output.pdf
```

This will combine all the PDFs into one document called `output.pdf`.

--------------------------------------------------



In [7]:
loader = PyPDFLoader(input_file)
pages = loader.load_and_split()
page_contents = [page.page_content for page in pages]

In [8]:
if os.path.exists(input_file):
    print("File found:", input_file)
else:
    print("File not found. Please check the file path.")

print(len(pages))

File found: /home/ubuntu/uniflow/example/transform/data/raw_input/IRS_2023.pdf
148


In [9]:
for i, page in enumerate(pages[146:]):  # Adjust the number of pages to preview
    print(f"Content of Page {i+1}:\n{page.page_content}\n\n---\n")

Content of Page 1:
Intangible drilling costs 26
Intangibles, amortization 31
Interest :
Allocation of 14
Below-market 17
Business expense for 13Capitalized 16
Carrying charge 25
Deductible 15
Forgone 17
Life insurance policies 16
Limitation 15
Not deductible 16
Refunds of 17
When to deduct 17
Internet-related expenses 48
Interview expenses 48
IRS social media 51
IRS Tax Professional 
Partners 51
IRS Video Portal 50
K
Key person 16
Kickbacks 47
L
Leases :
Canceling 11
Cost of getting 12, 31
Improvements by lessee 13
Leveraged 12
Mineral 40
Oil and gas 40
Sales distinguished 11
Taxes on 12
Legal and professional fees 48
Letter 5071C 53
Licenses 32, 48
Life insurance coverage 10
Limit on deductions 7
Line of credit 15
Loans :
Below-market 17
Discounted 17
Loans or advances 10
Lobbying expenses 48
Long-term care insurance 22
Losses 6, 7
At-risk limits 6
Net operating 6
Passive activities 6
Low Income Taxpayer Clinics 
(LITCs) 55
M
Machinery parts 5
Making a tax payment 53
Meals 44
Meals an

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `GuidedPrompt` class.

In [10]:
guided_prompt = GuidedPrompt(
    instruction="Generate Q&A based on the context.",
    examples=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [11]:
# input_data = [ Context(context=p[:500]) for p in page_contents[6:16] if len(p) > 200]
input_data = [ Context(context=p[:500]) for p in page_contents[:] if len(p) > 200]
input_data

[Context(context="Contents\nIntroduction .................. 1\nWhat's New for 2022 ............. 2\nWhat's New for 2023 ............. 2\nReminders ................... 2\nChapter 1. Deducting \nBusiness Expenses .......... 3\nChapter 2. Employees' Pay ........ 8\nChapter 3. Rent Expense ........ 11\nChapter 4. Interest ............ 13\nChapter 5. Taxes ............. 18\nChapter 6. Insurance ........... 21\nChapter 7. Costs You Can Deduct \nor Capitalize .............. 25\nChapter 8. Amortization ......... 29\nChapter 9. Depletio"),
 Context(context='publication or the How To Get Tax Help  section \nat the end of this publication, go to the IRS In-\nteractive Tax Assistant page at IRS.gov/\nHelp/ITA  where you can find topics by using the \nsearch feature or viewing the categories listed.\nGetting tax forms, instructions, and pub-\nlications. Go to IRS.gov/Forms  to download \ncurrent and prior -year forms, instructions, and \npublications.\nOrdering tax forms, instructions, and \npublic

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [11]:
# config = TransformOpenAIConfig(
#     guided_prompt_template=guided_prompt,
#     model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
# )
# client = TransformClient(config)

In [12]:
# pip3 install transformers accelerate bitsandbytes scipy
# pip3 install lmqg spacy
# pip3 install accelerate

In [12]:
config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(),
)
client = TransformClient(config)

Loading checkpoint shards: 100%|██████████| 2/2 [01:49<00:00, 54.82s/it]


Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [13]:
output = client.run(input_data)

  0%|          | 0/148 [00:00<?, ?it/s]

100%|██████████| 148/148 [18:18<00:00,  7.42s/it]


### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [14]:
print(output[:3])

[{'output': [{'response': ["instruction: Generate Q&A based on the context.\ncontext: In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.\nquestion: Who published A Mathematical Theory of Communication in 1948?\nanswer: Claude E. Shannon.\ncontext: Contents\nIntroduction .................. 1\nWhat's New for 2022 ............. 2\nWhat's New for 2023 ............. 2\nReminders ................... 2\nChapter 1. Deducting \nBusiness Expenses .......... 3\nChapter 2. Employees' Pay ........ 8\nChapter 3. Rent Expense ........ 11\nChapter 4. Interest ............ 13\nChapter 5. Taxes ............. 18\nChapter 6. Insurance ........... 21\nChapter 7. Costs You Can Deduct \nor Capitalize .............. 25\nChapter 8. Amortization ......... 29\nChapter 9. Depletio"], 'error': 'No errors.'}], 'r

In [15]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            parts = response.split('\n')
            response_dict = {part.split(': ')[0]: part.split(': ')[1] for part in parts if ': ' in part}

            if all(key in response_dict for key in ['context', 'question', 'answer']):
                contexts.append(response_dict['context'])
                questions.append(response_dict['question'])
                answers.append(response_dict['answer'])
            else:
                print("Missing context, question or answer in response:", response)

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,Contents,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
1,publication or the How To Get Tax Help section,Where can I find information about how to get tax help?,You can find information about how to get tax help at the IRS Interactive Tax Assistant page at IRS.gov/Help/ITA.
2,"In 1948, Claude E. Shannon published A Mathematical Theory of",What did Claude E. Shannon introduce in his article?,The concept of information entropy.
3,not your employee. See the Instructions for,What is the gig economy?,"The gig economy refers to a labor market where people engage in short-term, freelance or contract work rather than traditional full-time employment. This type of work is often facilitated by online platforms such as Uber, TaskRabbit, and Upwork."
4,The Gig Economy Tax Center streamlines,What does the Gig Economy Tax Center offer?,"Tips and resources on various topics related to the tax implications of the gig economy such as filing requirements, making quarterly estimated income tax payments, paying self-employment taxes, paying FICA, Medicare, and Additional Medicare taxes, deductible business expenses, and special rules for 1099 forms."
5,"human trafficking. If warranted, financial institu-",Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
6,receipts of $27 million or less for the 3 prior tax,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
7,pose of them.,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.
8,made for each member of the consolidated,What happens if I don't have an applicable financial statement and I purchase five laptops for my business in 2022?,"If you do not have an applicable financial statement and you purchase five laptops for your business in 2022, you may elect to treat these purchases as Section 537(b)(1) property if you make this election before January 1, 2023. However, if you make this election after January 1, 2023, you must include these purchases in your gross income for the tax year ending December 31, 2022."
9,records for 2022 in accordance with this policy.,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.


In [17]:
# # Extracting context, question, and answer into a DataFrame
# contexts = []
# questions = []
# answers = []

# for item in output:
#     for i in item.get('output', []):
#         for response in i.get('response', []):
#             if any(key not in response for key in ['context', 'question', 'answer']):
#                 print("Missing context, question or answer in response:", response)
#                 continue
#             contexts.append(response['context'])
#             questions.append(response['question'])
#             answers.append(response['answer'])

# # Set display options
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.width', 1000)

# df = pd.DataFrame({
#     'Context': contexts,
#     'Question': questions,
#     'Answer': answers
# })

# styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
#     'selector': 'th',
#     'props': [('text-align', 'left')]
# }])
# styled_df

Finally, we can save the output to a csv file.

In [21]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(f"{output_dir}/IRS_2023_QApairs.csv", index=False)