# Introduction
In today's tutorial, we will explore how large language models (LLMs), particularly the models developed by OpenAI, can be instrumental in automating accounting-related workflows. By leveraging these models, accountants can process and extract information from large and unstructured documents like SEC filings with unprecedented efficiency. Let's dive in to understand how

# Environment Setup
## Setting up the Conda Environment
This command activates a conda environment named llm. Conda environments allow users to manage multiple versions of software packages and their dependencies. It ensures that the required libraries and their specific versions are used, making the code reproducible.

In [1]:
%conda activate llm


Note: you may need to restart the kernel to use updated packages.


## Installing Necessary Libraries

### Installation of Libraries

Here, we are installing several Python libraries:
- `pypdf`: Helps in working with PDF files.
- `chromadb`: For handling and searching large-scale document embeddings.
- `langchain`: A library for chaining different NLP tasks.
- `unstructured`: A library useful for processing unstructured data.
- `openai`: The official library to interface with OpenAI's models.
- `tiktoken`: Helps in counting tokens in a string without making an API call.


In [22]:
%pip install pypdf chromadb langchain unstructured openai tiktoken

Collecting tiktoken
  Obtaining dependency information for tiktoken from https://files.pythonhosted.org/packages/b8/eb/234646d9eefda8a500d0fd88b05bf625a90ed18054124349db26e558276e/tiktoken-0.5.1-cp311-cp311-win_amd64.whl.metadata
  Downloading tiktoken-0.5.1-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Downloading tiktoken-0.5.1-cp311-cp311-win_amd64.whl (759 kB)
   ---------------------------------------- 0.0/759.8 kB ? eta -:--:--
   -- ------------------------------------ 41.0/759.8 kB 667.8 kB/s eta 0:00:02
   --------- ------------------------------ 174.1/759.8 kB 1.8 MB/s eta 0:00:01
   --------------------- ------------------ 399.4/759.8 kB 2.8 MB/s eta 0:00:01
   ---------------------------------------  757.8/759.8 kB 4.0 MB/s eta 0:00:01
   ---------------------------------------- 759.8/759.8 kB 3.7 MB/s eta 0:00:00
Installing collected packages: tiktoken
Successfully installed tiktoken-0.5.1
Note: you may need to restart the kernel to use updated packages.


## API Key Configuration

### Setting OpenAI API Key

In this block, we set up our OpenAI API key. This key is essential for authentication when making requests to the OpenAI service.


In [23]:
import os
os.environ["OPENAI_API_KEY"] = "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"

## Generating Custom Content

### Using the Model for Custom Content Generation

This block demonstrates how versatile LLMs can be. Not only can they process and analyze structured data, but they can also generate creative content. Here, we are using the model to write a rap about forensic accounting in the style of Eminem.


In [24]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [18]:
llm = OpenAI(temperature = 0.9)

In [20]:
print(llm("Write a rap about forensic accounting in the style of Eminem."))



Verse 1

I'm gonna break it down, get deep in the books

Number crunching and analyzing the financial looks

Pretty numbers and facts all rolling together

Investigating fraud and other financial matters

Skills to unravel the secrets of the shady deals

Forensic accounting, the law of embezzled bills

Verse 2

Accounts receivable, balance sheets and more

Searching for clues that open up the forensic door

Analysis of accounts that's critical to the case

Researching to recover the missing sums and their place

Interviewing witnesses, making sure the facts align

Presenting the evidence that supports the crime's decline

Verse 3

Writing up reports that prove the discrepancies

Discovering cover-ups and audit the anomalies

Uncovering hidden assets, chipping away at lies

Finding the perpetrator who's responsible for the crime

Digging deep into the data to expose the misdeeds

Forensic accounting's the way to reveal the root of the weeds


## Loading and Processing Documents

### Extracting Information from an HTML Document

In this block, we are loading an HTML document named `extract.html`. The `UnstructuredHTMLLoader` reads the document and extracts its content. The `CharacterTextSplitter` then breaks the content into smaller chunks, ensuring that each chunk does not exceed 1000 characters. The 500-character overlap between chunks ensures continuity of information.


In [25]:
loader = UnstructuredHTMLLoader("extract.html")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=500)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever())

Created a chunk of size 1253, which is longer than the specified 1000
Created a chunk of size 1479, which is longer than the specified 1000
Created a chunk of size 2262, which is longer than the specified 1000
Created a chunk of size 1047, which is longer than the specified 1000
Created a chunk of size 1001, which is longer than the specified 1000
Created a chunk of size 1059, which is longer than the specified 1000
Created a chunk of size 1001, which is longer than the specified 1000
Created a chunk of size 2477, which is longer than the specified 1000
Created a chunk of size 1438, which is longer than the specified 1000
Created a chunk of size 1084, which is longer than the specified 1000
Created a chunk of size 1872, which is longer than the specified 1000
Created a chunk of size 1174, which is longer than the specified 1000
Created a chunk of size 1047, which is longer than the specified 1000
Created a chunk of size 2401, which is longer than the specified 1000
Created a chunk of s

## Querying the Document

### Using RetrievalQA to Extract Relevant Information

The `RetrievalQA` function facilitates question-answering from the loaded document. By using `RetrievalQA`, we can extract specific pieces of information, such as summaries, financial metrics, and other data points that are of interest to accountants. In the example provided, we ask the model to retrieve details like net revenue, net income, issuer purchases of equity securities, and revenue increase for a specific year.


In [33]:
query = """Summarize key information from this document that might be relevant for an accountant."""
print(qa.run(query))

 This document is an SEC Form 10-Q for Mastercard. It includes a description of recent accounting pronouncements, quantitative and qualitative disclosures about market risk, and notes about acquisitions and organization. It also includes unaudited consolidated financial statements for the three and six months ended June 30, 2023 and 2022 and as of June 30, 2023. The accompanying notes provide more information about the significant accounting policies and the consolidation and basis of presentation of the financial statements.


In [31]:
query = """List the six months net revenue by year from the consolidated statement of operations table"""
print(qa.run(query))


Six Months Ended June 30, 2023: $12,017 million
Six Months Ended June 30, 2022: $10,664 million


In [32]:
query = """List the six months net income by year from the consolidated statement of operations table"""
print(qa.run(query))

 2023: $5,206 million; 2022: $4,906 million.


In [35]:
query = """What was the issuer purchaes of equity securities? Make a bulleted list."""
print(qa.run(query))

 
• During the second quarter of 2023, we repurchased 6.5 million shares for $2.4 billion at an average price of $373.52 per share of Class A common stock.
• The following table presents our repurchase activity on a cash basis during the second quarter of 2023:
• April 1 - 30: 2,070,540 shares at $367.29 per share
• May 1 - 31: 2,250,153 shares at $379.19 per share
• June 1 - 30: 2,148,089 shares at $373.59 per share


In [39]:
query = """How much did revenue increase in 2023?"""
print(qa.run(query))

 For the three months ended June 30, 2023, net revenue increased 14% versus the comparable period in 2022. Adjusted net revenue increased 14%, or 15% on a currency-neutral basis.


By the end of this tutorial, you should have a clearer understanding of how Large Language Models can streamline and improve accounting workflows by automating information retrieval from complex documents.