# Build ChatGPT to Answer Questions on Your Financial Data
## Introduction

Welcome to the 'Build ChatGPT to Answer Questions on Your Financial Data' lesson. We'll explore LangChain's benefits for large language models (LLMs) and its use in interpreting financial data. LangChain simplifies development, streamlines text generation from input prompts, and improves accuracy by ensuring consistency.

One of the key features of LangChain is its customizability, offering pre-built prompts and chains for use cases like question answering and summarization. Plus, LangChain is interoperable, integrating easily with tools like Deep Lake, which makes it ideal for handling complex applications.

We'll demonstrate this using Amazon's quarterly financial reports in this lesson. We'll embed the data using OpenAI's API, store it in Deep Lake, and explore it by asking questions through LangChain. This hands-on approach will illustrate how these tools can revolutionize the interpretation of financial data. Let's get started!

### The Workflow

The workflow to explore Amazon's revenue growth using LangChain and Deep Lake involves:

- Install required libraries and set API tokens for OpenAI and Activeloop.
- Download Amazon's financial reports, and split them into smaller chunks using LangChain's Text Splitter utility.
- Create a DeepLake vector store, add the text chunks, and use OpenAIChat as the language model.
- Finally, query the model with questions to obtain relevant insights.

#### Loading the libraries

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI 
from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat
from langchain.document_loaders import PagedPDFSplitter

Could not import azure.core python package.


First, let’s download financial reports from Amazon and load them into pages:

In [8]:
import requests
import tqdm
import os
from typing import List

def load_reports(urls: List[str], path: str) -> List[str]:
    """ Load pages from a list of urls"""
    pages = []

    for url in tqdm.tqdm(urls):
        r = requests.get(url)
        file = os.path.join(path, url.split('/')[-1])
        with open(file, 'wb') as f:
            f.write(r.content)
        loader = PagedPDFSplitter(file)
        local_pages = loader.load_and_split()
        pages.extend(local_pages)
    return pages



In [9]:
# financial reports of amamzon, but can be replaced by any URLs of pdfs
urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf'
        ]
path= os.path.abspath('data')
print(path)
pages = load_reports(urls, path)
print('Pages: ', len(pages))

c:\Users\edumu\Google Drive\Github\langchain-vectordb-basics\data


100%|██████████| 19/19 [01:00<00:00,  3.20s/it]

Pages:  414





We now use the Text Splitter Util to split documents into pages.

In [10]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(pages)

embeddings = OpenAIEmbeddings()

#Use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "edumunozsala"
my_activeloop_dataset_name = "amazon_earnings_6"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings, token=os.environ['ACTIVELOOP_TOKEN'])
db.add_documents(texts)



Your Deep Lake dataset has been successfully created!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/edumunozsala/amazon_earnings_6
hub://edumunozsala/amazon_earnings_6 loaded successfully.


Evaluating ingest: 100%|██████████| 1/1 [00:21<00:00
/

Dataset(path='hub://edumunozsala/amazon_earnings_6', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (414, 1536)  float32   None   
    ids      text     (414, 1)      str     None   
 metadata    json     (414, 1)      str     None   
   text      text     (414, 1)      str     None   


 

['37f1bd3c-71bc-11ee-a255-cc2f714963ed',
 '37f1bd3d-71bc-11ee-bb0d-cc2f714963ed',
 '37f1bd3e-71bc-11ee-99e8-cc2f714963ed',
 '37f1bd3f-71bc-11ee-9a36-cc2f714963ed',
 '37f1bd40-71bc-11ee-93e4-cc2f714963ed',
 '37f1bd41-71bc-11ee-9d8c-cc2f714963ed',
 '37f1bd42-71bc-11ee-a5d4-cc2f714963ed',
 '37f1bd43-71bc-11ee-ab3f-cc2f714963ed',
 '37f1bd44-71bc-11ee-ab4f-cc2f714963ed',
 '37f1bd45-71bc-11ee-8912-cc2f714963ed',
 '37f1bd46-71bc-11ee-b112-cc2f714963ed',
 '37f1bd47-71bc-11ee-b80a-cc2f714963ed',
 '37f1bd48-71bc-11ee-a7f6-cc2f714963ed',
 '37f1bd49-71bc-11ee-8662-cc2f714963ed',
 '37f1bd4a-71bc-11ee-be29-cc2f714963ed',
 '37f1bd4b-71bc-11ee-affc-cc2f714963ed',
 '37f1bd4c-71bc-11ee-9104-cc2f714963ed',
 '37f1bd4d-71bc-11ee-8e08-cc2f714963ed',
 '37f1bd4e-71bc-11ee-a97e-cc2f714963ed',
 '37f1bd4f-71bc-11ee-b578-cc2f714963ed',
 '37f1bd50-71bc-11ee-a05d-cc2f714963ed',
 '37f1bd51-71bc-11ee-ab9f-cc2f714963ed',
 '37f1bd52-71bc-11ee-96d1-cc2f714963ed',
 '37f1bd53-71bc-11ee-99df-cc2f714963ed',
 '37f1bd54-71bc-

Since the dataset is stored on Activeloop, you can load it later without recomputing embeddings. This is a significant benefit cause it would save you time and computational resources. LangChain has a wrapper around Deep Lake, allowing you to use it as a Vector Store.

In [11]:
qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())



## How to answer questions using ML?

The state-of-the-art method of question answering is by leveraging LangChain and Activeloop Deep Lake. You can store your documents in the unified, streamable format, and connect it to LangChain to run a question-answering model such as GPT or BERT. These language models are able to understand the context of the question and generate a more accurate answer. You may also implement techniques like data augmentation, document retrieval, and summarization to enhance the system performance.

Let’s now explore Amazon’s revenue change pre- and post-pandemic.

In [12]:
qa.run("Combine total revenue in 2020?")
#Amazon's total revenue in 2020 was $386,064 million.

'The total revenue in 2020 for Amazon.com, Inc. was $386,064 million.'

This answer is obviously incorrect, as we didn’t use any sophisticated methods for addition. We will explore further optimization for this use case to consistently get good answers by employing a chain of agents. For now, however, more straightforward queries can be answered very quickly:

In [13]:
qa.run("What is the revenue in 2021 Q3?")
# The net sales for Q3 2021 was $110.8 billion, with a 15% increase to $127.1 billion in Q3 2022.

'The revenue in 2021 Q3 was $65.6 billion.'

In conclusion, LangChain is a robust tool for productizing large language models, providing a simple workflow to generate output by employing prompts, utilities, and language models in a linked fashion. It offers numerous benefits, including streamlined LLM development, improved accuracy of LLMs, better use-case customization, easy integration with data sources, and other Python libraries, including Activeloop Deep Lake. Deep Lake, in its turn, enables rapid LangChain prototyping, as you can immediately access data without the need to recompute the embeddings for the model finetuning. If you’re working with LLMs and looking to streamline your development process, LangChain, powered by Deep Lake as the data store for LLM training, is worth exploring.