<a href="https://colab.research.google.com/github/ZongyangYue/LLM_Financial_Doc/blob/main/MainPipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install Necessary Packages

In [5]:
from IPython.display import clear_output
!pip3 install openai langchain unstructured pdf2image python-dotenv tiktoken pinecone-client
clear_output()

### Mount on Drive File System to Get .env file and PDF Documents

In [6]:
from google.colab import drive
drive.mount("/content/drive/")
%cd 'drive/My Drive/LLM'
%ls

Mounted at /content/drive/
/content/drive/My Drive/LLM
 2019122700117_acc_report.pdf   InfoExtraction.ipynb   TestEnvFile.ipynb
 2023022000301_acc_report.pdf   MainPipeline.ipynb    'Test langchain.ipynb'
 env                            PDFparser.ipynb


### Read Environment Variables

In [8]:
import os
import openai
import tiktoken
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

# print(os.environ['OPENAI_API_KEY'])
# print(os.environ['PINECONE_API_KEY'])
# print(os.environ['PINECONE_API_ENV'])
openai.api_key  = os.environ['OPENAI_API_KEY']

### Read PDF Documents into document objects / texts

In [9]:
from langchain.document_loaders import UnstructuredPDFLoader

In [10]:
loader = UnstructuredPDFLoader("2019122700117_acc_report.pdf")

In [11]:
data = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [12]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 131437 characters in your document


### Split Texts into Chunks

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [14]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [15]:
print (f'Now you have {len(texts)} documents')

Now you have 152 documents


## Create Embeddings in Pinecone

In [16]:
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

In [17]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_API_ENV = os.environ['PINECONE_API_ENV']

In [18]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [22]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "langchain1" # already created using Pinecone web GUI

Load text vector embeddings into Pinecone Vector Store Database

In [23]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

Turn query into embeddings, and find top_k similar vectors in pinecone vector store database from 152 vectors


In [25]:
query = "What is the name of the auditor who issued this accountant report, return answer in one or two words"
docs = docsearch.similarity_search(query, k = 10)

Import question answering functionality of OpenAI

In [26]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

Set up LLM temperature and chain_type="stuff"

In [27]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

### Run Queries using the OpenAI LLM and top_k related embeddings as context.

prompt engineering if we change query text,
can we tune embeddings directly? or vector embedding engineering?

In [29]:
chain.run(input_documents=docs, question=query).strip()

'PricewaterhouseCoopers'