<a href="https://colab.research.google.com/github/ZongyangYue/LLM_Financial_Doc/blob/main/DeepLake_production.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# install relevant packages
from IPython.display import clear_output

!pip3 install langchain deeplake pypdf openai tiktoken

clear_output()

In [2]:
# import relevant packages and set up api keys and tokens
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI 
from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat
from langchain.document_loaders import PagedPDFSplitter
import os
os.environ['OPENAI_API_KEY']='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'
os.environ['ACTIVELOOP_TOKEN']='eyJhbGciOiJIUzUxMiIsImlhdCI6MTY4NTE5NjA2NCwiZXhwIjoxNjg1NTQxNTk5fQ.eyJpZCI6Inl1ZXpvbmd5YW5nMjEwIn0.fRo3gNZGOTfzJ9klc3t9X9DEtZPQCDMhq-osEs8fw21wLOeJzYj3aFnm1QVYcGuYl25fIDG8lrsQArOjdbL3UQ'

In [12]:
# define necessary helper functions
import requests
import tqdm
from typing import List
def load_reports(url):
    pages = []

    r = requests.get(url)
    path = url.split('/')[-1]
    with open(path, 'wb') as f:
        f.write(r.content)
    loader = PagedPDFSplitter(path)
    local_pages = loader.load_and_split()
    pages.extend(local_pages)

    return pages

def run_queries(queries, qa):
  answers = []
  for query in queries:
    answers.append(qa.run(query))
  return answers

def print_answers(answers):
  for answer in answers:
    print(answer)


In [29]:
# define the main function
def analyze_accountants_report(url, activeloop_dataset_path):
  pages = load_reports(url)
  text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
  texts = text_splitter.split_documents(pages)
  embeddings = OpenAIEmbeddings()

  db = DeepLake(dataset_path=activeloop_dataset_path, embedding_function=embeddings, token=os.environ['ACTIVELOOP_TOKEN'])
  db.add_documents(texts)

  qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())

  queries = ["What is the name of the auditor/accountant company? Look around accountant's report section", 
           "What is the date of the accountant's report/circular?", 
           "Is the acquisition approved by shareholders?",
           "What is the audit fee or auditor's remuneration? Look for this around (Loss) Profit before Tax section?",
           "What is the name of the company being acquired? Look for its specific name around the Definitions pages"]
  answers = run_queries(queries, qa)
  print_answers(answers)

In [33]:
# first example document
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2023/0220/2023022000301.pdf'
activeloop_dataset_path = "hub://yuezongyang210/shunteng_intl2"
analyze_accountants_report(url, activeloop_dataset_path)

\

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/yuezongyang210/shunteng_intl2



|

hub://yuezongyang210/shunteng_intl2 loaded successfully.



  

Deep Lake Dataset in hub://yuezongyang210/shunteng_intl2 already exists, loading from the storage
Dataset(path='hub://yuezongyang210/shunteng_intl2', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (99, 1536)  float32   None   
    ids      text     (99, 1)      str     None   
 metadata    json     (99, 1)      str     None   
   text      text     (99, 1)      str     None   


Evaluating ingest: 100%|██████████| 1/1 [00:28<00:00


Dataset(path='hub://yuezongyang210/shunteng_intl2', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (198, 1536)  float32   None   
    ids      text     (198, 1)      str     None   
 metadata    json     (198, 1)      str     None   
   text      text     (198, 1)      str     None   
The name of the auditor/accountant company is Moore Stephens CPA Limited.
The accountant's report/circular is dated February 20, 2023.
It is not stated in the given context whether the acquisition has been approved by shareholders or not. The circular only provides information about the EGM and the requirements for shareholder approval.
The audit fee or auditor's remuneration is HK$6,000 for all periods listed in the context. It is mentioned in the (Loss) Profit before Tax section under "Auditor's remuneration."
The name of the company being acquired is Shunten International (

In [31]:
# second example document
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2019/1227/2019122700117.pdf'
activeloop_dataset_path = "hub://yuezongyang210/zhuhai_huafa2"
analyze_accountants_report(url, activeloop_dataset_path)

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/yuezongyang210/zhuhai_huafa2
hub://yuezongyang210/zhuhai_huafa2 loaded successfully.


Evaluating ingest: 100%|██████████| 1/1 [00:11<00:00


Dataset(path='hub://yuezongyang210/zhuhai_huafa2', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (312, 1536)  float32   None   
    ids      text     (312, 1)      str     None   
 metadata    json     (312, 1)      str     None   
   text      text     (312, 1)      str     None   
The auditor/accountant company is PricewaterhouseCoopers, Certified Public Accountants.
The accountant's report and circular are dated December 27, 2019.
It is not stated in the given context whether the acquisition has been approved by shareholders or not. The letters and notices discuss the recommendation of the Independent Board Committee and the Board of Directors to vote in favor of the resolution to approve the acquisition and property management services cooperation framework agreement at the SGM. However, the actual outcome of the vote is not mentioned.
The information 