# Complex Retrieval Augmented Generation
This notebook uses LlamaIndex and Unstructured.io to retrieve information from pdfs.  
Machine readable PDFs (i.e. not pdfs as scanned images) are converted to HTML.  
This is reported tp give better performance than OCR.  
[See the reference notebook here](https://colab.research.google.com/drive/1ffaV5iFRvhzGIq8YSckf-VJ-y0AZFmA-?usp=sharing#scrollTo=lGM0h5TzUxMU)

Import the dependencies

In [1]:
#!pip install llama-index llama-hub unstructured==0.10.18 lxml cohere -qU

**Import Statements**

**from pydantic import BaseModel**: This imports the BaseModel class from the pydantic library. Pydantic is commonly used for data validation and settings management using Python type annotations. The BaseModel class is used to define data models where you can specify the type of each attribute, and Pydantic will handle validation and error reporting.  

**from unstructured.partition.html import partition_html**: This imports the partition_html function from the unstructured library. This function is  used to process HTML content, by dividing it into different structured components or elements.    

**Pandas Configuration**:
These lines are configuring how pandas displays data in the output.

In [1]:
from pydantic import BaseModel
from unstructured.partition.html import partition_html
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)

Normally, you cannot start a new event loop (or run asynchronous tasks) if there's already an event loop running in the same thread. This is a limitation in certain environments like Jupyter notebooks. The nest_asyncio.apply() function patches the event loop to remove this limitation, allowing you to run asynchronous tasks even if an event loop is already running.

In [2]:
import nest_asyncio
nest_asyncio.apply()

##### Get the OpenAI API key

In [3]:
"""# Either type it in
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key: ")"""

'# Either type it in\nimport os\nfrom getpass import getpass\n\nos.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key: ")'

In [3]:
# Or read in the key as environment variable
import os

# Read API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
    raise ValueError("OpenAI API Key not set in environment variables")
os.environ["OPENAI_API_KEY"] = openai_api_key


#### Convert the pdf file to HTML using poppler

In [4]:
# Using poppler
# You can ignore the warning: Syntax Warning: Bad annotation destination
import subprocess

def convert_pdf_to_html(pdf_path, html_output_path):
    # Construct the command to convert the PDF file to HTML
    command = ['pdftohtml', '-c', '-noframes', pdf_path, html_output_path]

    # Run the command
    subprocess.run(command, check=True)

# Example usage
pdf_file = 'investor_reports/ssw-IR22.pdf' #This is the source pdf. Change it to your pdf name
html_output_file = 'investor_reports/output_file' #This is the destination html file name

convert_pdf_to_html(pdf_file, html_output_file)

# This creates a file output_file.html as well as a png for each page.
# The png files are required to maintain the html structure.

Page-1
Page-2
 link to page 4  link to page 8  link to page 9  link to page 15  link to page 21  link to page 34  link to page 37  link to page 39  link to page 69  link to page 71  link to page 76  link to page 85  link to page 86  link to page 89  link to page 93  link to page 94  link to page 106  link to page 107  link to page 118  link to page 127  link to page 128  link to page 139  link to page 150  link to page 151  link to page 174  link to page 181  link to page 182  link to page 186  link to page 216  link to page 233  link to page 236  link to page 238  link to page 242  link to page 254  link to page 270  link to page 275  link to page 283  link to page 286  link to page 289  link to page 290 Page-3
 link to page 2  link to page 71  link to page 71  link to page 71  link to page 71  link to page 283  link to page 283  link to page 233 Page-4
 link to page 6  link to page 8  link to page 9  link to page 15  link to page 21 Page-5
 link to page 103 Page-6
Page-7
Page-8
Page-

In [5]:
# Post-process the HTML file to ensure it uses a specific encoding, like UTF-8.
from pathlib import Path

def ensure_utf8_encoding(html_file_path):
    # Read the content of the HTML file
    with open(html_file_path, 'r', encoding='ISO-8859-1') as file:
        content = file.read()

    # Write the content back to the file with UTF-8 encoding
    with open(html_file_path, 'w', encoding='utf-8') as file:
        file.write(content)

# After conversion
html_output_path = 'investor_reports/output_file.html'  # Adjusted to include .html extension
ensure_utf8_encoding(Path(html_output_path))


#### Read the file into LlamaIndex

In [6]:
from llama_index.readers.file.flat_reader import FlatReader
from pathlib import Path

reader = FlatReader()
html_doc = reader.load_data(Path("investor_reports/output_file.html"))

#### Set up the node parser.  
The UnstructuredElementNodeParser Splits a document into Text Nodes and Index Nodes corresponding to embedded objects (e.g. tables).

##### Set up the llm

In [7]:
from llama_index.llms import OpenAI

#llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
llm = OpenAI(model='gpt-4-1106-preview', temperature=0.1)

In [8]:
from llama_index.node_parser import (
    UnstructuredElementNodeParser,
)

node_parser = UnstructuredElementNodeParser(llm=llm)

#### Get the raw nodes

In [9]:
import pickle

# Check if the nodes file already exists
if os.path.exists("html_doc_nodes.pkl"):
    # If the file exists, load the data from it
    print("loading node data...")
    with open("html_doc_nodes.pkl", "rb") as file:
        html_doc_raw_nodes = pickle.load(file)
else:
    # If the file does not exist, generate the data and save it
    print("generating node data...")
    html_doc_raw_nodes = node_parser.get_nodes_from_documents(html_doc)
    with open("html_doc_nodes.pkl", "wb") as file:
        pickle.dump(html_doc_raw_nodes, file)


loading node data...


##### Get the base nodes and the node mappings

In [10]:
base_nodes_html_doc, node_mappings_html_doc = node_parser.get_base_nodes_and_mappings(
    html_doc_raw_nodes
)

#### Set up the vector index, retriever, and query engine

In [11]:
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex(base_nodes_html_doc)
vector_retriever = vector_index.as_retriever(similarity_top_k=3)
vector_query_engine = vector_index.as_query_engine(similarity_top_k=3)

#### Set up the query engine

In [12]:
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    node_dict=node_mappings_html_doc,
)
query_engine = RetrieverQueryEngine.from_args(recursive_retriever)

#### Now ask questions!

In [33]:
response = query_engine.query("what was richard stewart's total remuneration in 2022?")
print(str(response))

Richard Stewart's total remuneration in 2022 cannot be determined based on the given context information.


##### Alternatively, you may use the vector_query_engine.

In [15]:
response = vector_query_engine.query("is savannah danson a member of the risk committee?")
print(str(response))

Yes, Savannah Danson is a member of the Risk Committee.


In [16]:
response = query_engine.query("how much gold did the company produce in 2022?")
print(str(response))

The company produced 3,612 tonnes of gold in 2022.


In [17]:
response = query_engine.query("What is the outlook for gold for 2023?")
print(str(response))

The outlook for 2023 is that there are compelling arguments for why the prospect of an economic slowdown remains on the table. Although there are risks involved, there is a good case for gold in 2023, driven by elevated geopolitical risk, a developed market economic slowdown, a peak in interest rates, and risks to equity valuations.


In [26]:
response = query_engine.query("what is the number 1 risk to the business in 2022?")
print(str(response))

The number 1 risk to the business in 2022 is the risk of energy availability, including the risk of energy shortages, load shedding in South Africa, and curtailment in Europe.


In [21]:
response = query_engine.query("what are the top risks to sa pgm operations?")
print(str(response))

The top risks to SA PGM operations include inability to meet global governance standards and targets, non-compliance with relevant laws and regulations, mine incidents and accidents, theft of copper and infrastructure, misaligned community expectations, attraction and retention of skills, total power outage/load curtailment, seismicity, illegal mining, labour relations/wage negotiations, supply chain challenges, and inability to execute on the annual business plan.


In [28]:
#Extra question
response = query_engine.query("list the top 10 risks to the business?")
print(str(response))

The top 10 risks to the business are as follows:
1. Energy availability
2. Failure to enable resilient communities
3. Inability to fund expansion
4. Failure to grow in targeted commodities and regions
5. Not generating sufficient returns to deliver on force for good strategy
6. Impact of climate change
7. Diverse stakeholder relations
8. Working in and developing homogenous ecosystems
9. Lack of technical and operating capability
10. Financial impact of a pandemic
