# Task

You will be given a PDF document that contains both textual and graphical data. Your task is to:

* Extract the textual and graphical information from the PDF pages.
* Convert the extracted graphical data (such as charts or graphs) into a structured, queryable format.
* Implement a system where users can ask questions and receive meaningful responses based on the extracted data.


Requirements:
* Document your approach and display your results in a Jupyter notebook (.ipynb)
* Your solution should allow users to query both the extracted text and any data that was derived from the graphical elements (such as tables).
* Provide brief explanations of your approach, choices made, and any challenges you encountered.

---
# Approaches

Following the presented notebook, you may find two approaches:
1. The short method, which uses the pdf as an input and allows for discussion based on the pdf context
2. The larger one, which uses a manual extraction of the contents of the pdf in order to feed it to the RAG solution in a similar manner to the first approach

### Discussion

Both approaches make good use of the RAG system provided via <a href="https://python.langchain.com/">langchain</a> and <a href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2">Mistral-7B-Instruct-v0.2</a>.

The second approach makes use of the PdfFileManager class that I have implemented, which aims to extract all text and graphical data manually. Despite the text extraction not being as much of a difficult task, the extraction of the charts and diagrams is a challenge I have not managed to overcome manually. I turned to converting the pages of the given pdf into images, followed by placing them in a pretrained <a href = "https://huggingface.co/naver-clova-ix/donut-base-finetuned-docvqa">donut</a> or a <a href = "https://huggingface.co/naver-clova-ix/donut-base-finetuned-docvqa">GPT-Vision-1-ft</a> model in order to achieve the retrieval of the graphical data and, later on, converting that data into a queryable format, but those trails have met no success, therefore, I have not managed to achieve the second point from this task in the expected manner.

A possible way to achieve the task might have been to, optionally, have a network that does a segmentation task over the graphs/charts/diagrams. Those graphs (or entire pages containing graphs) could have then been interpreted by another network, like a GPT-4 Vision network, which is able to perform the task of chart description.

### Conclusions:

The first approach cannot make use of a graph interpreter, as it uses the pdf as a whole for the input. In addition, the statistical data on the charts that are given, which, in this case, are text and .svg based, is interpreted better than I have personally expected. (see "TESTS FOR APPROACH No. 1" section)

Because the graphs in the pdf file contain text information, that information will be scraped too when collecting the entire textual data. This leads to some interestingly good results when creating an interpretation from the second approach. So, despite not being able to correctly extract the graphs manually and create queryable data from them, the task seems to be successfully achieved.

In [1]:
import getpass
import os

if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    os.environ["HUGGINGFACEHUB_API_TOKEN"]="#" # personal Hugging Face user access token with read permission

In [2]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain_community.llms import HuggingFaceEndpoint


## Approach No.1

In [3]:
pdf_folder_path = "./input" # in case there are multiple pdf files given as input
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)] #we may be able to lead all of them
index = VectorstoreIndexCreator( # and create embeddings for the input data in order for the network to retrive it
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders) # we also split the input in chuncks
llm=HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.2", temperature=0.1, max_length=512)

  embedding=HuggingFaceEmbeddings(),
  embedding=HuggingFaceEmbeddings(),
  from tqdm.autonotebook import tqdm, trange
  llm=HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-Instruct-v0.2", temperature=0.1, max_length=512)
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\Dragos\.cache\huggingface\token
Login successful


In [4]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=index.vectorstore.as_retriever(),
                                    input_key="question")

## Tests for Approach No. 1

In [5]:
chain.run('Give me a tip for energy saving')

  chain.run('Give me a tip for energy saving')


' One of the biggest energy wasters in your home is drafty windows and doors. Caulking drafty areas is a simple DIY project that can save you money and energy.'

In [6]:
chain.run('How is my electrical use?')

" Based on the information provided in the Home Energy Report, your electrical use is 18% more than similar nearby homes. This means that you are using more electricity than the average household in your area with a similar size and fuel type. To save energy and reduce your electricity bill, consider implementing the energy-saving tips provided in the report, such as caulking windows and doors, upgrading your refrigerator, and adjusting your thermostat settings. Additionally, you may want to evaluate your home's energy efficiency with a Home Energy Audit to identify other areas where you can save energy and money."

In [7]:
chain.run("What can be notted from the way the annual electricity chart looks?")

" The annual electricity chart shows that the person's electricity usage is higher than the average usage of similar-sized homes in their area, but lower than the usage of the most efficient homes. The chart also suggests that the person could save energy and money by pre-heating their home on cold days, setting their smart thermostat to save energy during high-cost hours, and evaluating their home's energy efficiency with a Home Energy Audit. Additionally, the person is encouraged to caulk drafty windows and doors, upgrade their refrigerator, and adjust their thermostat settings to save energy."

In [8]:
chain.run("How many elements are quantified in the annual electricity graph?")

" The annual electricity graph compares the electricity usage of the account holder with similar and efficient homes. It quantifies three elements: the account holder's electricity usage, the electricity usage of similar homes, and the electricity usage of efficient homes."

In [9]:
chain.run("In this document there are a couple of charts. Describe to me the meaning of the fiirst one.")

' The first chart in the document compares the annual electricity use of the household (represented by the blue bar labeled "You") with similar and efficient homes in the area. The chart shows that the household uses 18% more electricity than similar homes and is above the typical use. The chart also includes a breakdown of the electricity use for nearby homes, with efficient homes using the least electricity and similar homes using an intermediate amount. The chart is intended to help the household understand their electricity use in comparison to others in the area and identify opportunities for energy savings.'

## Approach No. 2

In [10]:
import PdfFileManager
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader

In [11]:
input_pdf = PdfFileManager.PdfFileManager("./input/test_info_extract.pdf")
loader = TextLoader('./output/pdf_text.txt')
documents = loader.load()

In [12]:
documents

[Document(metadata={'source': './output/pdf_text.txt'}, page_content="waiting until you have a full load to run your laundry can save up to 6% of your energy use. watch this space for new ways to save energy each month. monthly savings tip: do full laundry loads. turn over for more savings ideas. nearby homes are defined as... nearby homes are based on fuel, distance and size. square footage is collected from public information sources. efficient nearby homes are the top 15 per cent efficient of similar-sized homes nearby. homes within +/- 300 sq. ft. homes within 9 km other homes with electricity dear jill doe, here is your usage analysis for march. 18% more than similar nearby homes 125 kwh 103 kwh similar nearby homes you 49 kwh efficient nearby homes your electric use: above typical use march report account number: 954137 service address: 1627 tulip lane home energy report: electricity find your personalized analysis of your electrical energy use. scan this code or log in to your a

In [13]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

In [14]:
print(wrap_text_preserve_newlines(str(documents[0])))

page_content='waiting until you have a full load to run your laundry can save up to 6% of your energy use.
watch this space for new ways to save energy each month. monthly savings tip: do full laundry loads. turn over
for more savings ideas. nearby homes are defined as... nearby homes are based on fuel, distance and size.
square footage is collected from public information sources. efficient nearby homes are the top 15 per cent
efficient of similar-sized homes nearby. homes within +/- 300 sq. ft. homes within 9 km other homes with
electricity dear jill doe, here is your usage analysis for march. 18% more than similar nearby homes 125 kwh
103 kwh similar nearby homes you 49 kwh efficient nearby homes your electric use: above typical use march
report account number: 954137 service address: 1627 tulip lane home energy report: electricity find your
personalized analysis of your electrical energy use. scan this code or log in to your account at
franklinenergy.com. seven year savings is the 

In [15]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings()

  embeddings = HuggingFaceEmbeddings()


In [16]:
db = FAISS.from_documents(docs, embeddings)
chain2 = load_qa_chain(llm, chain_type="stuff")

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain2 = load_qa_chain(llm, chain_type="stuff")


## Tests for Approach No. 2

In [17]:
query = "Give me a tip for energy saving."
docs = db.similarity_search(query)
chain2.run(input_documents=docs, question=query)

' One energy-saving tip is to caulk drafty windows and doors to prevent heat loss and save money on your energy bill. This is a simple DIY project that can make a big difference. Additionally, setting your smart thermostat to save energy during high-cost hours and pre-heating your home on cold days can also help you save energy and money.'

In [18]:
query = "How is my electrical use?"
docs = db.similarity_search(query)
chain2.run(input_documents=docs, question=query)

' Based on the information provided, your electrical use in March was 125 kWh, which is 18% more than similar nearby homes and 49 kWh above the efficient nearby homes. To save energy and money, consider implementing the energy-saving tips provided, such as caulking drafty areas, setting your smart thermostat, and charging your electric vehicle overnight. Additionally, upgrading to a more energy-efficient refrigerator could also result in significant energy savings.'

In [19]:
query = "What can be notted from the way the annual electricity chart looks?"
docs = db.similarity_search(query)
chain2.run(input_documents=docs, question=query)

" The annual electricity chart shows that the household's electricity usage is higher than that of similar and efficient homes. Specifically, the household's usage is 18% more than similar homes and 49 kWh higher than efficient homes. The chart also shows that the household's usage in March was 125 kWh, while similar homes used 103 kWh and efficient homes used 50 kWh. The top three energy-saving tips provided to the household are caulking drafty areas, setting the smart thermostat to save energy during high-cost hours, and upgrading to an energy-efficient refrigerator."

In [20]:
query = "How many elements are quantified in the annual electricity graph?"
docs = db.similarity_search(query)
chain2.run(input_documents=docs, question=query)

' The annual electricity graph quantifies three elements: your annual electricity use, the electricity use of similar nearby homes, and the electricity use of efficient nearby homes.'

In [21]:
query = "In this document there are a couple of charts. Describe to me the meaning of the fiirst one."
docs = db.similarity_search(query)
chain2.run(input_documents=docs, question=query)

' The first chart in the document is titled "your annual electricity use compared with similar and efficient homes." It shows the annual electricity use of the home in question, represented by the blue bar, compared to the electricity use of similar-sized homes and efficient homes in the area. The x-axis represents the annual electricity use in kilowatt-hours (kWh), and the y-axis represents the number of homes. The chart indicates that the home in question uses more electricity than similar homes, as its blue bar is to the right of the other bars on the chart. Additionally, it uses more electricity than the efficient homes, as its blue bar is to the right of the red line representing the efficient homes.'