# Use Powerpoint documents in your RAG system

## Setup

* Install the required libraries
* Get an [Unstructured API key](https://unstructured.io/api-key-hosted), free 14-day trial allows you to process up to 1000 pages per day
* Get your HuggingFace token (depending on a model you choose to use, you may not need it). You can get one in your [profile's settings](https://huggingface.co/settings/tokens).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install -q unstructured-ingest "unstructured-ingest[pptx]" unstructured langchain langchain-community accelerate bitsandbytes sentence-transformers chromadb

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.4 MB[0m [31m20.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.4/2.4 MB[0m [31m44.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os

os.environ["UNSTRUCTURED_API_KEY"] = "YOUR_UNSTRUCTURED_API_KEY"
os.environ["UNSTRUCTURED_API_URL"] = "YOUR_UNSTRUCTURED_API_URL"

In [None]:
from huggingface_hub.hf_api import HfFolder

HfFolder.save_token('YOUR_HF_TOKEN')

## Preprocessing pptx file for RAG with Unstructured



In [None]:
# The example PowerPoint is from https://www.highland-k12.org/site/handlers/filedownload.ashx?moduleinstanceid=251&dataid=735&FileName=plants.ppt
# or upload your own .ppt or .pptx to Colab and save it to the following folder:

path_to_output="/content/drive/MyDrive/content/drive/"
pptx_file="plants.ppt"
path_to_pptx=f"{path_to_output}{pptx_file}"
output_file=f"{path_to_pptx}.json"

In [None]:
# Use the Unstructured Ingest Python library to process the document.

import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig

import json
from unstructured.staging.base import dict_to_elements

if __name__ == '__main__':
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=path_to_pptx),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")
        ),
        chunker_config=ChunkerConfig(chunking_strategy="by_title"),
        uploader_config=LocalUploaderConfig(output_dir=path_to_output)
    ).run()

Overriding of current TracerProvider is not allowed
2024-10-25 18:44:53,107 MainProcess INFO     created index with configs: {"input_path": "/content/drive/MyDrive/content/drive/plants.ppt", "recursive": false}, connection configs: {"access_config": "**********"}
2024-10-25 18:44:53,109 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-10-25 18:44:53,111 MainProcess INFO     created partition with configs: {"strategy": "auto", "ocr_languages": null, "encoding": null, "additional_partition_args": null, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-10-25 18:44:53,112 MainProcess INFO     created chunk with configs: {"ch

In [None]:
# Read the processed data into a dictionary of element objects.

with open(output_file, 'r') as f:
    elements = dict_to_elements(json.load(f))

## Retriever


In [None]:
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain.vectorstores import utils as chromautils
from langchain.embeddings import HuggingFaceEmbeddings

documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    documents.append(Document(page_content=element.text, metadata=metadata))

# Some metadata is too complex for ChromaDB, so we filter it out.
docs = chromautils.filter_complex_metadata(documents)

db = Chroma.from_documents(docs, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

  db = Chroma.from_documents(docs, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))
  from tqdm.autonotebook import tqdm, trange
INFO: Use pytorch device_name: cuda
INFO: Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO: Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


## Set up your LLM of choice

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Here we use https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B
model_name = 'NousResearch/Hermes-2-Pro-Llama-3-8B'

# Quantized version of the model can run on the free T4 in Colab
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    nb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Unused kwargs: ['nb_4bit_use_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/56.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

## Bringing all together with LangChain

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|im_end|>")
]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=250,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|im_start|>system
You are a helpful assistant.
You are given relevant documents for context and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: <|im_end|>
<|im_start|>user
{question}
Context: {context} <|im_end|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

  llm = HuggingFacePipeline(pipeline=text_generation_pipeline)


In [None]:
question = "How can I identify what plant is in front of me?"
rag_chain.invoke(question)

"To identify the plant in front of you, you'll need to compare its characteristics with those listed in the booklet. Look at the shape and size of the leaves, the type of seeds or berries it has, whether it has tendrils, and any unique features such as the stem growth pattern. You may also want to check if the plant has medicinal properties mentioned in the booklet, like the presence of hearts for Wood Sorrel or the absence of tendrils for Canadian Moonseed. Once you find a match, that's the plant you've identified. If you're still unsure after comparing these features, try collecting samples of the plant and bringing them back to the booklet for further comparison. Remember to respect the environment and avoid picking plants unnecessarily."

In [None]:
# Let's see what we are getting from the retriever for the question

retriever.invoke(question)

[Document(metadata={'filename': 'plants.ppt', 'filetype': 'application/vnd.ms-powerpoint', 'orig_elements': 'eJzsu1er9Gy/H/ZVXp6jBCVbveVMmhn13iUnvKj33mX83a372XZsB0IcTCAHe8Ewa6RrrvIvv7LE+nf//q+8y/t82P5ZZ3/9b//4iyZjmsKxDEMQEoHoGEPznESQDEcJPIEw6K//5R9/9fkWZ/EWv+P//V9pvOXluNz/zPJpq95L0Dviz91/ruO+pPnfg97P+T/7MauLOv97GZhEaAKlEAT+l7+n/HvAtIxpvq7/ZQRFwRSN/guMUQhFon/GTfnS1+taj8P6z/+0hX/37/96Z/6zDorCOPIf/o932JKn45L9sxvf3Y3L31uY4r939xeYjsP2nhfMlvrIQfX+/v3+316dunjY1n+Zpu2v//Af3vmKusuHuP+zyF//1b3/dGe7p7/vxNPU1e+K7+7AY8j+pV//12k882Ua6+Hvwe83yz0u8/XPtv/Kh/KvP5ud3iv/HPY+yf/sFP6z3pZf258Z7X056iPu/mH8WfQfYvZu8Y3hv67xD3Yc23f1PzP/5y049dblf71T/F/zmuNxHidxDCExjuc5nCQIHmdoQb+/wWRR/D/nFf63vP4P5fXvK8v/i077rwvBr7vsH7+sTrp8/ddiWP/x35V3NIfzBEHjBC+QAqZxjIpTBE3pdxmsoGDy3/L+/+u8q3lWp/XwnxFg/e/KeVZABZ1mRUYWNJWiMJSnOEHgMYqTOUYi1L9h+P/HOUf+6xRa8Tq9l5c6X//3HYKg5H9Kx67L0+0f8T+2pZ7etf8xFv/o8vjI1//5vyvBVExQKYFneAHlBI0hBAxRSJLk1BvuLEaTf0vw/0CC6/5PLpN4zQns773QDYgxjGm3kWSVDMuYzPvpfWe+IGkygMtXR8bTT8zTp/hVd5VhMLPtNAaAmL9/3pHyo6I