# Use Powerpoint documents in your RAG system

## Setup

* Install the required libraries
* Get an [Unstructured API key](https://unstructured.io/api-key-free), free tier will work (unless your powerpoint has more than 1000 pages)
* Get your HuggingFace token (depending on a model you choose to use, you may not need it). You can get one in your [profile's settings](https://huggingface.co/settings/tokens).

In [None]:
!pip install -q unstructured-client unstructured[all-docs] langchain accelerate bitsandbytes sentence-transformers chromadb

In [None]:
!pip install transformers --upgrade

In [None]:
import os

os.environ["UNSTRUCTURED_API_KEY"] = "YOUR_UNSTRUCTURED_API_KEY"

In [None]:
from huggingface_hub.hf_api import HfFolder

HfFolder.save_token('YOUR_HF_TOKEN')

## Preprocessing pptx file for RAG with Unstructured



In [None]:
# The example powerpoint is from https://agsci.oregonstate.edu
# search for Plant Identification, the powerpoint should be the first result
# or upload your own .ppt or .pptx to Colab.

path_to_pptx="/content/Plant Identification handout.ppt"

In [None]:
# set up Unstructured API Client to process the document

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import dict_to_elements


unstructured_api_key = os.environ.get("UNSTRUCTURED_API_KEY")

client = UnstructuredClient(
    api_key_auth=unstructured_api_key,
    # if using hosted API, provide your unique API URL:
    # server_url="YOUR_API_URL",
)

with open(path_to_pptx, "rb") as f:
  files=shared.Files(
      content=f.read(),
      file_name=path_to_pptx,
      )

  req = shared.PartitionParameters(
    files=files,
    # By setting the chunking strategy here, we'll partition AND
    # chunk the document in a single call
    chunking_strategy="by_title",
    max_characters=512,
  )

  try:
    resp = client.general.partition(req)
  except SDKError as e:
    print(e)

elements = dict_to_elements(resp.elements)

## Retriever


In [None]:
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain.vectorstores import utils as chromautils
from langchain.embeddings import HuggingFaceEmbeddings

documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    documents.append(Document(page_content=element.text, metadata=metadata))

# Some metadata is too complex for ChromaDB, so we filter it out
docs = chromautils.filter_complex_metadata(documents)

db = Chroma.from_documents(docs, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

## Set up your LLM of choice

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Here we use https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B
model_name = 'NousResearch/Hermes-2-Pro-Llama-3-8B'

# Quantized version of the model can run on the free T4 in Colab
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    nb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Unused kwargs: ['nb_4bit_use_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Bringing all together wiht LangChain

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|im_end|>")
]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=250,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|im_start|>system
You are a helpful assistant.
You are given relevant documents for context and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: <|im_end|>
<|im_start|>user
{question}
Context: {context} <|im_end|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
question = "How can I identify what plant is in front of me?"
rag_chain.invoke(question)

'To identify the plant in front of you, follow these steps:\n\n1. Visually inspect the plant\'s characteristics such as its overall form, whether it is herbaceous or woody, and if it has simple or compound leaves.\n\n2. Take note of the specific features of the leaves, including their shape, size, arrangement, venation pattern, and margins. This will help narrow down the possibilities based on leaf morphology.\n\n3. Observe other parts of the plant, such as the stem, buds, flowers, and fruits, which may provide additional clues to its identity.\n\n4. Use photographic references and consult plant classification keys, which are available online or in books like "Integrated Approach to Plant Identification" by Amy Jo Detweiler, Jan McNeilan, and Gail Gredler from the Horticulture Department. These resources can guide you through the process of identifying the plant based on its various characteristics.\n\n5. If needed, seek expert advice from local garden centers, horticulturists, or bota

In [None]:
# Let's see what we are getting from the retriever for the question

retriever.invoke(question)

[Document(page_content='Integrated Approach to Plant Identification\n\nVisual inspection of plant characteristics\n\nPhotographic references\n\nPlant classification keys\n\nExpert advice\n\nCollect information about what you see:\n\nHerbaceous, conifer, broadleaved evergreen, deciduous?\n\nCollect information about what you see:\n\nWhat is the overall form of the plant?\n\nCollect information about what you see:\n\nWhat are the characteristics of individual plant parts?\n\nLeaf type-simple leaf\n\nLeaf type-pinnately compound\n\nLeaf type\n\nLeaf type', metadata={'filename': 'Plant Identification handout.ppt', 'filetype': 'application/vnd.ms-powerpoint', 'orig_elements': 'eJztWE1vGzcQ/SuEzo5NLr9zKdqiQA0EhQ9BewgCY0gOpUX2C7uUaiPof++s4rRyqoPixjlYe5I0S81w3rx9nOG7jytssMWu3NZp9ZqtuNASsoeQkxaVzVCpCoI2lqPKYOzqgq1aLJCgAK3/uIpQcN2P97cJh7Ihk6AVuW7wNtUjxkKPZr9Xse8KhVk9PO2gxUP71U0DXWHXib7XuSavdd+xDXSp35bLYfjnj+V+2P8RhqF5WHa169JlP2B31za5H1so06s+kxNMfdzOyV0OI070uV/eNo9+zo4p9noLa5zI87sVduvVe7IOZLnttm3