## A RAG example using Hugging Face documentation with LangChain

This is a combination of tutorials from
taken from https://huggingface.co/learn/cookbook/en/advanced_rag and

https://python.langchain.com/v0.2/docs/integrations/document_loaders/url/

This notebook loads some text from url, splits into chunks, that make up the documents for RAG.  It then takes a user query, finds relevant documents, formats a prompt with context, and uses huggingface pipeline to get an answer from a Llama3 8B model




In [1]:
#The tutorial would have your run the following
#But it takes too long for us to wait, 
#  so we'll just use use pre-installed folders

if 0:
  !pip install --upgrade huggingface_hub[pytorch,cli] transformers accelerate datasets
  !pip install --upgrade langchain sentence-transformers langchain-community
  !pip install --upgrade bitsandbytes pypdf faiss-gpu pydantic
  !pip install --upgrade langchain-huggingface
  !pip install --upgrade unstructured
  #now show all packages
  !pip list


In [2]:
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

#from langchain.vectorstores import FAISS   #Facebook tool
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

#from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from typing import Optional, List, Tuple
from langchain.docstore.document import Document as LangchainDocument

print('imports done')

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


imports done


In [3]:
#Functions to help split up document into chunks
#  we'll use text from url in next cell
def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str, #EMBEDDING_MODEL_NAME
) -> List[LangchainDocument]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=MARKDOWN_SEPARATORS,
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique
print('split doc funtion defined')

split doc funtion defined


In [4]:
#First set up loader and get web pages as the raw documents
urls      = [ "https://slurm.schedmd.com/quickstart.html",
               "https://slurm.schedmd.com/man_index.html"  ]

loader    = UnstructuredURLLoader(urls=urls)
raw_pages = loader.load_and_split()

#raw_pages is a list
print('Num of raw pages after split:',len(raw_pages))

#Second set up a model to split the web pages
EMBEDDING_MODEL_NAME = "thenlper/gte-small"

# We use a hierarchical list of separators specifically tailored for splitting Markdown documents
# This list is taken from LangChain's MarkdownTextSplitter class
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
    ]

#Now split up the documents in to chunk size of tokens
#  each chunk will be put in a database and 
#  each request will be 'keyword' matched to retrieve chunks
#  that will be used as context for the prompt
#  (but it's vectorized to be faster)
docs_processed = split_documents(
    256,        # chunk size <<<--- try diff size, too big is wasteful, too small useless
    raw_pages,  
    tokenizer_name=EMBEDDING_MODEL_NAME,
)

print('Length of docs:', len(docs_processed))


Num of raw pages after split: 5
Length of docs: 30


In [5]:
#Third, set up embedding model and create vector database
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=False,  #True,   #this might cause some fork issues?
    model_kwargs={"device": "cpu"},   # "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
  )

KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
  )

#Now, embed a user query in the same space, show sample document
user_query = "How to create a slurm job?"
query_vector = embedding_model.embed_query(user_query)

print(f"\nStarting document vector database retrieval for {user_query=}...")
retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)
print("=========== retrieved docs metadata  =============================")
print(retrieved_docs[0].metadata)


  _torch_pytree._register_pytree_node(



Starting document vector database retrieval for user_query='How to create a slurm job?'...
{'source': 'https://slurm.schedmd.com/quickstart.html', 'start_index': 0}


In [6]:
#Now setup the hugging face pipeline
import huggingface_hub
from transformers import AutoTokenizer
import transformers
import torch

import os

print('hugging face imports done')


hugging face imports done


In [7]:
# You might need to do this one time to save the auth token in 
#   ~/.cache/huggingface/token
# Also, you might need to go to hugging face to get your auth token
! ~/.local/bin/huggingface-cli login --token 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

In [8]:
#Set up model and tokenizer
model="meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model)

print('tokenizer loaded')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


tokenizer loaded


In [9]:
  #Set up prompt template with a place for context informatoin
  prompt_in_chat_format = [
    {
        "role": "system",
        "content": """Using the information contained in the context,
  give a comprehensive answer to the question.
  Respond only to the question asked, response should be concise and relevant to the question.
  Provide the number of the source document when relevant.
  If the answer cannot be deduced from the context, do not give an answer.""",
    },
    {
        "role": "user",
        "content": """Context:
  {context}
  ---
  Now here is the question you need to answer.

  Question: {question}""",
    },
  ]
  RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
    prompt_in_chat_format, tokenize=False, add_generation_prompt=True
  )
print('RAG_PROMPT_TEMPLATE set up')


RAG_PROMPT_TEMPLATE set up


In [10]:
  #set up actual prompt with context consisting of retreived docs
  retrieved_docs_text = [
    doc.page_content for doc in retrieved_docs
  ]  # We only need the text of the documents
  context = "\nExtracted documents:\n"
  context += "".join(
    [f"Document {str(i)}:::\n" + doc for i, doc in enumerate(retrieved_docs_text)]
  )

  final_prompt = RAG_PROMPT_TEMPLATE.format(
    question=user_query, context=context
  )

#Final prompt is a long string 
print('Final prompt beginning:')
print(final_prompt[0:150],'  ....... ')
print('Final prompt ending: ')
print(final_prompt[-150:-1])

Final prompt beginning:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Using the information contained in the context,
give a comprehensive answer to the questi   ....... 
Final prompt ending: 
ol for
---
Now here is the question you need to answer.

Question: How to create a slurm job?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



In [11]:
  #set up the function 
  my_pipe2 = transformers.pipeline(
    #"text-generation",
    model=model,
    #for gpu : 
    torch_dtype=torch.float16,
    #torch_dtype=torch.float32,  #for cpu use this
    device_map="auto",
    #device=device2use
  )
  print('pipeline2 defined')


Loading checkpoint shards: 100%|██████████| 4/4 [00:13<00:00,  3.42s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pipeline2 defined


In [12]:
  #now call the function with the prompt as input and other options
  results_list = my_pipe2(
    final_prompt,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=500, #num new tokens to generate
  )

  mem_allocated = torch.cuda.memory_allocated()
  print('MYINFO mem allocated aft results:', mem_allocated)

  for result in results_list:   #result is a python dict object
    print(' ----------------- Generated Text Result --------------------------')
    print(f"Result: {result['generated_text']}")



Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


MYINFO mem allocated aft results: 16069058560
 ----------------- Generated Text Result --------------------------
Result: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.<|eot_id|><|start_header_id|>user<|end_header_id|>

Context:

Extracted documents:
Document 0:::
Slurm Workload Manager

SchedMD

Navigation

Slurm Workload Manager

Version 24.05

About
					
						Overview
						Release Notes

Using
					
						Documentation
						FAQ
						Publications

Installing
					
						Download
						Related Software
						Installation Guide

Getting Help
					
						Mailing Lists
						Support and Training
						Troubleshooting

Quick Start User Guide

Overview

Slurm i