# Llama 3 RAG

RAGs reference a knowledge base outside of its training data sources before generating a response. Because of this, RAGs extend the ability of LLMs to specific domains without the need of retraining. To create a RAG with Llama 3, we'll be using Huggingface. Make sure you have an account made. 

In [32]:
from huggingface_hub import login, notebook_login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig, AutoConfig
import torch
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Loading Llama3
This might need to be adjusted for your device. Come to us for consultation.

In [44]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# Configuration for loading the model with CPU offloading
quantization_config = BitsAndBytesConfig(load_in_8bit_fp32_cpu_offload=True)

# Device mapping for model
device_map = {
    "model.embed_tokens": 0,
    "model.embed_positions": 0,
    "model.layers": 0,
    "model.norm": 0,
    "lm_head": 0
}

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map,
    quantization_config=quantization_config,
    trust_remote_code=True
)

print(model.hf_device_map)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# Generation configuration
gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens = 512
gen_cfg.temperature = 0.0000001
gen_cfg.return_full_text = True
gen_cfg.do_sample = True
gen_cfg.repetition_penalty = 1.11

# Create the pipeline
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

# Use the pipeline in HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

Unused kwargs: ['load_in_8bit_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

{'model.embed_tokens': 0, 'model.embed_positions': 0, 'model.layers': 0, 'model.norm': 0, 'lm_head': 0}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [45]:
%%time

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=prompt_template,
)
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

CPU times: user 287 μs, sys: 0 ns, total: 287 μs
Wall time: 269 μs


### Loading your documents
Save the documents you want to fine tune on in your local repo. For this example, we'll use the Communist Manifesto, loaded in the content directory.

Prior to running this cell, in your terminal, run:
```bash
conda install -c conda-forge poppler
```

In [46]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

import os

# add your PDF paths here
loaders = [UnstructuredPDFLoader(fn) for fn in ['content/Communist_Manifesto.pdf',
                                                'content/Wealth of Nations.pdf']]

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document...content/Communist_Manifesto.pdf


The PDF <_io.BufferedReader name='content/Wealth of Nations.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case


Splitting text...
Loading raw document...content/Wealth of Nations.pdf
Splitting text...


3798

### Training the RAG

In [47]:
%%time
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

CPU times: user 9min 4s, sys: 618 ms, total: 9min 4s
Wall time: 8min 37s


### Running the RAG

In [None]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

# add and edit queries based on your documents
query = "Can a free market regulate itself?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))

print('#######################################################################')

query = "What would be the implications if free market principles were applied to a communist society?"
result = Chain_pdf.invoke(query)
print(fill(result['result'].strip(), width=200))