<a href="https://colab.research.google.com/github/dietmarja/LLM-Elements/blob/main/RAG/rag_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To extend the code so that it becomes an example of Retrieval-Augmented Generation (RAG), we need to incorporate a language model that can generate text based on the retrieved documents. We will use LangChain to facilitate this process.

Here is the extended code:

Install necessary packages
Import necessary modules
Set up Pinecone and Hugging Face embeddings
Extract text from PDFs and store them in Pinecone
Perform RAG with the help of LangChain

In [6]:
# Install necessary libraries if not already installed
!pip install -q torch transformers sentence-transformers PyMuPDF langchain-community

# Imports
import fitz  # PyMuPDF
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Hugging Face token
hf_token = "********"

# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    texts = []
    with fitz.open(pdf_path) as doc:
        for page_num, page in enumerate(doc):
            text = page.get_text()
            texts.append(text)
    return texts

# Paths to your PDF files
pdf_path_1 = "Attention_Paper.pdf"
pdf_path_2 = "Lora_Paper.pdf"

# Extract text from PDFs
texts_1 = extract_text_from_pdf(pdf_path_1)
texts_2 = extract_text_from_pdf(pdf_path_2)

# Concatenate all text from each document to form a single string
full_text_1 = " ".join(texts_1)
full_text_2 = " ".join(texts_2)

# Load the generation model and tokenizer with token
generation_model_name = "t5-large"  # Use a larger model for better performance
generation_model = AutoModelForSeq2SeqLM.from_pretrained(generation_model_name, use_auth_token=hf_token)
generation_tokenizer = AutoTokenizer.from_pretrained(generation_model_name, use_auth_token=hf_token)

# Function to generate a summary for the document
def generate_summary(text, max_length=200, chunk_size=512):
    input_text = f"summarize: {text}"
    inputs = generation_tokenizer(input_text, return_tensors="pt", truncation=True, max_length=chunk_size)
    outputs = generation_model.generate(inputs.input_ids, max_length=max_length)
    summary = generation_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

# Function to summarize long documents in chunks
def summarize_long_text(text, chunk_size=512, max_length=200):
    text_chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    chunk_summaries = [generate_summary(chunk, max_length=max_length) for chunk in text_chunks]
    combined_summary = " ".join(chunk_summaries)
    return generate_summary(combined_summary, max_length=max_length)  # Summarize the combined summary for coherence

# Generate summaries for each document
summary_1 = summarize_long_text(full_text_1)
summary_2 = summarize_long_text(full_text_2)

# Print the summaries
print(f"Summary of Attention Paper:\n{summary_1}\n")
print(f"Summary of Lora Paper:\n{summary_2}\n")


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Summary of Attention Paper:
transformer is a new simple network architecture based solely on attention mechanisms. it is superior in quality while being more parallelizable and requiring less time to train. transformer can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Summary of Lora Paper:
deploying indepen- dent instances of fine-tuned models, each with 175B parameters is prohibitively expensive. a large-scale, pre-trained language model is usually adapted to multiple down- stream applications. a low rank suffices even when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efficient.

