<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1gIQIGURhSn8-z5hS0VZSgye6AJg_1D6e?usp=sharing)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

### 🚀 RAGatouille :- Advanced_RAG_Retrieval 🌟

RAGatouille is a Python library designed to simplify the integration and training of state-of-the-art late-interaction retrieval methods, particularly ColBERT, within Retrieval-Augmented Generation (RAG) pipelines. 🤖💡

It offers a modular and user-friendly interface, enabling developers to efficiently incorporate advanced retrieval techniques into their applications. 🔧⚡


### 🔑 Key Features:

- **Training and Fine-Tuning ColBERT Models:** 🏋️‍♂️
  RAGatouille provides tools to train and fine-tune ColBERT models, facilitating the development of retrieval systems tailored to specific datasets and requirements.

- **Embedding and Indexing Documents:** 📚💻
  The library supports the embedding and indexing of documents, allowing for efficient retrieval operations. It includes functionalities for processing and managing large collections of text data.

- **Retrieving Documents:** 🔍
  RAGatouille enables the retrieval of documents based on queries, integrating seamlessly with generative models to enhance the quality and relevance of generated responses.


###**Setup and Installation**

In [None]:
!pip install ragatouille

In [None]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

### **Wikipedia Page Content Retriever 🌍**

In [None]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

### **Retrieve Wikipedia Page Content Length 📄🔢**

In [None]:
full_document = get_wikipedia_page("Hayao_Miyazaki")
len(full_document)

68505

### **Indexing Wikipedia Content with RAG 🔍📚**




In [None]:
RAG.index(
    collection=[full_document],
    document_ids=['miyazaki'],
    document_metadatas=[{"entity": "person", "source": "wikipedia"}],
    index_name="Miyazaki",
    max_document_length=180,
    split_documents=True
    )

### **Retrieving Relevant Information with RAG 🔎🎥**


In [None]:
k = 3
results = RAG.search(query="What animation studio did Miyazaki found?", k=k)
results

Loading searcher for index Miyazaki for the first time... This may take a few seconds
[Feb 02, 14:08:05] #> Loading codec...
[Feb 02, 14:08:05] #> Loading IVF...
[Feb 02, 14:08:05] #> Loading doclens...


  self.scaler = torch.cuda.amp.GradScaler()
  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 755.32it/s]

[Feb 02, 14:08:05] #> Loading codes and residuals...



  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 333.94it/s]


Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')



[{'content': '=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. The studio\'s name had been registered a year earlier; Miyazaki named it after the nickname of the Caproni Ca.309 aircraft, meaning "a hot wind that blows in the desert" in Italian.',
  'score': 25.71875,
  'rank': 1,
  'document_id': 'miyazaki',
  'passage_id': 42,
  'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanese anima

### **Measuring RAG Search Performance ⏱️🚀**


In [None]:
%%timeit
RAG.search(query="What animation studio did Miyazaki found?")

20.7 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### **Batch Search in RAG 🔍📚**

In [None]:
all_results = RAG.search(query=["What animation studio did Miyazaki found?", "Miyazaki son name"], k=k)
all_results

2it [00:00, 89.42it/s]


[[{'content': '=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. The studio\'s name had been registered a year earlier; Miyazaki named it after the nickname of the Caproni Ca.309 aircraft, meaning "a hot wind that blows in the desert" in Italian.',
   'score': 25.71875,
   'rank': 1,
   'document_id': 'miyazaki',
   'passage_id': 42,
   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},
  {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. He co-founded Studio Ghibli and serves as its honorary chairman. Over the course of his career, Miyazaki has attained international acclaim as a masterful storyteller and creator of Japanes

### **Loading Pretrained RAG Index for Miyazaki 📂🔍**








In [None]:
path_to_index = ".ragatouille/colbert/indexes/Miyazaki/"
RAG = RAGPretrainedModel.from_index(path_to_index)

### **Adding New Documents to RAG Index 📄➕**








In [None]:
new_documents = get_wikipedia_page("Studio_Ghibli")

RAG.add_to_index([new_documents])

### **Reranking with Custom Retrieval Pipeline 🏁🔍**








In [None]:
from sentence_transformers import SentenceTransformer
from voyager import Index, Space

class MyExistingRetrievalPipeline:
    index: Index
    embedder: SentenceTransformer

    def __init__(self, embedder_name: str = "BAAI/bge-small-en-v1.5"):
        self.embedder = SentenceTransformer(embedder_name)
        self.collection_map = {}
        self.index = Index(
            Space.Cosine,
            num_dimensions=self.embedder.get_sentence_embedding_dimension(),
        )

    def index_documents(self, documents: list[str]) -> None:
        # There's very few documents in our example, so we don't bother with batching
        for document in documents:
            self.collection_map[self.index.add_item(self.embedder.encode(document['content']))] = document['content']

    def query(self, query: str, k: int = 10) -> list[str]:
        query_embedding = self.embedder.encode(query)
        to_return = []
        for idx in self.index.query(query_embedding, k=k)[0]:
            to_return.append(self.collection_map[idx])
        return to_return

### **Initializing Existing Retrieval Pipeline 🔄**








In [None]:
existing_pipeline = MyExistingRetrievalPipeline()

### **Processing Wikipedia Corpus 🧠**








In [None]:
from ragatouille.utils import get_wikipedia_page
from ragatouille.data import CorpusProcessor

corpus_processor = CorpusProcessor()

documents = [get_wikipedia_page("Hayao Miyazaki"), get_wikipedia_page("Studio Ghibli"), get_wikipedia_page("Princess Mononoke"), get_wikipedia_page("Shrek")]
documents = corpus_processor.process_corpus(documents, chunk_size=200)

### **Indexing Documents in Existing Pipeline 🔍**








In [None]:
existing_pipeline.index_documents(documents)


### **Querying the Existing Pipeline 🔍**








In [None]:
query = "What's Gihbli's famous policy?"
raw_results = existing_pipeline.query(query, k=10)
raw_results

['The Boy and the Heron won the 2024 Golden Globe Award for Best Animated Feature Film, BAFTA Award for Best Animated Film, and the 2024 Academy Award for Best Animated Feature.\n\n\n== Name ==\nThe name "Ghibli" was chosen by Miyazaki from the Italian noun ghibli (also used in English), the nickname of Italy\'s Saharan scouting plane Caproni Ca.309, in turn derived from the Italianization of the Libyan Arabic name for a hot desert wind (قبلي qibliyy). The name was chosen by Miyazaki due to his passion for aircraft and also for the idea that the studio would "blow a new wind through the anime industry". Although the Italian word would be more accurately transliterated as "Giburi" (ギブリ), with a hard g sound, the studio\'s name is written in Japanese as Jiburi (ジブリ, [dʑiꜜbɯɾi] ).',
 '=== Feature films ===\n\n\n=== Television ===\n\n\n== Style and themes ==\n\nStudio Ghibli films are mostly hand-drawn using rich watercolor and acrylic paints. The films use traditional methods of making an

### **Reranking Results 🔄**








In [None]:
RAG.rerank(query=query, documents=raw_results, k=5)


100%|██████████| 1/1 [00:00<00:00, 20.74it/s]


[{'content': 'Studio Ghibli, Inc. (Japanese: 株式会社スタジオジブリ, Hepburn: Kabushiki-gaisha Sutajio Jiburi) is a Japanese animation studio based in Koganei, Tokyo. It has a strong presence in the animation industry and has expanded its portfolio to include various media formats, such as short subjects, television commercials, and two television films. Their work has been well received by audiences and recognized with numerous awards. Their mascot and most recognizable symbol, the character Totoro from the 1988 film My Neighbor Totoro, is a giant spirit inspired by raccoon dogs (tanuki) and cats (neko).',
  'score': 12.4609375,
  'rank': 0,
  'result_index': 3},
 {'content': 'The Boy and the Heron won the 2024 Golden Globe Award for Best Animated Feature Film, BAFTA Award for Best Animated Film, and the 2024 Academy Award for Best Animated Feature.\n\n\n== Name ==\nThe name "Ghibli" was chosen by Miyazaki from the Italian noun ghibli (also used in English), the nickname of Italy\'s Saharan scou