Skip to content

Latest commit

 

History

History
212 lines (146 loc) · 12.3 KB

RAG.md

File metadata and controls

212 lines (146 loc) · 12.3 KB

RAG component

A production RAG system is split into 3 main components:

- ingestion: clean, chunk, embed, and load your data to a vector DB
- retrieval: query your vector DB for context
- generation: attach the retrieved context to your prompt and pass it to an LLM

The ingestion component sits in the feature pipeline, while the retrieval and generation components are implemented inside the inference pipeline.

You can also use the retrieval and generation components in your training pipeline to fine-tune your LLM further on domain-specific prompts.

You can apply advanced techniques to optimize your RAG system for ingestion, retrieval and generation.

That being said, there are 3 main types of advanced RAG techniques:

- Pre-retrieval optimization [ingestion]: tweak how you create the chunks
- Retrieval optimization [retrieval]: improve the queries to your vector DB
- Post-retrieval optimization [retrieval]: process the retrieved chunks to filter out the noise

You can learn more about RAG from Decoding ML LLM Twin Course:

Advanced RAG architecture

Finetuning dataset preparation component

The finetuning dataset preparation module automates the generation of datasets specifically formatted for training and fine-tuning Large Language Models (LLMs). It interfaces with Qdrant, sends structured prompts to LLMs, and manages data with Comet ML for experiment tracking and artifact logging.

Why is fine-tuning important?

  1. Model Customization: Tailors the LLM's responses to specific domains or tasks.
  2. Improved Accuracy: Enhances the model's understanding of nuanced language used in specialized fields.
  3. Efficiency: Reduces the need for extensive post-processing by producing more relevant outputs directly.
  4. Adaptability: Allows models to continuously learn from new data, staying relevant as language and contexts evolve.

TBD add about lesson Finetuning Dataset Preparation Flow

Dependencies

Installation and Setup

To prepare your environment for these components, follow these steps:

  • poetry init
  • poetry install

Docker Settings

Host Configuration

To ensure that your Docker containers can communicate with each other you need to update your /etc/hosts file. Add the following entries to map the hostnames to your local machine:

# Docker MongoDB Hosts Configuration
127.0.0.1       mongo1
127.0.0.1       mongo2
127.0.0.1       mongo3

For Windows users check this article: https://medium.com/workleap/the-only-local-mongodb-replica-set-with-docker-compose-guide-youll-ever-need-2f0b74dd8384

CometML Integration

Overview

CometML is a cloud-based platform that provides tools for tracking, comparing, explaining, and optimizing experiments and models in machine learning. CometML helps data scientists and teams to better manage and collaborate on machine learning experiments.

Why Use CometML?

  • Experiment Tracking: CometML automatically tracks your code, experiments, and results, allowing you to compare between different runs and configurations visually.
  • Model Optimization: It offers tools to compare different models side by side, analyze hyperparameters, and track model performance across various metrics.
  • Collaboration and Sharing: Share findings and models with colleagues or the ML community, enhancing team collaboration and knowledge transfer.
  • Reproducibility: By logging every detail of the experiment setup, CometML ensures experiments are reproducible, making it easier to debug and iterate.

CometML Variables

When integrating CometML into your projects, you'll need to set up several environment variables to manage the authentication and configuration:

  • COMET_API_KEY: Your unique API key that authenticates your interactions with the CometML API.
  • COMET_PROJECT: The project name under which your experiments will be logged.
  • COMET_WORKSPACE: The workspace name that organizes various projects and experiments.

Obtaining CometML Variables

To access and set up the necessary CometML variables for your project, follow these steps:

  1. Create an Account or Log In:

    • Visit CometML's website and log in if you already have an account, or sign up if you're a new user.
  2. Create a New Project:

    • Once logged in, navigate to your dashboard. Here, you can create a new project by clicking on "New Project" and entering the relevant details for your project.
  3. Access API Key:

    • After creating your project, you will need to obtain your API key. Navigate to your account settings by clicking on your profile at the top right corner. Select 'API Keys' from the menu, and you'll see an option to generate or copy your existing API key.
  4. Set Environment Variables:

    • Add the obtained COMET_API_KEY to your environment variables, along with the COMET_PROJECT and COMET_WORKSPACE names you have set up.

Qdrant Integration

Overview

Qdrant is an open-source vector database designed for storing and searching large volumes of high-dimensional vector data. It is particularly suited for tasks that require searching through large datasets to find items similar to a query item, such as in recommendation systems or image retrieval applications.

Why Use Qdrant?

Qdrant is a robust vector database optimized for handling high-dimensional vector data, making it ideal for machine learning and AI applications. Here are the key reasons for using Qdrant:

  • Efficient Searching: Utilizes advanced indexing mechanisms to deliver fast and accurate search capabilities across high-dimensional datasets.
  • Scalability: Built to accommodate large-scale data sets, which is critical for enterprise-level deployments.
  • Flexibility: Supports a variety of distance metrics and filtering options that allow for precise customization of search results.
  • Integration with ML Pipelines: Seamlessly integrates into machine learning pipelines, enabling essential functions like nearest neighbor searches.

Setting Up Qdrant

Qdrant can be utilized via its Docker implementation or through its managed cloud service(https://cloud.qdrant.io/login). This flexibility allows you to choose an environment that best suits your project's needs.

Environment Variables

To configure your environment for Qdrant, set the following variables:

Docker Variables

  • QDRANT_HOST: The hostname or IP address where your Qdrant server is running.
  • QDRANT_PORT: The port on which Qdrant listens, typically 6333 for Docker setups.

Qdrant Cloud Variables

  • QDRANT_CLOUD_URL: The URL for accessing Qdrant Cloud services.
  • QDRANT_APIKEY: The API key for authenticating with Qdrant Cloud.

Please check this article to learn how to obtain these variables (https://qdrant.tech/documentation/cloud/quickstart-cloud/?utm_source=decodingml&utm_medium=referral&utm_campaign=llm-course)

Additionally, you can control the connection mode (Cloud or Docker) using a setting in your configuration file. More details can be found in db/qdrant.py :

USE_QDRANT_CLOUD: True  # Set to False to use Docker setup

Before running any commands, ensure your environment variables are correctly set up in your .env file to guarantee that everything works properly.

Environment Configuration

Ensure your .env file includes the following configurations:

MONGO_DATABASE_HOST="mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set"
MONGO_DATABASE_NAME="scrabble"

# QdrantDB config
QDRANT_DATABASE_HOST="localhost"
QDRANT_DATABASE_PORT=6333
QDRANT_APIKEY= your-key
QDRANT_CLOUD_URL=your-url
USE_QDRANT_CLOUD=False

# MQ config
RABBITMQ_DEFAULT_USERNAME=guest
RABBITMQ_DEFAULT_PASSWORD=guest
RABBITMQ_HOST=localhost
RABBITMQ_PORT= 5673

COMET_API_KEY="your-key"
COMET_WORKSPACE="alexandruvesa"
COMET_PROJECT='decodingml'

OPENAI_API_KEY = "your-key"

Supplementary Tools

You need a real dataset to run and test the modules. This section covers additional tools and scripts included in the project that assist with specific tasks, such as data insertion.

Script: insert_data_mongo.py

Overview

The insert_data_mongo.py script is designed to manage the automated downloading and insertion of various types of documents (articles, posts, repositories) into a MongoDB database. It facilitates the initial population of the database with structured data for further processing or analysis.

Features

  • Dataset Downloading: Automatically downloads JSON formatted data files from Google Drive based on predefined file IDs.
  • Dynamic Data Insertion: Inserts different types of documents (articles, posts, repositories) into the MongoDB database, associating each entry with its respective author.

How It Works

  1. Download Data: The script first checks if the specified output_dir directory exists and contains any files. If not, it creates the directory and downloads the data files from Google Drive.
  2. Insert Data: Based on the type specified in the downloaded files, it inserts posts, articles, or repositories into the MongoDB database.
  3. Logging: After each insertion, the script logs the number of items inserted and their associated author ID to help monitor the process.

RAG Component

RAG Module Structure

Query Expansion

  • query_expansion.py: Handles the expansion of a given query into multiple variations using language model-based templates. It integrates the ChatOpenAI class from langchain_openai and a custom QueryExpansionTemplate to generate expanded queries suitable for further processing.

Reranking

  • reranking.py: Manages the reranking of retrieved documents based on relevance to the original query. It uses a RerankingTemplate and the ChatOpenAI model to reorder the documents in order of relevance, making use of language model outputs.

Retriever

  • retriever.py: Performs vector-based retrieval of documents from a vector database using query expansion and reranking strategies. It utilizes the QueryExpansion and Reranker classes, as well as QdrantClient for database interactions and SentenceTransformer for generating query vectors.

Self Query

  • self_query.py: Generates metadata attributes related to a query, such as author ID, using a self-query mechanism. It employs a SelfQueryTemplate and the ChatOpenAI model to extract required metadata from the query context.

Usage

After you have everything setup, environemnt variables, CometML account and Qdrant account, it's time to insert data in your environment.

The workflow is straightforward:

The project includes a Makefile for easy management of common tasks. Here are the main commands you can use:

  • make help: Displays help for each make command.
  • make local-start: Build and start mongodb, mq and qdrant.
  • make local-test-github: Insert data to mongodb
  • make local-test-retriever:: Test RAG retrieval