# RAG Implementation with Google Gemini, LangChain, and FAISS

## Useful Documentation

- [Google AI Generative API Docs](https://ai.google.dev/tutorials/python_quickstart)
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction)
- [FAISS GitHub Repository](https://github.com/facebookresearch/faiss)

## Dependencies and Setup

In [1]:
# Install required libraries
!pip install -q google-generativeai langchain faiss-cpu pypdf python-dotenv

In [5]:
!pip install --upgrade --quiet  langchain-google-genai

In [10]:
import os
import sys
import logging
import warnings
import numpy as np
import google.generativeai as genai
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PyPDFLoader,
    CSVLoader,
    JSONLoader,
    TextLoader
)
from langchain.vectorstores import FAISS
#from langchain.embeddings import GoogleGenerativeAIEmbeddings
from langchain_google_genai import GoogleGenerativeAI, GoogleGenerativeAIEmbeddings
#from langchain.llms import GoogleGenerativeAI
from langchain.chains import RetrievalQA

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s: %(message)s'
)
logger = logging.getLogger(__name__)

# Suppress warnings
warnings.filterwarnings('ignore')
load_dotenv()

True

## Document Loading and Preprocessing

Document loading and preprocessing are crucial steps in the Retrieval-Augmented Generation (RAG) pipeline using LangChain. These steps involve loading documents from various sources, preprocessing them to make them suitable for retrieval and generation tasks, and then using them in the RAG pipeline.

### Document Loading
Document loading refers to the process of fetching documents from various sources such as files, databases, APIs, or web pages. LangChain provides various loaders to facilitate this process.

### Example Loaders in LangChain:
- File Loader: Loads documents from local files.
- Web Loader: Loads documents from web pages.
- Database Loader: Loads documents from databases.


### Preprocessing

Preprocessing involves cleaning and transforming the loaded documents to make them suitable for retrieval and generation tasks. This may include steps such as tokenization, removing stop words, stemming, and converting text to lowercase.

#### Steps in Document Loading and Preprocessing in RAG using LangChain:

- Load Documents: Use appropriate loaders to fetch documents from the desired sources.
- Preprocess Documents: Clean and transform the documents to prepare them for retrieval and generation.
- Index Documents: Index the preprocessed documents to enable efficient retrieval.
- Retrieve Documents: Retrieve relevant documents based on a query.
- Generate Response: Use the retrieved documents to generate a response.

In [11]:
def load_documents(file_paths):
    """
    Load documents from multiple file types with error handling

    Args:
        file_paths (list): List of file paths to load

    Returns:
        list: Loaded documents
    """
    documents = []
    for path in file_paths:
        try:
            if not os.path.exists(path):
                logger.warning(f"File not found: {path}")
                continue

            if path.endswith('.pdf'):
                loader = PyPDFLoader(path)
            elif path.endswith('.csv'):
                loader = CSVLoader(path)
            elif path.endswith('.json'):
                loader = JSONLoader(path)
            elif path.endswith('.txt'):
                loader = TextLoader(path)
            else:
                logger.error(f"Unsupported file type: {path}")
                continue

            documents.extend(loader.load())

        except Exception as e:
            logger.error(f"Error loading {path}: {e}")
            continue

    if not documents:
        logger.warning("No documents were loaded")

    return documents

def split_documents(documents, chunk_size=500, chunk_overlap=100):
    """
    Split documents into chunks with error handling

    Args:
        documents (list): List of documents to split
        chunk_size (int): Size of text chunks
        chunk_overlap (int): Overlap between chunks

    Returns:
        list: Text chunks
    """
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        chunks =  text_splitter.split_documents(documents)
    
        print(f"Split into {len(chunks)} chunks")
        return chunks
    except Exception as e:
        logger.error(f"Document splitting error: {e}")
        return []

## Embedding and Vector Store

In [None]:
class RAGSystem:
    def __init__(self, model_name='models/embedding-001'):
        """
        Initialize RAG system with embedding and generation models

        Args:
            model_name (str): Gemini embedding model name
        """
        try:
            # Validate API key
            if not os.getenv('GOOGLE_API_KEY'):
                raise ValueError("Google API Key not found. Set GOOGLE_API_KEY in .env")

            genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))

            self.embeddings = GoogleGenerativeAIEmbeddings(
                model=model_name,
                task_type='retrieval_document'
            )
            print("Embeddings initialized successfully!")

            self.llm = GoogleGenerativeAI(
                model='gemini-pro',
                temperature=0.7
            )
            print("LLM Model initialized successfully!")

            self.vectorstore = None
            logger.info("RAG System initialized successfully")

        except Exception as e:
            logger.error(f"Initialization error: {e}")
            raise

    def create_vectorstore(self, documents):
        """
        Create FAISS vector store from documents

        Args:
            documents (list): Preprocessed document chunks
        """
        try:
            if not documents:
                raise ValueError("No documents provided for vector store")

            self.vectorstore = FAISS.from_documents(
                documents,
                self.embeddings
            )
            logger.info(f"Vector store created with {len(documents)} chunks")

        except Exception as e:
            logger.error(f"Vector store creation error: {e}")
            raise

    def similarity_search(self, query, k=5):
        """
        Perform similarity search on vector store

        Args:
            query (str): Search query
            k (int): Number of top results

        Returns:
            list: Top similar document chunks
        """
        try:
            if not self.vectorstore:
                raise ValueError("Vector store not initialized")

            return self.vectorstore.similarity_search(query, k=k)

        except Exception as e:
            logger.error(f"Similarity search error: {e}")
            return []

    def query_documents(self, query, k=5):
        """
        Create retrieval-based QA chain

        Args:
            query (str): User query
            k (int): Number of context documents

        Returns:
            str: Generated response
        """
        try:
            if not self.vectorstore:
                raise ValueError("Vector store not initialized")

            retriever = self.vectorstore.as_retriever(
                search_kwargs={'k': k}
            )

            qa_chain = RetrievalQA.from_chain_type(
                llm=self.llm,
                chain_type='stuff',
                retriever=retriever
            )

            return qa_chain.run(query)

        except Exception as e:
            logger.error(f"Document query error: {e}")
            return "Unable to process query due to an error."

# Example Usage
def main():
    try:
        # File paths for different document types
        file_paths = [            
            'datasets/company_info.txt',
            'datasets/menu.json'
        ]

        # Initialize RAG system
        rag_system = RAGSystem()

        # Load and preprocess documents
        documents = load_documents(file_paths)
        chunks = split_documents(documents)

        # Create vector store
        rag_system.create_vectorstore(chunks)

        # Example queries
        query1 = "What is the main operation of Coffee Corp?"
        query2 = "Summarize key products of Coffee Corp"

        # Perform similarity search
        similar_docs = rag_system.similarity_search(query1)
        for doc in similar_docs:
            print(doc.page_content)

        # Generate response using RAG
        response = rag_system.query_documents(query2)
        print("RAG Response:", response)

    except Exception as e:
        logger.error(f"Main execution error: {e}")
        sys.exit(1)

if __name__ == '__main__':
    main()


2025-01-23 11:26:20,980 - INFO: RAG System initialized successfully
2025-01-23 11:26:20,983 - ERROR: Error loading datasets/menu.json: JSONLoader.__init__() missing 1 required positional argument: 'jq_schema'


2025-01-23 11:26:21,208 - INFO: Vector store created with 3 chunks


Contact Information:
Email: contact@coffeecorp.com
Phone: +1-800-555-1234
Website: www.coffeecorp.com
Company Name: Coffee Corp
Founded: 2005
Headquarters: Seattle, USA
Industry: Coffee
Employees: 2000
Mission: To provide the best coffee experience in the world.
Vision: To be the leading coffee company globally.
Values: Quality, Sustainability, Customer Satisfaction, Innovation
Key Products:
1. Espresso - A rich and bold coffee experience.
2. Latte - A smooth blend of espresso and steamed milk.
3. Cappuccino - A perfect balance of espresso, steamed milk, and foam.
4. Cold Brew - A refreshing and smooth cold coffee.

Company Achievements:
- Awarded Best Coffee Company in 2019.
- Recognized for sustainable practices by Green Coffee Magazine.
- Achieved a customer satisfaction rate of 95%.
RAG Response: 1. Espresso
2. Latte
3. Cappuccino
4. Cold Brew
