RAG Pipline and implementation 

In [1]:
import os
import logging
import pandas as pd
import faiss
import numpy as np
from langchain.vectorstores import FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms import CTransformers
from langchain.chains import RetrievalQA
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.indexes import VectorstoreIndexCreator
from fastapi import FastAPI
from pydantic import BaseModel
from dotenv import load_dotenv

In [3]:
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

In [2]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=50)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


In [25]:
#loading data from directory,splitting,embedding and storing to vector DB
def load_data_from_directory(directory):
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".csv"):
            file_path = os.path.join(directory, filename)
            loader = CSVLoader(file_path=file_path)
            documents.extend(loader.load())  # Load CSV as documents
    return documents
DATA_DIR = 'data/cricket_data'
# Convert CSVs into a text format for embedding
documents = load_data_from_directory(DATA_DIR)

split_documents = text_splitter.split_documents(documents)

logger.info("Creating vector store with FAISS...")
vector_store = FAISS.from_documents(split_documents, embedding_model)
logger.info("Vector store created successfully.")

2025-02-23 11:08:00,917 - INFO - Creating vector store with FAISS...
2025-02-23 16:25:08,298 - INFO - Vector store created successfully.


#data retrieval

In [7]:
data_vector_store = vector_store
retriever = data_vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 2})  


In [26]:
#llm & RAG pipeline

In [10]:
from langchain.llms import CTransformers
# llm = CTransformers(model="TheBloke/Llama-2-7B-Chat-GGUF", model_type="llama", config={"context_length": 4096})
llm = CTransformers(model="TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_type="mistral", config={"max_new_tokens": 512, "context_length": 2048)
# llm = CTransformers(model="TheBloke/Llama-2-13B-Chat-GGUF", model_type="llama")
# llm = CTransformers(model="TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_type="mistral")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, chain_type="stuff")

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

mistral-7b-instruct-v0.1.Q2_K.gguf:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

In [27]:
#sample Q&A

In [12]:
answer = qa_chain.run("comparison between Kohli and Rohit in batting stats in IPL.")
print(answer)

  answer = qa_chain.run("comparison between Kohli and Rohit in batting stats in IPL.")




Kohli has played more ODI matches than Rohit, but Rohit has a higher average score per match. In terms of runs scored, Kohli has the edge, with a total of 11,298 runs and 64 wickets in 195 matches, while Rohit has 7,279 runs and 30 wickets in 134 matches. However, Rohit's average score per match is 55.89, compared to Kohli's 51.25. Rohit also has a higher strike rate of 91.17%, while Kohli's is 90.28%.


In [11]:
# pip install ctransformers

In [13]:
# lg1tm24XpX81GLCuxNnXHaLTQp5Axm7r

In [23]:
answer = qa_chain.run("Who has the most sixes in IPL history?")
print(answer)

 Yusuf Pathan, an Indian cricketer, holds the record for the most sixes in IPL history with 236 sixes.


In [24]:
answer = qa_chain.run("Give me a summary of India vs Pakistan head-to-head in ICC events.")
print(answer)



India and Pakistan have played 29 matches against each other in various ICC events, with India winning 18 matches, while Pakistan has won 11 matches. The two teams first met in the 1978-79 World Cup tournament, where Pakistan defeated India in a dramatic final match. Since then, the rivalry between the two nations has only intensified, with both teams participating in several high-profile matches and tournaments against each other, including the 2007 World Cup final match, which India won after chasing down the victory target of 438 runs to win by six wickets, becoming the first team to chase down a score above 400 runs to win a World Cup match. In recent years, the two teams have also faced off in several high-profile ODI and Test matches, with India emerging as the dominant force in ICC events against Pakistan.
