<a href="https://colab.research.google.com/github/fahimku2020/fahimku2020/blob/main/Super_fast_rag_rerank_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Build simple rag in python without open api key,fetch data from wikipedia generate user input question and generate answers of different paragraphs based on semantic clustering ,rerank top answers based on similarity score, apply semantic text splitting,fast its execution speed by optimization

In [None]:
!pip install sentence-transformers
!pip install BeautifulSoup4
!pip  install requests
!pip install faiss-cpu
!pip install wikipedia

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=b3df7f309fe426b01d30ae0e85401c1241faa9f9613f0c9c83420c2c26253d47
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected package

In [None]:
import wikipedia
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Tuple

class WikipediaRAG:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        """
        Initialize the RAG system with a sentence embedding model

        :param model_name: Sentence transformer model for embeddings
        """
        # Load sentence embedding model
        self.embedding_model = SentenceTransformer(model_name)

        # Initialize Faiss index for semantic search
        self.dimension = self.embedding_model.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatL2(self.dimension)

        # Storage for text passages and their metadata
        self.passages = []
        self.page_titles = []

    def semantic_text_split(self, text: str, max_length: int = 250) -> List[str]:
        """
        Split text semantically into overlapping chunks

        :param text: Input text to split
        :param max_length: Maximum length of each passage
        :return: List of text passages
        """
        # Split text into sentences
        sentences = text.split('. ')
        passages = []
        current_passage = []
        current_length = 0

        for sentence in sentences:
            # Add sentence to current passage
            current_passage.append(sentence)
            current_length += len(sentence)

            # If passage is too long, create a new passage
            if current_length > max_length:
                passages.append('. '.join(current_passage))
                current_passage = current_passage[-2:]  # Overlap with previous context
                current_length = len('. '.join(current_passage))

        # Add remaining passage
        if current_passage:
            passages.append('. '.join(current_passage))

        return passages

    def fetch_and_process_wikipedia(self, topic: str):
        """
        Fetch Wikipedia page, split into semantic passages, and index

        :param topic: Wikipedia topic to retrieve
        """
        try:
            # Fetch Wikipedia page
            page = wikipedia.page(topic)

            # Semantic text splitting
            passages = self.semantic_text_split(page.content)

            # Generate embeddings for passages
            embeddings = self.embedding_model.encode(passages)

            # Add to Faiss index
            self.index.add(embeddings)

            # Store passages and titles for reference
            self.passages.extend(passages)
            self.page_titles.extend([page.title] * len(passages))

            print(f"Processed {topic}: {len(passages)} passages")

        except wikipedia.exceptions.DisambiguationError as e:
            print(f"Multiple matches for {topic}. Suggestions: {e.options}")
        except wikipedia.exceptions.PageError:
            print(f"No Wikipedia page found for {topic}")

    def retrieve_top_passages(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
        """
        Retrieve top passages based on semantic similarity

        :param query: User query
        :param top_k: Number of top passages to retrieve
        :return: List of tuples (passage, similarity_score)
        """
        # Embed query
        query_embedding = self.embedding_model.encode([query])

        # Search in Faiss index
        distances, indices = self.index.search(query_embedding, top_k)

        # Sort and return top passages with their similarity scores
        results = [
            (self.passages[idx], 1 / (1 + dist))  # Convert distance to similarity score
            for idx, dist in zip(indices[0], distances[0])
        ]

        return results

    def generate_answer(self, query: str) -> str:
        """
        Generate an answer by retrieving and synthesizing top passages

        :param query: User query
        :return: Generated answer
        """
        # Retrieve top passages
        top_passages = self.retrieve_top_passages(query)

        # Synthesize answer from top passages
        context = "\n".join([passage for passage, _ in top_passages])

        # Simple answer generation (can be replaced with more advanced LLM)
        answer = f"Based on the context from Wikipedia:\n\n{context}"

        return answer

def main():
    # Create RAG instance
    rag = WikipediaRAG()

    # Fetch and process some initial topics
    topics = ['Artificial Intelligence', 'Machine Learning', 'Python Programming','Amitabh bachan' ]
    for topic in topics:
        rag.fetch_and_process_wikipedia(topic)

    # Interactive loop
    while True:
        query = input("\nEnter your question (or 'exit' to quit): ")

        if query.lower() == 'exit':
            break

        # Generate and print answer
        answer = rag.generate_answer(query)
        print("\nAnswer:", answer)

        # Print source passages with similarity scores
        print("\nTop Relevant Passages:")
        top_passages = rag.retrieve_top_passages(query)
        for passage, score in top_passages:
            print(f"Similarity Score: {score:.2f}")
            print(passage[:300] + "...\n")

if __name__ == "__main__":
    main()

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processed Artificial Intelligence: 383 passages
No Wikipedia page found for Machine Learning
Processed Python Programming: 206 passages
Processed Amitabh bachan: 261 passages


KeyboardInterrupt: Interrupted by user

optimized rag model

In [None]:
!pip install wikipedia-api

Collecting wikipedia-api
  Downloading wikipedia_api-0.7.1.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.7.1-py3-none-any.whl size=14346 sha256=791a231cba52d6f72af4d8a96ff7f8e36e05e208f2d638d1ef5ab370c6ba5d29
  Stored in directory: /root/.cache/pip/wheels/4c/96/18/b9201cc3e8b47b02b510460210cfd832ccf10c0c4dd0522962
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.7.1


In [None]:
import asyncio
import time
from typing import List, Dict, Any
import numpy as np
import wikipediaapi
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

class OptimizedWikipediaRAG:
    def __init__(self,
                 chunk_size: int = 100,
                 top_k: int = 5,
                 language: str = 'en'):
        """
        Initialize RAG system with optimization parameters

        Args:
            chunk_size: Number of tokens per text chunk
            top_k: Number of top results to return
            language: Wikipedia language edition
        """
        # Lightweight model for fast embedding
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

        # Wikipedia API client
        self.wiki = wikipediaapi.Wikipedia(
            language=language,
            extract_format=wikipediaapi.ExtractFormat.WIKI,
            user_agent='MyWikipediaApp/1.0 (my_email@example.com)'
        )

        # Optimization parameters
        self.chunk_size = chunk_size
        self.top_k = top_k

    def semantic_text_split(self, text: str) -> List[str]:
        """
        Split text into semantically meaningful chunks

        Args:
            text: Input text to split

        Returns:
            List of text chunks
        """
        # Simple splitting with semantic awareness
        words = text.split()
        chunks = [
            ' '.join(words[i:i+self.chunk_size])
            for i in range(0, len(words), self.chunk_size)
        ]
        return chunks

    def fetch_wikipedia_content(self, query: str) -> str:
        """
        Fetch Wikipedia page content

        Args:
            query: Search query

        Returns:
            Extracted page content
        """
        try:
            page = self.wiki.page(query)
            return page.text if page.exists() else ""
        except Exception as e:
            print(f"Error fetching Wikipedia content: {e}")
            return ""

    def embed_chunks(self, chunks: List[str]) -> np.ndarray:
        """
        Generate embeddings for text chunks

        Args:
            chunks: List of text chunks

        Returns:
            Embedding matrix
        """
        # Parallel embedding for speed
        return self.model.encode(chunks, show_progress_bar=False)

    def rerank_results(self,
                       query: str,
                       chunks: List[str],
                       embeddings: np.ndarray) -> List[Dict[str, Any]]:
        """
        Rerank results based on semantic similarity

        Args:
            query: User query
            chunks: Text chunks
            embeddings: Chunk embeddings

        Returns:
            Reranked results with scores
        """
        # Embed query
        query_embedding = self.model.encode([query])[0]

        # Compute cosine similarities
        similarities = cosine_similarity([query_embedding], embeddings)[0]

        # Sort and select top results
        ranked_results = sorted(
            [
                {
                    'chunk': chunk,
                    'similarity': sim
                }
                for chunk, sim in zip(chunks, similarities)
            ],
            key=lambda x: x['similarity'],
            reverse=True
        )[:self.top_k]

        return ranked_results

    def answer_query(self, query: str) -> List[Dict[str, Any]]:
        """
        Main method to process query and return results

        Args:
            query: User input query

        Returns:
            Reranked answer chunks
        """
        # Fetch Wikipedia content
        content = self.fetch_wikipedia_content(query.split()[0])

        # Semantic text splitting
        chunks = self.semantic_text_split(content)

        # Generate embeddings
        embeddings = self.embed_chunks(chunks)

        # Rerank results
        return self.rerank_results(query, chunks, embeddings)

def main():
    # Example usage
    rag = OptimizedWikipediaRAG(chunk_size=50, top_k=2)

    # Measure execution time
    start_time = time.time()

    # Example queries
    queries = ["computer science ","solar system","politics"
    ]

    # Process multiple queries
    for query in tqdm(queries, desc="Processing Queries"):
        results = rag.answer_query(query)

        print(f"\nQuery: {query}")
        for i, result in enumerate(results, 1):
            print(f"Result {i}:")
            print(f"Similarity: {result['similarity']:.4f}")
            print(f"Chunk: {result['chunk'][:1000]}...\n")

    end_time = time.time()
    print(f"Total Execution Time: {end_time - start_time:.2f} seconds")

if __name__ == "__main__":
    main()

Processing Queries:  33%|███▎      | 1/3 [00:09<00:19,  9.58s/it]


Query: computer science 
Result 1:
Similarity: 0.4620
Chunk: any type of computer (netbook, supercomputer, cellular automaton, etc.) is able to perform the same computational tasks, given enough time and storage capacity. Artificial intelligence A computer will solve problems in exactly the way it is programmed to, without regard to efficiency, alternative solutions, possible shortcuts, or possible errors in...

Result 2:
Similarity: 0.4554
Chunk: typical modern definition of a computer is: "A device that computes, especially a programmable [usually] electronic machine that performs high-speed mathematical or logical operations or that assembles, stores, correlates, or otherwise processes information." According to this definition, any device that processes information qualifies as a computer. Future There is active...



Processing Queries:  67%|██████▋   | 2/3 [00:10<00:04,  4.48s/it]


Query: solar system
Result 1:
Similarity: 0.5653
Chunk: eclipse, an eclipse of a sun in which it is obstructed by the moon Solar System, the planetary system made up by the Sun and the objects orbiting it Solar Maximum Mission, a satellite SOLAR (ISS), an observatory on International Space Station Music "Solar" (composition), attributed to Miles Davis Solar...

Result 2:
Similarity: 0.5578
Chunk: Solar may refer to: Astronomy Of or relating to the Sun Solar telescope, a special purpose telescope used to observe the Sun A device that utilizes solar energy (e.g. "solar panels") Solar calendar, a calendar whose dates indicate the position of the Earth on its revolution around the Sun Solar...



Processing Queries: 100%|██████████| 3/3 [00:17<00:00,  5.72s/it]


Query: politics
Result 1:
Similarity: 0.5965
Chunk: Politics (from Ancient Greek πολιτικά (politiká) 'affairs of the cities') is the set of activities that are associated with making decisions in groups, or other forms of power relations among individuals, such as the distribution of status or resources. The branch of social science that studies politics and government is...

Result 2:
Similarity: 0.5221
Chunk: ('rule of thieves'). Insincere politics The words "politics" and "political" are sometimes used as pejoratives to mean political action that is deemed to be overzealous, performative, or insincere. Levels of politics Macropolitics Macropolitics can either describe political issues that affect an entire political system (e.g. the nation state), or refer...

Total Execution Time: 17.34 seconds





caching techniques

In [None]:
! pip install diskcache

Collecting diskcache
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache
Successfully installed diskcache-5.6.3


In [None]:
import wikipedia
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Tuple

class WikipediaRAG:
    def __init__(self,
                 embedding_model: str = 'all-MiniLM-L6-v2',
                 max_paragraphs: int = 20,
                 clustering_method: str = 'kmeans'):
        """
        Initialize RAG system with semantic embedding and clustering capabilities

        Args:
            embedding_model (str): Sentence transformer model for embeddings
            max_paragraphs (int): Maximum number of paragraphs to process
            clustering_method (str): Clustering approach ('kmeans')
        """
        # Use SentenceTransformer for efficient semantic embeddings
        self.embedding_model = SentenceTransformer(embedding_model)
        self.max_paragraphs = max_paragraphs
        self.clustering_method = clustering_method

    def fetch_wikipedia_content(self, topic: str) -> List[str]:
        """
        Fetch and preprocess Wikipedia content

        Args:
            topic (str): Wikipedia search topic

        Returns:
            List[str]: Preprocessed paragraphs
        """
        try:
            # Fetch Wikipedia page
            page = wikipedia.page(topic)

            # Split content into paragraphs
            paragraphs = page.content.split('\n\n')

            # Filter and preprocess paragraphs
            paragraphs = [
                p.strip() for p in paragraphs
                if p.strip() and len(p.split()) > 10
            ][:self.max_paragraphs]

            return paragraphs

        except Exception as e:
            print(f"Error fetching Wikipedia content: {e}")
            return []

    def semantic_text_splitting(self, text: str, chunk_size: int = 100) -> List[str]:
        """
        Advanced semantic text splitting with overlapping

        Args:
            text (str): Input text
            chunk_size (int): Number of tokens per chunk

        Returns:
            List[str]: Semantically split text chunks
        """
        tokens = text.split()
        chunks = []

        for i in range(0, len(tokens), chunk_size // 2):
            chunk = ' '.join(tokens[i:i+chunk_size])
            chunks.append(chunk)

        return chunks

    def embed_paragraphs(self, paragraphs: List[str]) -> np.ndarray:
        """
        Generate embeddings for paragraphs

        Args:
            paragraphs (List[str]): Input paragraphs

        Returns:
            np.ndarray: Paragraph embeddings
        """
        return self.embedding_model.encode(paragraphs, show_progress_bar=False)

    def cluster_paragraphs(self, embeddings: np.ndarray, n_clusters: int = 3) -> np.ndarray:
        """
        Cluster paragraphs based on semantic similarity

        Args:
            embeddings (np.ndarray): Paragraph embeddings
            n_clusters (int): Number of semantic clusters

        Returns:
            np.ndarray: Cluster labels
        """
        kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
        return kmeans.fit_predict(embeddings)

    def semantic_search(self, query: str, paragraphs: List[str], embeddings: np.ndarray) -> List[Tuple[str, float]]:
        """
        Perform semantic search and ranking

        Args:
            query (str): User query
            paragraphs (List[str]): Source paragraphs
            embeddings (np.ndarray): Paragraph embeddings

        Returns:
            List[Tuple[str, float]]: Ranked paragraphs with similarity scores
        """
        # Embed query
        query_embedding = self.embedding_model.encode([query])[0]

        # Compute cosine similarities
        similarities = cosine_similarity([query_embedding], embeddings)[0]

        # Create ranked list of paragraphs
        ranked_paragraphs = sorted(
            zip(paragraphs, similarities),
            key=lambda x: x[1],
            reverse=True
        )

        return ranked_paragraphs

    def generate_answer(self,
                        query: str,
                        topic: str,
                        top_k: int = 3) -> Dict[str, Any]:
        """
        Generate comprehensive answer using RAG approach

        Args:
            query (str): User query
            topic (str): Wikipedia topic
            top_k (int): Number of top paragraphs to retrieve

        Returns:
            Dict containing answer details
        """
        # Fetch and preprocess paragraphs
        paragraphs = self.fetch_wikipedia_content(topic)

        if not paragraphs:
            return {"error": "No content found"}

        # Generate embeddings
        embeddings = self.embed_paragraphs(paragraphs)

        # Semantic clustering
        cluster_labels = self.cluster_paragraphs(embeddings)

        # Semantic search and re-ranking
        ranked_paragraphs = self.semantic_search(query, paragraphs, embeddings)

        return {
            "query": query,
            "topic": topic,
            "top_answers": ranked_paragraphs[:top_k],
            "clusters": cluster_labels.tolist()
        }

def main():
    # Example usage
    rag_system = WikipediaRAG()

    # Example queries
    queries = [
        "What is the history of artificial intelligence?",
        "Explain quantum computing basics",
        "Tell me about climate change impact", "Filmfare awards"
    ]

    topics = [
        "Artificial Intelligence",
        "Quantum Computing",
        "Climate Change","Amitabh Bachan "
    ]

    for query, topic in zip(queries, topics):
        result = rag_system.generate_answer(query, topic)

        print("\n--- Results ---")
        print(f"Query: {result['query']}")
        print(f"Topic: {result['topic']}")

        print("\nTop Answers:")
        for i, (paragraph, score) in enumerate(result['top_answers'], 1):
            print(f"{i}. Score: {score:.4f}")
            print(f"   {paragraph[:300]}...\n")

if __name__ == "__main__":
    main()


--- Results ---
Query: What is the history of artificial intelligence?
Topic: Artificial Intelligence

Top Answers:
1. Score: 0.5528
   === General intelligence ===
A machine with artificial general intelligence should be able to solve a wide variety of problems with breadth and versatility similar to human intelligence....

2. Score: 0.5370
   Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence...

3. Score: 0.5228
   == Techniques ==
AI research uses a wide variety of techniques to accomplish the goals above....


--- Results ---
Query: Explain quantum computing basics
Topic: Quantum Computing

Top Answers:
1. Score: 0.7129
   A quantum computer is a computer that exploits quantum mechanical phenomena. On small scales, physical matter exh