<a href="https://colab.research.google.com/github/fahimku2020/fahimku2020/blob/main/RaG_model_with_user_prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install wikipedia
!pip install spacy
!pip instalk sentence-transformers
!pip install transformers
!pip install torch
!python -m spacy download en_core_web_sm

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=c94a8dc375a282cc48df8f199ac853a757fec2bf8a6fab423292eda3bb531a9b
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
ERROR: unknown command "instalk" - maybe you meant "install"
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation s

In [None]:
import wikipedia
import spacy
from sentence_transformers import SentenceTransformer
import numpy as np
import torch

class WikipediaRAG:
    def __init__(self):
        # Load spaCy for text processing
        self.nlp = spacy.load('en_core_web_sm')

        # Load sentence embedding model
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def fetch_wikipedia_content(self, topic):
        """
        Fetch Wikipedia content for a given topic

        Args:
            topic (str): The Wikipedia topic to search for

        Returns:
            str: Extracted Wikipedia content
        """
        try:
            # Try to get the summary first
            summary = wikipedia.summary(topic, sentences=3000)

            # Then try to get the full page content
            page = wikipedia.page(topic)
            full_content = page.content

            # Combine summary and full content
            return f"{summary}\n\n{full_content}"
        except wikipedia.exceptions.DisambiguationError as e:
            # Handle disambiguation by returning options
            return f"Multiple results found. Please be more specific. Options: {e.options}"
        except wikipedia.exceptions.PageError:
            return "No Wikipedia page found for this topic."

    def preprocess_text(self, text):
        """
        Preprocess text using spaCy

        Args:
            text (str): Input text to preprocess

        Returns:
            str: Preprocessed text
        """
        # Process text with spaCy
        doc = self.nlp(text)

        # Remove stopwords and lemmatize
        processed_tokens = [token.lemma_.lower() for token in doc
                            if not token.is_stop and token.is_alpha]

        return ' '.join(processed_tokens)

    def generate_embeddings(self, text):
        """
        Generate embeddings for text

        Args:
            text (str): Input text

        Returns:
            numpy.ndarray: Text embeddings
        """
        return self.model.encode(text)

    def calculate_similarity(self, query_embedding, context_embeddings):
        """
        Calculate cosine similarity between query and context embeddings

        Args:
            query_embedding (numpy.ndarray): Query embedding
            context_embeddings (list): List of context embeddings

        Returns:
            list: Similarity scores
        """
        # Compute cosine similarity
        similarities = []
        for context_emb in context_embeddings:
            sim = np.dot(query_embedding, context_emb) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(context_emb)
            )
            similarities.append(sim)

        return similarities

    def extract_answer(self, context, query):
        """
        Extract the most relevant answer from context

        Args:
            context (str): Full context text
            query (str): User query

        Returns:
            str: Most relevant answer snippet
        """
        # Split context into sentences
        doc = self.nlp(context)
        sentences = [sent.text.strip() for sent in doc.sents]

        # Preprocess query and sentences
        query_processed = self.preprocess_text(query)
        sentences_processed = [self.preprocess_text(sent) for sent in sentences]

        # Generate embeddings
        query_embedding = self.generate_embeddings(query_processed)
        sentence_embeddings = [
            self.generate_embeddings(sent) for sent in sentences_processed
        ]

        # Calculate similarities
        similarities = self.calculate_similarity(query_embedding, sentence_embeddings)

        # Get top 3 most similar sentences
        top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:1]

        # Construct answer from top sentences
        answer = ' '.join([sentences[i] for i in top_indices])

        return answer

    def answer_question(self, topic, query):
        """
        Main method to answer user questions about a topic

        Args:
            topic (str): Wikipedia topic to search
            query (str): User's question

        Returns:
            str: Generated answer
        """
        # Fetch Wikipedia content
        content = self.fetch_wikipedia_content(topic)

        # Extract and return answer
        answer = self.extract_answer(content, query)

        return answer

# Example usage
def main():
    # Initialize RAG system
    rag = WikipediaRAG()

    # Interactive loop
    while True:
        topic = input("Enter a Wikipedia topic (or 'quit' to exit): ")
        if topic.lower() == 'quit':
            break

        query = input("Ask a question about the topic: ")

        try:
            # Generate answer
            answer = rag.answer_question(topic, query)
            print("\nAnswer:", answer)
            print("\n" + "-"*50 + "\n")

        except Exception as e:
            print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Enter a Wikipedia topic (or 'quit' to exit): Amitabh Bachan
Ask a question about the topic: Who is Amitabh Bachan 

Answer: Ek Jeevit Kimvadanti in 2006,
Amitabh: The Making of a Superstar in 2006,
Looking for the Big B: Bollywood, Bachchan and Me in 2007 and
Bachchanalia in 2009. Amitabh Bachchan (pronounced [əmɪˈt̪ɑːbʱ ˈbətːʃən] ; born Amitabh Srivastava; 11 October 1942) is an Indian actor who works in Hindi cinema. Amitabh Bachchan (pronounced [əmɪˈt̪ɑːbʱ ˈbətːʃən] ; born Amitabh Srivastava; 11 October 1942) is an Indian actor who works in Hindi cinema.

--------------------------------------------------

Enter a Wikipedia topic (or 'quit' to exit): Amitabh Bachan 
Ask a question about the topic: Amitabh Bachan films 

Answer: It is with this last name that Amitabh debuted in films and used for all other practical purposes, Bachchan has become the surname for all of his immediate family. Amitabh Bachchan (pronounced [əmɪˈt̪ɑːbʱ ˈbətːʃən] ; born Amitabh Srivastava; 11 October 1942) 