<a href="https://colab.research.google.com/github/ethanelkaim/RAG/blob/main/RAG_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install faiss-cpu sentence-transformers transformers wikipedia-api torch datasets cohere

In [104]:
import wikipediaapi
from sentence_transformers import SentenceTransformer, util
import faiss
import numpy as np
from datasets import load_dataset
from transformers import pipeline
import cohere

In [105]:
# Initialize the sentence transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# FAISS Index (for vector-based retrieval)
dimension = 384
index = faiss.IndexFlatL2(dimension)

# Hugging Face NER pipeline for keyword extraction
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [112]:
# Function to extract keywords using Hugging Face's NER pipeline
def extract_keywords(query):
    ner_results = ner_pipeline(query)
    keywords = []
    for entity in ner_results:
        entity_word = entity['word']
        if entity_word not in keywords and not entity_word.startswith("##"):
            keywords.append(entity_word)
    return keywords

# Global dictionary to store content after indexing
wiki_content_map = {}

# Function to fetch content and embeddings from the dataset
def index_dataset(dataset_name, keyword):
    global wiki_content_map
    wiki_content_map.clear()  # Clear the map for each new query

    if dataset_name == 'wikipedia':
        wiki_wiki = wikipediaapi.Wikipedia('english')
        page = wiki_wiki.page(keyword)
        if page.exists():
            paragraphs = page.text.split('\n')
            for idx, paragraph in enumerate(paragraphs):
                if len(paragraph.strip()) > 0:
                    embedding = model.encode(paragraph, convert_to_tensor=False)
                    embedding = np.array([embedding])  # FAISS requires 2D arrays
                    index.add(embedding)
                    wiki_content_map[idx] = paragraph  # Store the paragraph in the content map
            print(f"Indexed page: {keyword}")
        else:
            print(f"Wikipedia page for '{keyword}' does not exist.")
    elif dataset_name == 'natural_questions':
        ds = load_dataset("google-research-datasets/natural_questions", "default")
        for i, example in enumerate(ds['train']):
            if keyword.lower() in example['question'].lower():
                passage = example['document_text']
                embedding = model.encode(passage, convert_to_tensor=False)
                embedding = np.array([embedding])  # FAISS requires 2D arrays
                index.add(embedding)
                wiki_content_map[i] = passage  # Store the passage in the content map
        print(f"Indexed examples from Natural Questions for keyword: {keyword}")

# Function to retrieve the most relevant passages from the indexed content
def retrieve_passages(query, top_k=3):
    query_embedding = model.encode(query, convert_to_tensor=False)
    distances, indices = index.search(np.array([query_embedding]), top_k)

    # Check if retrieved indices have corresponding text passages
    retrieved_passages = []
    for idx in indices[0]:
        if idx in wiki_content_map:
            retrieved_passages.append(wiki_content_map[idx])
        else:
            print(f"Warning: No passage found for index {idx}")

    if not retrieved_passages:
        print("No relevant passages found.")

    return retrieved_passages

In [113]:
# Function to generate a response using GPT-2
def generate_response_gpt2(query):
    generator = pipeline("text-generation", model="gpt2")
    generated_text = generator(query, max_length=5000, num_return_sequences=1)[0]['generated_text']
    return generated_text

# Function to generate a response using Cohere's API
def generate_response_cohere(query, cohere_api_key):
    co = cohere.Client(api_key=cohere_api_key)
    response = co.chat(model='command-r-plus', message=query)
    return response.text

# Function to generate a response using GPT-2 with retrieved context
def generate_response_gpt2_with_context(query, retrieved_passages):
    generator = pipeline("text-generation", model="gpt2")
    context = query + "\n\n" + "\n".join(retrieved_passages)
    generated_text = generator(context, max_length=10000, num_return_sequences=1)[0]['generated_text']
    return generated_text

# Function to generate a response using Cohere with retrieved context
def generate_response_cohere_with_context(query, retrieved_passages, cohere_api_key):
    context = query + "\n\n" + "\n".join(retrieved_passages)
    co = cohere.Client(api_key=cohere_api_key)
    response = co.generate(prompt=context, model="command").generations[0].text
    return response

In [126]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGr

In [122]:
query = "What is the Leonardo DiCaprio mother's name?"
# query = "How much money did Harry Potter star Daniel Radcliffe have when he was 18?"
dataset_name = "wikipedia"  # or "natural_questions"
llm_choice = "cohere"  # or "gpt2"
cohere_api_key = "LjyWoNgE5Cc1E5qytRY90Nwc2VlD1tMdKrkf13nF"

In [123]:
# Step 1: Extract keywords from the query
keywords = extract_keywords(query)
print(f"Extracted Keywords: {keywords}")

# Step 2: Fetch the data (Wikipedia or Natural Questions)
for keyword in keywords:
    content_map = index_dataset(dataset_name, keyword)

Extracted Keywords: ['Leonardo DiCaprio']
Indexed page: Leonardo DiCaprio


In [124]:
# Step 3: Retrieve the most relevant passages
retrieved_passages = retrieve_passages(query, top_k=3)
print("Retrieved Passages:")
for passage in retrieved_passages:
    print(passage)

# Step 4: Basic LLM response (without retrieved context)
if llm_choice == "gpt2":
    basic_response = generate_response_gpt2(query)
elif llm_choice == "cohere":
    if cohere_api_key is None:
        raise ValueError("Cohere API key is required for Cohere LLM.")
    basic_response = generate_response_cohere(query, cohere_api_key)
else:
    raise ValueError("Invalid LLM choice. Please choose 'gpt2' or 'cohere'.")

print("\nBasic Response (No Retrieval):")
print(basic_response)

Retrieved Passages:
Leonardo Wilhelm DiCaprio was born on November 11, 1974, in Los Angeles, California. He is the only child of Irmelin Indenbirken, a legal secretary, and George DiCaprio, an underground comix artist and distributor. They met while attending college and moved to Los Angeles after graduating. His mother is German and his father is of Italian and German descent. His maternal grandfather, Wilhelm Indenbirken, was German, and his maternal grandmother, Helene Indenbirken, was a Russian immigrant living in Germany. DiCaprio was raised Catholic. Sources have falsely claimed his maternal grandmother was born in Odesa, Ukraine; there is no evidence that DiCaprio has any relatives of Ukrainian birth or heritage.
See also

Basic Response (No Retrieval):
Irmelin DiCaprio


In [125]:
# Step 5: Augmented LLM response (with retrieved context)
if llm_choice == "gpt2":
    augmented_response = generate_response_gpt2_with_context(query, retrieved_passages)
elif llm_choice == "cohere":
    augmented_response = generate_response_cohere_with_context(query, retrieved_passages, cohere_api_key)

print("\nAugmented Response (With Retrieved Context):")
print(augmented_response)


Augmented Response (With Retrieved Context):
 Irmelin Indenbirken is Leonardo DiCaprio's mother's name. 
Would you like help with anything else?  I can also provide more information if you'd like. 
