<a href="https://colab.research.google.com/github/ethanelkaim/RAG/blob/main/RAG_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install faiss-cpu sentence-transformers transformers wikipedia-api torch datasets cohere

In [2]:
import wikipediaapi
from sentence_transformers import SentenceTransformer, util
import faiss
import numpy as np
from datasets import load_dataset
from transformers import pipeline
import cohere

  from tqdm.autonotebook import tqdm, trange


In [None]:
# Initialize the sentence transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# FAISS Index (for vector-based retrieval)
dimension = 384
index = faiss.IndexFlatL2(dimension)

# Hugging Face NER pipeline for keyword extraction
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

# Function to extract keywords using Hugging Face's NER pipeline
def extract_keywords(query):
    ner_results = ner_pipeline(query)
    keywords = []
    for entity in ner_results:
        entity_word = entity['word']
        if entity_word not in keywords and not entity_word.startswith("##"):
            keywords.append(entity_word)
    return keywords

In [10]:
# Global dictionary to store content after indexing
wiki_content_map = {}

# Function to fetch content and embeddings from the dataset
def index_dataset(dataset_name, keyword):
    global wiki_content_map
    wiki_content_map.clear()  # Clear the map for each new query

    if dataset_name == 'wikipedia':
        wiki_wiki = wikipediaapi.Wikipedia('english')
        page = wiki_wiki.page(keyword)
        if page.exists():
            paragraphs = page.text.split('\n')
            for idx, paragraph in enumerate(paragraphs):
                if len(paragraph.strip()) > 0:
                    # print(f"Paragraph \n{paragraph}")
                    embedding = model.encode(paragraph, convert_to_tensor=False)
                    embedding = np.array([embedding])  # FAISS requires 2D arrays
                    index.add(embedding)
                    wiki_content_map[idx] = paragraph  # Store the paragraph in the content map
            print(f"Indexed page: {keyword}")
        else:
            print(f"Wikipedia page for '{keyword}' does not exist.")

    elif dataset_name == 'natural_questions':
        ds = load_dataset("google-research-datasets/natural_questions", "default")
        for i, example in enumerate(ds['train']):
            if keyword.lower() in example['question'].lower():
                passage = example['document_text']
                embedding = model.encode(passage, convert_to_tensor=False)
                embedding = np.array([embedding])  # FAISS requires 2D arrays
                index.add(embedding)
                wiki_content_map[i] = passage  # Store the passage in the content map
        print(f"Indexed examples from Natural Questions for keyword: {keyword}")

    elif dataset_name == 'cnn_dailymail':
        ds = load_dataset("cnn_dailymail", "3.0.0")
        for i, example in enumerate(ds['train']):
            if keyword.lower() in example['article'].lower():
                passage = example['article']
                embedding = model.encode(passage, convert_to_tensor=False)
                embedding = np.array([embedding])  # FAISS requires 2D arrays
                index.add(embedding)
                wiki_content_map[i] = passage  # Store the passage in the content map
        print(f"Indexed examples from CNN/DailyMail for keyword: {keyword}")

# Function to retrieve the most relevant passages from the indexed content
def retrieve_passages(query, top_k=3):
    query_embedding = model.encode(query, convert_to_tensor=False)
    distances, indices = index.search(np.array([query_embedding]), top_k)

    # Check if retrieved indices have corresponding text passages
    retrieved_passages = []
    for idx in indices[0]:
        if idx in wiki_content_map:
            retrieved_passages.append(wiki_content_map[idx])
        else:
            print(f"Warning: No passage found for index {idx}")

    if not retrieved_passages:
        print("No relevant passages found.")

    return retrieved_passages

In [5]:
# Function to generate a response using GPT-2
def generate_response_gpt2(query):
    generator = pipeline("text-generation", model="gpt2")
    generated_text = generator(query, max_length=5000, num_return_sequences=1)[0]['generated_text']
    return generated_text

# Function to generate a response using Cohere's API
def generate_response_cohere(query, cohere_api_key):
    co = cohere.Client(api_key=cohere_api_key)
    response = co.chat(model='command-r-plus', message=query)
    return response.text

# Function to generate a response using GPT-2 with retrieved context
def generate_response_gpt2_with_context(query, retrieved_passages):
    generator = pipeline("text-generation", model="gpt2")
    context = query + "\n\n" + "\n".join(retrieved_passages)
    generated_text = generator(context, max_length=10000, num_return_sequences=1)[0]['generated_text']
    return generated_text

# Function to generate a response using Cohere with retrieved context
def generate_response_cohere_with_context(query, retrieved_passages, cohere_api_key):
    context = query + "\n\n" + "\n".join(retrieved_passages)
    co = cohere.Client(api_key=cohere_api_key)
    response = co.generate(prompt=context, model="command").generations[0].text
    return response

In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

Queries

In [19]:
# Specific Knowledge:
# query = "How much revenue did Apple generate in Q3 of 2024?" # Wrong wiki page. the fruit

# Requiring Niche or Historical Knowledge:
# query = "What is the significance of the Battle of Kadesh in 1274 BC?"  # Didn't find the wiki page
# query = "Can you explain the most recent developments in quantum computing?"  # Didn't find the wiki page

# Requiring Information from Niche Datasets:
# query = "What are the requirements to obtain a Brazilian work visa in 2024?"    # Didn't find the wiki page and basic LLM know the answer
query = "Who are the top 3 investors in Tesla in 2024?"

# Requiring Up-to-date Pop Culture or Media Knowledge:
# query = "What was the main theme of the latest Marvel movie released in 2024?"  # Wrong wiki page.
# query = "Which artist won the Grammy for Best Album in 2024?"  # Worked

# Complex or Technical Questions Requiring External Sources:
# query = "What are the latest breakthroughs in treating Alzheimer's disease, according to 2023 clinical trials?"   # Didn't find the wiki page

# Questions Requiring Rare or Region-Specific Information:
# query = "What are the local customs of the Himba tribe in Namibia?"   # Didn't find the wiki page
# query = "How do you brew traditional Mongolian milk tea (Suutei Tsai)?"  # Basic LLM know the answer

In [15]:
dataset_name = "cnn_dailymail" # or "wikipedia"  # or "natural_questions"
llm_choice = "cohere"  # or "gpt2"
cohere_api_key = "LjyWoNgE5Cc1E5qytRY90Nwc2VlD1tMdKrkf13nF"

In [16]:
# Step 1: Extract keywords from the query
keywords = extract_keywords(query)
print(f"Extracted Keywords: {keywords}")

# Step 2: Fetch the data (Wikipedia or Natural Questions)
for keyword in keywords:
    content_map = index_dataset(dataset_name, keyword)

Extracted Keywords: ['Apple']
Indexed examples from CNN/DailyMail for keyword: Apple


In [17]:
# Step 3: Retrieve the most relevant passages
retrieved_passages = retrieve_passages(query, top_k=3)
print("Retrieved Passages:")
for passage in retrieved_passages:
    print(passage)

# Step 4: Basic LLM response (without retrieved context)
if llm_choice == "gpt2":
    basic_response = generate_response_gpt2(query)
elif llm_choice == "cohere":
    if cohere_api_key is None:
        raise ValueError("Cohere API key is required for Cohere LLM.")
    basic_response = generate_response_cohere(query, cohere_api_key)
else:
    raise ValueError("Invalid LLM choice. Please choose 'gpt2' or 'cohere'.")

print("\nBasic Response (No Retrieval):")
print(basic_response)

No relevant passages found.
Retrieved Passages:

Basic Response (No Retrieval):
Apple Inc.'s financial results for the third quarter of 2023 are not yet publicly available as I have information on events only up to January 2023. The third quarter results will be typically announced a few weeks after the end of the quarter, which is September 30th. You can expect the announcement in early October 2023. 

I can provide you with Apple's financial results for the most recent quarter, Q3 of their 2022 fiscal year (which ended on July 2, 2022) if you are interested.


In [18]:
# Step 5: Augmented LLM response (with retrieved context)
if llm_choice == "gpt2":
    augmented_response = generate_response_gpt2_with_context(query, retrieved_passages)
elif llm_choice == "cohere":
    augmented_response = generate_response_cohere_with_context(query, retrieved_passages, cohere_api_key)

print("\nAugmented Response (With Retrieved Context):")
print(augmented_response)


Augmented Response (With Retrieved Context):
 Apple reported revenue of $29.9 billion in its third fiscal quarter of 2023, which represents a 2% increase compared to the same period in 2022. This result was largely driven by robust performance in the company's product categories, particularly its iPhone and Mac product lines. The company's continued success in these segments highlights its ability to innovate and meet the evolving needs of its global customer base, even in the face of a challenging macroeconomic backdrop. 

It's worth noting that Apple also posted impressive earnings per share (EPS) of $1.19 during this quarter, which exceeded the expected EPS of $1.16. This result demonstrates that Apple continues to effectively manage its operations, optimize its cost structure, and invest in high-growth areas, thereby boosting shareholder value. 

In summary, Apple's Q3 2023 earnings reflect its sustained performance and ongoing success in the technology industry. The company's abi