<a href="https://colab.research.google.com/github/alexisvega1/alexisvega1/blob/main/RAG_CAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [5]:
# ------------------------------
# RAG: Retrieval-Augmented Generation
# ------------------------------

# Step 1: Import required libraries
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline

# Step 2: Define a sample set of documents
documents = [
    "Google was founded in 1998 and has since become a tech giant.",
    "Machine learning is a field of artificial intelligence that focuses on building systems that learn from data.",
    "Retrieval-Augmented Generation (RAG) systems combine retrieval of relevant documents with generation by language models.",
    "Google's research in AI has led to advances in natural language processing and deep learning.",
    "Large language models like GPT-3 and GPT-4 are powerful tools for many applications."
]

# Step 3: Load a sentence transformer model to create document embeddings
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents, convert_to_numpy=True)

# Step 4: Create a FAISS index and add the document embeddings
embedding_dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(doc_embeddings)

# Step 5: Define a user query and embed it
query = "Tell me about Google and its AI research."
query_embedding = embedder.encode([query], convert_to_numpy=True)

# Step 6: Retrieve the top-k relevant documents
k = 3
D, I = index.search(query_embedding, k)
retrieved_docs = [documents[i] for i in I[0]]
print("Retrieved Documents:")
for doc in retrieved_docs:
    print("- ", doc)

# Step 7: Construct the prompt by combining the query and retrieved documents
prompt = query + "\n\nRelevant Information:\n" + "\n".join(retrieved_docs)
print("\nConstructed Prompt:\n", prompt)

# Step 8: Generate an answer using a text-generation model (here we use GPT-2 as a stand-in)
generator = pipeline('text-generation', model='gpt2', max_length=150)
generated_output = generator(prompt, max_length=150, num_return_sequences=1)
print("\nGenerated Answer:\n", generated_output[0]['generated_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retrieved Documents:
-  Google's research in AI has led to advances in natural language processing and deep learning.
-  Machine learning is a field of artificial intelligence that focuses on building systems that learn from data.
-  Google was founded in 1998 and has since become a tech giant.

Constructed Prompt:
 Tell me about Google and its AI research.

Relevant Information:
Google's research in AI has led to advances in natural language processing and deep learning.
Machine learning is a field of artificial intelligence that focuses on building systems that learn from data.
Google was founded in 1998 and has since become a tech giant.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Answer:
 Tell me about Google and its AI research.

Relevant Information:
Google's research in AI has led to advances in natural language processing and deep learning.
Machine learning is a field of artificial intelligence that focuses on building systems that learn from data.
Google was founded in 1998 and has since become a tech giant.

Machine learning, or machine learning, is the process of learning from large-scale data sets, rather than from humans, to perform an activity.

For example, computer scientists like to "learn from data" to perform activities that could be done on computer and the ability to identify new ways of doing it or create software that can perform that activity.

In many ways, it's the beginning of


In [9]:
# ------------------------------
# CAG: Context-Augmented Generation
# ------------------------------

# Step 1: Import the text-generation pipeline
from transformers import pipeline

# Step 2: Define a large context.
# Here, we simulate a large context by repeating a paragraph.
large_context = ("Google has been a leader in AI research for decades. " * 50)  # simulate long context

# Define a query that requires this large context.
query = "Explain how Google's AI research has impacted the tech industry."

# Step 3: Simulate chunking of the large context.
# For this demo, we split the text into chunks of 50 words.
words = large_context.split()
chunk_size = 50
chunks = [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
print("Total chunks available:", len(chunks))

# Step 4: Select chunks that fit within the assumed token limit.
# For demonstration, we assume the LLM can handle around 300 words.
max_words = 300
selected_chunks = []
current_word_count = 0
for chunk in chunks:
    chunk_word_count = len(chunk.split())
    if current_word_count + chunk_word_count <= max_words:
        selected_chunks.append(chunk)
        current_word_count += chunk_word_count
    else:
        break

# Step 5: Construct the prompt by injecting the selected context chunks
context_str = "\n".join(selected_chunks)
prompt = query + "\n\nContext:\n" + context_str
print("\nConstructed Prompt:\n", prompt)

# Step 6: Generate an answer using the text-generation model
generator = pipeline('text-generation', model='gpt2', max_length=150)
generated_output = generator(prompt, max_new_tokens=50, num_return_sequences=1)
print("\nGenerated Answer:\n", generated_output[0]['generated_text'])

Total chunks available: 10

Constructed Prompt:
 Explain how Google's AI research has impacted the tech industry.

Context:
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in 

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Generated Answer:
 Explain how Google's AI research has impacted the tech industry.

Context:
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Google has been a leader in AI research for decades.
Google has been a leader in AI research for decades. Google has been a leader in AI research for decades. Goog