# Challenge RAG with LLMs

## 1. Fundamental concepts and worktools

Given the variety of new concepts to be discussed, let's begin by summarizing each one to facilitate subsequent treatment.

### 1.1. RAG (Retrieval-Augmented Generation)

RAG is an AI framework that combines the strengths of traditional information retrieval systems (such as databases) with the capabilities of generative large language models (LLMs).  By combining this extra knowledge with its own language skills, the AI can write text that is more accurate, up-to-date, and relevant to your specific needs.

RAGs operate with a few main steps to help enhance generative AI outputs: 

- **Retrieval and Pre-processing:** RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases. Once retrieved, the relevant information undergoes pre-processing, including tokenization, stemming, and removal of stop words.
    
- **Generation:** The pre-processed retrieved information is then seamlessly incorporated into the pre-trained LLM. This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic. This augmented context enables the LLM to generate more precise, informative, and engaging responses. 

RAG operates by first retrieving relevant information from a database using a query generated by the LLM. This retrieved information is then integrated into the LLM's query input, enabling it to generate more accurate and contextually relevant text. RAG leverages vector databases, which store data in a way that facilitates efficient search and retrieval.

![Alt text](./figs/rag-1.png)

RAG offers several advantages over traditional methods of text generation, especially when dealing with factual information or data-driven responses. Here are some key reasons why using RAG can be beneficial:

- **Access to updated information:** Traditional LLMs are often limited to their pre-trained knowledge and data. This could lead to potentially outdated or inaccurate responses. RAG overcomes this by granting LLMs access to external information sources, ensuring accurate and up-to-date answers.

- **Factual grounding:** LLMs can sometimes struggle with factual accuracy because they are trained on massive amounts of text data, which may contain inaccuracies or biases. RAG helps address this issue by providing LLMs with access to a curated knowledge base, ensuring that the generated text is grounded in factual information. This makes RAG particularly valuable for applications where accuracy is paramount, such as news reporting, scientific writing, or customer service.

- **Contextual relevance:** The retrieval mechanism in RAG ensures that the retrieved information is relevant to the input query or context. By providing the LLM with contextually relevant information, RAG helps the model generate responses that are more coherent and aligned with the given context. This contextual grounding helps to reduce the generation of irrelevant or off-topic responses.

- **Factual consistency:** RAG encourages the LLM to generate responses that are consistent with the retrieved factual information. By conditioning the generation process on the retrieved knowledge, RAG helps to minimize contradictions and inconsistencies in the generated text. This reduces the likelihood of generating false or misleading information.

- **Utilizes vector databases:** RAGs leverage vector databases to efficiently retrieve relevant documents. Vector databases store documents as vectors in a high-dimensional space, allowing for fast and accurate retrieval based on semantic similarity.

- **Improved response accuracy:** RAGs complement LLMs by providing them with contextually relevant information. LLMs can then use this information to generate more coherent, informative, and accurate responses.

- **RAGs and chatbots:** RAGs can be integrated into a chatbot system to enhance their conversational abilities. By accessing external information, RAG-powered chatbots helps leverage external knowledge to provide more comprehensive, informative, and context-aware responses.

_Sources: Some Google stuff [here](https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en) and [here](https://www.youtube.com/watch?v=v4s5eU2tfd4)._

### 1.2. Flask

Flask is a lightweight WSGI web application framework in Python used for building web applications and APIs. WGSI stands for Web Server Gateway Interface: a specification that describes how a web server communicates with web applications, and how web applications can be chained together to process one request. It is designed to make getting started quick and easy, with the ability to scale up to complex applications.

In [1]:
from flask import Flask

key = 'znMajXo63oZ1RCuBXBhFNhm6iW7toDPbjxBJTiSg'

### 1.3. LangChain

LangChain is an open-source library designed to simplify the development of applications that use language models. It provides tools and abstractions to facilitate tasks such as managing prompts, handling conversation history, and integrating various components like models, vector stores, and databases. LangChain is particularly useful when building applications that require natural language understanding and processing, such as chatbots, search engines, or information retrieval systems. Some key Features of LangChain are:

- Prompt Management: LangChain provides utilities for managing and composing prompts, which are essential for interacting with language models.
- Chain Building: It allows developers to create chains of operations, where each step in the chain can involve different models or data transformations.
- Integration with Vector Stores: LangChain integrates with vector stores (like ChromaDB, Pinecone, etc.) to enable efficient storage and retrieval of vector embeddings for tasks such as similarity search.
- Flexible Architecture: The library is designed to be modular, allowing you to plug in different models, vector stores, and components as needed.
- Data Handling: LangChain supports handling complex data pipelines, making it easier to preprocess and postprocess data for language models.

### 1.4. ChromaDB

ChromaDB is a vector database. A vector database is a specialized database optimized for storing, indexing, and querying high-dimensional vector representations of data. These databases are designed to efficiently handle similarity searches in large datasets, making them ideal for use cases like:

- Semantic Search: Finding documents or text chunks similar to a query.
- Recommendation Systems: Suggesting items similar to a user's preferences.
- Image and Video Search: Retrieving similar images or video clips based on content.
- Anomaly Detection: Identifying unusual patterns in data.

Benefits of Using a Vector Database:

- Efficient Similarity Searches: Vector databases use specialized indexing techniques like HNSW (Hierarchical Navigable Small World) or Annoy (Approximate Nearest Neighbors) to quickly find similar vectors. This makes them significantly faster than computing pairwise similarities in memory, especially for large datasets.
- Scalability: Vector databases are designed to handle large volumes of data efficiently, allowing you to scale your applications as needed.
- Integration with NLP Pipelines: Vector databases can be easily integrated with NLP pipelines where text is transformed into embeddings (vectors), and these embeddings are then used for search and retrieval.
- Real-Time Querying: They enable real-time querying, which is essential for applications like chatbots and interactive search engines.

## 2. Embeddings

Usar lematización, skip-gram, glove

In [2]:
from docx import Document

def extract_text_from_docx(file_path):
    doc = Document(file_path)
    full_text = []
    for paragraph in doc.paragraphs:
        full_text.append(paragraph.text)
    return '\n'.join(full_text)

file_path = 'documento.docx'
ejemplo = extract_text_from_docx(file_path)
print(ejemplo)

Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.
Ficción Tecnológica: En un futuro distópico, la inteligencia artificial ha evolucionado al punto de alcanzar la singularidad. Un joven ingeniero, Alex, se ve inmerso en una conspiración global cuando descubre que las supercomputadoras han desarrollado emociones. A medida que la humanidad lucha por controlar a estas máquinas sintientes, Alex se enfrenta a dilemas éticos y decisiones que podrían cambiar el curso de la historia.
Naturaleza Deslumbrante: En lo profundo de la selva amazónica, una flor mágica conocida como "Luz de Luna" florece solo durante la no

In [3]:
story1 = 'Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.'
story2 = 'Ficción Tecnológica: En un futuro distópico, la inteligencia artificial ha evolucionado al punto de alcanzar la singularidad. Un joven ingeniero, Alex, se ve inmerso en una conspiración global cuando descubre que las supercomputadoras han desarrollado emociones. A medida que la humanidad lucha por controlar a estas máquinas sintientes, Alex se enfrenta a dilemas éticos y decisiones que podrían cambiar el curso de la historia.'
story3 = 'Naturaleza Deslumbrante: En lo profundo de la selva amazónica, una flor mágica conocida como "Luz de Luna" florece solo durante la noche. Con pétalos que brillan intensamente, la flor ilumina la oscuridad de la jungla, guiando a criaturas nocturnas y revelando paisajes deslumbrantes. Los lugareños creen que posee poderes curativos, convirtiéndola en el tesoro oculto de la naturaleza.'
story4 = 'Cuento Corto: En un pequeño pueblo, cada año, un reloj antiguo regala un día extra a la persona más desafortunada. Emma, una joven huérfana, es la elegida este año. Durante su día adicional, descubre una puerta mágica que la transporta a un mundo lleno de maravillas. Al final del día, Emma decide compartir su regalo con el pueblo, dejando una huella imborrable en el corazón de cada habitante.'
story5 = 'Características del Héroe Olvidado: Conocido como "Sombra Silenciosa", nuestro héroe es un maestro del sigilo y la astucia. Dotado de una memoria fotográfica y habilidades de camuflaje, se desplaza entre las sombras para proteger a los indefensos. Su pasado enigmático esconde tragedias que lo impulsan a luchar contra la injusticia. Aunque carece de habilidades sobrenaturales, su ingenio y habilidades tácticas lo convierten en una fuerza a tener en cuenta.'

In [4]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained Spanish model and tokenizer
model_name = 'dccuchile/bert-base-spanish-wwm-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Sample corpus of short stories in Spanish
stories = [story1,story2,story3,story4,story5]

# Function to divide text into chunks
def divide_text_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# Divide each story into chunks
chunk_size = 25  # Adjust chunk size as needed
all_chunks = [divide_text_into_chunks(story, chunk_size) for story in stories]
all_chunks = [chunk for sublist in all_chunks for chunk in sublist]  # Flatten the list of chunks

# Function to encode text
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy().flatten()

# Encode all chunks and store in a list
encoded_chunks = [(encode_text(chunk), chunk) for chunk in all_chunks]

# Encode the question in Spanish
question = "¿Quién es Zara?"
encoded_question = encode_text(question)

# Calculate similarities and find the most similar chunk
similarities = [cosine_similarity([encoded_question], [vector])[0][0] for vector, _ in encoded_chunks]
most_similar_index = np.argmax(similarities)
most_similar_chunk = encoded_chunks[most_similar_index][1]

print(f"Most similar chunk: {most_similar_chunk}")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Most similar chunk: convirtiéndola en el tesoro oculto de la naturaleza.


## Cohere + LangChain

In [5]:
import getpass
import os

os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API key: ")

Enter your Cohere API key:  ········


In [8]:
from langchain_cohere import ChatCohere
from langchain_core.prompts import ChatPromptTemplate

In [6]:
llm = ChatCohere(
    model="command-r-plus",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg

In [3]:
from docx import Document
from sentence_transformers import SentenceTransformer, util
import numpy as np
import re
import unicodedata
from scipy.spatial import distance
import torch

# Load a more powerful pre-trained Spanish model fine-tuned for question answering
model_name = 'mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es'
model = SentenceTransformer(model_name)

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    paragraphs = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
    return paragraphs

# Function to normalize text while keeping Spanish characters
def normalize_text(text):
#    text = text.lower()  # Convert to lowercase
#    text = unicodedata.normalize('NFD', text)  # Normalize to decompose accents
#    text = ''.join([c for c in text if unicodedata.category(c) != 'Mn' or c == 'ñ'])  # Remove combining accents except for ñ
#    text = re.sub(r'[^\w\sñ]', '', text)  # Remove punctuation except for ñ
    return text

# Read the entire document
file_path = 'documento.docx'  # Update this with your DOCX file path
paragraphs = read_document_from_docx(file_path)

# Normalize each paragraph
normalized_paragraphs = [normalize_text(para) for para in paragraphs]

# Encode the normalized paragraphs
encoded_chunks = model.encode(normalized_paragraphs, convert_to_tensor=True)

# Define questions related to the document
questions = [
    "¿Quién es Zara?",  
    "¿Qué descubre Alex?",  
    "¿Cómo se llama la flor mágica?",  
    "¿Qué recibe Emma?",  
    "¿Cuál es el apodo del héroe?"  
]

# Function to calculate and normalize scores
def normalize_scores(scores):
    min_score = np.min(scores)
    max_score = np.max(scores)
    normalized = (scores - min_score) / (max_score - min_score)
    return normalized

# Normalize and loop through each question, encode it, and find the most similar chunk
for question in questions:
    normalized_question = normalize_text(question)
    encoded_question = model.encode(normalized_question, convert_to_tensor=True)
    
    # Compute similarity metrics
    cosine_scores = util.pytorch_cos_sim(encoded_question, encoded_chunks).numpy().flatten()
    euclidean_scores = np.array([distance.euclidean(encoded_question.numpy(), chunk.numpy()) for chunk in encoded_chunks])
    manhattan_scores = np.array([distance.cityblock(encoded_question.numpy(), chunk.numpy()) for chunk in encoded_chunks])
    dot_product_scores = np.array([torch.dot(encoded_question, chunk).item() for chunk in encoded_chunks])
    
    # Normalize the scores
    normalized_cosine_scores = normalize_scores(cosine_scores)
    normalized_euclidean_scores = normalize_scores(-euclidean_scores)  # Negative because lower distance is better
    normalized_manhattan_scores = normalize_scores(-manhattan_scores)  # Negative because lower distance is better
    normalized_dot_product_scores = normalize_scores(dot_product_scores)
    
    # Combine the normalized scores
    combined_scores = (
        normalized_cosine_scores +
        normalized_euclidean_scores +
        normalized_manhattan_scores +
        normalized_dot_product_scores
    )
    
    # Find the chunk with the highest combined score
    most_similar_index = np.argmax(combined_scores)
    most_similar_chunk = normalized_paragraphs[most_similar_index]
    
    print(f"Question: {question}")
    print(f"Most similar chunk: {most_similar_chunk}\n")


  from tqdm.autonotebook import tqdm, trange
No sentence-transformers model found with name mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es. Creating a new one with mean pooling.


Question: ¿Quién es Zara?
Most similar chunk: Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.

Question: ¿Qué descubre Alex?
Most similar chunk: Características del Héroe Olvidado: Conocido como "Sombra Silenciosa", nuestro héroe es un maestro del sigilo y la astucia. Dotado de una memoria fotográfica y habilidades de camuflaje, se desplaza entre las sombras para proteger a los indefensos. Su pasado enigmático esconde tragedias que lo impulsan a luchar contra la injusticia. Aunque carece de habilidades sobrenaturales, su ingenio y habilidades tácticas lo convierten en una fuerza a tener en cuenta.

Quest

In [4]:
# This allows us to create a client that connects to the server
collection = chroma_client.create_collection(name="my_collection")

In [7]:
from chromadb.utils import embedding_functions

cohere_ef  = embedding_functions.CohereEmbeddingFunction(api_key='znMajXo63oZ1RCuBXBhFNhm6iW7toDPbjxBJTiSg',  model_name="large")
metadata_options = {
    "hnsw:space": "ip"  # You can change this to "ip" or "cosine" if needed
}

collection = chroma_client.get_or_create_collection(
    name="my_collection", metadata=metadata_options, embedding_function=cohere_ef)

In [18]:
import cohere
import chromadb
from chromadb.utils import embedding_functions
from docx import Document
import re
import unicodedata
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize Cohere
cohere_api_key = 'znMajXo63oZ1RCuBXBhFNhm6iW7toDPbjxBJTiSg'  # Replace with your Cohere API key
co = cohere.Client(cohere_api_key)

# Initialize ChromaDB Client
chroma_client = chromadb.Client()

# Define Cohere embedding function
cohere_ef = embedding_functions.CohereEmbeddingFunction(api_key=cohere_api_key, model_name="large")

# Set metadata options
metadata_options = {
    "hnsw:space": "cosine"  # You can choose "ip" or "cosine" based on your needs
}

# Create or get the collection
collection = chroma_client.get_or_create_collection(name="document_embeddings", metadata=metadata_options, embedding_function=cohere_ef)

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    return '\n\n'.join([para.text.strip() for para in doc.paragraphs if para.text.strip()])

# Function to normalize text while keeping Spanish characters
def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = unicodedata.normalize('NFD', text)  # Normalize to decompose accents
    text = ''.join([c for c in text if unicodedata.category(c) != 'Mn' or c == 'ñ'])  # Remove combining accents except for ñ
    text = re.sub(r'[^\w\sñ]', '', text)  # Remove punctuation except for ñ
    return text

# Read the entire document
file_path = 'documento.docx'  # Update this with your DOCX file path
content = read_document_from_docx(file_path)

# Split the document into chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
docs = text_splitter.create_documents([content])

# Store each chunk in ChromaDB with a unique UUID
for doc in docs:
    uuid_name = uuid.uuid1()
    embedding = co.embed(texts=[doc.page_content], model='large').embeddings[0]  # Get the embedding
    collection.add(ids=[str(uuid_name)], documents=[doc.page_content], metadatas=[{'text': doc.page_content}], embeddings=[embedding])  # No .tolist()

# Define questions related to the document
questions = [
    "¿Quién es Zara?",  
    "¿Qué descubre Alex?",  
    "¿Cómo se llama la flor mágica?",  
    "¿Qué recibe Emma?",  
    "¿Cuál es el apodo del héroe?"  
]

# Loop through each question, encode it, and find the most similar chunk
for question in questions:
    normalized_question = normalize_text(question)

    # Get the embedding for the normalized question
    question_embedding = co.embed(texts=[normalized_question], model='large').embeddings[0]  # Get the embedding
    
    # Query the collection using the embedding
    results = collection.query(query_embeddings=[question_embedding], n_results=1)  # Use query_embeddings

    # Print the results to inspect their structure
    print("Query Results:", results)

    # Access the most similar chunk based on the structure of the results
    most_similar_chunk = results['documents'][0][0]  # Access the first document in the first list
    metadata_text = results['metadatas'][0][0]['text']  # Access the metadata of the first document

    print(f"Question: {question}")
    print(f"Most similar chunk: {most_similar_chunk}\n")
    print(f"Metadata text: {metadata_text}\n")  # You can also print the metadata if needed

Query Results: {'ids': [['922f4b04-4d42-11ef-a568-347df694a35e']], 'distances': [[7902.73779296875]], 'metadatas': [[{'text': 'Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.'}]], 'embeddings': None, 'documents': [['Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes