# RAG with LLMs challenge

In this notebook, I summarize my attempt to solve a challenge problem regarding the development of a RAG framework. I will be showing some of the tests I have performed, along with some information about the tools used. The final version of the program, named app.py, can be found in this folder.

**Warning:** Many of the packages used here are not included in the Dockerfile.

**Warning:** I encountered some issues running Flask in Jupyter Notebooks, so I recommend running the final version of the program elsewhere.

## 1. Fundamental concepts and worktools

Given the variety of new concepts to be discussed (I will assume that the reader doesn't have experience with many of the following tools, just like me :P), let's begin by summarizing some of them.

### 1.1. RAG (Retrieval-Augmented Generation)

RAG is an AI framework that combines the strengths of traditional information retrieval systems (such as databases) with the capabilities of generative large language models (LLMs).  By combining this extra knowledge with its own language skills, the AI can write text that is more accurate, up-to-date, and relevant to your specific needs. RAGs operate with a few main steps to help enhance generative AI outputs: 

- **Retrieval and Pre-processing:** RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases. Once retrieved, the relevant information undergoes pre-processing, including tokenization, stemming, and removal of stop words.
    
- **Generation:** The pre-processed retrieved information is then seamlessly incorporated into the pre-trained LLM. This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic. This augmented context enables the LLM to generate more precise, informative, and engaging responses.

RAG offers several advantages over traditional methods of text generation, especially when dealing with factual information or data-driven responses. Here are some key reasons why using RAG can be beneficial:

- **Access to updated information:** Traditional LLMs are often limited to their pre-trained knowledge and data. This could lead to potentially outdated or inaccurate responses. RAG overcomes this by granting LLMs access to external information sources, ensuring accurate and up-to-date answers.

- **Factual grounding:** LLMs can sometimes struggle with factual accuracy because they are trained on massive amounts of text data, which may contain inaccuracies or biases. RAG helps address this issue by providing LLMs with access to a curated knowledge base, ensuring that the generated text is grounded in factual information. This makes RAG particularly valuable for applications where accuracy is paramount, such as news reporting, scientific writing, or customer service.

- **Contextual relevance:** The retrieval mechanism in RAG ensures that the retrieved information is relevant to the input query or context. By providing the LLM with contextually relevant information, RAG helps the model generate responses that are more coherent and aligned with the given context. This contextual grounding helps to reduce the generation of irrelevant or off-topic responses.

- **Factual consistency:** RAG encourages the LLM to generate responses that are consistent with the retrieved factual information. By conditioning the generation process on the retrieved knowledge, RAG helps to minimize contradictions and inconsistencies in the generated text. This reduces the likelihood of generating false or misleading information.

- **Utilizes vector databases:** RAGs leverage vector databases to efficiently retrieve relevant documents. Vector databases store documents as vectors in a high-dimensional space, allowing for fast and accurate retrieval based on semantic similarity.

- **Improved response accuracy:** RAGs complement LLMs by providing them with contextually relevant information. LLMs can then use this information to generate more coherent, informative, and accurate responses.

- **RAGs and chatbots:** RAGs can be integrated into a chatbot system to enhance their conversational abilities. By accessing external information, RAG-powered chatbots helps leverage external knowledge to provide more comprehensive, informative, and context-aware responses.

_Sources: Some Google stuff [here](https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en) and [here](https://www.youtube.com/watch?v=v4s5eU2tfd4)._

### 1.2. Flask

Flask is a lightweight WSGI web application framework in Python used for building web applications and APIs. WGSI stands for Web Server Gateway Interface: a specification that describes how a web server communicates with web applications, and how web applications can be chained together to process one request. It is designed to make getting started quick and easy, with the ability to scale up to complex applications.

### 1.3. LangChain

LangChain is an open-source library designed to simplify the development of applications that use language models. It provides tools and abstractions to facilitate tasks such as managing prompts, handling conversation history, and integrating various components like models, vector stores, and databases. LangChain is particularly useful when building applications that require natural language understanding and processing, such as chatbots, search engines, or information retrieval systems. Some key Features of LangChain are:

- Prompt Management: LangChain provides utilities for managing and composing prompts, which are essential for interacting with language models.
- Chain Building: It allows developers to create chains of operations, where each step in the chain can involve different models or data transformations.
- Integration with Vector Stores: LangChain integrates with vector stores (like ChromaDB, Pinecone, etc.) to enable efficient storage and retrieval of vector embeddings for tasks such as similarity search.
- Flexible Architecture: The library is designed to be modular, allowing you to plug in different models, vector stores, and components as needed.
- Data Handling: LangChain supports handling complex data pipelines, making it easier to preprocess and postprocess data for language models.

### 1.4. ChromaDB

ChromaDB is a vector database. A vector database is a specialized database optimized for storing, indexing, and querying high-dimensional vector representations of data. These databases are designed to efficiently handle similarity searches in large datasets, making them ideal for use cases like:

- Semantic Search: Finding documents or text chunks similar to a query.
- Recommendation Systems: Suggesting items similar to a user's preferences.
- Image and Video Search: Retrieving similar images or video clips based on content.
- Anomaly Detection: Identifying unusual patterns in data.

Benefits of using a Vector Database:

- Efficient Similarity Searches: Vector databases use specialized indexing techniques like HNSW (Hierarchical Navigable Small World) or Annoy (Approximate Nearest Neighbors) to quickly find similar vectors. This makes them significantly faster than computing pairwise similarities in memory, especially for large datasets.
- Scalability: Vector databases are designed to handle large volumes of data efficiently, allowing you to scale your applications as needed.
- Integration with NLP Pipelines: Vector databases can be easily integrated with NLP pipelines where text is transformed into embeddings (vectors), and these embeddings are then used for search and retrieval.
- Real-Time Querying: They enable real-time querying, which is essential for applications like chatbots and interactive search engines.

### 1.5. Cohere

Cohere is a natural language processing (NLP) platform that provides advanced language models as a service. It allows developers to leverage powerful machine learning models to perform tasks such as text generation, text embedding, and other NLP functionalities through a simple API. Cohere's platform is designed to help integrate state-of-the-art language understanding capabilities into applications without the need for deep expertise in machine learning. Some of its advantages are:

- High-Quality Language Models.
- Ease of Integration.
- Text Embedding.
- Developer-Friendly.
- Etc.

By offering cutting-edge language models through an accessible API, Cohere enables developers to incorporate advanced NLP capabilities into their applications, enhancing functionality and user experience without the need for extensive machine learning resources or expertise.

### 1.6. Postman

Postman is a popular tool for API testing, allowing you to create and execute requests and view responses in an organized manner. A Postman collection is a group of saved requests you can use to test and document your APIs. 

Creating a Postman collection allows you to save, organize, and share API requests, making it easier to test and document your API. It helps you ensure your endpoints are working correctly and can be used by others to verify and interact with your API.

### 1.7. Docker

A Docker image is a lightweight, standalone, and executable software package that includes everything needed to run a piece of software, including the code, runtime, libraries, environment variables, and configurations. A Dockerfile is a script that contains a series of instructions on how to build a Docker image, and allows us to automate the setup and deployment of our application by encapsulating it in a Docker container. This makes it easy to distribute, deploy, and run on any machine that supports Docker.

## 2. Encoding text

To grasp some ideas around these subjects, let's start by trying to encode some text! In particular, let's use the document provided by the challenge: `documento.docx`. Common encoding methods include:

- Tokenization: Splitting text into tokens (words or subwords) and converting them to numerical IDs.
- Word Embeddings: Representing words in a continuous vector space (e.g., Word2Vec, GloVe).
- Sentence Embeddings: Representing entire sentences or chunks of text in a vector space (e.g., BERT, Sentence-BERT).

## 2.1. Using BERT

Given a small corpus of short stories, the objective here is to divide it into chunks and then encode them using a pre-trained model. I'm using BERT's embeddings to capture semantic information from the text. Then, given a specific question, I encode it as well and find the most similar chunk to this question using similarity scores (in this case, I'll be using `cosine_similarity`).

Why BERT? BERT and other transformer-based models are the most advanced and provide state-of-the-art performance for tasks involving complex language understanding, making them the most suitable for finding semantically similar text chunks.

In [14]:
from docx import Document
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
def read_document_from_docx(file_path):
    # "para.text" extracts the text content of the paragraph
    # ".strip()" removes any leading and trailing whitespace from the text
    # "if para.text.strip()" filters out paragraphs that are empty or contain only whitespace
    doc = Document(file_path)
    paragraphs = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
    return paragraphs

file_path = 'documento.docx'
stories = read_document_from_docx(file_path) # Sample corpus of short stories in Spanish

In [16]:
# Load pre-trained Spanish model and tokenizer
model_name = 'dccuchile/bert-base-spanish-wwm-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Function to divide text into chunks
def divide_text_into_chunks(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# Divide each story into chunks
chunk_size = 70  # Chose 70 since it's an upper bound for the average paragraph size in the document
all_chunks = [divide_text_into_chunks(story, chunk_size) for story in stories]
all_chunks = [chunk for sublist in all_chunks for chunk in sublist]  # Flatten the list of chunks

# Function to encode text
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy().flatten()

# Encode all chunks and store in a list
encoded_chunks = [(encode_text(chunk), chunk) for chunk in all_chunks]

# Encode the question in Spanish
question = "¿Quién es Zara?"
encoded_question = encode_text(question)

# Calculate similarities and find the most similar chunk
similarities = [cosine_similarity([encoded_question], [vector])[0][0] for vector, _ in encoded_chunks]
most_similar_index = np.argmax(similarities)
most_similar_chunk = encoded_chunks[most_similar_index][1]

print(f"Most similar chunk: {most_similar_chunk}")

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Most similar chunk: Características del Héroe Olvidado: Conocido como "Sombra Silenciosa", nuestro héroe es un maestro del sigilo y la astucia. Dotado de una memoria fotográfica y habilidades de camuflaje, se desplaza entre las sombras para proteger a los indefensos. Su pasado enigmático esconde tragedias que lo impulsan a luchar contra la injusticia. Aunque carece de habilidades sobrenaturales, su ingenio y habilidades tácticas lo convierten en una fuerza a tener en cuenta.


Doesn't seem to be working well :P. This is just some testing and a first attempt to get used to this kind of problem. Let's give it another try with a more "complex" approach.

The next code normalizes text (i.e., converts to lowercase, removes punctuation, etc.), considers paragraphs as chunks (since paragraphs are unrelated), and combines several similarity metrics.

In [22]:
from docx import Document
from sentence_transformers import SentenceTransformer, util
import numpy as np
import re
import unicodedata
from scipy.spatial import distance
import torch

# Load a more powerful pre-trained Spanish model fine-tuned for question answering
model_name = 'mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es'
model = SentenceTransformer(model_name)

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    paragraphs = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
    return paragraphs

# Function to normalize text while keeping Spanish characters
def normalize_text(text):
#    text = text.lower()  # Convert to lowercase
#    text = unicodedata.normalize('NFD', text)  # Normalize to decompose accents
#    text = ''.join([c for c in text if unicodedata.category(c) != 'Mn' or c == 'ñ'])  # Remove combining accents except for ñ
#    text = re.sub(r'[^\w\sñ]', '', text)  # Remove punctuation except for ñ
    return text

# Read the entire document
file_path = 'documento.docx'  # Update this with your DOCX file path
paragraphs = read_document_from_docx(file_path)

# Normalize each paragraph
normalized_paragraphs = [normalize_text(para) for para in paragraphs]

# Encode the normalized paragraphs
encoded_chunks = model.encode(normalized_paragraphs, convert_to_tensor=True)

# Define questions related to the document
questions = [
    "¿Quién es Zara?",  
    "¿Qué descubre Alex?",  
    "¿Cómo se llama la flor mágica?",  
    "¿Qué recibe Emma?",  
    "¿Cuál es el apodo del héroe?"  
]

# Function to calculate and normalize scores
def normalize_scores(scores):
    min_score = np.min(scores)
    max_score = np.max(scores)
    normalized = (scores - min_score) / (max_score - min_score)
    return normalized

# Normalize and loop through each question, encode it, and find the most similar chunk
for question in questions:
    normalized_question = normalize_text(question)
    encoded_question = model.encode(normalized_question, convert_to_tensor=True)
    
    # Compute similarity metrics
    cosine_scores = util.pytorch_cos_sim(encoded_question, encoded_chunks).numpy().flatten()
    euclidean_scores = np.array([distance.euclidean(encoded_question.numpy(), chunk.numpy()) for chunk in encoded_chunks])
    manhattan_scores = np.array([distance.cityblock(encoded_question.numpy(), chunk.numpy()) for chunk in encoded_chunks])
    dot_product_scores = np.array([torch.dot(encoded_question, chunk).item() for chunk in encoded_chunks])
    
    # Normalize the scores
    normalized_cosine_scores = normalize_scores(cosine_scores)
    normalized_euclidean_scores = normalize_scores(-euclidean_scores)  # Negative because lower distance is better
    normalized_manhattan_scores = normalize_scores(-manhattan_scores)  # Negative because lower distance is better
    normalized_dot_product_scores = normalize_scores(dot_product_scores)
    
    # Combine the normalized scores
    combined_scores = (
        normalized_cosine_scores +
        normalized_euclidean_scores +
        normalized_manhattan_scores +
        normalized_dot_product_scores
    )
    
    # Find the chunk with the highest combined score
    most_similar_index = np.argmax(combined_scores)
    most_similar_chunk = normalized_paragraphs[most_similar_index]
    
    print(f"Question: {question}")
    print(f"Most similar chunk: {most_similar_chunk}\n")

No sentence-transformers model found with name mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es. Creating a new one with mean pooling.


Question: ¿Quién es Zara?
Most similar chunk: Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.

Question: ¿Qué descubre Alex?
Most similar chunk: Características del Héroe Olvidado: Conocido como "Sombra Silenciosa", nuestro héroe es un maestro del sigilo y la astucia. Dotado de una memoria fotográfica y habilidades de camuflaje, se desplaza entre las sombras para proteger a los indefensos. Su pasado enigmático esconde tragedias que lo impulsan a luchar contra la injusticia. Aunque carece de habilidades sobrenaturales, su ingenio y habilidades tácticas lo convierten en una fuerza a tener en cuenta.

Quest

The code still has issues but seems to perform better! Let's try a different approach in the next cells.

## 2.2. Cohere + ChromaDB

Now I'll be using Cohere's embeddings together with the ChromaDB vector database (the tutorial provided with this challenge was really helpful! :D). This involves getting a Cohere API Key to authenticate requests to the Cohere API.

I added some random questions just to check how the code is working and a unique identifier (UUID) to ensure that every document can be individually referenced in the database. I'm still considering paragraphs as chunks.

In [26]:
import cohere
import chromadb
from chromadb.utils import embedding_functions
from docx import Document
import re
import unicodedata
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize Cohere
cohere_api_key = 'insert_your_key'
co = cohere.Client(cohere_api_key)

# Initialize ChromaDB Client
# The client is the interface you use to interact with the Chroma database
chroma_client = chromadb.Client()

# Define Cohere embedding function
cohere_ef = embedding_functions.CohereEmbeddingFunction(api_key=cohere_api_key, model_name="large")

# Set metadata options
metadata_options = {
    "hnsw:space": "cosine"  # You can choose "ip" or "cosine" based on your needs
}

# Create (or get) the collection in the Chroma database (if it doesn't exist) to store the embeddings
# A collection is like a table in a database, where you can store documents, their embeddings, and metadata.
collection = chroma_client.get_or_create_collection(name="document_embeddings", metadata=metadata_options, embedding_function=cohere_ef)

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    return '\n\n'.join([para.text.strip() for para in doc.paragraphs if para.text.strip()])

# Function to normalize text while keeping Spanish characters
def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = unicodedata.normalize('NFD', text)  # Normalize to decompose accents
    text = ''.join([c for c in text if unicodedata.category(c) != 'Mn' or c == 'ñ'])  # Remove combining accents except for ñ
    text = re.sub(r'[^\w\sñ]', '', text)  # Remove punctuation except for ñ
    return text

# Read the entire document
file_path = 'documento.docx'  # Update this with your DOCX file path
content = read_document_from_docx(file_path)

# Split the document into chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
docs = text_splitter.create_documents([content])

# Store each chunk in ChromaDB with a unique UUID
for doc in docs:
    uuid_name = uuid.uuid1()
    embedding = co.embed(texts=[doc.page_content], model='large').embeddings[0]  # Get the embedding
    collection.add(ids=[str(uuid_name)], documents=[doc.page_content], metadatas=[{'text': doc.page_content}], embeddings=[embedding])  # No .tolist()

# Define questions related to the document
questions = [
    "¿Quién es Zara?",  
    "¿Qué descubre Alex?",  
    "¿Cómo se llama la flor mágica?",  
    "¿Qué recibe Emma?",  
    "¿Cuál es el apodo del héroe?"  
]

# Loop through each question, encode it, and find the most similar chunk
for question in questions:
    normalized_question = normalize_text(question)

    # Get the embedding for the normalized question
    question_embedding = co.embed(texts=[normalized_question], model='large').embeddings[0]  # Get the embedding
    
    # Query the collection using the embedding
    results = collection.query(query_embeddings=[question_embedding], n_results=1)  # Use query_embeddings

    # Print the results to inspect their structure
    # print("Query Results:", results)

    # Access the most similar chunk based on the structure of the results
    most_similar_chunk = results['documents'][0][0]  # Access the first document in the first list
    metadata_text = results['metadatas'][0][0]['text']  # Access the metadata of the first document

    print(f"Question: {question}")
    print(f"Most similar chunk: {most_similar_chunk}\n")
    # print(f"Metadata text: {metadata_text}\n")  # You can also print the metadata if needed

Question: ¿Quién es Zara?
Most similar chunk: Ficción Espacial: En la lejana galaxia de Zenthoria, dos civilizaciones alienígenas, los Dracorians y los Lumis, se encuentran al borde de la guerra intergaláctica. Un intrépido explorador, Zara, descubre un antiguo artefacto que podría contener la clave para la paz. Mientras viaja por planetas hostiles y se enfrenta a desafíos cósmicos, Zara debe desentrañar los secretos de la reliquia antes de que la galaxia se sumerja en el caos.

Question: ¿Qué descubre Alex?
Most similar chunk: 
Ficción Tecnológica: En un futuro distópico, la inteligencia artificial ha evolucionado al punto de alcanzar la singularidad. Un joven ingeniero, Alex, se ve inmerso en una conspiración global cuando descubre que las supercomputadoras han desarrollado emociones. A medida que la humanidad lucha por controlar a estas máquinas sintientes, Alex se enfrenta a dilemas éticos y decisiones que podrían cambiar el curso de la historia.

Question: ¿Cómo se llama la flor m

This code seems to be doing a good job finding the "correct" story for each question. There are plenty of extra things to be done, such as building a Flask API, reducing the answers to one sentence, adding emojis, etc. Let's continue!

## 3. Incorporating Flask

**Warning:** I'm having problems running Flask in jupyter notebooks. The following codes have been run with VS Code.

**Warning:** Remember to add your own Cohere API key!

To create a Python API that enables communication between users and Cohere, we need to set up an endpoint to receive requests, process them, and then interact with Cohere's API to get a response based on the user's input. In summary, the key Components of the API are:

- Flask API Setup: We need to set up a Flask application that can handle HTTP requests and define routes that users can use to submit questions.
- Using Cohere LLM: We have to connect to the Cohere LLM through the Cohere API. The API will take the most relevant chunk of text as context and combine it with the user's question to generate a response.
- Retrieving relevant context: We can do this using ChromaDB to find the most relevant chunk based on the user’s question.

To ask a question and get an answer, a POST request has to be sent to the /ask endpoint with the required JSON data. This can be done using the `curl` command or using Postman. Here's a request example with `curl`:

```
curl -X POST http://127.0.0.1:5000/ask \
     -H "Content-Type: application/json" \
     -d '{"user_name": "John Doe", "question": "How are you today?"}'
```

In [None]:
# from flask import Flask, request, jsonify, Response
import cohere
import chromadb
import uuid
from docx import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

app = Flask(__name__)

# Initialize Cohere
cohere_api_key = 'insert_your_key'
co = cohere.Client(cohere_api_key)

# Initialize ChromaDB Client
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name="document_embeddings")

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    return '\n\n'.join([para.text.strip() for para in doc.paragraphs if para.text.strip()])

# Read and process the document
def process_document(file_path):
    content = read_document_from_docx(file_path)
    text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
    docs = text_splitter.create_documents([content])
    
    # Store chunks in ChromaDB
    for doc in docs:
        uuid_name = str(uuid.uuid1())
        embedding = co.embed(texts=[doc.page_content], model='large').embeddings[0]  # Get the embedding
        collection.add(ids=[uuid_name], documents=[doc.page_content], metadatas=[{'text': doc.page_content}], embeddings=[embedding])

# Initialize the document processing
file_path = 'documento.docx'  # Update with your DOCX file path
process_document(file_path)

@app.route('/ask', methods=['POST'])
def ask():
    # Get the user's name and question from the request
    user_name = request.json.get('user_name')
    user_question = request.json.get('question')

    # Step 1: Retrieve the most relevant chunk using Chroma
    question_embedding = co.embed(texts=[user_question], model='large').embeddings[0]  # Get the embedding for the question
    results = collection.query(query_embeddings=[question_embedding], n_results=1)

    # Extract the most relevant chunk
    most_relevant_chunk = results['documents'][0][0]  # Access the first document in the first list

    # Step 2: Create a prompt for the LLM
    prompt = f"Context: {most_relevant_chunk}\nQuestion: {user_question}\nAnswer:"

    # Step 3: Use the Cohere LLM to get an answer
    response = co.generate(prompt=prompt, model='command', max_tokens=150)  # Adjust parameters as needed

    # Create the response data
    response_data = {
        'user_name': user_name,
        'question': user_question,
        'answer': response.generations[0].text.strip()
    }

    # Return the generated answer along with the user's name, ensuring no special character escaping
    return Response(json.dumps(response_data, ensure_ascii=False), mimetype='application/json')

if __name__ == '__main__':
    app.run(debug=True)

When sending the request:

```
curl -X POST http://localhost:5000/ask -H "Content-Type: application/json" -d '{"user_name": "John Doe", "question": "¿Quién es Zara"}'
```

I got the answer:

```
{"user_name": "John Doe", "question": "¿Quién es Zara", "answer": "Zara es un intrépido explorador en la lejana galaxia de Zenthoria, que viaja por hostiles planetas y se enfrenta a desafíos cósmicos."}
```

The code seems to be working fine! Now, in the next section, I'll add some extra features.

## 4. Extra features + Postman + Docker

The program must meet the following requirements regarding the answer provided:

1) The program always provides the same answer to the same question.

To ensure this I added a dictionary answer_cache to store answers to questions. Before processing the question, the code checks if the question already exists in the cache. If it does, it returns the cached answer.

2) The answers must be limited to one sentence and in the third person.

To ensure this I modified the prompt to explicitly ask for a concise, single-sentence answer. I also post-processed the generated response to extract only the first sentence. I also modified the prompt to instruct the model explicitly to respond in the third person.

3) Add emojis to the end of the answer based on the content.

To ensure this I added an additional prompt to generate emojis based on the answer. It uses Cohere's text generation to suggest emojis and appends them to the answer before sending the response back to the user.

4) The language of the answer must be the same as the language of the question.

Unfortunately, I couldn't get this done. I couldn't find a free translation API or get Cohere to handle this. :(

Also, since I was having issues connecting to the Cohere API, I added some error handling and logging. To reduce verbosity, change the logging level from `DEBUG` to `WARNING`: `logging.basicConfig(level=logging.WARNING)`

In [None]:
from flask import Flask, request, jsonify, Response
import cohere
import chromadb
import uuid
from docx import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json
import logging
import emoji

app = Flask(__name__)

# Configure logging
logging.basicConfig(level=logging.DEBUG)

# Initialize Cohere
cohere_api_key = 'insert_your_key'
co = cohere.Client(cohere_api_key)

# Initialize ChromaDB Client
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name="document_embeddings")

# Cache to store answers to questions
answer_cache = {}

# Function to read the entire document from a DOCX file
def read_document_from_docx(file_path):
    doc = Document(file_path)
    return '\n\n'.join([para.text.strip() for para in doc.paragraphs if para.text.strip()])

# Read and process the document
def process_document(file_path):
    content = read_document_from_docx(file_path)
    text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
    docs = text_splitter.create_documents([content])
    
    # Store chunks in ChromaDB
    for doc in docs:
        uuid_name = str(uuid.uuid1())
        try:
            embedding = co.embed(texts=[doc.page_content], model='large').embeddings[0]  # Get the embedding
            collection.add(ids=[uuid_name], documents=[doc.page_content], metadatas=[{'text': doc.page_content}], embeddings=[embedding])
        except Exception as e:
            logging.error(f"Error getting embedding: {e}")

# Initialize the document processing
file_path = 'documento.docx'  # Update with your DOCX file path
try:
    process_document(file_path)
except Exception as e:
    logging.error(f"Error processing document: {e}")

@app.route('/ask', methods=['POST'])
def ask():
    try:
        # Get the user's name and question from the request
        user_name = request.json.get('user_name')
        user_question = request.json.get('question')

        # Check if the answer is already cached
        if user_question in answer_cache:
            cached_answer = answer_cache[user_question]
            response_data = {
                'user_name': user_name,
                'question': user_question,
                'answer': cached_answer
            }
            return Response(json.dumps(response_data, ensure_ascii=False), mimetype='application/json')

        # Step 1: Retrieve the most relevant chunk using Chroma
        question_embedding = co.embed(texts=[user_question], model='large').embeddings[0]  # Get the embedding for the question
        results = collection.query(query_embeddings=[question_embedding], n_results=1)

        # Extract the most relevant chunk
        most_relevant_chunk = results['documents'][0][0]  # Access the first document in the first list

        # Step 2: Create a prompt for the LLM
        prompt = f"Contexto: {most_relevant_chunk}\nPregunta: {user_question}\nResponde en tercera persona y en una oración:"  # Added "Responde en tercera persona"

        # Step 3: Use the Cohere LLM to get an answer
        response = co.generate(prompt=prompt, model='command', max_tokens=150)  # Adjust parameters as needed
        generated_answer = response.generations[0].text.strip()

        # Extract only the first sentence from the generated answer
        first_sentence = generated_answer.split('.')[0] + '.'

        # Step 4: Create a prompt to generate emojis based on the answer
        emoji_prompt = f"Answer: {first_sentence}\nAdd two or three emojis that represent this answer:"

        # Step 5: Use the Cohere LLM to generate emojis
        emoji_response = co.generate(prompt=emoji_prompt, model='command', max_tokens=10)
        emoji_text = emoji_response.generations[0].text.strip()

        # Filter to keep only emojis
        emojis = ''.join([char for char in emoji_text if emoji.is_emoji(char)])

        # Append emojis to the answer
        final_answer = first_sentence + ' ' + emojis

        # Cache the generated answer
        answer_cache[user_question] = final_answer

        # Create the response data
        response_data = {
            'user_name': user_name,
            'question': user_question,
            'answer': final_answer
        }

        # Return the generated answer along with the user's name, ensuring no special character escaping
        return Response(json.dumps(response_data, ensure_ascii=False), mimetype='application/json')
    except Exception as e:
        logging.error(f"Error in ask endpoint: {e}")
        return jsonify({'error': str(e)}), 500


if __name__ == '__main__':
    try:
        app.run(debug=True)
    except Exception as e:
        logging.error(f"Error starting Flask app: {e}")

In order to test this code I created a Postman collection with the requests:

```
{
    "user_name": "Usuario 1",
    "question": "¿Quién es Zara?"
}
{
    "user_name": "Usuario 2",
    "question": "¿A dónde decidió ir Emma?"
}
{
    "user_name": "Usuario 3",
    "question": "¿Cuál es el nombre de la flor mágica?"
}
```
And got the following answers:

1) "Zara es un explorador intrépido y valiente que viaja en busca de la paz en la lejana galaxia de Zenthoria. 💫"
2) "Emma se transportó a un mundo lleno de maravillas, donde disfrutó de muchos lugares increíbles y sorprendentes. 🌍🥰👏"
3) "Según el texto, la flor mágica se denomina \"Luz de Luna\". 🌼🌕🌛⚪"

The code seems to be working fine, although more testing should be done. The collection can be found in the file: ```RAG_API.postman_collection.json```

After this, I set up the working directory to create a Dockerfile for this Flask application. The directory has the following structure:
```
/rag-llms
    /app.py
    /Dockerfile
    /requirements.txt
    /documento.docx
    /...
```
After builing the Docker image with:

```docker build -t flask-cohere-app .```

the program can be run in a Docker container with:

```docker run -p 5000:5000 flask-cohere-app```

## 5. Things to add/try:

- Fine-tuning of the pre-trained models.
- Support for other languages.
- A more thorough testing.
- Generalize document reading (include other formats such as PDF).
- Optimize for scalability (I guess the approach used here only works for small corpora).