# **Step 2: Knowledge Base Construction**


## **1.Reasons to Choose Elasticsearch**

- **Full-Text Search Capabilities** :
Elasticsearch excels in handling full-text search with advanced features like fuzzy matching ( compare deux chaines de caracteres ex ALINA ET ALI et renvoie un score de similarité), relevance scoring, and powerful querying capabilities, essential for parsing and retrieving text from complex PDFs effectively

- **Scalability and Performance** :
Elasticsearch is designed to be distributed and scalable, meaning it can handle large datasets and high query volumes efficiently, crucial for a chatbot that needs to respond quickly to user queries based on potentially large PDF documents 

- **Vector Search** :
With built-in support for vector search, Elasticsearch can handle embedding-based retrieval, critical for modern RAG systems, allowing the system to understand and retrieve semantically similar content, enhancing the chatbot’s ability to provide relevant answers 

- **Integration and Ecosystem** :
Elasticsearch integrates well with various machine learning and AI tools, such as LlamaIndex and other embedding models, making it easier to build a pipeline that ingests PDF content, processes it into embeddings, and performs efficient search and retrieval 

## **2.Store Raw Text in Elasticsearch:**

=> it's generally recommended to store the raw text and then perform vectorization at the time of query or processing.

- **Flexibility** : Storing raw text allows for future re-processing with different vectorization techniques without needing to re-ingest the data. This flexibility is crucial as better or more efficient vectorization methods might become available later​ 

- **Space Efficiency** : Raw text often requires less storage space compared to vectorized representations. Vector embeddings can be high-dimensional and thus more space-consuming​ 

- **Indexing Efficiency** : Modern databases and search engines, like Elasticsearch, can efficiently handle and index raw text, facilitating quick full-text search and retrieval. Once the relevant documents are retrieved, vectorization can be applied to a smaller subset of data, optimizing computational resources​ (ar5iv)​.

## **3. Read the Markdown File and Prepare the Text for Indexing**


Indexing in the context of databases and search engines refers to the process of organizing data to facilitate fast retrieval

### Benefits of Indexing:
- Faster Query Response: Indexing enables quick search responses by reducing the need to scan the entire dataset.
- Improved User Experience: Users get faster and more accurate search results, enhancing their overall experience.
- Scalability: Efficient indexing allows systems to scale and handle large volumes of data and queries effectively.

In [16]:
import requests
import markdown
import json
import os
import dotenv
import sys

# Configurations
es_host = 'https://371e52c2ddc94eeda8d2dbeb8acc5645.us-central1.gcp.cloud.es.io:443'  # URL de votre instance Elasticsearch
api_key = os.getenv('elastic_host')
index_name = 'data_for_rag'
document_id = 1  # ID unique pour le document
markdown_file_path = 'parsed_result_gpt.md'

# Lire le fichier markdown
with open(markdown_file_path, 'r') as file:
    markdown_text = file.read()
# Convertir le markdown en HTML
html_text = markdown.markdown(markdown_text)

# Préparer le document pour l'indexation
document = {
    "title": "Document Title",
    "content": html_text,
    "metadata": {
        "author": "Author Name",
        "date": "2024-07-21"
    }
}

# Convertir le document en JSON
document_json = json.dumps(document)

# URL pour l'indexation du document
url = f'{es_host}/{index_name}/_doc/{document_id}'

# Enregistrer le document dans Elasticsearch via l'API avec jeton d'API
headers = {
    "Content-Type": "application/json",
    "Authorization": f"ApiKey {api_key}"
}
response = requests.put(url, headers=headers, data=document_json)

# Vérifier la réponse de l'API
if response.status_code in [200, 201]:
    print("Document indexé avec succès.")
else:
    print(f"Erreur lors de l'indexation : {response.status_code}\n{response.text}")


Document indexé avec succès.
