# API Documentation Enrichment Tool

This notebook implements a data enrichment pipeline that transforms raw API endpoint documentation into a retrieval-friendly, SLM-consumable format. The enriched documentation will be better suited for RAG (Retrieval Augmented Generation) systems.

In [1]:
import os
import glob
from pathlib import Path
from langchain_openai import ChatOpenAI

## LLM Configuration

Initialize the Ollama large language model that will be used for enriching the documentation. We use a helper function to make it easy to call the model with different prompts.

In [2]:
from getpass import getpass

openai_api_key = getpass("Enter your OpenAI API key: ")

llm = llm = ChatOpenAI(
    model_name="gpt-4o-mini",
    openai_api_key = openai_api_key,
    temperature=0.3
)

def call_llm(prompt, model="llama3.2:latest"):
    """
    Call the Ollama LLM with the enrichment prompt.
    Returns the model's response as a string.
    """
    response = llm.invoke(prompt)
    return response

## Configuration

Set up input and output directories and define the enrichment prompt template. The template instructs the SLM how to structure and enhance the API documentation.

In [3]:
# Directory to read files from
input_dir = "/workspaces/RAG_BOT/ProcessedData"

# Directory to save enriched files
output_dir = "/workspaces/RAG_BOT/EnrichedData"
os.makedirs(output_dir, exist_ok=True)

# Define a consistent enrichment prompt template
enrichment_template = """
You are an expert API documentation assistant and retrieval-augmentation designer.  

Your goal is to transform the following raw endpoint documentation into a *retrieval-friendly*, *LLM-consumable* format, should be keyword search friendly.  

Your output must be structured into sections.  

For each section, **strictly use only the content from the raw documentation** unless you are generating generic questions or search terms to help retrieval.  

---

**Your task:**

Given the raw documentation, produce these sections:

1. **Overview**  
   - A concise, human-readable summary explaining the purpose and use-case of the endpoint.  
   - Include HTTP method and path in human-friendly terms.  
   - Explain any security/auth requirements.

2. **Key Search Terms**  
   - 5–10 relevant search keywords or phrases that someone might use to find this endpoint in a semantic search.

3. **Example User Questions**  
   - 5–10 natural-language example questions a user might ask when they need this endpoint.

4. **Developer Notes**  
   - Important details for developers (required parameters, request/response structure, error handling, security considerations).
   - List *required fields* clearly.

5. **Detailed Explanation of Availble data, Request and Response parameters**
   - A detailed breakdown of all request and response parameters, including types, descriptions, and any constraints.
   - try to include examples of typical values where applicable.

5. **Raw Endpoint Documentation (Formatted)**  
   - Nicely reformat and preserve the original text exactly as given (but fix any obvious formatting issues).

---

**Instructions for formatting output:**  

- Use clear markdown headings for each section.  
- Use bullet lists or code blocks where appropriate.  
- Be consistent across different endpoints.  

---

**RAW DOCUMENTATION:**  

{endpoint_text}

---

**Your Output:**  
Return the fully structured markdown with all sections completed.
"""

In [4]:
# Get all text files from the input directory
input_files = glob.glob(os.path.join(input_dir, "*.txt"))
print(f"Found {len(input_files)} files to process.")

Found 3 files to process.


## Document Processing

Process each file by:
1. Reading the content
2. Splitting it into individual endpoint chunks
3. Enriching each endpoint using the SLM
4. Saving the enriched documentation to the output directory

In [5]:
for file_idx, file_path in enumerate(input_files, 1):
    print(f"Processing file {file_idx}/{len(input_files)}: {file_path}")
    
    # Read the file content
    with open(file_path, "r", encoding="utf-8") as f:
        full_text = f.read()
    
    # Split on the known separator (adjust if it's different in your files)
    endpoint_chunks = [chunk.strip() for chunk in full_text.split('--------------------------------------------------------------------------------') if chunk.strip()]
    
    print(f"  - Found {len(endpoint_chunks)} endpoint sections in {os.path.basename(file_path)}")
    
    # Process each endpoint chunk
    for chunk_idx, endpoint_text in enumerate(endpoint_chunks, 1):
        print(f"  - Processing endpoint {chunk_idx}/{len(endpoint_chunks)}...")
        
        # Prepare the prompt
        prompt = enrichment_template.format(endpoint_text=endpoint_text)
        
        # Get the enriched version from the sLM
        enriched_text = call_llm(prompt)
        
        # Define filename using both the source file name and chunk index
        base_filename = Path(file_path).stem
        filename = f"{base_filename}_endpoint_{chunk_idx:03}.txt"
        
        # Save enriched text
        with open(os.path.join(output_dir, filename), "w", encoding="utf-8") as out_file:
            out_file.write(enriched_text.content)
        
        print(f"    Saved: {filename}")

Processing file 1/3: /workspaces/RAG_BOT/ProcessedData/PolicyMangement.txt
  - Found 9 endpoint sections in PolicyMangement.txt
  - Processing endpoint 1/9...
    Saved: PolicyMangement_endpoint_001.txt
  - Processing endpoint 2/9...
    Saved: PolicyMangement_endpoint_002.txt
  - Processing endpoint 3/9...
    Saved: PolicyMangement_endpoint_003.txt
  - Processing endpoint 4/9...
    Saved: PolicyMangement_endpoint_004.txt
  - Processing endpoint 5/9...
    Saved: PolicyMangement_endpoint_005.txt
  - Processing endpoint 6/9...
    Saved: PolicyMangement_endpoint_006.txt
  - Processing endpoint 7/9...
    Saved: PolicyMangement_endpoint_007.txt
  - Processing endpoint 8/9...
    Saved: PolicyMangement_endpoint_008.txt
  - Processing endpoint 9/9...
    Saved: PolicyMangement_endpoint_009.txt
Processing file 2/3: /workspaces/RAG_BOT/ProcessedData/ApplicationManagement.txt
  - Found 35 endpoint sections in ApplicationManagement.txt
  - Processing endpoint 1/35...
    Saved: ApplicationMa