# Summary of Notebook

This notebook processes OpenAPI specifications into structured text documents for use in a Retrieval Augmented Generation (RAG) system. It follows these steps:

1. **Data Acquisition**:
   - Reads a CSV file containing API names and their corresponding OpenAPI specification URLs
   - Downloads YAML specification files from these URLs
   - Stores the raw YAML files in a local directory

2. **Format Conversion**:
   - Parses the YAML files into Python objects
   - Converts these objects to JSON format
   - Saves the JSON files for further processing

3. **Specification Processing**:
   - Extracts detailed information about each API endpoint:
     - Basic information (path, method, summary)
     - Parameters and their descriptions
     - Request body schemas and examples
     - Response formats and examples
     - Security requirements

4. **RAG-Optimized Formatting**:
   - Structures the extracted information in a consistent format
   - Cleans HTML from descriptions
   - Generates sample JSON for requests and responses
   - Creates comprehensive documentation for each endpoint
   
5. **Output Generation**:
   - Saves processed documentation as text files
   - Organizes outputs by API category
   - Provides summary statistics about the processed endpoints

This processed data can then be used as the knowledge base for a RAG system that can answer questions about these APIs.

# Data Collector for YAML to JSON Conversion

This notebook:
1. Reads URLs from the OG_Dataset.csv file
2. Downloads YAML files from those URLs
3. Converts the YAML files to JSON format
4. Processes OpenAPI specifications into structured text documents

In [1]:
# Import required libraries
import pandas as pd
import requests
import yaml
import json
import os
import re
import logging
import shutil
from git import Repo
from urllib.parse import urlparse

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Step 1: Read the CSV file containing URLs

In [2]:
# Read the CSV file
csv_path = '/workspaces/RAG_BOT/OG_Dataset.csv'
df = pd.read_csv(csv_path)

# Display the CSV contents
print(f"Found {len(df)} entries in the CSV file:")
df

Found 3 entries in the CSV file:


Unnamed: 0,Name,URL
0,PolicyMangement,https://api.stoplight.io/projects/cHJqOjIxMDg3...
1,UserManagement,https://api.stoplight.io/projects/cHJqOjIxMDg3...
2,ApplicationManagement,https://api.stoplight.io/projects/cHJqOjIxMDg3...


## Step 2: Download YAML files and convert to JSON

In [3]:
# Create output directories if they don't exist
output_dir_json = '/workspaces/RAG_BOT/Data Processor/json_files'
output_dir_yaml = '/workspaces/RAG_BOT/Data Processor/yaml_files'
os.makedirs(output_dir_json, exist_ok=True)
os.makedirs(output_dir_yaml, exist_ok=True)

# Process each URL
json_files = []
yaml_files = []

for index, row in df.iterrows():
    try:
        category = row[0]  # First column is the category name
        url = row[1]       # Second column is the URL
        
        logger.info(f"Processing {category} from {url}")
        
        # Download the YAML file
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for HTTP errors
        
        # Save YAML to file
        yaml_filename = f"{category}.yaml"
        yaml_path = os.path.join(output_dir_yaml, yaml_filename)
        
        with open(yaml_path, 'w') as yaml_file:
            yaml_file.write(response.text)
        
        logger.info(f"Saved original YAML to {yaml_path}")
        yaml_files.append(yaml_path)
        
        # Parse YAML content
        yaml_content = yaml.safe_load(response.text)
        
        # Convert to JSON
        json_content = json.dumps(yaml_content, indent=2)
        
        # Save JSON to file
        json_filename = f"{category}.json"
        json_path = os.path.join(output_dir_json, json_filename)
        
        with open(json_path, 'w') as json_file:
            json_file.write(json_content)
        
        logger.info(f"Saved JSON to {json_path}")
        json_files.append(json_path)
        
    except Exception as e:
        logger.error(f"Error processing {url}: {str(e)}")

print(f"Successfully processed {len(json_files)} JSON files and {len(yaml_files)} YAML files")

  category = row[0]  # First column is the category name
  url = row[1]       # Second column is the URL
2025-06-11 00:50:59,743 - INFO - Processing PolicyMangement from https://api.stoplight.io/projects/cHJqOjIxMDg3OQ/branches/main/export/openapi/identity_merged_files/PolicyManagement.yaml
2025-06-11 00:51:01,644 - INFO - Saved original YAML to /workspaces/RAG_BOT/Data Processor/yaml_files/PolicyMangement.yaml
2025-06-11 00:51:01,644 - INFO - Saved original YAML to /workspaces/RAG_BOT/Data Processor/yaml_files/PolicyMangement.yaml
2025-06-11 00:51:01,684 - INFO - Saved JSON to /workspaces/RAG_BOT/Data Processor/json_files/PolicyMangement.json
2025-06-11 00:51:01,685 - INFO - Processing UserManagement from https://api.stoplight.io/projects/cHJqOjIxMDg3OQ/branches/main/export/openapi/identity_merged_files/UserManagement.yaml
2025-06-11 00:51:01,684 - INFO - Saved JSON to /workspaces/RAG_BOT/Data Processor/json_files/PolicyMangement.json
2025-06-11 00:51:01,685 - INFO - Processing UserMa

Successfully processed 3 JSON files and 3 YAML files


## Step 3: Process OpenAPI Specifications

This section processes the OpenAPI JSON files into structured text documents for use in a RAG (Retrieval Augmented Generation) system.

### 3.1 Utility Functions

These functions help with loading and processing OpenAPI specifications.

In [4]:
def load_openapi_spec(file_path):
    """Load and parse the OpenAPI JSON specification."""
    with open(file_path, 'r') as f:
        return json.load(f)

def clean_html(text):
    """Remove HTML tags and clean up formatting."""
    if not isinstance(text, str):
        return "No description provided"
    
    # Remove HTML tags
    clean_text = re.sub(r'<.*?>', ' ', text)
    # Normalize whitespace
    clean_text = re.sub(r'\s+', ' ', clean_text)
    return clean_text.strip()

def extract_extension_fields(method_info):
    """Extract OpenAPI extension fields starting with 'x-'."""
    extensions = {}
    for key, value in method_info.items():
        if key.startswith('x-'):
            extensions[key] = value
    return extensions

def resolve_schema_reference(ref_path, openapi_spec):
    """Resolve a schema reference to its actual schema definition."""
    if not ref_path.startswith('#/'):
        return None  # External references not supported
    
    # Remove '#/' from the path and split by '/'
    path_parts = ref_path.lstrip('#/').split('/')
    
    # Navigate through the OpenAPI spec to find the referenced schema
    current = openapi_spec
    for part in path_parts:
        if part in current:
            current = current[part]
        else:
            return None  # Reference path not found
    
    return current

### 3.2 Schema Processing Functions

These functions generate sample JSON and format schema properties.

In [5]:
def generate_sample_json(schema, openapi_spec=None, is_root=True):
    """Generate a sample JSON object from an OpenAPI schema."""
    if not schema:
        return None
    
    # Handle $ref references
    if '$ref' in schema and openapi_spec:
        # Resolve the reference to get the actual schema
        resolved_schema = resolve_schema_reference(schema['$ref'], openapi_spec)
        if resolved_schema:
            return generate_sample_json(resolved_schema, openapi_spec, is_root)
        return {"$ref": schema['$ref']}  # Fallback if resolution fails
    
    # Handle different schema types
    schema_type = schema.get('type')
    
    if schema_type == 'object':
        result = {}
        if 'properties' in schema:
            for prop_name, prop_schema in schema['properties'].items():
                result[prop_name] = generate_sample_json(prop_schema, openapi_spec, False)
        return result
    
    elif schema_type == 'array':
        if 'items' in schema:
            return [generate_sample_json(schema['items'], openapi_spec, False)]
        return []
    
    elif schema_type == 'string':
        return "string_value"
    
    elif schema_type == 'number' or schema_type == 'integer':
        return 0
    
    elif schema_type == 'boolean':
        return False
    
    # If no specific type or unsupported type
    return None

def format_property(name, details, indent=""):
    """Format a schema property in a RAG-friendly way."""
    prop_type = details.get('type', 'undefined')
    description = details.get('description', 'No description provided')
    description = clean_html(description)
    
    if description == '{…}':
        description = "Additional nested properties (abbreviated in schema)"
    
    return f"{indent}* {name} ({prop_type}): {description}"

def format_schema_properties(schema, indent_level=0):
    """Recursively format schema properties."""
    lines = []
    indent = "  " * indent_level
    
    if not schema:
        return lines
    
    if '$ref' in schema:
        ref = schema['$ref'].split('/')[-1]
        return [f"{indent}References schema: {ref}"]
    
    if 'properties' in schema:
        for prop_name, prop_details in schema['properties'].items():
            lines.append(format_property(prop_name, prop_details, indent))
            
            # Handle nested objects
            if prop_details.get('type') == 'object' and 'properties' in prop_details:
                lines.append(f"{indent}  Nested properties:")
                for nested_lines in format_schema_properties(prop_details, indent_level + 2):
                    lines.append(nested_lines)
            
            # Handle arrays
            if prop_details.get('type') == 'array' and 'items' in prop_details:
                lines.append(f"{indent}  Array items:")
                for nested_lines in format_schema_properties(prop_details['items'], indent_level + 2):
                    lines.append(nested_lines)
    
    if 'required' in schema and schema['required']:
        lines.append(f"{indent}Required fields: {', '.join(schema['required'])}")
    
    return lines

### 3.3 Endpoint Formatting Functions

These functions format API endpoints into a structured text format for RAG systems.

In [6]:
def format_endpoint(path, method_info, http_method, openapi_spec):
    """Format a single endpoint into RAG-friendly text."""
    lines = []
    
    # Basic endpoint information
    title = method_info.get('summary', 'Unnamed Endpoint')
    lines.append(f"ENDPOINT: {title}")
    lines.append(f"PATH: {path}")
    lines.append(f"METHOD: {http_method.upper()}")
    
    # Tags
    if 'tags' in method_info:
        lines.append(f"TAGS: {', '.join(method_info['tags'])}")
    
    # Description
    if 'description' in method_info:
        description = clean_html(method_info['description'])
        lines.append(f"DESCRIPTION: {description}")
    
    # Extension fields (metadata)
    extensions = extract_extension_fields(method_info)
    if extensions:
        lines.append("METADATA:")
        for ext_key, ext_value in extensions.items():
            lines.append(f"  * {ext_key}: {ext_value}")
    
    # Parameters
    if 'parameters' in method_info and method_info['parameters']:
        lines.append("PARAMETERS:")
        for param in method_info['parameters']:
            param_name = param.get('name', 'unnamed')
            param_in = param.get('in', 'undefined')
            param_required = "Required" if param.get('required', False) else "Optional"
            param_description = clean_html(param.get('description', 'No description provided'))
            lines.append(f"  * {param_name} ({param_in}, {param_required}): {param_description}")
    
    # Request Body
    if 'requestBody' in method_info:
        req_body = method_info['requestBody']
        req_required = "Required" if req_body.get('required', False) else "Optional"
        lines.append(f"REQUEST BODY: {req_required}")
        
        if 'content' in req_body:
            for content_type, content_details in req_body['content'].items():
                lines.append(f"  Content Type: {content_type}")
                
                if 'schema' in content_details:
                    schema = content_details['schema']
                    lines.append("  Schema Properties:")
                    lines.extend(format_schema_properties(schema, 2))
                    
                    # Generate sample request JSON
                    sample_json = generate_sample_json(schema, openapi_spec)
                    if sample_json:
                        lines.append("  Sample Request JSON:")
                        lines.append(f"  ```json\n  {json.dumps(sample_json, indent=2)}\n  ```")
    
    # Responses
    if 'responses' in method_info:
        lines.append("RESPONSES:")
        for status_code, response_info in method_info['responses'].items():
            lines.append(f"  Status Code: {status_code}")
            description = clean_html(response_info.get('description', 'No description provided'))
            lines.append(f"  Description: {description}")
            
            if 'content' in response_info:
                for content_type, content_details in response_info['content'].items():
                    lines.append(f"  Content Type: {content_type}")
                    
                    if 'schema' in content_details:
                        schema = content_details['schema']
                        # In the responses section where it handles $ref
                        if '$ref' in schema:
                            ref_path = schema['$ref']
                            schema_name = ref_path.split('/')[-1]
                            
                            # Get the full schema from components
                            full_schema = None
                            if ref_path.startswith('#/components/schemas/'):
                                schema_name = ref_path.split('/')[-1]
                                if 'components' in openapi_spec and 'schemas' in openapi_spec['components'] and schema_name in openapi_spec['components']['schemas']:
                                    full_schema = openapi_spec['components']['schemas'][schema_name]
                            
                            if full_schema:
                                lines.append(f"  Response Schema: {schema_name}")
                                # Add schema description if available
                                if 'description' in full_schema:
                                    description = clean_html(full_schema.get('description', 'No description provided'))
                                    lines.append(f"  Description: {description}")
                                
                                lines.append("  Response Body Properties:")
                                lines.extend(format_schema_properties(full_schema, 2))
                                
                                # Generate sample response JSON
                                sample_json = generate_sample_json(full_schema, openapi_spec)
                                if sample_json:
                                    lines.append("  Sample Response JSON:")
                                    lines.append(f"  ```json\n  {json.dumps(sample_json, indent=2)}\n  ```")
                            else:
                                # Fallback to just showing the reference if we can't find the schema
                                lines.append(f"  Response Schema: References {schema_name}")
                                resolved_schema = resolve_schema_reference(schema['$ref'], openapi_spec)
                                if resolved_schema:
                                    lines.append("  Response Schema Properties:")
                                    lines.extend(format_schema_properties(resolved_schema, 2))
                                    
                                    # Generate sample response JSON from resolved schema
                                    sample_json = generate_sample_json(resolved_schema, openapi_spec)
                                    if sample_json:
                                        lines.append("  Sample Response JSON:")
                                        lines.append(f"  ```json\n  {json.dumps(sample_json, indent=2)}\n  ```")
    
    # Security
    if 'security' in method_info:
        security_schemes = []
        for security_item in method_info['security']:
            for scheme, scopes in security_item.items():
                security_schemes.append(scheme)
        if security_schemes:
            lines.append(f"SECURITY: {', '.join(security_schemes)}")
    
    return "\n".join(lines)

def format_all_endpoints(openapi_spec):
    """Process all endpoints in the OpenAPI specification."""
    all_endpoints = []
    
    # Process each path and method
    for path, path_item in openapi_spec.get('paths', {}).items():
        for method, method_info in path_item.items():
            # Skip non-HTTP methods
            if method in ['parameters', 'servers', 'summary', 'description']:
                continue
                
            endpoint_text = format_endpoint(path, method_info, method, openapi_spec)
            all_endpoints.append(endpoint_text)
            all_endpoints.append("--------" * 10) 
    
    return "\n".join(all_endpoints)

### 3.4 Main Processing Function

The main function that processes all OpenAPI spec files and generates the output files.

In [7]:
def main():
    """Process all OpenAPI spec files in the json_files directory."""
    # Define directories
    json_dir = "./json_files/"
    output_dir = "../ProcessedData/"

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Get list of JSON files
    json_files = [os.path.join(json_dir, f) for f in os.listdir(json_dir) if f.endswith('.json')]

    if not json_files:
        print(f"No JSON files found in {json_dir}")
        return

    # Process each file
    endpoint_count = 0

    for json_file in json_files:
        # Get base filename without extension
        base_name = os.path.basename(json_file)
        file_name_without_ext = os.path.splitext(base_name)[0]
        output_file = os.path.join(output_dir, f"{file_name_without_ext}.txt")
        
        # Process the file
        print(f"Processing {base_name}...")
        api_spec = load_openapi_spec(json_file)
        api_text = format_all_endpoints(api_spec)
        
        # Count endpoints
        endpoint_lines = [line for line in api_text.split('\n') if line.startswith("ENDPOINT:")]
        endpoint_count += len(endpoint_lines)
        
        # Write individual file
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(api_text)
        
        print(f"Created {output_file} with {len(endpoint_lines)} endpoints")

    print(f"Processing complete. {endpoint_count} total endpoints from {len(json_files)} files.")

## Step 4: Execute the OpenAPI Processing

In [8]:
if __name__ == "__main__":
    main()

Processing UserManagement.json...
Created ../ProcessedData/UserManagement.txt with 33 endpoints
Processing PolicyMangement.json...
Created ../ProcessedData/PolicyMangement.txt with 9 endpoints
Processing ApplicationManagement.json...
Created ../ProcessedData/ApplicationManagement.txt with 35 endpoints
Processing complete. 77 total endpoints from 3 files.
Created ../ProcessedData/PolicyMangement.txt with 9 endpoints
Processing ApplicationManagement.json...
Created ../ProcessedData/ApplicationManagement.txt with 35 endpoints
Processing complete. 77 total endpoints from 3 files.
