# Data Collector for YAML to JSON Conversion

This notebook:
1. Reads URLs from the OG_Dataset.csv file
2. Downloads YAML files from those URLs
3. Converts the YAML files to JSON format
4. Commits the JSON files to a private Git repository

In [36]:
# Import required libraries
import pandas as pd
import requests
import yaml
import json
import os
from git import Repo
import logging
import shutil
from urllib.parse import urlparse

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Step 1: Read the CSV file containing URLs

In [37]:
# Read the CSV file
csv_path = '/workspaces/RAG_BOT/OG_Dataset.csv'
df = pd.read_csv(csv_path)

# Display the CSV contents
print(f"Found {len(df)} entries in the CSV file:")
df

Found 3 entries in the CSV file:


Unnamed: 0,Name,URL
0,PolicyMangement,https://api.stoplight.io/projects/cHJqOjIxMDg3...
1,UserManagement,https://api.stoplight.io/projects/cHJqOjIxMDg3...
2,ApplicationManagement,https://api.stoplight.io/projects/cHJqOjIxMDg3...


## Step 2: Download YAML files and convert to JSON

In [38]:
# Create output directories if they don't exist
output_dir_json = '/workspaces/RAG_BOT/Data Collector/json_files'
output_dir_yaml = '/workspaces/RAG_BOT/Data Collector/yaml_files'
os.makedirs(output_dir_json, exist_ok=True)
os.makedirs(output_dir_yaml, exist_ok=True)

# Process each URL
json_files = []
yaml_files = []

for index, row in df.iterrows():
    try:
        category = row[0]  # First column is the category name
        url = row[1]       # Second column is the URL
        
        logger.info(f"Processing {category} from {url}")
        
        # Download the YAML file
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for HTTP errors
        
        # Save YAML to file
        yaml_filename = f"{category}.yaml"
        yaml_path = os.path.join(output_dir_yaml, yaml_filename)
        
        with open(yaml_path, 'w') as yaml_file:
            yaml_file.write(response.text)
        
        logger.info(f"Saved original YAML to {yaml_path}")
        yaml_files.append(yaml_path)
        
        # Parse YAML content
        yaml_content = yaml.safe_load(response.text)
        
        # Convert to JSON
        json_content = json.dumps(yaml_content, indent=2)
        
        # Save JSON to file
        json_filename = f"{category}.json"
        json_path = os.path.join(output_dir_json, json_filename)
        
        with open(json_path, 'w') as json_file:
            json_file.write(json_content)
        
        logger.info(f"Saved JSON to {json_path}")
        json_files.append(json_path)
        
    except Exception as e:
        logger.error(f"Error processing {url}: {str(e)}")

print(f"Successfully processed {len(json_files)} JSON files and {len(yaml_files)} YAML files")

  category = row[0]  # First column is the category name
  url = row[1]       # Second column is the URL
2025-06-07 06:31:42,926 - INFO - Processing PolicyMangement from https://api.stoplight.io/projects/cHJqOjIxMDg3OQ/branches/main/export/openapi/identity_merged_files/PolicyManagement.yaml
2025-06-07 06:31:45,482 - INFO - Saved original YAML to /workspaces/RAG_BOT/Data Collector/yaml_files/PolicyMangement.yaml
2025-06-07 06:31:45,527 - INFO - Saved JSON to /workspaces/RAG_BOT/Data Collector/json_files/PolicyMangement.json
2025-06-07 06:31:45,527 - INFO - Processing UserManagement from https://api.stoplight.io/projects/cHJqOjIxMDg3OQ/branches/main/export/openapi/identity_merged_files/UserManagement.yaml
2025-06-07 06:31:48,171 - INFO - Saved original YAML to /workspaces/RAG_BOT/Data Collector/yaml_files/UserManagement.yaml
2025-06-07 06:31:48,477 - INFO - Saved JSON to /workspaces/RAG_BOT/Data Collector/json_files/UserManagement.json
2025-06-07 06:31:48,478 - INFO - Processing Applicat

Successfully processed 3 JSON files and 3 YAML files


## Step 3: Configure Private Git Repository

In [39]:
# Configure private Git repository details
def setup_private_repo(repo_url, auth_token):
    """
    Set up credentials for private Git repository
    
    Args:
        repo_url (str): URL of the private Git repository
        auth_token (str): Authentication token for the private repository
    
    Returns:
        str: The repository URL with embedded authentication token
    """
    # Parse repository URL
    parsed_url = urlparse(repo_url)
    
    # Construct repository URL with authentication
    if parsed_url.scheme == "https":
        # Format: https://{token}@github.com/username/repo.git
        auth_url = f"https://{auth_token}@{parsed_url.netloc}{parsed_url.path}"
    else:
        # If not HTTPS, keep URL as is and rely on other authentication methods
        logger.warning("Non-HTTPS repository URL provided. Token authentication might not work.")
        auth_url = repo_url
        
    return auth_url

# Set your private repository details here
private_repo_url = "https://github.com/yourusername/your-private-repo.git"  # Replace with your repository URL
auth_token = "your-auth-token"  # Replace with your personal access token

# Create authenticated repository URL
authenticated_repo_url = setup_private_repo(private_repo_url, auth_token)

In [40]:
def commit_to_git(repo_path, files_to_commit, commit_message):
    """Commit files to a Git repository"""
    try:
        # Initialize repository
        repo = Repo(repo_path)
        
        # Check if repo is dirty (has uncommitted changes)
        if repo.is_dirty(untracked_files=True):
            # Add files
            for file_path in files_to_commit:
                relative_path = os.path.relpath(file_path, repo_path)
                repo.git.add(relative_path)
            
            # Commit changes
            repo.git.commit('-m', commit_message)
            logger.info(f"Committed {len(files_to_commit)} files to repository")
            
            # You could add push here if needed
            # repo.git.push()
            
            return True
        else:
            logger.info("No changes to commit")
            return False
    
    except Exception as e:
        logger.error(f"Git error: {str(e)}")
        return False

In [41]:
def commit_to_private_repo(repo_url, files_to_commit, commit_message):
    """
    Clone private repository, add files, commit and push changes
    
    Args:
        repo_url (str): URL of the private Git repository with authentication token
        files_to_commit (list): List of file paths to commit
        commit_message (str): Commit message
    
    Returns:
        bool: True if successful, False otherwise
    """
    temp_dir = '/tmp/private_repo_clone'
    
    try:
        # Remove temp directory if it exists
        if os.path.exists(temp_dir):
            shutil.rmtree(temp_dir)
        
        # Clone the repository
        logger.info(f"Cloning private repository...")
        repo = Repo.clone_from(repo_url, temp_dir)
        
        # Create target directory in the cloned repo
        target_dir = os.path.join(temp_dir, 'json_files')
        os.makedirs(target_dir, exist_ok=True)
        
        # Copy files to the target directory
        for file_path in files_to_commit:
            file_name = os.path.basename(file_path)
            target_path = os.path.join(target_dir, file_name)
            shutil.copy2(file_path, target_path)
            logger.info(f"Copied {file_path} to {target_path}")
        
        # Add all files
        repo.git.add(A=True)
        
        # Check if there are changes to commit
        if repo.is_dirty(untracked_files=True):
            # Commit changes
            repo.git.commit('-m', commit_message)
            logger.info(f"Committed {len(files_to_commit)} files to private repository")
            
            # Push changes
            logger.info("Pushing changes to private repository...")
            repo.git.push()
            logger.info("Successfully pushed changes to private repository")
            
            return True
        else:
            logger.info("No changes to commit in private repository")
            return False
            
    except Exception as e:
        logger.error(f"Error with private repository: {str(e)}")
        return False
    finally:
        # Clean up - remove temp directory
        if os.path.exists(temp_dir):
            shutil.rmtree(temp_dir)

In [42]:
import datetime as date
# Commit the files to the repository
commit_message = "Add converted JSON files from YAML sources - }" + date.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

if json_files:
    # # For local workspace repository (original approach)
    # workspace_repo_path = '/workspaces/RAG_BOT'
    # workspace_success = commit_to_git(workspace_repo_path, json_files, commit_message)
    
    # if workspace_success:
    #     print("Successfully committed files to workspace repository")
    # else:
    #     print("Failed to commit files to workspace repository")
    
    # For private repository (new approach)
    # Uncomment and fill in the details when ready to use
    auth_token = "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"
    private_repo_url = "https://github.com/Venkata-Thrivedi-WILP/DataStore.git"
    authenticated_repo_url = setup_private_repo(private_repo_url, auth_token)
    private_success = commit_to_private_repo(authenticated_repo_url, json_files, commit_message)
    
    if private_success:
        print("Successfully committed files to private repository")
    else:
        print("Failed to commit files to private repository")
else:
    print("No files to commit")

2025-06-07 06:31:51,015 - INFO - Cloning private repository...
2025-06-07 06:31:52,217 - INFO - Copied /workspaces/RAG_BOT/Data Collector/json_files/PolicyMangement.json to /tmp/private_repo_clone/json_files/PolicyMangement.json
2025-06-07 06:31:52,218 - INFO - Copied /workspaces/RAG_BOT/Data Collector/json_files/UserManagement.json to /tmp/private_repo_clone/json_files/UserManagement.json
2025-06-07 06:31:52,219 - INFO - Copied /workspaces/RAG_BOT/Data Collector/json_files/ApplicationManagement.json to /tmp/private_repo_clone/json_files/ApplicationManagement.json
2025-06-07 06:31:52,245 - INFO - Committed 3 files to private repository
2025-06-07 06:31:52,246 - INFO - Pushing changes to private repository...
2025-06-07 06:31:53,541 - INFO - Successfully pushed changes to private repository


Successfully committed files to private repository


## Summary

The notebook has:
1. Read the CSV file with URLs
2. Downloaded YAML files from those URLs
3. Converted the YAML files to JSON format
4. Saved the JSON files to the repository
5. Provided functionality to commit the changes to both the local workspace and a private Git repository

To use the private repository functionality:
1. Replace the placeholder values in the private repository configuration cell
2. Uncomment the private repository commit code in the final cell
3. Run the notebook to process the files and commit them to your private repository