## Agent Configuration  

This section covers the configuration of the agent, including:  
* Defining the data context that the agent will interact with  
* Setting up the routine the agent will follow as a system prompt (embedding the data context)  
* Establishing the list of tools available for the agent to complete its tasks  

In [1]:
def get_namespace_from_service_account():
    """
    Reads the Kubernetes namespace from the service account mount point.
    Returns 'default' if not running in a Kubernetes pod or if the file doesn't exist.
    """
    namespace_file = '/var/run/secrets/kubernetes.io/serviceaccount/namespace'
    try:
        with open(namespace_file, 'r') as f:
            return f.read().strip()
    except IOError:
        return 'default'

# Set NAMESPACE variable
NAMESPACE = get_namespace_from_service_account()

In [2]:
# Configuration (same as above)
NAMESPACE = get_namespace_from_service_account()
PG_SERVICE_NAME = "vince-retail-postgres"
PG_DATABASE_NAME = "vince-retail"
POSTGRES_PASSWORD = "postgres"

def get_db_connection(retries=3, delay=5):
    """Get connection with retries"""
    for attempt in range(retries):
        try:
            conn = psycopg2.connect(
                host=f"{PG_SERVICE_NAME}.{NAMESPACE}.svc.cluster.local",
                database=PG_DATABASE_NAME,
                user="postgres",
                password=POSTGRES_PASSWORD,
                port="5432",
                connect_timeout=5
            )
            return conn
        except psycopg2.OperationalError as e:
            if attempt == retries - 1:
                raise
            print(f"Connection failed (attempt {attempt + 1}), retrying...")
            time.sleep(delay)

In [3]:
from pathlib import Path
from llama_index.core import Settings
from llama_index.llms.nvidia import NVIDIA
from llama_index.embeddings.nvidia import NVIDIAEmbedding
from typing import List, Dict, Any
import pandas as pd
import psycopg2
from datetime import datetime

# 1. Read JWT Token
def get_nvidia_auth_token():
    token_path = Path("/etc/secrets/ezua/.auth_token")
    if token_path.exists():
        with open(token_path, "r") as f:
            return f.read().strip()
    raise ValueError("NVIDIA auth token not found at /etc/secrets/ezua/.auth_token")

nvidia_api_key = get_nvidia_auth_token()

# 2. NVIDIA NIM Setup
llm = NVIDIA(
    base_url="https://llama-3-1-8b-6efc4543-predictor-ezai-services.hpepcai-ingress.pcai.hpecic.net/v1",
    model="meta/llama-3.1-8b-instruct",
    api_key=nvidia_api_key,
    temperature=0.1,
    max_tokens=1024
)
Settings.llm = llm

embed_model = NVIDIAEmbedding(
    model="nvidia/nv-embedqa-e5-v5",
    base_url="https://embedding-v5-8da38070-predictor-ezai-services.hpepcai-ingress.pcai.hpecic.net/v1",
    api_key=nvidia_api_key,
    truncate="END"
)
Settings.embed_model = embed_model

# 3. System Prompt Configuration
DATA_ENGINEER_SYSTEM_PROMPT = """
You are a senior data engineer specializing in retail data cleaning and preparation for analytics. 
Your task is to clean and transform the vince-retail database which contains tables:
- source_catalog (product information)
- source_customers (customer information)
- source_stock (inventory data)
- source_orders (order headers)
- source_order_products (order line items)

Key data quality issues to address:
1. Missing values (NULLs)
2. Inconsistent categories/spelling (e.g., "Toyz" vs "Toys")
3. Invalid dates (future dates or very old dates)
4. Negative quantities
5. Invalid emails
6. Duplicate records
7. Referential integrity issues

Approach:
1. First analyze the data to identify issues
2. Propose cleaning strategies based on data type and business context
3. Execute cleaning only after user confirmation
4. Document all changes made

Available operations:
- Handle missing values (drop, impute, flag)
- Standardize categorical values
- Validate and correct dates
- Fix numeric anomalies
- Validate email formats
- Deduplicate records
- Enforce referential integrity

Always:
- Explain your reasoning
- Show samples before/after changes
- Preserve raw data (create cleaned views/tables)
- Consider business impact of changes
"""


HTTP Request: GET https://llama-3-1-8b-6efc4543-predictor-ezai-services.hpepcai-ingress.pcai.hpecic.net/v1/models "HTTP/1.1 200 OK"


## Agent Runtime
This section covers the code executed while the agent is in action, including:
* Preparing the tools for use by the agent
* The agent's runtime function

In [4]:
class ProactiveDataEngineerAgent:
    def __init__(self, db_tools):
        self.tools = db_tools
        self.cleaning_log = []
        self.auto_approve = False  # Set to True for fully automatic mode
        
    def analyze_and_clean(self, table: str = None, auto_approve: bool = None):
        """Comprehensive analysis and automated cleaning with optional approval"""
        if auto_approve is not None:
            self.auto_approve = auto_approve
            
        print(f"\n{'='*40}\nStarting Data Quality Assessment\n{'='*40}")
        
        # 1. Perform analysis
        analysis = self.tools.get_data_issues_report()
        if table:
            analysis = {table: analysis.get(table, {})}
        
        # 2. Generate cleaning plan
        cleaning_plan = self._generate_cleaning_plan(analysis)
        
        # 3. Execute cleaning with optional approval
        results = self._execute_cleaning_plan(cleaning_plan)
        
        # 4. Generate report
        self._generate_cleaning_report(results)
        
        return results
    
    def _generate_cleaning_plan(self, analysis: Dict) -> List[Dict]:
        """Convert analysis into executable cleaning operations"""
        cleaning_plan = []
        
        for table, issues in analysis.items():
            print(f"\nAnalyzing {table}...")
            
            # Handle null values
            for col, null_count in [(k,v) for k,v in issues.items() if '_nulls' in k]:
                col_name = col.replace('_nulls', '')
                if null_count > 0:
                    op = self._create_null_handling_operation(table, col_name, null_count)
                    cleaning_plan.append(op)
            
            # Handle negative quantities
            if 'negative_quantities' in issues and issues['negative_quantities'] > 0:
                op = self._create_quantity_fix_operation(table, issues['negative_quantities'])
                cleaning_plan.append(op)
            
            # Handle date issues
            for col, future_count in [(k,v) for k,v in issues.items() if '_future_dates' in k]:
                col_name = col.replace('_future_dates', '')
                if future_count > 0:
                    op = self._create_date_fix_operation(table, col_name, future_count)
                    cleaning_plan.append(op)
        
        return cleaning_plan
    
    def _create_null_handling_operation(self, table: str, column: str, count: int) -> Dict:
        """Generate operation for handling null values"""
        schema = self.tools.get_table_schema(table)
        col_type = next((c['type'] for c in schema['columns'] if c['name'] == column), 'text')
        
        # Smart default values based on column type
        if col_type in ['integer', 'bigint', 'numeric']:
            default = 0
            method = "Zero-fill"
        elif col_type == 'boolean':
            default = False
            method = "False-fill"
        elif 'date' in col_type.lower():
            default = '1970-01-01'
            method = "Epoch-date-fill"
        else:
            default = 'Unknown'
            method = "Unknown-fill"
        
        return {
            'id': f"null_fix_{table}_{column}",
            'table': table,
            'column': column,
            'type': 'fill_nulls',
            'value': default,
            'description': f"{method} for {count} null values in {table}.{column}",
            'severity': 'high' if count > 10 else 'medium'
        }
    
    def _create_quantity_fix_operation(self, table: str, count: int) -> Dict:
        """Generate operation for fixing negative quantities"""
        return {
            'id': f"quantity_fix_{table}",
            'table': table,
            'column': 'product_quantity',
            'type': 'fix_numeric_anomalies',
            'description': f"Correct {count} negative quantities in {table}",
            'action': 'abs',
            'severity': 'high'
        }
    
    def _create_date_fix_operation(self, table: str, column: str, count: int) -> Dict:
        """Generate operation for fixing invalid dates"""
        return {
            'id': f"date_fix_{table}_{column}",
            'table': table,
            'column': column,
            'type': 'correct_dates',
            'description': f"Correct {count} invalid future dates in {table}.{column}",
            'correct_value': datetime.now().date(),
            'severity': 'high'
        }
    
    def _execute_cleaning_plan(self, plan: List[Dict]) -> Dict:
        """Execute cleaning operations with optional approval"""
        results = {}
        
        for op in sorted(plan, key=lambda x: x['severity'], reverse=True):
            print(f"\nOperation: {op['description']}")
            print(f"Type: {op['type']}")
            print(f"Severity: {op['severity'].upper()}")
            
            if not self.auto_approve:
                confirm = input("Execute this operation? (y/n/a for all): ").lower()
                if confirm == 'a':
                    self.auto_approve = True
                elif confirm != 'y':
                    results[op['id']] = {'status': 'skipped'}
                    continue
            
            try:
                result = self.tools.clean_data([op])
                results[op['id']] = result[op['id']]
                self.cleaning_log.append({
                    'timestamp': datetime.now().isoformat(),
                    'operation': op,
                    'result': result
                })
                print(f"Result: {result[op['id']]}")
            except Exception as e:
                print(f"Error executing operation: {str(e)}")
                results[op['id']] = {'status': 'error', 'message': str(e)}
        
        return results
    
    def _generate_cleaning_report(self, results: Dict):
        """Generate summary report of cleaning operations"""
        print("\n" + "="*40)
        print("Data Cleaning Report")
        print("="*40)
        
        stats = {
            'executed': 0,
            'skipped': 0,
            'errors': 0,
            'rows_affected': 0
        }
        
        for op_id, result in results.items():
            if result.get('status') == 'success':
                stats['executed'] += 1
                stats['rows_affected'] += result.get('rows_affected', 0)
            elif result.get('status') == 'skipped':
                stats['skipped'] += 1
            else:
                stats['errors'] += 1
        
        print(f"\nOperations Executed: {stats['executed']}")
        print(f"Operations Skipped: {stats['skipped']}")
        print(f"Operations Failed: {stats['errors']}")
        print(f"Total Rows Affected: {stats['rows_affected']}")
        
        if stats['errors'] > 0:
            print("\nFailed Operations:")
            for op_id, result in results.items():
                if result.get('status') not in ('success', 'skipped'):
                    print(f"- {op_id}: {result.get('message', 'Unknown error')}")

In [5]:
import polars as pl
from typing import Dict, List, Any, Optional
import json
from enum import Enum
from datetime import datetime

class DataEngineerTools:
    """Tools for the data engineer agent to interact with the database using Polars"""
    
    def __init__(self):
        self.conn = get_db_connection()
    
    def get_table_schema(self, table_name: str) -> Dict:
        """Get schema information for a table"""
        with self.conn.cursor() as cursor:
            cursor.execute(f"""
                SELECT column_name, data_type, is_nullable 
                FROM information_schema.columns 
                WHERE table_name = %s
            """, (table_name,))
            return {
                "columns": [
                    {"name": row[0], "type": row[1], "nullable": row[2] == 'YES'} 
                    for row in cursor.fetchall()
                ]
            }
    
    def sample_table_data(self, table_name: str, limit: int = 5) -> List[Dict]:
        """Get sample data from a table"""
        with self.conn.cursor() as cursor:
            cursor.execute(f"SELECT * FROM {table_name} LIMIT %s", (limit,))
            columns = [desc[0] for desc in cursor.description]
            return [dict(zip(columns, row)) for row in cursor.fetchall()]
    
    def execute_query(self, query: str) -> pl.DataFrame:
        """Execute a SQL query and return results as Polars DataFrame"""
        with self.conn.cursor() as cursor:
            cursor.execute(query)
            columns = [desc[0] for desc in cursor.description]
            data = cursor.fetchall()
            return pl.DataFrame(data, schema=columns)
    
    def get_data_issues_report(self) -> Dict:
        """Generate a data quality report using Polars"""
        report = {}
        tables = ['source_catalog', 'source_customers', 'source_stock', 
                 'source_orders', 'source_order_products']
        
        for table in tables:
            report[table] = {}
            schema = self.get_table_schema(table)
            
            # Count nulls per column
            for col in schema['columns']:
                null_count = self.execute_query(
                    f"SELECT COUNT(*) FROM {table} WHERE {col['name']} IS NULL"
                ).get_column("count")[0]
                report[table][f"{col['name']}_nulls"] = null_count
                
            # Check for negative quantities where applicable
            if 'quantity' in [c['name'] for c in schema['columns']]:
                neg_count = self.execute_query(
                    f"SELECT COUNT(*) FROM {table} WHERE product_quantity < 0"
                ).get_column("count")[0]
                report[table]['negative_quantities'] = neg_count
                
            # Check date validity for date columns
            date_cols = [c['name'] for c in schema['columns'] if 'date' in c['name'].lower()]
            for col in date_cols:
                future_dates = self.execute_query(
                    f"""SELECT COUNT(*) FROM {table} 
                    WHERE {col} > CURRENT_DATE + INTERVAL '1 year'"""
                ).get_column("count")[0]
                report[table][f"{col}_future_dates"] = future_dates
                
        return report
    
    def clean_data(self, operations: List[Dict]) -> Dict:
        """Execute cleaning operations"""
        results = {}
        with self.conn.cursor() as cursor:
            for op in operations:
                try:
                    if op['type'] == 'fill_nulls':
                        cursor.execute(f"""
                            UPDATE {op['table']} 
                            SET {op['column']} = %s 
                            WHERE {op['column']} IS NULL
                        """, (op['value'],))
                        results[op['id']] = {
                            "status": "success", 
                            "rows_affected": cursor.rowcount
                        }
                        
                    elif op['type'] == 'fix_numeric_anomalies':
                        if op['action'] == 'abs':
                            cursor.execute(f"""
                                UPDATE {op['table']} 
                                SET {op['column']} = ABS({op['column']})
                                WHERE {op['column']} < 0
                            """)
                        elif op['action'] == 'set':
                            cursor.execute(f"""
                                UPDATE {op['table']} 
                                SET {op['column']} = %s 
                                WHERE {op['column']} < 0
                            """, (op['value'],))
                        results[op['id']] = {
                            "status": "success", 
                            "rows_affected": cursor.rowcount
                        }
                    
                    elif op['type'] == 'standardize_category':
                        for wrong, correct in op['mappings'].items():
                            cursor.execute(f"""
                                UPDATE {op['table']} 
                                SET {op['column']} = %s 
                                WHERE {op['column']} = %s
                            """, (correct, wrong))
                        results[op['id']] = {
                            "status": "success", 
                            "rows_affected": cursor.rowcount
                        }
                    
                    elif op['type'] == 'remove_duplicates':
                        cursor.execute(f"""
                            CREATE TEMP TABLE temp_{op['table']} AS 
                            SELECT DISTINCT ON ({','.join(op['key_columns'])}) * 
                            FROM {op['table']}
                        """)
                        cursor.execute(f"TRUNCATE {op['table']}")
                        cursor.execute(f"""
                            INSERT INTO {op['table']} 
                            SELECT * FROM temp_{op['table']}
                        """)
                        cursor.execute(f"DROP TABLE temp_{op['table']}")
                        results[op['id']] = {"status": "success"}
                        
                    elif op['type'] == 'correct_dates':
                        cursor.execute(f"""
                            UPDATE {op['table']} 
                            SET {op['column']} = %s 
                            WHERE {op['column']} > CURRENT_DATE + INTERVAL '1 year'
                                OR {op['column']} < DATE '1900-01-01'
                        """, (op['correct_value'],))
                        results[op['id']] = {
                            "status": "success", 
                            "rows_affected": cursor.rowcount
                        }
                        
                    elif op['type'] == 'validate_emails':
                        cursor.execute(f"""
                            UPDATE {op['table']} 
                            SET {op['column']} = %s 
                            WHERE {op['column']} IS NOT NULL
                            AND {op['column']} !~ '^[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+$'
                        """, (op['default_value'],))
                        results[op['id']] = {
                            "status": "success", 
                            "rows_affected": cursor.rowcount
                        }
                        
                    self.conn.commit()
                except Exception as e:
                    self.conn.rollback()
                    results[op['id']] = {"status": "error", "message": str(e)}
        return results

## Running the Agent

In [6]:
def run_proactive_data_engineer(auto_mode=True):
    """Run the proactive data engineer with optional auto-approval"""
    tools = DataEngineerTools()
    agent = ProactiveDataEngineerAgent(tools)
    agent.auto_approve = auto_mode
    
    print("""
    Proactive Data Engineer Agent
    ----------------------------
    This agent will:
    1. Analyze your data for quality issues
    2. Propose automated fixes
    3. Execute fixes with your approval (unless in auto mode)
    4. Generate detailed reports
    
    Available commands:
    - analyze [table]: Analyze specific table or all tables
    - auto on/off: Toggle auto-approval mode
    - history: Show cleaning history
    - exit: End session
    """)
    
    while True:
        try:
            nvidia_api_key = get_nvidia_auth_token()
            command = input("\nCommand: ").strip().lower()
            
            if command in ['exit', 'quit']:
                break
            elif command.startswith('analyze'):
                table = command.split()[1] if len(command.split()) > 1 else None
                agent.analyze_and_clean(table)
            elif command == 'auto on':
                agent.auto_approve = True
                print("Auto-approval mode enabled")
            elif command == 'auto off':
                agent.auto_approve = False
                print("Auto-approval mode disabled")
            elif command == 'history':
                print("\nCleaning History:")
                for log in agent.cleaning_log:
                    print(f"{log['timestamp']}: {log['operation']['description']}")
                    print(f"  Result: {log['result']}")
            elif command == 'help':
                print("Commands: analyze [table], auto on/off, history, exit")
            else:
                print("Invalid command. Type 'help' for options")
                
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"Error: {str(e)}")
    
    print("\nSession ended. Final cleaning summary:")
    for log in agent.cleaning_log[-5:]:  # Show last 5 operations
        print(f"- {log['operation']['description']}")
        print(f"  Rows affected: {log['result'].get('rows_affected', 'N/A')}")

if __name__ == "__main__":
    run_proactive_data_engineer(auto_mode=True)  # Set to True for fully automatic


    Proactive Data Engineer Agent
    ----------------------------
    This agent will:
    1. Analyze your data for quality issues
    2. Propose automated fixes
    3. Execute fixes with your approval (unless in auto mode)
    4. Generate detailed reports
    
    Available commands:
    - analyze [table]: Analyze specific table or all tables
    - auto on/off: Toggle auto-approval mode
    - history: Show cleaning history
    - exit: End session
    



Command:  analyze



Starting Data Quality Assessment

Analyzing source_catalog...

Analyzing source_customers...

Analyzing source_stock...

Analyzing source_orders...

Analyzing source_order_products...

Operation: Zero-fill for 3 null values in source_catalog.price_cents
Type: fill_nulls
Severity: MEDIUM
Result: {'status': 'success', 'rows_affected': 3}

Operation: Unknown-fill for 6 null values in source_catalog.product_name
Type: fill_nulls
Severity: MEDIUM
Result: {'status': 'success', 'rows_affected': 6}

Operation: Unknown-fill for 5 null values in source_catalog.product_category
Type: fill_nulls
Severity: MEDIUM
Result: {'status': 'success', 'rows_affected': 5}

Operation: Unknown-fill for 5 null values in source_customers.customer_surname
Type: fill_nulls
Severity: MEDIUM
Result: {'status': 'success', 'rows_affected': 5}

Operation: Unknown-fill for 4 null values in source_customers.customer_email
Type: fill_nulls
Severity: MEDIUM
Result: {'status': 'success', 'rows_affected': 4}

Operation: Zer


Command:  exit



Session ended. Final cleaning summary:
- Zero-fill for 161 null values in source_stock.product_quantity
  Rows affected: N/A
- Zero-fill for 12 null values in source_orders.customer_id
  Rows affected: N/A
- Epoch-date-fill for 11 null values in source_orders.order_date
  Rows affected: N/A
- Correct 3 invalid future dates in source_orders.order_date
  Rows affected: N/A
- Zero-fill for 587 null values in source_order_products.product_quantity
  Rows affected: N/A
