# Spider 2 BigQuery Connection Testing Notebook

This notebook tests the connection to the Spider 2 dataset on BigQuery and demonstrates text-to-SQL functionality.

In [None]:
import sys
import os
import pandas as pd
import json

# Add backend to path
sys.path.append(os.path.join(os.getcwd(), '..', 'backend'))

from dotenv import load_dotenv
load_dotenv('../.env')

# Import our services
from data_connection.bigquery_client import BigQueryClient, connect_to_spider2
from data_connection.spider2_service import Spider2Service

## Setup BigQuery Authentication

**Important**: Before running this notebook, ensure you have:

1. **Google Cloud Project**: Set up a Google Cloud project
2. **BigQuery API**: Enable BigQuery API in your project
3. **Authentication**: Choose one of these methods:

### Option 1: Service Account (Recommended for production)
```bash
# Download service account JSON key
# Set environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"
```

### Option 2: User Credentials (Good for development)
```bash
# Install gcloud CLI and authenticate
gcloud auth application-default login
```

### Option 3: Environment Variables
Set in your `.env` file:
```
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
# OR
GOOGLE_CREDENTIALS_JSON={"type": "service_account", ...}
```

## Test Basic BigQuery Connection

In [None]:
# Test basic BigQuery connection
try:
    # Initialize BigQuery client
    client = BigQueryClient()
    print(f"✓ Connected to BigQuery project: {client.project_id}")
    
    # Test a simple query
    test_query = "SELECT 'Hello BigQuery!' as message, CURRENT_TIMESTAMP() as timestamp"
    result = client.query(test_query)
    print(f"✓ Test query successful:")
    print(result)
    
except Exception as e:
    print(f"❌ BigQuery connection failed: {e}")
    print("\nPlease check your authentication setup (see instructions above)")

## Explore Spider 2 Dataset

In [None]:
# Explore Spider 2 dataset
try:
    # List available tables in Spider 2
    spider_tables = client.list_spider2_tables()
    print(f"✓ Found {len(spider_tables)} tables in Spider 2 dataset")
    print("\nFirst 10 tables:")
    for table in spider_tables[:10]:
        print(f"  - {table}")
    
    if len(spider_tables) > 10:
        print(f"  ... and {len(spider_tables) - 10} more")
        
except Exception as e:
    print(f"❌ Failed to access Spider 2 dataset: {e}")
    print("Note: Make sure your project has access to the public Spider 2 dataset")

## Examine Table Schema

In [None]:
# Get schema information for a specific table
if 'spider_tables' in locals() and len(spider_tables) > 0:
    # Use the first available table
    sample_table = spider_tables[0]
    
    try:
        schema_info = client.get_table_schema(sample_table)
        print(f"✓ Schema for table '{sample_table}':")
        print(json.dumps(schema_info, indent=2, default=str))
        
    except Exception as e:
        print(f"❌ Failed to get schema for {sample_table}: {e}")
else:
    print("⚠ No tables available to examine schema")

## Sample Data from Spider 2

In [None]:
# Get sample data from a table
if 'sample_table' in locals():
    try:
        sample_data = client.sample_table_data(sample_table, limit=5)
        print(f"✓ Sample data from '{sample_table}':")
        print(sample_data)
        
    except Exception as e:
        print(f"❌ Failed to get sample data from {sample_table}: {e}")
else:
    print("⚠ No table available for sampling")

## Test Spider 2 Service

In [None]:
# Initialize Spider 2 service
try:
    spider_service = Spider2Service()
    print("✓ Spider 2 service initialized")
    
    # Get available databases
    databases = spider_service.get_available_databases()
    print(f"\n✓ Found {len(databases)} databases in Spider 2:")
    for db in databases[:10]:  # Show first 10
        print(f"  - {db}")
    
    if len(databases) > 10:
        print(f"  ... and {len(databases) - 10} more")
        
except Exception as e:
    print(f"❌ Failed to initialize Spider 2 service: {e}")

## Get Database Schema

In [None]:
# Get schema for a specific database
if 'databases' in locals() and len(databases) > 0:
    # Use the first available database
    sample_db = databases[0]
    
    try:
        db_schema = spider_service.get_database_schema(sample_db)
        print(f"✓ Schema for database '{sample_db}':")
        print(f"  - Database: {db_schema['database_name']}")
        print(f"  - Tables: {db_schema['table_count']}")
        
        print("\n  Table details:")
        for table_name, table_info in list(db_schema['tables'].items())[:3]:  # Show first 3 tables
            print(f"    {table_name}: {len(table_info.get('columns', []))} columns, {table_info.get('num_rows', 'N/A')} rows")
        
        if len(db_schema['tables']) > 3:
            print(f"    ... and {len(db_schema['tables']) - 3} more tables")
            
    except Exception as e:
        print(f"❌ Failed to get schema for database {sample_db}: {e}")
else:
    print("⚠ No databases available to examine")

## Test Text-to-SQL Generation

In [None]:
# Test SQL generation for a database
if 'sample_db' in locals():
    try:
        # Test natural language to SQL conversion
        test_question = "Show me all records from the first table"
        
        print(f"Natural Language: {test_question}")
        print(f"Target Database: {sample_db}")
        
        generated_sql = spider_service.generate_sql_for_database(
            natural_language_query=test_question,
            database_name=sample_db
        )
        
        print(f"\n✓ Generated SQL:")
        print(generated_sql)
        
    except Exception as e:
        print(f"❌ Failed to generate SQL: {e}")
else:
    print("⚠ No database available for SQL generation")

## Execute and Analyze Query

In [None]:
# Test complete workflow: generate SQL and execute it
if 'sample_db' in locals():
    try:
        # Test with a simple counting query
        test_question = "How many records are in each table?"
        
        print(f"Testing complete workflow...")
        print(f"Question: {test_question}")
        print(f"Database: {sample_db}")
        
        # Execute and analyze
        result = spider_service.execute_and_analyze(
            natural_language_query=test_question,
            database_name=sample_db,
            execute_query=True  # Set to False to only generate SQL without executing
        )
        
        print(f"\n✓ Complete workflow result:")
        print(f"Generated SQL: {result['generated_sql']}")
        
        if result['execution_results']:
            if result['execution_results']['success']:
                print(f"\n✓ Execution successful:")
                print(f"  Rows returned: {result['execution_results']['row_count']}")
                print(f"  Columns: {result['execution_results']['columns']}")
                if result['execution_results']['sample_data']:
                    print(f"  Sample data: {result['execution_results']['sample_data'][:2]}")
            else:
                print(f"❌ Execution failed: {result['execution_results']['error']}")
        
        if result['analysis']:
            print(f"\n📊 AI Analysis:")
            print(result['analysis'])
            
    except Exception as e:
        print(f"❌ Complete workflow failed: {e}")
else:
    print("⚠ No database available for complete workflow test")

## Test Multiple Databases

In [None]:
# Test SQL generation across multiple databases
if 'databases' in locals() and len(databases) >= 2:
    test_databases = databases[:3]  # Test first 3 databases
    test_question = "Show me the structure of the main table"
    
    print(f"Testing SQL generation across multiple databases...")
    print(f"Question: {test_question}\n")
    
    for db in test_databases:
        try:
            print(f"Database: {db}")
            sql = spider_service.generate_sql_for_database(
                natural_language_query=test_question,
                database_name=db
            )
            print(f"Generated SQL: {sql[:100]}..." if len(sql) > 100 else f"Generated SQL: {sql}")
            print("-" * 50)
            
        except Exception as e:
            print(f"❌ Failed for {db}: {e}")
            print("-" * 50)
else:
    print("⚠ Need at least 2 databases for multi-database testing")

## Performance and Cost Analysis

In [None]:
# Test query validation and cost estimation
if 'generated_sql' in locals():
    try:
        print("Testing query validation and cost estimation...")
        
        validation_result = client.validate_query(generated_sql)
        
        print(f"✓ Query validation result:")
        print(f"  Valid: {validation_result['valid']}")
        print(f"  Bytes to process: {validation_result.get('total_bytes_processed', 'N/A')}")
        print(f"  Estimated cost: {validation_result.get('total_bytes_billed', 'N/A')} bytes")
        
    except Exception as e:
        print(f"❌ Query validation failed: {e}")
else:
    print("⚠ No SQL query available for validation")

## Cleanup

In [None]:
# Clean up connections
try:
    if 'client' in locals():
        client.close()
    if 'spider_service' in locals():
        spider_service.close()
    print("✓ Connections closed successfully")
except Exception as e:
    print(f"Warning: Error during cleanup: {e}")

## Summary

This notebook demonstrated:

1. **BigQuery Connection**: Basic connection to Google BigQuery
2. **Spider 2 Access**: Accessing the public Spider 2 dataset
3. **Schema Exploration**: Examining database and table schemas
4. **Text-to-SQL Generation**: Converting natural language to SQL using Gemini
5. **Query Execution**: Running generated SQL queries on Spider 2 data
6. **Result Analysis**: AI-powered analysis of query results
7. **Performance Testing**: Query validation and cost estimation

### Next Steps:

- **Authentication**: Ensure proper BigQuery authentication is set up
- **Permissions**: Verify access to the Spider 2 public dataset
- **Optimization**: Fine-tune SQL generation prompts for better accuracy
- **Evaluation**: Implement benchmarking against known correct SQL queries
- **Error Handling**: Add robust error handling for production use

### Troubleshooting:

If you encounter authentication errors:
1. Check your Google Cloud project setup
2. Ensure BigQuery API is enabled
3. Verify your service account has BigQuery permissions
4. Check that environment variables are set correctly