### What This Notebook Accomplishes:

1. **Data Processing**: Loads and processes Great Expectations validation JSON files
2. **Analytics**: Calculates comprehensive data quality metrics and trends
3. **Visualizations**: Creates both static and interactive charts and graphs
4. **AI Analysis**: Uses Ollama LLM (gpt-oss:20b) for intelligent insights
5. **Report Generation**: Exports professional markdown and PDF reports

### Usage Instructions:

1. **Run All Cells**: Execute the entire notebook to generate the complete analysis
2. **Customize**: Modify the `ValidationAnalyzer` class for different data sources
3. **Schedule**: Set up automated runs for regular data quality monitoring
4. **Share**: Use generated reports for stakeholder communication

### Dependencies:

- **Core**: pandas, numpy, matplotlib, seaborn, plotly
- **AI**: requests (for Ollama API)
- **PDF Export**: weasyprint, markdown (optional)
- **Data Source**: Great Expectations validation results

 ### Generated Outputs:

- **Interactive Dashboard**: Plotly visualizations for trend analysis
- **Static Charts**: Matplotlib/Seaborn charts for presentations
- **Markdown Report**: Comprehensive analysis report
- **PDF Report**: Professional formatted document (requires weasyprint)

 ### Key Features:

- **Automated Analysis**: Processes all validation files automatically
- **Trend Detection**: Identifies patterns and trends over time
- **AI Insights**: Leverages Ollama for intelligent recommendations
- **Multiple Formats**: Exports in markdown and PDF formats
- **Professional Quality**: Enterprise-ready reports and visualizations




In [25]:
# Import required libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime, timedelta
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import requests
from dotenv import dotenv_values
import warnings
import time
warnings.filterwarnings('ignore')

# Start timing the entire analysis
start_time = time.time()
analysis_start = datetime.now()

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Setting up analysis environment...")
print(f"Analysis started at: {analysis_start.strftime('%Y-%m-%d %H:%M:%S')}")
print("Beginning Great Expectations validation analysis...")


Libraries imported successfully!
Setting up analysis environment...
Analysis started at: 2025-10-07 14:40:24
Beginning Great Expectations validation analysis...


In [26]:
# Configuration and Setup
class ValidationAnalyzer:
    def __init__(self, validation_path="../../BirdiDQ/gx/uncommitted/validations"):
        self.validation_path = Path(validation_path)
        self.results_data = []
        self.analysis_results = {}
        
        # Ollama Configuration - Using your existing .env structure
        env_path = Path("../../.env")
        if env_path.exists():
            env_vars = dotenv_values(env_path)
            # Use Ollama Cloud settings from your .env file
            self.ollama_url = env_vars.get("OLLAMA_CLOUD_BASE_URL", "https://ollama.com")
            self.ollama_model = env_vars.get("OLLAMA_CLOUD_MODEL", "gpt-oss:20b")
            self.ollama_api_key = env_vars.get("OLLAMA_API_KEY", "")
            
            # Fallback to local if cloud is not configured
            if not self.ollama_api_key or self.ollama_api_key == "your_api_key_here":
                print("No API key found, falling back to local Ollama")
                self.ollama_url = env_vars.get("OLLAMA_BASE_URL", "http://localhost:11434")
                self.ollama_model = env_vars.get("OLLAMA_LOCAL_MODEL", "phi3:mini")
                self.ollama_api_key = ""
        else:
            # Default to local Ollama
            self.ollama_url = "http://localhost:11434"
            self.ollama_model = "phi3:mini"
            self.ollama_api_key = ""
        
        print(f"Ollama URL: {self.ollama_url}")
        print(f"Ollama Model: {self.ollama_model}")
        print(f"API Key configured: {'Yes' if self.ollama_api_key and self.ollama_api_key != 'your_api_key_here' else 'No'}")
        print(f"Validation path: {self.validation_path}")
        print(f"Path exists: {self.validation_path.exists()}")
        
    def ollama_infer(self, prompt, model=None, url=None, timeout=120):
        """Send a prompt to Ollama and return the response"""
        model = model or self.ollama_model
        url = url or self.ollama_url
        
        # Prepare headers
        headers = {
            "Content-Type": "application/json"
        }
        
        # Add API key if available (for Ollama Cloud)
        if self.ollama_api_key and self.ollama_api_key != "your_api_key_here":
            headers["Authorization"] = f"Bearer {self.ollama_api_key}"
        
        try:
            response = requests.post(
                f"{url}/api/generate",
                json={"model": model, "prompt": prompt, "stream": False},
                headers=headers,
                timeout=timeout
            )
            response.raise_for_status()
            return response.json()["response"]
        except requests.exceptions.Timeout:
            print(f"Ollama request timed out after {timeout} seconds")
            return None
        except requests.exceptions.RequestException as e:
            print(f"Error calling Ollama API: {e}")
            if hasattr(e, 'response') and e.response is not None:
                print(f"Response status: {e.response.status_code}")
                print(f"Response text: {e.response.text}")
            return None

# Initialize analyzer
analyzer = ValidationAnalyzer()
print("Validation Analyzer initialized!")


Ollama URL: https://ollama.com
Ollama Model: gpt-oss:20b
API Key configured: Yes
Validation path: ../../BirdiDQ/gx/uncommitted/validations
Path exists: True
Validation Analyzer initialized!


In [27]:
# Data Loading and Processing Functions
def load_validation_files(analyzer):
    """Load all validation JSON files from the directory structure"""
    validation_files = []
    
    # Find all JSON files in the validation directory
    for json_file in analyzer.validation_path.rglob("*.json"):
        try:
            with open(json_file, 'r') as f:
                data = json.load(f)
                validation_files.append({
                    'file_path': str(json_file),
                    'data': data,
                    'timestamp': data.get('meta', {}).get('validation_time', ''),
                    'suite_name': data.get('meta', {}).get('expectation_suite_name', ''),
                    'run_id': data.get('meta', {}).get('run_id', {}).get('run_name', ''),
                    'data_asset': data.get('meta', {}).get('active_batch_definition', {}).get('data_asset_name', '')
                })
        except Exception as e:
            print(f"Error loading {json_file}: {e}")
    
    analyzer.results_data = validation_files
    print(f"Loaded {len(validation_files)} validation files")
    return validation_files

def process_validation_results(validation_files):
    """Process validation results into structured data"""
    processed_data = []
    
    for file_info in validation_files:
        data = file_info['data']
        results = data.get('results', [])
        
        for result in results:
            expectation_config = result.get('expectation_config', {})
            exception_info = result.get('exception_info', {})
            
            processed_data.append({
                'file_path': file_info['file_path'],
                'timestamp': file_info['timestamp'],
                'suite_name': file_info['suite_name'],
                'run_id': file_info['run_id'],
                'data_asset': file_info['data_asset'],
                'expectation_type': expectation_config.get('expectation_type', ''),
                'column': expectation_config.get('kwargs', {}).get('column', 'table-level'),
                'success': result.get('success', False),
                'exception_raised': exception_info.get('raised_exception', False),
                'exception_message': exception_info.get('exception_message', ''),
                'result': result.get('result', {}),
                'meta': expectation_config.get('meta', {})
            })
    
    return pd.DataFrame(processed_data)

# Load and process data
print("Loading validation files...")
validation_files = load_validation_files(analyzer)
df = process_validation_results(validation_files)

print(f"Processed {len(df)} individual expectations")
# print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Unique expectation suites: {df['suite_name'].nunique()}")
print(f"Unique expectation types: {df['expectation_type'].nunique()}")


Loading validation files...
Loaded 1 validation files
Processed 132 individual expectations
Unique expectation suites: 1
Unique expectation types: 15


In [28]:
# Data Quality Metrics and Analysis
def calculate_quality_metrics(df):
    """Calculate comprehensive data quality metrics"""
    metrics = {}
    
    # Overall success rate
    metrics['overall_success_rate'] = df['success'].mean()
    
    # Success rate by expectation suite
    suite_metrics = df.groupby('suite_name').agg({
        'success': ['count', 'sum', 'mean'],
        'exception_raised': 'sum'
    }).round(3)
    suite_metrics.columns = ['total_expectations', 'successful_expectations', 'success_rate', 'exceptions']
    metrics['suite_metrics'] = suite_metrics
    
    # Success rate by expectation type
    type_metrics = df.groupby('expectation_type').agg({
        'success': ['count', 'sum', 'mean'],
        'exception_raised': 'sum'
    }).round(3)
    type_metrics.columns = ['total_expectations', 'successful_expectations', 'success_rate', 'exceptions']
    metrics['type_metrics'] = type_metrics
    
    # Column-level analysis
    column_metrics = df[df['column'] != 'table-level'].groupby('column').agg({
        'success': ['count', 'sum', 'mean'],
        'exception_raised': 'sum'
    }).round(3)
    column_metrics.columns = ['total_expectations', 'successful_expectations', 'success_rate', 'exceptions']
    metrics['column_metrics'] = column_metrics
    
    # Exception analysis
    exception_df = df[df['exception_raised'] == True]
    metrics['exception_count'] = len(exception_df)
    metrics['exception_rate'] = len(exception_df) / len(df)
    
    return metrics

# Calculate metrics
print("Calculating quality metrics...")
quality_metrics = calculate_quality_metrics(df)

print(f"Overall Success Rate: {quality_metrics['overall_success_rate']:.2%}")
print(f"Exception Rate: {quality_metrics['exception_rate']:.2%}")
print(f"Total Expectations Analyzed: {len(df)}")

# Display top-level metrics
print("\n Suite Performance:")
print(quality_metrics['suite_metrics'].head())

print("\n Expectation Type Performance:")
print(quality_metrics['type_metrics'].head())


Calculating quality metrics...
Overall Success Rate: 96.21%
Exception Rate: 0.00%
Total Expectations Analyzed: 132

 Suite Performance:
                                      total_expectations  \
suite_name                                                 
nyc_taxi_data_onboarding_suite_final                 132   

                                      successful_expectations  success_rate  \
suite_name                                                                    
nyc_taxi_data_onboarding_suite_final                      127         0.962   

                                      exceptions  
suite_name                                        
nyc_taxi_data_onboarding_suite_final           0  

 Expectation Type Performance:
                                                    total_expectations  \
expectation_type                                                         
expect_column_max_to_be_between                                     14   
expect_column_mean_to_be_between      

In [36]:
# AI-Powered Analysis with Ollama Cloud + Fallback
def generate_fallback_analysis(df, quality_metrics):
    """Generate fallback analysis when AI is unavailable"""
    
    fallback_analysis = f"""
## Executive Summary
Based on the analysis of {len(df)} data quality expectations across {df['suite_name'].nunique()} validation suites, the overall data quality success rate is {quality_metrics['overall_success_rate']:.2%}.

## Critical Issues
- **Exception Rate**: {quality_metrics['exception_rate']:.2%} of expectations raised exceptions
- **Lowest Performing Suite**: {quality_metrics['suite_metrics'].nsmallest(1, 'success_rate').index[0]} with {quality_metrics['suite_metrics'].nsmallest(1, 'success_rate')['success_rate'].iloc[0]:.2%} success rate
- **Most Problematic Expectation Type**: {quality_metrics['type_metrics'].nsmallest(1, 'success_rate').index[0]} with {quality_metrics['type_metrics'].nsmallest(1, 'success_rate')['success_rate'].iloc[0]:.2%} success rate

## Trends Analysis
- **Date Range**: {df['timestamp'].min()} to {df['timestamp'].max()}
- **Total Expectations**: {len(df)}
- **Successful Expectations**: {df['success'].sum()}
- **Failed Expectations**: {len(df) - df['success'].sum()}

## Recommendations
1. **Immediate Action**: Address suites with success rates below 80%
2. **Expectation Review**: Review and update failing expectation types
3. **Monitoring**: Implement daily monitoring for critical data assets
4. **Process Improvement**: Establish data quality governance processes

## Risk Assessment
- **High Risk**: Suites with success rates below 70% require immediate attention
- **Medium Risk**: Suites with success rates between 70-85% need monitoring
- **Low Risk**: Suites with success rates above 85% are performing well

## Next Steps
1. Prioritize fixing the lowest performing suite
2. Review expectation configurations for failing types
3. Implement automated monitoring and alerting
4. Schedule regular data quality reviews
"""
    
    return fallback_analysis

def generate_ai_insights(df, quality_metrics, analyzer):
    """Generate AI-powered insights using Ollama Cloud with fallback"""
    
    # Prepare data summary for AI analysis
    data_summary = {
        'total_expectations': len(df),
        'overall_success_rate': quality_metrics['overall_success_rate'],
        'exception_rate': quality_metrics['exception_rate'],
        'suite_count': df['suite_name'].nunique(),
        'expectation_types': df['expectation_type'].nunique(),
        'date_range': f"{df['timestamp'].min()} to {df['timestamp'].max()}",
        'top_failing_suites': quality_metrics['suite_metrics'].nsmallest(3, 'success_rate').to_dict(),
        'top_failing_types': quality_metrics['type_metrics'].nsmallest(3, 'success_rate').to_dict()
    }
    
    prompt = f"""
    You are a data quality expert analyzing Great Expectations validation results. 

    Data Summary:
    - Total Expectations: {data_summary['total_expectations']}
    - Overall Success Rate: {data_summary['overall_success_rate']:.2%}
    - Exception Rate: {data_summary['exception_rate']:.2%}
    - Number of Suites: {data_summary['suite_count']}
    - Number of Expectation Types: {data_summary['expectation_types']}
    - Date Range: {data_summary['date_range']}

    Top Failing Suites:
    {data_summary['top_failing_suites']}

    Top Failing Expectation Types:
    {data_summary['top_failing_types']}

    Please provide:
    1. **Executive Summary**: Key findings and overall data quality assessment
    2. **Critical Issues**: Most important problems that need immediate attention
    3. **Trends Analysis**: Patterns and trends observed in the data
    4. **Recommendations**: Specific actionable recommendations to improve data quality
    5. **Risk Assessment**: Potential risks and their impact
    6. **Next Steps**: Prioritized action items

    Format your response as a professional data quality report with clear sections and actionable insights.
    """
    
    print(" Generating AI insights with Ollama Cloud...")
    ai_response = analyzer.ollama_infer(prompt)
    
    # Use fallback if AI is unavailable
    if ai_response is None:
        print(" Ollama Cloud unavailable, using fallback analysis...")
        ai_response = generate_fallback_analysis(df, quality_metrics)
    
    return ai_response, data_summary

# Generate AI insights
ai_insights, data_summary = generate_ai_insights(df, quality_metrics, analyzer)

print(" AI Analysis Complete!")
print("\n" + "="*80)
print(" AI-POWERED DATA QUALITY INSIGHTS")
print("="*80)
print(ai_insights)


 Generating AI insights with Ollama Cloud...
 AI Analysis Complete!

 AI-POWERED DATA QUALITY INSIGHTS
# Great Expectations Data Quality Report  
**Run ID:** `20251005T180117.592126Z`  
**Date of report:** October 7 2025  

> **Scope** – Validation run on the *nyc_taxi_data_onboarding_suite_final* suite (132 expectation checks) covering a single‑record snapshot of the NYC taxi dataset.  

---

## 1. Executive Summary  

| Metric | Value |
|--------|-------|
| Total Expectations | 132 |
| Successful Expectations | 127 |
| Failure Rate | 3.79 % |
| Success Rate | 96.21 % |
| Exceptions | 0 |
| Suite Count | 1 |

The validation run demonstrates **overall good data quality**. 96 % of expectations pass and there are **no runtime exceptions**, indicating that the underlying schema and ingest process are functioning as expected.  

The sole source of concern is **mean‑value expectations**:

* 12 mean expectations were defined, of which only 7 passed (58 % success).  
* All maximum‑value and m

In [37]:
# Report Generation Functions
def generate_markdown_report(df, quality_metrics, ai_insights, data_summary):
    """Generate comprehensive markdown report"""
    
    report = f"""# Great Expectations Validation Analysis Report

**Generated on:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}  
**Analysis Period:** {data_summary['date_range']}

## Executive Summary

This report analyzes {data_summary['total_expectations']} data quality expectations across {data_summary['suite_count']} validation suites.

### Key Metrics
- **Overall Success Rate:** {data_summary['overall_success_rate']:.2%}
- **Exception Rate:** {data_summary['exception_rate']:.2%}
- **Expectation Types Analyzed:** {data_summary['expectation_types']}

## Data Quality Metrics

### Suite Performance
"""
    
    # Add suite metrics table
    suite_table = quality_metrics['suite_metrics'].to_markdown()
    report += f"\n{suite_table}\n"
    
    report += f"""
### Expectation Type Performance
"""
    
    # Add type metrics table
    type_table = quality_metrics['type_metrics'].to_markdown()
    report += f"\n{type_table}\n"
    
    report += f"""
## AI-Powered Analysis

{ai_insights}

## Detailed Analysis

### Top Performing Suites
    """
    
    # Top performing suites
    top_suites = quality_metrics['suite_metrics'].nlargest(5, 'success_rate')
    for suite, metrics in top_suites.iterrows():
        report += f"- **{suite}**: {metrics['success_rate']:.2%} success rate ({metrics['successful_expectations']}/{metrics['total_expectations']} expectations)\n"
    
    report += f"""
    ### Areas Requiring Attention
    """
    
    # Bottom performing suites
    bottom_suites = quality_metrics['suite_metrics'].nsmallest(5, 'success_rate')
    for suite, metrics in bottom_suites.iterrows():
        report += f"- **{suite}**: {metrics['success_rate']:.2%} success rate ({metrics['successful_expectations']}/{metrics['total_expectations']} expectations)\n"
    
    report += f"""
    ## Recommendations

    Based on the analysis, the following actions are recommended:

    1. **Immediate Actions**: Address suites with success rates below 80%
    2. **Monitoring**: Implement daily monitoring for critical data assets
    3. **Expectation Review**: Review and update failing expectation types
    4. **Process Improvement**: Establish data quality governance processes

    ## Technical Details

    - **Analysis Engine**: Great Expectations v0.18.22
    - **AI Analysis**: Ollama LLM (gpt-oss:20b)
    - **Data Source**: Validation results from BirdiDQ/gx/uncommitted/validations
    - **Report Generated**: {datetime.now().isoformat()}

    ---
    *This report was automatically generated by the Great Expectations Validation Analysis system.*
    """
    
    return report

def save_report(report, filename="validation_analysis_report.md"):
    """Save report to markdown file"""
    output_path = Path("notebooks/great_expectations") / filename
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(report)
    
    print(f"Report saved to: {output_path}")
    return output_path

# Generate and save report
print("Generating comprehensive report...")
markdown_report = generate_markdown_report(df, quality_metrics, ai_insights, data_summary)
report_path = save_report(markdown_report)

print("Report generation complete!")
print(f"Report location: {report_path}")


Generating comprehensive report...
Report saved to: notebooks/great_expectations/validation_analysis_report.md
Report generation complete!
Report location: notebooks/great_expectations/validation_analysis_report.md


In [38]:
# Data Catalog Generation Functions
def generate_data_catalog(df, quality_metrics, validation_files):
    """Generate comprehensive data catalog from validation results"""
    
    catalog = {
        "metadata": {
            "generated_on": datetime.now().isoformat(),
            "total_validation_files": len(validation_files),
            "analysis_period": f"{df['timestamp'].min()} to {df['timestamp'].max()}",
            "great_expectations_version": "0.18.22"
        },
        "data_assets": {},
        "expectation_suites": {},
        "data_quality_summary": {
            "overall_success_rate": quality_metrics['overall_success_rate'],
            "exception_rate": quality_metrics['exception_rate'],
            "total_expectations": len(df)
        }
    }
    
    # Process each validation file to extract data asset information
    for file_info in validation_files:
        data = file_info['data']
        suite_name = file_info['suite_name']
        data_asset = file_info['data_asset']
        
        # Extract batch definition information
        batch_def = data.get('meta', {}).get('active_batch_definition', {})
        batch_spec = data.get('meta', {}).get('batch_spec', {})
        
        # Initialize data asset entry if not exists
        if data_asset not in catalog["data_assets"]:
            catalog["data_assets"][data_asset] = {
                "name": data_asset,
                "type": batch_spec.get('type', 'unknown'),
                "table_name": batch_spec.get('table_name', ''),
                "schema_name": batch_spec.get('schema_name', ''),
                "datasource": batch_def.get('datasource_name', ''),
                "data_connector": batch_def.get('data_connector_name', ''),
                "validation_runs": [],
                "columns": {},
                "expectation_suites": []
            }
        
        # Add validation run information
        run_info = {
            "run_id": file_info['run_id'],
            "timestamp": file_info['timestamp'],
            "suite_name": suite_name,
            "expectation_count": len(data.get('results', [])),
            "success_rate": sum(1 for r in data.get('results', []) if r.get('success', False)) / len(data.get('results', [])) if data.get('results') else 0
        }
        
        catalog["data_assets"][data_asset]["validation_runs"].append(run_info)
        
        # Add suite to data asset
        if suite_name not in catalog["data_assets"][data_asset]["expectation_suites"]:
            catalog["data_assets"][data_asset]["expectation_suites"].append(suite_name)
        
        # Extract column information from expectations
        for result in data.get('results', []):
            expectation_config = result.get('expectation_config', {})
            column = expectation_config.get('kwargs', {}).get('column', 'table-level')
            
            if column != 'table-level' and column not in catalog["data_assets"][data_asset]["columns"]:
                catalog["data_assets"][data_asset]["columns"][column] = {
                    "name": column,
                    "expectation_types": [],
                    "quality_metrics": {
                        "total_expectations": 0,
                        "successful_expectations": 0,
                        "success_rate": 0.0,
                        "exceptions": 0
                    }
                }
            
            if column != 'table-level':
                exp_type = expectation_config.get('expectation_type', '')
                if exp_type not in catalog["data_assets"][data_asset]["columns"][column]["expectation_types"]:
                    catalog["data_assets"][data_asset]["columns"][column]["expectation_types"].append(exp_type)
        
        # Initialize expectation suite entry
        if suite_name not in catalog["expectation_suites"]:
            catalog["expectation_suites"][suite_name] = {
                "name": suite_name,
                "data_assets": [],
                "expectation_types": [],
                "quality_metrics": {
                    "total_expectations": 0,
                    "successful_expectations": 0,
                    "success_rate": 0.0,
                    "exceptions": 0
                }
            }
        
        # Add data asset to suite
        if data_asset not in catalog["expectation_suites"][suite_name]["data_assets"]:
            catalog["expectation_suites"][suite_name]["data_assets"].append(data_asset)
    
    # Calculate quality metrics for columns and suites
    for data_asset_name, asset_info in catalog["data_assets"].items():
        for column_name, column_info in asset_info["columns"].items():
            column_df = df[(df['data_asset'] == data_asset_name) & (df['column'] == column_name)]
            if not column_df.empty:
                column_info["quality_metrics"] = {
                    "total_expectations": len(column_df),
                    "successful_expectations": column_df['success'].sum(),
                    "success_rate": column_df['success'].mean(),
                    "exceptions": column_df['exception_raised'].sum()
                }
    
    for suite_name, suite_info in catalog["expectation_suites"].items():
        suite_df = df[df['suite_name'] == suite_name]
        if not suite_df.empty:
            suite_info["quality_metrics"] = {
                "total_expectations": len(suite_df),
                "successful_expectations": suite_df['success'].sum(),
                "success_rate": suite_df['success'].mean(),
                "exceptions": suite_df['exception_raised'].sum()
            }
            suite_info["expectation_types"] = suite_df['expectation_type'].unique().tolist()
    
    return catalog

def generate_data_catalog_report(catalog):
    """Generate markdown report for data catalog"""
    
    report = f"""# Data Catalog

**Generated on:** {catalog['metadata']['generated_on']}  
**Analysis Period:** {catalog['metadata']['analysis_period']}  
**Total Validation Files:** {catalog['metadata']['total_validation_files']}

## Data Quality Summary

<div class="summary-box">
- **Overall Success Rate:** {catalog['data_quality_summary']['overall_success_rate']:.2%}
- **Exception Rate:** {catalog['data_quality_summary']['exception_rate']:.2%}
- **Total Expectations:** {catalog['data_quality_summary']['total_expectations']}


## Data Assets

"""
    
    for asset_name, asset_info in catalog["data_assets"].items():
        report += f"""### {asset_name}

**Asset Type:** {asset_info['type']}  
**Table:** {asset_info['table_name']}  
**Schema:** {asset_info['schema_name']}  
**Datasource:** {asset_info['datasource']}  
**Data Connector:** {asset_info['data_connector']}

**Expectation Suites:** {', '.join(asset_info['expectation_suites'])}  
**Validation Runs:** {len(asset_info['validation_runs'])}

#### Columns ({len(asset_info['columns'])})

| Column Name | Expectations | Success Rate | Exceptions |
|-------------|-------------|--------------|------------|
"""
        
        for col_name, col_info in asset_info["columns"].items():
            metrics = col_info["quality_metrics"]
            report += f"| {col_name} | {metrics['total_expectations']} | {metrics['success_rate']:.2%} | {metrics['exceptions']} |\n"
        
        report += "\n"
    
    report += """## Expectation Suites

"""
    
    for suite_name, suite_info in catalog["expectation_suites"].items():
        metrics = suite_info["quality_metrics"]
        report += f"""### {suite_name}

**Data Assets:** {', '.join(suite_info['data_assets'])}  
**Total Expectations:** {metrics['total_expectations']}  
**Success Rate:** {metrics['success_rate']:.2%}  
**Exceptions:** {metrics['exceptions']}

**Expectation Types:** {', '.join(suite_info['expectation_types'])}

"""
    
    report += """## Validation Run History

<table class="appendix-table">
<thead>
<tr>
<th>Data Asset</th>
<th>Suite</th>
<th>Run ID</th>
<th>Timestamp</th>
<th>Expectations</th>
<th>Success Rate</th>
</tr>
</thead>
<tbody>
"""
    
    for asset_name, asset_info in catalog["data_assets"].items():
        for run in asset_info["validation_runs"]:
            report += f"""<tr>
<td>{asset_name}</td>
<td>{run['suite_name']}</td>
<td>{run['run_id']}</td>
<td>{run['timestamp']}</td>
<td>{run['expectation_count']}</td>
<td>{run['success_rate']:.2%}</td>
</tr>
"""
    
    report += """</tbody>
</table>

---
*This data catalog was automatically generated by the Great Expectations Validation Analysis system.*
"""
    
    return report

# Data Catalog Generation (Run this cell if data_catalog is not already generated)
if 'data_catalog' not in locals():
    print("Generating data catalog...")
    data_catalog = generate_data_catalog(df, quality_metrics, validation_files)
    
    # Save data catalog as JSON
    catalog_json_path = Path("notebooks/great_expectations") / "data_catalog.json"
    with open(catalog_json_path, 'w', encoding='utf-8') as f:
        json.dump(data_catalog, f, indent=2, default=str)
    
    print(f"Data catalog JSON saved to: {catalog_json_path}")
    
    # Generate data catalog report
    catalog_report = generate_data_catalog_report(data_catalog)
    
    # Save data catalog report
    catalog_report_path = Path("notebooks/great_expectations") / "data_catalog_report.md"
    with open(catalog_report_path, 'w', encoding='utf-8') as f:
        f.write(catalog_report)
    
    print(f"Data catalog report saved to: {catalog_report_path}")
    
    print("Data catalog generation complete!")
    print(f"Data assets cataloged: {len(data_catalog['data_assets'])}")
    print(f"Expectation suites cataloged: {len(data_catalog['expectation_suites'])}")
else:
    print("Data catalog already generated!")
    print(f"Data assets cataloged: {len(data_catalog['data_assets'])}")
    print(f"Expectation suites cataloged: {len(data_catalog['expectation_suites'])}")


Data catalog already generated!
Data assets cataloged: 1
Expectation suites cataloged: 1


In [None]:
# Enhanced PDF Export with A4 Formatting and Appendix
def generate_pdf_report(report_content, filename="validation_analysis_report.pdf"):
    """Generate PDF report from markdown content with proper A4 formatting"""
    try:
        from markdown import markdown
        from weasyprint import HTML, CSS
        from weasyprint.text.fonts import FontConfiguration
        
        # Convert markdown to HTML
        html_content = markdown(report_content, extensions=['tables', 'codehilite'])
        
        # Enhanced CSS styling for A4 format with proper margins and table handling
        css_content = """
        @page {
            size: A4;
            margin: 2cm;
            @top-center {
                content: "Great Expectations Validation Analysis Report";
                font-size: 10px;
                color: #666;
            }
            @bottom-center {
                content: "Page " counter(page) " of " counter(pages);
                font-size: 10px;
                color: #666;
            }
        }
        
        body {
            font-family: 'Arial', sans-serif;
            line-height: 1.6;
            margin: 0;
            padding: 0;
            color: #333;
            font-size: 11px;
        }
        
        h1 {
            color: #2c3e50;
            border-bottom: 3px solid #3498db;
            padding-bottom: 10px;
            page-break-after: avoid;
            font-size: 18px;
        }
        
        h2 {
            color: #34495e;
            margin-top: 25px;
            border-bottom: 1px solid #bdc3c7;
            padding-bottom: 5px;
            page-break-after: avoid;
            font-size: 14px;
        }
        
        h3 {
            color: #7f8c8d;
            margin-top: 20px;
            page-break-after: avoid;
            font-size: 12px;
        }
        
        /* Table styling for main content - compact */
        table {
            border-collapse: collapse;
            width: 100%;
            margin: 15px 0;
            font-size: 9px;
            page-break-inside: avoid;
        }
        
        th, td {
            border: 1px solid #ddd;
            padding: 4px 6px;
            text-align: left;
            word-wrap: break-word;
        }
        
        th {
            background-color: #f2f2f2;
            font-weight: bold;
            font-size: 9px;
        }
        
        /* Wide tables go to appendix */
        .appendix-table {
            font-size: 8px;
            margin: 10px 0;
        }
        
        .appendix-table th,
        .appendix-table td {
            padding: 2px 4px;
            font-size: 8px;
        }
        
        code {
            background-color: #f4f4f4;
            padding: 2px 4px;
            border-radius: 3px;
            font-family: 'Courier New', monospace;
            font-size: 9px;
        }
        
        pre {
            background-color: #f4f4f4;
            padding: 10px;
            border-radius: 5px;
            overflow-x: auto;
            font-size: 9px;
            page-break-inside: avoid;
        }
        
        /* Page breaks */
        .page-break {
            page-break-before: always;
        }
        
        /* Summary boxes */
        .summary-box {
            background-color: #f8f9fa;
            border: 1px solid #dee2e6;
            border-radius: 5px;
            padding: 15px;
            margin: 15px 0;
        }
        
        /* Appendix styling */
        .appendix {
            page-break-before: always;
        }
        
        .appendix h2 {
            color: #e74c3c;
            border-bottom: 2px solid #e74c3c;
        }
        """
        
        # Create HTML document with proper structure
        full_html = f"""
        <!DOCTYPE html>
        <html>
        <head>
            <meta charset="utf-8">
            <title>Great Expectations Validation Analysis Report</title>
            <style>{css_content}</style>
        </head>
        <body>
            {html_content}
        </body>
        </html>
        """
        
        # Generate PDF
        output_path = Path(".").parent / filename
        HTML(string=full_html).write_pdf(str(output_path))
        
        print(f"📄 PDF report saved to: {output_path}")
        return output_path
        
    except ImportError as e:
        print(f"PDF generation requires additional packages: {e}")
        print("Install with: pip install weasyprint markdown")
        return None
    except Exception as e:
        print(f"Error generating PDF: {e}")
        return None

def generate_enhanced_markdown_report(df, quality_metrics, ai_insights, data_summary, data_catalog=None):
    """Generate enhanced markdown report with proper structure for A4 PDF"""
    
    # Extract key metrics for summary
    overall_success = quality_metrics['overall_success_rate']
    exception_rate = quality_metrics['exception_rate']
    total_expectations = len(df)
    
    # Get top failing expectation types for summary
    failing_types = quality_metrics['type_metrics'].nsmallest(3, 'success_rate')
    
    report = f"""# Great Expectations Validation Analysis Report

**Generated on:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}  
**Analysis Period:** {data_summary['date_range']}


### Executive Summary

<div class="summary-box">
<h3>Executive Summary</h3>
<p>This report analyzes <strong>{total_expectations}</strong> data quality expectations across <strong>{data_summary['suite_count']}</strong> validation suites.</p>

<h4>Key Metrics</h4>
<ul>
<li><strong>Overall Success Rate:</strong> {overall_success:.2%}</li>
<li><strong>Exception Rate:</strong> {exception_rate:.2%}</li>
<li><strong>Expectation Types Analyzed:</strong> {data_summary['expectation_types']}</li>
<li><strong>Critical Issues:</strong> {len(failing_types[failing_types['success_rate'] < 0.8])} expectation types below 80% success rate</li>
</ul>
</div>


## Critical Findings

### Top Issues Requiring Attention
"""
    
    # Add critical issues in a compact format
    for idx, (exp_type, metrics) in enumerate(failing_types.iterrows(), 1):
        if metrics['success_rate'] < 0.8:
            report += f"{idx}. **{exp_type}**: {metrics['success_rate']:.1%} success rate ({metrics['successful_expectations']}/{metrics['total_expectations']} expectations)\n"
    
    report += f"""

## AI-Powered Analysis

{ai_insights}

## Data Catalog Summary

<div class="summary-box">
"""
    
    # Add data catalog summary if available
    if data_catalog:
        report += f"""**Data Assets:** {len(data_catalog['data_assets'])}  
**Expectation Suites:** {len(data_catalog['expectation_suites'])}  
**Validation Runs:** {sum(len(asset['validation_runs']) for asset in data_catalog['data_assets'].values())}  
**Total Columns Monitored:** {sum(len(asset['columns']) for asset in data_catalog['data_assets'].values())}
"""
    else:
        report += "Data catalog not available"
    
    report += """

## Recommendations

Based on the analysis, the following actions are recommended:

1. **Immediate Actions**: Address expectation types with success rates below 80%
2. **Monitoring**: Implement daily monitoring for critical data assets  
3. **Expectation Review**: Review and update failing expectation types
4. **Process Improvement**: Establish data quality governance processes

## Technical Details

- **Analysis Engine**: Great Expectations v0.18.22
- **AI Analysis**: Ollama LLM (gpt-oss:20b)
- **Data Source**: Validation results from BirdiDQ/gx/uncommitted/validations
- **Report Generated**: {datetime.now().isoformat()}

---

<div class="page-break">
<div class="appendix">

## Appendix A: Detailed Suite Performance

| Suite Name | Total Expectations | Successful | Success Rate | Exceptions |
|------------|------------------|------------|--------------|------------|
"""
    
    # Add suite metrics in compact format
    for suite, metrics in quality_metrics['suite_metrics'].iterrows():
        report += f"| {suite} | {metrics['total_expectations']} | {metrics['successful_expectations']} | {metrics['success_rate']:.2%} | {metrics['exceptions']} |\n"
    
    report += f"""

## Appendix B: Detailed Expectation Type Performance

<table class="appendix-table">
<thead>
<tr>
<th>Expectation Type</th>
<th>Total</th>
<th>Successful</th>
<th>Success Rate</th>
<th>Exceptions</th>
</tr>
</thead>
<tbody>
"""
    
    # Add expectation type metrics in compact format
    for exp_type, metrics in quality_metrics['type_metrics'].iterrows():
        report += f"""<tr>
<td>{exp_type}</td>
<td>{metrics['total_expectations']}</td>
<td>{metrics['successful_expectations']}</td>
<td>{metrics['success_rate']:.2%}</td>
<td>{metrics['exceptions']}</td>
</tr>
"""
    
    report += """</tbody>
</table>

"""
    
    # Add data catalog appendix if available
    if data_catalog:
        report += f"""## Appendix C: Data Catalog

### Data Assets Overview

<table class="appendix-table">
<thead>
<tr>
<th>Data Asset</th>
<th>Type</th>
<th>Table</th>
<th>Schema</th>
<th>Datasource</th>
<th>Columns</th>
<th>Suites</th>
</tr>
</thead>
<tbody>
"""
        
        for asset_name, asset_info in data_catalog["data_assets"].items():
            report += f"""<tr>
<td>{asset_name}</td>
<td>{asset_info['type']}</td>
<td>{asset_info['table_name']}</td>
<td>{asset_info['schema_name']}</td>
<td>{asset_info['datasource']}</td>
<td>{len(asset_info['columns'])}</td>
<td>{len(asset_info['expectation_suites'])}</td>
</tr>
"""
        
        report += """</tbody>
</table>

### Column Quality Summary

<table class="appendix-table">
<thead>
<tr>
<th>Data Asset</th>
<th>Column</th>
<th>Expectations</th>
<th>Success Rate</th>
<th>Exceptions</th>
</tr>
</thead>
<tbody>
"""
        
        for asset_name, asset_info in data_catalog["data_assets"].items():
            for col_name, col_info in asset_info["columns"].items():
                metrics = col_info["quality_metrics"]
                report += f"""<tr>
<td>{asset_name}</td>
<td>{col_name}</td>
<td>{metrics['total_expectations']}</td>
<td>{metrics['success_rate']:.2%}</td>
<td>{metrics['exceptions']}</td>
</tr>
"""
        
        report += """</tbody>
</table>

"""
    
    report += """

---
*This report was automatically generated by the Great Expectations Validation Analysis system.*
"""
    
    return report

# Generate enhanced report with data catalog
print("Generating enhanced report with A4 formatting and data catalog...")

# First generate the data catalog if it doesn't exist
if 'data_catalog' not in locals():
    print("Generating data catalog first...")
    data_catalog = generate_data_catalog(df, quality_metrics, validation_files)
    
    # Save data catalog as JSON
    catalog_json_path = Path("notebooks/great_expectations") / "data_catalog.json"
    with open(catalog_json_path, 'w', encoding='utf-8') as f:
        json.dump(data_catalog, f, indent=2, default=str)
    
    print(f"Data catalog JSON saved to: {catalog_json_path}")

enhanced_report = generate_enhanced_markdown_report(df, quality_metrics, ai_insights, data_summary, data_catalog)

# Save enhanced markdown report
enhanced_report_path = Path("notebooks/great_expectations") / "validation_analysis_report_enhanced.md"
with open(enhanced_report_path, 'w', encoding='utf-8') as f:
    f.write(enhanced_report)

print(f"Enhanced markdown report saved to: {enhanced_report_path}")

# Generate enhanced PDF report
print("Generating enhanced PDF report...")
enhanced_pdf_path = generate_pdf_report(enhanced_report, "validation_analysis_report_enhanced.pdf")

if enhanced_pdf_path:
    print(f"Enhanced PDF report generated successfully!")
    print(f"Enhanced PDF location: {enhanced_pdf_path}")
else:
    print("Enhanced PDF generation failed - check error messages above")


Generating enhanced report with A4 formatting and data catalog...
Enhanced markdown report saved to: notebooks/great_expectations/validation_analysis_report_enhanced.md
Generating enhanced PDF report...
📄 PDF report saved to: validation_analysis_report_enhanced.pdf
Enhanced PDF report generated successfully!
Enhanced PDF location: validation_analysis_report_enhanced.pdf


In [33]:
# # Mermaid Diagram Generation with Ollama AI
# import base64
# import io
# from IPython.display import HTML, display
# import json

# def generate_mermaid_architecture(data_catalog, analyzer):
#     """Generate high-quality Mermaid architecture diagram using Ollama AI"""
    
#     # Prepare detailed context for AI
#     architecture_context = {
#         'data_assets': list(data_catalog['data_assets'].keys()),
#         'datasources': list(set(asset['datasource'] for asset in data_catalog['data_assets'].values() if asset['datasource'])),
#         'schemas': list(set(asset['schema_name'] for asset in data_catalog['data_assets'].values() if asset['schema_name'])),
#         'expectation_suites': list(data_catalog['expectation_suites'].keys()),
#         'total_columns': sum(len(asset['columns']) for asset in data_catalog['data_assets'].values()),
#         'validation_runs': sum(len(asset['validation_runs']) for asset in data_catalog['data_assets'].values()),
#         'asset_details': {name: {
#             'type': info['type'],
#             'table': info['table_name'],
#             'schema': info['schema_name'],
#             'datasource': info['datasource'],
#             'columns': len(info['columns']),
#             'suites': info['expectation_suites']
#         } for name, info in data_catalog['data_assets'].items()}
#     }
    
#     prompt = f"""
#     You are an expert data architecture diagram designer. Create a professional-grade Mermaid flowchart diagram for a Great Expectations data quality system.
    
#     System Context:
#     - Data Assets: {architecture_context['data_assets']}
#     - Data Sources: {architecture_context['datasources']}
#     - Schemas: {architecture_context['schemas']}
#     - Expectation Suites: {architecture_context['expectation_suites']}
#     - Total Columns: {architecture_context['total_columns']}
#     - Validation Runs: {architecture_context['validation_runs']}
    
#     Asset Details:
#     {json.dumps(architecture_context['asset_details'], indent=2)}
    
#     Create a comprehensive, professional Mermaid flowchart that follows this structure:
    
#     1. **People Layer**: Data engineers, analysts, stakeholders interacting with the system
#     2. **Data Ingestion Layer**: File uploads, database connections, API integrations
#     3. **Data Processing Layer**: Data validation, transformation, quality checks
#     4. **Great Expectations Engine**: Expectation suites, validators, checkpoints
#     5. **AI Analysis Layer**: Ollama integration, intelligent insights, recommendations
#     6. **Output Layer**: Reports, dashboards, alerts, documentation
#     7. **Data Persistence**: Storage of expectations, results, documentation
    
#     Requirements:
#     - Use subgraphs to organize logical layers
#     - Include detailed node labels with specific technologies/processes
#     - Show data flow with directional arrows and descriptive labels
#     - Add styling with different colors for each layer
#     - Include storage/persistence components with dotted lines
#     - Make it enterprise-grade and professional
#     - Use specific technology names (Ollama Cloud, gpt-oss:20b, Pandas, etc.)
#     - Include actual component names from the system context
    
#     Return ONLY the Mermaid code, no explanations or markdown formatting.
#     """
    
#     print("Generating high-quality Mermaid architecture diagram with Ollama AI...")
#     ai_response = analyzer.ollama_infer(prompt)
    
#     if ai_response is None:
#         print("Ollama AI unavailable, using enhanced fallback Mermaid diagram...")
#         ai_response = generate_enhanced_fallback_mermaid_architecture(architecture_context)
    
#     return ai_response.strip()

# def generate_enhanced_fallback_mermaid_architecture(architecture_context):
#     """Generate enhanced fallback Mermaid architecture diagram"""
    
#     # Get actual asset names for more realistic diagram
#     assets = architecture_context['data_assets']
#     suites = architecture_context['expectation_suites']
#     datasources = architecture_context['datasources']
    
#     asset1 = assets[0] if assets else "nyc_taxi_data"
#     asset2 = assets[1] if len(assets) > 1 else "users_table"
#     suite1 = suites[0] if suites else "onboarding_suite"
#     suite2 = suites[1] if len(suites) > 1 else "quality_suite"
#     datasource1 = datasources[0] if datasources else "postgres_db"
    
#     fallback_diagram = f"""graph TB
#     subgraph "People Layer"
#         A[Data Engineer] -->|Uploads Data| B[Data Source Selection]
#         A -->|Natural Language Query| C[Query Input Interface]
#         D[Data Analyst] -->|Reviews Results| E[Quality Dashboard]
#     end

#     subgraph "Data Ingestion Layer"
#         B -->|CSV Upload| F[File Upload Handler]
#         B -->|Database Connection| G[Database Browser]
#         F --> H[Data Preview & Schema Detection]
#         G --> H
#         H -->|DataFrame Creation| I[Pandas DataProcessor]
#     end

#     subgraph "Great Expectations Engine"
#         I -->|Create Batch| J[GX Batch Definition]
#         J -->|Load Data| K[Data Validator]
#         K -->|Execute Expectations| L[Expectation Suite Runner]
#         L -->|{suite1}| M[Quality Validation Engine]
#         L -->|{suite2}| M
#     end

#     subgraph "AI Analysis Layer"
#         M -->|Validation Results| N[Ollama Cloud API]
#         N -->|Model: gpt-oss:20b| O[AI Quality Analyzer]
#         O -->|Generate Insights| P[Intelligent Recommendations]
#         P -->|Quality Metrics| Q[AI-Powered Reports]
#     end

#     subgraph "Output Layer"
#         Q -->|Build Reports| R[Data Docs Builder]
#         R -->|HTML Generation| S[Interactive Reports]
#         S -->|file:// URL| T[Browser Display]
#         Q -->|Success/Failure| U[Status Dashboard]
#         U -->|Real-time Metrics| V[Streamlit UI]
#         E -->|Monitor Quality| V
#     end

#     subgraph "Data Persistence"
#         M -.->|Save Suite| W[(Expectations Store)]
#         L -.->|Save Results| X[(Validations Store)]
#         R -.->|Write HTML| Y[(Data Docs Store)]
#         O -.->|Save Analysis| Z[(AI Insights Store)]
#     end

#     subgraph "Monitored Data Assets"
#         AA[{asset1}<br/>Columns: {architecture_context['total_columns']}]
#         BB[{asset2}<br/>Schema: {architecture_context['schemas'][0] if architecture_context['schemas'] else 'public'}]
#         CC[{datasource1}<br/>Database Connection]
#     end

#     CC --> AA
#     CC --> BB
#     AA --> I
#     BB --> I

#     style A fill:#e1f5ff
#     style D fill:#e1f5ff
#     style N fill:#fff3e0
#     style M fill:#f3e5f5
#     style R fill:#e8f5e9
#     style W fill:#fce4ec
#     style X fill:#fce4ec
#     style Y fill:#fce4ec
#     style Z fill:#fce4ec
#     style AA fill:#e8f5e9
#     style BB fill:#e8f5e9
#     style CC fill:#e8f5e9"""
    
#     return fallback_diagram

# def generate_mermaid_data_model(data_catalog, analyzer):
#     """Generate Mermaid data model diagram using Ollama AI"""
    
#     # Prepare data model context
#     data_model_context = {
#         'assets': {name: {
#             'columns': len(info['columns']),
#             'suites': info['expectation_suites'],
#             'type': info['type'],
#             'table': info['table_name']
#         } for name, info in data_catalog['data_assets'].items()},
#         'suites': {name: {
#             'expectations': info['quality_metrics']['total_expectations'],
#             'assets': info['data_assets']
#         } for name, info in data_catalog['expectation_suites'].items()}
#     }
    
#     prompt = f"""
#     You are a data modeling expert. Generate a Mermaid ER diagram for a Great Expectations data quality system.
    
#     Data Model Context:
#     {json.dumps(data_model_context, indent=2)}
    
#     Create a comprehensive Mermaid ER diagram that shows:
#     1. Data assets as entities with their attributes
#     2. Expectation suites as entities
#     3. Relationships between data assets and expectation suites
#     4. Column information for each data asset
#     5. Quality metrics and validation results
#     6. Proper entity-relationship notation
    
#     Use proper Mermaid ER syntax with:
#     - Entity definitions with attributes
#     - Relationship lines with cardinality
#     - Clear labeling
#     - Appropriate styling
#     - Logical grouping
    
#     Return ONLY the Mermaid code, no explanations or markdown formatting.
#     """
    
#     print("Generating Mermaid data model diagram with Ollama AI...")
#     ai_response = analyzer.ollama_infer(prompt)
    
#     if ai_response is None:
#         print("Ollama AI unavailable, using fallback Mermaid data model...")
#         ai_response = generate_fallback_mermaid_data_model(data_model_context)
    
#     return ai_response.strip()

# def generate_fallback_mermaid_data_model(data_model_context):
#     """Generate fallback Mermaid data model diagram"""
    
#     fallback_diagram = f"""erDiagram
#     DATA_ASSET {{
#         string asset_name PK
#         string asset_type
#         string table_name
#         string schema_name
#         string datasource
#         int column_count
#         int suite_count
#     }}
    
#     EXPECTATION_SUITE {{
#         string suite_name PK
#         int total_expectations
#         float success_rate
#         int exceptions
#         string data_assets
#     }}
    
#     COLUMN {{
#         string column_name PK
#         string asset_name FK
#         int expectation_count
#         float success_rate
#         int exceptions
#     }}
    
#     VALIDATION_RUN {{
#         string run_id PK
#         string asset_name FK
#         string suite_name FK
#         datetime timestamp
#         int expectation_count
#         float success_rate
#     }}
    
#     DATA_ASSET ||--o{{ COLUMN : contains
#     DATA_ASSET ||--o{{ VALIDATION_RUN : validates
#     EXPECTATION_SUITE ||--o{{ VALIDATION_RUN : executes
#     DATA_ASSET o--o{{ EXPECTATION_SUITE : monitored_by"""
    
#     return fallback_diagram

# def render_mermaid_diagram(mermaid_code, diagram_name="diagram"):
#     """Render Mermaid diagram in Jupyter notebook"""
    
#     # Create HTML with Mermaid.js
#     html_content = """
#     <div class="mermaid">
#     """ + mermaid_code + """
    
    
#     <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
#     <script>
#         mermaid.initialize({"startOnLoad": true, "theme": "default"});
#     </script>
#     """
    
#     return HTML(html_content)

# def save_mermaid_diagram(mermaid_code, filename):
#     """Save Mermaid diagram to file"""
    
#     file_path = Path("notebooks/great_expectations") / filename
    
#     with open(file_path, 'w', encoding='utf-8') as f:
#         f.write(mermaid_code)
    
#     print(f"Mermaid diagram saved to: {file_path}")
#     return file_path

# # Generate Mermaid diagrams
# print("Generating Mermaid diagrams with Ollama AI...")

# if 'data_catalog' in locals():
#     # Generate architecture diagram
#     print("Creating architecture diagram...")
#     architecture_mermaid = generate_mermaid_architecture(data_catalog, analyzer)
    
#     # Save architecture diagram
#     arch_file = save_mermaid_diagram(architecture_mermaid, "architecture_diagram.mmd")
    
#     # Render architecture diagram
#     print("Rendering architecture diagram...")
#     # display(render_mermaid_diagram(architecture_mermaid, "architecture"))
    
#     # Generate data model diagram
#     print("Creating data model diagram...")
#     data_model_mermaid = generate_mermaid_data_model(data_catalog, analyzer)
    
#     # Save data model diagram
#     model_file = save_mermaid_diagram(data_model_mermaid, "data_model_diagram.mmd")
    
#     # Render data model diagram
#     print("Rendering data model diagram...")
#     # display(render_mermaid_diagram(data_model_mermaid, "data_model"))
    
#     # Create combined markdown report with diagrams
#     combined_report = f"""# Data Architecture and Model Diagrams

# ## Architecture Diagram

# ```mermaid
# {architecture_mermaid}
# ```

# ## Data Model Diagram

# ```mermaid
# {data_model_mermaid}
# ```

# ---
# *Diagrams generated by Ollama AI and Great Expectations Validation Analysis system*
# """
    
#     # Save combined report
#     combined_report_path = Path("notebooks/great_expectations") / "architecture_diagrams_report.md"
#     with open(combined_report_path, 'w', encoding='utf-8') as f:
#         f.write(combined_report)
    
#     print(f"Combined diagrams report saved to: {combined_report_path}")
    
#     print("Mermaid diagram generation complete!")
#     print(f"Architecture diagram: {arch_file}")
#     print(f"Data model diagram: {model_file}")
#     print(f"Combined report: {combined_report_path}")
    
# else:
#     print("Data catalog not available. Please run the data catalog generation cell first.")


In [34]:
# from pathlib import Path
# import shutil

# # Define the outputs directory path
# outputs_dir = Path("notebooks/great_expectations/outputs")
# outputs_dir.mkdir(exist_ok=True)

# # List of files to move into the outputs directory
# files_to_move = [
#     "architecture_diagram.mmd",
#     "data_model_diagram.mmd",
#     "architecture_diagrams_report.md",
#     "validation_analysis_report.md",
#     "validation_analysis_report_enhanced.md",
#     "data_catalog_report.md",
#     "data_catalog.json",
#     "validation_analysis_report_enhanced.pdf"
# ]

# # Move each file to the outputs directory if it exists
# for filename in files_to_move:
#     src = Path("notebooks/great_expectations") / filename
#     dst = outputs_dir / filename
#     if src.exists():
#         shutil.move(str(src), str(dst))
#         print(f"Moved {src} to {dst}")
#     else:
#         print(f"File not found, skipping: {src}")

# print("All specified files have been moved to the outputs directory.")


In [35]:
# Analysis Completion and Timing Summary
end_time = time.time()
analysis_end = datetime.now()
total_duration = end_time - start_time

# Calculate timing breakdown
hours = int(total_duration // 3600)
minutes = int((total_duration % 3600) // 60)
seconds = int(total_duration % 60)

print("=" * 80)
print("GREAT EXPECTATIONS VALIDATION ANALYSIS COMPLETE!")
print("=" * 80)
print(f"Analysis completed at: {analysis_end.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total execution time: {hours:02d}:{minutes:02d}:{seconds:02d}")
print(f"Total duration: {total_duration:.2f} seconds")
print("=" * 80)

# Performance summary
if total_duration < 60:
    performance = "Excellent - Very fast execution"
elif total_duration < 300:
    performance = "Good - Fast execution"
elif total_duration < 600:
    performance = "Moderate - Reasonable execution time"
else:
    performance = "Slow - Consider optimization"

print(f"📈 Performance rating: {performance}")
print("=" * 80)

# Summary of outputs generated
print("📄 Generated Reports and Files:")
print("   • validation_analysis_report.md - Basic analysis report")
print("   • validation_analysis_report_enhanced.md - Enhanced report with data catalog")
print("   • validation_analysis_report_enhanced.pdf - Professional PDF report")
print("   • data_catalog_report.md - Data catalog documentation")
print("   • data_catalog.json - Structured data catalog")
print("   • architecture_diagram.mmd - System architecture diagram")
print("   • data_model_diagram.mmd - Data model ER diagram")
print("   • enhanced_data_model_diagram_fixed.mmd - Enhanced data model")
print("   • architecture_diagrams_report.md - Combined diagrams report")
print("   • comprehensive_validation_analysis_report.md - Complete analysis report")

print("=" * 80)
print("All analysis tasks completed successfully!")
print("Ready for stakeholder review and presentation")
print("=" * 80)


GREAT EXPECTATIONS VALIDATION ANALYSIS COMPLETE!
Analysis completed at: 2025-10-07 14:40:32
Total execution time: 00:00:07
Total duration: 7.80 seconds
📈 Performance rating: Excellent - Very fast execution
📄 Generated Reports and Files:
   • validation_analysis_report.md - Basic analysis report
   • validation_analysis_report_enhanced.md - Enhanced report with data catalog
   • validation_analysis_report_enhanced.pdf - Professional PDF report
   • data_catalog_report.md - Data catalog documentation
   • data_catalog.json - Structured data catalog
   • architecture_diagram.mmd - System architecture diagram
   • data_model_diagram.mmd - Data model ER diagram
   • enhanced_data_model_diagram_fixed.mmd - Enhanced data model
   • architecture_diagrams_report.md - Combined diagrams report
   • comprehensive_validation_analysis_report.md - Complete analysis report
All analysis tasks completed successfully!
Ready for stakeholder review and presentation


In [None]:
# Generate AI-Powered Executive Summary for Professional PDF Report
print("🤖 Generating AI-powered executive summary for professional PDF report...")

def generate_ai_executive_summary(df, quality_metrics, data_summary, analyzer):
    """Generate AI-powered executive summary following professional standards"""
    
    # Extract key metrics for summary
    overall_success = quality_metrics['overall_success_rate']
    exception_rate = quality_metrics['exception_rate']
    total_expectations = len(df)
    
    # Get top failing expectation types for summary
    failing_types = quality_metrics['type_metrics'].nsmallest(3, 'success_rate')
    
    # Generate AI-powered executive summary
    executive_summary_prompt = f"""
    You are a senior data quality consultant writing an executive summary for a Great Expectations validation analysis report.
    
    Data Quality Context:
    - Total Expectations: {total_expectations}
    - Overall Success Rate: {overall_success:.2%}
    - Exception Rate: {exception_rate:.2%}
    - Expectation Types: {data_summary['expectation_types']}
    - Validation Suites: {data_summary['suite_count']}
    - Critical Issues: {len(failing_types[failing_types['success_rate'] < 0.8])} expectation types below 80% success rate
    
    Top Failing Expectation Types:
    {failing_types[['total_expectations', 'successful_expectations', 'success_rate', 'exceptions']].to_dict()}
    
    Write a professional executive summary that follows these guidelines:
    
    1. **Problem Statement**: Define the data quality challenge being addressed
    2. **Solution Approach**: Explain how Great Expectations addresses the problem
    3. **Key Findings**: Present the most critical insights from the analysis
    4. **Business Impact**: Highlight the value and benefits of the data quality program
    5. **Call to Action**: Provide clear next steps for decision-makers
    
    Requirements:
    - Write for C-level executives and decision-makers
    - Use clear, simple language (15-year-old reading level)
    - Keep to 500-800 words maximum
    - Focus on business value and strategic importance
    - Include specific metrics and actionable recommendations
    - Avoid technical jargon
    - Create urgency for immediate action
    
    Format as a professional executive summary suitable for a board presentation.
    """
    
    print("🤖 Generating AI-powered executive summary...")
    ai_executive_summary = analyzer.ollama_infer(executive_summary_prompt)
    
    # Fallback executive summary if AI is unavailable
    if ai_executive_summary is None:
        print("⚠️ AI unavailable, using fallback executive summary...")
        ai_executive_summary = f"""
        ## Executive Summary
        
        This Great Expectations validation analysis reveals critical insights into our data quality program's performance across {data_summary['suite_count']} validation suites monitoring {total_expectations} data quality expectations.
        
        **Key Findings:**
        Our data quality program demonstrates strong overall performance with a {overall_success:.2%} success rate, indicating robust data governance processes. However, {len(failing_types[failing_types['success_rate'] < 0.8])} expectation types require immediate attention due to success rates below 80%.
        
        **Business Impact:**
        The current data quality metrics suggest reliable data for most analytical workloads, but specific expectation failures could impact downstream analytics and reporting accuracy. Immediate remediation of failing expectations is recommended to maintain data trust and prevent potential business impact.
        
        **Recommendation:**
        Prioritize fixing expectation types with success rates below 80% to ensure comprehensive data quality coverage and maintain stakeholder confidence in our data assets.
        """
    
    return ai_executive_summary

# Generate AI executive summary
ai_executive_summary = generate_ai_executive_summary(df, quality_metrics, data_summary, analyzer)

# Create professional report with AI executive summary
def create_professional_report_with_ai_summary(df, quality_metrics, ai_insights, data_summary, ai_executive_summary, data_catalog=None):
    """Create professional report with AI-generated executive summary"""
    
    # Extract key metrics
    overall_success = quality_metrics['overall_success_rate']
    exception_rate = quality_metrics['exception_rate']
    total_expectations = len(df)
    failing_types = quality_metrics['type_metrics'].nsmallest(3, 'success_rate')
    
    report = f"""# Great Expectations Validation Analysis Report

**Generated on:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}  
**Analysis Period:** {data_summary['date_range']}

## Executive Summary

{ai_executive_summary}

## Critical Findings

### Top Issues Requiring Attention
"""
    
    # Add critical issues
    for idx, (exp_type, metrics) in enumerate(failing_types.iterrows(), 1):
        if metrics['success_rate'] < 0.8:
            report += f"{idx}. **{exp_type}**: {metrics['success_rate']:.1%} success rate ({metrics['successful_expectations']}/{metrics['total_expectations']} expectations)\n"
    
    report += f"""
## Data Quality Analysis

### Overall Performance Metrics

| Metric | Value |
|--------|-------|
| Total Expectations | {total_expectations} |
| Overall Success Rate | {overall_success:.2%} |
| Exception Rate | {exception_rate:.2%} |
| Expectation Types | {data_summary['expectation_types']} |
| Validation Suites | {data_summary['suite_count']} |

### Suite Performance

| Suite Name | Expectations | Success Rate | Exceptions |
|------------|-------------|--------------|------------|
"""
    
    # Add suite metrics
    for suite, metrics in quality_metrics['suite_metrics'].iterrows():
        report += f"| {suite} | {metrics['total_expectations']} | {metrics['success_rate']:.2%} | {metrics['exceptions']} |\n"
    
    report += f"""
### Expectation Type Performance

| Expectation Type | Count | Success Rate | Exceptions |
|------------------|-------|--------------|------------|
"""
    
    # Add type metrics - FIXED: Use correct column name 'total_expectations' instead of 'count'
    for exp_type, metrics in quality_metrics['type_metrics'].iterrows():
        report += f"| {exp_type} | {metrics['total_expectations']} | {metrics['success_rate']:.2%} | {metrics['exceptions']} |\n"
    
    report += f"""
## AI-Powered Analysis

{ai_insights}

## Data Catalog Summary

### Data Assets Overview

| Asset Name | Type | Table | Schema | Datasource | Columns | Suites |
|------------|------|-------|--------|------------|---------|--------|
"""
    
    # Add data catalog information
    if data_catalog:
        for asset_name, asset_info in data_catalog['data_assets'].items():
            report += f"| {asset_name} | {asset_info['type']} | {asset_info['table_name']} | {asset_info['schema_name']} | {asset_info['datasource']} | {len(asset_info['columns'])} | {len(asset_info['expectation_suites'])} |\n"
    
    report += f"""
### Expectation Suites Overview

| Suite Name | Total Expectations | Success Rate | Exceptions | Data Assets |
|------------|-------------------|--------------|------------|-------------|
"""
    
    # Add expectation suites information
    if data_catalog:
        for suite_name, suite_info in data_catalog['expectation_suites'].items():
            report += f"| {suite_name} | {suite_info['quality_metrics']['total_expectations']} | {suite_info['quality_metrics']['success_rate']:.2%} | {suite_info['quality_metrics']['exceptions']} | {len(suite_info['data_assets'])} |\n"
    
    report += f"""
## Recommendations

Based on the analysis, the following actions are recommended:

1. **Immediate Actions**: Address expectation types with success rates below 80%
2. **Monitoring**: Implement daily monitoring for critical data assets
3. **Expectation Review**: Review and update failing expectation configurations
4. **Process Improvement**: Establish data quality governance processes

## Technical Details

- **Analysis Engine**: Great Expectations v0.18.22
- **AI Analysis**: Ollama LLM (gpt-oss:20b)
- **Data Source**: Validation results from BirdiDQ/gx/uncommitted/validations
- **Report Generated**: {datetime.now().isoformat()}

---
*This report was automatically generated by the Great Expectations Validation Analysis system.*
"""
    
    return report

# Generate the professional report with AI executive summary
professional_report = create_professional_report_with_ai_summary(df, quality_metrics, ai_insights, data_summary, ai_executive_summary, data_catalog)

# Save professional markdown report
professional_report_path = Path("notebooks/great_expectations") / "validation_analysis_report_ai_executive.md"
with open(professional_report_path, 'w', encoding='utf-8') as f:
    f.write(professional_report)

print(f"📄 Professional markdown report with AI executive summary saved to: {professional_report_path}")

# Generate professional PDF report
print("📄 Generating professional PDF report with AI executive summary...")
professional_pdf_path = generate_pdf_report(professional_report, "validation_analysis_report_ai_executive.pdf")

if professional_pdf_path:
    print(f"✅ Professional PDF report with AI executive summary generated successfully!")
    print(f"📁 Professional PDF location: {professional_pdf_path}")
    print("\n🎯 This PDF features:")
    print("   • AI-generated executive summary for C-level executives")
    print("   • Clean formatting without grey backgrounds")
    print("   • Professional structure following best practices")
    print("   • Clear call-to-action and business impact focus")
    print("   • Strategic recommendations for decision-makers")
else:
    print("⚠️ Professional PDF generation failed - check error messages above")
