# AutoGen Data Analysis Pipeline Overview

This notebook provides an overview of the AutoGen Data Analysis Pipeline with AMP protocol integration.

## Pipeline Architecture

The pipeline consists of 6 specialized AutoGen agents:

1. **Data Collector Agent** - Ingests data from various sources
2. **Data Cleaner Agent** - Handles data quality and preprocessing
3. **Statistical Analyst Agent** - Performs statistical analysis
4. **ML Analyst Agent** - Builds and evaluates ML models
5. **Visualization Agent** - Creates charts and dashboards
6. **Quality Assurance Agent** - Validates results and ensures accuracy

## Key Features

- **Multi-Agent Collaboration**: Agents work together using AutoGen's conversation framework
- **AMP Protocol Integration**: Standardized communication between agents
- **End-to-End Automation**: Complete data analysis workflow from ingestion to reporting
- **Quality Assurance**: Built-in validation and quality checks
- **Extensible Architecture**: Easy to add new agents and capabilities


In [None]:
# Setup and imports
import sys
import os
from pathlib import Path

# Add pipeline modules to path
notebook_dir = Path.cwd()
project_root = notebook_dir.parent
sys.path.append(str(project_root))
sys.path.append(str(project_root / 'pipelines'))
sys.path.append(str(project_root / '../shared-lib'))

print(f"Notebook directory: {notebook_dir}")
print(f"Project root: {project_root}")

In [None]:
# Import pipeline components
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pipelines.data_pipeline import DataAnalysisPipeline, PipelineConfig
from amp_types import TransportType

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

## Pipeline Configuration

Let's create a basic pipeline configuration:

In [None]:
# Create pipeline configuration
config = PipelineConfig(
    pipeline_id="notebook_demo_001",
    pipeline_name="Notebook Demo Pipeline",
    
    # Enable all components for demo
    enable_data_collection=True,
    enable_data_cleaning=True,
    enable_statistical_analysis=True,
    enable_ml_analysis=True,
    enable_visualization=True,
    enable_quality_assurance=True,
    
    # Quality settings
    data_quality_threshold=0.8,
    model_performance_threshold=0.7,
    
    # Output settings
    generate_report=True,
    create_dashboard=True,
    save_artifacts=True,
    
    # LLM configuration (placeholder)
    llm_config={
        "config_list": [
            {
                "model": "gpt-4",
                "api_key": os.environ.get("OPENAI_API_KEY", "demo-key"),
                "api_type": "openai"
            }
        ]
    }
)

print(f"Pipeline ID: {config.pipeline_id}")
print(f"Components enabled: {sum([config.enable_data_collection, config.enable_data_cleaning, config.enable_statistical_analysis, config.enable_ml_analysis, config.enable_visualization, config.enable_quality_assurance])}")

## Sample Data

Let's create some sample data for demonstration:

In [None]:
# Create sample dataset
np.random.seed(42)
n_samples = 500

# Generate synthetic employee data
data = {
    'employee_id': range(1, n_samples + 1),
    'age': np.random.randint(22, 65, n_samples),
    'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR'], n_samples),
    'years_experience': np.random.randint(0, 25, n_samples),
    'education_level': np.random.choice(['Bachelor', 'Master', 'PhD'], n_samples, p=[0.5, 0.35, 0.15]),
    'salary': np.random.normal(75000, 20000, n_samples),
    'satisfaction_score': np.random.uniform(1, 10, n_samples),
    'performance_rating': np.random.choice(['Poor', 'Average', 'Good', 'Excellent'], n_samples, p=[0.1, 0.3, 0.4, 0.2])
}

# Add some correlations
for i in range(n_samples):
    # Salary correlates with experience and education
    if data['education_level'][i] == 'PhD':
        data['salary'][i] = max(data['salary'][i], np.random.normal(90000, 15000))
    elif data['education_level'][i] == 'Master':
        data['salary'][i] = max(data['salary'][i], np.random.normal(80000, 18000))
    
    data['salary'][i] += data['years_experience'][i] * 1200
    
    # Performance correlates with satisfaction
    if data['satisfaction_score'][i] > 8:
        data['performance_rating'][i] = np.random.choice(['Good', 'Excellent'], p=[0.3, 0.7])
    elif data['satisfaction_score'][i] < 4:
        data['performance_rating'][i] = np.random.choice(['Poor', 'Average'], p=[0.6, 0.4])

# Create DataFrame
df = pd.DataFrame(data)

# Introduce some missing values
missing_indices = np.random.choice(n_samples, size=int(n_samples * 0.05), replace=False)
df.loc[missing_indices, 'satisfaction_score'] = np.nan

print(f"Sample dataset created: {df.shape}")
print(f"Missing values: {df.isnull().sum().sum()}")
df.head()

In [None]:
# Basic data exploration
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"\nData Types:")
print(df.dtypes)
print(f"\nMissing Values:")
print(df.isnull().sum())
print(f"\nNumerical Summary:")
df.describe()

In [None]:
# Basic visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Age distribution
axes[0, 0].hist(df['age'], bins=20, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Salary by department
df.boxplot(column='salary', by='department', ax=axes[0, 1])
axes[0, 1].set_title('Salary by Department')
axes[0, 1].set_xlabel('Department')
axes[0, 1].set_ylabel('Salary')

# Performance rating distribution
performance_counts = df['performance_rating'].value_counts()
axes[1, 0].bar(performance_counts.index, performance_counts.values)
axes[1, 0].set_title('Performance Rating Distribution')
axes[1, 0].set_xlabel('Performance Rating')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=45)

# Satisfaction vs Salary scatter
scatter = axes[1, 1].scatter(df['satisfaction_score'], df['salary'], alpha=0.6)
axes[1, 1].set_title('Satisfaction vs Salary')
axes[1, 1].set_xlabel('Satisfaction Score')
axes[1, 1].set_ylabel('Salary')

plt.tight_layout()
plt.show()

## Pipeline Demonstration

**Note**: The following cells demonstrate the pipeline structure. In a real environment with proper API keys and AMP network setup, these would execute the full pipeline.

For now, we'll show the pipeline initialization and structure:

In [None]:
# Initialize the pipeline
pipeline = DataAnalysisPipeline(config)

print(f"Pipeline initialized: {pipeline.pipeline_id}")
print(f"Agents available: {list(pipeline.agents.keys())}")

# Show agent capabilities
for agent_name, agent in pipeline.agents.items():
    print(f"\n{agent_name.upper()}:")
    print(f"  Capabilities: {list(agent.capabilities.keys())}")
    print(f"  Framework: {agent.amp_config.framework}")

In [None]:
# Save sample data for pipeline processing
sample_file = project_root / "data" / "notebook_sample_data.csv"
sample_file.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(sample_file, index=False)

print(f"Sample data saved to: {sample_file}")
print(f"Ready for pipeline processing!")

## Analysis Request

Define what we want the pipeline to analyze:

In [None]:
analysis_request = """
Analyze the employee dataset to understand:
1. Relationship between age, education, experience and salary
2. Performance patterns across different departments
3. Factors that predict employee satisfaction
4. Build a classification model to predict performance rating
5. Identify any data quality issues and recommendations
"""

context = {
    "business_context": "Employee performance and satisfaction analysis",
    "target_metric": "performance_rating",
    "key_stakeholders": ["HR", "Management"],
    "analysis_type": "classification"
}

print("Analysis Request:")
print(analysis_request)
print(f"\nContext: {context}")

## Pipeline Execution

**Note**: The following cell would run the complete pipeline in a properly configured environment. 

```python
# This would run the complete pipeline
results = await pipeline.run_pipeline(
    data_source=str(sample_file),
    analysis_request=analysis_request,
    context=context
)
```

The pipeline would execute these steps:

1. **Data Collection**: Load and validate the CSV file
2. **Data Cleaning**: Handle missing values, detect outliers, remove duplicates
3. **Statistical Analysis**: Descriptive statistics, correlation analysis, hypothesis testing
4. **ML Analysis**: Feature engineering, model training (Random Forest, Logistic Regression, etc.)
5. **Visualization**: Create plots, dashboards, and reports
6. **Quality Assurance**: Validate data quality and model performance

Each agent would contribute its specialized analysis and the results would be integrated into a comprehensive report.

## Expected Pipeline Outputs

The pipeline would generate:

### Data Collection Results
- Dataset metadata (shape, columns, data types)
- Data source validation report
- Sample data preview

### Data Cleaning Results
- Missing value imputation report
- Outlier detection and handling
- Data quality score and recommendations

### Statistical Analysis Results
- Descriptive statistics for all variables
- Correlation matrix and significant relationships
- Hypothesis test results

### ML Analysis Results
- Feature importance ranking
- Model performance comparison
- Best model selection and metrics
- Predictions and confidence intervals

### Visualizations
- Distribution plots
- Correlation heatmaps
- Model performance charts
- Interactive dashboards

### Quality Assurance Report
- Data validation results
- Model validation against thresholds
- Pipeline audit and compliance check
- Final recommendations

## Next Steps

To run this pipeline in a real environment:

1. **Set up API Keys**: Configure OpenAI API key for LLM access
2. **AMP Network**: Set up AMP protocol network (or use local mode)
3. **Dependencies**: Install all required packages from requirements.txt
4. **Configuration**: Adjust pipeline and agent configurations as needed
5. **Execution**: Run the pipeline using the command line or this notebook

### Command Line Usage

```bash
# Run with sample data
python run_pipeline.py --sample

# Run with custom data
python run_pipeline.py --data data/notebook_sample_data.csv --request "Predict employee performance"
```

### Other Notebooks

Explore other notebooks in this directory:
- `02_Agent_Deep_Dive.ipynb` - Detailed exploration of individual agents
- `03_Custom_Analysis.ipynb` - Custom analysis workflows
- `04_Visualization_Examples.ipynb` - Visualization capabilities showcase