# GreenMining Experiment Notebook

A comprehensive experiment demonstrating all features of the `greenmining` library for Mining Software Repositories (MSR) in Green IT research.

**Experiment setup:**
- **Search experiment**: 10 blockchain repositories fetched via GitHub GraphQL API
- **Selected repositories**: 2 handpicked repos (`pallets/flask`, `psf/requests`)
- **Single repository**: 1 deep analysis repo (`tiangolo/fastapi`)
- **Commits per repo**: 20
- **Minimum stars**: 3
- **Languages**: Top 20 programming languages
- **All features enabled**: Energy tracking, method-level analysis, source code access, carbon reporting, power regression, correlation analysis, version comparison

---
## 1. Installation

Install `greenmining` from PyPI with all optional dependencies.

In [None]:
!pip install greenmining python-dotenv tqdm --quiet

Note: you may need to restart the kernel to use updated packages.


---
## 2. Library Overview

Verify the installation and inspect the library version, available pattern count, and module structure.

In [10]:
import greenmining

print(f"greenmining version: {greenmining.__version__}")
print(f"Total GSF patterns: {len(greenmining.GSF_PATTERNS)}")
print(f"Green keywords count: {len(greenmining.GREEN_KEYWORDS)}")
print(f"\nPublic API: {greenmining.__all__}")

greenmining version: 1.0.8
Total GSF patterns: 122
Green keywords count: 321

Public API: ['Config', 'GSF_PATTERNS', 'GREEN_KEYWORDS', 'is_green_aware', 'get_pattern_by_keywords', 'fetch_repositories', 'analyze_repositories', '__version__']


---
## 3. Pattern Detection

The GSF (Green Software Foundation) pattern catalog contains 122 sustainability patterns across 15 categories. Each pattern has associated keywords, SCI impact classification, and descriptive metadata.

### 3.1 Explore Pattern Categories

In [11]:
from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords

# Count patterns per category
categories = {}
for pid, pattern in GSF_PATTERNS.items():
    cat = pattern["category"]
    categories[cat] = categories.get(cat, 0) + 1

print("GSF Pattern Categories:")
print("-" * 40)
for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
    print(f"  {cat:20s} {count:3d} patterns")
print(f"{'':20s} {'---':>3s}")
print(f"  {'TOTAL':20s} {sum(categories.values()):3d}")

GSF Pattern Categories:
----------------------------------------
  cloud                 40 patterns
  ai                    19 patterns
  web                   17 patterns
  general                8 patterns
  network                6 patterns
  database               5 patterns
  code                   4 patterns
  microservices          4 patterns
  infrastructure         4 patterns
  data                   3 patterns
  async                  3 patterns
  monitoring             3 patterns
  networking             2 patterns
  resource               2 patterns
  caching                2 patterns
                     ---
  TOTAL                122


### 3.2 Green Awareness Detection

Test the keyword-based green awareness detection on sample commit messages.

In [12]:
test_messages = [
    "Optimize Redis caching to reduce energy consumption",
    "Add auto-scaling to handle peak traffic efficiently",
    "Fix typo in README",
    "Implement lazy loading for images to reduce bandwidth",
    "Refactor database queries with connection pooling",
    "Update CI pipeline configuration",
    "Switch to gRPC for efficient service communication",
    "Enable model pruning to reduce inference cost",
]

print(f"{'Commit Message':<55} {'Green?':<8} {'Patterns'}")
print("=" * 100)
for msg in test_messages:
    green = is_green_aware(msg)
    patterns = get_pattern_by_keywords(msg) if green else []
    print(f"{msg:<55} {str(green):<8} {patterns}")

Commit Message                                          Green?   Patterns
Optimize Redis caching to reduce energy consumption     True     ['Cache Static Data', 'Optimize Storage Resource Utilization', 'Optimize Average CPU Utilization', 'Reduce Transmitted Data', 'Use Energy Efficient Hardware', 'Keep Request Counts Low']
Add auto-scaling to handle peak traffic efficiently     True     ['Optimize Peak CPU Utilization', 'Use Energy Efficient Hardware', 'Efficient Format for Model Training', 'Energy Efficient Models']
Fix typo in README                                      False    []
Implement lazy loading for images to reduce bandwidth   True     ['Reduce Transmitted Data', 'Scale Infrastructure with User Load', 'Defer Offscreen Images', 'Keep Request Counts Low', 'Properly Sized Images', 'Serve Images in Modern Formats', 'Lazy Loading', 'Pagination & Lazy Loading']
Refactor database queries with connection pooling       True     ['Reduce Transmitted Data', 'Connection Pooling']
Updat

### 3.3 Inspect Individual Pattern Details

Examine the structure of a specific pattern including its keywords, SCI impact, and description.

In [13]:
# Show details for the first 5 patterns
for pid in list(GSF_PATTERNS.keys())[:5]:
    p = GSF_PATTERNS[pid]
    print(f"ID: {pid}")
    print(f"  Name:        {p['name']}")
    print(f"  Category:    {p['category']}")
    print(f"  SCI Impact:  {p['sci_impact']}")
    print(f"  Keywords:    {p['keywords'][:5]}...")
    print(f"  Description: {p['description'][:80]}...")
    print()

ID: cache_static_data
  Name:        Cache Static Data
  Category:    cloud
  SCI Impact:  Reduces energy by minimizing redundant compute and network operations
  Keywords:    ['cache', 'caching', 'static', 'cdn', 'redis']...
  Description: Cache static content to reduce server load and network transfers...

ID: choose_region_closest
  Name:        Choose Region Closest to Users
  Category:    cloud
  SCI Impact:  Less energy for network transmission, lower latency
  Keywords:    ['region', 'closest', 'proximity', 'latency', 'location']...
  Description: Deploy in regions closest to users to reduce network distance...

ID: compress_stored_data
  Name:        Compress Stored Data
  Category:    cloud
  SCI Impact:  Lower embodied carbon from reduced storage infrastructure
  Keywords:    ['compress', 'compression', 'stored', 'storage', 'gzip']...
  Description: Compress data at rest to reduce storage footprint...

ID: compress_transmitted_data
  Name:        Compress Transmitted Data
  C

---
## 4. GitHub Token Configuration

A GitHub personal access token is required for fetching repositories via the GraphQL API. Set your token below.

In [14]:
import os

# Option 1: Set directly (replace with your token)
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN", "your_github_token_here")

# Option 2: Load from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
    GITHUB_TOKEN = os.getenv("GITHUB_TOKEN", GITHUB_TOKEN)
except ImportError:
    pass

if GITHUB_TOKEN == "your_github_token_here":
    print("WARNING: Set your GITHUB_TOKEN to run the search experiment.")
    print("You can still run URL-based analysis without a token.")
else:
    print(f"GitHub token configured ({GITHUB_TOKEN[:8]}...)")

GitHub token configured (github_p...)


---
## 5. Experiment 1: Search 10 Blockchain Repositories

Use the `fetch_repositories` function to search GitHub for blockchain-related repositories using the GraphQL API v4. Filtering by:
- Keyword: `blockchain`
- Minimum stars: 3
- Languages: Top 20 programming languages
- Date filters: Created after 2020-01-01

### 5.1 Fetch Repositories

In [15]:
from greenmining import fetch_repositories

# Top 20 programming languages
TOP_20_LANGUAGES = [
    "Python", "JavaScript", "TypeScript", "Java", "Go",
    "Rust", "C", "C++", "C#", "Ruby",
    "PHP", "Kotlin", "Swift", "Scala", "R",
    "Dart", "Shell", "Lua", "Perl", "Haskell",
]

blockchain_repos = fetch_repositories(
    github_token=GITHUB_TOKEN,
    max_repos=10,
    min_stars=3,
    keywords="blockchain",
    languages=TOP_20_LANGUAGES,
    created_after="2020-01-01",
)

print(f"Fetched {len(blockchain_repos)} blockchain repositories:\n")
for i, repo in enumerate(blockchain_repos, 1):
    print(f"  {i:2d}. {repo.full_name} ({repo.stars} stars, {repo.language})")

Fetching up to 10 repositories...
   Keywords: blockchain
   Filters: min_stars=3
   Created: 2020-01-01 to any
GraphQL Search Query: blockchain stars:>=3 created:>=2020-01-01
Rate Limit: 4998/5000 (cost: 1)
Error fetching repositories: Repository.__init__() got an unexpected keyword argument 'languages'. Did you mean 'language'?
Fetched 0 repositories using GraphQL
Fetched 0 repositories
   Saved to: data/repositories.json
Fetched 0 blockchain repositories:



### 5.2 Extract Commits

Use `CommitExtractor` to extract up to 20 commits per repository, skipping merge and bot commits.

In [None]:
from greenmining.services.commit_extractor import CommitExtractor

extractor = CommitExtractor(
    max_commits=20,
    skip_merges=True,
    days_back=730,
    github_token=GITHUB_TOKEN,
    timeout=60,
)

all_commits = extractor.extract_from_repositories(blockchain_repos)

print(f"\nTotal commits extracted: {len(all_commits)}")

### 5.3 Analyze Commits for Green Patterns

Use `DataAnalyzer` to scan each commit message for GSF sustainability patterns.

In [None]:
from greenmining.services.data_analyzer import DataAnalyzer

analyzer = DataAnalyzer(
    enable_diff_analysis=False,
    batch_size=10,
)

analyzed_commits = analyzer.analyze_commits(all_commits)

green_count = sum(1 for c in analyzed_commits if c.get("green_aware", False))
green_pct = (green_count / len(analyzed_commits) * 100) if analyzed_commits else 0

print(f"Analyzed: {len(analyzed_commits)} commits")
print(f"Green-aware: {green_count} ({green_pct:.1f}%)")
print(f"\nSample green commits:")
for c in analyzed_commits:
    if c.get("green_aware"):
        print(f"  - {c.get('message', '')[:70]}")
        print(f"    Patterns: {c.get('known_pattern', [])}")

### 5.4 Aggregate Results with Statistical and Temporal Analysis

Use `DataAggregator` with statistical correlations and temporal trend analysis enabled.

In [None]:
from greenmining.services.data_aggregator import DataAggregator

aggregator = DataAggregator(
    enable_stats=True,
    enable_temporal=True,
    temporal_granularity="quarter",
)

aggregated = aggregator.aggregate(analyzed_commits, blockchain_repos)

print("Aggregated Statistics:")
print(f"  Total commits:     {aggregated.get('total_commits', 'N/A')}")
print(f"  Green-aware:       {aggregated.get('green_aware_count', 'N/A')}")
print(f"  Green rate:        {aggregated.get('green_aware_percentage', 'N/A')}%")

# Show top patterns
top_patterns = aggregated.get("top_patterns", [])
if top_patterns:
    print(f"\nTop patterns:")
    for p in top_patterns[:5]:
        print(f"  - {p}")

# Show temporal analysis
temporal = aggregated.get("temporal_analysis", {})
if temporal:
    periods = temporal.get("periods", [])
    print(f"\nTemporal analysis ({len(periods)} periods):")
    for period in periods[:5]:
        print(f"  {period.get('period')}: {period.get('commit_count')} commits, "
              f"{period.get('green_awareness_rate', 0):.1%} green")

### 5.5 Save Search Results

Export analyzed data to JSON and CSV for further analysis.

In [None]:
import json
from pathlib import Path

output_dir = Path("experiment_output")
output_dir.mkdir(exist_ok=True)

# Save analyzed commits to JSON
with open(output_dir / "blockchain_analyzed.json", "w") as f:
    json.dump(analyzed_commits, f, indent=2, default=str)

# Save aggregated stats
with open(output_dir / "blockchain_stats.json", "w") as f:
    json.dump(aggregated, f, indent=2, default=str)

print(f"Results saved to {output_dir.absolute()}/")
print(f"  blockchain_analyzed.json ({len(analyzed_commits)} commits)")
print(f"  blockchain_stats.json")

---
## 6. Experiment 2: URL-Based Analysis of 2 Selected Repositories

Analyze two handpicked repositories directly from their GitHub URLs using PyDriller. This approach clones the repository locally and performs deep commit-level analysis including:
- GSF pattern matching
- Delta Maintainability Model (DMM) metrics
- Structural complexity metrics (via Lizard)
- Full process metrics (8 PyDriller metrics)

**Selected repositories:**
- `pallets/flask` - Python web microframework
- `psf/requests` - HTTP library for Python

### 6.1 Analyze Flask

In [None]:
from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer

analyzer = LocalRepoAnalyzer(
    max_commits=20,
    days_back=730,
    skip_merges=True,
    compute_process_metrics=True,
    cleanup_after=True,
    method_level_analysis=True,
    include_source_code=True,
    process_metrics="standard",
)

flask_result = analyzer.analyze_repository("https://github.com/pallets/flask")

print(f"Repository: {flask_result.name}")
print(f"Total commits: {flask_result.total_commits}")
print(f"Green commits: {flask_result.green_commits}")
print(f"Green rate: {flask_result.green_commit_rate:.1%}")

### 6.2 Analyze Requests

In [None]:
requests_result = analyzer.analyze_repository("https://github.com/psf/requests")

print(f"Repository: {requests_result.name}")
print(f"Total commits: {requests_result.total_commits}")
print(f"Green commits: {requests_result.green_commits}")
print(f"Green rate: {requests_result.green_commit_rate:.1%}")

### 6.3 Inspect Commit-Level Details

Examine the rich data available for each commit, including DMM metrics, complexity, and green patterns.

In [None]:
# Show detailed commit analysis for Flask
print("Flask - Commit Details:")
print("=" * 80)
for commit in flask_result.commits[:5]:
    d = commit.to_dict()
    print(f"\nCommit: {d['commit_hash'][:8]}")
    print(f"  Author:     {d['author']}")
    print(f"  Date:       {d['date']}")
    print(f"  Message:    {d['message'][:60]}")
    print(f"  Green:      {d['green_aware']}")
    print(f"  Patterns:   {d['gsf_patterns_matched']}")
    print(f"  Confidence: {d['confidence']}")
    print(f"  Files:      {len(d['files_modified'])} modified")
    print(f"  Lines:      +{d['insertions']} / -{d['deletions']}")
    print(f"  DMM Size:   {d['dmm_unit_size']}")
    print(f"  DMM Cmplx:  {d['dmm_unit_complexity']}")
    print(f"  NLOC:       {d['total_nloc']}")
    print(f"  Complexity: {d['total_complexity']}")
    print(f"  Methods:    {d['methods_count']}")

### 6.4 Inspect Method-Level Analysis

Method-level metrics are extracted via Lizard integration, providing per-function complexity data for each modified file in a commit.

In [None]:
# Find commits with method-level data
print("Method-Level Analysis (Flask):")
print("=" * 80)
methods_found = 0
for commit in flask_result.commits:
    if commit.methods:
        print(f"\nCommit {commit.hash[:8]}: {commit.message[:50]}")
        for method in commit.methods[:5]:
            m = method.to_dict()
            print(f"  {m['long_name']}")
            print(f"    File: {m['filename']}, Lines: {m['start_line']}-{m['end_line']}")
            print(f"    NLOC: {m['nloc']}, Complexity: {m['complexity']}, "
                  f"Tokens: {m['token_count']}, Params: {m['parameters']}")
            methods_found += 1
        if methods_found >= 10:
            break

if methods_found == 0:
    print("No method-level data found in analyzed commits.")

### 6.5 Inspect Source Code Changes

Source code before/after is available for each modified file, enabling refactoring detection and diff analysis.

In [None]:
# Show source code changes for first commit with modifications
print("Source Code Changes (Flask):")
print("=" * 80)
changes_shown = 0
for commit in flask_result.commits:
    if commit.source_changes:
        print(f"\nCommit {commit.hash[:8]}: {commit.message[:50]}")
        for change in commit.source_changes[:3]:
            c = change.to_dict()
            print(f"  File: {c['filename']} ({c['change_type']})")
            print(f"  Lines: +{c['added_lines']} / -{c['deleted_lines']}")
            if c['source_code_before']:
                lines = c['source_code_before'].split('\n')
                print(f"  Before ({len(lines)} lines): {lines[0][:60]}...")
            if c['source_code_after']:
                lines = c['source_code_after'].split('\n')
                print(f"  After  ({len(lines)} lines): {lines[0][:60]}...")
            changes_shown += 1
        if changes_shown >= 5:
            break

if changes_shown == 0:
    print("No source code changes found in analyzed commits.")

### 6.6 Inspect Process Metrics

PyDriller computes 8 process metrics across the repository: ChangeSet, CodeChurn, CommitsCount, ContributorsCount, ContributorsExperience, HistoryComplexity, HunksCount, and LinesCount.

In [None]:
print("Process Metrics (Flask):")
print("=" * 80)
for metric_name, metric_value in flask_result.process_metrics.items():
    if isinstance(metric_value, dict):
        # Show summary for dict metrics
        print(f"  {metric_name}: {len(metric_value)} entries")
        for k, v in list(metric_value.items())[:3]:
            print(f"    {k}: {v}")
        if len(metric_value) > 3:
            print(f"    ... ({len(metric_value) - 3} more)")
    else:
        print(f"  {metric_name}: {metric_value}")

print(f"\nProcess Metrics (Requests):")
print("=" * 80)
for metric_name, metric_value in requests_result.process_metrics.items():
    if isinstance(metric_value, dict):
        print(f"  {metric_name}: {len(metric_value)} entries")
    else:
        print(f"  {metric_name}: {metric_value}")

---
## 7. Experiment 3: Single Repository Deep Analysis

Perform a deep analysis on `tiangolo/fastapi` with integrated energy tracking. This demonstrates energy measurement during repository mining.

### 7.1 Analyze with Energy Tracking

In [None]:
from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer

deep_analyzer = LocalRepoAnalyzer(
    max_commits=20,
    days_back=730,
    skip_merges=True,
    compute_process_metrics=True,
    cleanup_after=True,
    energy_tracking=True,
    energy_backend="auto",
    method_level_analysis=True,
    include_source_code=True,
    process_metrics="standard",
)

fastapi_result = deep_analyzer.analyze_repository("https://github.com/tiangolo/fastapi")

print(f"Repository: {fastapi_result.name}")
print(f"Total commits: {fastapi_result.total_commits}")
print(f"Green commits: {fastapi_result.green_commits}")
print(f"Green rate: {fastapi_result.green_commit_rate:.1%}")

# Show energy metrics if available
if fastapi_result.energy_metrics:
    em = fastapi_result.energy_metrics
    print(f"\nEnergy Metrics:")
    print(f"  Energy consumed: {em.get('joules', 0):.4f} Joules")
    print(f"  Average power:   {em.get('watts_avg', 0):.4f} Watts")
    print(f"  Duration:        {em.get('duration_seconds', 0):.2f} seconds")
    print(f"  Backend:         {em.get('backend', 'N/A')}")
else:
    print("\nEnergy tracking: No data (backend may not be available)")

### 7.2 Compare All Three Repositories

In [None]:
url_results = [flask_result, requests_result, fastapi_result]

print(f"{'Repository':<25} {'Commits':>8} {'Green':>6} {'Rate':>8} {'Complexity':>11}")
print("=" * 65)
for r in url_results:
    avg_complexity = (
        sum(c.total_complexity for c in r.commits) / len(r.commits)
        if r.commits else 0
    )
    print(f"{r.name:<25} {r.total_commits:>8} {r.green_commits:>6} "
          f"{r.green_commit_rate:>7.1%} {avg_complexity:>11.1f}")

---
## 8. Batch Analysis with Parallelism

Use the top-level `analyze_repositories()` function to analyze the 2 selected + 1 deep repo in parallel with energy tracking enabled. This is the convenience API that handles cloning, analysis, and cleanup.

In [None]:
from greenmining import analyze_repositories

batch_results = analyze_repositories(
    urls=[
        "https://github.com/pallets/flask",
        "https://github.com/psf/requests",
        "https://github.com/tiangolo/fastapi",
    ],
    max_commits=20,
    parallel_workers=3,
    energy_tracking=True,
    energy_backend="auto",
    method_level_analysis=True,
    include_source_code=True,
)

print(f"Batch analysis complete: {len(batch_results)} repositories\n")
for r in batch_results:
    print(f"{r.name}: {r.total_commits} commits, {r.green_commit_rate:.1%} green")

---
## 9. Energy Measurement

GreenMining provides three energy measurement backends. The `auto` backend selects the best available: RAPL on Linux (Intel/AMD), then CPU Meter as fallback.

### 9.1 Standalone Energy Measurement

In [None]:
from greenmining.energy import get_energy_meter, CPUEnergyMeter, RAPLEnergyMeter
import time

# Auto-detect best available backend
meter = get_energy_meter("auto")
print(f"Selected backend: {meter.__class__.__name__}")
print(f"Available: {meter.is_available()}")

# Measure a workload
meter.start()

# Simulate a workload
total = 0
for i in range(500_000):
    total += i * i

result = meter.stop()

print(f"\nMeasurement Results:")
print(f"  Energy:   {result.joules:.4f} Joules")
print(f"  Power:    {result.watts_avg:.4f} Watts (avg)")
print(f"  Peak:     {result.watts_peak:.4f} Watts (peak)")
print(f"  Duration: {result.duration_seconds:.4f} seconds")
print(f"  CPU:      {result.cpu_energy_joules:.4f} Joules")
print(f"  DRAM:     {result.dram_energy_joules} Joules")
print(f"  Backend:  {result.backend}")

### 9.2 CPU Energy Meter (Cross-Platform)

The CPU Energy Meter works on all platforms by estimating power from CPU utilization and TDP (Thermal Design Power).

In [None]:
cpu_meter = CPUEnergyMeter(
    tdp_watts=None,          # Auto-detect TDP
    sample_interval=0.5,     # 500ms sampling
)

print(f"Platform TDP: {cpu_meter.tdp_watts} W")
print(f"Available: {cpu_meter.is_available()}")

cpu_meter.start()
time.sleep(1)  # 1-second measurement window
cpu_result = cpu_meter.stop()

print(f"\nCPU Energy Meter Results (1s window):")
print(f"  Energy:   {cpu_result.joules:.4f} Joules")
print(f"  Power:    {cpu_result.watts_avg:.4f} Watts")
print(f"  Duration: {cpu_result.duration_seconds:.4f} seconds")

### 9.3 RAPL Energy Meter (Linux Intel/AMD)

RAPL (Running Average Power Limit) provides hardware-level energy counters on Linux with Intel or AMD processors.

In [None]:
rapl_meter = RAPLEnergyMeter()
print(f"RAPL available: {rapl_meter.is_available()}")

if rapl_meter.is_available():
    rapl_meter.start()
    time.sleep(1)
    rapl_result = rapl_meter.stop()
    print(f"  Energy:   {rapl_result.joules:.4f} Joules")
    print(f"  CPU:      {rapl_result.cpu_energy_joules:.4f} Joules")
    print(f"  DRAM:     {rapl_result.dram_energy_joules:.4f} Joules")
else:
    print("  RAPL is not available on this system (requires Linux with Intel/AMD and powercap access).")

---
## 10. Carbon Footprint Reporting

Convert energy measurements to CO2 emissions using regional grid carbon intensity data. Supports 20+ countries and major cloud provider regions (AWS, GCP, Azure).

### 10.1 Generate Carbon Report

In [None]:
from greenmining.energy import CarbonReporter, CarbonReport

reporter = CarbonReporter(
    country_iso="USA",
    cloud_provider="aws",
    region="us-east-1",
)

# Generate report from the energy measurement above
report = reporter.generate_report(total_joules=cpu_result.joules)

print(report.summary())

### 10.2 Compare Carbon Across Regions

Compare the same energy consumption across different countries and cloud regions to see the impact of grid carbon intensity.

In [None]:
# Simulate a workload energy measurement
test_joules = 3600.0  # 1 kWh

regions = [
    ("Sweden (NOR)", "SWE", None, None),
    ("France (FRA)", "FRA", None, None),
    ("USA (USA)", "USA", None, None),
    ("Germany (DEU)", "DEU", None, None),
    ("Australia (AUS)", "AUS", None, None),
    ("India (IND)", "IND", None, None),
    ("AWS us-west-2", "USA", "aws", "us-west-2"),
    ("AWS eu-north-1", "SWE", "aws", "eu-north-1"),
    ("GCP europe-north1", "SWE", "gcp", "europe-north1"),
]

print(f"Carbon comparison for {test_joules:.0f} Joules ({test_joules/3_600_000*1000:.1f} Wh):")
print(f"{'Region':<25} {'gCO2/kWh':>10} {'Emissions (g)':>15} {'Tree-months':>12}")
print("=" * 65)

for label, country, cloud, region in regions:
    r = CarbonReporter(country_iso=country, cloud_provider=cloud, region=region)
    report = r.generate_report(total_joules=test_joules)
    d = report.to_dict()
    print(f"{label:<25} {d['carbon_intensity_gco2_kwh']:>10.0f} "
          f"{d['total_emissions_grams']:>15.4f} {d['equivalents']['tree_months']:>12.4f}")

### 10.3 Supported Countries and Cloud Regions

In [None]:
print(f"Supported countries: {CarbonReporter.get_supported_countries()}")
print(f"\nAWS regions: {CarbonReporter.get_supported_cloud_regions('aws')}")
print(f"\nGCP regions: {CarbonReporter.get_supported_cloud_regions('gcp')}")
print(f"\nAzure regions: {CarbonReporter.get_supported_cloud_regions('azure')}")

---
## 11. Carbon Report from Analysis Results

Generate a carbon footprint report from the batch analysis results, combining energy data from multiple repositories.

In [None]:
# Collect analysis results with energy data
analysis_dicts = [r.to_dict() for r in batch_results]

# Generate combined carbon report
combined_reporter = CarbonReporter(country_iso="USA")
combined_report = combined_reporter.generate_report(analysis_results=analysis_dicts)

print(combined_report.summary())

print("\nPer-repository breakdown:")
for entry in combined_report.to_dict().get("analysis_results", []):
    print(f"  {entry['name']}: {entry['energy_joules']:.4f} J, {entry['duration_seconds']:.2f}s")

---
## 12. Power Regression Detection

Detect commits that caused power consumption regressions by measuring energy at each commit in a range. The detector runs a test command at each commit and compares energy usage against a configurable threshold.

This requires a local git repository. We demonstrate the API and configuration here.

In [None]:
from greenmining.analyzers import PowerRegressionDetector, PowerRegression

detector = PowerRegressionDetector(
    test_command="pytest tests/ -x --no-header -q",
    energy_backend="auto",
    threshold_percent=5.0,      # Flag regressions above 5%
    iterations=3,               # 3 measurement iterations for accuracy
    warmup_iterations=1,        # 1 warmup run before measuring
)

print("PowerRegressionDetector configured:")
print(f"  Test command:       {detector.test_command}")
print(f"  Energy backend:     {detector.energy_backend}")
print(f"  Threshold:          {detector.threshold_percent}%")
print(f"  Iterations:         {detector.iterations}")
print(f"  Warmup iterations:  {detector.warmup_iterations}")

# To run detection on a local repo:
# regressions = detector.detect(
#     repo_path="/path/to/local/repo",
#     baseline_commit="HEAD~10",
#     target_commit="HEAD",
#     max_commits=50,
# )
# for r in regressions:
#     print(f"Regression at {r.sha[:8]}: +{r.power_increase:.1f}%")

print("\nPowerRegression dataclass fields:")
print(f"  {[f.name for f in PowerRegression.__dataclass_fields__.values()]}")

---
## 13. Metrics-to-Power Correlation

Analyze the statistical relationship between code metrics (complexity, NLOC, churn) and power consumption using Pearson and Spearman correlations.

### 13.1 Build Correlation from Analysis Data

In [None]:
from greenmining.analyzers import MetricsPowerCorrelator, CorrelationResult

# Extract metrics from the FastAPI analysis
commits_data = [c.to_dict() for c in fastapi_result.commits if c.total_nloc > 0]

if len(commits_data) >= 3:
    complexity_values = [c["total_complexity"] for c in commits_data]
    nloc_values = [c["total_nloc"] for c in commits_data]
    insertions_values = [c["insertions"] for c in commits_data]
    deletions_values = [c["deletions"] for c in commits_data]
    methods_values = [c["methods_count"] for c in commits_data]

    # Use insertions as a proxy for "power" in this demo
    # (in practice, you would use actual energy measurements)
    power_proxy = [c["insertions"] + c["deletions"] for c in commits_data]

    correlator = MetricsPowerCorrelator(significance_level=0.05)
    correlator.fit(
        metrics=["complexity", "nloc", "insertions", "deletions", "methods_count"],
        metrics_values={
            "complexity": complexity_values,
            "nloc": nloc_values,
            "insertions": insertions_values,
            "deletions": deletions_values,
            "methods_count": methods_values,
        },
        power_measurements=power_proxy,
    )

    print("Pearson correlations (linear):")
    for metric, r in correlator.pearson.items():
        print(f"  {metric:<20s} r = {r:+.4f}")

    print(f"\nSpearman correlations (monotonic):")
    for metric, r in correlator.spearman.items():
        print(f"  {metric:<20s} rho = {r:+.4f}")

    print(f"\nFeature importance (normalized):")
    for metric, imp in correlator.feature_importance.items():
        bar = "#" * int(imp * 30)
        print(f"  {metric:<20s} {imp:.3f} {bar}")
else:
    print(f"Insufficient data points ({len(commits_data)}). Need at least 3.")

### 13.2 Inspect Significant Correlations

In [None]:
if len(commits_data) >= 3:
    significant = correlator.get_significant_correlations()
    print(f"Significant correlations (p < 0.05): {len(significant)}")
    for name, result in significant.items():
        d = result.to_dict()
        print(f"  {name}: strength={d['strength']}, "
              f"pearson_r={d['pearson_r']:.4f} (p={d['pearson_p']:.6f}), "
              f"spearman_r={d['spearman_r']:.4f} (p={d['spearman_p']:.6f})")

    all_results = correlator.get_results()
    print(f"\nAll correlations: {len(all_results)}")
    for name, result in all_results.items():
        d = result.to_dict()
        sig = "*" if d["significant"] else " "
        print(f"  {sig} {name:<20s} strength={d['strength']:<12s} "
              f"pearson={d['pearson_r']:+.4f} spearman={d['spearman_r']:+.4f}")

---
## 14. Version Power Analysis

Compare power consumption across different software versions or tags. The analyzer checks out each version, runs a test command, measures energy, and reports trends.

This requires a local git repository with version tags. We demonstrate the API configuration and data structures here.

In [None]:
from greenmining.analyzers import VersionPowerAnalyzer, VersionPowerReport

version_analyzer = VersionPowerAnalyzer(
    test_command="pytest tests/ --no-header -q",
    energy_backend="auto",
    iterations=5,
    warmup_iterations=1,
)

print("VersionPowerAnalyzer configured:")
print(f"  Test command:       {version_analyzer.test_command}")
print(f"  Energy backend:     {version_analyzer.energy_backend}")
print(f"  Iterations:         {version_analyzer.iterations}")
print(f"  Warmup:             {version_analyzer.warmup_iterations}")

# To run on a local repo:
# report = version_analyzer.analyze_versions(
#     repo_path="/path/to/local/repo",
#     versions=["v1.0", "v1.1", "v1.2", "v2.0"],
# )
# print(report.summary())

# Demonstrate the report data structure
from greenmining.analyzers.version_power_analyzer import VersionPowerProfile

demo_report = VersionPowerReport(
    versions=[
        VersionPowerProfile(version="v1.0", commit_sha="abc1234", energy_joules=10.5, power_watts_avg=8.2, duration_seconds=1.28, iterations=5, energy_std=0.3),
        VersionPowerProfile(version="v1.1", commit_sha="def5678", energy_joules=11.2, power_watts_avg=8.5, duration_seconds=1.32, iterations=5, energy_std=0.4),
        VersionPowerProfile(version="v2.0", commit_sha="ghi9012", energy_joules=9.8, power_watts_avg=7.9, duration_seconds=1.24, iterations=5, energy_std=0.2),
    ],
    trend="decreasing",
    total_change_percent=-6.67,
    most_efficient="v2.0",
    least_efficient="v1.1",
)

print(f"\nDemo report structure:")
print(demo_report.summary())

---
## 15. Statistical Analyzer

The `StatisticalAnalyzer` computes effect sizes, correlations, and statistical significance tests on the analysis results.

In [None]:
import pandas as pd
from greenmining.analyzers import StatisticalAnalyzer

stat_analyzer = StatisticalAnalyzer()

# Prepare data from URL analysis results
all_analysis_dicts = []
for repo_result in url_results:
    for commit in repo_result.commits:
        d = commit.to_dict()
        d["repository"] = repo_result.name
        all_analysis_dicts.append(d)

# Convert to DataFrame for statistical analysis
commits_df = pd.DataFrame(all_analysis_dicts)

# Pattern correlation analysis
all_patterns = set()
for patterns_list in commits_df.get("gsf_patterns_matched", []):
    if patterns_list:
        all_patterns.update(patterns_list)

for pattern in all_patterns:
    commits_df[f"pattern_{pattern}"] = commits_df["gsf_patterns_matched"].apply(
        lambda x, p=pattern: 1 if p in (x or []) else 0
    )

if len(all_patterns) >= 2:
    correlation_results = stat_analyzer.analyze_pattern_correlations(commits_df)
    print("Pattern Correlation Analysis:")
    sig_pairs = correlation_results.get("significant_pairs", [])
    print(f"  Significant pairs: {len(sig_pairs)}")
    for pair in sig_pairs[:10]:
        print(f"    {pair}")
else:
    print(f"Found {len(all_patterns)} pattern(s) - need >= 2 for correlation analysis")

# Temporal trend analysis
if "date" in commits_df.columns and "green_aware" in commits_df.columns:
    if "commit_hash" not in commits_df.columns:
        commits_df["commit_hash"] = commits_df.index.astype(str)
    trend_results = stat_analyzer.temporal_trend_analysis(commits_df)
    trend = trend_results.get("trend", {})
    print(f"\nTemporal Trend:")
    print(f"  Direction: {trend.get('direction', 'N/A')}")
    print(f"  Significant: {trend.get('significant', 'N/A')}")
    print(f"  Correlation: {trend.get('correlation', 'N/A')}")

# Effect size: green vs non-green commit complexity
green_complexity = commits_df[commits_df["green_aware"] == True]["total_complexity"].dropna().tolist()
non_green_complexity = commits_df[commits_df["green_aware"] == False]["total_complexity"].dropna().tolist()

if green_complexity and non_green_complexity:
    effect = stat_analyzer.effect_size_analysis(green_complexity, non_green_complexity)
    print(f"\nEffect Size (Green vs Non-Green Complexity):")
    print(f"  Cohen's d: {effect['cohens_d']:.3f} ({effect['magnitude']})")
    print(f"  Mean difference: {effect['mean_difference']:.2f}")
    print(f"  Significant: {effect['significant']}")
else:
    print("\nInsufficient data for effect size analysis")

---
## 16. Temporal Analyzer

Analyze how green software adoption evolves over time using configurable time granularity.

In [None]:
from greenmining.analyzers import TemporalAnalyzer

temporal = TemporalAnalyzer(granularity="month")

# Prepare analysis results in the expected format
analysis_results_list = []
for d in all_analysis_dicts:
    analysis_results_list.append({
        "commit_sha": d.get("commit_hash", ""),
        "is_green_aware": d.get("green_aware", False),
        "patterns_detected": d.get("gsf_patterns_matched", []),
        "detection_method": "gsf_keyword",
    })

# Run temporal analysis
temporal_results = temporal.analyze_trends(all_analysis_dicts, analysis_results_list)

print("Temporal Analysis (monthly):")
print("=" * 60)
periods = temporal_results.get("periods", [])
for period in periods:
    if isinstance(period, dict):
        print(f"  {period.get('period', 'N/A')}: "
              f"{period.get('commit_count', 0)} commits, "
              f"{period.get('green_awareness_rate', 0):.1%} green")

summary = temporal_results.get("summary", {})
if summary:
    print(f"\nTrend Summary:")
    print(f"  Overall direction: {summary.get('overall_direction', 'N/A')}")
    print(f"  Total periods: {summary.get('total_periods', 0)}")
    print(f"  Peak period: {summary.get('peak_period', 'N/A')}")

evolution = temporal_results.get("pattern_evolution", {})
if evolution:
    print(f"\nPattern Evolution:")
    print(f"  Emerging: {evolution.get('emerging', [])}")
    print(f"  Stable:   {evolution.get('stable', [])}")

---
## 17. Qualitative Analyzer

The `QualitativeAnalyzer` selects a stratified sample of commits for manual validation, useful for research validation of automated classification accuracy.

In [None]:
from greenmining.analyzers import QualitativeAnalyzer

qual_analyzer = QualitativeAnalyzer(
    sample_size=10,
    stratify_by="pattern",
)

# Generate stratified validation samples
samples = qual_analyzer.generate_validation_samples(
    commits=all_analysis_dicts,
    analysis_results=analysis_results_list,
    include_negatives=True,
)

print("Qualitative Validation Samples:")
print("=" * 60)
print(f"Generated {len(samples)} samples\n")

for sample in samples[:5]:
    print(f"  Commit: {sample.commit_hash[:8]}... | "
          f"Pattern: {sample.pattern} | "
          f"Green: {sample.is_green}")

# Export for manual review
qual_analyzer.export_samples_for_review("experiment_output/validation_samples.json")
print(f"\nSamples exported to experiment_output/validation_samples.json")
print("After manual review, import with: qual_analyzer.import_validated_samples('validated.json')")
print("Then compute metrics with: qual_analyzer.calculate_metrics()")

---
## 18. Code Diff Analyzer

The `CodeDiffAnalyzer` performs 15-type code-level pattern detection by analyzing actual code changes in diffs, providing deeper insight than commit message analysis alone.

In [None]:
from greenmining.analyzers import CodeDiffAnalyzer

diff_analyzer = CodeDiffAnalyzer()

# CodeDiffAnalyzer.analyze_commit_diff() requires a PyDriller Commit object.
# It is automatically integrated into the LocalRepoAnalyzer pipeline.
# Here we show the 15 pattern signatures it detects in code diffs.

print("CodeDiffAnalyzer - Green Pattern Signatures:")
print("=" * 60)
print(f"Total pattern types: {len(diff_analyzer.PATTERN_SIGNATURES)}\n")
for pattern_name, pattern_data in diff_analyzer.PATTERN_SIGNATURES.items():
    print(f"  {pattern_name}:")
    if isinstance(pattern_data, dict):
        for key, value in list(pattern_data.items())[:2]:
            if isinstance(value, list):
                print(f"    {key}: {value[:3]}...")
            else:
                print(f"    {key}: {value}")
    print()

print("Note: To analyze actual diffs, use LocalRepoAnalyzer which calls")
print("analyze_commit_diff() on each PyDriller Commit object automatically.")

---
## 19. Report Generation

Generate a Markdown analysis report from the aggregated results.

In [None]:
from greenmining.services.reports import ReportGenerator

report_gen = ReportGenerator()

# Prepare data in expected format
repos_data = {
    "metadata": {"total_repos": len(url_results)},
    "repositories": [r.to_dict() for r in url_results],
}

analysis_data = {
    "metadata": {"total_commits": len(all_analysis_dicts)},
    "commits": all_analysis_dicts,
}

aggregated_data = aggregated if "aggregated" in dir() else {
    "total_commits": len(all_analysis_dicts),
    "green_aware_count": sum(1 for d in all_analysis_dicts if d.get("green_aware")),
}

report_content = report_gen.generate_report(
    aggregated_data=aggregated_data,
    analysis_data=analysis_data,
    repos_data=repos_data,
)

# Save report
report_path = output_dir / "experiment_report.md"
with open(report_path, "w") as f:
    f.write(report_content)

# Show first 30 lines
print(f"Report saved to {report_path}")
print("\nReport preview:")
print("=" * 60)
for line in report_content.split("\n")[:30]:
    print(line)

---
## 20. Export Results

Export all analysis results to multiple formats for further research use.

In [None]:
import json

# Export URL analysis results
url_export = []
for r in url_results:
    url_export.append(r.to_dict())

with open(output_dir / "url_analysis_results.json", "w") as f:
    json.dump(url_export, f, indent=2, default=str)

# Export batch results
batch_export = [r.to_dict() for r in batch_results]
with open(output_dir / "batch_analysis_results.json", "w") as f:
    json.dump(batch_export, f, indent=2, default=str)

# Export carbon report
with open(output_dir / "carbon_report.json", "w") as f:
    json.dump(combined_report.to_dict(), f, indent=2, default=str)

# Summary
print("Exported files:")
for p in sorted(output_dir.glob("*")):
    size = p.stat().st_size
    print(f"  {p.name:<35} {size:>8,} bytes")

---
## 21. Private Repository Support

GreenMining supports analyzing private repositories via two authentication methods. This cell demonstrates the configuration (not executed without actual credentials).

In [None]:
from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer

# Method 1: HTTPS with GitHub token
# The token is injected into the clone URL for authentication.
https_analyzer = LocalRepoAnalyzer(
    github_token="ghp_your_token_here",  # Replace with actual token
    max_commits=20,
)
print("HTTPS authentication configured (token-based)")
print(f"  Token set: {https_analyzer.github_token is not None}")

# Method 2: SSH with private key
# Sets GIT_SSH_COMMAND to use the specified key.
ssh_analyzer = LocalRepoAnalyzer(
    ssh_key_path="~/.ssh/id_rsa",  # Replace with actual key path
    max_commits=20,
)
print(f"\nSSH authentication configured (key-based)")
print(f"  Key path: {ssh_analyzer.ssh_key_path}")

# Usage (not executed):
# result = https_analyzer.analyze_repository("https://github.com/company/private-repo")
# result = ssh_analyzer.analyze_repository("git@github.com:company/private-repo.git")
print("\nTo analyze a private repo, call analyze_repository() with the repo URL.")

---
## 22. Web Dashboard

GreenMining includes a Flask-based web dashboard for interactive visualization of analysis results. The dashboard provides REST API endpoints and an HTML interface.

**Requires:** `pip install greenmining[dashboard]`

In [None]:
try:
    from greenmining.dashboard import create_app, run_dashboard

    print("Dashboard module available.")
    print(f"\nAPI endpoints:")
    print(f"  GET /                  - Dashboard UI")
    print(f"  GET /api/repositories  - List analyzed repositories")
    print(f"  GET /api/analysis      - Full analysis results")
    print(f"  GET /api/statistics    - Aggregated statistics")
    print(f"  GET /api/energy        - Energy measurement data")
    print(f"  GET /api/summary       - Summary overview")
    print(f"\nTo launch the dashboard:")
    print(f"  run_dashboard(data_dir='./experiment_output', host='127.0.0.1', port=5000)")
    print(f"\nNote: Do not run in a notebook cell (blocks execution). Use a separate terminal.")

except ImportError:
    print("Dashboard not available. Install with: pip install greenmining[dashboard]")

---
## 23. Experiment Summary

Final summary of all experiments and features demonstrated in this notebook.

In [None]:
print("GreenMining Experiment Summary")
print("=" * 60)
print(f"Library version: {greenmining.__version__}")
print(f"GSF patterns:    {len(greenmining.GSF_PATTERNS)}")
print(f"Green keywords:  {len(greenmining.GREEN_KEYWORDS)}")
print()

print("Experiments:")
print(f"  1. Search: {len(blockchain_repos)} blockchain repos, "
      f"{len(analyzed_commits)} commits analyzed")
print(f"  2. URL analysis: Flask, Requests (2 repos, method-level + source code)")
print(f"  3. Deep analysis: FastAPI (energy tracking enabled)")
print(f"  4. Batch: 3 repos analyzed in parallel with energy tracking")
print()

print("Features demonstrated:")
features = [
    "GSF pattern detection (122 patterns, 15 categories)",
    "Green awareness keyword matching (321 keywords)",
    "GitHub GraphQL API repository search",
    "Date, star, and language filters",
    "Commit extraction (merge/bot filtering)",
    "Commit analysis with pattern classification",
    "Data aggregation with temporal trends",
    "URL-based repository analysis (PyDriller)",
    "Batch analysis with parallel workers",
    "Private repository support (HTTPS + SSH)",
    "Energy measurement (RAPL, CPU Meter, Auto)",
    "Carbon footprint reporting (20+ countries, cloud regions)",
    "Power regression detection",
    "Metrics-to-power correlation (Pearson/Spearman)",
    "Version power analysis",
    "Method-level analysis (Lizard)",
    "Source code before/after access",
    "Process metrics (8 PyDriller metrics)",
    "Statistical analysis",
    "Temporal analysis",
    "Qualitative analysis (stratified sampling)",
    "Code diff analysis (15 pattern types)",
    "Report generation (Markdown)",
    "Multi-format export (JSON, CSV)",
    "Web dashboard (Flask)",
]
for i, f in enumerate(features, 1):
    print(f"  {i:2d}. {f}")

print(f"\nOutput directory: {output_dir.absolute()}")