# GreenMining Experiment: Unified Repository Analysis Pipeline

This notebook demonstrates a complete analysis pipeline using the `greenmining` library.

## Experiment Setup
- **10 blockchain repositories** found via GraphQL search
- **3 manually selected repositories** (Flask, Requests, FastAPI)
- **Total: 13 repositories** — all analyzed with the same pipeline and ALL features enabled
- **Commits per repository:** 20
- **Min stars:** 3
- **Languages:** Top 20 programming languages

## Pipeline Structure
1. **Data Gathering** — search + URL-based fetching for all 13 repos
2. **Unified Analysis** — every feature applied to every repo equally

In [1]:
!pip install greenmining[energy,dashboard] --upgrade --quiet

## Step 1: Import Libraries

Import all GreenMining modules needed for the experiment.

In [2]:
import os
import json
import time
import pandas as pd

import greenmining
from greenmining import (
    fetch_repositories,
    analyze_repositories,
    GSF_PATTERNS,
    GREEN_KEYWORDS,
    is_green_aware,
    get_pattern_by_keywords,
)
from greenmining.analyzers import (
    StatisticalAnalyzer,
    TemporalAnalyzer,
    QualitativeAnalyzer,
    CodeDiffAnalyzer,
    PowerRegressionDetector,
    MetricsPowerCorrelator,
    VersionPowerAnalyzer,
)
from greenmining.energy import CarbonReporter, get_energy_meter, CPUEnergyMeter
from greenmining.dashboard import create_app

print(f'GreenMining version: {greenmining.__version__}')
print(f'GSF Patterns: {len(GSF_PATTERNS)}')
print(f'Green Keywords: {len(GREEN_KEYWORDS)}')

GreenMining version: 1.0.9
GSF Patterns: 122
Green Keywords: 321


## Step 2: Configuration

GitHub token and analysis parameters shared across all repositories.

In [3]:
GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN', 'your_github_token_here')

try:
    from dotenv import load_dotenv
    load_dotenv()
    GITHUB_TOKEN = os.environ.get('GITHUB_TOKEN', GITHUB_TOKEN)
except ImportError:
    pass

if GITHUB_TOKEN == 'your_github_token_here':
    print('WARNING: Set GITHUB_TOKEN to run the search step.')
else:
    print(f'GitHub token configured ({GITHUB_TOKEN[:8]}...)')

# Shared analysis parameters
MAX_COMMITS = 20
MIN_STARS = 3
PARALLEL_WORKERS = 2

LANGUAGES = [
    'Python', 'JavaScript', 'TypeScript', 'Java', 'C++',
    'C#', 'Go', 'Rust', 'PHP', 'Ruby',
    'Swift', 'Kotlin', 'Scala', 'R', 'MATLAB',
    'Dart', 'Lua', 'Perl', 'Haskell', 'Elixir',
]

print(f'Max commits per repo: {MAX_COMMITS}')
print(f'Min stars: {MIN_STARS}')
print(f'Languages: {len(LANGUAGES)}')

GitHub token configured (github_p...)
Max commits per repo: 20
Min stars: 3
Languages: 20


---
# Part B: Data Gathering

## Step 3: Search Blockchain Repositories

Use the GraphQL API to find 10 blockchain repositories matching our criteria.

In [4]:
search_repos = fetch_repositories(
    github_token=GITHUB_TOKEN,
    max_repos=10,
    min_stars=MIN_STARS,
    languages=LANGUAGES,
    keywords='blockchain',
    created_after='2020-01-01',
)

print(f'Found {len(search_repos)} blockchain repositories:')
for i, repo in enumerate(search_repos, 1):
    print(f'  {i:2d}. {repo.full_name} ({repo.stars} stars, {repo.language})')

search_urls = [repo.url for repo in search_repos]

Fetching up to 10 repositories...
   Keywords: blockchain
   Filters: min_stars=3
   Created: 2020-01-01 to any
GraphQL Search Query: blockchain stars:>=3 created:>=2020-01-01
Rate Limit: 4985/5000 (cost: 1)
Fetched 10 repositories using GraphQL
Fetched 10 repositories
   Saved to: data/repositories.json
Found 10 blockchain repositories:
   1. calistus-igwilo/nitda-blockchain-scholarship (3083 stars, HTML)
   2. smartcontractkit/full-blockchain-solidity-course-js (13959 stars, None)
   3. smartcontractkit/full-blockchain-solidity-course-py (11234 stars, None)
   4. BytePhoenixCoding/BlockchainTokenSniper (472 stars, None)
   5. FuelLabs/fuel-core (57393 stars, Rust)
   6. slowmist/Blockchain-dark-forest-selfguard-handbook (6739 stars, None)
   7. Eternaldeath/BlockchainHome (990 stars, HTML)
   8. paritytech/polkadot-sdk (2670 stars, Rust)
   9. aptos-labs/aptos-core (6424 stars, Rust)
  10. massalabs/massa (5560 stars, Rust)


## Step 4: Analyze All 13 Repositories

Combine the 10 search results with 3 manually selected repositories, then run the
full analysis pipeline on all of them at once with every feature enabled:
- GSF pattern detection (122 patterns, 321 keywords)
- Process metrics (DMM size, complexity, interfacing)
- Method-level analysis (per-function complexity metrics)
- Source code capture (before/after for each modified file)
- Energy measurement (CPU-based tracking during analysis)

In [5]:
# 3 manually selected repositories
manual_urls = [
    'https://github.com/pallets/flask',
    'https://github.com/psf/requests',
    'https://github.com/tiangolo/fastapi',
]

# Combine all URLs
all_urls = search_urls + manual_urls
print(f'Total repositories: {len(all_urls)}')
print(f'  Search results: {len(search_urls)}')
print(f'  Manual selection: {len(manual_urls)}')
print()

# Analyze ALL repositories with ALL features
results = analyze_repositories(
    urls=all_urls,
    max_commits=MAX_COMMITS,
    parallel_workers=PARALLEL_WORKERS,
    output_format='dict',
    energy_tracking=True,
    energy_backend='auto',
    method_level_analysis=True,
    include_source_code=True,
    github_token=GITHUB_TOKEN,
)

print(f'\nAnalysis complete: {len(results)} repositories')

Total repositories: 13
  Search results: 10
  Manual selection: 3


 Analyzing 13 repositories with 2 workers

 Analyzing repository: calistus-igwilo/nitda-blockchain-scholarship

 Analyzing repository: smartcontractkit/full-blockchain-solidity-course-js   Cloning to: /tmp/greenmining_repos/nitda-blockchain-scholarship

   Cloning to: /tmp/greenmining_repos/full-blockchain-solidity-course-js
    Analyzed 2 commits
   Computing process metrics...
   Cleaning up: /tmp/greenmining_repos/full-blockchain-solidity-course-js

 Analyzing repository: smartcontractkit/full-blockchain-solidity-course-py   Completed: smartcontractkit/full-blockchain-solidity-course-js

   Cloning to: /tmp/greenmining_repos/full-blockchain-solidity-course-py
    Analyzed 1 commits
   Computing process metrics...
   Cleaning up: /tmp/greenmining_repos/full-blockchain-solidity-course-py

 Analyzing repository: BytePhoenixCoding/BlockchainTokenSniper   Completed: smartcontractkit/full-blockchain-solidity-course-py

  

## Step 5: Results Overview

Summary of the unified analysis across all repositories.

In [6]:
total_commits = sum(r['total_commits'] for r in results)
total_green = sum(r['green_commits'] for r in results)
overall_rate = total_green / total_commits if total_commits > 0 else 0

print('=' * 70)
print('UNIFIED ANALYSIS SUMMARY')
print('=' * 70)
print(f'Repositories analyzed: {len(results)}')
print(f'Total commits: {total_commits}')
print(f'Green-aware commits: {total_green}')
print(f'Overall green rate: {overall_rate:.1%}')
print()
print(f'{"Repository":<40} {"Commits":<10} {"Green":<10} {"Rate":<10}')
print('-' * 70)
for r in results:
    rate = r['green_commit_rate'] if r['total_commits'] > 0 else 0
    print(f'{r["name"]:<40} {r["total_commits"]:<10} {r["green_commits"]:<10} {rate:.1%}')

# Build flat commit list for use in all later analysis steps
all_commits = []
for r in results:
    for c in r.get('commits', []):
        c['repository'] = r['name']
        all_commits.append(c)

print(f'\nFlattened commit pool: {len(all_commits)} commits')

TypeError: 'RepositoryAnalysis' object is not subscriptable

---
# Part C: Unified Analysis

Every feature is applied to the combined dataset of all 13 repositories.

## Step 6: GSF Pattern Analysis

Examine the Green Software Foundation patterns detected across all repositories.
GreenMining detects 122 patterns across 15 categories with 321 keywords.

In [None]:
# Pattern frequency across all commits
pattern_counts = {}
for commit in all_commits:
    for pattern in commit.get('gsf_patterns_matched', []):
        pattern_counts[pattern] = pattern_counts.get(pattern, 0) + 1

sorted_patterns = sorted(pattern_counts.items(), key=lambda x: x[1], reverse=True)

print(f'Unique patterns detected: {len(sorted_patterns)}')
print(f'\nTop 20 GSF Patterns:')
print(f'{"Pattern":<45} {"Count":<8} {"% of Commits":<12}')
print('-' * 65)
for pattern, count in sorted_patterns[:20]:
    pct = count / len(all_commits) * 100 if all_commits else 0
    print(f'{pattern:<45} {count:<8} {pct:.1f}%')

# Pattern categories
categories = set()
for p in GSF_PATTERNS.values():
    categories.add(p.get('category', 'Unknown'))
print(f'\nGSF Categories ({len(categories)}):')
for cat in sorted(categories):
    count = sum(1 for p in GSF_PATTERNS.values() if p.get('category') == cat)
    print(f'  {cat}: {count} patterns')

## Step 7: Green Awareness Detection

Demonstrate keyword-based green awareness detection on sample commit messages
and the pattern lookup API.

In [None]:
test_messages = [
    'Optimize database queries for energy efficiency',
    'Fix typo in README',
    'Implement lazy loading for images to reduce bandwidth',
    'Add unit tests for login',
    'Reduce memory footprint of cache layer',
    'Refactor to async I/O for better resource utilization',
]

print('Green Awareness Detection:')
for msg in test_messages:
    result = is_green_aware(msg)
    print(f'  [{"GREEN" if result else "-----"}] {msg}')

print('\nPattern Lookup Examples:')
for keyword in ['cache', 'lazy loading', 'compression', 'async']:
    patterns = get_pattern_by_keywords(keyword)
    if patterns:
        names = [p['name'] for p in patterns[:3]]
        print(f'  "{keyword}" -> {names}')
    else:
        print(f'  "{keyword}" -> no matching patterns')

## Step 8: Process Metrics

Examine the process metrics collected during analysis: DMM (Delta Maintainability Model)
scores for size, complexity, and interfacing, plus structural complexity metrics
and method-level analysis via Lizard integration.

In [None]:
print('Process Metrics Summary')
print('=' * 70)

metrics_keys = [
    'dmm_unit_size', 'dmm_unit_complexity', 'dmm_unit_interfacing',
    'total_nloc', 'total_complexity', 'max_complexity',
    'methods_count', 'files_modified', 'insertions', 'deletions',
]

metrics_data = {k: [] for k in metrics_keys}
for commit in all_commits:
    for key in metrics_keys:
        val = commit.get(key)
        if val is not None:
            metrics_data[key].append(val)

print(f'{"Metric":<25} {"Avg":>10} {"Min":>10} {"Max":>10} {"N":>6}')
print('-' * 65)
for metric, values in metrics_data.items():
    if values:
        avg = sum(values) / len(values)
        print(f'{metric:<25} {avg:>10.2f} {min(values):>10.2f} {max(values):>10.2f} {len(values):>6}')

# Method-level analysis
total_methods = sum(len(c.get('methods', [])) for c in all_commits)
print(f'\nMethod-Level Analysis:')
print(f'  Total methods analyzed: {total_methods}')

for commit in all_commits:
    methods = commit.get('methods', [])
    if methods:
        print(f'  Sample from {commit.get("repository")} ({commit["hash"][:8]}):')
        for m in methods[:3]:
            print(f'    {m.get("name", "N/A")}: nloc={m.get("nloc", 0)}, '
                  f'complexity={m.get("complexity", 0)}')
        break

# Source code changes
total_src = sum(len(c.get('source_changes', [])) for c in all_commits)
print(f'\nSource code changes captured: {total_src}')

## Step 9: Statistical Analysis

Apply statistical methods to the combined dataset: pattern correlations,
temporal trend significance, and effect sizes between green and non-green commits.

In [None]:
stat_analyzer = StatisticalAnalyzer()

commits_df = pd.DataFrame(all_commits)

# Add binary indicator columns for each pattern
all_pattern_names = list(pattern_counts.keys())
for pattern in all_pattern_names:
    commits_df[f'pattern_{pattern}'] = commits_df['gsf_patterns_matched'].apply(
        lambda x, p=pattern: 1 if p in (x or []) else 0
    )

# Pattern correlations
if len(all_pattern_names) >= 2:
    corr = stat_analyzer.analyze_pattern_correlations(commits_df)
    sig_pairs = corr.get('significant_pairs', [])
    print(f'Pattern Correlation Analysis:')
    print(f'  Significant pairs: {len(sig_pairs)}')
    for pair in sig_pairs[:10]:
        print(f'    {pair}')
else:
    print(f'Found {len(all_pattern_names)} pattern(s) - need >= 2 for correlation')

# Temporal trend
if 'date' in commits_df.columns and 'green_aware' in commits_df.columns:
    if 'commit_hash' not in commits_df.columns:
        commits_df['commit_hash'] = commits_df.get('hash', commits_df.index.astype(str))
    trend_results = stat_analyzer.temporal_trend_analysis(commits_df)
    trend = trend_results.get('trend', {})
    print(f'\nTemporal Trend:')
    print(f'  Direction: {trend.get("direction", "N/A")}')
    print(f'  Significant: {trend.get("significant", "N/A")}')
    print(f'  Correlation: {trend.get("correlation", "N/A")}')

# Effect size: green vs non-green complexity
green_cx = commits_df[commits_df['green_aware'] == True]['total_complexity'].dropna().tolist()
non_green_cx = commits_df[commits_df['green_aware'] == False]['total_complexity'].dropna().tolist()

if green_cx and non_green_cx:
    effect = stat_analyzer.effect_size_analysis(green_cx, non_green_cx)
    print(f'\nEffect Size (Green vs Non-Green Complexity):')
    print(f'  Cohen\'s d: {effect["cohens_d"]:.3f} ({effect["magnitude"]})')
    print(f'  Mean difference: {effect["mean_difference"]:.2f}')
    print(f'  Significant: {effect["significant"]}')
else:
    print('\nInsufficient data for effect size analysis')

## Step 10: Temporal Analysis

Analyze how green software practices evolve over time across all repositories.

In [None]:
temporal = TemporalAnalyzer(granularity='quarter')

# Convert to analyzer's expected format
analysis_results_fmt = []
for c in all_commits:
    analysis_results_fmt.append({
        'commit_sha': c.get('hash', ''),
        'is_green_aware': c.get('green_aware', False),
        'patterns_detected': c.get('gsf_patterns_matched', []),
        'detection_method': 'gsf_keyword',
    })

temporal_results = temporal.analyze_trends(all_commits, analysis_results_fmt)

periods = temporal_results.get('periods', [])
print(f'Temporal Analysis ({len(periods)} periods):')
print(f'{"Period":<20} {"Commits":<10} {"Green":<10} {"Rate":<10} {"Patterns":<10}')
print('-' * 60)
for p in periods:
    rate = p.get('green_awareness_rate', 0)
    print(f'{p.get("period", "N/A"):<20} {p.get("commit_count", 0):<10} '
          f'{p.get("green_commit_count", 0):<10} {rate:.1%}      '
          f'{p.get("unique_patterns", 0)}')

summary = temporal_results.get('summary', {})
print(f'\nTrend: {summary.get("overall_direction", "N/A")}')
print(f'Peak period: {summary.get("peak_period", "N/A")}')

## Step 11: Code Diff Pattern Signatures

The CodeDiffAnalyzer detects green patterns directly in code changes.
It is integrated into the analysis pipeline automatically. Here we inspect
the pattern signatures it looks for.

In [None]:
diff_analyzer = CodeDiffAnalyzer()

print(f'Code Diff Pattern Signatures: {len(diff_analyzer.PATTERN_SIGNATURES)} types')
print('=' * 60)
for name, data in diff_analyzer.PATTERN_SIGNATURES.items():
    print(f'  {name}:')
    if isinstance(data, dict):
        for key, val in list(data.items())[:2]:
            if isinstance(val, list):
                print(f'    {key}: {val[:3]}...')
            else:
                print(f'    {key}: {val}')
    print()

## Step 12: Energy Measurement

GreenMining provides multiple energy measurement backends:
- **RAPL** — Linux kernel hardware counters (Intel/AMD, most accurate)
- **CodeCarbon** — cross-platform (requires codecarbon package)
- **CPU Meter** — universal (estimates from CPU utilization and TDP)
- **Auto** — selects the best available backend

In [None]:
# Check available backends
print('Available Energy Backends:')
for backend in ['rapl', 'codecarbon', 'cpu_meter', 'auto']:
    try:
        m = get_energy_meter(backend)
        print(f'  {backend}: available ({type(m).__name__})')
    except Exception as e:
        print(f'  {backend}: not available ({e})')

# Measure a sample workload
meter = CPUEnergyMeter()
print(f'\nCPU Energy Meter: available={meter.is_available()}')

def sample_workload():
    return sum(i ** 2 for i in range(1_000_000))

result, energy = meter.measure(sample_workload)
print(f'\nSample Workload Measurement:')
print(f'  Energy: {energy.joules:.4f} J')
print(f'  Power avg: {energy.watts_avg:.2f} W')
print(f'  Peak: {energy.watts_peak:.2f} W')
print(f'  Duration: {energy.duration_seconds:.3f} s')
print(f'  Backend: {energy.backend}')

# Show energy from the repository analysis
for r in results:
    e = r.get('energy_metrics')
    if e and e.get('joules', 0) > 0:
        print(f'\nAnalysis energy ({r["name"]}):')
        print(f'  Total: {e.get("joules", 0):.4f} J')
        print(f'  Avg power: {e.get("watts_avg", 0):.2f} W')
        break

## Step 13: Power Regression Detection

Detect commits that introduced energy regressions by measuring power before and after
each commit. Requires a local repository with a runnable test suite.

In [None]:
detector = PowerRegressionDetector(
    test_command="python -c 'sum(range(100000))'",
    energy_backend='cpu_meter',
    threshold_percent=5.0,
    iterations=3,
    warmup_iterations=1,
)

print('PowerRegressionDetector configured:')
print(f'  Test command: python -c "sum(range(100000))"')
print(f'  Backend: cpu_meter')
print(f'  Threshold: 5.0%')
print(f'  Iterations: 3, Warmup: 1')
print()
print('Usage on a local repository:')
print('  regressions = detector.detect(')
print('      repo_path="./my-repo",')
print('      baseline_commit="HEAD~10",')
print('      target_commit="HEAD",')
print('  )')
print('  for r in regressions:')
print('      print(f"{r.sha[:8]} | before={r.power_before:.2f}W | '
      'after={r.power_after:.2f}W | regression={r.is_regression}")')

## Step 14: Metrics-to-Power Correlation

Analyze correlations between code metrics and energy consumption using
Pearson and Spearman coefficients.

In [None]:
correlator = MetricsPowerCorrelator(significance_level=0.05)

metric_names = ['total_complexity', 'total_nloc', 'files_modified', 'insertions', 'deletions']
metrics_values = {m: [] for m in metric_names}
power_measurements = []

for c in all_commits:
    has_all = all(c.get(m) is not None for m in metric_names)
    energy_val = c.get('energy_watts_avg') or c.get('energy_joules')
    if has_all and energy_val:
        for m in metric_names:
            metrics_values[m].append(float(c[m]))
        power_measurements.append(float(energy_val))

if len(power_measurements) >= 3:
    correlator.fit(metric_names, metrics_values, power_measurements)
    summary = correlator.summary()
    print(f'Metrics-to-Power Correlation:')
    print(f'  Metrics analyzed: {summary["total_metrics"]}')
    print(f'  Significant: {summary["significant_count"]}')
    print()
    for name, result in correlator.get_results().items():
        print(f'  {name}:')
        print(f'    Pearson r={result.pearson_r:.3f}, Spearman rho={result.spearman_rho:.3f}')
        print(f'    Significant: {result.is_significant}')
    print(f'\nFeature Importance:')
    for name, imp in correlator.feature_importance.items():
        bar = '#' * int(imp * 30)
        print(f'  {name:<20} {imp:.3f} {bar}')
else:
    print(f'Insufficient data ({len(power_measurements)} points, need >= 3)')
    print('Enable energy_tracking=True to collect per-commit energy data.')

## Step 15: Version Power Analysis

Compare energy consumption across different software versions by checking out
tags and running a test suite at each version.

In [None]:
version_analyzer = VersionPowerAnalyzer(
    test_command="python -c 'sum(range(100000))'",
    energy_backend='cpu_meter',
    iterations=5,
    warmup_iterations=1,
)

print('VersionPowerAnalyzer configured:')
print(f'  Backend: cpu_meter, Iterations: 5, Warmup: 1')
print()
print('Usage on a local repository with version tags:')
print('  report = version_analyzer.analyze_versions(')
print('      repo_path="./my-repo",')
print('      versions=["v1.0", "v2.0", "v3.0"],')
print('  )')
print('  print(f"Trend: {report.trend}")')
print('  print(f"Total change: {report.total_change_percent:.1f}%")')
print('  print(f"Most efficient: {report.most_efficient}")')
print('  for v in report.versions:')
print('      print(f"{v.version}: {v.power_watts_avg:.2f}W")')

## Step 16: Visualization (matplotlib)

Static charts from the unified analysis data.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Green commit rate per repository
repo_names = [r['name'][:20] for r in results]
green_rates = [r['green_commit_rate'] for r in results]
axes[0, 0].barh(repo_names, green_rates, color='green', alpha=0.7)
axes[0, 0].set_xlabel('Green Commit Rate')
axes[0, 0].set_title('Green Awareness by Repository')
axes[0, 0].set_xlim(0, 1)

# 2. Top 10 patterns
if sorted_patterns:
    top = sorted_patterns[:10]
    axes[0, 1].barh([p[0][:30] for p in top], [p[1] for p in top], color='teal', alpha=0.7)
    axes[0, 1].set_xlabel('Count')
    axes[0, 1].set_title('Top 10 GSF Patterns')

# 3. Commits breakdown
commit_counts = [r['total_commits'] for r in results]
green_counts = [r['green_commits'] for r in results]
x = range(len(results))
axes[1, 0].bar(x, commit_counts, label='Total', alpha=0.7)
axes[1, 0].bar(x, green_counts, label='Green', alpha=0.7)
axes[1, 0].set_xticks(list(x))
axes[1, 0].set_xticklabels(repo_names, rotation=45, ha='right', fontsize=7)
axes[1, 0].set_title('Commit Breakdown')
axes[1, 0].legend()

# 4. Complexity distribution
cxs = [c.get('total_complexity', 0) for c in all_commits if c.get('total_complexity')]
if cxs:
    axes[1, 1].hist(cxs, bins=20, color='orange', alpha=0.7, edgecolor='black')
    axes[1, 1].set_xlabel('Total Complexity')
    axes[1, 1].set_title('Complexity Distribution')

plt.tight_layout()
plt.savefig('data/analysis_plots.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved to data/analysis_plots.png')

## Step 17: Interactive Visualization (Plotly)

Interactive charts for deeper exploration.

In [None]:
import plotly.express as px

# Sunburst: Repository -> Green/Non-Green -> Pattern
sun_data = []
for r in results:
    for c in r.get('commits', []):
        cat = 'Green' if c.get('green_aware') else 'Non-Green'
        patterns = c.get('gsf_patterns_matched', [])
        pat = patterns[0] if patterns else 'None'
        sun_data.append({
            'repository': r['name'][:20], 'category': cat,
            'pattern': pat[:30], 'count': 1,
        })

if sun_data:
    df_sun = pd.DataFrame(sun_data)
    fig = px.sunburst(df_sun, path=['repository', 'category', 'pattern'],
                      values='count', title='Repository Analysis Breakdown')
    fig.show()

# Scatter: Complexity vs NLOC
sc = [{'cx': c['total_complexity'], 'nloc': c['total_nloc'],
       'green': 'Green' if c.get('green_aware') else 'Non-Green',
       'repo': c.get('repository', '')}
      for c in all_commits if c.get('total_complexity') and c.get('total_nloc')]

if sc:
    fig2 = px.scatter(pd.DataFrame(sc), x='nloc', y='cx', color='green',
                      hover_data=['repo'],
                      title='Complexity vs Lines of Code')
    fig2.show()

## Step 18: Export Results

Export the unified analysis to JSON, CSV, and pandas DataFrame.

In [None]:
# JSON
with open('data/analysis_results.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)
print('Exported data/analysis_results.json')

# CSV (flattened commits)
csv_rows = []
for r in results:
    for c in r.get('commits', []):
        csv_rows.append({
            'repository': r['name'],
            'repo_url': r['url'],
            'commit_hash': c.get('hash', ''),
            'author': c.get('author', ''),
            'date': c.get('date', ''),
            'message': str(c.get('message', ''))[:100],
            'green_aware': c.get('green_aware', False),
            'patterns_matched': ', '.join(c.get('gsf_patterns_matched', [])),
            'pattern_count': c.get('pattern_count', 0),
            'confidence': c.get('confidence', ''),
            'files_modified': c.get('files_modified', 0),
            'insertions': c.get('insertions', 0),
            'deletions': c.get('deletions', 0),
            'dmm_unit_size': c.get('dmm_unit_size'),
            'dmm_unit_complexity': c.get('dmm_unit_complexity'),
            'dmm_unit_interfacing': c.get('dmm_unit_interfacing'),
            'total_nloc': c.get('total_nloc'),
            'total_complexity': c.get('total_complexity'),
            'methods_count': c.get('methods_count'),
            'energy_joules': c.get('energy_joules'),
        })

df_export = pd.DataFrame(csv_rows)
df_export.to_csv('data/analysis_results.csv', index=False)
print(f'Exported {len(csv_rows)} commits to data/analysis_results.csv')
print(f'\nDataFrame shape: {df_export.shape}')
df_export.head()

## Step 19: Web Dashboard

GreenMining includes a Flask-based dashboard for interactive exploration.
The dashboard reads analysis data from a directory and exposes REST API endpoints.

In [None]:
app = create_app(data_dir='./data')

print('Dashboard created successfully')
print()
print('API Endpoints:')
print('  GET /              - Dashboard UI')
print('  GET /api/repositories - Repository data')
print('  GET /api/analysis    - Analysis results')
print('  GET /api/statistics  - Aggregated statistics')
print('  GET /api/energy      - Energy report')
print('  GET /api/summary     - Summary metrics')
print()
print('To launch (in a terminal, not here):')
print('  from greenmining.dashboard import run_dashboard')
print('  run_dashboard(data_dir="./data", host="127.0.0.1", port=5000)')

---
# Summary

## Repositories Analyzed
- 10 blockchain repositories (GraphQL search)
- 3 selected repositories (Flask, Requests, FastAPI)
- **Total: 13 repositories** through a single unified pipeline

## Features Applied to All Repositories

| Feature | Status |
|---------|--------|
| GSF Pattern Detection (122 patterns, 15 categories) | Applied |
| Process Metrics (DMM size, complexity, interfacing) | Applied |
| Method-Level Analysis (per-function complexity) | Applied |
| Source Code Capture (before/after) | Applied |
| Energy Measurement (auto-detected backend) | Applied |
| Statistical Analysis (correlations, effect sizes) | Applied |
| Temporal Analysis (quarterly trends) | Applied |
| Code Diff Pattern Signatures | Applied | 
| Power Regression Detection | Demonstrated |
| Metrics-to-Power Correlation (Pearson/Spearman) | Applied |
| Version Power Comparison | Demonstrated |
| Visualization (matplotlib + plotly) | Applied |
| Export (JSON, CSV, DataFrame) | Applied |
| Web Dashboard (Flask REST API) | Applied |

## Output Files
- `data/analysis_results.json` — Full analysis data
- `data/analysis_results.csv` — Flattened commit-level data
- `data/analysis_plots.png` — Static visualizations
- `data/validation_samples.json` — Qualitative validation samples