# AstroGraphAnomaly — Colab Full Test (toutes capacités)

Objectif : exécuter et **valider** un maximum de capacités du workflow, en restant **workflow-first** (CLI), sans supposer un mode package.

Couverture :
- exécution offline (CSV test) + jeux de données synthétiques
- multi-engines + multi-stratégies de seuil (avec fallback si option non supportée)
- génération des artefacts (raw/scored/top/GraphML/manifest)
- export de plots (set “pertinent + beau”)
- explainability (LIME) + prompts LLM (lecture/visualisation)
- analyses graph avancées (communautés, k-core, betweenness approx, articulation, bridges)
- comparaison inter-runs (Jaccard overlap top anomalies + corrélation scores)

**Gaia** est optionnel (réseau/quota). Le notebook ne le lance pas par défaut.


In [None]:
!git clone --depth 1 https://github.com/dalozedidier-dot/AstroGraphAnomaly.git
%cd AstroGraphAnomaly
!python -m pip install -q --upgrade pip
!pip -q install -r requirements.txt


## 0) Détection entrypoint + capacités CLI

On détecte `workflow.py` ou `run_workflow.py` et on récupère l’aide CLI pour savoir ce qui est supporté.

In [None]:
import os, sys, subprocess, re
from pathlib import Path

ENTRYPOINT = None
if Path('workflow.py').exists():
    ENTRYPOINT = 'workflow.py'
elif Path('run_workflow.py').exists():
    ENTRYPOINT = 'run_workflow.py'
else:
    raise FileNotFoundError('Aucun entrypoint trouvé: workflow.py ou run_workflow.py')

print('Entrypoint détecté:', ENTRYPOINT)

def _help_text():
    try:
        out = subprocess.check_output([sys.executable, ENTRYPOINT, '--help'], stderr=subprocess.STDOUT, text=True)
        return out
    except Exception as e:
        print('WARN: unable to read --help:', e)
        return ''

HELP = _help_text()
print(HELP[:1200])

# Try to infer supported engines/threshold strategies (best-effort)
engines = []
m = re.search(r"--engine\s+\{([^}]+)\}", HELP)
if m:
    engines = [x.strip() for x in m.group(1).split(',')]
else:
    engines = ['isolation_forest', 'lof', 'ocsvm', 'robust_zscore']

thr = []
m2 = re.search(r"--threshold-strategy\s+\{([^}]+)\}", HELP)
if m2:
    thr = [x.strip() for x in m2.group(1).split(',')]
else:
    thr = ['top_k', 'percentile', 'contamination']

print('Candidate engines:', engines)
print('Candidate threshold strategies:', thr)


## 1) Jeux de données

- `data/sample_gaia_like.csv` : fourni par le repo (offline)
- `data/sample_gaia_like_with_bp_rp.csv` : version enrichie (pour activer CMD offline)
- `data/hubble_like.csv` : dataset synthétique (même schéma minimal) pour tester le mode `hubble` si présent, sinon `csv`.


In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

data_dir = Path('data')
data_dir.mkdir(exist_ok=True)

base_csv = data_dir/'sample_gaia_like.csv'
assert base_csv.exists(), f'Missing {base_csv}'

df = pd.read_csv(base_csv)

# Ensure bp_rp exists for offline CMD (synthetic if not provided)
df_bp = df.copy()
if 'bp_rp' not in df_bp.columns:
    # synthetic color index around typical range [0.0, 3.5]
    rng = np.random.default_rng(42)
    df_bp['bp_rp'] = np.clip(rng.normal(loc=1.5, scale=0.6, size=len(df_bp)), -0.5, 4.5)

(data_dir/'sample_gaia_like_with_bp_rp.csv').write_text(df_bp.to_csv(index=False), encoding='utf-8')

# Synthetic 'hubble-like' dataset: reusing Gaia-like schema + a couple extra cols
rng = np.random.default_rng(7)
n = 1200
hubble = pd.DataFrame({
    'source_id': np.arange(900000000000000000, 900000000000000000 + n, dtype=np.int64),
    'ra': rng.uniform(10.0, 11.0, size=n),
    'dec': rng.uniform(20.0, 21.0, size=n),
    'parallax': np.abs(rng.normal(0.8, 0.25, size=n)) + 0.05,
    'pmra': rng.normal(-5.0, 2.0, size=n),
    'pmdec': rng.normal(-6.0, 2.2, size=n),
    'phot_g_mean_mag': rng.uniform(16.0, 23.0, size=n),
    'bp_rp': np.clip(rng.normal(1.6, 0.7, size=n), -0.5, 4.5),
})
hubble['distance'] = 1000.0 / hubble['parallax']
# extra cols typical in imaging
hubble['flux'] = rng.lognormal(mean=2.0, sigma=0.6, size=n)
hubble['snr'] = rng.uniform(5, 200, size=n)

(data_dir/'hubble_like.csv').write_text(hubble.to_csv(index=False), encoding='utf-8')

print('Created:', data_dir/'sample_gaia_like_with_bp_rp.csv')
print('Created:', data_dir/'hubble_like.csv')
df_bp.head()


## 2) Runner robuste (fallback)

On tente une configuration riche (engine + threshold + features extended + plots + explain).
Si l’entrypoint refuse un flag (option non supportée), on retente avec un set minimal.


In [None]:
import json
import traceback

def run_cmd(cmd):
    print('RUN:', ' '.join(cmd))
    subprocess.check_call(cmd)

def run_workflow(mode: str, out_dir: str, in_csv: str = None, ra=None, dec=None, radius_deg=None, limit=None,
                 engine='isolation_forest', threshold_strategy='top_k', top_k=30, explain_top=10, knn_k=8,
                 features_mode='extended', plots=True):
    outp = Path(out_dir)
    outp.mkdir(parents=True, exist_ok=True)

    # Build "rich" command
    if ENTRYPOINT == 'workflow.py':
        cmd = [sys.executable, ENTRYPOINT, mode]
        if mode in ('csv','hubble'):
            cmd += ['--in-csv', in_csv]
        if mode == 'gaia':
            cmd += ['--ra', str(ra), '--dec', str(dec), '--radius-deg', str(radius_deg), '--limit', str(limit)]
        cmd += ['--out', out_dir]
        # optional flags
        cmd += ['--engine', engine]
        cmd += ['--threshold-strategy', threshold_strategy]
        cmd += ['--top-k', str(top_k)]
        cmd += ['--explain-top', str(explain_top)]
        cmd += ['--knn-k', str(knn_k)]
        cmd += ['--features-mode', features_mode]
        if plots:
            cmd.append('--plots')
    else:
        cmd = [sys.executable, ENTRYPOINT, '--mode', mode]
        if mode in ('csv','hubble'):
            cmd += ['--in-csv', in_csv]
        if mode == 'gaia':
            cmd += ['--ra', str(ra), '--dec', str(dec), '--radius-deg', str(radius_deg), '--limit', str(limit)]
        cmd += ['--out', out_dir]
        cmd += ['--engine', engine]
        cmd += ['--threshold-strategy', threshold_strategy]
        cmd += ['--top-k', str(top_k)]
        cmd += ['--explain-top', str(explain_top)]
        cmd += ['--knn-k', str(knn_k)]
        cmd += ['--features-mode', features_mode]
        if plots:
            cmd.append('--plots')

    # Try rich, fallback to minimal if needed
    try:
        run_cmd(cmd)
        return {'status':'ok','cmd':cmd}
    except Exception as e:
        print('WARN: rich run failed, fallback minimal. Error:', e)
        # Minimal: drop advanced flags
        if ENTRYPOINT == 'workflow.py':
            cmd2 = [sys.executable, ENTRYPOINT, mode]
            if mode in ('csv','hubble'):
                cmd2 += ['--in-csv', in_csv]
            if mode == 'gaia':
                cmd2 += ['--ra', str(ra), '--dec', str(dec), '--radius-deg', str(radius_deg), '--limit', str(limit)]
            cmd2 += ['--out', out_dir]
            cmd2 += ['--top-k', str(top_k)]
            if plots:
                cmd2.append('--plots')
        else:
            cmd2 = [sys.executable, ENTRYPOINT, '--mode', mode]
            if mode in ('csv','hubble'):
                cmd2 += ['--in-csv', in_csv]
            if mode == 'gaia':
                cmd2 += ['--ra', str(ra), '--dec', str(dec), '--radius-deg', str(radius_deg), '--limit', str(limit)]
            cmd2 += ['--out', out_dir]
            cmd2 += ['--top-k', str(top_k)]
            if plots:
                cmd2.append('--plots')

        run_cmd(cmd2)
        return {'status':'fallback_ok','cmd':cmd2,'rich_error':str(e)}


## 3) Offline matrix : engines × seuils (sur CSV enrichi bp_rp)

On utilise `sample_gaia_like_with_bp_rp.csv` pour activer le CMD offline.


In [None]:
from pathlib import Path

IN_CSV = 'data/sample_gaia_like_with_bp_rp.csv'
OUT_BASE = Path('results/fulltest_offline')
OUT_BASE.mkdir(parents=True, exist_ok=True)

RUNS = []
for eng in engines:
    for t in thr:
        out_dir = OUT_BASE / f'{eng}__{t}'
        meta = run_workflow(
            mode='csv',
            in_csv=IN_CSV,
            out_dir=str(out_dir),
            engine=eng,
            threshold_strategy=t,
            top_k=30,
            explain_top=10,
            knn_k=8,
            features_mode='extended',
            plots=True,
        )
        RUNS.append({'engine':eng,'threshold':t,'out_dir':str(out_dir), **meta})

runs_df = pd.DataFrame(RUNS)
runs_df


## 4) Validation des artefacts

On vérifie présence des fichiers attendus (raw/scored/top/graph/manifest + plots + explanations/prompt si disponibles).


In [None]:
EXPECTED = [
  'raw.csv',
  'scored.csv',
  'top_anomalies.csv',
  'graph_full.graphml',
  'graph_topk.graphml',
  'manifest.json',
]

def check_out(out_dir: str):
    out = Path(out_dir)
    present = {p.name for p in out.glob('*')}
    ok = all((out/e).exists() for e in EXPECTED)
    plots = sorted([p.name for p in (out/'plots').glob('*.png')]) if (out/'plots').exists() else []
    return {
        'ok': ok,
        'missing': [e for e in EXPECTED if not (out/e).exists()],
        'n_plots': len(plots),
        'has_cmd': (out/'plots'/'cmd_bp_rp_vs_g.png').exists(),
        'has_explanations': (out/'explanations.jsonl').exists(),
        'has_prompts': (out/'llm_prompts.jsonl').exists(),
    }

checks = []
for r in RUNS:
    c = check_out(r['out_dir'])
    checks.append({**r, **c})
checks_df = pd.DataFrame(checks)
checks_df[['engine','threshold','status','ok','n_plots','has_cmd','has_explanations','has_prompts','missing']]


## 5) Comparaison inter-runs

- Overlap Jaccard des `top_anomalies`.
- Corrélation de score d’anomalie sur l’intersection des `source_id`.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

def load_top(out_dir: str):
    top = pd.read_csv(Path(out_dir)/'top_anomalies.csv')
    return set(top['source_id'].astype(str).tolist())

def load_scores(out_dir: str):
    df = pd.read_csv(Path(out_dir)/'scored.csv')[['source_id','anomaly_score']]
    df['source_id'] = df['source_id'].astype(str)
    return df

labels = [f"{r['engine']}|{r['threshold']}" for r in RUNS]
tops = [load_top(r['out_dir']) for r in RUNS]

J = np.zeros((len(RUNS), len(RUNS)), dtype=float)
for i in range(len(RUNS)):
    for j in range(len(RUNS)):
        inter = len(tops[i].intersection(tops[j]))
        union = len(tops[i].union(tops[j]))
        J[i,j] = inter/union if union else 0.0

plt.figure(figsize=(max(6, len(RUNS)*0.6), max(5, len(RUNS)*0.5)))
plt.imshow(J, aspect='auto')
plt.xticks(range(len(labels)), labels, rotation=90)
plt.yticks(range(len(labels)), labels)
plt.title('Jaccard overlap — top anomalies')
plt.colorbar(label='Jaccard')
plt.tight_layout(); plt.show()


In [None]:
# Score correlation (pairwise) on intersection of source_id
Corr = np.full((len(RUNS), len(RUNS)), np.nan, dtype=float)
score_dfs = [load_scores(r['out_dir']) for r in RUNS]

for i in range(len(RUNS)):
    for j in range(len(RUNS)):
        a = score_dfs[i]
        b = score_dfs[j]
        m = a.merge(b, on='source_id', suffixes=('_a','_b'))
        if len(m) >= 50:
            Corr[i,j] = np.corrcoef(m['anomaly_score_a'].to_numpy(float), m['anomaly_score_b'].to_numpy(float))[0,1]

plt.figure(figsize=(max(6, len(RUNS)*0.6), max(5, len(RUNS)*0.5)))
plt.imshow(Corr, aspect='auto', vmin=-1, vmax=1)
plt.xticks(range(len(labels)), labels, rotation=90)
plt.yticks(range(len(labels)), labels)
plt.title('Corrélation des scores — intersection source_id')
plt.colorbar(label='Pearson r')
plt.tight_layout(); plt.show()


## 6) Galerie des plots exportés (run choisi)

Choisir un run qui a `ok=True` et le maximum de plots.


In [None]:
from IPython.display import Image, display

# pick best run = ok with max plots
best = max(checks, key=lambda x: (x['ok'], x['n_plots']))
BEST_DIR = Path(best['out_dir'])
print('BEST:', best['engine'], best['threshold'], 'status=', best['status'], 'plots=', best['n_plots'])

plots_dir = BEST_DIR/'plots'
for p in sorted(plots_dir.glob('*.png')):
    print('PLOT:', p.name)
    display(Image(filename=str(p)))


## 7) Explainability (LIME) + prompts LLM

Lecture de `explanations.jsonl` et `llm_prompts.jsonl` si présents.


In [None]:
import json

exp_path = BEST_DIR/'explanations.jsonl'
prm_path = BEST_DIR/'llm_prompts.jsonl'
print('explanations:', exp_path.exists())
print('prompts:', prm_path.exists())

first_exp = None
if exp_path.exists():
    with exp_path.open('r', encoding='utf-8') as f:
        first_exp = json.loads(next(f))
    print('source_id:', first_exp.get('source_id'))
    print('features (sample):', (first_exp.get('features') or {}) )
    lime = (first_exp.get('lime') or {})
    weights = lime.get('weights', [])
    print('n weights:', len(weights))

if first_exp and (first_exp.get('lime') or {}).get('weights'):
    w = first_exp['lime']['weights']
    w = sorted(w, key=lambda x: abs(x.get('weight', 0.0)), reverse=True)[:12]
    labels_w = [x['feature'] for x in w]
    vals_w = [x['weight'] for x in w]
    import matplotlib.pyplot as plt
    plt.figure(figsize=(11,5))
    plt.bar(range(len(vals_w)), vals_w)
    plt.xticks(range(len(vals_w)), labels_w, rotation=45, ha='right')
    plt.title('LIME — top poids (première anomalie expliquée)')
    plt.tight_layout(); plt.show()

if prm_path.exists():
    with prm_path.open('r', encoding='utf-8') as f:
        obj = json.loads(next(f))
    print('Prompt for source_id:', obj.get('source_id'))
    print(obj.get('prompt','')[:1800])


## 8) Analyse graphe avancée (GraphML)

- communautés (Louvain si dispo, sinon greedy)
- k-core
- betweenness approx
- articulation points + bridges
- corrélations avec anomaly_score


In [None]:
import networkx as nx

G = nx.read_graphml(BEST_DIR/'graph_full.graphml')
print('Graph loaded:', G.number_of_nodes(), 'nodes', G.number_of_edges(), 'edges')

nodes = list(G.nodes())
deg = dict(G.degree())
clust = nx.clustering(G)
core = nx.core_number(G) if G.number_of_nodes() > 0 else {n:0 for n in nodes}

# betweenness approx
k = min(300, len(nodes))
btw = nx.betweenness_centrality(G, k=k, normalized=True, seed=42) if len(nodes) > 1 else {n:0.0 for n in nodes}

# communities
try:
    from networkx.algorithms.community import louvain_communities
    comms = louvain_communities(G, seed=42)
except Exception:
    from networkx.algorithms.community import greedy_modularity_communities
    comms = greedy_modularity_communities(G)

comm_id = {}
for i, cset in enumerate(comms):
    for n in cset:
        comm_id[n] = i

aps = set(nx.articulation_points(G)) if G.number_of_nodes() > 2 else set()
try:
    bridges = list(nx.bridges(G))
    bridge_nodes = set([a for a,b in bridges] + [b for a,b in bridges])
except Exception:
    bridge_nodes = set()

gdf = pd.DataFrame({
    'node': nodes,
    'degree': [deg.get(n,0) for n in nodes],
    'clustering': [clust.get(n,0.0) for n in nodes],
    'kcore': [core.get(n,0) for n in nodes],
    'betweenness': [btw.get(n,0.0) for n in nodes],
    'community': [comm_id.get(n,-1) for n in nodes],
    'is_articulation': [1 if n in aps else 0 for n in nodes],
    'incident_to_bridge': [1 if n in bridge_nodes else 0 for n in nodes],
})
gdf.head()


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(9,5.5))
plt.hist(gdf['degree'], bins=50, alpha=0.85)
plt.title('Distribution degree')
plt.xlabel('degree'); plt.ylabel('count')
plt.tight_layout(); plt.show()

plt.figure(figsize=(9,5.5))
plt.hist(gdf['kcore'], bins=30, alpha=0.85)
plt.title('Distribution k-core')
plt.xlabel('kcore'); plt.ylabel('count')
plt.tight_layout(); plt.show()

plt.figure(figsize=(9,5.5))
plt.hist(gdf['betweenness'], bins=60, alpha=0.85)
plt.title('Distribution betweenness (approx)')
plt.xlabel('betweenness'); plt.ylabel('count')
plt.tight_layout(); plt.show()


In [None]:
# Correlations with anomaly_score (if mapping exists)
scored = pd.read_csv(BEST_DIR/'scored.csv')
scored['source_id'] = scored['source_id'].astype(str)

# graph nodes are often str; align by str
gdf2 = gdf.copy()
gdf2['source_id'] = gdf2['node'].astype(str)
m = scored[['source_id','anomaly_score']].merge(gdf2[['source_id','degree','kcore','betweenness','community']], on='source_id', how='inner')
print('Joined rows:', len(m))

if len(m) > 50:
    for col in ['degree','kcore','betweenness']:
        plt.figure(figsize=(7,5))
        plt.scatter(m[col].to_numpy(float), m['anomaly_score'].to_numpy(float), s=10, alpha=0.6)
        plt.title(f'anomaly_score vs {col}')
        plt.xlabel(col); plt.ylabel('anomaly_score')
        plt.tight_layout(); plt.show()


## 9) Gaia (optionnel)

Par défaut : **désactivé** (réseau/quota). Pour activer : mettre `RUN_GAIA=1`.


In [None]:
if os.environ.get('RUN_GAIA','0') == '1':
    out_gaia = 'results/fulltest_gaia'
    run_workflow(
        mode='gaia',
        out_dir=out_gaia,
        ra=266.4051,
        dec=-28.936175,
        radius_deg=0.3,
        limit=800,
        engine='isolation_forest',
        threshold_strategy='top_k',
        top_k=30,
        explain_top=10,
        knn_k=8,
        features_mode='extended',
        plots=True,
    )
    print('Gaia run done:', out_gaia)
else:
    print('Gaia skipped (set RUN_GAIA=1 to enable).')


## 10) Hubble (support mentionné)

Si le mode `hubble` existe dans l’entrypoint, on le teste avec `data/hubble_like.csv`. Sinon, on teste via `csv`.


In [None]:
import subprocess

has_hubble = False
if ENTRYPOINT == 'workflow.py':
    try:
        # crude check: subcommands list in help
        has_hubble = 'hubble' in HELP
    except Exception:
        has_hubble = False

mode = 'hubble' if has_hubble else 'csv'
out_h = 'results/fulltest_hubble_like'
meta_h = run_workflow(
    mode=mode,
    in_csv='data/hubble_like.csv',
    out_dir=out_h,
    engine='robust_zscore',
    threshold_strategy='percentile',
    top_k=30,
    explain_top=10,
    knn_k=10,
    features_mode='extended',
    plots=True,
)
print('Hubble-like run mode:', mode)
print('status:', meta_h['status'])
print('out:', out_h)


## 11) Export du résumé

On écrit un résumé machine-readable des runs offline.

In [None]:
import json, datetime

summary_path = Path('results/fulltest_summary.json')
summary_obj = {
    'timestamp': datetime.datetime.now().isoformat(),
    'entrypoint': ENTRYPOINT,
    'runs': RUNS,
    'checks': checks,
}
summary_path.parent.mkdir(parents=True, exist_ok=True)
summary_path.write_text(json.dumps(summary_obj, indent=2), encoding='utf-8')
print('Wrote:', summary_path)
