Skip to content

chowfi/experiment-copilot

Repository files navigation

Experiment tracker

This repo includes a skill that runs your experiment script, reads the log, and appends one structured row per run to EXPERIMENTS.md so each iteration is automatically dated, annotated with intent and what was changed, results, summary of results, and next-step recommendations.

The sample below is the output produced while iterating on a Kaggle-style customer churn classification.

Example Ouput:

Run# Timestamp Action Acc AUC PRF1 Intent Summary Analysis AI Recommendation
1 2026-04-12 15:23:09 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; defaults RF_PARAMS/RF_FIXED; preprocess median+StandardScaler + impute+OneHotEncoder train: 1.000000; test: 0.497313 train: 1.000000; test: 0.610912 train: P=1.00 R=1.00 F1=1.00; test: P=0.75 R=0.50 F1=0.35 Inferred: journal run via /churn-rf-experiment-tracker Baseline; no prior row. Mitigate test collapse (class_weight or threshold); then uncomment GridSearchCV block
2 2026-04-12 15:29:13 uv run python customer_churn_random_forest.py; untracked script — RF_PARAMS n_estimators=100, max_depth=6, min_samples_split=50, min_samples_leaf=50, max_features='sqrt' train: 0.991623; test: 0.473685 train: 0.999923; test: 0.792463 train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30 Regularize RF after Run#1 looked overfit (tighter depth/leaves, fewer trees) Test AUC up; test Acc down; test wF1 down Test AUC up but acc collapsed to “always churn”; try class_weight='balanced', threshold tuning, or slightly looser min_samples_*
3 2026-04-12 15:30:39 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; RF_PARAMS n_estimators=10, max_depth=6, min_samples_split=50, min_samples_leaf=50, max_features='sqrt' train: 0.991623; test: 0.473685 train: 0.999662; test: 0.784007 train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30 Inferred: journal run via /churn-rf-experiment-tracker Test AUC down; test Acc flat; test wF1 flat Same train/test acc gap; tune threshold/class_weight or enable commented GridSearchCV
4 2026-04-12 15:37:30 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; RF_PARAMS n_estimators=5 (was 10 in Run#3), max_depth=6, min_samples_split=50, min_samples_leaf=50, max_features='sqrt' train: 0.991623; test: 0.473685 train: 0.999200; test: 0.787055 train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30 Inferred: Cut n_estimators 10→5 while keeping depth/min_samples_* caps because Run#3 still showed near-perfect train AUC vs test (~0.78), aiming to trim variance/capacity. Test AUC up slightly; Acc and wF1 flat Test report still all-positive preds for class 0; add class_weight='balanced', tune threshold on probs, then revisit min_samples_* or GridSearchCV.
5 2026-04-12 15:47:31 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; same RF_PARAMS as Run#4; DECISION_THRESHOLD=0.65, _predict_from_pos_proba, env CHURN_DECISION_THRESHOLD train: 0.991743; test: 0.474679 train: 0.999200; test: 0.787055 train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.47 F1=0.31 Applied a fixed probability cutoff (DECISION_THRESHOLD=0.65) per Run#4’s suggestion to tune thresholds on scores while test predictions stayed almost entirely positive under a 0.5-style rule. Test wF1 and Acc up slightly; test AUC flat Test still no predicted negatives; sweep threshold or add class_weight='balanced', then GridSearchCV
6 2026-04-12 15:56:13 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; same RF_PARAMS/threshold path as Run#5; RF_FIXED adds class_weight='balanced' train: 0.991639; test: 0.474415 train: 0.999275; test: 0.763716 train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.47 F1=0.31 Added class_weight='balanced' per Run#5’s next step after test still predicted almost exclusively churn under DECISION_THRESHOLD=0.65. Test AUC down; Acc down slightly; wF1 flat Test AUC down vs Run#5; all-positive collapse unchanged—sweep threshold on held-out PR, then GridSearchCV
7 2026-04-12 16:22:20 uv run python customer_churn_random_forest.py; git diff empty vs HEAD; same RF_PARAMS/class_weight as Run#6; decision threshold 0.681818 via env (was 0.65 in Run#6 log) train: 0.991852; test: 0.476264 train: 0.999275; test: 0.763716 train: P=0.99 R=0.99 F1=0.99; test: P=0.73 R=0.48 F1=0.31 Inferred: Raised decision cutoff to ~0.682 following Run#6’s call to sweep threshold while keeping class_weight='balanced', to recover some negative (non-churn) predictions on test. Test Acc up slightly; test AUC flat; wF1 flat Class 0 recall still ~1%; grid-search threshold on validation F1/PR and/or enable GridSearchCV on RF + threshold.
8 2026-04-12 19:00:32 uv run python customer_churn_random_forest.py; git diff vs HEAD: RF_PARAMS max_depth 6→4; else as Run#7 (n_estimators=5, min_samples_*, max_features='sqrt', class_weight='balanced', threshold ~0.682) train: 0.969499; test: 0.528738 train: 0.999275; test: 0.671049 train: P=0.97 R=0.97 F1=0.97; test: P=0.75 R=0.53 F1=0.42 Tightened max_depth 6→4 to regularize further after Run#7’s still-wide train vs test AUC gap, keeping balanced weights and the fixed cutoff. Test wF1 and Acc up; test AUC down sharply Test AUC fell vs Run#7 (~0.76→0.67); restore or sweep max_depth, retune threshold, then RUN_GRID_SEARCH=1 on PARAM_GRID.
9 2026-04-12 19:20:59 uv run python customer_churn_random_forest.py; git diff vs HEAD: n_estimators 5→200, max_depth 6→4; same min_samples_*/max_features, class_weight='balanced', threshold ~0.682 train: 0.990997; test: 0.512645 train: 0.999901; test: 0.761396 train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.51 F1=0.39 Inferred: Scaled n_estimators to 200 while keeping shallow max_depth=4 after Run#8’s test AUC slump (~0.67), expecting ensemble variance reduction to recover ranking quality without deepening trees. Test AUC up sharply; Acc and wF1 down Test AUC back near Run#7 but class 0 recall ~8% and acc ~51%—sweep threshold, then RUN_GRID_SEARCH=1 on PARAM_GRID.
10 2026-04-12 20:20:46 uv run python customer_churn_random_forest.py; git diff vs HEAD: concat Kaggle train+test CSVs, train_test_split(..., stratify=Churn, test_size=env TEST_SIZE default 0.2); drop CustomerID; OneHotEncoder(drop="first"); RF_PARAMS n_estimators=100, max_depth=10; remove class_weight; DECISION_THRESHOLD=0.5; tune_churn_threshold.py matches concat+split train: 0.929702; test: 0.929990 train: 0.964699; test: 0.953140 train: P=0.93 R=0.93 F1=0.93; test: P=0.93 R=0.93 F1=0.93 Concatenated both CSVs and stratified re-split because the publisher train/test files misaligned churn and feature marginals, collapsing test metrics in prior runs. Test Acc up sharply; test AUC up sharply; train-test gap much smaller RUN_GRID_SEARCH=1 now that holdout matches train distribution; optional separate eval on raw Kaggle test for drift.
11 2026-04-12 20:49:53 uv run python customer_churn_random_forest.py; env CHURN_DECISION_THRESHOLD=0.146465; no RUN_GRID_SEARCH; same concat+stratified split, RF_PARAMS, preprocessor as Run#10 train: 0.932955; test: 0.933404 train: 0.964699; test: 0.953140 train: P=0.94 R=0.93 F1=0.93; test: P=0.94 R=0.93 F1=0.93 Lowered probability cutoff via env after threshold-tuning discussion, expecting better class-0 precision/recall tradeoff without refitting the forest. Test Acc slightly up; test wF1 flat; test AUC flat Unset CHURN_DECISION_THRESHOLD unless needed; RUN_GRID_SEARCH=1 to tune PARAM_GRID.

The skill lives at .cursor/skills/churn-rf-experiment-tracker/SKILL.md.

About

Experiment tracker that acts as both a memory layer for user and agents.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages