Experiment tracker

This repo includes a skill that runs your experiment script, reads the log, and appends one structured row per run to EXPERIMENTS.md so each iteration is automatically dated, annotated with intent and what was changed, results, summary of results, and next-step recommendations.

The sample below is the output produced while iterating on a Kaggle-style customer churn classification.

Example Ouput:

Run#	Timestamp	Action	Acc	AUC	PRF1	Intent	Summary Analysis	AI Recommendation
1	2026-04-12 15:23:09	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; defaults `RF_PARAMS`/`RF_FIXED`; preprocess median+`StandardScaler` + impute+`OneHotEncoder`	train: 1.000000; test: 0.497313	train: 1.000000; test: 0.610912	train: P=1.00 R=1.00 F1=1.00; test: P=0.75 R=0.50 F1=0.35	Inferred: journal run via `/churn-rf-experiment-tracker`	Baseline; no prior row.	Mitigate test collapse (`class_weight` or threshold); then uncomment GridSearchCV block
2	2026-04-12 15:29:13	`uv run python customer_churn_random_forest.py`; untracked script — `RF_PARAMS` `n_estimators=100`, `max_depth=6`, `min_samples_split=50`, `min_samples_leaf=50`, `max_features='sqrt'`	train: 0.991623; test: 0.473685	train: 0.999923; test: 0.792463	train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30	Regularize RF after Run#1 looked overfit (tighter depth/leaves, fewer trees)	Test AUC up; test Acc down; test wF1 down	Test AUC up but acc collapsed to “always churn”; try `class_weight='balanced'`, threshold tuning, or slightly looser `min_samples_`*
3	2026-04-12 15:30:39	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; `RF_PARAMS` `n_estimators=10`, `max_depth=6`, `min_samples_split=50`, `min_samples_leaf=50`, `max_features='sqrt'`	train: 0.991623; test: 0.473685	train: 0.999662; test: 0.784007	train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30	Inferred: journal run via `/churn-rf-experiment-tracker`	Test AUC down; test Acc flat; test wF1 flat	Same train/test acc gap; tune threshold/`class_weight` or enable commented `GridSearchCV`
4	2026-04-12 15:37:30	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; `RF_PARAMS` `n_estimators=5` (was 10 in Run#3), `max_depth=6`, `min_samples_split=50`, `min_samples_leaf=50`, `max_features='sqrt'`	train: 0.991623; test: 0.473685	train: 0.999200; test: 0.787055	train: P=0.99 R=0.99 F1=0.99; test: P=0.22 R=0.47 F1=0.30	Inferred: Cut `n_estimators` 10→5 while keeping depth/`min_samples_*` caps because Run#3 still showed near-perfect train AUC vs test (~0.78), aiming to trim variance/capacity.	Test AUC up slightly; Acc and wF1 flat	Test report still all-positive preds for class 0; add `class_weight='balanced'`, tune threshold on probs, then revisit `min_samples_*` or `GridSearchCV`.
5	2026-04-12 15:47:31	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; same `RF_PARAMS` as Run#4; `DECISION_THRESHOLD=0.65`, `_predict_from_pos_proba`, env `CHURN_DECISION_THRESHOLD`	train: 0.991743; test: 0.474679	train: 0.999200; test: 0.787055	train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.47 F1=0.31	Applied a fixed probability cutoff (`DECISION_THRESHOLD=0.65`) per Run#4’s suggestion to tune thresholds on scores while test predictions stayed almost entirely positive under a 0.5-style rule.	Test wF1 and Acc up slightly; test AUC flat	Test still no predicted negatives; sweep threshold or add `class_weight='balanced'`, then `GridSearchCV`
6	2026-04-12 15:56:13	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; same `RF_PARAMS`/threshold path as Run#5; `RF_FIXED` adds `class_weight='balanced'`	train: 0.991639; test: 0.474415	train: 0.999275; test: 0.763716	train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.47 F1=0.31	Added `class_weight='balanced'` per Run#5’s next step after test still predicted almost exclusively churn under `DECISION_THRESHOLD=0.65`.	Test AUC down; Acc down slightly; wF1 flat	Test AUC down vs Run#5; all-positive collapse unchanged—sweep threshold on held-out PR, then `GridSearchCV`
7	2026-04-12 16:22:20	`uv run python customer_churn_random_forest.py`; `git diff` empty vs HEAD; same `RF_PARAMS`/`class_weight` as Run#6; decision threshold `0.681818` via env (was `0.65` in Run#6 log)	train: 0.991852; test: 0.476264	train: 0.999275; test: 0.763716	train: P=0.99 R=0.99 F1=0.99; test: P=0.73 R=0.48 F1=0.31	Inferred: Raised decision cutoff to ~0.682 following Run#6’s call to sweep threshold while keeping `class_weight='balanced'`, to recover some negative (non-churn) predictions on test.	Test Acc up slightly; test AUC flat; wF1 flat	Class 0 recall still ~1%; grid-search threshold on validation F1/PR and/or enable `GridSearchCV` on RF + threshold.
8	2026-04-12 19:00:32	`uv run python customer_churn_random_forest.py`; `git diff` vs HEAD: `RF_PARAMS` `max_depth` 6→4; else as Run#7 (`n_estimators=5`, `min_samples_*`, `max_features='sqrt'`, `class_weight='balanced'`, threshold ~0.682)	train: 0.969499; test: 0.528738	train: 0.999275; test: 0.671049	train: P=0.97 R=0.97 F1=0.97; test: P=0.75 R=0.53 F1=0.42	Tightened `max_depth` 6→4 to regularize further after Run#7’s still-wide train vs test AUC gap, keeping balanced weights and the fixed cutoff.	Test wF1 and Acc up; test AUC down sharply	Test AUC fell vs Run#7 (~0.76→0.67); restore or sweep `max_depth`, retune threshold, then `RUN_GRID_SEARCH=1` on `PARAM_GRID`.
9	2026-04-12 19:20:59	`uv run python customer_churn_random_forest.py`; `git diff` vs HEAD: `n_estimators` 5→200, `max_depth` 6→4; same `min_samples_*`/`max_features`, `class_weight='balanced'`, threshold ~0.682	train: 0.990997; test: 0.512645	train: 0.999901; test: 0.761396	train: P=0.99 R=0.99 F1=0.99; test: P=0.75 R=0.51 F1=0.39	Inferred: Scaled `n_estimators` to 200 while keeping shallow `max_depth=4` after Run#8’s test AUC slump (~0.67), expecting ensemble variance reduction to recover ranking quality without deepening trees.	Test AUC up sharply; Acc and wF1 down	Test AUC back near Run#7 but class 0 recall ~8% and acc ~51%—sweep threshold, then `RUN_GRID_SEARCH=1` on `PARAM_GRID`.
10	2026-04-12 20:20:46	`uv run python customer_churn_random_forest.py`; `git diff` vs HEAD: concat Kaggle train+test CSVs, `train_test_split(..., stratify=Churn, test_size=env TEST_SIZE default 0.2)`; drop `CustomerID`; `OneHotEncoder(drop="first")`; `RF_PARAMS` `n_estimators=100`, `max_depth=10`; remove `class_weight`; `DECISION_THRESHOLD=0.5`; `tune_churn_threshold.py` matches concat+split	train: 0.929702; test: 0.929990	train: 0.964699; test: 0.953140	train: P=0.93 R=0.93 F1=0.93; test: P=0.93 R=0.93 F1=0.93	Concatenated both CSVs and stratified re-split because the publisher train/test files misaligned churn and feature marginals, collapsing test metrics in prior runs.	Test Acc up sharply; test AUC up sharply; train-test gap much smaller	`RUN_GRID_SEARCH=1` now that holdout matches train distribution; optional separate eval on raw Kaggle test for drift.
11	2026-04-12 20:49:53	`uv run python customer_churn_random_forest.py`; env `CHURN_DECISION_THRESHOLD=0.146465`; no `RUN_GRID_SEARCH`; same concat+stratified split, `RF_PARAMS`, preprocessor as Run#10	train: 0.932955; test: 0.933404	train: 0.964699; test: 0.953140	train: P=0.94 R=0.93 F1=0.93; test: P=0.94 R=0.93 F1=0.93	Lowered probability cutoff via env after threshold-tuning discussion, expecting better class-0 precision/recall tradeoff without refitting the forest.	Test Acc slightly up; test wF1 flat; test AUC flat	Unset `CHURN_DECISION_THRESHOLD` unless needed; `RUN_GRID_SEARCH=1` to tune `PARAM_GRID`.

The skill lives at .cursor/skills/churn-rf-experiment-tracker/SKILL.md.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.cursor/skills/churn-rf-experiment-tracker		.cursor/skills/churn-rf-experiment-tracker
__pycache__		__pycache__
data		data
mcp-web-browse		mcp-web-browse
.gitignore		.gitignore
EXPERIMENTS.md		EXPERIMENTS.md
README.md		README.md
customer_churn_random_forest.py		customer_churn_random_forest.py
download_customer_churn_data.sh		download_customer_churn_data.sh
eda_customer_churn.ipynb		eda_customer_churn.ipynb
llama-router-presets.ini		llama-router-presets.ini
pyproject.toml		pyproject.toml
tune_churn_threshold.py		tune_churn_threshold.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiment tracker

Example Ouput:

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Experiment tracker

Example Ouput:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages