Conversation
| print(f"[OK] Wrote: {OUT_FP}") | ||
| print("[INFO] Performing clustering...") | ||
| best_k, best_sil = choose_best_k(X) | ||
| elbow_scores = compute_elbow_scores(X) |
There was a problem hiding this comment.
can we add a guard here? so that the elbow requires at least 3 teams (since k starts at 2, and k_max is currently capped to n-1 which would return an empty curve)
maybe something like this:
elbow_scores = None
if X.shape[0] >= 3:
elbow_scores = compute_elbow_scores(X)
else:
print("[INFO] Skipping elbow scores (need >= 3 teams).")| # Save elbow scores | ||
| elbow_fp = os.path.join(data_dir, f"elbow_scores_{cluster_suffix}.csv") | ||
| elbow_df = pd.DataFrame(elbow_scores) | ||
| elbow_df.to_csv(elbow_fp, index=False) |
There was a problem hiding this comment.
if compute_elbow_scores() returns an empty k list, this line is still writing an empty csv. can we add a check that elbow_scores["k"] is non empty else skip the csv production?
something like this:
if elbow_scores and elbow_scores["k"]:
elbow_df = pd.DataFrame(elbow_scores)
elbow_df.to_csv(elbow_fp, index=False)
print(f"[OK] Wrote: {elbow_fp}")
else:
print("[INFO] Skipping elbow CSV (no valid k range).")
AdaraPutri
left a comment
There was a problem hiding this comment.
hey Manu, great job with this PR! I’ve added two inline comments around small guard / edge-case handling. one additional thought: since the elbow score is mainly used to visually inspect the curve, could you also add a simple matplotlib visualization alongside the csv?
|
Hey, just made the changes! |
AdaraPutri
left a comment
There was a problem hiding this comment.
thanks for adding the plot Manu, it looks good now! approved.
d2r3v
left a comment
There was a problem hiding this comment.
Looks Good. Ready to merge.

Summary
This PR adds elbow method analysis alongside the existing silhouette-based cluster selection to provide complementary validation metrics for K-means clustering decisions.
Changes
New Functionality:
Implementation Details:
New Outputs
Per dataset, a new CSV is generated:
CSV Schema:
Why This Matters:
The elbow method complements silhouette analysis by providing an alternative perspective on cluster quality:
Researchers can now cross-validate clustering decisions using both metrics and identify cases where the methods agree or disagree about optimal k.