Skip to content

Feature/adding elbow score#65

Merged
Mahatav merged 5 commits intodevfrom
feature/adding_elbow_score
Mar 2, 2026
Merged

Feature/adding elbow score#65
Mahatav merged 5 commits intodevfrom
feature/adding_elbow_score

Conversation

@Mahatav
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav commented Feb 2, 2026

Summary

This PR adds elbow method analysis alongside the existing silhouette-based cluster selection to provide complementary validation metrics for K-means clustering decisions.

Changes

New Functionality:

  • Added compute_elbow_scores() function to calculate within-cluster sum of squares (inertia) across the k-range
  • Integrated the elbow score computation into the main clustering pipeline
  • Exports elbow scores to CSV for both branching and PR datasets

Implementation Details:

  • Computes inertia for k values from 2 to 10 (matching silhouette range)
  • Reuses existing KMeans parameters (n_init=25, random_state=42) for consistency
  • Processes both datasets in parallel within the existing loop
  • Follows established naming convention: elbow_scores_{suffix}.csv

New Outputs

Per dataset, a new CSV is generated:

  • data/outputs/branching/elbow_scores_branching.csv
  • data/outputs/pr/elbow_scores_pr.csv

CSV Schema:

  • k - Number of clusters tested (2 to 10)
  • inertia - Within-cluster sum of squares (lower indicates tighter clusters)

Why This Matters:

The elbow method complements silhouette analysis by providing an alternative perspective on cluster quality:

  • Silhouette: Measures cluster separation quality (higher = better distinct clusters)
  • Elbow: Measures compactness via inertia (look for the "elbow point" where diminishing returns begin)

Researchers can now cross-validate clustering decisions using both metrics and identify cases where the methods agree or disagree about optimal k.

@Mahatav Mahatav requested a review from AdaraPutri February 2, 2026 15:58
@Mahatav Mahatav added the enhancement New feature or request label Feb 3, 2026
@Mahatav Mahatav self-assigned this Feb 3, 2026
Comment thread process_model/clustering.py Outdated
print(f"[OK] Wrote: {OUT_FP}")
print("[INFO] Performing clustering...")
best_k, best_sil = choose_best_k(X)
elbow_scores = compute_elbow_scores(X)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a guard here? so that the elbow requires at least 3 teams (since k starts at 2, and k_max is currently capped to n-1 which would return an empty curve)

maybe something like this:

elbow_scores = None
if X.shape[0] >= 3:
    elbow_scores = compute_elbow_scores(X)
else:
    print("[INFO] Skipping elbow scores (need >= 3 teams).")

Comment thread process_model/clustering.py Outdated
# Save elbow scores
elbow_fp = os.path.join(data_dir, f"elbow_scores_{cluster_suffix}.csv")
elbow_df = pd.DataFrame(elbow_scores)
elbow_df.to_csv(elbow_fp, index=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if compute_elbow_scores() returns an empty k list, this line is still writing an empty csv. can we add a check that elbow_scores["k"] is non empty else skip the csv production?

something like this:

if elbow_scores and elbow_scores["k"]:
    elbow_df = pd.DataFrame(elbow_scores)
    elbow_df.to_csv(elbow_fp, index=False)
    print(f"[OK] Wrote: {elbow_fp}")
else:
    print("[INFO] Skipping elbow CSV (no valid k range).")

Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Manu, great job with this PR! I’ve added two inline comments around small guard / edge-case handling. one additional thought: since the elbow score is mainly used to visually inspect the curve, could you also add a simple matplotlib visualization alongside the csv?

@Mahatav Mahatav requested a review from AdaraPutri February 8, 2026 12:03
@Mahatav
Copy link
Copy Markdown
Collaborator Author

Mahatav commented Feb 8, 2026

Hey, just made the changes!

Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the guard Manu, the edge cases are protected now. could you still add the matplotlib graphing? it would have the number of clusters as the x-axis and the inertia value as the y-axis. something that would look like this graph:

Image

@Mahatav Mahatav requested a review from AdaraPutri February 15, 2026 17:03
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the plot Manu, it looks good now! approved.

Copy link
Copy Markdown
Collaborator

@d2r3v d2r3v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good. Ready to merge.

@Mahatav Mahatav merged commit 62eee9c into dev Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants