Conversation
…runs evething from extraction to anylysis to graphs. I also created a file that just runs the anylysis and graphing scripts given sometimes we dont need to re gather data.
… error handling in main execution flow
There was a problem hiding this comment.
hey Manu, thanks for working on this PR. it's a big task so I appreciate your work!
I noticed that the readme file is quite exhaustive (600+ lines). there are some details that we dont need for set up. consider modifying the sections:
- remove "Key Design Decisions" (149 - 165)
- remove "Utility Modules" (437 - 490)
- remove any mentions of the process model variables "FILE_SOURCE" and "FOLDER_SOURCE" as they are no longer required (413 - 433)
- remove line 84: "That's it! No FILE_SOURCE or FOLDER_SOURCE needed — the pipeline handles both datasets automatically." --> new users dont need to know that
- remove "Understanding the Outputs" and "Interpreting Clusters" (588 - 627) --> or can move to another documentation doc if needed
- remove section "Quick Reference" (317 - 391) --> instructions are redundant with setup section
- move section "LLM Setup and AI_MODE Toggle" to be under step 4. Configure Environment
- add step 5. which is running
main.py, then move section "The Modular Way: Run Steps Individually" right under it
as I have to head out now, I will continue testing this branch manually later today and comparing the produced csvs and graphs with the older versions to see if they are as expected. I will add more comments if there are further changes to be made
| print(f" ⚠️ Clustering error: {e}\n") | ||
|
|
||
| # Run process model analysis (BOTH datasets automatically) | ||
| print("\nStep 3: Process Model Analysis (Both Datasets)") |
There was a problem hiding this comment.
not sure why Step 3 is done again here. please remove re-callingrun_transition_edges(), run_zscore(), and run_clustering() as it duplicates the process
|
|
||
| try: | ||
| print(" • Generating graphs...") | ||
| run_graphing() |
There was a problem hiding this comment.
move this up, run_graphing() was not called yet in the previous Step 3
There was a problem hiding this comment.
there is typo in file name, but also i think we should name it as something like process_model_only.py instead, as we already have an existing analysis.py file which does very different things than what is done in this script (might mistake one for another)
| IN_FP = os.path.join(DATA_DIR, "team_transition_edges_avg_session_zscores.csv") | ||
| OUT_FP = os.path.join(DATA_DIR, f"behavior_clusters_{CLUSTER_SUFFIX}.csv") | ||
| # Load .env | ||
| load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') |
There was a problem hiding this comment.
remove this line, we no longer read from .env for any process model steps
| OUT_CLUSTERS_DIR = os.path.join(PR_OUT_DIR, "clusters") | ||
|
|
||
| # Load .env | ||
| load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') |
| "example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv", | ||
| "output_folder": os.path.join(ROOT, "data", "outputs", "branching") | ||
| # Load .env (for other config if needed) | ||
| load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') |
| script_path = Path(__file__).resolve() | ||
| print(f"[DEBUG] Script location: {script_path}") | ||
| # Load .env | ||
| load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env') |
| "example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv", | ||
| "output_folder": os.path.join(ROOT, "data", "outputs", "branching") | ||
| }, | ||
| "pr_labels": { |
There was a problem hiding this comment.
can we change this to pr so it matches all the other config keys in the rest of the process model steps? also since it outputs to folder pr anyways. but lmk if there's a reason to keep this as pr_labels!
| except Exception as e: | ||
| print(f" Transition edges error: {e}\n") | ||
|
|
||
| try: |
There was a problem hiding this comment.
can you add a conditional statement/ guard here that only runs run_zscore() and run_clustering() if there are are at least 3 team data available? otherwise clustering would be inaccurate/useless
| except Exception as e: | ||
| print(f" ⚠️ Transition edges error: {e}\n") | ||
|
|
||
| try: |
There was a problem hiding this comment.
please also add guard here for minimum 3 team data before running run_zscore() and run_clustering()
|
hey Manu, the output csvs and graphs from running |
…treamline pipeline execution, and improve error handling for team analysis
|
Hey, I worked on the recommended changes. I have felt the comments up so you can still check if I completed it. |
…raphing modules; update documentation and tests for clarity and accuracy
AdaraPutri
left a comment
There was a problem hiding this comment.
awesome job Manu! the refactoring makes the codebase more robust and the readme a lot cleaner. i think this pr is ready to merge 🙌
d2r3v
left a comment
There was a problem hiding this comment.
Seems to work well based on testing. Ready to merge.
aliyahnurdafika
left a comment
There was a problem hiding this comment.
Overall, this PR update was great. Good work! I left a few comments/notes about what I learned from it. Thank you!
|
|
||
|
|
||
| def _pick_timestamp(row: pd.Series) -> str | None: | ||
| def _pick_timestamp(row: pd.Series) -> Optional[str]: |
There was a problem hiding this comment.
Nice update for using Optional[str] instead of str | None, since that syntax is more universal
| CONFIGS = { | ||
| "branching": { | ||
| "output_folder": os.path.join(ROOT, "data", "outputs", "branching"), | ||
| "cluster_suffix": "branching" | ||
| }, | ||
| "pr": { | ||
| "output_folder": os.path.join(ROOT, "data", "outputs", "pr"), | ||
| "cluster_suffix": "pr" | ||
| } | ||
| } |
There was a problem hiding this comment.
Nice refactor for moving from conditional logic into configuration, so it makes it easier to add new dataset types.
|
|
||
|
|
||
| def _pick_timestamp(row: pd.Series) -> str | None: | ||
| def _pick_timestamp(row: pd.Series) -> Optional[str]: |
There was a problem hiding this comment.
Nice update for using Optional[T] instead of T | None, since this function is more universal
| if not os.path.exists(in_fp): | ||
| print(f"[SKIP] Missing input: {in_fp}") | ||
| print(f" Run zscore_calculation.py first") | ||
| continue |
There was a problem hiding this comment.
Nice refactor to keep the pipeline running for other datasets instead of stopping completely.
| matrix_df.to_csv(matrix_out_fp) | ||
| print(f"[OK] Wrote transition matrix: {matrix_out_fp}") | ||
|
|
||
| if X.shape[0] < 2: |
There was a problem hiding this comment.
Nice implementation of a guard clause that handles edge cases. We implement an if condition for very small datasets to prevent clustering failures when insufficient teams (too few teams) remain after filtering.
Summary
Removed the toggle system for file sourcing (environment variables FOLDER_SOURCE and FILE_SOURCE) and created unified main files that automatically process both branching and PR datasets. Added main.py for full pipeline execution (extraction + analysis) and main_Only_Analysis.py for analysis-only workflows. Fixed Python 3.9 compatibility issue in LLM integration by making newly added parameters optional.
What's included?
What's not included?
Current status
Ready for review. All existing functionality is preserved with backward compatibility. Both datasets now process automatically in a single run without manual configuration.
Testing
Closes #43