Feature/adding main run it all file by Mahatav · Pull Request #55 · bohuie/processAnalysis

Mahatav · 2026-01-26T18:27:03Z

Summary

Removed the toggle system for file sourcing (environment variables FOLDER_SOURCE and FILE_SOURCE) and created unified main files that automatically process both branching and PR datasets. Added main.py for full pipeline execution (extraction + analysis) and main_Only_Analysis.py for analysis-only workflows. Fixed Python 3.9 compatibility issue in LLM integration by making newly added parameters optional.

What's included?

main.py: Complete pipeline runner (extraction → labelling → process model analysis)
- Processes all 22 teams sequentially
- Automatic dual-dataset processing (branching + PR)
- Comprehensive error handling with progress indicators
main_Only_Anylysis.py: Analysis-only pipeline (skips extraction)
- Useful for re-running analysis on existing data
Dual-dataset automation in process model scripts:
- transition_edges.py
- zscore_calculation.py
- clustering.py
- graphing.py
Modular extraction: Refactored app.py to export run_batch_extraction() for reuse
Python 3.9 compatibility: Made optional LLM parameters backward-compatible
Module structure: Added scripts/init.py
Typo fix: Renamed clean_lable.py → clean_lables.py

What's not included?

No changes to labelling logic or event categorization
No changes to clustering algorithms or z-score calculations
No modifications to graph rendering (orientation/pruning features from Remove the toggle settings inside process model (pr vs. code structure) #43)
No changes to output directory structure or file naming conventions

Current status

Ready for review. All existing functionality is preserved with backward compatibility. Both datasets now process automatically in a single run without manual configuration.

Testing

Verified main.py successfully runs the full pipeline for Team 15
Verified dual-dataset processing generates outputs in both data/outputs/branching/ and data/outputs/pr/
Verified main_Only_Analysis.py skips extraction and processes existing data
Verified Python 3.9 compatibility with optional LLM parameters
Verified backward compatibility: existing workflows unchanged
Confirmed error handling continues processing when one dataset fails

Closes #43

…runs evething from extraction to anylysis to graphs. I also created a file that just runs the anylysis and graphing scripts given sometimes we dont need to re gather data.

… error handling in main execution flow

AdaraPutri

hey Manu, thanks for working on this PR. it's a big task so I appreciate your work!

I noticed that the readme file is quite exhaustive (600+ lines). there are some details that we dont need for set up. consider modifying the sections:

remove "Key Design Decisions" (149 - 165)
remove "Utility Modules" (437 - 490)
remove any mentions of the process model variables "FILE_SOURCE" and "FOLDER_SOURCE" as they are no longer required (413 - 433)
remove line 84: "That's it! No FILE_SOURCE or FOLDER_SOURCE needed — the pipeline handles both datasets automatically." --> new users dont need to know that
remove "Understanding the Outputs" and "Interpreting Clusters" (588 - 627) --> or can move to another documentation doc if needed
remove section "Quick Reference" (317 - 391) --> instructions are redundant with setup section
move section "LLM Setup and AI_MODE Toggle" to be under step 4. Configure Environment
add step 5. which is running main.py, then move section "The Modular Way: Run Steps Individually" right under it

as I have to head out now, I will continue testing this branch manually later today and comparing the produced csvs and graphs with the older versions to see if they are as expected. I will add more comments if there are further changes to be made

AdaraPutri · 2026-02-06T14:57:00Z

+    print(f"   ⚠️  Clustering error: {e}\n")
+
+# Run process model analysis (BOTH datasets automatically)
+print("\nStep 3: Process Model Analysis (Both Datasets)")


not sure why Step 3 is done again here. please remove re-callingrun_transition_edges(), run_zscore(), and run_clustering() as it duplicates the process

AdaraPutri · 2026-02-06T14:58:42Z

+
+try:
+    print("   • Generating graphs...")
+    run_graphing()


move this up, run_graphing() was not called yet in the previous Step 3

AdaraPutri · 2026-02-06T15:03:09Z

there is typo in file name, but also i think we should name it as something like process_model_only.py instead, as we already have an existing analysis.py file which does very different things than what is done in this script (might mistake one for another)

AdaraPutri · 2026-02-06T15:15:15Z

-IN_FP = os.path.join(DATA_DIR, "team_transition_edges_avg_session_zscores.csv")
-OUT_FP = os.path.join(DATA_DIR, f"behavior_clusters_{CLUSTER_SUFFIX}.csv")
+# Load .env
+load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')


remove this line, we no longer read from .env for any process model steps

AdaraPutri · 2026-02-06T15:15:44Z

-OUT_CLUSTERS_DIR = os.path.join(PR_OUT_DIR, "clusters")
-
+# Load .env
+load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')


AdaraPutri · 2026-02-06T15:15:55Z

-    "example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv",
-    "output_folder": os.path.join(ROOT, "data", "outputs", "branching")
+# Load .env (for other config if needed)
+load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')


AdaraPutri · 2026-02-06T15:16:05Z

-script_path = Path(__file__).resolve()
-print(f"[DEBUG] Script location: {script_path}")
+# Load .env
+load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')


AdaraPutri · 2026-02-06T15:19:55Z

+        "example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv",
+        "output_folder": os.path.join(ROOT, "data", "outputs", "branching")
+    },
+    "pr_labels": {


can we change this to pr so it matches all the other config keys in the rest of the process model steps? also since it outputs to folder pr anyways. but lmk if there's a reason to keep this as pr_labels!

AdaraPutri · 2026-02-06T15:27:42Z

+except Exception as e:
+    print(f"   Transition edges error: {e}\n")
+
+try:


can you add a conditional statement/ guard here that only runs run_zscore() and run_clustering() if there are are at least 3 team data available? otherwise clustering would be inaccurate/useless

AdaraPutri · 2026-02-06T15:28:37Z

+except Exception as e:
+    print(f"   ⚠️  Transition edges error: {e}\n")
+
+try:


please also add guard here for minimum 3 team data before running run_zscore() and run_clustering()

AdaraPutri · 2026-02-06T15:47:58Z

hey Manu, the output csvs and graphs from running main.py produce the same results as running the current flow on dev which is awesome! I left some inline comments (sorry there's a couple of them 🥲) but lmk if you disagree with some of the changes. also could you modify the .env.example to not include FILE_SOURCE and FOLDER_SOURCE? thanks!

…treamline pipeline execution, and improve error handling for team analysis

Mahatav · 2026-02-08T12:15:01Z

Hey, I worked on the recommended changes. I have felt the comments up so you can still check if I completed it.

…raphing modules; update documentation and tests for clarity and accuracy

AdaraPutri

awesome job Manu! the refactoring makes the codebase more robust and the readme a lot cleaner. i think this pr is ready to merge 🙌

d2r3v

Seems to work well based on testing. Ready to merge.

aliyahnurdafika

Overall, this PR update was great. Good work! I left a few comments/notes about what I learned from it. Thank you!

aliyahnurdafika · 2026-04-02T15:57:31Z



-def _pick_timestamp(row: pd.Series) -> str | None:
+def _pick_timestamp(row: pd.Series) -> Optional[str]:


Nice update for using Optional[str] instead of str | None, since that syntax is more universal

aliyahnurdafika · 2026-04-02T16:05:23Z

+CONFIGS = {
+    "branching": {
+        "output_folder": os.path.join(ROOT, "data", "outputs", "branching"),
+        "cluster_suffix": "branching"
+    },
+    "pr": {
+        "output_folder": os.path.join(ROOT, "data", "outputs", "pr"),
+        "cluster_suffix": "pr"
+    }
+}


Nice refactor for moving from conditional logic into configuration, so it makes it easier to add new dataset types.

aliyahnurdafika · 2026-04-02T16:08:17Z



-def _pick_timestamp(row: pd.Series) -> str | None:
+def _pick_timestamp(row: pd.Series) -> Optional[str]:


Nice update for using Optional[T] instead of T | None, since this function is more universal

aliyahnurdafika · 2026-04-02T16:16:10Z

+        if not os.path.exists(in_fp):
+            print(f"[SKIP] Missing input: {in_fp}")
+            print(f"       Run zscore_calculation.py first")
+            continue


Nice refactor to keep the pipeline running for other datasets instead of stopping completely.

aliyahnurdafika · 2026-04-02T16:33:52Z

+        matrix_df.to_csv(matrix_out_fp)
+        print(f"[OK] Wrote transition matrix: {matrix_out_fp}")
+
+        if X.shape[0] < 2:


Nice implementation of a guard clause that handles edge cases. We implement an if condition for very small datasets to prevent clustering failures when insufficient teams (too few teams) remain after filtering.

Mahatav added 2 commits January 20, 2026 17:54

remove the toggel feature in its totallity and creted main file that …

6327283

…runs evething from extraction to anylysis to graphs. I also created a file that just runs the anylysis and graphing scripts given sometimes we dont need to re gather data.

added init file for scripts

cb5d6c5

Mahatav requested review from AdaraPutri and d2r3v January 26, 2026 18:27

Mahatav added 2 commits January 26, 2026 10:30

Refactor label processing functions to use Optional types and improve…

2354db0

… error handling in main execution flow

removing the pdf that i added by mistake

a6be9de

Mahatav added the enhancement New feature or request label Jan 26, 2026

Mahatav self-assigned this Feb 3, 2026

AdaraPutri requested changes Feb 5, 2026

View reviewed changes

AdaraPutri reviewed Feb 6, 2026

View reviewed changes

Enhance README and main execution flow: add LLM setup instructions, s…

7e2d521

…treamline pipeline execution, and improve error handling for team analysis

Mahatav requested a review from AdaraPutri February 8, 2026 12:14

Mahatav added 2 commits February 8, 2026 04:16

Refactor configuration and improve edge filtering in clustering and g…

e052d65

…raphing modules; update documentation and tests for clarity and accuracy

pulled dev and rexolved conflits so it is ready for a merge

358bdda

AdaraPutri approved these changes Feb 9, 2026

View reviewed changes

d2r3v approved these changes Feb 10, 2026

View reviewed changes

Mahatav merged commit ed42e07 into dev Feb 10, 2026

aliyahnurdafika reviewed Apr 4, 2026

View reviewed changes



		def _pick_timestamp(row: pd.Series) -> str \| None:
		def _pick_timestamp(row: pd.Series) -> Optional[str]:

Conversation

Mahatav commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included?

What's not included?

Current status

Testing

Uh oh!

AdaraPutri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdaraPutri Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdaraPutri Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdaraPutri Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdaraPutri commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mahatav commented Feb 8, 2026

Uh oh!

AdaraPutri left a comment

Choose a reason for hiding this comment

Uh oh!

d2r3v left a comment

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mahatav commented Jan 26, 2026 •

edited

Loading

AdaraPutri left a comment •

edited

Loading

AdaraPutri Feb 6, 2026 •

edited

Loading

AdaraPutri Feb 6, 2026 •

edited

Loading

AdaraPutri Feb 6, 2026 •

edited

Loading

AdaraPutri commented Feb 6, 2026 •

edited

Loading