Skip to content

Feature/adding main run it all file#55

Merged
Mahatav merged 7 commits intodevfrom
feature/adding-main-run-it-all-file
Feb 10, 2026
Merged

Feature/adding main run it all file#55
Mahatav merged 7 commits intodevfrom
feature/adding-main-run-it-all-file

Conversation

@Mahatav
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav commented Jan 26, 2026

Summary

Removed the toggle system for file sourcing (environment variables FOLDER_SOURCE and FILE_SOURCE) and created unified main files that automatically process both branching and PR datasets. Added main.py for full pipeline execution (extraction + analysis) and main_Only_Analysis.py for analysis-only workflows. Fixed Python 3.9 compatibility issue in LLM integration by making newly added parameters optional.

What's included?

  • main.py: Complete pipeline runner (extraction → labelling → process model analysis)
    • Processes all 22 teams sequentially
    • Automatic dual-dataset processing (branching + PR)
    • Comprehensive error handling with progress indicators
  • main_Only_Anylysis.py: Analysis-only pipeline (skips extraction)
    • Useful for re-running analysis on existing data
  • Dual-dataset automation in process model scripts:
    • transition_edges.py
    • zscore_calculation.py
    • clustering.py
    • graphing.py
  • Modular extraction: Refactored app.py to export run_batch_extraction() for reuse
  • Python 3.9 compatibility: Made optional LLM parameters backward-compatible
  • Module structure: Added scripts/init.py
  • Typo fix: Renamed clean_lable.py → clean_lables.py

What's not included?

Current status

Ready for review. All existing functionality is preserved with backward compatibility. Both datasets now process automatically in a single run without manual configuration.

Testing

  • Verified main.py successfully runs the full pipeline for Team 15
  • Verified dual-dataset processing generates outputs in both data/outputs/branching/ and data/outputs/pr/
  • Verified main_Only_Analysis.py skips extraction and processes existing data
  • Verified Python 3.9 compatibility with optional LLM parameters
  • Verified backward compatibility: existing workflows unchanged
  • Confirmed error handling continues processing when one dataset fails

Closes #43

…runs evething from extraction to anylysis to graphs. I also created a file that just runs the anylysis and graphing scripts given sometimes we dont need to re gather data.
@Mahatav Mahatav requested review from AdaraPutri and d2r3v January 26, 2026 18:27
@Mahatav Mahatav added the enhancement New feature or request label Jan 26, 2026
@Mahatav Mahatav self-assigned this Feb 3, 2026
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Manu, thanks for working on this PR. it's a big task so I appreciate your work!

I noticed that the readme file is quite exhaustive (600+ lines). there are some details that we dont need for set up. consider modifying the sections:

  • remove "Key Design Decisions" (149 - 165)
  • remove "Utility Modules" (437 - 490)
  • remove any mentions of the process model variables "FILE_SOURCE" and "FOLDER_SOURCE" as they are no longer required (413 - 433)
  • remove line 84: "That's it! No FILE_SOURCE or FOLDER_SOURCE needed — the pipeline handles both datasets automatically." --> new users dont need to know that
  • remove "Understanding the Outputs" and "Interpreting Clusters" (588 - 627) --> or can move to another documentation doc if needed
  • remove section "Quick Reference" (317 - 391) --> instructions are redundant with setup section
  • move section "LLM Setup and AI_MODE Toggle" to be under step 4. Configure Environment
  • add step 5. which is running main.py, then move section "The Modular Way: Run Steps Individually" right under it

as I have to head out now, I will continue testing this branch manually later today and comparing the produced csvs and graphs with the older versions to see if they are as expected. I will add more comments if there are further changes to be made

Comment thread main_Only_Anylysis.py Outdated
print(f" ⚠️ Clustering error: {e}\n")

# Run process model analysis (BOTH datasets automatically)
print("\nStep 3: Process Model Analysis (Both Datasets)")
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why Step 3 is done again here. please remove re-callingrun_transition_edges(), run_zscore(), and run_clustering() as it duplicates the process

Comment thread process_model_only.py

try:
print(" • Generating graphs...")
run_graphing()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this up, run_graphing() was not called yet in the previous Step 3

Comment thread process_model_only.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is typo in file name, but also i think we should name it as something like process_model_only.py instead, as we already have an existing analysis.py file which does very different things than what is done in this script (might mistake one for another)

Comment thread process_model/clustering.py Outdated
IN_FP = os.path.join(DATA_DIR, "team_transition_edges_avg_session_zscores.csv")
OUT_FP = os.path.join(DATA_DIR, f"behavior_clusters_{CLUSTER_SUFFIX}.csv")
# Load .env
load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line, we no longer read from .env for any process model steps

Comment thread process_model/graphing.py Outdated
OUT_CLUSTERS_DIR = os.path.join(PR_OUT_DIR, "clusters")

# Load .env
load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment thread process_model/transition_edges.py Outdated
"example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv",
"output_folder": os.path.join(ROOT, "data", "outputs", "branching")
# Load .env (for other config if needed)
load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment thread process_model/zscore_calculation.py Outdated
script_path = Path(__file__).resolve()
print(f"[DEBUG] Script location: {script_path}")
# Load .env
load_dotenv(dotenv_path=Path(__file__).parent.parent / '.env')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment thread process_model/transition_edges.py Outdated
"example": "CLEAN_year-long-project-team-7_labels_branching_and_structure.csv",
"output_folder": os.path.join(ROOT, "data", "outputs", "branching")
},
"pr_labels": {
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to pr so it matches all the other config keys in the rest of the process model steps? also since it outputs to folder pr anyways. but lmk if there's a reason to keep this as pr_labels!

Comment thread main.py Outdated
except Exception as e:
print(f" Transition edges error: {e}\n")

try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a conditional statement/ guard here that only runs run_zscore() and run_clustering() if there are are at least 3 team data available? otherwise clustering would be inaccurate/useless

Comment thread main_Only_Anylysis.py Outdated
except Exception as e:
print(f" ⚠️ Transition edges error: {e}\n")

try:
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add guard here for minimum 3 team data before running run_zscore() and run_clustering()

@AdaraPutri
Copy link
Copy Markdown
Collaborator

AdaraPutri commented Feb 6, 2026

hey Manu, the output csvs and graphs from running main.py produce the same results as running the current flow on dev which is awesome! I left some inline comments (sorry there's a couple of them 🥲) but lmk if you disagree with some of the changes. also could you modify the .env.example to not include FILE_SOURCE and FOLDER_SOURCE? thanks!

…treamline pipeline execution, and improve error handling for team analysis
@Mahatav Mahatav requested a review from AdaraPutri February 8, 2026 12:14
@Mahatav
Copy link
Copy Markdown
Collaborator Author

Mahatav commented Feb 8, 2026

Hey, I worked on the recommended changes. I have felt the comments up so you can still check if I completed it.

Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome job Manu! the refactoring makes the codebase more robust and the readme a lot cleaner. i think this pr is ready to merge 🙌

Copy link
Copy Markdown
Collaborator

@d2r3v d2r3v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work well based on testing. Ready to merge.

@Mahatav Mahatav merged commit ed42e07 into dev Feb 10, 2026
Copy link
Copy Markdown
Collaborator

@aliyahnurdafika aliyahnurdafika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this PR update was great. Good work! I left a few comments/notes about what I learned from it. Thank you!



def _pick_timestamp(row: pd.Series) -> str | None:
def _pick_timestamp(row: pd.Series) -> Optional[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice update for using Optional[str] instead of str | None, since that syntax is more universal

Comment on lines +11 to +20
CONFIGS = {
"branching": {
"output_folder": os.path.join(ROOT, "data", "outputs", "branching"),
"cluster_suffix": "branching"
},
"pr": {
"output_folder": os.path.join(ROOT, "data", "outputs", "pr"),
"cluster_suffix": "pr"
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor for moving from conditional logic into configuration, so it makes it easier to add new dataset types.



def _pick_timestamp(row: pd.Series) -> str | None:
def _pick_timestamp(row: pd.Series) -> Optional[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice update for using Optional[T] instead of T | None, since this function is more universal

Comment on lines +96 to +99
if not os.path.exists(in_fp):
print(f"[SKIP] Missing input: {in_fp}")
print(f" Run zscore_calculation.py first")
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor to keep the pipeline running for other datasets instead of stopping completely.

matrix_df.to_csv(matrix_out_fp)
print(f"[OK] Wrote transition matrix: {matrix_out_fp}")

if X.shape[0] < 2:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implementation of a guard clause that handles edge cases. We implement an if condition for very small datasets to prevent clustering failures when insufficient teams (too few teams) remain after filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants