Skip to content

feat: Add Topic Modeling database schema tables#3397

Merged
sgoggins merged 4 commits intochaoss:mainfrom
xiaoha-cloud:topic-modeling-schema-only
Nov 13, 2025
Merged

feat: Add Topic Modeling database schema tables#3397
sgoggins merged 4 commits intochaoss:mainfrom
xiaoha-cloud:topic-modeling-schema-only

Conversation

@xiaoha-cloud
Copy link
Contributor

Description

This PR adds:

  1. Migration 35: topic_model_meta table

    • Stores metadata for each trained topic model
    • 21 fields including model_id (UUID PK), repo_id (FK), training parameters, quality metrics
    • Enables model versioning, comparison, and intelligent retraining
  2. Migration 36: topic_model_event table

    • Audit log for topic modeling operations
    • Tracks training lifecycle events for observability
  3. TopicModelMeta ORM model

    • SQLAlchemy model definition with relationships
    • Added to models/init.py exports

Why split into two PRs?

  • Maintainer requested schema-only PR first to avoid frequent rebases during review
  • Allows schema to be merged and stable before feature code review
  • Follows Augur's pattern of separating schema migrations from feature implementation

Related: #3207

Notes for Reviewers

  • Only schema changes, no business logic
  • Migrations are sequential (35, 36) following existing pattern
  • ORM models follow Augur conventions (tool_source, data_collection_date, etc.)

Signed commits

  • Yes, I signed my commits.

@xiaoha-cloud xiaoha-cloud force-pushed the topic-modeling-schema-only branch from 33e66cc to 0592997 Compare November 12, 2025 01:30
Add two new tables and ORM models for Topic Modeling versioning system:

1. topic_model_meta table (Migration 35):
   - Stores metadata for each trained topic model
   - 21 fields including model_id (UUID PK), repo_id (FK), training parameters,
     quality metrics (coherence_score, topic_diversity), and visualization data
   - Enables model versioning, comparison, and intelligent retraining

2. topic_model_event table (Migration 36):
   - Audit log for topic modeling events
   - Tracks training lifecycle: started, completed, retrain triggered, etc.
   - Provides observability for automated and manual training operations

3. TopicModelMeta ORM model:
   - SQLAlchemy model definition for topic_model_meta table
   - Relationships and field mappings for application layer

These schema changes support the Topic Modeling feature that enables:
- Automated NMF-based topic extraction from repository messages
- Model version management and comparison
- Intelligent retraining based on data/quality changes
- Storage optimization via REPLACE strategy for automatic runs

Related: chaoss#3207
Signed-off-by: Xiaoha <blairjade183@gmail.com>
@xiaoha-cloud xiaoha-cloud force-pushed the topic-modeling-schema-only branch from 0592997 to d20c672 Compare November 12, 2025 01:32
- All JSON/JSONB fields in Augur have NO indexes
- Verified: repo_badging.data (JSONB), chaoss_metric_status.cm_info (JSON), etc.
- payload is used for display, not filtering
- Query performance relies on ix_tme_repo_ts and ix_tme_event indexes

Signed-off-by: Xiaoha <blairjade183@gmail.com>
@sgoggins sgoggins added add-feature Adds new features metric CHAOSS Issues that relate directly to our goal of being a good reference implementation of CHAOSS metrics labels Nov 12, 2025
@sgoggins sgoggins self-assigned this Nov 12, 2025
sgoggins
sgoggins previously approved these changes Nov 12, 2025
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right way to add things to the schema. @MoralCode ?

Thank you @xiaoha-cloud !!!

Copy link
Contributor

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than making sure you are using timezone-aware columns for all the timestamps, I don't really see a reason not to merge this - its going to create new unused tables but thats okay since its part one of the topic modeling contribution and merging it sooner is better so other database changes can be made without impacting the pending merge.

@MoralCode MoralCode added the database Related to Augur's unifed data model label Nov 12, 2025
- set training_start_time/end_time/data_collection_date to TIMESTAMPTZ
- update TopicModelMeta ORM to use timezone-aware columns
- align topic_model_event ts column with TIMESTAMPTZ requirement
- satisfies maintainer request for timezone data storage

Signed-off-by: Xiaoha <blairjade183@gmail.com>
- switch Alembic migrations to use sa.TIMESTAMP(timezone=True)
- keeps timezone support while avoiding Postgres-specific type import

Signed-off-by: Xiaoha <blairjade183@gmail.com>
@MoralCode MoralCode requested a review from sgoggins November 13, 2025 15:59
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sgoggins sgoggins merged commit 6b48ab6 into chaoss:main Nov 13, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

add-feature Adds new features CHAOSS Issues that relate directly to our goal of being a good reference implementation of CHAOSS metrics database Related to Augur's unifed data model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants