Skip to content

Conversation

@VEDA95
Copy link
Collaborator

@VEDA95 VEDA95 commented Sep 10, 2025

Bug fix for Issue #201

Pull Request type

This pull request provides a bug fix for the error message in the screenshot attached to issue #201.
Binaries for testing purposes can be found at https://github.com/civictechdc/mango-tango-cli/actions/runs/17604801072

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build-related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

In some instances the uvicorn webserver throws an error when trying to gracefully shutdown after pressing ctrl+c. This is due to connections to the server being prematurely closed by the client (or in this case the browser). As such this causes uvicorn to get hung up on cleaning up prematurely closed connections the same time the shutdown is supposed to occur. Which causes the server to throw said error as it can't handle doing both cleanup and shutdown at the same time.

Issue Number: 201

What is the new behavior?

Fixed the way the uvicorn server is implemented to cleanly exit any process related to uvicorn when the shutdown process occurs. Thus stamping out the error message entirely.

Does this introduce a breaking change?

  • Yes
  • No

@github-actions
Copy link

github-actions bot commented Sep 10, 2025

PR Preview Action v1.6.2
Preview removed because the pull request was closed.
2025-09-12 20:17 UTC

@VEDA95 VEDA95 added domain: core Affects the app's core architecture bugfix Inconsistencies or issues which will cause a problem for users or implementors. labels Sep 10, 2025
@VEDA95 VEDA95 changed the base branch from main to develop September 10, 2025 06:41
@VEDA95 VEDA95 requested a review from JoeKarow September 10, 2025 06:50
Copy link
Collaborator

@KristijanArmeni KristijanArmeni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VEDA95. I tried the executable you shared and this seems to work. Can't comment on the implementation itself, since not my domain, but justification is clear to me.

A small thing: after launching a dashboard, the localhost link did not seem to be clickable. Not sure if related at all, might just be the terminal app that is used for the launch with .exe. If so, just ignore.

Shall we merge this in before #200 (which renames executables etc.)?

@KristijanArmeni KristijanArmeni merged commit 40b00b0 into develop Sep 12, 2025
6 checks passed
@KristijanArmeni KristijanArmeni deleted the issue-201-cli-asyncio-error-bugfix branch September 12, 2025 20:16
@VEDA95
Copy link
Collaborator Author

VEDA95 commented Sep 12, 2025

A small thing: after launching a dashboard, the localhost link did not seem to be clickable. Not sure if related at all, might just be the terminal app that is used for the launch with .exe. If so, just ignore.

Matter of fact I have run into this issue too @KristijanArmeni. However, I've only run into this issue in the Mac OS terminal app. Though when I use ghostty I don't run into this problem. Clicking on the link also seems to work when running the project on my linux desktop too

KristijanArmeni added a commit that referenced this pull request Sep 23, 2025
* Reapply changes pertaining to analyzer parametrization (#146)

* [ENH] Hashtag update and add tests (#121)

* [ENH] update interface to rely on post column, add users col

* [ENH] update gini() to operate on pl.Series

* [ENH] wrap analysis code into hashtag_analyzer()

* [ENH] check for time column dtype

* [ENH] fix column naming, set defaul window to 12h

* [ENH] add data and tests for hashtag_analyzer

* [ENH] add tests for gini()

* [MAINT] explicitly specify return_dtype

* [MAINT] cleanup

* fix: hide unsupported hashtag analysis export formats

* test tutorial comment cleanup

* fix: Capitalization typo

---------

Co-authored-by: DeanEby <ebyd21@gmail.com>

* feat: analyzer parameterization (#126)

This change introduces analyzer parameterization. Parameters can be defined
using `AnalyzerParam` and queried via the analyzer/web presenter context
objects. The CLI is updated to include a step where parameters are chosen,
and they are reflected back when the analysis is viewed. The test helpers
are also updated to include parameter provision.

The example analyzer, which performs simple character count, is updated
to include a single parameter, namely `fudge_factor` that offsets the
character count.

These parameter types are supported:
- `IntegerParam` (produces `int` value)
- `TimeBinningParam` (produces `TimeBinningValue`, essentially an object
  with attributes `unit` and `amount`.

The `TimeBinningParam` is currently unused, but it is defined in anticipation
that it will be used by the hashtag tests. As such, a convenient method that
produces a polars `truncate` expression is added to `TimeBinningValue` so
it can just be dropped into the analyzer later.

* [FEAT] Add time window parameter selection to hashtags test (#129)

* add time bining parameter to hashtag analyzer
* add default parameter for tester
* de-indent help text, remove reference to polars expr

---------

Co-authored-by: DeanEby <ebyd21@gmail.com>
Co-authored-by: soul-codes <40335030+soul-codes@users.noreply.github.com>

* Multi Tenancy Support (#150)

* Implemented middleware needed to support multi tenancy with the waitress server

* Updating waitress to resolve two CVS reports from dependabot. (https://github.com/civictechdc/mango-tango-cli/security/dependabot/1), (https://github.com/civictechdc/mango-tango-cli/security/dependabot/2)

* ran black and isort

* Add macOS code signing and build improvements (#156)

* add .env files to ignore

* use VIRTUAL_ENV if set, otherwise fallback to `./venv/`

* update workflow for codesigning

* allow manual run

* cleaup workflows

* update pyinstaller.spec for macOS codesigning and add os import; update requirements-dev.txt to ensure pyinstaller is included

* add mango.entitlements for macOS codesigning requirements

* don't upload artifacts from test builds

* update actions versions and improve dependency caching in test workflow

* [FEAT] Add dashboard prototype (Shiny for Python) for hashtag analyzer  (#157)

* initial commit, hacky factory and interface modules

* initial commit, __init__.py

* initial commit, supporting data science modules for app

* initial commit, shiny app module

* add hashtag dashboard to the suite

* add code to load raw dataset

* [ENH] use group_by_dynamic, add rolling mean computation

* [MAINT] adding 'gini_smooth' output column to output test data

* [MAINT] remove unused import

* [MAINT] add shiny, shinywidgets to requirements.txt

* set launch_browser in run_app to True

* add ShinyServerConfig in factory.py

* [FEAT] add `._start_shiny_server` method to `AnalysisWebServerContext` class

* move shiny app instance declaration from `factory.py` to `app/analysis_webserver_context`

* Multi Tenancy Support (got it working this time!) (#163)

* feat: Add comprehensive AI development assistant integration and documentation (#164)

* Implementing configuration to get artifacts back when running workflow...

* WIP including hashtag analysis as test in CLI app builds

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP still working on production build fixes for shiny dashboards.

* WIP still working on production build fixes for shiny dashboards.

* WIP still working on production build fixes for shiny dashboards.

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* Added except statement to silence CancelledError error that occurs when shutting down webserver with CTRL + C

* WIP fixing bug that causes shiny dashboard to be displayed via  instead of ...

* Cleaning up code...

* [FEAT] Add tests for the ngram analyzer and reorganize the folders (#168)

* test: initial commit, add test_ngrams.py

* test: initial commit add .csv and .parquet data for testing

* test: add ParquetTestData class

* feat: sort the output of n-gram statistics, change print feedback

* test: add __init__.py

* refactor: move ngram analyzers to a single ngrams folder

* refactor: update import statements

* refactor: move and rename base test

* test: add parquet data for ngram_stats test

* initial commit, __init__.py

* chore: pl.count() deprecated use pl.len()

* Update hashtag dashboard: make lineplot clickable and some UX/esthetics (#165)

* remove accordion, remove filters from dataframe

* remove vertical line, remove unused dependencies in plots.py

* remove date picker, add code to select date by clicking on line plot

* fix conlflicts, move color declarations to plots.py

* format hover labels

* [MAINT] move code for placeholder figures to plots.py

* [MAINT] move `clicked_data` higher up

* [MAINT]rename bar plot ids to be more specific

* [FEAT] use place holder fig in hashtag_bar_plot()

* add line break in placeholder figure text

* update placeholder text

* [FEAT] update code to work when secondary_analysis() returns None

* improve time window info display

* fix tooltip text

* remove selected point from hover information in line plot

* update text labels in figures

* update analyzer short description, indent long description

* main: shorten the description of temporal analyzer

* use ProgressReporter

* [FEAT] Add N-gram analysis dashboard in Shiny (#173)

* feat: initial commit, app.py

* chore: use ngram_string variable

* feat: update factory.pyand __init__.py for Shiny dashboard

* fix: sort stats df, to correctly show top 100 ngrams, use constants for col names

* refactor: use `np.random.default_rng()` instead of `np.random.seed()`

* Update build_exe.yml

* Application-Wide Logging System (#177)

* [DOCS] Redo(#149): Add config, documentation, and github workflow for mkdocs (#161)

* add mkdocs config and populate /docs folder

* add mkdostrings and deps to dev dependencies

* update README.md, add bages and logo

* create guides folder

* Add github workflow for docs build on pull request

* use a seperate, smaller requirements file for docs and use for workflow

* update readme esthetics a little bit

* [fix] add requirements-mdocs.txt and remove cmake dependency

* fix links

* Wrote, "Getting Started" guide for technical docs.

* Added a link to the contributor workflow section under "Next Steps" in getting started section.

* add scoped token permission

* use separate build / deploy actions

* use deploy key instead of github_token

* use separate build & deploy actions. only deploy to prod from main -- others get a preview build

* update workflow to use pull_request_target for better security and context

* add manual trigger

* change trigger back to pull_request

* Add macOS code signing and build improvements (#156)

* add .env files to ignore

* use VIRTUAL_ENV if set, otherwise fallback to `./venv/`

* update workflow for codesigning

* allow manual run

* cleaup workflows

* update pyinstaller.spec for macOS codesigning and add os import; update requirements-dev.txt to ensure pyinstaller is included

* add mango.entitlements for macOS codesigning requirements

* don't upload artifacts from test builds

* update actions versions and improve dependency caching in test workflow

* reset permissions

* Revised setup docs with the help of the ai-docs Joe generated. Also wrote Getting Started guide that is based on the AI docs.

* Split architecture docs into seperate sections

* Added next steps to contribution section

* Updated mkdoc nav config to reflect how the architecture docs are now split up

* Update build_exe.yml

* Update build_exe.yml

* Finished writing initial revisions to the analyzers documentation...

* WIP writing docs for shiny and react dashboards...

* Wrote initial draft for the technical docs for the react and shiny dashboards...

* Added Nexts Steps sub section for each domain section...

* Transferred logging docs from branch JoeKarow/logging-system

* Added additional docs to the mkdocs navigation...

* Revised navigation links to be more coherent in structure...

* Implemented mermaid plugin for mkdocs...

* Shifted docs around and corrected link warning when running mkdeocs locally...

* Moved testing docs to get-started folder...

* Revised overview and testing links to clear link warnings...

* Implemented reference docs FactoryOutputContext and related context objects

* Applied black and isort formatting analyzer interface context...

* Rebuilt mkdoc site assets...

* WIP troubleshooting why workflow is not deploying updated mkdoc assets...

* WIP troubleshooting why workflow is not deploying updated mkdoc assets...

* WIP troubleshooting why updated mkdoc assets are not being deployed to github-pages...

* Implemented changes recommended in the latest review conducted by Kristijan

* Cleaning up unneccesarry files...

* Application-Wide Logging System (#177)

* Resolving merge conflicts with dev branch...

* create guides folder

* Breaking out of rebase hell (hopefully)...

* Rebuilt mkdoc site assets...

* Cleaning up unneccesarry files...

* Removing everything that was readded to overview while trying to break out of rebase hell..

* Removing .idea directory that got added while breaking out of rebase hell...

* Attempting to fix action error that occurs when deploying mkdocs site...

* Included link to technical docs site in project README...

* feat, fix: add logo to the home page, fix home page links

* add /site to gitignore

* docs: use About for homepage title

---------

Co-authored-by: Kristijan Armeni <kristijan.armeni@gmail.com>
Co-authored-by: VEDA95 <snetterfield1@gmail.com>

* [FIX, FEAT] Add support for native datetime formats, strip timezone information when importing string-based date columns (#194)

* feat: add `native_date`, `native_datetime` SeriesSemantic instances, and `parse_datetime_with_tz()` helper

* fix: add helper method to more consistently load and transform input data

* test: add tests for main `series_semantic` instances

* feat: improve datetime handling in hashtag dashboard with midpoint default

* feat: add option to parse timezones coded as offset strings

* test: add tests for parsing offset-coded tz and user warning

* fix: remove heterogenous timezones test_parse_datetime_with_tz()

* Use Rich for printing tables, prompts and panels in welcome scree (replaces: PR #185) (#195)

* feat: add `print_data_frame` using `rich.table` for printing data frames

* use print_data_frame in new_analysis

* feat: add `print_dialog_section_title()`

* feat: use print_data_frame and print_dialogue_section_title

* feat: update ascii art welcome page, use rich.Panel

* chore: add credits

* fix: try moving script comment

* fix: reapply formatting

* chore: add rich to requirements

* feat: add option to color-code cols based on datatype and smarter col width setting

* feat: add smart_print_data_frame to only show summary for large data_frames

* feat: fix no-coloring mode, simplify and remove column width computation

* feat: use smart dataframe printing in project selection

* refactor: revert to using print for section titles

* feat: use table captions in smart_print_data_frame

* chore: use print, remove unused imports

* feat: add three different logo sizes

* feat: add three different logo sizes

* chore: add docstring to smart_print_data_frame

* feat: limit column width for cols with a lot of characters

* [bugfix] Add `pythonjsonlogger` to hidden imports (#198)

fix: add `pythonjsonlogger and jsonlogger to hidden imports`

* Issue 201 cli asyncio error bugfix (#203)

* Implemented exception catch for CancelledError

* WIP still troubleshooting CancelledError exception bug

* WIP still troubleshooting CancelledError exception bug

* WIP still troubleshooting CancelledError exception bug

* Implemented custom shutdown handler and signal for gracefully shutting down uvicorn server

* WIP still troubleshooting CancelledError exception bug

* Removed unneccessary print statement

* Ran black/isort

* Rename executable to `cibmangotree`, print feedback message during startup (#200)

* chore: rename executable and main script to cibmangotree

* chore: rename .exe to cibmangotree

* chore: rename executable to cibmangotree

* rename .exe to cibmangotree

* redoing commit, renaming

* chore: use the new executable name in workflows

* feat: Shared Unicode tokenizer service for analyzer ecosystem (#204)

* [RELEASE] v0.8.1-beta.1 (hotfix) (#199)

* Reapply changes pertaining to analyzer parametrization (#146)

* [ENH] Hashtag update and add tests (#121)

* [ENH] update interface to rely on post column, add users col

* [ENH] update gini() to operate on pl.Series

* [ENH] wrap analysis code into hashtag_analyzer()

* [ENH] check for time column dtype

* [ENH] fix column naming, set defaul window to 12h

* [ENH] add data and tests for hashtag_analyzer

* [ENH] add tests for gini()

* [MAINT] explicitly specify return_dtype

* [MAINT] cleanup

* fix: hide unsupported hashtag analysis export formats

* test tutorial comment cleanup

* fix: Capitalization typo

---------

Co-authored-by: DeanEby <ebyd21@gmail.com>

* feat: analyzer parameterization (#126)

This change introduces analyzer parameterization. Parameters can be defined
using `AnalyzerParam` and queried via the analyzer/web presenter context
objects. The CLI is updated to include a step where parameters are chosen,
and they are reflected back when the analysis is viewed. The test helpers
are also updated to include parameter provision.

The example analyzer, which performs simple character count, is updated
to include a single parameter, namely `fudge_factor` that offsets the
character count.

These parameter types are supported:
- `IntegerParam` (produces `int` value)
- `TimeBinningParam` (produces `TimeBinningValue`, essentially an object
  with attributes `unit` and `amount`.

The `TimeBinningParam` is currently unused, but it is defined in anticipation
that it will be used by the hashtag tests. As such, a convenient method that
produces a polars `truncate` expression is added to `TimeBinningValue` so
it can just be dropped into the analyzer later.

* [FEAT] Add time window parameter selection to hashtags test (#129)

* add time bining parameter to hashtag analyzer
* add default parameter for tester
* de-indent help text, remove reference to polars expr

---------

Co-authored-by: DeanEby <ebyd21@gmail.com>
Co-authored-by: soul-codes <40335030+soul-codes@users.noreply.github.com>

* Multi Tenancy Support (#150)

* Implemented middleware needed to support multi tenancy with the waitress server

* Updating waitress to resolve two CVS reports from dependabot. (https://github.com/civictechdc/mango-tango-cli/security/dependabot/1), (https://github.com/civictechdc/mango-tango-cli/security/dependabot/2)

* ran black and isort

* Add macOS code signing and build improvements (#156)

* add .env files to ignore

* use VIRTUAL_ENV if set, otherwise fallback to `./venv/`

* update workflow for codesigning

* allow manual run

* cleaup workflows

* update pyinstaller.spec for macOS codesigning and add os import; update requirements-dev.txt to ensure pyinstaller is included

* add mango.entitlements for macOS codesigning requirements

* don't upload artifacts from test builds

* update actions versions and improve dependency caching in test workflow

* [FEAT] Add dashboard prototype (Shiny for Python) for hashtag analyzer  (#157)

* initial commit, hacky factory and interface modules

* initial commit, __init__.py

* initial commit, supporting data science modules for app

* initial commit, shiny app module

* add hashtag dashboard to the suite

* add code to load raw dataset

* [ENH] use group_by_dynamic, add rolling mean computation

* [MAINT] adding 'gini_smooth' output column to output test data

* [MAINT] remove unused import

* [MAINT] add shiny, shinywidgets to requirements.txt

* set launch_browser in run_app to True

* add ShinyServerConfig in factory.py

* [FEAT] add `._start_shiny_server` method to `AnalysisWebServerContext` class

* move shiny app instance declaration from `factory.py` to `app/analysis_webserver_context`

* Multi Tenancy Support (got it working this time!) (#163)

* feat: Add comprehensive AI development assistant integration and documentation (#164)

* Implementing configuration to get artifacts back when running workflow...

* WIP including hashtag analysis as test in CLI app builds

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP still working on production build fixes for shiny dashboards.

* WIP still working on production build fixes for shiny dashboards.

* WIP still working on production build fixes for shiny dashboards.

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* WIP working on implementing fix for file not found errors in the executable for shiny dashboards

* Added except statement to silence CancelledError error that occurs when shutting down webserver with CTRL + C

* WIP fixing bug that causes shiny dashboard to be displayed via  instead of ...

* Cleaning up code...

* [FEAT] Add tests for the ngram analyzer and reorganize the folders (#168)

* test: initial commit, add test_ngrams.py

* test: initial commit add .csv and .parquet data for testing

* test: add ParquetTestData class

* feat: sort the output of n-gram statistics, change print feedback

* test: add __init__.py

* refactor: move ngram analyzers to a single ngrams folder

* refactor: update import statements

* refactor: move and rename base test

* test: add parquet data for ngram_stats test

* initial commit, __init__.py

* chore: pl.count() deprecated use pl.len()

* Update hashtag dashboard: make lineplot clickable and some UX/esthetics (#165)

* remove accordion, remove filters from dataframe

* remove vertical line, remove unused dependencies in plots.py

* remove date picker, add code to select date by clicking on line plot

* fix conlflicts, move color declarations to plots.py

* format hover labels

* [MAINT] move code for placeholder figures to plots.py

* [MAINT] move `clicked_data` higher up

* [MAINT]rename bar plot ids to be more specific

* [FEAT] use place holder fig in hashtag_bar_plot()

* add line break in placeholder figure text

* update placeholder text

* [FEAT] update code to work when secondary_analysis() returns None

* improve time window info display

* fix tooltip text

* remove selected point from hover information in line plot

* update text labels in figures

* update analyzer short description, indent long description

* main: shorten the description of temporal analyzer

* use ProgressReporter

* [FEAT] Add N-gram analysis dashboard in Shiny (#173)

* feat: initial commit, app.py

* chore: use ngram_string variable

* feat: update factory.pyand __init__.py for Shiny dashboard

* fix: sort stats df, to correctly show top 100 ngrams, use constants for col names

* refactor: use `np.random.default_rng()` instead of `np.random.seed()`

* Update build_exe.yml

* Application-Wide Logging System (#177)

* [DOCS] Redo(#149): Add config, documentation, and github workflow for mkdocs (#161)

* add mkdocs config and populate /docs folder

* add mkdostrings and deps to dev dependencies

* update README.md, add bages and logo

* create guides folder

* Add github workflow for docs build on pull request

* use a seperate, smaller requirements file for docs and use for workflow

* update readme esthetics a little bit

* [fix] add requirements-mdocs.txt and remove cmake dependency

* fix links

* Wrote, "Getting Started" guide for technical docs.

* Added a link to the contributor workflow section under "Next Steps" in getting started section.

* add scoped token permission

* use separate build / deploy actions

* use deploy key instead of github_token

* use separate build & deploy actions. only deploy to prod from main -- others get a preview build

* update workflow to use pull_request_target for better security and context

* add manual trigger

* change trigger back to pull_request

* Add macOS code signing and build improvements (#156)

* add .env files to ignore

* use VIRTUAL_ENV if set, otherwise fallback to `./venv/`

* update workflow for codesigning

* allow manual run

* cleaup workflows

* update pyinstaller.spec for macOS codesigning and add os import; update requirements-dev.txt to ensure pyinstaller is included

* add mango.entitlements for macOS codesigning requirements

* don't upload artifacts from test builds

* update actions versions and improve dependency caching in test workflow

* reset permissions

* Revised setup docs with the help of the ai-docs Joe generated. Also wrote Getting Started guide that is based on the AI docs.

* Split architecture docs into seperate sections

* Added next steps to contribution section

* Updated mkdoc nav config to reflect how the architecture docs are now split up

* Update build_exe.yml

* Update build_exe.yml

* Finished writing initial revisions to the analyzers documentation...

* WIP writing docs for shiny and react dashboards...

* Wrote initial draft for the technical docs for the react and shiny dashboards...

* Added Nexts Steps sub section for each domain section...

* Transferred logging docs from branch JoeKarow/logging-system

* Added additional docs to the mkdocs navigation...

* Revised navigation links to be more coherent in structure...

* Implemented mermaid plugin for mkdocs...

* Shifted docs around and corrected link warning when running mkdeocs locally...

* Moved testing docs to get-started folder...

* Revised overview and testing links to clear link warnings...

* Implemented reference docs FactoryOutputContext and related context objects

* Applied black and isort formatting analyzer interface context...

* Rebuilt mkdoc site assets...

* WIP troubleshooting why workflow is not deploying updated mkdoc assets...

* WIP troubleshooting why workflow is not deploying updated mkdoc assets...

* WIP troubleshooting why updated mkdoc assets are not being deployed to github-pages...

* Implemented changes recommended in the latest review conducted by Kristijan

* Cleaning up unneccesarry files...

* Application-Wide Logging System (#177)

* Resolving merge conflicts with dev branch...

* create guides folder

* Breaking out of rebase hell (hopefully)...

* Rebuilt mkdoc site assets...

* Cleaning up unneccesarry files...

* Removing everything that was readded to overview while trying to break out of rebase hell..

* Removing .idea directory that got added while breaking out of rebase hell...

* Attempting to fix action error that occurs when deploying mkdocs site...

* Included link to technical docs site in project README...

* feat, fix: add logo to the home page, fix home page links

* add /site to gitignore

* docs: use About for homepage title

---------

Co-authored-by: Kristijan Armeni <kristijan.armeni@gmail.com>
Co-authored-by: VEDA95 <snetterfield1@gmail.com>

* [FIX, FEAT] Add support for native datetime formats, strip timezone information when importing string-based date columns (#194)

* feat: add `native_date`, `native_datetime` SeriesSemantic instances, and `parse_datetime_with_tz()` helper

* fix: add helper method to more consistently load and transform input data

* test: add tests for main `series_semantic` instances

* feat: improve datetime handling in hashtag dashboard with midpoint default

* feat: add option to parse timezones coded as offset strings

* test: add tests for parsing offset-coded tz and user warning

* fix: remove heterogenous timezones test_parse_datetime_with_tz()

* Use Rich for printing tables, prompts and panels in welcome scree (replaces: PR #185) (#195)

* feat: add `print_data_frame` using `rich.table` for printing data frames

* use print_data_frame in new_analysis

* feat: add `print_dialog_section_title()`

* feat: use print_data_frame and print_dialogue_section_title

* feat: update ascii art welcome page, use rich.Panel

* chore: add credits

* fix: try moving script comment

* fix: reapply formatting

* chore: add rich to requirements

* feat: add option to color-code cols based on datatype and smarter col width setting

* feat: add smart_print_data_frame to only show summary for large data_frames

* feat: fix no-coloring mode, simplify and remove column width computation

* feat: use smart dataframe printing in project selection

* refactor: revert to using print for section titles

* feat: use table captions in smart_print_data_frame

* chore: use print, remove unused imports

* feat: add three different logo sizes

* feat: add three different logo sizes

* chore: add docstring to smart_print_data_frame

* feat: limit column width for cols with a lot of characters

* [bugfix] Add `pythonjsonlogger` to hidden imports (#198)

fix: add `pythonjsonlogger and jsonlogger to hidden imports`

---------

Co-authored-by: DeanEby <ebyd21@gmail.com>
Co-authored-by: soul-codes <40335030+soul-codes@users.noreply.github.com>
Co-authored-by: VEDA95 <snetterfield1@gmail.com>
Co-authored-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>

* Phase 1: Create tokenizer service directory structure

- Add services/ package for modular service architecture
- Create services/tokenizer/ with plugin-ready structure
- Add core/ directory for abstract base classes and types
- Add basic/ directory for BasicTokenizer implementation
- Set up __init__.py files with placeholder documentation

This establishes the foundation for Unicode-aware tokenization
service with extensible plugin architecture.

* Phase 2: Implement core tokenizer architecture

- Add TokenizerConfig dataclass with comprehensive configuration options
- Implement AbstractTokenizer base class with plugin architecture
- Define TokenType, LanguageFamily, SpaceType enums for type safety
- Create clean interfaces for tokenization operations
- Set up dependency injection pattern for configuration
- Add preprocessing and postprocessing pipeline hooks

Core architecture provides extensible foundation for different
tokenization strategies while maintaining clean interfaces.

* Phase 3: Implement Unicode-aware BasicTokenizer

- Add language_detection.py with Unicode script classification
- Create patterns.py with comprehensive social media regex patterns
- Implement BasicTokenizer with AbstractTokenizer interface
- Add social media entity preservation (@mentions, #hashtags, URLs)
- Support mixed-script text handling (e.g., "iPhone用户" → ["iPhone", "用户"])
- Enable configurable tokenization with emoji and entity extraction
- Set up tokenizer service exports and factory functions

Extracted and simplified core tokenization logic from app/utils.py
while maintaining Unicode handling and entity preservation.

* Phase 4: Integrate tokenizer service with n-gram analyzer

- Add configurable min_n and max_n parameters to ngram interface
- Update analyzer description to mention multilingual support
- Replace complex tokenization logic with new tokenizer service
- Integrate Unicode-aware tokenization with social media entities
- Maintain backward compatibility with existing test data
- Preserve progress reporting and output format structures
- Enable user-configurable n-gram length parameters (1-15 range)

N-gram analyzer now leverages enhanced tokenization capabilities
while maintaining full compatibility with existing analyses.

* Phase 5: Comprehensive testing and validation

- Remove backward compatibility tokenize() function from ngram analyzer
- Update existing ngram tests to use new tokenizer service exclusively
- Create comprehensive test suite for tokenizer service (86 tests total):
  - services/tokenizer/test_service.py (38 tests)
  - services/tokenizer/core/test_types.py (15 tests)
  - services/tokenizer/basic/test_basic_tokenizer.py (33 tests)
- Test multilingual support (Latin, Chinese, Japanese, Arabic, mixed scripts)
- Test social media entities (@mentions, #hashtags, URLs, emojis)
- Test configurable parameters and edge cases
- Validate backward compatibility and performance

All tests pass, ensuring reliable tokenization for social media analytics.

* Refactor: Reduce test redundancy between service and implementation tests

- Streamline test_service.py from 38 to 19 focused tests
- Remove redundant multilingual and social media tests from service layer
- Keep comprehensive testing in test_basic_tokenizer.py (38 tests)
- Maintain clear separation: API/integration vs implementation details
- All 57 total tests pass with better organization and no coverage gaps

Eliminates ~19 redundant tests while preserving comprehensive validation.

* Phase 6: Add comprehensive tokenizer service documentation

- Update .ai-context/README.md with tokenizer service in tech stack
- Enhance .ai-context/architecture-overview.md with services layer
- Add tokenizer symbols to .ai-context/symbol-reference.md
- Create comprehensive services/tokenizer/README.md with:
  * Unicode-aware tokenization capabilities
  * Social media entity preservation
  * Plugin architecture documentation
  * Configuration options and API reference
  * Integration patterns with analyzers
  * Performance and testing information

All documentation follows established patterns and markdown standards.

* Add Serena memory: Tokenizer service implementation documentation

Comprehensive memory documenting the Unicode-aware tokenizer service:
- Architecture overview with core components and plugin system
- Multilingual support (Latin, CJK, Arabic, mixed scripts)
- Social media entity preservation (@mentions, #hashtags, URLs)
- Configuration options and integration patterns
- Testing coverage and performance characteristics
- Development patterns and future extension points
- Migration notes and backward compatibility

Provides complete reference for tokenizer service implementation.

* add `regex` dependency

* Fix ngrams analyzer test with updated tokenizer service

- Update context.parameters() to context.params in main.py
- Regenerate test data files to match current tokenizer output
- Fix timestamp format in message_authors test data to match semantic preprocessing
- All ngrams tests now pass with updated tokenizer service

* Fix tokenizer token order preservation

Implement position-based tokenization algorithm that preserves the exact order
of tokens as they appear in input text. Previously, social media entities
(hashtags, mentions, URLs) were extracted first, then remaining text was
processed, destroying the original token order.

Key changes:
- Replace _extract_tokens() with position-based _extract_tokens_ordered()
- Add comprehensive test suite with 14 order preservation test cases
- Fix existing tests that had incorrect order expectations
- Maintain all multilingual and configuration functionality
- Add validation tests for performance and compatibility

Example fix:
- Input: "Hello @user check #hashtag"
- Before: ["@user", "#hashtag", "hello", "check"] (wrong order)
- After: ["hello", "@user", "check", "#hashtag"] (preserves order)

All 72 tests pass with no regressions.

* Phase 1: Combined Regex Optimization - 18% performance improvement

- Replace multiple regex passes with single combined pattern
- Reduce algorithmic complexity from O(n×m) to O(n)
- Achieve 17-26% performance improvement across social media workloads
- Maintain all 72 existing tests passing
- Preserve exact token order and entity detection functionality

Technical changes:
- Add combined_social_entities_pattern with named groups
- Optimize _find_social_entities_with_positions() with single finditer()
- Use match.lastgroup for efficient entity type detection
- No breaking changes to public API

Performance results:
- Medium Social: +26% (178k → 224k tokens/sec)
- Entity Heavy: +17% (234k → 275k tokens/sec)
- Large 10K: +11% (219k → 244k tokens/sec)

* Phase 2 REPLACEMENT: Comprehensive Regex Tokenizer - 45.6% performance improvement

Replace failed index-based processing with comprehensive regex approach:
- Single regex pattern finds ALL tokens in document order using findall()
- Eliminate entire segmentation system (O(n×segments) → O(n))
- Remove 8 complex tokenization methods and reassembly logic
- Massive code simplification while preserving functionality

Performance results:
- Execution time: 4.39ms → 2.39ms (45.6% faster)
- Throughput: 211k → 285k tokens/sec (34.7% higher)
- Test compatibility: 96.6% (56/58 tests pass)

Technical changes:
- Replace _extract_tokens_ordered() with single comprehensive regex
- Use patterns.get_comprehensive_pattern(config).findall(text)
- Remove segmentation methods: _create_ordered_segments(), etc.
- Enhanced comprehensive pattern with Thai script support
- Maintain configuration-based token filtering

This replaces the minimal-impact Phase 2 index optimization with
a fundamental architectural improvement that delivers major gains.

* Phase 3: Configuration and Code Cleanup

Remove unused configuration options for cleaner, more maintainable code:
- Remove unused SpaceType enum and space_type configuration
- Remove custom_spaces (never used)
- Remove include_whitespace (unused functionality)
- Streamline TokenizerConfig to focus on actively used options

Benefits:
- Cleaner configuration interface
- Reduced code complexity
- Easier maintenance and understanding
- No breaking changes to core functionality

All essential configuration options preserved:
- Social media entity extraction controls
- Token filtering (punctuation, numeric, emoji)
- Text preprocessing (case handling, unicode normalization)
- Language detection settings

* Documentation Update: Align tokenizer docs with optimization implementation

Update all tokenizer service documentation to reflect the completed 3-phase optimization:

Phase 1 & 2 Optimization Results:
- 45.6% performance improvement (285k tokens/sec)
- Comprehensive regex approach replacing segmentation
- Configuration simplified (removed SpaceType, custom_spaces, include_whitespace)

Documentation Changes:
- services/tokenizer/README.md: Remove deleted config options, update examples, architecture, and performance metrics
- .ai-context/symbol-reference.md: Fix TokenizerConfig section, add get_comprehensive_pattern() method
- .ai-context/architecture-overview.md: Add performance improvements and comprehensive regex notes

Validation:
- All configuration examples tested and working
- Import statements corrected
- Performance claims updated to reflect 285k tokens/sec capability
- No references to deleted functionality remain

The tokenizer documentation now accurately reflects the optimized implementation
with proper API references and current performance characteristics.

* Final Cleanup: Remove vestigial language detection methods from AbstractTokenizer

Complete the tokenizer optimization by removing architectural debt:
- Remove detect_language_family() and is_space_separated() methods from AbstractTokenizer
- These methods only returned static fallback values after comprehensive regex optimization
- Simplify BasicTokenizer implementation by removing unused method calls
- Remove 2 obsolete tests that tested only fallback behavior
- Preserve all 56 functional tokenization tests

Benefits:
- Cleaner, honest API that accurately represents tokenizer capabilities
- Eliminate confusion from fake "language detection" methods
- Remove dead code and unused parameters
- Maintain all performance optimizations and functionality

Validation:
- All 56 tokenization tests pass
- Multilingual support (CJK, Arabic, Latin) preserved
- Social media entity extraction maintained
- Analyzer integration confirmed (n-grams, hashtags)
- No functional regressions

This completes the tokenizer service modernization project with a
clean, simplified interface that reflects the efficient comprehensive
regex architecture achieving 45.6% performance improvement.

* Documentation: Make tokenizer README objective and instructional

Clean up promotional language from tokenizer service documentation:
- Remove marketing terms like "sophisticated", "optimal performance"
- Remove repeated "45.6% performance improvement" claims
- Change "Performance Characteristics" to "Implementation Details"
- Focus on functionality and usage rather than optimization claims
- Maintain technical accuracy while using objective language
- Keep all configuration examples and usage patterns

The documentation now serves as clear technical reference focused on
what the tokenizer does and how to use it effectively, rather than
promotional material about performance improvements.

Also updated symbol-reference.md and architecture-overview.md to remove
references to deleted language detection methods.

* Complete tokenizer service optimization and cleanup

Final changes to complete the tokenizer service modernization:
- Remove language_detection.py module (no longer needed after comprehensive regex optimization)
- Update imports and exports to remove language detection references
- Clean up test files and type definitions
- Update patterns.py with final optimizations
- Modernize service integration tests

This completes the 3-phase tokenizer optimization project:
- Phase 1: Combined regex optimization (18% improvement)
- Phase 2: Comprehensive regex tokenizer (45.6% improvement)
- Phase 3: Configuration cleanup and language detection removal
- Final cleanup: Remove vestigial interface methods and documentation

The tokenizer service now provides efficient, regex-based tokenization
with simplified architecture, honest documentation, and significant
performance improvements while maintaining full multilingual and
social media entity support.

* update import/exports

* add docs section, update mkdocs config

* Tokenizer Service Cleanup: Remove over-engineered API surface

Phase 1-3: Complete cleanup of tokenizer service over-engineering
- Remove SpaceType enum (completely unused)
- Remove update_config() method and related tests
- Remove detect_language field from TokenizerConfig and ngrams usage
- Remove tokenize_with_types() method and TokenizedResult type
- Remove _extract_and_classify_tokens() supporting method
- Remove _is_emoji_only() duplicate method
- Remove individual pattern methods (get_social_media_pattern, etc.)
- Make preprocess_text and postprocess_tokens private methods
- Clean up imports and exports throughout service

Critical: Preserves all core tokenization behavior including CJK
character-level tokenization for backward compatibility.

Benefits:
- Reduced API complexity (removed 4 unnecessary methods/features)
- Cleaner codebase with 200+ lines of over-engineering removed
- Maintained all essential functionality and performance
- All 86 tokenizer tests pass
- Simplified service interface aligned with actual usage patterns

* formatting

* Documentation Update: Improve tokenizer service documentation

- Complete rewrite of services/tokenizer/README.md for better developer experience
- Fix inaccurate configuration defaults and method signatures
- Remove performance claims and make documentation objective
- Update usage patterns to be less opinionated about analysis approaches
- Fix symbol reference documentation to match actual implementation
- Change extract_emails default to True for better social media analytics

* Update tokenizer documentation to reflect current API

- Fix Pattern Matching section: correct get_patterns() signature and add actual pattern classes
- Update Service API exports: include missing TokenList and CaseHandling types
- Remove references to non-existent pattern functions (get_pattern, get_comprehensive_pattern)

* update test to reflect new default

* Fix Thai character-level tokenization tests and weak assertions

- Modified Thai tokenizer pattern from `[\u0e00-\u0e7f]+` to `[\u0e00-\u0e7f]` for character-level tokenization
- Fixed `test_thai_text_tokenization()` to test specific character-level expected output
- Fixed `test_mixed_script_multilingual()` Thai assertion to validate individual characters
- Enhanced tokenizer test suite with comprehensive Phase 1 improvements:
  - Split emoji tests into specific enabled/disabled scenarios
  - Added comprehensive negative testing for disabled features
  - Fixed weak assertion patterns with specific behavior validation
  - Documented implementation bugs for email and decimal numeric exclusion
- Eliminated uncertain test language and vague assertions
- All tests now validate exact expected tokenization behavior

* fix Thai handling

* config cleanup

* formatting

* doc updates

* Fix tokenizer service URL/email preservation and test quality

## Phase 1: Critical Bug Fixes
- Fix URL/email preservation: rename extract_* to include_* (default True)
- Implement single-pass regex exclusion (33% performance improvement)
- Fix URL punctuation cleaning and entity classification overlap
- URLs/emails now either preserved whole or excluded entirely (no fragmentation)

## Phase 2: Test Quality Improvements
- Replace weak type-only assertions with specific expected results
- Eliminate all non-deterministic "may"/"might" language from tests
- Fix sloppy assertions with precise validation
- All tests now validate actual tokenization behavior

## Phase 3: Test Coverage & Organization
- Add comprehensive mixed entity interaction test (@mentions + #hashtags)
- Remove duplicate test scenarios while preserving coverage
- Improve test documentation and logical organization
- Update all tests to use new include_* config field names

## Results
- All 98 tokenizer service tests passing
- Clean n-gram analysis input (no fragmented social media entities)
- Deterministic, enterprise-level test quality
- Backward compatibility maintained with improved semantics

* documentation update

* Coderabbit review

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>

* fix failing test

* refactor: convert tokenizer dataclasses to Pydantic models

- Convert TokenizerConfig from dataclass to Pydantic BaseModel
- Update imports to use Pydantic instead of dataclasses
- Update documentation references from dataclasses to Pydantic models
- Maintain backward compatibility - all existing tests pass
- Provides consistency with rest of codebase that uses Pydantic

Addresses review comment at services/tokenizer/core/types.py:8

* style: rename pattern variables to UPPER_CASE constants

- Move all pattern variables from _compile_patterns method to module-level constants
- Rename all pattern variables to UPPER_CASE following Python naming conventions
- Update all references to use the new constant names
- Maintain exact same functionality while improving code organization
- All 98 tokenizer tests pass

Addresses code review comment requesting pattern variables be constants.

* docs: fix tokenizer documentation hierarchy in mkdocs.yml

* formatting

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>

---------

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Co-authored-by: Kristijan Armeni <kristijan.armeni@gmail.com>
Co-authored-by: DeanEby <ebyd21@gmail.com>
Co-authored-by: soul-codes <40335030+soul-codes@users.noreply.github.com>
Co-authored-by: VEDA95 <snetterfield1@gmail.com>

* maint: Reorganize and rename subfolders in `/analyzers` for consistency (#207)

* chore: rename folder hashtags --> hashtags_base

* chore: move *_base and *_web into hashtags folder

* chore: move test_data one level up

* chore: update all hashtags imports

* chore: remove temporal.ipynb

* chore: rename temporal --> temporal_base

* chore: rename temporal_barplot --> temporal_web for consistency

* move _base and _web into /analyzers/temporal

* chore: add temporal/__init__.py

* chore: update temporal imports

* chore: rename test_hashtags_analyzers.py

* chore: remove unused and commented code

* Updated naming conventions in CLI (#205)

* Updated naming conventions in CLI
* Update tests

* bugfix: update `hashtags_web/app.py` to not use hard-coded column names (#210)

fix: use column constants for renaming

* feat: Add method and option to detect number of rows to skip in csv files (#211)

* feat: add _detect_skip_rows()

* test: add a few test cases and test logic

* chore: remove comparisons to booleans

* doc: add comment

* ux: add informative hint for skip_rows

* chore: remove unused import

* ux: use smart_print_data_frame to print options

* remove table title

* feat: use smart_print_data_frame to display detected columns

* ux: rename to "Has header" parameter

* feat: add `print_message` utility instead of base `print`

* ux: update the printed message to use consistent name

* feat: open file once for skip_rows and dialect detection

* test: update tests

* fix: make nr. trailing commas consistent

* chore: rename variable

* feat: add more checks for manual skip_rows selection

* ux: add feedback to manual skip_rows selection

* feat: add validation to not skip max_lines

---------

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Co-authored-by: DeanEby <ebyd21@gmail.com>
Co-authored-by: soul-codes <40335030+soul-codes@users.noreply.github.com>
Co-authored-by: VEDA95 <snetterfield1@gmail.com>
Co-authored-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Co-authored-by: JMCulhane <62526923+JMCulhane@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Inconsistencies or issues which will cause a problem for users or implementors. domain: core Affects the app's core architecture

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: When stopping dashboard server on CLI with ctrl+c, you get scary "error" message

4 participants