-
Notifications
You must be signed in to change notification settings - Fork 22
Enhance n-gram analyzer: configurable parameters, advanced tokenization, and large dataset support #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…-gram analysis Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
… improve non-spaced tokenization Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
|
@JoeKarow This is wonderful! Sorry to mess up here, but I just added a PR with ngram tests and a small refactor for ngram analyzers in #168 . Since this PR is quite lite in terms of commits, would it make sense to merge #168 first and then refactor + rebase this on top (+ add tests)? Also happy to do the other way around, whichever is easiest! |
If #168 is ready to go, go ahead and merge and then I'll correct this PR. I've also been messing around with optimizations for running the n-gram analysis on large datasets. I've been using a dataset that Cameron provided with ~5m rows. |
Ok, thanks, will do shortly!
Welcome to open an issue for it as well. I'm curious about it! |
…gram sizes - Add configurable min_ngram_size and max_ngram_size parameters to ngrams_base analyzer - Implement custom tokenizer that generates n-grams within specified size range - Update ngram_stats analyzer to handle variable n-gram sizes - Enhance interface definitions with new tokenizer parameters
…andling - Add RowCountComparer for validating parquet file row counts - Enhance PrimaryAnalyzerTester with better error handling and progress tracking - Improve test data validation and comparison utilities - Add support for multiple test parameter configurations
…le tokenizer configurations - Add parametrized tests for different min/max n-gram size configurations - Include test data for default, min1_max3, min2_max4, and min4_max6 scenarios - Test both ngrams_base and ngram_stats analyzers with new tokenizer parameters - Update existing test data files to reflect new tokenizer behavior
…acking system - Add parquet_row_count function for efficient row counting of large parquet files - Enhance terminal_tools progress tracking with better formatting and display - Add comprehensive tests for utility functions and progress system - Improve progress callback handling for long-running operations
- Update requirements.txt with new dependencies needed for tokenizer implementation - Update .gitignore to exclude temporary files and test artifacts
Update .ai-context documentation to reflect recent architectural changes: - Add performance optimization components to symbol reference - Memory management strategies (ExternalSortUniqueExtractor) - Fallback processors for disk-based processing - Performance testing infrastructure documentation - Add performance architecture section to architecture overview - Memory-aware processing with adaptive allocation - Tiered processing strategy and system-specific scaling - Chunk size optimization patterns - Update setup guide with pytest-benchmark dependency - Fix markdown formatting issues (MD032, MD009, MD031) All cross-references validated against current codebase state. Maintains documentation accuracy for AI assistant effectiveness.
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
…ramework - Update architecture overview with Infrastructure Layer (logging, memory management) - Update symbol reference with MemoryManager auto-detection capabilities - Update setup guide with new dependencies and performance testing instructions - Add comprehensive performance testing framework memory - Update suggested commands with performance testing and benchmarking workflows All documentation verified against actual codebase implementation for accuracy.
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, how long does it take for you to run the new ngram analysis @JoeKarow and @KristijanArmeni? I tried running the new analysis on my laptop last night and it took 30+ minutes to get to 50% in the processing stage (also caused my laptop to heat up which is also out of the norm). The old version of this test only took less than 30 seconds to process. This may be due to me using a smaller dataset to test this, but if the new version of the analysis is taking 30+ minutes to process a smaller dataset that tells me that performance still needs work. I'll take a further look at the code later today and see if I can come up with any ideas for ironing this problem out
|
On another note though I do really like the new rich ui that's implemented for this @JoeKarow. Definitely feels like a big facelift compared to before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I forget the documentation you added into docs/dev-guide.md should be in it's own separate markdown file. With how the docs are structured in the dev branch I would relocate the parts talking about progress reporting under something like docs/guides/contributing/progress-reporting.md, along with revising the testing and performance sections to include what you wrote for those two points. After that you should also be able to remove dev-guide entirely to resolve the merge conflicts.
I fetched this branch locally, and ran it on the I agree with the progress reporter, pretty nice. Though perhaps even too obsessive, not sure the users need as much (but perhaps once analyses run longer it's a different story). I'll have a look and review this PR shortly! Screen.Recording.2025-08-11.at.4.29.20.PM.mov |
|
Got it @KristijanArmeni. Those speeds are way faster better than what I'm achieving at the moment. If it helps @JoeKarow and @KristijanArmeni I'm using the That being said I think I identified a few operations that can be optimized for better performance. |
Huh, interesting. Can you share the config (i.e. column mapping selection and n-gram parameters) that you use to run the test on russian trolls? |
|
Sure @KristijanArmeni. Here's the settings and mappings I used when running the test. There's something I'm probably doing wrong here but the selected column mappings worked just fine when using the old version of the ngram test
|
|
I've been using the PCTC dataset. The original size is like ~5m rows.. I've been testing using a subset of ~ 1m-2.5m rows for times sake. I had some downtime while I was out of town this weekend and had some ideas to try and improve it further. Some might have to wait (like looking at implementing a Database Storage adapter with DuckDB). @VEDA95 & @KristijanArmeni - what are your machine specs? Processor, RAM, OS version, exact Python version, SSD or spinny hard drive? Also - where can I find the datasets that you've tried? |
|
@JoeKarow so I've been playing around with the code and managed to shave a few minutes off switching to using feather files instead of csv files for temporary storage. Though it still seems to spend a lot of time storing those chunks in temp files. If you're going to be at the meeting this wednesday perhaps we can have a look at this issue together and try to figure out how to improve performance for the batch processing substep. |
|
Also here's the machine specs you requested for my setup. Laptop (2021 16-inch Macbook Pro):
Desktop (Custom):
|
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
This commit implements a genuine Textual+Rich hybrid progress manager that eliminates code duplication and provides true 60fps updates through proper Textual integration. Key architectural improvements: - Genuine Textual App integration with Static widgets containing Rich renderables - True 60fps updates via Textual set_interval (not Rich Live configuration) - Eliminated ~300 lines of code duplication through ProgressStateManager - Strategy pattern with ProgressBackend abstraction - Maintained complete API backward compatibility Technical details: - TextualProgressApp uses textual.app.App with daemon threading for CLI compatibility - ProgressStateManager extracts shared logic from duplicated implementations - ProgressBackend abstract base class with TextualProgressBackend and RichProgressBackend - Added ChecklistProgressManager backward compatibility alias - Updated all test imports and mock specifications Performance optimizations: - Memory-aware progress reporting with pressure detection - Positional insertion API for dynamic step ordering - Context manager protocol for proper resource management Testing: - All 99 tests passing (98 passed, 1 skipped) - Updated analyzer test files to use ProgressManager imports - Fixed mock specifications and backward compatibility Dependencies: - Added textual==5.3.0 for genuine Textual integration - Updated rich==14.1.0 for compatibility This implementation resolves SOLID principle violations and provides a robust foundation for future progress reporting enhancements.
|
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
…s manager - Extract main analysis logic to _run_ngram_analysis_with_progress_manager() - Add context.progress_manager integration for shared progress tracking - Maintain backward compatibility with standalone progress manager - Enhance memory-aware processing patterns and fallback strategies - Improve tokenization and n-gram generation with adaptive chunking - Add comprehensive error handling and memory pressure detection
|
Just ran this on the small 10k rows Twitter dataset and looks good on my end.
It's a really big PR, so I won't get a chance to review this before next week I'm afraid. Does that work? In general and given the size, if there's no critical issues here, I'd just consider merging once reviewed and work on any efficiency improvements (spotted by @VEDA95) for the next release or so. Previewngram-rich.mov |
|
This PR got a bit too unwieldy - I'll be stripping this branch for parts and resubmitting smaller, more focused PRs. |



Pull Request type
Please check the type of change your PR introduces:
What is the current behavior?
The n-gram analyzer had some limitations that made it challenging to use with larger datasets and different text types:
Issue Number: #151, #171, #181
What is the new behavior?
This branch ended up being much bigger than originally planned! Here's what's new:
Performance & Memory Improvements
Enhanced User Experience
Advanced Tokenization System
#climate), mentions (@user), and URLs as single tokens instead of splitting themapp/utils.pyso other analyzers can use it...and\!\!\!Testing & Documentation
Does this introduce a breaking change?
All existing functionality works exactly the same. The improvements are mostly "under the hood" - better performance, more reliable operation, and better feedback. The tokenization improvements are backwards compatible.
Other information
Scope Creep Confession
This started as a simple tokenizer fix but turned into a comprehensive performance and infrastructure overhaul. While working on memory issues, it became clear the whole system needed modernization.
What This Means for Users
Technical Highlights
Testing Coverage
All existing tests pass, plus extensive new test coverage for:
Performance Improvements
Users should see:
Tokenization Examples
The enhanced tokenization system now properly handles:
This ended up being a foundational upgrade that will make future development much easier and more reliable. The abstracted tokenization system and performance framework provide a solid base for additional analyzers and improvements.
Related Work
This contributes to the broader Rich library integration effort: