Skip to content

Conversation

@JoeKarow
Copy link
Collaborator

@JoeKarow JoeKarow commented Jul 10, 2025

Pull Request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build-related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

The n-gram analyzer had some limitations that made it challenging to use with larger datasets and different text types:

  • Would run out of memory on datasets over 1-2GB
  • No progress feedback during long-running analyses (users couldn't tell if it was working or stuck)
  • Fixed memory limits regardless of system capabilities
  • Basic tokenization that didn't handle punctuation well
  • Limited error handling and logging
  • Tokenization logic was embedded in the analyzer, making it hard to reuse

Issue Number: #151, #171, #181

What is the new behavior?

This branch ended up being much bigger than originally planned! Here's what's new:

Performance & Memory Improvements

  • Smart memory handling: The analyzer now automatically detects your system's RAM and adjusts accordingly. High-memory systems get better performance, low-memory systems stay safe.
  • Handles large datasets: Processes multi-gigabyte files without running out of memory using streaming techniques
  • Adaptive chunking: Processing chunks are sized based on your system's capabilities and the dataset size for optimal performance

Enhanced User Experience

  • Real-time progress reporting: You can see exactly what's happening during analysis with detailed progress bars, sub-steps, and time estimates
  • Better error messages: Clear, actionable error messages with context
  • Comprehensive logging: All operations are logged with structured JSON logs that rotate automatically. Great for debugging issues.

Advanced Tokenization System

  • Smart entity preservation: Keeps hashtags (#climate), mentions (@user), and URLs as single tokens instead of splitting them
  • Multi-language support: Automatic detection and proper handling of CJK (Chinese, Japanese, Korean) characters alongside Latin text
  • Mixed script handling: Handles text that combines different writing systems (e.g., "iPhone用户" keeps "iPhone" as one token, processes Chinese characters appropriately)
  • Abstracted tokenization engine: Moved tokenization logic to app/utils.py so other analyzers can use it
  • Configurable parameters: Min/max n-gram sizes are now user-configurable (1-15 range)
  • Punctuation filtering: Filters out pure punctuation n-grams like ... and \!\!\!
  • Social media aware: Designed specifically for social media text with proper handling of platform-specific elements

Testing & Documentation

  • Performance testing framework: Added extensive benchmarking and performance tests to prevent regressions and validate improvements
  • Enhanced documentation: Updated all docs, guides, and AI assistant contexts to reflect the new capabilities
  • Memory testing: Comprehensive tests for different system configurations

Does this introduce a breaking change?

  • Yes
  • No

All existing functionality works exactly the same. The improvements are mostly "under the hood" - better performance, more reliable operation, and better feedback. The tokenization improvements are backwards compatible.

Other information

Scope Creep Confession

This started as a simple tokenizer fix but turned into a comprehensive performance and infrastructure overhaul. While working on memory issues, it became clear the whole system needed modernization.

What This Means for Users

  • Analyses that used to fail on large datasets now work reliably
  • No more wondering if the program is stuck - you get real-time feedback with progress bars
  • Better performance on high-memory systems, safer operation on low-memory systems
  • Much better debugging when things go wrong with structured logging
  • More reliable operation overall with comprehensive error handling
  • Better results from social media text with proper handling of hashtags, mentions, and mixed languages

Technical Highlights

  • 94 commits across 73 files
  • 16,000+ lines of new code (performance systems, testing, documentation)
  • Added comprehensive performance testing suite with benchmarking
  • Implemented memory-aware processing with automatic system detection
  • Enhanced progress reporting with hierarchical display and sub-steps
  • Application-wide structured JSON logging system
  • Abstracted tokenization engine for reuse across analyzers
  • Updated all documentation and AI assistant contexts

Testing Coverage

All existing tests pass, plus extensive new test coverage for:

  • Performance benchmarking and memory management
  • Progress reporting reliability (54+ test methods)
  • Tokenization improvements and edge cases (hashtags, mentions, URLs, CJK text)
  • Memory detection across different system configurations
  • Integration testing for the complete pipeline

Performance Improvements

Users should see:

  • 2-4x faster processing on medium datasets
  • 5-10x reduction in I/O operations
  • Better memory utilization that scales with system capabilities
  • No more memory crashes on large datasets

Tokenization Examples

The enhanced tokenization system now properly handles:

Input: "Check out @user's post about #climate change\! https://example.com/article iPhone用户很喜欢"

Old tokenization (simple whitespace split): 
["Check", "out", "@user's", "post", "about", "#climate", "change\!", "https://example.com/article", "iPhone用户很喜欢"]

New tokenization (smart entity and script handling):
["Check", "out", "@user", "post", "about", "#climate", "change", "https://example.com/article", "iPhone", "用", "户", "很", "喜", "欢"]

This ended up being a foundational upgrade that will make future development much easier and more reliable. The abstracted tokenization system and performance framework provide a solid base for additional analyzers and improvements.

Related Work

This contributes to the broader Rich library integration effort:

…-gram analysis

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
@JoeKarow JoeKarow changed the base branch from main to develop July 10, 2025 17:28
@KristijanArmeni
Copy link
Collaborator

@JoeKarow This is wonderful! Sorry to mess up here, but I just added a PR with ngram tests and a small refactor for ngram analyzers in #168 . Since this PR is quite lite in terms of commits, would it make sense to merge #168 first and then refactor + rebase this on top (+ add tests)?

Also happy to do the other way around, whichever is easiest!

@JoeKarow
Copy link
Collaborator Author

@JoeKarow This is wonderful! Sorry to mess up here, but I just added a PR with ngram tests and a small refactor for ngram analyzers in #168 . Since this PR is quite lite in terms of commits, would it make sense to merge #168 first and then refactor + rebase this on top (+ add tests)?

Also happy to do the other way around, whichever is easiest!

If #168 is ready to go, go ahead and merge and then I'll correct this PR. I've also been messing around with optimizations for running the n-gram analysis on large datasets. I've been using a dataset that Cameron provided with ~5m rows.

@KristijanArmeni
Copy link
Collaborator

If #168 is ready to go, go ahead and merge and then I'll correct this PR.

Ok, thanks, will do shortly!

I've also been messing around with optimizations for running the n-gram analysis on large datasets.

Welcome to open an issue for it as well. I'm curious about it!

@JoeKarow JoeKarow self-assigned this Jul 23, 2025
@JoeKarow JoeKarow marked this pull request as draft July 23, 2025 19:07
JoeKarow added 6 commits July 28, 2025 16:38
…gram sizes

- Add configurable min_ngram_size and max_ngram_size parameters to ngrams_base analyzer
- Implement custom tokenizer that generates n-grams within specified size range
- Update ngram_stats analyzer to handle variable n-gram sizes
- Enhance interface definitions with new tokenizer parameters
…andling

- Add RowCountComparer for validating parquet file row counts
- Enhance PrimaryAnalyzerTester with better error handling and progress tracking
- Improve test data validation and comparison utilities
- Add support for multiple test parameter configurations
…le tokenizer configurations

- Add parametrized tests for different min/max n-gram size configurations
- Include test data for default, min1_max3, min2_max4, and min4_max6 scenarios
- Test both ngrams_base and ngram_stats analyzers with new tokenizer parameters
- Update existing test data files to reflect new tokenizer behavior
…acking system

- Add parquet_row_count function for efficient row counting of large parquet files
- Enhance terminal_tools progress tracking with better formatting and display
- Add comprehensive tests for utility functions and progress system
- Improve progress callback handling for long-running operations
- Update requirements.txt with new dependencies needed for tokenizer implementation
- Update .gitignore to exclude temporary files and test artifacts
@VEDA95 VEDA95 requested a review from KristijanArmeni August 7, 2025 05:06
Update .ai-context documentation to reflect recent architectural changes:

- Add performance optimization components to symbol reference
  - Memory management strategies (ExternalSortUniqueExtractor)
  - Fallback processors for disk-based processing
  - Performance testing infrastructure documentation
- Add performance architecture section to architecture overview
  - Memory-aware processing with adaptive allocation
  - Tiered processing strategy and system-specific scaling
  - Chunk size optimization patterns
- Update setup guide with pytest-benchmark dependency
- Fix markdown formatting issues (MD032, MD009, MD031)

All cross-references validated against current codebase state.
Maintains documentation accuracy for AI assistant effectiveness.
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
…ramework

- Update architecture overview with Infrastructure Layer (logging, memory management)
- Update symbol reference with MemoryManager auto-detection capabilities
- Update setup guide with new dependencies and performance testing instructions
- Add comprehensive performance testing framework memory
- Update suggested commands with performance testing and benchmarking workflows

All documentation verified against actual codebase implementation for accuracy.
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
Copy link
Collaborator

@VEDA95 VEDA95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, how long does it take for you to run the new ngram analysis @JoeKarow and @KristijanArmeni? I tried running the new analysis on my laptop last night and it took 30+ minutes to get to 50% in the processing stage (also caused my laptop to heat up which is also out of the norm). The old version of this test only took less than 30 seconds to process. This may be due to me using a smaller dataset to test this, but if the new version of the analysis is taking 30+ minutes to process a smaller dataset that tells me that performance still needs work. I'll take a further look at the code later today and see if I can come up with any ideas for ironing this problem out

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 10, 2025

On another note though I do really like the new rich ui that's implemented for this @JoeKarow. Definitely feels like a big facelift compared to before

Copy link
Collaborator

@VEDA95 VEDA95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I forget the documentation you added into docs/dev-guide.md should be in it's own separate markdown file. With how the docs are structured in the dev branch I would relocate the parts talking about progress reporting under something like docs/guides/contributing/progress-reporting.md, along with revising the testing and performance sections to include what you wrote for those two points. After that you should also be able to remove dev-guide entirely to resolve the merge conflicts.

@KristijanArmeni
Copy link
Collaborator

KristijanArmeni commented Aug 11, 2025

I'm curious, how long does it take for you to run the new ngram analysis @JoeKarow and @KristijanArmeni?

I fetched this branch locally, and ran it on the 24004_Twitter10kSample and it runs fine, definitely wihtin what's expected (see video below for params etc.).

I agree with the progress reporter, pretty nice. Though perhaps even too obsessive, not sure the users need as much (but perhaps once analyses run longer it's a different story).

I'll have a look and review this PR shortly!

Screen.Recording.2025-08-11.at.4.29.20.PM.mov

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 11, 2025

Got it @KristijanArmeni. Those speeds are way faster better than what I'm achieving at the moment. If it helps @JoeKarow and @KristijanArmeni I'm using the confirmed_russia_trolls_tweets dataset. Furthermore I haven't even tried the much bigger test datasets we have so this is something that needs to be optimized before releasing into the wild. I'm currently running the test on my desktop to see if the performance is comparable to what I'm seeing on my desktop. So far they match up pretty closely.

That being said I think I identified a few operations that can be optimized for better performance.

@KristijanArmeni
Copy link
Collaborator

and @KristijanArmeni I'm using the confirmed_russia_trolls_tweets dataset

Huh, interesting. Can you share the config (i.e. column mapping selection and n-gram parameters) that you use to run the test on russian trolls?

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 11, 2025

The stage I've been stuck at is the is the Processing data batches subsection. I can do another review pointing out the pieces of code that could be optimized, but the one thing that is storing the temporary chunks as csv files. Feels like it would be more efficient to store these chunks as binary data.

screenshot_11082025_170920

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 11, 2025

Sure @KristijanArmeni. Here's the settings and mappings I used when running the test. There's something I'm probably doing wrong here but the selected column mappings worked just fine when using the old version of the ngram test

screenshot_11082025_172200 screenshot_11082025_172237

@JoeKarow
Copy link
Collaborator Author

I've been using the PCTC dataset. The original size is like ~5m rows.. I've been testing using a subset of ~ 1m-2.5m rows for times sake. I had some downtime while I was out of town this weekend and had some ideas to try and improve it further. Some might have to wait (like looking at implementing a Database Storage adapter with DuckDB).

@VEDA95 & @KristijanArmeni - what are your machine specs? Processor, RAM, OS version, exact Python version, SSD or spinny hard drive?

Also - where can I find the datasets that you've tried?

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 12, 2025

@JoeKarow so I've been playing around with the code and managed to shave a few minutes off switching to using feather files instead of csv files for temporary storage. Though it still seems to spend a lot of time storing those chunks in temp files. If you're going to be at the meeting this wednesday perhaps we can have a look at this issue together and try to figure out how to improve performance for the batch processing substep.

@VEDA95
Copy link
Collaborator

VEDA95 commented Aug 12, 2025

Also here's the machine specs you requested for my setup.

Laptop (2021 16-inch Macbook Pro):

  • CPU: Apple M1 Max (8 performance, 2 efficency cores)
  • GPU: 24 GPU cores
  • RAM: 32GB
  • Storage: 1TB SSD
  • OS: Mac OS Sequoia 15.6

Desktop (Custom):

  • CPU: Intel Core i7 8700K (6 cores, 12 threads)
  • GPU: AMD Radeon 7900XT
  • RAM: 32GB DDR4 2133
  • Storage: 256GB NVME SSD (for OS and apps), 2TB 7200RPM HDD, 1TB 7200RPM (both HDDs are used for storing everything else)
  • OS: Fedora Workstation 42

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
This commit implements a genuine Textual+Rich hybrid progress manager that
eliminates code duplication and provides true 60fps updates through proper
Textual integration.

Key architectural improvements:
- Genuine Textual App integration with Static widgets containing Rich renderables
- True 60fps updates via Textual set_interval (not Rich Live configuration)
- Eliminated ~300 lines of code duplication through ProgressStateManager
- Strategy pattern with ProgressBackend abstraction
- Maintained complete API backward compatibility

Technical details:
- TextualProgressApp uses textual.app.App with daemon threading for CLI compatibility
- ProgressStateManager extracts shared logic from duplicated implementations
- ProgressBackend abstract base class with TextualProgressBackend and RichProgressBackend
- Added ChecklistProgressManager backward compatibility alias
- Updated all test imports and mock specifications

Performance optimizations:
- Memory-aware progress reporting with pressure detection
- Positional insertion API for dynamic step ordering
- Context manager protocol for proper resource management

Testing:
- All 99 tests passing (98 passed, 1 skipped)
- Updated analyzer test files to use ProgressManager imports
- Fixed mock specifications and backward compatibility

Dependencies:
- Added textual==5.3.0 for genuine Textual integration
- Updated rich==14.1.0 for compatibility

This implementation resolves SOLID principle violations and provides a robust
foundation for future progress reporting enhancements.
@github-actions
Copy link

github-actions bot commented Aug 13, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://civictechdc.github.io/mango-tango-cli/pr-preview/pr-166/

Built to branch gh-pages at 2025-08-14 14:48 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Joe Karow <58997957+JoeKarow@users.noreply.github.com>
…s manager

- Extract main analysis logic to _run_ngram_analysis_with_progress_manager()
- Add context.progress_manager integration for shared progress tracking
- Maintain backward compatibility with standalone progress manager
- Enhance memory-aware processing patterns and fallback strategies
- Improve tokenization and n-gram generation with adaptive chunking
- Add comprehensive error handling and memory pressure detection
@KristijanArmeni
Copy link
Collaborator

Just ran this on the small 10k rows Twitter dataset and looks good on my end.

  • Performance: It seems that it runs ballpark same time than with current develop version, but I guess this difference only comes forward with very large datasets? (I don't have the CPC dataset locally). I'd love to better understand the chunking strategy though!
  • Rich progress: I still suspect it's maybe even too verbose for a typical use case, but definitely better to have it report what it's doing under the hood!

It's a really big PR, so I won't get a chance to review this before next week I'm afraid. Does that work? In general and given the size, if there's no critical issues here, I'd just consider merging once reviewed and work on any efficiency improvements (spotted by @VEDA95) for the next release or so.

Preview

ngram-rich.mov

@JoeKarow JoeKarow added this to the v0.9.0 milestone Aug 14, 2025
@JoeKarow JoeKarow removed this from the v0.9.0 milestone Sep 10, 2025
@JoeKarow
Copy link
Collaborator Author

This PR got a bit too unwieldy - I'll be stripping this branch for parts and resubmitting smaller, more focused PRs.

@JoeKarow JoeKarow closed this Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants