Skip to content

Create CharsetDetector#148

Merged
viktorbeck98 merged 14 commits into
developmentfrom
feature/charset-detector
May 19, 2026
Merged

Create CharsetDetector#148
viktorbeck98 merged 14 commits into
developmentfrom
feature/charset-detector

Conversation

@ernstleierzopf
Copy link
Copy Markdown
Contributor

Fixes #59

@ernstleierzopf ernstleierzopf changed the base branch from main to development May 13, 2026 06:05
Comment thread src/detectmatelibrary/detectors/charset_detector.py Dismissed
Comment thread src/detectmatelibrary/detectors/charset_detector.py Dismissed
ernstleierzopf and others added 13 commits May 13, 2026 08:10
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Branch-free dispatch via set.update vs set.add bound at construction.
Enables character-set accumulation for charset-style detectors without
a per-call isinstance check.
to_state/from_state preserve the flag so reloaded trackers continue to
use the same accumulation strategy. Old snapshots without the key
default to False (backward compatible).
Closure factory threads the flag down to each per-variable
SingleStabilityTracker without touching MultiTracker or EventTracker.
Explicit named parameter at the public boundary.
EventTracker.load previously called EventTracker.__init__ directly,
bypassing subclass closures (e.g. EventStabilityTracker's expand_value
factory). Newly-encountered variables after a reload silently used
default semantics. Route subclasses through cls(**kwargs) so the
factory rebuilds; base class path unchanged.
The cls-is-EventTracker branch is the legacy reconstruction path;
subclasses go through cls(**kwargs) so closure-based factories
(like EventStabilityTracker's expand_value) can be rebuilt.
Subclasses must therefore accept event_data_kwargs and not require
positional args from __init__.
Wires expand_value=True for the main persistency so unique_set accumulates
chars. auto_conf_persistency keeps default (whole-value stability) for
variable selection. Also adds the missing _register_persistency call so
the trained state participates in persist/load.
- Replace nested any(c in x for x in unique_set) loop with set(v) - unique_set
- Combine all unknown chars per variable into a single alert message
- Remove ignore_non_string_val config field and sys.exit guard (let
  upstream type errors surface naturally if a non-string slips through)
- Strip ignore_non_string_val from test configs and pipeline_config_default.yaml
from_dict replaces the entire config object during auto-config rebuild,
which previously wiped any persist setting set by an earlier config
load. Save and restore old_persist, matching the pattern in
NewValueDetector.set_configuration().
Type[SingleTracker] only accepts class objects, but closure factories
(e.g. EventStabilityTracker.make_tracker) are equally valid producers.
Widening to Callable[[], SingleTracker] removes the type: ignore at the
EventStabilityTracker call site and documents the real contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@viktorbeck98 viktorbeck98 merged commit 5a6b70b into development May 19, 2026
4 checks passed
@viktorbeck98 viktorbeck98 deleted the feature/charset-detector branch May 19, 2026 15:20
@viktorbeck98 viktorbeck98 restored the feature/charset-detector branch May 20, 2026 14:07
@viktorbeck98 viktorbeck98 deleted the feature/charset-detector branch May 20, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CharsetDetector

2 participants