badwords: move to ./scripts for easier local execution#20869
badwords: move to ./scripts for easier local execution#20869
Conversation
|
augment review |
🤖 Augment PR SummarySummary: Moves the “badwords” checker into
🤖 Was this summary useful? React with 👍 or 👎 |
|
@vszakats I trust it you don't mind this change? It makes the data format of I figure a next step would be to move more of the specific command line logic into the script itself to make it easier to run locally stand-alone. |
Not at all, the speed boost is very nice, also making it easier to run is removing a pain-point. |
- 'badwords' is now a target in Makefile.am - change badwords.txt to specify plain "words" instead of regexes so the script can build single regexes when scanning, which makes the script perform much faster (~6 times faster) Closes #20869
Mostly sentences starting with bad words
There was a problem hiding this comment.
Pull request overview
This PR relocates the “badwords” checker into ./scripts for easier local use and refactors it to precompile/aggregate patterns for faster execution, alongside documentation wording cleanups driven by the ruleset.
Changes:
- Move/standardize badwords tooling under
scripts/and update CI workflows to call it. - Refactor
scripts/badwordsto build combined regexes and add new “ignore” patterns (URLs, bold text). - Adjust
scripts/badwords.txtrules and update docs to satisfy the updated checker output.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/badwords | Refactors matching to combined regexes + adds new ignore patterns. |
| scripts/badwords.txt | Rewrites a number of rules from regex-style entries into plain phrases/words. |
| scripts/badwords.ok | Updates whitelist entries to match the new matcher’s behavior. |
| scripts/Makefile.am | Ensures badwords assets are included in release tarballs. |
| Makefile.am | Adds a badwords make target and updates source-check invocation. |
| .github/workflows/checksrc.yml | Switches CI source checks from the old badwords script to ./scripts/badwords. |
| .github/workflows/checkdocs.yml | Switches CI docs checks to ./scripts/badwords. |
| docs/libcurl/libcurl-tutorial.md | Formatting tweaks to avoid badwords hits / improve readability. |
| docs/libcurl/curl_global_init_mem.md | Updates “thread safe” wording to “thread-safe”. |
| docs/libcurl/curl_easy_getinfo.md | Minor prose tweaks. |
| docs/internals/TIME-KEEPING.md | Minor prose tweaks. |
| docs/internals/MULTI-EV.md | Heading tweak. |
| docs/internals/CONNECTION-FILTERS.md | Minor prose tweaks. |
| docs/cmdline-opts/config.md | Minor prose tweaks. |
| docs/VERSIONS.md | Minor prose tweaks. |
| docs/TheArtOfHttpScripting.md | Changes “url-encode” phrasing to “URL encode”. |
| docs/HTTP3.md | Minor prose tweak. |
Comments suppressed due to low confidence (5)
scripts/badwords:130
@whitelistentries are plain strings but later used as regexes ins/$p//g, which means they can be recompiled repeatedly inside the per-line loop. Since the comment says “pre-compiled”, it would be more accurate (and likely faster) to store these asqr//regexes up-front (including those added via---...lines), and then apply them without recompilation.
scripts/badwords:27%exactcaseis now unused (the code was switched to@exact+$re_cs). It should be removed to avoid dead state that suggests case handling is still done via this hash.
scripts/badwords.txt:62- The
urlrule currently has leading/trailing spaces (" url= URL"). Because the script treats everything before/after=as significant, this will make the bad-word pattern include a leading space and the suggested replacement include a leading space, which is likely unintended and can prevent matching in common cases. Consider changing it tourl=URL(or otherwise removing the incidental whitespace).
scripts/badwords:110 - For case-insensitive rules,
$2is the matched text from the input line (preserving the original casing). Using that as the key breaks both the suggestion lookup ($alt{$w}) and whitelist matching ($wl{"$f:$l:$w"}/$wl{"$f::$w"}) when the input casing differs from the canonical entry inbadwords.txt(e.g.Back-endwon’t find$alt{"back-end"}). Consider normalizing case-insensitive matches to a canonical key (for example store CI rules keyed bylc($bad)and look up vialc($match)), while keeping exact-case rules unchanged.
scripts/badwords:143 - The new scanning logic stops after the first match in each category because the
while (...) { ...; last; }construct effectively runs at most once. Previously the script could report multiple different bad words on the same line; now it will report at most one case-insensitive and one exact-case match per line, potentially hiding additional issues. If the intent is to keep prior behavior, iterate over all matches (or repeatedly remove/advance past the found match) instead oflasting immediately.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
and make it pre-compile regexes to execute much faster