Skip to content

badwords: move to ./scripts for easier local execution#20869

Closed
bagder wants to merge 4 commits intomasterfrom
bagder/badwords
Closed

badwords: move to ./scripts for easier local execution#20869
bagder wants to merge 4 commits intomasterfrom
bagder/badwords

Conversation

@bagder
Copy link
Copy Markdown
Member

@bagder bagder commented Mar 9, 2026

and make it pre-compile regexes to execute much faster

@bagder bagder added CI Continuous Integration script labels Mar 9, 2026
@bagder
Copy link
Copy Markdown
Member Author

bagder commented Mar 9, 2026

augment review

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 9, 2026

🤖 Augment PR Summary

Summary: Moves the “badwords” checker into scripts/ for easier local execution and refactors it to run faster.
Changes:

  • Update GitHub workflows to call ./scripts/badwords and read ./scripts/badwords.{txt,ok}
  • Add a top-level make badwords convenience target to run the docs and source scans
  • Ship the badwords script and rule/whitelist files via scripts/Makefile.am (EXTRA_DIST)
  • Optimize the checker by precompiling whitelist regexes and building combined match regexes

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

@bagder
Copy link
Copy Markdown
Member Author

bagder commented Mar 9, 2026

@vszakats I trust it you don't mind this change? It makes the data format of badwords.txt a little simpler, with the benefit that the execution speed is several times faster - on my machine this completes in 6 seconds now, compared to 48 before this PR.

I figure a next step would be to move more of the specific command line logic into the script itself to make it easier to run locally stand-alone.

@vszakats
Copy link
Copy Markdown
Member

vszakats commented Mar 9, 2026

@vszakats I trust it you don't mind this change? It makes the data format of badwords.txt a little simpler, with the benefit that the execution speed is several times faster - on my machine this completes in 6 seconds now, compared to 48 before this PR.

I figure a next step would be to move more of the specific command line logic into the script itself to make it easier to run locally stand-alone.

Not at all, the speed boost is very nice, also making it easier to run is removing a pain-point.
Agree on integrating more into the script itself. Also to perhaps avoid having two separate
kinds of invocations for separate parts of the repo (and/or perhaps remove exceptions, esp. they
were made because of false positives; I'll need to recheck.)

bagder added 2 commits March 9, 2026 22:12
- 'badwords' is now a target in Makefile.am

- change badwords.txt to specify plain "words" instead of regexes so the
  script can build single regexes when scanning, which makes the script
  perform much faster (~6 times faster)

Closes #20869
Mostly sentences starting with bad words
@bagder bagder force-pushed the bagder/badwords branch from 1ed4864 to dd7426a Compare March 9, 2026 21:15
@bagder bagder marked this pull request as ready for review March 9, 2026 21:15
@bagder bagder requested a review from Copilot March 9, 2026 21:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR relocates the “badwords” checker into ./scripts for easier local use and refactors it to precompile/aggregate patterns for faster execution, alongside documentation wording cleanups driven by the ruleset.

Changes:

  • Move/standardize badwords tooling under scripts/ and update CI workflows to call it.
  • Refactor scripts/badwords to build combined regexes and add new “ignore” patterns (URLs, bold text).
  • Adjust scripts/badwords.txt rules and update docs to satisfy the updated checker output.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/badwords Refactors matching to combined regexes + adds new ignore patterns.
scripts/badwords.txt Rewrites a number of rules from regex-style entries into plain phrases/words.
scripts/badwords.ok Updates whitelist entries to match the new matcher’s behavior.
scripts/Makefile.am Ensures badwords assets are included in release tarballs.
Makefile.am Adds a badwords make target and updates source-check invocation.
.github/workflows/checksrc.yml Switches CI source checks from the old badwords script to ./scripts/badwords.
.github/workflows/checkdocs.yml Switches CI docs checks to ./scripts/badwords.
docs/libcurl/libcurl-tutorial.md Formatting tweaks to avoid badwords hits / improve readability.
docs/libcurl/curl_global_init_mem.md Updates “thread safe” wording to “thread-safe”.
docs/libcurl/curl_easy_getinfo.md Minor prose tweaks.
docs/internals/TIME-KEEPING.md Minor prose tweaks.
docs/internals/MULTI-EV.md Heading tweak.
docs/internals/CONNECTION-FILTERS.md Minor prose tweaks.
docs/cmdline-opts/config.md Minor prose tweaks.
docs/VERSIONS.md Minor prose tweaks.
docs/TheArtOfHttpScripting.md Changes “url-encode” phrasing to “URL encode”.
docs/HTTP3.md Minor prose tweak.
Comments suppressed due to low confidence (5)

scripts/badwords:130

  • @whitelist entries are plain strings but later used as regexes in s/$p//g, which means they can be recompiled repeatedly inside the per-line loop. Since the comment says “pre-compiled”, it would be more accurate (and likely faster) to store these as qr// regexes up-front (including those added via ---... lines), and then apply them without recompilation.
    scripts/badwords:27
  • %exactcase is now unused (the code was switched to @exact + $re_cs). It should be removed to avoid dead state that suggests case handling is still done via this hash.
    scripts/badwords.txt:62
  • The url rule currently has leading/trailing spaces (" url= URL"). Because the script treats everything before/after = as significant, this will make the bad-word pattern include a leading space and the suggested replacement include a leading space, which is likely unintended and can prevent matching in common cases. Consider changing it to url=URL (or otherwise removing the incidental whitespace).
    scripts/badwords:110
  • For case-insensitive rules, $2 is the matched text from the input line (preserving the original casing). Using that as the key breaks both the suggestion lookup ($alt{$w}) and whitelist matching ($wl{"$f:$l:$w"} / $wl{"$f::$w"}) when the input casing differs from the canonical entry in badwords.txt (e.g. Back-end won’t find $alt{"back-end"}). Consider normalizing case-insensitive matches to a canonical key (for example store CI rules keyed by lc($bad) and look up via lc($match)), while keeping exact-case rules unchanged.
    scripts/badwords:143
  • The new scanning logic stops after the first match in each category because the while (...) { ...; last; } construct effectively runs at most once. Previously the script could report multiple different bad words on the same line; now it will report at most one case-insensitive and one exact-case match per line, potentially hiding additional issues. If the intent is to keep prior behavior, iterate over all matches (or repeatedly remove/advance past the found match) instead of lasting immediately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@bagder bagder closed this in 7132871 Mar 9, 2026
@bagder bagder deleted the bagder/badwords branch March 9, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants