Skip to content

badwords: refactored for comments and whitelisting#20909

Closed
bagder wants to merge 3 commits intomasterfrom
bagder/badwords-code
Closed

badwords: refactored for comments and whitelisting#20909
bagder wants to merge 3 commits intomasterfrom
bagder/badwords-code

Conversation

@bagder
Copy link
Copy Markdown
Member

@bagder bagder commented Mar 12, 2026

  • when scanning source code, this now only checks source code comments and double-quote strings. No more finding bad words as part of code
  • this allows the full scan to be done in a single invocation
  • detects source code or markdown by file name extension
  • moved the whitelist words config into the single badwords.txt file, no more having them separately (see top of file for syntax)
  • all whitelisted words are checked case insensitively now
  • removed support for whitelisting words on a specific line number. We did not use it and it is too fragile

Removing the actual code from getting scanned made the script take an additional 0.5 seconds on my machine.

Scanning 1525 files now takes a little under 1.7 seconds for me.

@bagder bagder marked this pull request as ready for review March 12, 2026 15:24
@bagder bagder requested a review from Copilot March 12, 2026 15:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the scripts/badwords scanner to focus on source-code comments and double-quoted strings, consolidate scanning into a single run, and move whitelist configuration into scripts/badwords.txt.

Changes:

  • Add a C comment/string extraction path in scripts/badwords and switch to extension-based handling.
  • Move whitelist entries into scripts/badwords.txt and remove scripts/badwords.ok.
  • Consolidate badwords-all to a single invocation over sources and markdown.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/badwords.txt Documents/hosts new whitelist configuration entries alongside existing badword rules.
scripts/badwords.ok Removed in favor of in-file whitelist syntax in badwords.txt.
scripts/badwords-all Consolidates scanning into one command invocation.
scripts/badwords Implements new source-code scanning mode + new whitelist parsing behavior.
scripts/Makefile.am Stops distributing badwords.ok (consistent with whitelist move).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- when scanning source code, this now only checks source code comments
  and double-quote strings. No more finding bad words as part of code
- this allows the full scan to be done in a single invocation
- detects source code or markdown by file name extension
- moved the whitelist words config into the single `badwords.txt` file,
  no more having them separately (see top of file for syntax)
- all whitelisted words are checked case insensitively now
- removed support for whitelisting words on a specific line number. We
  did not use it and it is too fragile

Removing the actual code from getting scanned made the script take an
additional 0.5 seconds on my machine.

Scanning 1525 files now takes a little under 1.7 seconds for me.

Closes #20909
@bagder bagder force-pushed the bagder/badwords-code branch from 3b8ed09 to c4af243 Compare March 12, 2026 15:57
@bagder
Copy link
Copy Markdown
Member Author

bagder commented Mar 12, 2026

augment review

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 12, 2026

🤖 Augment PR Summary

Summary: Refactors the scripts/badwords scanner to focus on human-facing text in source files and to consolidate configuration/whitelisting.

Changes:

  • Adds a state-machine preprocessor (srcline) so C sources are scanned primarily in comments and double-quoted strings (rather than arbitrary code tokens).
  • Enables doing a full scan via a single badwords invocation by updating scripts/badwords-all.
  • Classifies inputs as �source� vs �markdown/document� based on filename extension, and optionally skips indented blocks for markdown-like inputs.
  • Moves whitelist configuration into scripts/badwords.txt, and removes the separate scripts/badwords.ok file from the distribution.
  • Makes whitelisted word matching case-insensitive and removes support for fragile line-number-specific whitelisting.
  • Updates the whitelist entries in scripts/badwords.txt to reflect the new single-file configuration approach.

Technical Notes: The scanner enumerates targets via git ls-files, then applies whitelist-removal patterns and combined regex matching for case-insensitive vs exact-case bad-word rules.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

@vszakats
Copy link
Copy Markdown
Member

  • removed support for whitelisting words on a specific line number. We did not use it and it is too fragile

Also used in curl-www in two CVE .mds. Not a showstopper, just to note:
https://github.com/curl/curl-www/blob/313c28596fbab0d8229c961db49c38f735a3e0e3/.github/scripts/badwords.ok#L3-L4

@bagder bagder closed this in 6870803 Mar 13, 2026
@bagder bagder deleted the bagder/badwords-code branch March 13, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants