Skip to content

badwords: rework exceptions, fix many of them#20886

Closed
vszakats wants to merge 43 commits intocurl:masterfrom
vszakats:badwmerge
Closed

badwords: rework exceptions, fix many of them#20886
vszakats wants to merge 43 commits intocurl:masterfrom
vszakats:badwmerge

Conversation

@vszakats
Copy link
Copy Markdown
Member

@vszakats vszakats commented Mar 11, 2026

Also:

  • support per-directory and per-upper-directory whitelist entries.
  • convert badlist input grep tweak into the above format.
    (except for 'And' which had just a few hits.)
  • fix many code exceptions, but do not enforce.
    (there also remain about 350 'will' uses in lib)
  • fix badwords in example code, drop exceptions.
  • badwords-all: convert to Perl.
    To make it usable from CMake.
  • FAQ: reword to not use 'will'. Drop exception.

@bagder
Copy link
Copy Markdown
Member

bagder commented Mar 11, 2026

I think you can do this differently. I think banning url from source code is untenable for example. It makes zero sense to ban that from code. It could perhaps work if we really made sure the scanner only cared about comments.

Having a special set of whitelisted words could be handled within the script, the same way it now ignores markdown-indented content only for some files.

@bagder
Copy link
Copy Markdown
Member

bagder commented Mar 11, 2026

For example, when building the large set of files to scan, we could give each file a "qualifier" if it is document or source code, and then we could apply a special set of whitelisted words on the different categories. Without needing two invokes.

@bagder
Copy link
Copy Markdown
Member

bagder commented Mar 11, 2026

I would not be upset if we just skipped scanning the source code, because I think that actually does not need the same attention to language.

@bagder
Copy link
Copy Markdown
Member

bagder commented Mar 11, 2026

Doing separate whitelisting could then also make the bold and backtick whitelisting not apply to source code...

@vszakats
Copy link
Copy Markdown
Member Author

vszakats commented Mar 11, 2026

url is easy to whitelist now on a per-subdir basis. But it's really
only one hit ATM.

Though what I noticed is not all badwords are caught for some
reason. Not sure if by intent or accident. [edit: my mistake!]

I'd still vouch for scanning sources and reduce exceptions. What
remains now is 'will' in lib. With per-lib whitelisted words it can be
relaxed as necessary.

@vszakats
Copy link
Copy Markdown
Member Author

vszakats commented Mar 11, 2026

ah, my mistake, -a enabled scanning indented lines, which we want for src.
backtracking.

@bagder
Copy link
Copy Markdown
Member

bagder commented Mar 11, 2026

I have a pending script that extracts all the comments from source code, which should help to ignore code.

Do we want it to check double-quoted strings as well?

@vszakats
Copy link
Copy Markdown
Member Author

vszakats commented Mar 11, 2026

I have a pending script that extracts all the comments from source code, which should help to ignore code.

Do we want it to check double-quoted strings as well?

That would be nice, yes. I expect a couple of exception due to it, but manageable.

Comment thread docs/examples/http2-upload.c Outdated
static int setup(struct input *t, int num, const char *upload)
{
char url[256];
char urlup[256];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't url be whitelisted here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed simpler to avoid by renaming. Could be a better
name though probably. I was wondering why it's only caught
here, not in other examples where declared as char *url.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, we do not want to forbid meaningful variable names now, or?

Copy link
Copy Markdown
Member Author

@vszakats vszakats Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 2 hits in total, one in examples, the other in lib, whitelisted,
it seems fine to me. Can be whitelisted wider, if causing issues.

Or, using @bagder's comment-filter.

Copy link
Copy Markdown
Member

@bagder bagder Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't want warnings for this when the next example is added using url as a variable name...

Although I will fix this in the next step which will make the script ignore the code and only check comments and strings

@vszakats vszakats changed the title badwords: reduce to single invocation, rework and fix exceptions badwords: rework exceptions and fix some of them Mar 11, 2026
@vszakats vszakats changed the title badwords: rework exceptions and fix some of them badwords: rework exceptions, fix most of them Mar 11, 2026
@vszakats vszakats changed the title badwords: rework exceptions, fix most of them badwords: rework exceptions, fix many of them Mar 11, 2026
@vszakats
Copy link
Copy Markdown
Member Author

I updated FAQ.md to avoid 'will', while here.

@vszakats vszakats closed this in 435eabe Mar 12, 2026
@vszakats vszakats deleted the badwmerge branch March 12, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants