NonPrintableCheck sniff #103

Open
wants to merge 3 commits into
from

Projects

None yet

3 participants

@selul
selul commented Jan 6, 2017

Non printable chars sniffs for css,js and php files, implementing the themecheck plugin rule.

Reference #94

@jrfnl

Hi @selul Thank you ever so much for this! Looks great!

I've left some feedback and questions inline.

One additional question: this is quite a generic sniff and very re-usable by other (WP) standards.
I'm wondering if placing it in the Theme subset is correct or if it should be better placed in the Files subset.

I'd also like to suggest you may want to ask upstream (PHPCS) if they'd be interested in this sniff as I got a feeling it would be well placed in the Generic ruleset.

+ */
+
+/**
+ * Restricts the use of non-printable characters in themes
@jrfnl
jrfnl Jan 6, 2017

in themes can be removed. The sniff can be used on any file, not just WP theme files.

+ * @return array
+ */
+ public function register() {
+ return ( array_merge( PHP_CodeSniffer_Tokens::$commentTokens, PHP_CodeSniffer_Tokens::$emptyTokens, PHP_CodeSniffer_Tokens::$stringTokens, array(
@jrfnl
jrfnl Jan 6, 2017 edited

I'd be very interested to hear the rationale of how you decided which tokens to look at for these type of characters.
Will bad characters always be in these tokens or is there - even a remote chance - that they could be tokenized differently ?

Some additional suggestions:

  • PHP_CodeSniffer_Tokens::$heredocTokens
  • T_BAD_CHARACTER - #94 (comment)

Two side-notes:

  • PHP_CodeSniffer_Tokens::$commentTokens is a sub-set of PHP_CodeSniffer_Tokens::$emptyTokens, so you only need the last one.
  • Layout-wise I'd make this a multi-line function call with each parameter on a new line for readability.
@selul
selul Jan 7, 2017

@jrfnl Yes, my assumption was that the non-printable chars can be sneaked in those kind of tokens without breaking the linting, so my best guess was the selected categories.
I have also added the T_BAD_CHARCTER but i saw that is breaking the build on PHP7.

@jrfnl
jrfnl Jan 7, 2017 edited

I'm just wondering if there are still more tokens in which they could be sneaked in...

Regarding the build break for PHP7, if you change the code within the register() method to the below, that should fix it:

	$targets = array_merge(
		PHP_CodeSniffer_Tokens::$emptyTokens,
		PHP_CodeSniffer_Tokens::$heredocTokens,
		PHP_CodeSniffer_Tokens::$stringTokens
	);
	$targets[T_STRING]      = T_STRING;
	$targets[T_INLINE_HTML] = T_INLINE_HTML;

	if ( defined( 'T_BAD_CHARACTER' ) ) {
		$targets[T_BAD_CHARACTER] = T_BAD_CHARACTER;
	}

	return $targets;

Note: as we're defining a variable anyway, I've moved the extra tokens out of the array_merge(). array_merge() is expensive, while straight array assignments are not.
Also - for the extra ones, I've followed the pattern which the PHP_CodeSniffer_Tokens properties already use of having the token as both the key as well as the value in the array.

@selul
selul Jan 8, 2017

@jrfnl thanks for help and feedback, i have pushed the changes for this

+ $token = $tokens[ $stackPtr ];
+ $filename = $phpcsFile->getFilename();
+ if ( preg_match( '/[\x00-\x08\x0B-\x0C\x0E-\x1F\x80-\xFF]/', $token['content'], $matches ) ) {
+ $warning = 'Non-printable characters were found in the ' . $filename . ' file. You may want to check this file for errors.';
@jrfnl
jrfnl Jan 6, 2017

PHPCS natively supports sprintf() type of functionality in the addWarning/Errror() methods by passing a $data array as the fourth parameter. Please consider using this instead, i.e. use a placeholder for the variable and pass the variable to the addWarning() method as the fourth parameter array( $filename ).

@grappler
grappler Jan 6, 2017 edited

On the other hand there is not much need to mention in which file the error is as the line number and file is mention in the PHPCs output.

+ $filename = $phpcsFile->getFilename();
+ if ( preg_match( '/[\x00-\x08\x0B-\x0C\x0E-\x1F\x80-\xFF]/', $token['content'], $matches ) ) {
+ $warning = 'Non-printable characters were found in the ' . $filename . ' file. You may want to check this file for errors.';
+ $phpcsFile->addWarning( $warning, $stackPtr, 'NonPrintableCheck' );
@jrfnl
jrfnl Jan 6, 2017

Should this be a warning or an error ?

@grappler
grappler Jan 6, 2017

This should be an error.

+ $filename = $phpcsFile->getFilename();
+ if ( preg_match( '/[\x00-\x08\x0B-\x0C\x0E-\x1F\x80-\xFF]/', $token['content'], $matches ) ) {
+ $warning = 'Non-printable characters were found in the ' . $filename . ' file. You may want to check this file for errors.';
+ $phpcsFile->addWarning( $warning, $stackPtr, 'NonPrintableCheck' );
@jrfnl
jrfnl Jan 6, 2017 edited

I'm also wondering if - when PHPCS finds a bad character - we should even bother checking the tokens in the rest of the file or if we should just skip to the end of the file using return $phpcsFile->numTokens; ?

This might give issues with the unit tests, but we can work around that with a public $skip_to_end = true; class property combined with a // @codingStandardsChangeSetting WordPress.Theme.NonPrintableCheck skip_to_end false annotation if it would be decided this is the way to go.

@grappler
grappler Jan 6, 2017

I would be for checking the whole file as it otherwise frustrating finding new issues after fixing the first issue.

@@ -0,0 +1,8 @@
+/*
@jrfnl
jrfnl Jan 6, 2017 edited

For ease of understanding what you're testing for on each line, it would be great if you would add some indication of the character "hidden" in each line. Something like Bad - ASCII 07.
This goes for all three test files.

Also, please mind punctuation in the test files. The Good/Bad comments should generally start with a capital and end with a . period.

+ 6 => 1,
+ 7 => 1,
+ );
+ break;
@jrfnl
jrfnl Jan 6, 2017

No need to break if you're returning in the line above as that effectively breaks you out of the switch (and the whole function) already. (same below)

@selul selul Fixed issues after first review.
Changed testfiles to match the comments standards and added the ascii codes
Changed report from warning to error
Improved register tokens
6e239ca
@selul
selul commented Jan 7, 2017

@grappler @jrfnl I have added the requested changes. The build fails on PHP 7 as the T_BAD_CHARACTER is no longer available.

@jrfnl
jrfnl commented Jan 7, 2017

@selul Thanks for the update. I'll try and do another review run through tomorrow.

@grappler

While I was testing this sniffs on a few existing themes I found this example. In this case °, and are non-printable characters. I am not sure if it is valid code but seems to make sense.

$str = abs($lat['deg']) . '°' . $lat['min'] . '′' . round($lat['sec']) . '″';

+ * @return array
+ */
+ public function register() {
+ $targets = array_merge(
@grappler
grappler Jan 13, 2017

I would only have one space between the variable and the equal sign but wait till we get @jrfnl's feedback on it.

+ $tokens = $phpcsFile->getTokens();
+ $token = $tokens[ $stackPtr ];
+ if ( preg_match( '/[\x00-\x08\x0B-\x0C\x0E-\x1F\x80-\xFF]/', $token['content'], $matches ) ) {
+ $phpcsFile->addError( 'Non-printable characters were found in the file. You may want to check this file for errors.', $stackPtr, 'NonPrintableCheck' );
@grappler
grappler Jan 13, 2017

While I was testing the sniff, I realized there is no way to work out in which position the non-printable character is. So what about changing the message to:

			$phpcsFile->addError(
				'Non-printable character was found in column %s.',
				$stackPtr,
				'NonPrintableCheck',
				array(
					$token['column']
				)
			);
@grappler
grappler commented Jan 19, 2017 edited

@jrfnl I checked with the Theme Check plugin and I am getting the same result which makes sense as the regex is the same but the in the theme check this is only a info, so it is not required.

It would be good to do some research on the subject. @selul would you be willing to help out.

Do symbols like °, and count as non-printable characters and what negative affect can they have?

+ );
+ default:
+ return array();
+ } // End switch().
@grappler
grappler Jan 20, 2017

There are also multiple spaces between the closing bracket and the comment.

@grappler

I spoke to Otto and the reason this was added in the Theme Check was to check for Byte Order Mark. Explanation. This is now being checked with Generic.Files.ByteOrderMark

The check was made a bit looser to try to catch other issues too.

I did a bit of research, non printable characters are things like line breaks. In order for this sniff to work we would need to think of places where it does not make sense to have non printable characters. I am not sure how many places there are.

More of the ASCII codes can be found in the links 1-31 and 127-159 are the invisible characters.
http://ascii.cl/htmlcodes.htm
http://www.ascii-code.com/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment