[javascript] Javascript tokenizer now ignores comment tokens. #35

Merged
merged 1 commit into from Jun 12, 2016

Projects

None yet

2 participants

@tiobe
Contributor
tiobe commented Jun 6, 2016

One of our customers complained that the CPD javascript tokenizer doesn't ignore comments when the duplicated code is calculated. This pull request changes the javascript tokenizer so it ignores comments completely.

Since the AbstractTokenizer doesn't support multi line comments and single line comments started with more dan 1 character. (Javascript uses '//' for single line comments), I decided to use a different tokenizer. Adapting the AbstractTokenizer to support Javascript comments would require a drastic change and would possibly have impact on other languages as well.

At first I planned to use the tokenizer of Rhino, but unfortunately the tokenizer is not exposed on the external interface of Rhino. So in the end, I found the following JavaCC Ecmascript5 grammar on github: https://github.com/DigiArea/es5-model/blob/master/com.digiarea.es5/src/com/digiarea/es5/parser/es5.jj

It seems to do its job very well, since I successfully checked some test projects for duplicated code. One of the test projects I used is: Vue.js

To support line continuations in Javascript, I took a different approach than the CppTokenizer. I didn't use the ContinuationReader, because this causes problems with the line numbers. After a line continuation is removed by the ContiunationReader, the line number is of by one. The JavaCC tokenizer only increments the line number when it sees an character, the ContinuationReader remover the continuation character and character completely from the input stream. My solution is to post-process the tokens in the Javascript CPD tokenizer to remove the continuation characters.

@adangel
Owner
adangel commented Jun 12, 2016

Thanks!

@adangel adangel merged commit 63293f8 into adangel:pmd/5.3.x Jun 12, 2016
@adangel adangel changed the title from Javascript tokenizer now ignores comment tokens. to [javascript] Javascript tokenizer now ignores comment tokens. Jun 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment