New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checker Cache and TranslationCheck conflict #3539

Open
romani opened this Issue Nov 10, 2016 · 9 comments

Comments

Projects
None yet
2 participants
@romani
Member

romani commented Nov 10, 2016

TranslationCheck looks at multiple files, and only does validation ONLY after all the files are given to it.
If some file was in cache and so skipped by Checker(non of Checks is called to consider it), it creates and situation that TranslationCheck does not see some files - so report violation.

Processing 1 file and skipping others will cause it to falsely believe that files are missing and report excess violations, even when no violations on a non-cache run exist. This is because we don't guess were the files are, we use the files reported to the check to know where they are and if they exist.

Steps to reproduce on CS locally:

  1. Clean target directory of any cache file.
  2. Run checkstyle's ant-phase-verify with cache on. No violations are produced but cache file is created
  3. Add '.' to any value at \src\main\resources\com\puppycrawl\tools\checkstyle\checks\naming\messages_es.properties.
  4. Run checkstyle again and this time 8 violations are reported, even though no files were modified and we got no violations on our first run in step 2.
    Below is the output:
Starting audit...
[ERROR] ...\checks\naming:0: Properties file 'messages.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_de.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_fi.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_pt.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_ja.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_fr.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_tr.properties' is missing. [Translation]
[ERROR] ...\checks\naming:0: Properties file 'messages_zh.properties' is missing. [Translation]
Audit done.
Checkstyle ends with 8 errors.

Expected: no violations.

Commit 441d2d3 need to be reverted as fix for issue is provided.

Original problem was reported and discussed at: #3493

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Nov 10, 2016

Member

This is unfortunate mis-design of TranslationCheck, we can not fix it easily.
Users need to be aware of it if they use caching (all IDE plugins use caching).

There are no problems with validation logic of this Check if validation is launched on just cloned sources (no existing cache scenario).

This issue could be fixed only after "Support multi-file validation Checks in Checkstyle" - #3540 .

Member

romani commented Nov 10, 2016

This is unfortunate mis-design of TranslationCheck, we can not fix it easily.
Users need to be aware of it if they use caching (all IDE plugins use caching).

There are no problems with validation logic of this Check if validation is launched on just cloned sources (no existing cache scenario).

This issue could be fixed only after "Support multi-file validation Checks in Checkstyle" - #3540 .

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 17, 2018

Member

Proposal:

AS all Checks will have annotations (stateless, filestateful, global), all multi-file Checks have to be "global"
so we can run them on files even file is an cache. We just need to change a bit Checker::processFiles()
to consider annotations on Checks too.
Composite module TreeWalker will have new method isGlobal() that will return result base on all modules in it (value will be precalculated on initialization phase).
Checker should have a Map(any other quick collection) of Global Checks and quickly recheck if required to call proceesFiltered on Check or not.
All Checks that do not have annotation, should be treated as Global. So all existing Checks in the world will not participate is caching. The same principle as in Multithreading, all Checks without annotation - Global, and work as before same instance for all threads.

Small summary: Global Checks never skipped by cache. If some Checks existed they most-likely working withtout cache, so it will not be problem for them. The only drawback for vast majority of thirparty Checks as they will stop benefit from cache. This will be unlikely noticeable by user quickly so update to have annotations will take a lot of time and most of Checks never upgrade.

@rnveach , does it make sense ?

Member

romani commented Jun 17, 2018

Proposal:

AS all Checks will have annotations (stateless, filestateful, global), all multi-file Checks have to be "global"
so we can run them on files even file is an cache. We just need to change a bit Checker::processFiles()
to consider annotations on Checks too.
Composite module TreeWalker will have new method isGlobal() that will return result base on all modules in it (value will be precalculated on initialization phase).
Checker should have a Map(any other quick collection) of Global Checks and quickly recheck if required to call proceesFiltered on Check or not.
All Checks that do not have annotation, should be treated as Global. So all existing Checks in the world will not participate is caching. The same principle as in Multithreading, all Checks without annotation - Global, and work as before same instance for all threads.

Small summary: Global Checks never skipped by cache. If some Checks existed they most-likely working withtout cache, so it will not be problem for them. The only drawback for vast majority of thirparty Checks as they will stop benefit from cache. This will be unlikely noticeable by user quickly so update to have annotations will take a lot of time and most of Checks never upgrade.

@rnveach , does it make sense ?

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jun 18, 2018

Member

I understand what you are saying but I don't agree with it.

Global Checks never skipped by cache

This is no different then your MT mode suggestion by separating checks into a different, unsupported group. It is just another workaround and not real support. Java checks will definitely see a difference as now every run will be the exact same as running without the cache. Placing annotation on check will not help, especially for multi-file checks which we are now starting to support, because the annotation they must have will always skip cache by your suggestion.

My proposed implementation can be seen at https://groups.google.com/forum/#!topic/checkstyle-devel/a7S2PePHG0I . It supports multi-file Java checks with cache support. To do this, every multi-file check must be cache aware and maintain its own cache for files it is skipping over.
https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/CacheAware.java

Even translation check would need something like this to avoid reading the file twice.

try (InputStream inStream = Files.newInputStream(file.toPath())) {

Member

rnveach commented Jun 18, 2018

I understand what you are saying but I don't agree with it.

Global Checks never skipped by cache

This is no different then your MT mode suggestion by separating checks into a different, unsupported group. It is just another workaround and not real support. Java checks will definitely see a difference as now every run will be the exact same as running without the cache. Placing annotation on check will not help, especially for multi-file checks which we are now starting to support, because the annotation they must have will always skip cache by your suggestion.

My proposed implementation can be seen at https://groups.google.com/forum/#!topic/checkstyle-devel/a7S2PePHG0I . It supports multi-file Java checks with cache support. To do this, every multi-file check must be cache aware and maintain its own cache for files it is skipping over.
https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/CacheAware.java

Even translation check would need something like this to avoid reading the file twice.

try (InputStream inStream = Files.newInputStream(file.toPath())) {

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 26, 2018

Member

This is no different then your MT mode suggestion by separating checks into a different, unsupported group. It is just another workaround and not real support

we are extending to completely new sector of validations, and it is very clear that we will benefit from it that much, do change our API to seriously.

run will be the exact same as running without the cache.

standard simple Checks will continue benefit from cache. Who want to benefit more should be updated to have annotation.

Placing annotation on check will not help, especially for multi-file checks which we are now starting to support, because the annotation they must have will always skip cache by your suggestion.

placing annotation will help for stateless, filestateful Checks. For Global and multi-file no benefit.
With Global and multi-file Check have to make cache himself if it so critical. We do not know what global checks do for validation, they might validated whole folder of sources, or compare some metric between folders/packages, or just count sum of all lines in all files. Our cache now was and is for filestate Checks only. Lets keep it as is for now. If Global Checks so desperately need caching they can create their own cache, we do not limit them.

To do this, every multi-file check must be cache aware and maintain its own cache for files it is skipping over.

Custom Checks can do whatever they want to resolve speed problems. We do not know what they might need for caching and what matter for them.

Even translation check would need something like this to avoid reading the file twice

I tried to get your idea .... not sure ... Looks like your major concern is that Checker is actually reading while file and give content to Checks(here). I already showed that this penalty is close to nothing. But yes, it is might be not ideal. Especially then Check do know better what to do with file and how to read it. I am ok to extend API of AbstractFileSetCheck to have method like canWorkOnPrereadFileContent(default implementation will return true) and Checker will do

// we can make code to skip reading file if only globals Checks are used
final FileText theText = new FileText(file.getAbsoluteFile(), charset);
for (final FileSetCheck fsc : fileSetChecks) {
     if (fsc.canWorkOnPrereadFileContent()) 
         fileMessages.addAll(fsc.process(file, theText));
     else
         fileMessages.addAll(fsc.process(file));
}

So TranslationCheck can override canWorkOnPrereadFileContent to return false, and do what ever he needs. BUT we already stuck with preread of file and as soon as one non global Check exists, we will do this and there is no reason differentiate Checks, as reading was done already for other 99% Checks from configuration.

We have have small penalty for all, some java Checks do not need parsing, and could quickly be done by own read much faster, but we put them after java parsing just to be easier for us to maintain and easy to extend in future. http://checkstyle.sourceforge.net/config_sizes.html#OuterTypeNumber could be done enormously quicker in custom parsing, but we do not do this. http://checkstyle.sourceforge.net/config_sizes.html#FileLength is also not ideal for performance.

Member

romani commented Jun 26, 2018

This is no different then your MT mode suggestion by separating checks into a different, unsupported group. It is just another workaround and not real support

we are extending to completely new sector of validations, and it is very clear that we will benefit from it that much, do change our API to seriously.

run will be the exact same as running without the cache.

standard simple Checks will continue benefit from cache. Who want to benefit more should be updated to have annotation.

Placing annotation on check will not help, especially for multi-file checks which we are now starting to support, because the annotation they must have will always skip cache by your suggestion.

placing annotation will help for stateless, filestateful Checks. For Global and multi-file no benefit.
With Global and multi-file Check have to make cache himself if it so critical. We do not know what global checks do for validation, they might validated whole folder of sources, or compare some metric between folders/packages, or just count sum of all lines in all files. Our cache now was and is for filestate Checks only. Lets keep it as is for now. If Global Checks so desperately need caching they can create their own cache, we do not limit them.

To do this, every multi-file check must be cache aware and maintain its own cache for files it is skipping over.

Custom Checks can do whatever they want to resolve speed problems. We do not know what they might need for caching and what matter for them.

Even translation check would need something like this to avoid reading the file twice

I tried to get your idea .... not sure ... Looks like your major concern is that Checker is actually reading while file and give content to Checks(here). I already showed that this penalty is close to nothing. But yes, it is might be not ideal. Especially then Check do know better what to do with file and how to read it. I am ok to extend API of AbstractFileSetCheck to have method like canWorkOnPrereadFileContent(default implementation will return true) and Checker will do

// we can make code to skip reading file if only globals Checks are used
final FileText theText = new FileText(file.getAbsoluteFile(), charset);
for (final FileSetCheck fsc : fileSetChecks) {
     if (fsc.canWorkOnPrereadFileContent()) 
         fileMessages.addAll(fsc.process(file, theText));
     else
         fileMessages.addAll(fsc.process(file));
}

So TranslationCheck can override canWorkOnPrereadFileContent to return false, and do what ever he needs. BUT we already stuck with preread of file and as soon as one non global Check exists, we will do this and there is no reason differentiate Checks, as reading was done already for other 99% Checks from configuration.

We have have small penalty for all, some java Checks do not need parsing, and could quickly be done by own read much faster, but we put them after java parsing just to be easier for us to maintain and easy to extend in future. http://checkstyle.sourceforge.net/config_sizes.html#OuterTypeNumber could be done enormously quicker in custom parsing, but we do not do this. http://checkstyle.sourceforge.net/config_sizes.html#FileLength is also not ideal for performance.

@rnveach rnveach added the hard label Jun 29, 2018

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jun 29, 2018

Member

If Global Checks so desperately need caching they can create their own cache, we do not limit them.

They cannot because they do not control TreeWalker. Checks are not the most time consuming process. It is TreeWalker and it's parsing. Checks are only ever contacted after TreeWalker parses the file and starts scrolling through the nodes.
Also Checks are not cache aware. If an external file changes, they are not notified about it. They can't erase their internal data to re-process the old files. They also need to be notified when a file is skipped to be able to pull in information from their cache. They can't tell the difference between a skipped file and a file that was not included in the run.
So we are limiting what they can do. Any and all events the cache has now, Checks would need to receive those same events for their own cache system. This is why I built CacheAware interface.

Global and multi-file Check have to make cache himself if it so critical
We do not know what they might need for caching and what matter for them.

Yes this is correct. This is why my example code had CacheAware and Check specific caching.

I already showed that this penalty is close to nothing.

You are only looking at penalties for FileSets. Not AbstractChecks. My main concern is them, not FileSets, as they are where we take the most time with.

tried to get your idea .... not sure
I am ok to extend API of AbstractFileSetCheck to have method like canWorkOnPrereadFileContent

There shouldn't be any reason Checks need to re-read file. TranslationCheck breaks that at https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/checks/TranslationCheck.java#L482 because of the lack of true multi-file support.

some java Checks do not need parsing

This is because they were incorrectly made as AbstractChecks and should be FileSet checks. We have another issue to break compatibility of one such check. There shouldn't be any AbstractChecks that doesn't examine the Java code in some way,

OuterTypeNumber could be done enormously quicker in custom parsing

No, it is examining the Java tree so it should be using the Java parser. If the parser is slow, then we should examine what can be done to make it faster.

FileLength is also not ideal for performance.

It is not a java check and it's code looks fine.

Our cache now was and is for filestate Checks only. Lets keep it as is for now.

I do not support this. This is again picking Checks to support and ignoring others. Java checks was the reason I wanted cache enabled in the project as it was taking minutes for a local run for me on my laptop.
It also still doesn't change that FileSets will still need CacheAware properties to be able to implement their own cache for full multi-file and cache support.

Member

rnveach commented Jun 29, 2018

If Global Checks so desperately need caching they can create their own cache, we do not limit them.

They cannot because they do not control TreeWalker. Checks are not the most time consuming process. It is TreeWalker and it's parsing. Checks are only ever contacted after TreeWalker parses the file and starts scrolling through the nodes.
Also Checks are not cache aware. If an external file changes, they are not notified about it. They can't erase their internal data to re-process the old files. They also need to be notified when a file is skipped to be able to pull in information from their cache. They can't tell the difference between a skipped file and a file that was not included in the run.
So we are limiting what they can do. Any and all events the cache has now, Checks would need to receive those same events for their own cache system. This is why I built CacheAware interface.

Global and multi-file Check have to make cache himself if it so critical
We do not know what they might need for caching and what matter for them.

Yes this is correct. This is why my example code had CacheAware and Check specific caching.

I already showed that this penalty is close to nothing.

You are only looking at penalties for FileSets. Not AbstractChecks. My main concern is them, not FileSets, as they are where we take the most time with.

tried to get your idea .... not sure
I am ok to extend API of AbstractFileSetCheck to have method like canWorkOnPrereadFileContent

There shouldn't be any reason Checks need to re-read file. TranslationCheck breaks that at https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/checks/TranslationCheck.java#L482 because of the lack of true multi-file support.

some java Checks do not need parsing

This is because they were incorrectly made as AbstractChecks and should be FileSet checks. We have another issue to break compatibility of one such check. There shouldn't be any AbstractChecks that doesn't examine the Java code in some way,

OuterTypeNumber could be done enormously quicker in custom parsing

No, it is examining the Java tree so it should be using the Java parser. If the parser is slow, then we should examine what can be done to make it faster.

FileLength is also not ideal for performance.

It is not a java check and it's code looks fine.

Our cache now was and is for filestate Checks only. Lets keep it as is for now.

I do not support this. This is again picking Checks to support and ignoring others. Java checks was the reason I wanted cache enabled in the project as it was taking minutes for a local run for me on my laptop.
It also still doesn't change that FileSets will still need CacheAware properties to be able to implement their own cache for full multi-file and cache support.

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jun 29, 2018

Member

OuterTypeNumber
FileLength

If you want to know the worst offenders use https://github.com/rnveach/checkstyle/commits/more_audits to see real numbers.

Using our own project on my work laptop, full run time was 48.3 seconds.

The worst FileSet is TreeWalker with 27.3 seconds. Runners up are RegexpSinglelineCheck (0.47 seconds), JavadocPackageCheck (0.47 seconds), RegexpMultilineCheck (0.38 seconds).

The worst AbstractCheck is UnusedImportsCheck with 4.3 seconds. Runners up are IndentationCheck (0.56 seconds), CommentsIndentationCheck (0.55 seconds), ParenPadCheck (0.29 seconds).

Java parser took 2.7 seconds total. Javadoc parser took 13.5 seconds total.

Member

rnveach commented Jun 29, 2018

OuterTypeNumber
FileLength

If you want to know the worst offenders use https://github.com/rnveach/checkstyle/commits/more_audits to see real numbers.

Using our own project on my work laptop, full run time was 48.3 seconds.

The worst FileSet is TreeWalker with 27.3 seconds. Runners up are RegexpSinglelineCheck (0.47 seconds), JavadocPackageCheck (0.47 seconds), RegexpMultilineCheck (0.38 seconds).

The worst AbstractCheck is UnusedImportsCheck with 4.3 seconds. Runners up are IndentationCheck (0.56 seconds), CommentsIndentationCheck (0.55 seconds), ParenPadCheck (0.29 seconds).

Java parser took 2.7 seconds total. Javadoc parser took 13.5 seconds total.

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 30, 2018

Member

How to multi-file(MF) Java Check is going to work while cache is activated?
MF Check should try to minimize/avoid re-parsing by mean of TreeWalker.

so as soon it is MF Java Check, it is submodule of TreeWalker where parsing is actually done.
ALL we have right now in Checkstyle is single file Checks.
So we need to imagine some Check that is required to be MF.

New MF Check: make sure that in all classes *Domain.java, there is serialVersionUID field
and it has the same value.

MF Checks can do validation only at the end of execution.
During processing file by file, all they could is just caching some statistics or pathes
(or even whole ASTs cloning) to files they will need to validate at the end.

In other words, MF Checks are Checks that are working on some set of files.
They have to get all of them to make a division where is violation (in what file)
or even violated whole set and let user decide what should be changed.
It mean that MF Check need content of all files, that are changed and which are not changed.
Even more, fact that some file did not have violations and did not change does not mean that
it is not required for Check to review.

So, as summary, MF Check need whole set of files to make a decision.
The same as single file(file stateful) Check need whole file to make decision.

By limitation of Checker we cannot give Check whole set of files in one call.

Cache.
Ideal cache is cache that is not visible at all for actual "workers"(Checks) and manager of them (Checker)
by knowlegde of how workers do their work can skip request for processing if manager sure that result
will be the same. Workers/Checks shoud be unaware of cache.
that is why we have cache in Checker as it knows nature of checks, and make predict response from Check
by tree factors: file not changed, there were no violations, if Check did not change - result will be the same.

ATTENTION: we already ignore situation that some Check might change between execution !!!
If user have no violation on whole code, he just update version of checkstyle from 6.19 to 8.11 ,
do not changes to config, and probably by chance config is not affected by breaking changes.
He launch validation one more time - no violations, as all is skipped by cache. BUT there might be
bunch of false-negatives are fixed in Checks, so user will start to get violations only later on when
file is changed and new logic of Check is executed.

Lets come back to caching for MF.....
To make cache works for MF, we need to change cache structure and cache whole group, and if whole group
did not change, it is ok to skip Check. But we do not know what is group of Check (message*.properties or *Domain.java).
BUT even we will make API to communicate Cache vs Check.
Check will need all parsing again on whole set/group of files to make decision.
So MF Checks are not satisfied by statistics of single file cache that we have now.
Creation of better cache is blocked by current Checker implementation (processing of file one by one, randomly, no order).

MF implementations are complicated by design and we can not predict what they will need.
MF might reuse our single file cache statistic but, they need to read it them self and work with it
in read-only mode, if they need.

MF Check needs calls on each file, and should do decision them-self on skip
or caching till final stage of validation.
MF Java Check as submodules of TreeWalker will demand it to parse all files the time and completely
eliminate cache usage for all others.

WE SHOULD NOT SUPPORT MF Check in TreeWalker, at least for now.

WE SHOULD LET USERS to write MF FileSet Checks, and user want he can reuse Parser in his MF FileSet Check
and do parsing only when it is required (if it will work on certain set of java files).
If it is going to parse all :), it is possible, but will work a bit slow, but I hope he will benefit
by result of validation.

==========

I am not closing the door for MF Java Checks, I just want to make first step in MF support - for now
only for FileSet. When we start to have requests for optimization of MF java processing, we will start
to think about caching of ASTs in Checker between modules. I do not want to spend now time on stuff that nobody
will use now. We will have time to collect use cases and requirements from users on what will be good to have for MF processing.
I know answer :) - global cache of all ASTs during run will be awesome and satisfy all MF Checks, but it cost :) a lot for memory.
So right now I do not want to make it as part of standard library. But user can do this himself by implementation of just another Holder that is child of TreeWalker.

Member

romani commented Jun 30, 2018

How to multi-file(MF) Java Check is going to work while cache is activated?
MF Check should try to minimize/avoid re-parsing by mean of TreeWalker.

so as soon it is MF Java Check, it is submodule of TreeWalker where parsing is actually done.
ALL we have right now in Checkstyle is single file Checks.
So we need to imagine some Check that is required to be MF.

New MF Check: make sure that in all classes *Domain.java, there is serialVersionUID field
and it has the same value.

MF Checks can do validation only at the end of execution.
During processing file by file, all they could is just caching some statistics or pathes
(or even whole ASTs cloning) to files they will need to validate at the end.

In other words, MF Checks are Checks that are working on some set of files.
They have to get all of them to make a division where is violation (in what file)
or even violated whole set and let user decide what should be changed.
It mean that MF Check need content of all files, that are changed and which are not changed.
Even more, fact that some file did not have violations and did not change does not mean that
it is not required for Check to review.

So, as summary, MF Check need whole set of files to make a decision.
The same as single file(file stateful) Check need whole file to make decision.

By limitation of Checker we cannot give Check whole set of files in one call.

Cache.
Ideal cache is cache that is not visible at all for actual "workers"(Checks) and manager of them (Checker)
by knowlegde of how workers do their work can skip request for processing if manager sure that result
will be the same. Workers/Checks shoud be unaware of cache.
that is why we have cache in Checker as it knows nature of checks, and make predict response from Check
by tree factors: file not changed, there were no violations, if Check did not change - result will be the same.

ATTENTION: we already ignore situation that some Check might change between execution !!!
If user have no violation on whole code, he just update version of checkstyle from 6.19 to 8.11 ,
do not changes to config, and probably by chance config is not affected by breaking changes.
He launch validation one more time - no violations, as all is skipped by cache. BUT there might be
bunch of false-negatives are fixed in Checks, so user will start to get violations only later on when
file is changed and new logic of Check is executed.

Lets come back to caching for MF.....
To make cache works for MF, we need to change cache structure and cache whole group, and if whole group
did not change, it is ok to skip Check. But we do not know what is group of Check (message*.properties or *Domain.java).
BUT even we will make API to communicate Cache vs Check.
Check will need all parsing again on whole set/group of files to make decision.
So MF Checks are not satisfied by statistics of single file cache that we have now.
Creation of better cache is blocked by current Checker implementation (processing of file one by one, randomly, no order).

MF implementations are complicated by design and we can not predict what they will need.
MF might reuse our single file cache statistic but, they need to read it them self and work with it
in read-only mode, if they need.

MF Check needs calls on each file, and should do decision them-self on skip
or caching till final stage of validation.
MF Java Check as submodules of TreeWalker will demand it to parse all files the time and completely
eliminate cache usage for all others.

WE SHOULD NOT SUPPORT MF Check in TreeWalker, at least for now.

WE SHOULD LET USERS to write MF FileSet Checks, and user want he can reuse Parser in his MF FileSet Check
and do parsing only when it is required (if it will work on certain set of java files).
If it is going to parse all :), it is possible, but will work a bit slow, but I hope he will benefit
by result of validation.

==========

I am not closing the door for MF Java Checks, I just want to make first step in MF support - for now
only for FileSet. When we start to have requests for optimization of MF java processing, we will start
to think about caching of ASTs in Checker between modules. I do not want to spend now time on stuff that nobody
will use now. We will have time to collect use cases and requirements from users on what will be good to have for MF processing.
I know answer :) - global cache of all ASTs during run will be awesome and satisfy all MF Checks, but it cost :) a lot for memory.
So right now I do not want to make it as part of standard library. But user can do this himself by implementation of just another Holder that is child of TreeWalker.

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jul 6, 2018

Member

WE SHOULD NOT SUPPORT MF Check in TreeWalker, at least for now.

Once FileSets support it, all TreeWalker has to do is chain it to the checks, the same way Checker chains things to the FileSets.

caching of ASTs in Checker between modules
global cache of all ASTs during run will be awesome and satisfy all MF Checks

I'm not sure what you mean by the first part. MF AbstractChecks can cache ASTs however they want, Checker doesn't need to be involved in that and we don't have to cache everything. All checker needs to supply is what files is cached, if the cache is erased, and what is skipped.

Creation of better cache is blocked by current Checker implementation (processing of file one by one, randomly, no order).

Checker's cache and file order don't need to change. The cache just needs to talk with the modules more before deciding if a file is skipped or not.

I do not want to spend now time on stuff that nobody will use now.

I have been using my own MF AbstractChecks. My most recent check was ensuring all setter calls for a class are defined in the same order as they are defined in the class. My project uses alot of holders (POJOs) and I prefer they be created in the order of precedence which is what the fields are ordered by.

Nothing that was written specified how the cache will be made 'better'.
Did you look over my implementation of MF and are ok with going in that direction?

Member

rnveach commented Jul 6, 2018

WE SHOULD NOT SUPPORT MF Check in TreeWalker, at least for now.

Once FileSets support it, all TreeWalker has to do is chain it to the checks, the same way Checker chains things to the FileSets.

caching of ASTs in Checker between modules
global cache of all ASTs during run will be awesome and satisfy all MF Checks

I'm not sure what you mean by the first part. MF AbstractChecks can cache ASTs however they want, Checker doesn't need to be involved in that and we don't have to cache everything. All checker needs to supply is what files is cached, if the cache is erased, and what is skipped.

Creation of better cache is blocked by current Checker implementation (processing of file one by one, randomly, no order).

Checker's cache and file order don't need to change. The cache just needs to talk with the modules more before deciding if a file is skipped or not.

I do not want to spend now time on stuff that nobody will use now.

I have been using my own MF AbstractChecks. My most recent check was ensuring all setter calls for a class are defined in the same order as they are defined in the class. My project uses alot of holders (POJOs) and I prefer they be created in the order of precedence which is what the fields are ordered by.

Nothing that was written specified how the cache will be made 'better'.
Did you look over my implementation of MF and are ok with going in that direction?

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jul 9, 2018

Member

MF files implementation:
https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/checks/coding/ActualFinalClassCheck.java
It is possible only due to updated Checker: https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/Checker.java#L287

While debate is hot .... few more nuances:

  1. Non of existing multi-file static analysis tool use cache
  2. Checkstyle is well know for single file validation, so it is heavily used in on commit hooks to validate only changed files quickly. We need to keep this advantage, so we just need Checker/CLI mode to skip MF checks completely.
  3. As we start to speak about MF Checks ..... and implementations like ActualFinalClassCheck demand awareness of cache to avoid double parsing of file. We forget about more complicated MF requirements.
    Examples: "recheck that all properties in i resources are in use in java files", "recheck that classes referenced in XML resources exists and have special method", "all resources are in use main sources", .... "detect some pattern make a decision and search in all sources". So single pass validation does not work well for MF validations. Even in existing single file checks we have Checks that do non-single pass over tree. So as we become MF, multi pass traversing of file might be unavoidable.

We just need to have global cache of ASTs, all MF Checks get all calls on each file(not affected by single file cache), all MF Checks should be FileSets with ability to connect somehow to global cache of ASTs, if AST is not present file is parsed, stay in cache and returned to module. In this case we will cover all cases that MF might come up. It will consume a lot of memory .... but it is the only way out for all cases MF support.

Just did a testing overs checkstyle src/main folder, commit.

✔ ~/java/github/romani/checkstyle [test-ast-global-cache L|✔] 
find src/main -type f -name *.java | wc -l
364

mvn -e -Passembly package
...

/usr/bin/time -v java -jar target/checkstyle-8.12-SNAPSHOT-all.jar -c config/checkstyle_checks.xml src/main/
...

result in extra ~580M for 364 files:

keep all ASTs:Maximum resident set size (kbytes): 1 585 680
usual process:Maximum resident set size (kbytes): 1 005 712

So user will always have a choice to speedup execution when MF are used: use MT mode, allocate more memory to execution, split MF Checks to separate heavy execution (for non-on-commit validation).

Member

romani commented Jul 9, 2018

MF files implementation:
https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/checks/coding/ActualFinalClassCheck.java
It is possible only due to updated Checker: https://github.com/rnveach/checkstyle/blob/c954d4df2d42ec6b52e1cd2d1b6b8b11c1011a67/src/main/java/com/puppycrawl/tools/checkstyle/Checker.java#L287

While debate is hot .... few more nuances:

  1. Non of existing multi-file static analysis tool use cache
  2. Checkstyle is well know for single file validation, so it is heavily used in on commit hooks to validate only changed files quickly. We need to keep this advantage, so we just need Checker/CLI mode to skip MF checks completely.
  3. As we start to speak about MF Checks ..... and implementations like ActualFinalClassCheck demand awareness of cache to avoid double parsing of file. We forget about more complicated MF requirements.
    Examples: "recheck that all properties in i resources are in use in java files", "recheck that classes referenced in XML resources exists and have special method", "all resources are in use main sources", .... "detect some pattern make a decision and search in all sources". So single pass validation does not work well for MF validations. Even in existing single file checks we have Checks that do non-single pass over tree. So as we become MF, multi pass traversing of file might be unavoidable.

We just need to have global cache of ASTs, all MF Checks get all calls on each file(not affected by single file cache), all MF Checks should be FileSets with ability to connect somehow to global cache of ASTs, if AST is not present file is parsed, stay in cache and returned to module. In this case we will cover all cases that MF might come up. It will consume a lot of memory .... but it is the only way out for all cases MF support.

Just did a testing overs checkstyle src/main folder, commit.

✔ ~/java/github/romani/checkstyle [test-ast-global-cache L|✔] 
find src/main -type f -name *.java | wc -l
364

mvn -e -Passembly package
...

/usr/bin/time -v java -jar target/checkstyle-8.12-SNAPSHOT-all.jar -c config/checkstyle_checks.xml src/main/
...

result in extra ~580M for 364 files:

keep all ASTs:Maximum resident set size (kbytes): 1 585 680
usual process:Maximum resident set size (kbytes): 1 005 712

So user will always have a choice to speedup execution when MF are used: use MT mode, allocate more memory to execution, split MF Checks to separate heavy execution (for non-on-commit validation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment