-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: quick and dirty parallel lint #32
Conversation
The lint check is by far the lenghtiest static code check on my workflow. Let's make it run in parallel and save some precious time. This is a quick and dirty version, just to prove the point that we can and should do it. I reckon this could and should be done at a different level, and also available for other checks (such as style, indent, license, etc). Signed-off-by: Cleber Rosa <crosa@redhat.com>
pool = multiprocessing.Pool() | ||
opts = self.get_opts() | ||
args = [(filename, opts) for filename in filenames] | ||
failed_paths = pool.map(run_pylint_file, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an iterator yielding the paths would be probably better (we can deal with lots of paths). Also have you thought about bundling multiple paths to single check? There is a certain overhead when initializing each worker so for loop would be even better (anyway this is still better than no parallelism and it's pretty readable, just asking whether you considered this option)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the titles says it, this is a "quick and dirty" version.
Anyway, as far as alternative approaches are concerned, I'd focus more on delivering a more "universal" solution to inspektor (for other types of checks and that would work when multiple files are given -- this is limited to directory contents). And, at least initially, I'd try to stick to the multiprocessing utilities and paradigms on the Python standard library. Just less complexity to deal with, and with the undeniable result benefits.
args = [(filename, opts) for filename in filenames] | ||
failed_paths = pool.map(run_pylint_file, args) | ||
failed_paths = [path for path in failed_paths if path is not None] | ||
if any(failed_paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you combined two approaches here. Either you could always evaluate failed_paths
and then check if failed_paths
or you could use if any(failed_paths)
without processing it (usually it should be [None, None, None...]
so False)and only evaluate the
failed_paths` in case of failure. I'd go with the second approach...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean
+ if any(failed_paths):
failed_paths = [path for path in failed_paths if path is not None]
- if any(failed_paths):
- self.failed_paths += failed_paths
- return not failed_paths
+ self.failed_paths += failed_paths
+ return False
+ return True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @clebergnu it seems to work well and I really like it. I tried batching per 10 files, but the results were insignificantly faster so I don't think it's worth the effort. There is a minor style issue in postprocessing the results, once addressed we can merge it...
PS: Occasionally I get multiple lines logged in one line, but IMO that's not a big issue and the speed improvement is better than slightly corrupted output...
Actually one more request, can you please make it configurable? Some people might want to disable this feature (I'd support something like |
I thought about this again and why are we not simply filtering the paths and running the linter with all files as list/iterator? It supports parallel execution and it'd avoid the multiple initialization. I did a very dirty check and it works well and if we want per-file-granularity-results we can simply forward custom reporter: class OurTextReporter(TextReporter):
def __init__(self, *args, **kwargs):
super(OurTextReporter, self).__init__(*args, **kwargs)
self.__failed_modules = []
def handle_message(self, msg):
if msg.module not in self._modules:
if msg.module:
self.__failed_modules.append(msg.module)
return super(OurTextReporter, self).handle_message(msg) |
.. to be precise I used this: def should_be_checked(path):
checker = PathChecker(path=path, args=parsed_args, label='Lint',
logger=self.log)
if checker.check_attributes('text', 'python', 'not_empty'):
return True
else:
return False
paths = []
not_file_or_dirs = []
for file_or_dir in parsed_args.path:
if os.path.isdir(file_or_dir):
for root, dirs, files in os.walk(file_or_dir):
for filename in files:
path = os.path.join(root, filename)
if should_be_checked(path):
paths.append(path)
elif os.path.isfile(file_or_dir):
if should_be_checked(file_or_dir):
paths.append(file_or_dir)
else:
not_file_or_dirs.append(file_or_dir) to walk/filter the paths and then simply run: runner = QuietLintRun(self.get_opts() + ["-j", "8"] + paths, exit=False) to run with all paths and in parallel. Anyway let me know if you prefer the original or this version. |
The lint check is by far the lenghtiest static code check on my
workflow. Let's make it run in parallel and save some precious time.
This is a quick and dirty version, just to prove the point that we
can and should do it. I reckon this could and should be done at
a different level, and also available for other checks (such as
style, indent, license, etc).
Signed-off-by: Cleber Rosa crosa@redhat.com
The numbers I get when running lint for Avocado are (before):
And after: