RFC: quick and dirty parallel lint #32

clebergnu · 2018-02-01T12:26:33Z

The lint check is by far the lenghtiest static code check on my
workflow. Let's make it run in parallel and save some precious time.

This is a quick and dirty version, just to prove the point that we
can and should do it. I reckon this could and should be done at
a different level, and also available for other checks (such as
style, indent, license, etc).

Signed-off-by: Cleber Rosa crosa@redhat.com

The numbers I get when running lint for Avocado are (before):

$ /usr/bin/time ./selftests/checkall 

Running 'inspekt lint --exclude=.git --enable W0102,W0611'
Pylint disabled: W,R,C,E1002,E1101,E1103,E1120,F0401,I0011
Pylint enabled : W0102,W0611
Syntax check PASS
lint PASSED

All checks PASSED
60.21user 1.17system 1:01.58elapsed 99%CPU (0avgtext+0avgdata 531472maxresident)k
0inputs+2056outputs (0major+190536minor)pagefaults 0swaps

And after:

$ /usr/bin/time ./selftests/checkall 

Running 'inspekt lint --exclude=.git --enable W0102,W0611'
Pylint disabled: W,R,C,E1002,E1101,E1103,E1120,F0401,I0011
Pylint enabled : W0102,W0611
Syntax check PASS
lint PASSED

All checks PASSED
152.42user 2.74system 0:24.31elapsed 638%CPU (0avgtext+0avgdata 239108maxresident)k
0inputs+2040outputs (0major+435095minor)pagefaults 0swaps

The lint check is by far the lenghtiest static code check on my workflow. Let's make it run in parallel and save some precious time. This is a quick and dirty version, just to prove the point that we can and should do it. I reckon this could and should be done at a different level, and also available for other checks (such as style, indent, license, etc). Signed-off-by: Cleber Rosa <crosa@redhat.com>

clebergnu · 2018-02-01T12:26:59Z

Check if inspektor uses pylint parallel processing

ldoktor · 2018-02-01T15:29:07Z

inspektor/lint.py

+            pool = multiprocessing.Pool()
+            opts = self.get_opts()
+            args = [(filename, opts) for filename in filenames]
+            failed_paths = pool.map(run_pylint_file, args)


an iterator yielding the paths would be probably better (we can deal with lots of paths). Also have you thought about bundling multiple paths to single check? There is a certain overhead when initializing each worker so for loop would be even better (anyway this is still better than no parallelism and it's pretty readable, just asking whether you considered this option)

As the titles says it, this is a "quick and dirty" version.

Anyway, as far as alternative approaches are concerned, I'd focus more on delivering a more "universal" solution to inspektor (for other types of checks and that would work when multiple files are given -- this is limited to directory contents). And, at least initially, I'd try to stick to the multiprocessing utilities and paradigms on the Python standard library. Just less complexity to deal with, and with the undeniable result benefits.

ldoktor · 2018-02-01T17:45:49Z

inspektor/lint.py

+            args = [(filename, opts) for filename in filenames]
+            failed_paths = pool.map(run_pylint_file, args)
+            failed_paths = [path for path in failed_paths if path is not None]
+            if any(failed_paths):


I think you combined two approaches here. Either you could always evaluate failed_paths and then check if failed_paths or you could use if any(failed_paths) without processing it (usually it should be [None, None, None...] so False)and only evaluate thefailed_paths` in case of failure. I'd go with the second approach...

I mean

+ if any(failed_paths): failed_paths = [path for path in failed_paths if path is not None] - if any(failed_paths): - self.failed_paths += failed_paths - return not failed_paths + self.failed_paths += failed_paths + return False + return True

ldoktor

Hello @clebergnu it seems to work well and I really like it. I tried batching per 10 files, but the results were insignificantly faster so I don't think it's worth the effort. There is a minor style issue in postprocessing the results, once addressed we can merge it...

PS: Occasionally I get multiple lines logged in one line, but IMO that's not a big issue and the speed improvement is better than slightly corrupted output...

ldoktor · 2018-02-02T07:41:15Z

Actually one more request, can you please make it configurable? Some people might want to disable this feature (I'd support something like --parallel N where default is 0 (None) and 1+ means number of pools.

ldoktor · 2018-02-02T13:21:23Z

I thought about this again and why are we not simply filtering the paths and running the linter with all files as list/iterator? It supports parallel execution and it'd avoid the multiple initialization. I did a very dirty check and it works well and if we want per-file-granularity-results we can simply forward custom reporter:

class OurTextReporter(TextReporter):
    def __init__(self, *args, **kwargs):
        super(OurTextReporter, self).__init__(*args, **kwargs)
        self.__failed_modules = []

    def handle_message(self, msg):
        if msg.module not in self._modules:
            if msg.module:
                self.__failed_modules.append(msg.module)
        return super(OurTextReporter, self).handle_message(msg)

ldoktor · 2018-02-02T13:23:55Z

.. to be precise I used this:

        def should_be_checked(path):
            checker = PathChecker(path=path, args=parsed_args, label='Lint',
                                  logger=self.log)
            if checker.check_attributes('text', 'python', 'not_empty'):
                return True
            else:
                return False
        paths = []
        not_file_or_dirs = []
        for file_or_dir in parsed_args.path:
            if os.path.isdir(file_or_dir):
                for root, dirs, files in os.walk(file_or_dir):
                    for filename in files:
                        path = os.path.join(root, filename)
                        if should_be_checked(path):
                            paths.append(path)
            elif os.path.isfile(file_or_dir):
                if should_be_checked(file_or_dir):
                    paths.append(file_or_dir)
            else:
                not_file_or_dirs.append(file_or_dir)

to walk/filter the paths and then simply run:

runner = QuietLintRun(self.get_opts() + ["-j", "8"] + paths, exit=False)

to run with all paths and in parallel. Anyway let me know if you prefer the original or this version.

ldoktor reviewed Feb 1, 2018

View reviewed changes

ldoktor requested changes Feb 1, 2018

View reviewed changes

clebergnu mentioned this pull request Apr 5, 2018

Parallel lint #35

Closed

clebergnu closed this Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: quick and dirty parallel lint #32

RFC: quick and dirty parallel lint #32

clebergnu commented Feb 1, 2018

clebergnu commented Feb 1, 2018

ldoktor Feb 1, 2018

clebergnu Feb 1, 2018

ldoktor Feb 1, 2018

ldoktor Feb 1, 2018

ldoktor left a comment

ldoktor commented Feb 2, 2018

ldoktor commented Feb 2, 2018

ldoktor commented Feb 2, 2018

RFC: quick and dirty parallel lint #32

RFC: quick and dirty parallel lint #32

Conversation

clebergnu commented Feb 1, 2018

clebergnu commented Feb 1, 2018

ldoktor Feb 1, 2018

Choose a reason for hiding this comment

clebergnu Feb 1, 2018

Choose a reason for hiding this comment

ldoktor Feb 1, 2018

Choose a reason for hiding this comment

ldoktor Feb 1, 2018

Choose a reason for hiding this comment

ldoktor left a comment

Choose a reason for hiding this comment

ldoktor commented Feb 2, 2018

ldoktor commented Feb 2, 2018

ldoktor commented Feb 2, 2018