Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: quick and dirty parallel lint #32

Closed

Conversation

clebergnu
Copy link
Contributor

The lint check is by far the lenghtiest static code check on my
workflow. Let's make it run in parallel and save some precious time.

This is a quick and dirty version, just to prove the point that we
can and should do it. I reckon this could and should be done at
a different level, and also available for other checks (such as
style, indent, license, etc).

Signed-off-by: Cleber Rosa crosa@redhat.com


The numbers I get when running lint for Avocado are (before):

$ /usr/bin/time ./selftests/checkall 

Running 'inspekt lint --exclude=.git --enable W0102,W0611'
Pylint disabled: W,R,C,E1002,E1101,E1103,E1120,F0401,I0011
Pylint enabled : W0102,W0611
Syntax check PASS
lint PASSED

All checks PASSED
60.21user 1.17system 1:01.58elapsed 99%CPU (0avgtext+0avgdata 531472maxresident)k
0inputs+2056outputs (0major+190536minor)pagefaults 0swaps

And after:

$ /usr/bin/time ./selftests/checkall 

Running 'inspekt lint --exclude=.git --enable W0102,W0611'
Pylint disabled: W,R,C,E1002,E1101,E1103,E1120,F0401,I0011
Pylint enabled : W0102,W0611
Syntax check PASS
lint PASSED

All checks PASSED
152.42user 2.74system 0:24.31elapsed 638%CPU (0avgtext+0avgdata 239108maxresident)k
0inputs+2040outputs (0major+435095minor)pagefaults 0swaps

The lint check is by far the lenghtiest static code check on my
workflow.  Let's make it run in parallel and save some precious time.

This is a quick and dirty version, just to prove the point that we
can and should do it.  I reckon this could and should be done at
a different level, and also available for other checks (such as
style, indent, license, etc).

Signed-off-by: Cleber Rosa <crosa@redhat.com>
@clebergnu
Copy link
Contributor Author

pool = multiprocessing.Pool()
opts = self.get_opts()
args = [(filename, opts) for filename in filenames]
failed_paths = pool.map(run_pylint_file, args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an iterator yielding the paths would be probably better (we can deal with lots of paths). Also have you thought about bundling multiple paths to single check? There is a certain overhead when initializing each worker so for loop would be even better (anyway this is still better than no parallelism and it's pretty readable, just asking whether you considered this option)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the titles says it, this is a "quick and dirty" version.

Anyway, as far as alternative approaches are concerned, I'd focus more on delivering a more "universal" solution to inspektor (for other types of checks and that would work when multiple files are given -- this is limited to directory contents). And, at least initially, I'd try to stick to the multiprocessing utilities and paradigms on the Python standard library. Just less complexity to deal with, and with the undeniable result benefits.

args = [(filename, opts) for filename in filenames]
failed_paths = pool.map(run_pylint_file, args)
failed_paths = [path for path in failed_paths if path is not None]
if any(failed_paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you combined two approaches here. Either you could always evaluate failed_paths and then check if failed_paths or you could use if any(failed_paths) without processing it (usually it should be [None, None, None...] so False)and only evaluate thefailed_paths` in case of failure. I'd go with the second approach...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean

+        if any(failed_paths):
             failed_paths = [path for path in failed_paths if path is not None]
-            if any(failed_paths):
-                self.failed_paths += failed_paths
-        return not failed_paths
+            self.failed_paths += failed_paths
+            return False
+        return True

Copy link
Contributor

@ldoktor ldoktor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @clebergnu it seems to work well and I really like it. I tried batching per 10 files, but the results were insignificantly faster so I don't think it's worth the effort. There is a minor style issue in postprocessing the results, once addressed we can merge it...

PS: Occasionally I get multiple lines logged in one line, but IMO that's not a big issue and the speed improvement is better than slightly corrupted output...

@ldoktor
Copy link
Contributor

ldoktor commented Feb 2, 2018

Actually one more request, can you please make it configurable? Some people might want to disable this feature (I'd support something like --parallel N where default is 0 (None) and 1+ means number of pools.

@ldoktor
Copy link
Contributor

ldoktor commented Feb 2, 2018

I thought about this again and why are we not simply filtering the paths and running the linter with all files as list/iterator? It supports parallel execution and it'd avoid the multiple initialization. I did a very dirty check and it works well and if we want per-file-granularity-results we can simply forward custom reporter:

class OurTextReporter(TextReporter):
    def __init__(self, *args, **kwargs):
        super(OurTextReporter, self).__init__(*args, **kwargs)
        self.__failed_modules = []

    def handle_message(self, msg):
        if msg.module not in self._modules:
            if msg.module:
                self.__failed_modules.append(msg.module)
        return super(OurTextReporter, self).handle_message(msg)

@ldoktor
Copy link
Contributor

ldoktor commented Feb 2, 2018

.. to be precise I used this:

        def should_be_checked(path):
            checker = PathChecker(path=path, args=parsed_args, label='Lint',
                                  logger=self.log)
            if checker.check_attributes('text', 'python', 'not_empty'):
                return True
            else:
                return False
        paths = []
        not_file_or_dirs = []
        for file_or_dir in parsed_args.path:
            if os.path.isdir(file_or_dir):
                for root, dirs, files in os.walk(file_or_dir):
                    for filename in files:
                        path = os.path.join(root, filename)
                        if should_be_checked(path):
                            paths.append(path)
            elif os.path.isfile(file_or_dir):
                if should_be_checked(file_or_dir):
                    paths.append(file_or_dir)
            else:
                not_file_or_dirs.append(file_or_dir)

to walk/filter the paths and then simply run:

runner = QuietLintRun(self.get_opts() + ["-j", "8"] + paths, exit=False)

to run with all paths and in parallel. Anyway let me know if you prefer the original or this version.

@clebergnu clebergnu mentioned this pull request Apr 5, 2018
@clebergnu clebergnu closed this Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants