Skip to content
Dimi Shahbaz edited this page Nov 10, 2020 · 1 revision

Updating bazel-mypy-integration for caching and perfomance

Background

mypy is a project that does static type checking on python code, according to type hints in the code (see PEP484).

bazel-mypy-integration is a project to enable mypy type-checking for python targets in bazel.

Current limitations

1. mypy metadata is not cached between targets

bazel-mypy-integration understands the python dependency graph represented in bazel. Essentially, it creates a mypy ... invocation for a given set of .py files, as well as the dependencies of those files.

However, the mypy invocation does not handle caching of dependencies at all. For example, given:

py_library(
  name = "lib1",
  srcs = [
    "lib1_a.py",
    "lib1_b.py",
    "lib1_c.py",
  ]
)

py_library(
  name = "lib2",
  srcs = [
    "lib2.py",
  ],
  deps = [
    ":lib1",
  ]
)

mypy_test(  # This is a concrete mypy test for lib1
  name = "lib1_mypy_test",
  deps = [":lib1"]
)

mypy_test(
  name = "lib2_mypy_test",
  deps = [":lib2"]
)

Running e.g.: bazel test :lib1_mypy_test :lib2_mypy_test will result in two mypy invocations that which will both parse (and potentially test, depending on your mypy.ini) all the source files involved, transitively. E.g. mypy ... -- lib1_a.py lib1_b.py lib1_c.py and mypy ... -- lib1_a.py lib1_b.py lib1_c.py lib2.py. There is no state shared between these, meaning that there is duplicated effort in parsing the files in common.

Worse still, mypy ships with type stubs for all of the python standard library (via typeshed). If lib*py imports anything from the python standard library, mypy further parses the stub files, again without caching anything.

2. Confusion around py_library targets with imports

Currently, bazel-mypy-integration passes complete paths to source files, wrt. the workspace root. This introduces a problem with how bazel py_library targets can operate. Here is an example that illustrates the problem:

py_library(
  name = "lib1",
  srcs = [
    "srcs/lib1.py",
    "srcs/internal/lib1_internal.py",
  ],
  imports = [
    "srcs",
  ]
)

The above example adds the workspace-relative path srcs to the PYTHONPATH of all dependents of lib1. Dependents can depend on this target and do import lib1 and everything works. Furthermore, lib1.py can do from internal import lib1_internal, and that works fine as well. Now let's examine the mypy command line produced by bazel-mypy-integration for this library:

MYPYPATH=$PWD:srcs/
mypy ...args... -- srcs/lib1.py srcs/internal/lib1_internal.py

This leads to an exception in mypy (versions >= 0.780) which mentions "Source file found twice" (see issue #20). The issue stems from how mypy treats the source file arguments. It expects that if a source file is listed as srcs/internal/lib1_internal.py then it must be a module with the same path: srcs.internal.lib1_internal. Since in the above example, srcs/lib1.py does from internal import lib1_internal, and since mypy checks the uniqueness of modules it imports, it fails when it encounters identical modules srcs.internal.lib1_internal (from the command line) and internal.lib1_internal (from an import statement).

This means that bazel-mypy-integration's mypy version is fixed at 0.750 to avoid this problem. This is not a bug in mypy (see discussion in https://github.com/python/mypy/issues/8944).

Improvements

Improvement 1: Propagate dependency metadata cache

mypy natively supports caching of type data, through the use of --cache-dir and a hidden option called --cache-map. bazel-mypy-integration takes advantage of fixed cache locations for each module using argument triples of the form --cache-map path/to/lib1_a.py <path to mypy's lib1_a.meta.json> <path to mypy's lib1_a.data.json>. These paths represent both where mypy should generate the metadata for a parsed source file, and also where to find existing generated metadata (for dependencies, for example). In our examples above, this would translate in to --cache-map triples for each of the 3 source files in lib1 and 4 source files for lib2 (its own, and its dependencies' source files).

By capturing the generated .meta.json/.data.json pairs as part of the rule invocation, we can propagate mypy's generated metadata to dependents. This means that mypy does not have to re-generate the metadata by re-parsing the dependent source files. This results in markedly faster performance, even in shallow dependency trees. It's important to note that the mypy docs mention the improved performance as well, when caching is enabled.

Improvement 2: Propagate typeshed metadata cache

Similar to the above, typeshed stub parsing can be sped up by propagating the same cache triples and mypy .meta.json/.data.json pairs for all of the stdlib (which can be a large number of files).

In order to accomplish this, we need to treat the mypy stubs a little differently. typeshed is an implicit, internal dependency of mypy, meaning that the typeshed pip package is not represented in bazel at all, it's only available by way of the mypy pip package. To tease out the valid typeshed stubs from requirement("mypy"), we need to trawl through the files in that package, and filter the typeshed ones based on path filtering.

For this purpose, a new rule is introduced called mypy_stdlib_cache_library, which is similar to mypy_aspect and mypy_test, but has an implementation that is used to deal with typeshed stubs only. There needs to be a singleton instance of this target, defined as part of bazel-mypy-integration. This singleton is a dependency of all mypy targets.

Improvement 3: imports path aware mypy invocations

To solve for the problem described in 2., and to make it possible in general for bazel-mypy-integration to work with py_* targets that specify an imports attribute, we can change the above invocation to operate on modules rather than directly on source files:

MYPYPATH=$PWD:srcs/
mypy ...args... -- -m lib1 -m internal.lib1_internal

Now, mypy finds the import lines to match exactly the -m arguments it has received on the command line, which means there are no duplicate source file errors. In addition, the metadata generated by mypy will refer to the correct module paths (ie, the generated metadata will know that they refer to internal.lib1_internal and not srcs.internal.lib1_internal). This is correct for dependents as well.

Caveat 1: multiple py_library targets referring to the same sources

In bazel, there is nothing stopping the same source file(s) being a part of multiple py_library targets. Consider for example:

py_library(
  name = "a",
  srcs = ["a.py"]
)

py_library(
  name = "a_prime",
  srcs = ["a.py"]
)

py_binary(
  name = "bin",
  srcs = ["bin.py"],
  deps = [
    ":a",
    ":a_prime",
  ]
)

There is nothing preventing this situation, and it's perfectly valid in terms of bazel dependencies and python rules to have this. The runfiles will contain a.py and all is fine.

However, mypy expects --cache-map arguments for each python source file, and it does not want duplicate --cache-map arguments pointing at the same a.py (it exits with an error in this case; all --cache-map source files must be unique). Since each py source file must have a unique --cache-map argument, we have a dilemma, since both a and a_prime specify the same source. If we were to operate as usual, both sets of --cache-map arguments would be in the transitive set of --cache-map triples arguments, which fails.

The solution is to just pick one cache map argument (the first one encountered). This is not a perfect solution, though, since it is possible for the same py source file, at the exact same location, to produce a different set of mypy metadata (for example, if the python target's imports path was different). But this seems pathlogical enough to not worry about it. Just picking the one cache map argument is fine because:

  • if bin.py imports a, and there is no difference in cached metadata, then the cached metadata will be used correctly.
  • if bin.py imports a, and there is a difference in cached metadata (for example, a difference in module name), then the mypy will just regenerate the metadata for a.py.

Caveat 2: multiple imports per py_library

The above scenario breaks down when there are multiple imports specified in a single py_library:

py_library(
  name = "lib1",
  srcs = [
    "srcs/lib1.py",
    "srcs/internal/lib1_internal.py",
  ],
  imports = [
    "srcs",
    "srcs/internal",
  ]
)

If the above looks odd, it kind of is, because it specifies that all modules under srcs and under srcs/internal can be imported directly. This is not a convention in python, and it seems like kind of an edge case.

In any case, handling it with bazel-mypy-integration does not seem clear, for the following reason: Remember that the --cache-map arguments to mypy take a source file path, a .meta.json path, and a .data.json path. The module path is encoded in the generated metadata. If the same source files can lead to multiple module paths (internal.lib1_internal and lib1_internal), that's more information than can be represented in the --cache-map argument, because it requires a single source file (this seems like an oversite in the design of mypy). In other words, the --cache-map arguments imply a specific file and module path combination.

In any case, in the above pathlogical case of multiple imports, the best we can do is just pick the first import path, and refer to the source file through a single module path (e.g. -m internal.lib1_internal).