Skip to content

Conversation

@progval
Copy link

@progval progval commented Sep 1, 2022

This avoids allocating a potentially long list.

While working on the next version of Software Heritage's license dataset, I noticed calls to scancode.api.get_license() calls spent a lot of time in this function.

Avoiding the list materialization seems to make scancode.api.get_license() take 40 to 50% less time overall

Signed-off-by: Valentin Lorentz vlorentz@softwareheritage.org

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

@progval progval changed the title Replace list comprehension with a generation Replace list comprehension with a generator Sep 1, 2022
This avoids allocating a potentially long list.

Signed-off-by: Valentin Lorentz <vlorentz@softwareheritage.org>
@progval
Copy link
Author

progval commented Sep 1, 2022

Uh, looks like it does fail some of the tests; I didn't see it locally. Sorry for the noise.

@progval progval closed this Sep 1, 2022
@pombredanne
Copy link
Member

@progval Hey :)

Uh, looks like it does fail some of the tests; I didn't see it locally. Sorry for the noise.

that's OK.... we can likely make this work out OK.
Note that in anycase the license detection engine returns a list of matches in the first place and since there are several filters that run over these matches and look in many cases at the whole sequence, it had to be a list somehow.
Do you have a reproducible case where you had too many license detection items created at once? The only cases that could happen AFAIK is where there is a large number of licenses in a single file.

@progval
Copy link
Author

progval commented Sep 1, 2022

Note that in anycase the license detection engine returns a list of matches in the first place and since there are several filters that run over these matches and look in many cases at the whole sequence, it had to be a list somehow.

Ah, I assumed intbitset's constructor consumed the iterable. Looks like the improvement was cheating then, if it was actually removing items from further runs...

@pombredanne
Copy link
Member

OK, wrong context... I had not looked at your patch first!
intbitset could be made to work on a generator alright and I maintain the library at https://github.com/inveniosoftware/intbitset too.

@pombredanne pombredanne reopened this Sep 1, 2022
@progval
Copy link
Author

progval commented Sep 1, 2022

I doubt it will be much of an improvement; allocating lists is rarely an issue, and I didn't notice any that was particularly large while profiling

@pombredanne
Copy link
Member

OK, closing then!
Please keep me abreast of what you are doing with ScanCode and SWH: that's super interesting to me!

@pombredanne pombredanne closed this Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants