Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement? #38

Open
karajan1001 opened this issue May 31, 2020 · 10 comments
Open

Performance improvement? #38

karajan1001 opened this issue May 31, 2020 · 10 comments

Comments

@karajan1001
Copy link

Hello.

We are users of pathspec in some other project. I have a performance question.

For a long list of rules (dozens) matches large amount of files (hundreds of thousands) the match_file takes a long time. Is there any method to improve its performance?
For example, using a big regex instead of multiple small ones.

@cpburnz
Copy link
Owner

cpburnz commented Jun 8, 2020

Can you provide an example of how you're specifically performing the matches? About how long is a long time? Is it on the order of minutes, hours, or days? This will help me look into the performance issue.

@karajan1001
Copy link
Author

image
image
image

For example, It would take 20μs for each file. And 2 seconds for 100k file. And if we use big regex and use if expression to skip the normalization in the UNIX system. It could be 100ms (maybe several hundred for Windows users). This could give great help to user experience in the interactive tools relied on path specification.

@excitoon
Copy link

excitoon commented Aug 28, 2022

I checked pathspec against gitignorefile on this branch https://github.com/excitoon/3/tree/pathspec . On big project (16188 directories, 204718 files) it is still faster:

real	0m43.853s

vs

real	0m25.885s

I'll check if I can fix it.

excitoon added a commit to excitoon/3 that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
@excitoon
Copy link

excitoon commented Aug 28, 2022

I made it to:

real	0m28.939s

so far. Thing is, gitignorefile's results are more precise, and if I could afford wrong results, it would be much more fast.

excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
@excitoon
Copy link

I got slightly better RE for a start of pattern: (?:^|.+/) instead of ^(?:.+/)?. @cpburnz check that out

excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
excitoon added a commit to excitoon/gitignorefile that referenced this issue Aug 28, 2022
@bollwyvl
Copy link

bollwyvl commented Sep 2, 2022

Is it worth adding an actual benchmark with e.g. pytest-benchmark or asv?

@excitoon
Copy link

excitoon commented Sep 2, 2022 via email

@Dobatymo
Copy link

It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.

@bkarstens
Copy link

bkarstens commented Mar 12, 2023

It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.

👀 It is possible, from my experimentation:

  • for multiple normal lines, I can just or them together: pattern1|pattern2
  • for negation lines, I can do this (?!negation_regex)(?:previous_regex).

then you end up with one long pattern like (?!negation5)(?:(?!negation3)(?:pattern1|pattern2)|pattern4)

But I have a completely different implementation so idk how hard that would be for this project.

I actually have 2 patterns: one that's used if the path is a directory, and one that's used if the path doesn't exist or is a file. That lets me flatten all the patterns into one. But since checking if it's a dir is comparatively slow, I also have a setting to not check and assume everything passed in is a file such that foo/ matches foo/bar but not foo even when foo is a folder.

I'm still working on fixing #74

@karajan1001
Copy link
Author

for multiple normal lines, I can just or them together: pattern1|pattern2
for negation lines, I can do this (?!negation_regex)(?:previous_regex).

I only used method 1 in another project and get a significant performance improvement.

Method 2 is something I didn't think of. In my case, I split the pattern into several groups, only the same type of pattern can be joined together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants