Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seek for ignored files only in dirs that their content has changed. #5429

Closed
wants to merge 3 commits into from

Conversation

jbialobr
Copy link
Member

Eagerly loading all the ignored files slows down the regular status query.
When you have tens of thousands of ignored files it can take up to a minute to get git status completed
when run with ignored files included. Since we base the working dir and commit index info on this, it is a poor experience to have to wait that long for seeing changes to be reflected in the revisions grid.

Changes proposed in this pull request:

  • Lazily seek for ignored files only in dirs that their content has changed, in a separate query.
  • Remember tracked files to not treat them as candidates for ignored file.
    When not touching ignored files, gitext will not have to load the ignored files list at all.

What did I do to test the code and ensure quality:

  • Manual tests.

Has been tested on (remove any that don't apply):

  • GIT 2.18
  • Windows 10

Eagerly loading all the ignored files slows down the regular status query.
Remember tracked files to not treat them as a candidate for ignored file.
Copy link
Member

@gerhardol gerhardol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution will fire off a series of "git-status -- 'ignore-candidates'" which in total could add up to more than listing all files in one go. In the repos I have at home, the time for git-status with and without candidates is very close. It is probably an improvement, but it is hard to determine. The complexity increases with this change too.

Another approach to minimize the first time until update which is annoying could have been to fire an update without ignored files the first time, next time with ignored (then wait minutes until next as before).

A way to limit the candidate checking and memory consumption (the list can be huge) is to use --ignore=matching when listing and filter on directories. That requires Git 2.16 or so but can be backwards compatible. That is more important if listing all ignored, but could benefit this approach too. See #4256 for an issue of the ignore files.

I would like to test this by use some time before approving. Well implemented though, not really any comments on the code.

@gerhardol
Copy link
Member

@jbialobr
If git-status --ignored takes 1 min, how long for git-status --ignored=matching and git-status --ignored ?
This PR can cause git-status --ignored to be constantly cancelled, since git-status can run more often than git-status --ignored can, so ignore files are never updated if I read the code correctly. Have you seen this scenario?

Proposal:

  • Always run git-status --ignored without candidates. This can simplify the implementation
  • Throttle updates.
  • As long as normal git-status is not cancelled and --ignored runs, do not cancel
  • Wait between invocations of --ignored, similar to how normal git-status is started

A simple alternative is to just skip --ignored for "requested" invocations so the first update is quicker.

@jbialobr
Copy link
Member Author

If git-status --ignored takes 1 min, how long for git-status --ignored=matching and git-status --ignored ?

They run really fast, but they do not return individual files. We would have to change the logic that verifies if a file is an ignored one.

This PR can cause git-status --ignored to be constantly cancelled, since git-status can run more often than git-status --ignored can, so ignore files are never updated if I read the code correctly. Have you seen this scenario?

If you are talking about:

if (IsAboutToUpdate())
                {
                    return;
                }

then yes, it is there on purpose. The regular use case is that you work with none ignored files. In most cases the git status will return the same files that are candidates for ignored and there will be no need to call git status --ignored command.

Always run git-status --ignored without candidates. This can simplify the implementation

You have to implement your own logic to check if the file matches a dir returned by the command. It will be much slower than checking if a file is in hashmap. It will not simplify the implementation but has a chance to work fast.

Throttle updates.

I prefer to focus on making it always fast.

As long as normal git-status is not cancelled and --ignored runs, do not cancel

I am not sure about what cancellation you are talking.

Wait between invocations of --ignored, similar to how normal git-status is started

It is already the case. And the --ignored command only is called when after the regular git status command there are candidates left which should not be too often.

A simple alternative is to just skip --ignored for "requested" invocations so the first update is quicker.

I am not satisfied with that, as I am experiencing lags after the initial load as well.

Copy link
Member

@RussKie RussKie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbialobr, I've taken a liberty to refactor your proposal extracting (what I perceive as) caching related implementations into own class. This way it is modular and can be unit tested.
Passing IsAboutToUpdate as an arguments feels very dirty, but it is EOD here and I'm quite tired to think clearly.

private HashSet<string> _ignoredFiles = new HashSet<string>();
private ConcurrentDictionary<string, object> _ignoredFiles = new ConcurrentDictionary<string, object>();
private ConcurrentDictionary<string, object> _ignoredCandidates = new ConcurrentDictionary<string, object>();
private ConcurrentDictionary<string, object> _trackedSeen = new ConcurrentDictionary<string, object>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 can be made readonly

@jbialobr
Copy link
Member Author

@RussKie Thanks, it is for sure a right direction. The UpdateCandidatesAsync method can be split into two methods: one for updating from the git status result, and the other for verifying candidates. This way there will be no need to pass IsAboutToUpdate to the cache. What we should be passing, is the GitModule captured inside the Update method. We have to keep the module we use in sync with _statusCancellationToken.

@RussKie
Copy link
Member

RussKie commented Sep 16, 2018 via email

@jbialobr
Copy link
Member Author

Perhaps GitModule should be passed to the constructor by value and the cache should be recreated after GitModule has changed?

@gerhardol
Copy link
Member

gerhardol commented Sep 16, 2018

For --ignored==matching: That should be required as all ignored files could require GB of storage. See #4256

In situations where a build is running, there will be ignored files updated constantly. The detection of ignored files do not have to be perfect, but will limit how often git-status runs. Since 2.51, the update is throttled dynamically, to detect changes after manual changes quickly and not load too much at massive updates like when building.

Note that WorkTreeChanged() will return directly if an update is already scheduled. In the current implementation, some further updates from FileSystemWatcher will slip through, but that should be changed. To just check "the first directories" is normally OK, we just want to limit the rates. Previously, the complete list were updated anyway, not so with this change. The change is not necessarily a problem, but the current implementation has some drawbacks.

The previous implementation, starting with no ignored files:
git-status --ignored (no ignored seen)
When building:
git-status running every 3rd second (or more if git-status requires more time)

The previous implementation, starting with no ignored files:
git-status --ignored (all ignored seen)
When building:
git-status running when new ignored are seen only.
Every 5th min, update the ignored files

This PR, starting with no ignored files:
When building the following will run constantly:
git-status --ignored (no ignored seen)
git-status --ignored -- few files
The ignored files will increase over time, will require many iterations -> the list of ignored files is not meaningful and could as well be removed as it will not save much.

Possibilities:

  1. Use git-status --ignored without candidates to get a full ignored list quicker. --ignored will only run when there are new ignored files
  2. Remove the ignored list, as it it is not used anyway
  3. Simple hardcoded .gitignore parsing to filter obvious files and directories (rather not do this)
  4. Remove early return in WorkTreeChanged() if an update is detected (will have effect on CPU load to some extent)

Will submit a PR with a few adjustments to this PR (I get an exception)

gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 16, 2018
@gerhardol
Copy link
Member

Removing the ignored list is not a bad option, that decreases the load for me compared to me (still bringing the benefits). #4256 is resolved too.

gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 16, 2018
@jbialobr
Copy link
Member Author

@gerhardol I am leaning towards removing the ignored list at all as you proposed.

@jbialobr jbialobr closed this Sep 17, 2018
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 17, 2018
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 23, 2018
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 24, 2018
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 26, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, do not check if update is required if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored takes longer times. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 26, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, do not check if update is required if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored can takes significant time. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 26, 2018
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 27, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, do not check if update is required if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored can takes significant time. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Sep 30, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, do not check if update is required if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored can takes significant time. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Oct 3, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, do not check if update is required if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored can takes significant time. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
gerhardol added a commit to gerhardol/gitextensions that referenced this pull request Oct 5, 2018
fixes gitextensions#5439
fixes gitextensions#4256

partly based on gitextensions#5429
The handling was improved in gitextensions#4327, gitextensions#4328, fixing the regression
Those PRs limited the max load for git-status

Changes:
React faster on changing directory (1.0s->0.2s)
Some tweaks to the checks, disable file system monitoring temporary if an update is already scheduled

Do not read ignored files due to two problems:
 * Reading ignored can takes significant time. Since this was done the first time, the first update was delayed.
   (Reading ignored could be done later, but this will still delay the next update).
 * Saving all seen ignored files require substantial memory, see gitextensions#4256.
   The list could be limited (see the issue), but that will increase the test time for ignored files

As ignored files will now always trigger git-status may cause some more runs.
However, the ignored file list was never complete, the list was better over time,
so GE must be able to handle this load anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants