Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0.3-b2 'Rebuild' with large number of files #6

Open
glenn-slayden opened this issue Dec 24, 2017 · 3 comments
Open

3.0.3-b2 'Rebuild' with large number of files #6

glenn-slayden opened this issue Dec 24, 2017 · 3 comments
Assignees

Comments

@glenn-slayden
Copy link

glenn-slayden commented Dec 24, 2017

There does indeed seem to be a problem in 3.0.3 beta 2 with the manual indexing rebuild function, or at least the UI reporting while it's running. The UI display number of files being processed seems to be clamped at a 16-bit value somewhere, especially since the left-side of the "Completed 31342/41423" text seems to wrap at 216 and the right side displayed 65539 (0x10000 + 3) for a long time, for some reason.

The "Failed" counter also reporting 32768 for a long time which seems too suspicious ( = 0x8000) to be a coincidence. Is it possible that these counter values are "wrapping around" when they overflow Int16.MaxValue also?

Anyway, below is what I see when the 'rebuild' operation is complete. The problem is that the one-and-only subdirectory I selected for indexing in "Locations" has over 150,000 files below it, a number which seems to have no relation to the numbers shown. Furthermore, although the UI seems to indicate that 74,195 files "failed", clicking on the details shows only a single one.

icaros-2

I'm not sure if those are UI problems only... or whether files are actually being skipped/missed. But the fact is that for indexing around 150,000 files, the counters never go above ~0x10000 (65536).

I should probably open a separate issue for the next point, but I'll just mention that the directory with the 150,000 files used for the indexing operation contains a number of NTFS directory junction folders. Icaros does not seem to be descending into directory junctions (NTFS "reparse point"). In your Win32 WIN32_FIND_DATA structure, if the dwFileAttributes indicates FILE_ATTRIBUTE_REPARSE_POINT you need to open that file using CreateFileW(... FILE_FLAG_BACKUP_SEMANTICS) and then descend into it like a directory.

Caveat: I say this having recently learned that IShellObj has big problems traversing directory junctions, especially when they point to another disk volume, since the IThumbnailCache distinguishes the source for a thumbnail by a hash code which is built using the volume GUID.

So the problem with directory junctions may not be Icaros' fault. Worse, there may be nothing you can provide to work around it. Basically, the whole IShellItem design seems to require that the directory structure be a proper tree (i.e. since the scope for any idlist is defined by--and restricted to--its parent item). Obviously, the presence of junctions (and symlinks) turn the file system into a graph such that its no longer a proper tree.

@glenn-slayden glenn-slayden changed the title 'Rebuild' with large number of files 3.0.3-b2 'Rebuild' with large number of files Dec 24, 2017
@Xanashi
Copy link
Owner

Xanashi commented Apr 27, 2018

Hi Glenn Slayden,

Thanks for the very well written report and analysis! And sorry for the very late reply.
Icaros v.3.1.0 beta 1 should fix the first issue you reported. Could you give it a try?

The failed files issue may still be there though. Let me know if it is.

As for the issue with directory junctions. Aren't they typically avoided during recursive directory parsing
to avoid possible infinite directory loops, e.g. having a reparse point that points to a parent folder containing said reparse point?

@Xanashi Xanashi self-assigned this Apr 27, 2018
@Xanashi Xanashi added the bug label Apr 27, 2018
@glenn-slayden
Copy link
Author

glenn-slayden commented Jun 24, 2018

Sorry, haven't had a chance to try your new build yet.

More info on the directory junctions issue: I now believe that the whole PIDL system in the Windows Shell is based on 1-to-many parent/child relationships which (as I mentioned above) entails a proper tree, and this is regardless of junctions (or other methods) providing multiple paths to the same file system object.

For the shell thumbnail cache, what this means is that (again this is my best guess) in the cache can contain "duplicate" thumbnails for the same file, if that file was visited using more than one path. In other words, I don't think the PIDL mechanism ever attempts to conflate the physical identities of the final target object.

Reparse loops are a bane of the Windows Shell, especially exacerbated by a misguided--er, fully demented--use of junctions by the core OS itself in the user profiles directory structure (i.e., C:\users...). Maintaining backwards-compatibility with XP and earlier across the multiple redesigns here is partly to blame, but as far as I can tell, the arrangement has led to at least one atrocious security flaw (escalation by modifying the contents, and thus the execution side-effects, of the singular desktop.ini file which is junction-shared between the Guest and Administrator desktops in default Windows 7/8/10 installation).

Not to mention the disastrous consequences of deleting an archived windows installation that was previously copied to another drive or volume, say. X:\windows-old-install. Seems like a safe thing to do, but the tangled directory junctions established by Windows in there still point to C:\users, C:\AppData, etc., so you'll be horrified to discover that the innocuous cleanup operation you thought you were doing has irreparably deleted your current, active desktop and user profile.

But I digress. Yes it is the responsibility of the client app to detect loops if any. Ideally apps would be savvy enough to conflate multiple paths to the same file system object. It's too not hard to program, and there could be a few effective approaches. Useful info is here. Setting aside approaches which attempt to uniquely identify the targets themselves, most cases can probably be handled by pre-checking all paths through GetFinalPathNameByHandle with VOLUME_NAME_NORMALIZED | VOLUME_NAME_GUID and then conflating those results.

That technique improves cross-drive junction behavior for non-looped cases (which is all my use-case requires), and I think all apps that are serious about the file system should rigorously adopt that technique. As for looped directory structures, I consider them an error, and never to be established, but unfortunately the issue is only a matter of convention.

@github-account1111
Copy link

github-account1111 commented Mar 11, 2022

Not sure if this belongs here, so please let me know if I need to open a new issue, but I am having a similar problem:

image

It's similar in that it didn't complete, but mine doesn't say it completed successfully (which is correct) and gives a 0x20 code (Undetermined result).
It stopped at 61% and processed 68k out of 108k.
Sounds about right to me, but why did it stop?

@Xanashi Xanashi added the fixed label Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants