FSCrawler not indexing all the files #690

Ganesh-96 · 2019-02-28T11:10:39Z

We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.

dadoonet · 2019-02-28T13:05:50Z

There is no easy way for this. But I believe that some documents are not indexed because they have been rejected. So you should see that in FSCrawler logs.

Ganesh-96 · 2019-02-28T13:13:55Z

I did not find any file names in the log files. So I tried with debug mode as well but the log file is around 12GB. Is there any sort of pattern for the rejected files/error messages in the debug logs?

dadoonet · 2019-02-28T13:15:02Z

You should see a WARN or ERROR message I think. It should not really part of the DEBUG level unless you specified to ignore errors?

Ganesh-96 · 2019-02-28T13:20:38Z

Yes, in the settings file I have modified the "continue_on_error" to true. I do see some error messages in the logs but couldn't find to which files these errors relate to.

11:02:18,821 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font IOFIMH+Arial
java.io.IOException: Error:TTF.loca unknown offset format.
11:02:18,839 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSName{TT19}]
11:02:19,273 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font AIRSWE+Arial,Bold

dadoonet · 2019-02-28T13:25:19Z

Ha I see... Something I should may be fix if not already fixed in the latest SNAPSHOT. Can't recall from the top of my head.

Because of continue_on_error setting, it probably does not WARN about that. I need to check the code and see if I can be smarter here.

Ganesh-96 · 2019-02-28T13:39:26Z

okay, we are using 2.6 version currently.
So Is there a way to identify the actual object(file/folder) name to the error messages.

dadoonet · 2019-02-28T14:59:05Z

I think that #675 fixes that.
If you download the latest 2.7-SNAPSHOT, that should be part of it. See https://fscrawler.readthedocs.io/en/latest/installation.html

Ganesh-96 · 2019-02-28T15:22:28Z

okay, I will try to see if we can upgrade to the latest version.
I have one other query. As mentioned in the document I have manually downloaded the jai-imageio-core-1.3.0.jar, jai-imageio-jpeg2000-1.3.0.jar files and added to the lib directory. But I keep getting the same warning every time that > J2KImageReader not loaded.

09:20:22,316 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

dadoonet · 2019-02-28T15:57:22Z

Could you share a file that produces that warning?

Ganesh-96 · 2019-02-28T16:00:54Z

This messages comes up the moment any index job is started.

dadoonet · 2019-02-28T16:21:03Z

Ha right. I'll try it then.

Ganesh-96 · 2019-02-28T16:25:10Z

okay then will wait for an update from you.
One quick question, with 2.7 does it print the path of the file along with name. There are cases where multiple files have the same name but in different paths.

dadoonet · 2019-02-28T16:26:18Z

From what I recall it's the full path. But give it a try. I'd love to get feedback for 2.7

Ganesh-96 · 2019-02-28T16:27:48Z

sure, will give it a try..

Ganesh-96 · 2019-03-01T22:34:16Z

I did a test run using 2.7 snapshot version. Now I can see the error along with the file name in the log file. It shows only the file name but not the path.

dadoonet · 2019-03-02T03:21:13Z

Thanks. I will change that if possible. Stay tuned.

Related to #690.

dadoonet · 2019-03-03T09:49:51Z

@Ganesh2409 I merged #694 which will give you the full filename. It should be available in the oss Sonatype snapshot repository in less than one hour if you'd like to try it.

Ganesh-96 · 2019-03-04T10:06:55Z

All the error messages(WARN) with file names I have seen in the log files are related to parsing errors. I can see these files in the Indexes though. All these files have no content in the indexes but file properties are indexed.
I see some ERROR messages in the logs but these errors doesn't mention the file names.
Error Messages.txt

dadoonet · 2019-03-04T10:39:17Z

All these files have no content in the indexes but file properties are indexed.

That's the effect of continue_on_error.

I see some ERROR messages in the logs but these errors doesn't mention the file names.

Hmmm... You don't have any other messages than that?

I mean between those 2 lines:

Line 111196: 23:03:40,800 ERROR [o.a.p.c.PDFStreamEngine] Operator cm has too few operands: [COSInt{0}, COSInt{0}]
Line 307748: 04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]

Ganesh-96 · 2019-03-04T10:47:07Z

All other WARN and INFO messages only.. below are couple of lines before the ERROR messages.

04:04:56,744 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'LRMSER+63shhiibsqwrsad'
04:04:56,756 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'FORKEV+63shhiibsqwrsad'
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{2903.48}]

dadoonet · 2019-03-04T10:50:30Z

Would it be possible to share a document even if you remove most of its content that reproduces it?

Ganesh-96 · 2019-03-04T10:55:29Z

The problem is I am not sure which document these errors are coming. Based on the documents and the data it contains I can try and see if I can share that document or not.

dadoonet · 2019-03-04T11:03:04Z

Ha I see... The only way would be at this time to activate the --debug option then and see... Not ideal for sure...

Ganesh-96 · 2019-03-04T13:06:50Z

I checked with my team and we cannot share any of the files as they contain sensitive information..

Ganesh-96 · 2019-03-04T13:46:34Z

Is there anyway we can ignore the warnings during the parsing and index the content?

dadoonet · 2019-03-04T14:10:00Z

I believe the warning means that Tika is not able to extract the content. So I'd assume that it's not possible.

Ganesh-96 · 2019-03-04T14:29:20Z

oh ok, so now we came across a new scenario altogether like content not being indexed but our original issue is still the same.. as these errors we have seen are related to content parsing only but the file is present in the index. So we still have the main issue.

Ganesh-96 · 2019-03-05T08:23:45Z

Is it possible to print the file names for the ERROR message also, right now I can see the filenames for WARN messages only.

dadoonet · 2019-03-05T08:27:35Z

Probably it could be possible but I'd need to be able to reproduce the problem if I want to fix it. Without a document which is generating this error, this is hard to guess where I should put the code.
Specifically that the error seems to be printed by Tika code and not by FSCrawler code so I'm unsure I can catch something which is not thrown.
I think I could test if content is null and add a warn_on_null_content option, may be...

Ganesh-96 · 2019-03-05T08:35:07Z

Unfortunately I cannot share the documents.

Ganesh-96 · 2019-03-05T08:44:26Z

One more issue we are seeing is couple of jobs are getting stuck for couple of days. There is no change in the documents count in the indexes and no logs are getting printed. Probably not an issue but it will be helpful to know.

Ganesh-96 · 2019-03-07T09:39:53Z

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

Bhanuji95 · 2019-03-07T10:20:46Z

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

dadoonet · 2019-03-09T12:58:15Z

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

@Ganesh2409 How much memory did you assign to FSCrawler? I mean that it will probably require a lot of memory to unzip and parse every single content.
Ideally you should unzip the files in your directory and let FSCrawler index smaller files.
One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example.
WDYT? Would that help? It requires a bit of thinking, introducing new settings like a fscrawler_tmp dir... Probably not that quick to implement.
Another workaround would be to exclude big files with ignore-above setting.

@Bhanuji95 What kind of file is it?

Ganesh-96 · 2019-03-11T10:00:51Z

The server has 32 gb memory and it is only used for FSCrawler. I haven't configured any memory specific to FSCrawler.

Ganesh-96 · 2019-03-11T10:06:47Z

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Bhanuji95 · 2019-03-11T10:10:22Z

It is a .7z file

dadoonet · 2019-03-11T11:23:26Z

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

dadoonet · 2019-03-11T11:23:59Z

@Bhanuji95 so the same answer I gave #690 (comment) applies.

dadoonet · 2019-03-11T11:32:39Z

One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example.

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

See https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java

@tballison Could you confirm that?

Ganesh-96 · 2019-03-11T11:56:00Z

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Sure, I will give it a try..

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

Sure, I can do this but haven't done before.

Ganesh-96 · 2019-03-11T12:09:39Z

I can see the file properties when there are some parsing errors. But for large files it is getting stuck.
So If a file content can't be indexed, can we get the file properties indexed alone.

Ganesh-96 · 2019-03-13T14:57:16Z

For Folder Indexes, we are getting only the path details in the indexes. Are there any options we have to get the Last modified date as well as we get in the Files Index.

dadoonet · 2019-03-13T16:55:51Z

@Ganesh2409 It does not exist. I don't think I'd like to support it as the way I'm designing the next version will remove the folder index all together.

Ganesh-96 · 2019-03-13T17:14:44Z

So there won't be any information about the folders indexed in the future release or we will have those details in the files index?

Ganesh-96 · 2019-03-13T17:27:18Z

Got a new error while trying to index the full content("indexed_chars": "-1") of the files.

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

and it got stopped even continue_on_error is set true.

dadoonet · 2019-03-13T17:56:54Z

it got stopped

You mean that FSCrawler process exited?

Ganesh-96 · 2019-03-13T18:13:06Z

Yes..

dadoonet · 2019-03-13T18:20:07Z

That'd be great to share the document that make that happen in a new issue. So I can look at it.

tballison · 2019-03-13T18:38:20Z

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

Y, various parsers create tmp files quite often.

tballison · 2019-03-13T18:42:32Z

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread.

Sadly, no. That won't be robust against an infinite loop. You can't kill a thread, you can only ask it to stop politely and hope for the best. The only way to "timeout" an infinite loop is to kill the process.

We have the ForkParser that spawns a separate child process and has the notion of "timeout".
tika-batch will run robustly (oom and timeout) against a directory of documents in batch mode
tika-server in --spawnChild mode spawns a child and will kill it/restart it on timeout oom, etc.

Happy to discuss if you have questions... See https://issues.apache.org/jira/browse/TIKA-456

tballison · 2019-03-13T18:44:54Z

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

Tika really doesn't work well with files of this size. Tika was originally designed to be streaming, but some file formats simply don't allow that. The best solution is the one you've already come to, which is to uncompress/unpack large container files: gz, zip, etc. as well as, e.g. pst/mbox...

Ganesh-96 · 2019-03-13T19:28:42Z

That'd be great to share the document that make that happen in a new issue. So I can look at it.

The main problem in sharing the document is, I can't really identify the document that is generating these issues. If I run the job in debug mode it is just creating a huge log file which is making very it very hard to find issues.

dadoonet · 2019-04-07T15:13:36Z

@Ganesh2409 If you run it in debug mode, I think that close to the WARN line:

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

You should also have a stack trace. That would help if you can share this one.

May be you could just debug the FsParserAbstract class. See https://fscrawler.readthedocs.io/en/latest/admin/logger.html?highlight=logger

dineshrana87 · 2021-11-11T06:04:59Z

Sir,
I does not search hindi image and pdf document
I have also set language like below
ocr:
language: "eng+hi"
enabled: true
path: "C:/Program Files/Tesseract-OCR"
data_path: "C:/Program Files/Tesseract-OCR/tessdata"
pdf_strategy: "ocr_and_text"
follow_symlinks: false

Kindly tell me
Thanks and Regards
Dinesh Rana
India

sahin52 · 2021-11-23T11:04:44Z

You may have to change the language to "eng+hin", since the code for Hindi is hin, look it up from here: https://www.loc.gov/standards/iso639-2/php/code_list.php @dineshrana87

dadoonet added a commit that referenced this issue Mar 2, 2019

Display full names when catching parsing errors

9821622

Related to #690.

dadoonet mentioned this issue Mar 2, 2019

Display full names when catching parsing errors #694

Merged

dadoonet mentioned this issue Mar 9, 2019

Add unzip_above option #696

Closed

FSCrawler not indexing all the files #690

FSCrawler not indexing all the files #690

Comments

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

dadoonet commented Feb 28, 2019

Ganesh-96 commented Feb 28, 2019

Ganesh-96 commented Mar 1, 2019

dadoonet commented Mar 2, 2019

dadoonet commented Mar 3, 2019

Ganesh-96 commented Mar 4, 2019

dadoonet commented Mar 4, 2019

Ganesh-96 commented Mar 4, 2019 • edited by dadoonet

dadoonet commented Mar 4, 2019

Ganesh-96 commented Mar 4, 2019

dadoonet commented Mar 4, 2019

Ganesh-96 commented Mar 4, 2019

Ganesh-96 commented Mar 4, 2019

dadoonet commented Mar 4, 2019

Ganesh-96 commented Mar 4, 2019

Ganesh-96 commented Mar 5, 2019

dadoonet commented Mar 5, 2019

Ganesh-96 commented Mar 5, 2019

Ganesh-96 commented Mar 5, 2019

Ganesh-96 commented Mar 7, 2019 • edited

Bhanuji95 commented Mar 7, 2019

dadoonet commented Mar 9, 2019

Ganesh-96 commented Mar 11, 2019

Ganesh-96 commented Mar 11, 2019

Bhanuji95 commented Mar 11, 2019

dadoonet commented Mar 11, 2019

dadoonet commented Mar 11, 2019

dadoonet commented Mar 11, 2019

Ganesh-96 commented Mar 11, 2019 • edited

Ganesh-96 commented Mar 11, 2019

Ganesh-96 commented Mar 13, 2019

dadoonet commented Mar 13, 2019

Ganesh-96 commented Mar 13, 2019

Ganesh-96 commented Mar 13, 2019 • edited by dadoonet

dadoonet commented Mar 13, 2019

Ganesh-96 commented Mar 13, 2019

dadoonet commented Mar 13, 2019

tballison commented Mar 13, 2019 • edited

tballison commented Mar 13, 2019 • edited

tballison commented Mar 13, 2019

Ganesh-96 commented Mar 13, 2019

dadoonet commented Apr 7, 2019 • edited

dineshrana87 commented Nov 11, 2021

sahin52 commented Nov 23, 2021 • edited

Ganesh-96 commented Mar 4, 2019 •

edited by dadoonet

Ganesh-96 commented Mar 7, 2019 •

edited

Ganesh-96 commented Mar 11, 2019 •

edited

Ganesh-96 commented Mar 13, 2019 •

edited by dadoonet

tballison commented Mar 13, 2019 •

edited

tballison commented Mar 13, 2019 •

edited

dadoonet commented Apr 7, 2019 •

edited

sahin52 commented Nov 23, 2021 •

edited