Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSCrawler not indexing all the files #690

Open
Ganesh-96 opened this issue Feb 28, 2019 · 56 comments
Open

FSCrawler not indexing all the files #690

Ganesh-96 opened this issue Feb 28, 2019 · 56 comments

Comments

@Ganesh-96
Copy link

We indexed 2 million documents into elasticsearch using fscrawler. But the files count in elsaticsearch doesn't match with the files in the Share path. Is there a way to identify which files are not indexed.

@dadoonet
Copy link
Owner

There is no easy way for this. But I believe that some documents are not indexed because they have been rejected. So you should see that in FSCrawler logs.

@Ganesh-96
Copy link
Author

I did not find any file names in the log files. So I tried with debug mode as well but the log file is around 12GB. Is there any sort of pattern for the rejected files/error messages in the debug logs?

@dadoonet
Copy link
Owner

You should see a WARN or ERROR message I think. It should not really part of the DEBUG level unless you specified to ignore errors?

@Ganesh-96
Copy link
Author

Yes, in the settings file I have modified the "continue_on_error" to true. I do see some error messages in the logs but couldn't find to which files these errors relate to.

11:02:18,821 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font IOFIMH+Arial
java.io.IOException: Error:TTF.loca unknown offset format.
11:02:18,839 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSName{TT19}]
11:02:19,273 WARN [o.a.p.p.f.PDTrueTypeFont] Could not read embedded TTF for font AIRSWE+Arial,Bold

@dadoonet
Copy link
Owner

Ha I see... Something I should may be fix if not already fixed in the latest SNAPSHOT. Can't recall from the top of my head.

Because of continue_on_error setting, it probably does not WARN about that. I need to check the code and see if I can be smarter here.

@Ganesh-96
Copy link
Author

okay, we are using 2.6 version currently.
So Is there a way to identify the actual object(file/folder) name to the error messages.

@dadoonet
Copy link
Owner

I think that #675 fixes that.
If you download the latest 2.7-SNAPSHOT, that should be part of it. See https://fscrawler.readthedocs.io/en/latest/installation.html

@Ganesh-96
Copy link
Author

okay, I will try to see if we can upgrade to the latest version.
I have one other query. As mentioned in the document I have manually downloaded the jai-imageio-core-1.3.0.jar, jai-imageio-jpeg2000-1.3.0.jar files and added to the lib directory. But I keep getting the same warning every time that > J2KImageReader not loaded.

09:20:22,316 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

@dadoonet
Copy link
Owner

Could you share a file that produces that warning?

@Ganesh-96
Copy link
Author

This messages comes up the moment any index job is started.

@dadoonet
Copy link
Owner

Ha right. I'll try it then.

@Ganesh-96
Copy link
Author

okay then will wait for an update from you.
One quick question, with 2.7 does it print the path of the file along with name. There are cases where multiple files have the same name but in different paths.

@dadoonet
Copy link
Owner

From what I recall it's the full path. But give it a try. I'd love to get feedback for 2.7

@Ganesh-96
Copy link
Author

sure, will give it a try..

@Ganesh-96
Copy link
Author

I did a test run using 2.7 snapshot version. Now I can see the error along with the file name in the log file. It shows only the file name but not the path.

@dadoonet
Copy link
Owner

dadoonet commented Mar 2, 2019

Thanks. I will change that if possible. Stay tuned.

@dadoonet
Copy link
Owner

dadoonet commented Mar 3, 2019

@Ganesh2409 I merged #694 which will give you the full filename. It should be available in the oss Sonatype snapshot repository in less than one hour if you'd like to try it.

@Ganesh-96
Copy link
Author

All the error messages(WARN) with file names I have seen in the log files are related to parsing errors. I can see these files in the Indexes though. All these files have no content in the indexes but file properties are indexed.
I see some ERROR messages in the logs but these errors doesn't mention the file names.
Error Messages.txt

@dadoonet
Copy link
Owner

dadoonet commented Mar 4, 2019

All these files have no content in the indexes but file properties are indexed.

That's the effect of continue_on_error.

I see some ERROR messages in the logs but these errors doesn't mention the file names.

Hmmm... You don't have any other messages than that?

I mean between those 2 lines:

Line 111196: 23:03:40,800 ERROR [o.a.p.c.PDFStreamEngine] Operator cm has too few operands: [COSInt{0}, COSInt{0}]
Line 307748: 04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]

@Ganesh-96
Copy link
Author

Ganesh-96 commented Mar 4, 2019

All other WARN and INFO messages only.. below are couple of lines before the ERROR messages.

04:04:56,744 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'LRMSER+63shhiibsqwrsad'
04:04:56,756 WARN  [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'TimesNewRomanPSMT' for 'FORKEV+63shhiibsqwrsad'
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{6.48}]
04:04:56,817 ERROR [o.a.p.c.PDFStreamEngine] Operator Tf has too few operands: [COSFloat{2903.48}]

@dadoonet
Copy link
Owner

dadoonet commented Mar 4, 2019

Would it be possible to share a document even if you remove most of its content that reproduces it?

@Ganesh-96
Copy link
Author

The problem is I am not sure which document these errors are coming. Based on the documents and the data it contains I can try and see if I can share that document or not.

@dadoonet
Copy link
Owner

dadoonet commented Mar 4, 2019

Ha I see... The only way would be at this time to activate the --debug option then and see... Not ideal for sure...

@Ganesh-96
Copy link
Author

I checked with my team and we cannot share any of the files as they contain sensitive information..

@Ganesh-96
Copy link
Author

Is there anyway we can ignore the warnings during the parsing and index the content?

@dadoonet
Copy link
Owner

dadoonet commented Mar 4, 2019

I believe the warning means that Tika is not able to extract the content. So I'd assume that it's not possible.

@Ganesh-96
Copy link
Author

oh ok, so now we came across a new scenario altogether like content not being indexed but our original issue is still the same.. as these errors we have seen are related to content parsing only but the file is present in the index. So we still have the main issue.

@Ganesh-96
Copy link
Author

Is it possible to print the file names for the ERROR message also, right now I can see the filenames for WARN messages only.

@dadoonet
Copy link
Owner

dadoonet commented Mar 5, 2019

Probably it could be possible but I'd need to be able to reproduce the problem if I want to fix it. Without a document which is generating this error, this is hard to guess where I should put the code.
Specifically that the error seems to be printed by Tika code and not by FSCrawler code so I'm unsure I can catch something which is not thrown.
I think I could test if content is null and add a warn_on_null_content option, may be...

@Ganesh-96
Copy link
Author

Unfortunately I cannot share the documents.

@Ganesh-96
Copy link
Author

One more issue we are seeing is couple of jobs are getting stuck for couple of days. There is no change in the documents count in the indexes and no logs are getting printed. Probably not an issue but it will be helpful to know.

@Ganesh-96
Copy link
Author

Ganesh-96 commented Mar 7, 2019

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

@Bhanuji95
Copy link

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

@dadoonet
Copy link
Owner

dadoonet commented Mar 9, 2019

Any inputs on the above issue. Currently the job is stuck for more than a day. The file size where it got stuck is 2 gb and gz extension. This issue occurs only when "indexed_chars": "-1" is set.

@Ganesh2409 How much memory did you assign to FSCrawler? I mean that it will probably require a lot of memory to unzip and parse every single content.
Ideally you should unzip the files in your directory and let FSCrawler index smaller files.
One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example.
WDYT? Would that help? It requires a bit of thinking, introducing new settings like a fscrawler_tmp dir... Probably not that quick to implement.
Another workaround would be to exclude big files with ignore-above setting.

@Bhanuji95 What kind of file is it?

@Ganesh-96
Copy link
Author

The server has 32 gb memory and it is only used for FSCrawler. I haven't configured any memory specific to FSCrawler.

@Ganesh-96
Copy link
Author

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

@Bhanuji95
Copy link

It is a .7z file

@dadoonet
Copy link
Owner

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

@dadoonet
Copy link
Owner

@Bhanuji95 so the same answer I gave #690 (comment) applies.

@dadoonet
Copy link
Owner

One of the feature I can may implement would be to Unzip files in a tmp dir, index that content, then remove the dir... Optional setting of course, like unzip_above: 100mb for example.

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

See https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java

@tballison Could you confirm that?

@Ganesh-96
Copy link
Author

Ganesh-96 commented Mar 11, 2019

@Ganesh2409 Read https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html and add much more memory to FSCrawler like 16gb may be. I'll be happy to hear if this is getting better.

Sure, I will give it a try..

Can we have some timeout setting to skip the current file and continue with the indexing if there are no updates/not able to index that file instead of waiting for it to finish. Currently I have to stop the job as it is in hung state.

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread. That's something I have in mind for the future (running in an async mode) but it's not yet there.

Would you mind opening a separate future request like "Add extraction timeout" or something like this?

Sure, I can do this but haven't done before.

@Ganesh-96
Copy link
Author

I can see the file properties when there are some parsing errors. But for large files it is getting stuck.
So If a file content can't be indexed, can we get the file properties indexed alone.

@Ganesh-96
Copy link
Author

For Folder Indexes, we are getting only the path details in the indexes. Are there any options we have to get the Last modified date as well as we get in the Files Index.

@dadoonet
Copy link
Owner

@Ganesh2409 It does not exist. I don't think I'd like to support it as the way I'm designing the next version will remove the folder index all together.

@Ganesh-96
Copy link
Author

So there won't be any information about the folders indexed in the future release or we will have those details in the files index?

@Ganesh-96
Copy link
Author

Ganesh-96 commented Mar 13, 2019

Got a new error while trying to index the full content("indexed_chars": "-1") of the files.

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

and it got stopped even continue_on_error is set true.

@dadoonet
Copy link
Owner

it got stopped

You mean that FSCrawler process exited?

@Ganesh-96
Copy link
Author

Yes..

@dadoonet
Copy link
Owner

That'd be great to share the document that make that happen in a new issue. So I can look at it.

@tballison
Copy link

tballison commented Mar 13, 2019

Hmmm. I looked at Tika source code and it seems that Tika is actually using tmp dir to extract data.

Y, various parsers create tmp files quite often.

@tballison
Copy link

tballison commented Mar 13, 2019

Good question. I don't know yet. It would require making all that running in separate threads and have a timeout for each thread.

Sadly, no. That won't be robust against an infinite loop. You can't kill a thread, you can only ask it to stop politely and hope for the best. The only way to "timeout" an infinite loop is to kill the process.

  • We have the ForkParser that spawns a separate child process and has the notion of "timeout".
  • tika-batch will run robustly (oom and timeout) against a directory of documents in batch mode
  • tika-server in --spawnChild mode spawns a child and will kill it/restart it on timeout oom, etc.

Happy to discuss if you have questions... See https://issues.apache.org/jira/browse/TIKA-456

@tballison
Copy link

Even we are facing similar issues, fscrawler is getting stuck while indexing some documents which are around 4gb of size.

Tika really doesn't work well with files of this size. Tika was originally designed to be streaming, but some file formats simply don't allow that. The best solution is the one you've already come to, which is to uncompress/unpack large container files: gz, zip, etc. as well as, e.g. pst/mbox...

@Ganesh-96
Copy link
Author

That'd be great to share the document that make that happen in a new issue. So I can look at it.

The main problem in sharing the document is, I can't really identify the document that is generating these issues. If I run the job in debug mode it is just creating a huge log file which is making very it very hard to find issues.

@dadoonet
Copy link
Owner

dadoonet commented Apr 7, 2019

@Ganesh2409 If you run it in debug mode, I think that close to the WARN line:

20:17:44,580 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\servername\folder: integer overflow

You should also have a stack trace. That would help if you can share this one.

May be you could just debug the FsParserAbstract class. See https://fscrawler.readthedocs.io/en/latest/admin/logger.html?highlight=logger

@dineshrana87
Copy link

Sir,
I does not search hindi image and pdf document
I have also set language like below
ocr:
language: "eng+hi"
enabled: true
path: "C:/Program Files/Tesseract-OCR"
data_path: "C:/Program Files/Tesseract-OCR/tessdata"
pdf_strategy: "ocr_and_text"
follow_symlinks: false

Kindly tell me
Thanks and Regards
Dinesh Rana
India

@sahin52
Copy link
Contributor

sahin52 commented Nov 23, 2021

You may have to change the language to "eng+hin", since the code for Hindi is hin, look it up from here: https://www.loc.gov/standards/iso639-2/php/code_list.php @dineshrana87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants