Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unzip_above option #696

Closed
dadoonet opened this issue Mar 9, 2019 · 1 comment
Closed

Add unzip_above option #696

dadoonet opened this issue Mar 9, 2019 · 1 comment
Labels
new For new features or options

Comments

@dadoonet
Copy link
Owner

dadoonet commented Mar 9, 2019

Based on #690 (comment) discussion, we can implement a new unzip_above: 100mb option which would extract in a tmp dir (defined by a FSCRAWLER_TMP system option (default to System tmp dir)) and run then FSCrawler on that dir.

For each generated Doc, we would then transform the URL as coming not from the tmp dir but from the ZIP/GZ file.

Then after run, we would remove the tmp dir and files. Could be optional with a setting like keep_tmp_files: true.

@dadoonet dadoonet added the new For new features or options label Mar 9, 2019
@dadoonet
Copy link
Owner Author

So this is already happening behind the scene in Tika. No need for that then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options
Projects
None yet
Development

No branches or pull requests

1 participant