Add unzip_above option #696

dadoonet · 2019-03-09T13:03:56Z

Based on #690 (comment) discussion, we can implement a new unzip_above: 100mb option which would extract in a tmp dir (defined by a FSCRAWLER_TMP system option (default to System tmp dir)) and run then FSCrawler on that dir.

For each generated Doc, we would then transform the URL as coming not from the tmp dir but from the ZIP/GZ file.

Then after run, we would remove the tmp dir and files. Could be optional with a setting like keep_tmp_files: true.

The text was updated successfully, but these errors were encountered:

dadoonet · 2019-03-13T18:43:15Z

So this is already happening behind the scene in Tika. No need for that then.

dadoonet added the new For new features or options label Mar 9, 2019

dadoonet closed this as completed Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unzip_above option #696

Add unzip_above option #696

dadoonet commented Mar 9, 2019

dadoonet commented Mar 13, 2019

Add unzip_above option #696

Add unzip_above option #696

Comments

dadoonet commented Mar 9, 2019

dadoonet commented Mar 13, 2019