-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming support for .gz and .bz2 format input / output files #34
Comments
Are you using the command-line utility, or serd as a library? For code, it should be relatively easy to hook up serd to whatever compression library using custom read and write functions. Since serd is a lightweight library with no external dependencies, I don't think it's appropriate to add dependencies for this (and don't want to step on the feature creep treadmill of whatever archive format somebody wants this week). If the API makes this too difficult (there's a ton of archive libraries I have never tried), the reasons why should be addressed. I have already revised that heavily in the upcoming major version ( That said, I'd be more open to adding it to the command-line utilities since that should be easy to make optional and doesn't add a dependency to the library itself. That code would also serve as an example to steal for other programs/libraries that want to do it. On the other hand, there you could just set up a pipeline... |
Just using serdi at the command line to convert to N-triples right now, so I can examine the output of serd. Agree I was thinking maybe a compile time option to add support into serdi might be reasonable. I understand the goal of lightweight with no external dependencies (I like that too). My next step is to try using the library (and I could add decompress support, understood). Agree examples are very helpful, it looks like the code for serdi itself might be the best example to start with? I hope to read any one of the supported formats (but compressed), do some minimal processing/filtering of the triples as they fly by, and then store subsets in a few separate files. Simple use, but it needs to be performant, and streaming, hence my interest in this library. |
I see. For things like that, if you want to dig into the code, you might want to start with the aforementioned If you're a Python fan, I'm working on Python bindings in the As for the issue at hand, we can ponder whether built-in support is worth it for convenience, but you can always just throw some UNIX at the problem, e.g.
|
I'm fine with C (likely faster), but python is fine if easier to use and as nearly as fast. I use any language as needed, if I had to pick a language I'd identify as a K&R C fan. Yes, I am interested to hear when serd1 is done. One question (maybe this is a can of worms?), why do you use a dash for stdin, instead of just reading stdin when no file is given like common file commands do, e.g. cat, sort, uniq, cut, etc. If I'm not mistaken using a dash is a niche behavior used only by a select few apps. It feels unnatural to need to add a dash parameter if piping data to serdi... I keep forgetting serdi needs it... |
Okay, I was must guessing from the python libraries comment. The |
You can close this ticket. Thank you for the notes, agree this is not core, and there are more important things to work on |
Okay. I will keep it around for now as a reminder, since I would like to make sure that at least, for example, it's easy to wire up libarchive to the read/write APIs. I probably won't add support to the tools themselves for initial release (I'm really struggling to finally get this out, so non-API-affecting feature creep in general is out), but it should be easy enough to add as a feature in a minor release. |
I'm finding very large turtle/triple datasets may be best kept in compressed form, if only to not be throttled by disk I/O max speeds while reading/writing them with a blazing fast library. Two common compressions I'm running into with triples store data are gz and bzip2. Would you consider adding the ability to stream out and back into compressed forms? There are libraries for Python to make it easy for both formats, so I'm hoping the same might be true here?
The text was updated successfully, but these errors were encountered: