Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming support for .gz and .bz2 format input / output files #34

Open
joelduerksen opened this issue Dec 29, 2021 · 7 comments
Open

Comments

@joelduerksen
Copy link

I'm finding very large turtle/triple datasets may be best kept in compressed form, if only to not be throttled by disk I/O max speeds while reading/writing them with a blazing fast library. Two common compressions I'm running into with triples store data are gz and bzip2. Would you consider adding the ability to stream out and back into compressed forms? There are libraries for Python to make it easy for both formats, so I'm hoping the same might be true here?

@drobilla
Copy link
Owner

drobilla commented Dec 29, 2021

Are you using the command-line utility, or serd as a library?

For code, it should be relatively easy to hook up serd to whatever compression library using custom read and write functions. Since serd is a lightweight library with no external dependencies, I don't think it's appropriate to add dependencies for this (and don't want to step on the feature creep treadmill of whatever archive format somebody wants this week). If the API makes this too difficult (there's a ton of archive libraries I have never tried), the reasons why should be addressed. I have already revised that heavily in the upcoming major version (serd1 branch) and imagine it should hook up nicely to more or less anything, but it'd be good to double-check various popular libraries before committing to the API.

That said, I'd be more open to adding it to the command-line utilities since that should be easy to make optional and doesn't add a dependency to the library itself. That code would also serve as an example to steal for other programs/libraries that want to do it. On the other hand, there you could just set up a pipeline...

@joelduerksen
Copy link
Author

Just using serdi at the command line to convert to N-triples right now, so I can examine the output of serd. Agree I was thinking maybe a compile time option to add support into serdi might be reasonable. I understand the goal of lightweight with no external dependencies (I like that too).

My next step is to try using the library (and I could add decompress support, understood). Agree examples are very helpful, it looks like the code for serdi itself might be the best example to start with? I hope to read any one of the supported formats (but compressed), do some minimal processing/filtering of the triples as they fly by, and then store subsets in a few separate files. Simple use, but it needs to be performant, and streaming, hence my interest in this library.

@drobilla
Copy link
Owner

I see. For things like that, if you want to dig into the code, you might want to start with the aforementioned serd1 branch, even though it's not out yet. There is a lot more there around processing streams (including a utility specifically for filtering) and the API is quite a bit friendlier and more polished. You can do it with the current stable branch too though, there's always been facilities for custom functions. Unfortunately there's not much example-based documentation (niche within a niche here, never seemed worth my time), but you should be able to figure it out from serdi or just poking through serd.h.

If you're a Python fan, I'm working on Python bindings in the serd1 branch as well. They're not quite done yet though (I think the current tip doesn't even build, bit of a mess right now). Earlier WIP of the documentation here, for example: https://drobilla.net/files/pyserd_docs/ . I hope to finish this stuff up shortly, but have a lot of balls in the air right now... if you're interested in this I can ping this issue when they're ready(ish), feedback would be helpful.

As for the issue at hand, we can ponder whether built-in support is worth it for convenience, but you can always just throw some UNIX at the problem, e.g.

zcat mydbdump.ttl.gz | serdi -

@joelduerksen
Copy link
Author

I'm fine with C (likely faster), but python is fine if easier to use and as nearly as fast. I use any language as needed, if I had to pick a language I'd identify as a K&R C fan. Yes, I am interested to hear when serd1 is done. One question (maybe this is a can of worms?), why do you use a dash for stdin, instead of just reading stdin when no file is given like common file commands do, e.g. cat, sort, uniq, cut, etc. If I'm not mistaken using a dash is a niche behavior used only by a select few apps. It feels unnatural to need to add a dash parameter if piping data to serdi... I keep forgetting serdi needs it...

@drobilla
Copy link
Owner

drobilla commented Jan 3, 2022

Okay, I was must guessing from the python libraries comment.

The - thing is a pretty universal convention for tools that are usually used with file inputs (which are friendlier in this case because then a base URI and syntax can be determined), but I suppose it could perhaps work without. In any case, please open separate tickets for unrelated issues to keep the tracker on point.

@joelduerksen
Copy link
Author

You can close this ticket. Thank you for the notes, agree this is not core, and there are more important things to work on

@drobilla
Copy link
Owner

drobilla commented Jan 6, 2022

Okay. I will keep it around for now as a reminder, since I would like to make sure that at least, for example, it's easy to wire up libarchive to the read/write APIs.

I probably won't add support to the tools themselves for initial release (I'm really struggling to finally get this out, so non-API-affecting feature creep in general is out), but it should be easy enough to add as a feature in a minor release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants