Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random prefixes for blank nodes #8

Closed
wants to merge 1 commit into from

Conversation

pietercolpaert
Copy link

Problem

I want to use raptor to be able to concatenate RDF files. At this moment, the only reason why this is not possible is because blank nodes are given the same identifiers when converting different files.

Suggested solution

Giving random prefix names per file. This way we can use something like

ls *.rdf | while read file ; do { rapper -o ntriples $file ; } done > dump.n3

to concatenate a directory of rdf files in one dump.

@smileygingerbread
Copy link

Any progress on this issue? I'm having the same problem. Right now, with rapper it's impossible to process multiple files for the same graph. This is actually a serious bug, because many datasets are not published as a big monolithic file, rather they're published as a collection of smaller files.

@smileygingerbread
Copy link

This doesn't seem to work very well. It looks like I'm still getting some bnodes with the same (random) names.

@smileygingerbread
Copy link

@pietercolpaert did you eventually manage to solve this issue? How?

@smileygingerbread
Copy link

@pietercolpaert @dajobe another option could be to use a --genid=... command-line argument instead. It doesn't require implementing any new function, but only parsing an additional argument. It would work by replacing the string "genid" with the (random) one define by the user.

What do you guys think?

@pietercolpaert
Copy link
Author

Now looking at my code 4 years later it looks like random bnode names would of course not always return unique bnode names and merging this to master would not solve any problems. Maybe when we’d concatenate it with the current unixtime in something sub microseconds?

An extra parameter also sounds like a good idea.

I suggest I close this PR and you can open a new issue referencing this PR.

@smileygingerbread
Copy link

smileygingerbread commented Apr 29, 2017

@pietercolpaert

it looks like random bnode names would of course not always return unique bnode names

right now bnodes are generated as _:genIdN where N is a progressive number within each parsed files, so for example _:genId1, _:genId2, _:genId3, ...
Your patch seemed to work because it replace the "genId" string with a random one (one per file), so the new names would become _:rnd-string1, _:rnd-string2, _:rnd-string3, ... which is fine.
The problem though, or at least from my tests, is that parsing different files with your patch would generate the same random string (sometimes). So I don't know if this is a problem with the PRNG seed or whatnot.

An extra parameter also sounds like a good idea.

yeah I like this idea as well. Basically a --genid=... parameter to replace the default "genId" string, such that the bnodes will be named _:new-string1, _:new-string2, _:new-string3, ...
I'd submit a patch for this, but I'm completely alien to raptor source code. I'm willing to help, comment, test, even write some code for this, but somebody who knows the code should guide me through. Anybody who can work on this? Shouldn't (in theory at least) be too much work, just add an additional getopt option to replace the default "genId" value with the one passed from the command line.

you can open a new issue referencing this PR

Can't open issues on this repository

@smileygingerbread
Copy link

rapper also has an -f argument which looks like the default way to set parser/serialization options.

-f OPTION(=VALUE), --feature OPTION(=VALUE)  
                          Set parser or serializer options
                          Use `-f help' for a list of valid options

so, instead of defining a new --genid=... argument, it should be possible (and probably more appropriate as well) to add a new value to this f command. This solution might even be simpler.

Is there any developer or maintainer reading these comments at all???

@sharpaper
Copy link

Any progress on this?!

@sharpaper
Copy link

sharpaper commented Oct 21, 2017

This is a serious bug, because it makes rapper completely useless for any batch processing. This PR is already 4 years old... is rapper/librdf dead or unmaintained?

@dajobe
Copy link
Owner

dajobe commented Sep 30, 2020

Not landing, rapper is not a stream processor for RDF graph merging, use redland and it's rdfproc tool for that.

@dajobe dajobe closed this Sep 30, 2020
@pietercolpaert
Copy link
Author

As rapper is a command line tool that can be installed in, among others, Debian repositories, I still think this would have been a nice feature hidden behind a flag, or even as a separate rapper-concat command, that would help dataset maintainers without having to open up a software development environment just to bring some triples together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants