Json serializer #64

guenthermi · 2014-04-22T07:19:00Z

This provides classes to build up a pipeline for serialize json dumps. In Addition it adds a new method in the TestObjectFactory to create ItemDocument test objects.

add EntityDocumentSerializer JsonSerializer JsonProcessor createItemDocument method in TestObjectFactory JsonSerialisationExample

- name MWDumpFileProcessorImpl to MWRevisionDumpFileProcessor - remove unused imports

mkroetzsch · 2014-04-23T08:09:37Z

Something needs to be fixed here. The tests are now failing.

mkroetzsch · 2014-04-24T08:03:39Z

Could you modify the example to write to a b2zipped file instead? This should be easy to do by using the bzip2 library we use elsewhere (mainly in utils) and it will safe a lot of disk space.

mkroetzsch · 2014-04-24T08:26:55Z

The architecture could be improved. As it is now, you require two steps (1) create a JsonProcessor, (2) create a JsonSerializer. The first step is not really needed. The JsonSerializer can simply create its JsonProcessor internally (in fact it could even be the same class, and be its own JsonSerializer instead of using another internal object). For this to work, the JsonSerializer will need to get the output stream in its own constructor, but this would be better anyway (since the JsonSerializer needs to know the output stream too; it just takes it from the JsonProcessor now).

- remove JsonProcessor and integrate this functionality into JsonSerializer - change the interface for EntityDocumentSerializer - compressed output in the example

guenthermi · 2014-04-24T22:39:45Z

Ok I made some changes. EntityDocumentSerializer extends from EntityDocumentProcessor now. The Output in the example is send to an bzip2 compressing stream.

* Fixed Java warnings * Fixed spelling errors in names and comments * Simplified code in some places

mkroetzsch · 2014-04-25T09:00:10Z

I did some smaller fixes (please watch out for Java warnings; we use American spelling in code, hence "serializer" not "serialiser").

Where in your code do you fix the output encoding? The JSON file should be in UTF8-encoded, but I don't see this being defined anywhere. Also, it would be good to have some linebreaks, at least after each entity, since otherwise the whole dump is one line and very hard to navigate with text tools. This code should be shared for processing item and property documents (currently, much code is copied there).

writeEntityDocument methode to reduce code redundance linebreaks after each document

mkroetzsch · 2014-04-25T19:06:19Z

I ran the JSON export now on today's dumps. This took almost exactly 5h (quite a long time). The resulting bzip2 file was 1.6G (so not very big: I/O should not be the cause of the slowdown). I was able to extract 20.1G data from this file; then it ended unexpectedly. Maybe this is because the output stream is never closed ...

I will commit fixes in a minute.

* Close stream after use * Use StandardCharsets to get UTF-8 charset * Throw something when catching an IOException during serialization

mkroetzsch · 2014-04-25T19:43:00Z

Ok, this is looking good now. I checked the output of the most recent code and it is valid bzip2. I did not check if it is valid JSON inside (looks like JSON ...).

Final todo: please add this as a new feature to the release notes. Then this can be merged (feel free to merge it yourself when you are ready).

P.S. To test such code, it makes sense to process only a small daily, not all of the large dump. I not used this code to do this:

MwDumpFile dump = dumpFileManager.findMostRecentDump(DumpContentType.DAILY);
dumpFileProcessor.processDumpFileContents(dump.getDumpFileStream(), dump);

You could also create a dumpFileManager without Web access to avoid downloading a new daily every day while testing.

guenthermi · 2014-04-26T10:20:13Z

Thank you for reviewing and the fixes.

"P.S. To test such code, it makes sense to process only a small daily, not all of the large dump. I not used this code to do this:"

Do you mean that this should be done only one time or is it performant enough to do it in a JUnitTest?

mkroetzsch · 2014-04-26T10:26:52Z

Do you mean that this should be done only one time or is it performant enough to do it in a JUnitTest?

No, I just mean for your local testing, since you said that you had never actually tried your code. It would take far too long for a unit test.

Json serializer

guenthermi added 5 commits April 16, 2014 19:43

implement JsonSerializer

c1fe6a3

add EntityDocumentSerializer JsonSerializer JsonProcessor createItemDocument method in TestObjectFactory JsonSerialisationExample

add licence header

c1002d2

add javadoc to the example

0ea5a77

Merge remote-tracking branch 'origin/master' into json-serializer

9c54291

fix confilict

acf014e

- name MWDumpFileProcessorImpl to MWRevisionDumpFileProcessor - remove unused imports

guenthermi added 3 commits April 23, 2014 10:43

update sitelinks in ItemDocumentEntry.txt

820c935

Merge remote-tracking branch 'origin/master' into json-serializer

604706a

add qualifiers-order attribute to ItemDocumentEntry.txt

28c8ffd

restructure JsonSerializer

4d162a1

- remove JsonProcessor and integrate this functionality into JsonSerializer - change the interface for EntityDocumentSerializer - compressed output in the example

Various smaller fixes

02a7661

* Fixed Java warnings * Fixed spelling errors in names and comments * Simplified code in some places

guenthermi added 2 commits April 25, 2014 12:14

define UTF-8 encoding

3796f67

writeEntityDocument methode to reduce code redundance linebreaks after each document

change testcode

54a9455

Small fixes

459bafc

* Close stream after use * Use StandardCharsets to get UTF-8 charset * Throw something when catching an IOException during serialization

Update RELEASE-NOTES.md

dfc39bc

guenthermi added a commit that referenced this pull request Apr 27, 2014

Merge pull request #64 from Wikidata/json-serializer

d9c2af2

Json serializer

guenthermi merged commit d9c2af2 into master Apr 27, 2014

mkroetzsch deleted the json-serializer branch April 29, 2014 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Json serializer #64

Json serializer #64

guenthermi commented Apr 22, 2014

mkroetzsch commented Apr 23, 2014

mkroetzsch commented Apr 24, 2014

mkroetzsch commented Apr 24, 2014

guenthermi commented Apr 24, 2014

mkroetzsch commented Apr 25, 2014

mkroetzsch commented Apr 25, 2014

mkroetzsch commented Apr 25, 2014

guenthermi commented Apr 26, 2014

mkroetzsch commented Apr 26, 2014

Json serializer #64

Json serializer #64

Conversation

guenthermi commented Apr 22, 2014

mkroetzsch commented Apr 23, 2014

mkroetzsch commented Apr 24, 2014

mkroetzsch commented Apr 24, 2014

guenthermi commented Apr 24, 2014

mkroetzsch commented Apr 25, 2014

mkroetzsch commented Apr 25, 2014

mkroetzsch commented Apr 25, 2014

guenthermi commented Apr 26, 2014

mkroetzsch commented Apr 26, 2014