Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json serializer #64

Merged
merged 14 commits into from Apr 27, 2014
Merged

Json serializer #64

merged 14 commits into from Apr 27, 2014

Conversation

guenthermi
Copy link
Member

This provides classes to build up a pipeline for serialize json dumps. In Addition it adds a new method in the TestObjectFactory to create ItemDocument test objects.

add EntityDocumentSerializer
	JsonSerializer
	JsonProcessor
	createItemDocument method in TestObjectFactory
	JsonSerialisationExample
	- name MWDumpFileProcessorImpl to MWRevisionDumpFileProcessor
	- remove unused imports
@mkroetzsch
Copy link
Member

Something needs to be fixed here. The tests are now failing.

@mkroetzsch
Copy link
Member

Could you modify the example to write to a b2zipped file instead? This should be easy to do by using the bzip2 library we use elsewhere (mainly in utils) and it will safe a lot of disk space.

@mkroetzsch
Copy link
Member

The architecture could be improved. As it is now, you require two steps (1) create a JsonProcessor, (2) create a JsonSerializer. The first step is not really needed. The JsonSerializer can simply create its JsonProcessor internally (in fact it could even be the same class, and be its own JsonSerializer instead of using another internal object). For this to work, the JsonSerializer will need to get the output stream in its own constructor, but this would be better anyway (since the JsonSerializer needs to know the output stream too; it just takes it from the JsonProcessor now).

	- remove JsonProcessor and integrate this functionality into
JsonSerializer
	- change the interface for EntityDocumentSerializer
	- compressed output in the example
@guenthermi
Copy link
Member Author

Ok I made some changes. EntityDocumentSerializer extends from EntityDocumentProcessor now. The Output in the example is send to an bzip2 compressing stream.

* Fixed Java warnings
* Fixed spelling errors in names and comments
* Simplified code in some places
@mkroetzsch
Copy link
Member

I did some smaller fixes (please watch out for Java warnings; we use American spelling in code, hence "serializer" not "serialiser").

Where in your code do you fix the output encoding? The JSON file should be in UTF8-encoded, but I don't see this being defined anywhere. Also, it would be good to have some linebreaks, at least after each entity, since otherwise the whole dump is one line and very hard to navigate with text tools. This code should be shared for processing item and property documents (currently, much code is copied there).

writeEntityDocument methode to reduce code redundance
linebreaks after each document
@mkroetzsch
Copy link
Member

I ran the JSON export now on today's dumps. This took almost exactly 5h (quite a long time). The resulting bzip2 file was 1.6G (so not very big: I/O should not be the cause of the slowdown). I was able to extract 20.1G data from this file; then it ended unexpectedly. Maybe this is because the output stream is never closed ...

I will commit fixes in a minute.

* Close stream after use
* Use StandardCharsets to get UTF-8 charset
* Throw something when catching an IOException during serialization
@mkroetzsch
Copy link
Member

Ok, this is looking good now. I checked the output of the most recent code and it is valid bzip2. I did not check if it is valid JSON inside (looks like JSON ...).

Final todo: please add this as a new feature to the release notes. Then this can be merged (feel free to merge it yourself when you are ready).

P.S. To test such code, it makes sense to process only a small daily, not all of the large dump. I not used this code to do this:

MwDumpFile dump = dumpFileManager.findMostRecentDump(DumpContentType.DAILY);
dumpFileProcessor.processDumpFileContents(dump.getDumpFileStream(), dump);

You could also create a dumpFileManager without Web access to avoid downloading a new daily every day while testing.

@guenthermi
Copy link
Member Author

Thank you for reviewing and the fixes.

"P.S. To test such code, it makes sense to process only a small daily, not all of the large dump. I not used this code to do this:"

Do you mean that this should be done only one time or is it performant enough to do it in a JUnitTest?

@mkroetzsch
Copy link
Member

Do you mean that this should be done only one time or is it performant enough to do it in a JUnitTest?

No, I just mean for your local testing, since you said that you had never actually tried your code. It would take far too long for a unit test.

guenthermi added a commit that referenced this pull request Apr 27, 2014
@guenthermi guenthermi merged commit d9c2af2 into master Apr 27, 2014
@mkroetzsch mkroetzsch deleted the json-serializer branch April 29, 2014 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants