New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Json serializer #64
Json serializer #64
Conversation
add EntityDocumentSerializer JsonSerializer JsonProcessor createItemDocument method in TestObjectFactory JsonSerialisationExample
- name MWDumpFileProcessorImpl to MWRevisionDumpFileProcessor - remove unused imports
Something needs to be fixed here. The tests are now failing. |
Could you modify the example to write to a b2zipped file instead? This should be easy to do by using the bzip2 library we use elsewhere (mainly in utils) and it will safe a lot of disk space. |
The architecture could be improved. As it is now, you require two steps (1) create a JsonProcessor, (2) create a JsonSerializer. The first step is not really needed. The JsonSerializer can simply create its JsonProcessor internally (in fact it could even be the same class, and be its own JsonSerializer instead of using another internal object). For this to work, the JsonSerializer will need to get the output stream in its own constructor, but this would be better anyway (since the JsonSerializer needs to know the output stream too; it just takes it from the JsonProcessor now). |
- remove JsonProcessor and integrate this functionality into JsonSerializer - change the interface for EntityDocumentSerializer - compressed output in the example
Ok I made some changes. EntityDocumentSerializer extends from EntityDocumentProcessor now. The Output in the example is send to an bzip2 compressing stream. |
* Fixed Java warnings * Fixed spelling errors in names and comments * Simplified code in some places
I did some smaller fixes (please watch out for Java warnings; we use American spelling in code, hence "serializer" not "serialiser"). Where in your code do you fix the output encoding? The JSON file should be in UTF8-encoded, but I don't see this being defined anywhere. Also, it would be good to have some linebreaks, at least after each entity, since otherwise the whole dump is one line and very hard to navigate with text tools. This code should be shared for processing item and property documents (currently, much code is copied there). |
writeEntityDocument methode to reduce code redundance linebreaks after each document
I ran the JSON export now on today's dumps. This took almost exactly 5h (quite a long time). The resulting bzip2 file was 1.6G (so not very big: I/O should not be the cause of the slowdown). I was able to extract 20.1G data from this file; then it ended unexpectedly. Maybe this is because the output stream is never closed ... I will commit fixes in a minute. |
* Close stream after use * Use StandardCharsets to get UTF-8 charset * Throw something when catching an IOException during serialization
Ok, this is looking good now. I checked the output of the most recent code and it is valid bzip2. I did not check if it is valid JSON inside (looks like JSON ...). Final todo: please add this as a new feature to the release notes. Then this can be merged (feel free to merge it yourself when you are ready). P.S. To test such code, it makes sense to process only a small daily, not all of the large dump. I not used this code to do this:
You could also create a dumpFileManager without Web access to avoid downloading a new daily every day while testing. |
Thank you for reviewing and the fixes.
Do you mean that this should be done only one time or is it performant enough to do it in a JUnitTest? |
No, I just mean for your local testing, since you said that you had never actually tried your code. It would take far too long for a unit test. |
This provides classes to build up a pipeline for serialize json dumps. In Addition it adds a new method in the TestObjectFactory to create ItemDocument test objects.