Add DataKernel serializer (with ASCII optimization) #55

vmykh · 2015-09-25T17:53:06Z

Hi, I want to add Datakernel serializer.

It's extremely fast and space-efficient serializer, crafted using bytecode engineering.

As you can see from benchmark, it is ~1.5 times faster than closest opponent (considering serialization + deserialization total time).

Here you can find more info and examples. Source code can be found in this place. Datakernel serializer is also available on maven central.

To create jar file, Maven Shade Plugin was used, which allows to include all dependencies in jar (such as ObjectWeb ASM) and avoid conflicts if you're using same dependency but with different version. Such kind of jar is called UberJar. More Details.

myshzzx · 2015-09-26T04:18:56Z

Datakernel serializer is so limited and it can be used in only a few situations.

you need to add an annotation like "@serialize(order = 0)" to each "get" method of field that you want to serialize
before deserializing, you need to give it a fixed byte buf (no stream support), tell it what type/class it is (no type recognize), then it can do its job
before serializing, a big enough byte buf is needed, or an "ArrayIndexOutOfBoundsException" will be thrown

That's all I know from its tutoiral. I haven't test its performance, but fast is far away from enough, there is a long way for it to become a productive component. Tell me if there's any wrong here.

vmykh · 2015-09-26T14:57:21Z

myshzzx, thank you for feedback.
Here are some reasons why this approach was chosen:

Explicit ‘order’ attribute ensures stability of binary format: fields are ordered to survive code refactorings and rearrangement of fields, and also for versioning and maintaining of backward compatibility. And some JVMs do not guarantee retrieving of fields in any particular order in runtime, which is needed for runtime generation of serializer.
The emphasis of this serializer is on speed, it is based on byte buffer operations which do not incur the overhead of regular Java streams.
Using (and re-using) a fixed buffer gives maximum performance, ensures the maximum message size, and plays nicely with GC.

myshzzx · 2015-09-27T01:36:09Z

That's a big sacrifice, for extreme speed. It fills a blank of jvm-serialization, as I know.

cowtowncoder · 2015-09-28T03:54:57Z

I think it is fine to discuss various trade-offs of codecs on the mailing list, but I do not think there is any specific criteria for inclusion, either regarding performance or implementation or even limitations.
That is, I don't think anyone objects to inclusion, as long as test case is fair and codec works well enough to pass verifications and so forth.

vmykh · 2015-09-28T17:26:24Z

Well, actually, DataKernel Serializer supports UTF8 (by default) and ASCII, UTF16 strings (using annotations), allowing to specify the best one depending on data being serialized. According to jvm-serializers wiki, there is subcategory of manually optimized serializers so it's allowed to use parameters, that provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a little bit slower but still the fastest ‘total’ times for DataKernel serializer comparing to other serializers, according to our measurements.

cakoose · 2015-09-29T07:02:13Z

A tangent regarding the ASCII optimization: basically, I think we shouldn't
allow that optimization in the benchmark.

I think it's helpful to differentiate between the schema definition and the
actual test value. Even though test values happen to only use ASCII, the
schemas are meant (in my opinion) to allow Unicode fields.

I think this is reflects a common real-world scenario: most string fields
are defined to be Unicode, even if the data is usually ASCII-only. So
while it's fair to have a fast path for ASCII data (many of them do), I
think it's unfair if the serializer fails on Unicode.

Similarly, we only allow the non-null optimization if the schema specifies
that a field is non-null. I don't think it's fair to apply the
optimization to a nullable field just because the test value happens to
never be null.

On Mon, Sep 28, 2015 at 10:26 AM, Vitaliy Mykhalko <notifications@github.com

wrote:

Well, actually, DataKernel Serializer supports UTF8 (by default) and
ASCII, UTF16 strings (using annotations), allowing to specify the best one
depending on data being serialized. According to jvm-serializers wiki
https://github.com/eishay/jvm-serializers/wiki, there is subcategory of
manually optimized serializers so it's allowed to use parameters, that
provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is
common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a
little bit slower but still the fastest ‘total’ times for DataKernel
serializer comparing to other serializers, according to our measurements.

—
Reply to this email directly or view it on GitHub
#55 (comment)
.

cowtowncoder · 2015-09-29T17:42:58Z

I fully agree with @cakoose. I don't see a problem in codec optimizing for likely cases (of, say, ASCII), but I think that it must be capable of handling non-ASCII. And this is one reason why perhaps we should force use of couple of non-ASCII characters -- I know there are 4 different test files, some with them.
But maybe all of them should have a few sprinkled around to make sure codecs do not take unrealistic shortcuts; codec that would be ASCII only would be of quite limited value in my opinion. Even if majority of content was ASCII: there's always "that one document" where this is not true and things break.

vmykh · 2015-09-30T19:42:53Z

I agree that Unicode support is usually needed (and DataKernel serializer still gives great results with Unicode, so we are ok with UTF8).

But I think in many cases ASCII optimization can be really useful:

data, that can be ASCII only (according to standards), namely fields like "uri" or "format" (MIME / MediaType). For example, in URI all non-ASCII characters must be percent-encoded link
various string ids, tokens
hashes, passwords, randomly generated keys
database fields with non-Unicode encoding

Maybe ASCII optimization should be allowed for some kind of fields like "uri" and "format", and disallowed for the rest of fields, which could potentially contain Unicode? In doing so, wouldn't it give more balanced and realistic benchmark?

cowtowncoder · 2015-09-30T20:13:36Z

@vmykh If schema (whatever the source; external, based on Java class definition) indicates that the datatype can NOT contain non-ASCII content, sure. I don't have specific objection to denoting that for fields you mention, and leaving others like "title", "description" as full Unicode.

cakoose · 2015-09-30T23:38:20Z

I think the ASCII optimization is a useful one to make. Many users will
see real-world benefits from it. But the goal of this benchmark, I
believe, is to be as useful as possible to as many people as possible, and
I'm not sure if incorporating the ASCII-only optimization will aid in that
goal.

Long story (again, sorry):

Having been around this project for a long time has taught me how difficult
it is to make good benchmarks in general and, specifically, how misleading
ours might be. One of the biggest issues is that we only have a single
test value. It's something Eishay Smith came up with in 2008 when he
wanted to do a quick comparison of Thrift vs Protobuf.

We've since made changes where we thought the benefit was unambiguous. For
example, the test strings used to be "a" and "b" and the test numbers were
1 and 2 -- we switched to longer strings and bigger numbers.

But the fact remains that our test value is still very biased. For
example, there are real world use cases that involve serializing tons of
numbers and those use cases aren't represented in our benchmark. Changing
the test value to some other reasonable value could end up changing the
results by maybe 25%. Given that, it seems silly to me that we report
results with the number of significant figures that we do! (I always worry
about people taking our results too seriously.)

One way to improve this project is to have multiple test values that try
and cover different use cases (ASCII-only, lots of numbers, short strings,
long strings, etc) [1]. This would allow a user to look say "hmm, my
values have lots of ASCII and lots of numbers" and pay more attention to
those particular result sets.

But as it stands now, we're only using a single test schema/value. To me,
it's not clear whether changing that schema/value to have ASCII-only
strings will help towards the end goal of making the results more useful
for more people.

On Wed, Sep 30, 2015 at 1:13 PM, Tatu Saloranta notifications@github.com
wrote:

@vmykh https://github.com/vmykh If schema (whatever the source;
external, based on Java class definition) indicates that the datatype can
NOT contain non-ASCII content, sure. I don't have specific objection to
denoting that for fields you mention, and leaving others like "title",
"description" as full Unicode.

—
Reply to this email directly or view it on GitHub
#55 (comment)
.

…to ISO-8859-1

vmykh · 2015-10-02T01:18:55Z

Okay, I've updated this pull request, and now ASCII optimization is used only for "uri" and "format" fields. Alternatively, I've also created another pull request, where serializer doesn't use ASCII optimization at all. So you can decide which one better conforms requirements and then merge it.

pascaldekloe · 2016-07-08T19:31:11Z

Favoured #56 over ASCII optimizations.

vsavchuk and others added 2 commits September 11, 2015 10:16

Add DataKernel serializer

0c7b305

Update DataKernel Serializer

640dbe9

vsavchuk and others added 2 commits October 1, 2015 13:07

Update datakernel serializer, add @SerializerVarLength, change ASCII …

ae70103

…to ISO-8859-1

update serializer-1.1.2.jar

e2db6ed

vmykh changed the title ~~Add DataKernel serializer~~ Add DataKernel serializer (with ASCII optimization) Oct 2, 2015

vmykh mentioned this pull request Oct 2, 2015

Add DataKernel serializer (UTF8 only) #56

Merged

lightZebra force-pushed the master branch from f38b019 to e2db6ed Compare November 23, 2015 12:23

pascaldekloe closed this Jul 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataKernel serializer (with ASCII optimization) #55

Add DataKernel serializer (with ASCII optimization) #55

vmykh commented Sep 25, 2015

myshzzx commented Sep 26, 2015

vmykh commented Sep 26, 2015

myshzzx commented Sep 27, 2015

cowtowncoder commented Sep 28, 2015

vmykh commented Sep 28, 2015

cakoose commented Sep 29, 2015

cowtowncoder commented Sep 29, 2015

vmykh commented Sep 30, 2015

cowtowncoder commented Sep 30, 2015

cakoose commented Sep 30, 2015

vmykh commented Oct 2, 2015

pascaldekloe commented Jul 8, 2016

Add DataKernel serializer (with ASCII optimization) #55

Add DataKernel serializer (with ASCII optimization) #55

Conversation

vmykh commented Sep 25, 2015

myshzzx commented Sep 26, 2015

vmykh commented Sep 26, 2015

myshzzx commented Sep 27, 2015

cowtowncoder commented Sep 28, 2015

vmykh commented Sep 28, 2015

cakoose commented Sep 29, 2015

cowtowncoder commented Sep 29, 2015

vmykh commented Sep 30, 2015

cowtowncoder commented Sep 30, 2015

cakoose commented Sep 30, 2015

vmykh commented Oct 2, 2015

pascaldekloe commented Jul 8, 2016