Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataKernel serializer (with ASCII optimization) #55

Closed
wants to merge 4 commits into from

Conversation

vmykh
Copy link
Contributor

@vmykh vmykh commented Sep 25, 2015

Hi, I want to add Datakernel serializer.

It's extremely fast and space-efficient serializer, crafted using bytecode engineering.

As you can see from benchmark, it is ~1.5 times faster than closest opponent (considering serialization + deserialization total time).

Here you can find more info and examples. Source code can be found in this place. Datakernel serializer is also available on maven central.

To create jar file, Maven Shade Plugin was used, which allows to include all dependencies in jar (such as ObjectWeb ASM) and avoid conflicts if you're using same dependency but with different version. Such kind of jar is called UberJar. More Details.

@myshzzx
Copy link

myshzzx commented Sep 26, 2015

Datakernel serializer is so limited and it can be used in only a few situations.

  • you need to add an annotation like "@serialize(order = 0)" to each "get" method of field that you want to serialize
  • before deserializing, you need to give it a fixed byte buf (no stream support), tell it what type/class it is (no type recognize), then it can do its job
  • before serializing, a big enough byte buf is needed, or an "ArrayIndexOutOfBoundsException" will be thrown

That's all I know from its tutoiral. I haven't test its performance, but fast is far away from enough, there is a long way for it to become a productive component. Tell me if there's any wrong here.

@vmykh
Copy link
Contributor Author

vmykh commented Sep 26, 2015

myshzzx, thank you for feedback.
Here are some reasons why this approach was chosen:

  • Explicit ‘order’ attribute ensures stability of binary format: fields are ordered to survive code refactorings and rearrangement of fields, and also for versioning and maintaining of backward compatibility. And some JVMs do not guarantee retrieving of fields in any particular order in runtime, which is needed for runtime generation of serializer.
  • The emphasis of this serializer is on speed, it is based on byte buffer operations which do not incur the overhead of regular Java streams.
  • Using (and re-using) a fixed buffer gives maximum performance, ensures the maximum message size, and plays nicely with GC.

@myshzzx
Copy link

myshzzx commented Sep 27, 2015

That's a big sacrifice, for extreme speed. It fills a blank of jvm-serialization, as I know.

@cowtowncoder
Copy link
Collaborator

I think it is fine to discuss various trade-offs of codecs on the mailing list, but I do not think there is any specific criteria for inclusion, either regarding performance or implementation or even limitations.
That is, I don't think anyone objects to inclusion, as long as test case is fair and codec works well enough to pass verifications and so forth.

@vmykh
Copy link
Contributor Author

vmykh commented Sep 28, 2015

Well, actually, DataKernel Serializer supports UTF8 (by default) and ASCII, UTF16 strings (using annotations), allowing to specify the best one depending on data being serialized. According to jvm-serializers wiki, there is subcategory of manually optimized serializers so it's allowed to use parameters, that provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a little bit slower but still the fastest ‘total’ times for DataKernel serializer comparing to other serializers, according to our measurements.

@cakoose
Copy link
Collaborator

cakoose commented Sep 29, 2015

A tangent regarding the ASCII optimization: basically, I think we shouldn't
allow that optimization in the benchmark.

I think it's helpful to differentiate between the schema definition and the
actual test value. Even though test values happen to only use ASCII, the
schemas are meant (in my opinion) to allow Unicode fields.

I think this is reflects a common real-world scenario: most string fields
are defined to be Unicode, even if the data is usually ASCII-only. So
while it's fair to have a fast path for ASCII data (many of them do), I
think it's unfair if the serializer fails on Unicode.

Similarly, we only allow the non-null optimization if the schema specifies
that a field is non-null. I don't think it's fair to apply the
optimization to a nullable field just because the test value happens to
never be null.

On Mon, Sep 28, 2015 at 10:26 AM, Vitaliy Mykhalko <notifications@github.com

wrote:

Well, actually, DataKernel Serializer supports UTF8 (by default) and
ASCII, UTF16 strings (using annotations), allowing to specify the best one
depending on data being serialized. According to jvm-serializers wiki
https://github.com/eishay/jvm-serializers/wiki, there is subcategory of
manually optimized serializers so it's allowed to use parameters, that
provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is
common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a
little bit slower but still the fastest ‘total’ times for DataKernel
serializer comparing to other serializers, according to our measurements.


Reply to this email directly or view it on GitHub
#55 (comment)
.

@cowtowncoder
Copy link
Collaborator

I fully agree with @cakoose. I don't see a problem in codec optimizing for likely cases (of, say, ASCII), but I think that it must be capable of handling non-ASCII. And this is one reason why perhaps we should force use of couple of non-ASCII characters -- I know there are 4 different test files, some with them.
But maybe all of them should have a few sprinkled around to make sure codecs do not take unrealistic shortcuts; codec that would be ASCII only would be of quite limited value in my opinion. Even if majority of content was ASCII: there's always "that one document" where this is not true and things break.

@vmykh
Copy link
Contributor Author

vmykh commented Sep 30, 2015

I agree that Unicode support is usually needed (and DataKernel serializer still gives great results with Unicode, so we are ok with UTF8).

But I think in many cases ASCII optimization can be really useful:

  • data, that can be ASCII only (according to standards), namely fields like "uri" or "format" (MIME / MediaType). For example, in URI all non-ASCII characters must be percent-encoded link
  • various string ids, tokens
  • hashes, passwords, randomly generated keys
  • database fields with non-Unicode encoding

Maybe ASCII optimization should be allowed for some kind of fields like "uri" and "format", and disallowed for the rest of fields, which could potentially contain Unicode? In doing so, wouldn't it give more balanced and realistic benchmark?

@cowtowncoder
Copy link
Collaborator

@vmykh If schema (whatever the source; external, based on Java class definition) indicates that the datatype can NOT contain non-ASCII content, sure. I don't have specific objection to denoting that for fields you mention, and leaving others like "title", "description" as full Unicode.

@cakoose
Copy link
Collaborator

cakoose commented Sep 30, 2015

I think the ASCII optimization is a useful one to make. Many users will
see real-world benefits from it. But the goal of this benchmark, I
believe, is to be as useful as possible to as many people as possible, and
I'm not sure if incorporating the ASCII-only optimization will aid in that
goal.

Long story (again, sorry):

Having been around this project for a long time has taught me how difficult
it is to make good benchmarks in general and, specifically, how misleading
ours might be. One of the biggest issues is that we only have a single
test value. It's something Eishay Smith came up with in 2008 when he
wanted to do a quick comparison of Thrift vs Protobuf.

We've since made changes where we thought the benefit was unambiguous. For
example, the test strings used to be "a" and "b" and the test numbers were
1 and 2 -- we switched to longer strings and bigger numbers.

But the fact remains that our test value is still very biased. For
example, there are real world use cases that involve serializing tons of
numbers and those use cases aren't represented in our benchmark. Changing
the test value to some other reasonable value could end up changing the
results by maybe 25%. Given that, it seems silly to me that we report
results with the number of significant figures that we do! (I always worry
about people taking our results too seriously.)

One way to improve this project is to have multiple test values that try
and cover different use cases (ASCII-only, lots of numbers, short strings,
long strings, etc) [1]. This would allow a user to look say "hmm, my
values have lots of ASCII and lots of numbers" and pay more attention to
those particular result sets.

But as it stands now, we're only using a single test schema/value. To me,
it's not clear whether changing that schema/value to have ASCII-only
strings will help towards the end goal of making the results more useful
for more people.

On Wed, Sep 30, 2015 at 1:13 PM, Tatu Saloranta notifications@github.com
wrote:

@vmykh https://github.com/vmykh If schema (whatever the source;
external, based on Java class definition) indicates that the datatype can
NOT contain non-ASCII content, sure. I don't have specific objection to
denoting that for fields you mention, and leaving others like "title",
"description" as full Unicode.


Reply to this email directly or view it on GitHub
#55 (comment)
.

@vmykh vmykh changed the title Add DataKernel serializer Add DataKernel serializer (with ASCII optimization) Oct 2, 2015
@vmykh
Copy link
Contributor Author

vmykh commented Oct 2, 2015

Okay, I've updated this pull request, and now ASCII optimization is used only for "uri" and "format" fields. Alternatively, I've also created another pull request, where serializer doesn't use ASCII optimization at all. So you can decide which one better conforms requirements and then merge it.

@pascaldekloe
Copy link
Collaborator

Favoured #56 over ASCII optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants