New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CC-3741 - Remove use of IdentityHashMap #1009
Conversation
How is this related to #827? |
@cyrusv , it's indirectly related to #827 in the sense that with this fix, we will be less likely to overpopulate the local schema cache, thus having an LRU policy on the cache is less critical. However, we can still add an eviction policy to the cache in the future, but the cache does not need to be identity (reference) based. So in that sense this fix is orthogonal to an eviction policy on the cache. |
@YuvalItzchakov, here is a fix for #894 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rayokota Thanks for fixing this long-standing problem with SR clients. Wondering if it might be useful to have some sort of jmh benchmark to assert that there is no performance degradation while using the same schema instance. Also, I believe the new change with version is potentially not backward compatible. I really like the new approach but would that need to be coupled with this PR? Can the migration to HashMap go without those changes?
if (schema.getType().equals(Schema.Type.RECORD)) { | ||
return result; | ||
return new GenericContainerWithVersion((GenericContainer) result, version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rayokota I believe this is potentially an incompatible change. If there are connectors that are looking for the version property, they could potentially break with this change?
Thanks @mageshn ! Yes, I can do some microbenchmarking to verify performance is acceptable. Regarding not setting the version property, I believe it is required to move to
The I looked through all our connectors and did not find a reference to the I can do a further scan of third-party connectors and preview connectors to ensure that no connectors are referring to this property if it would make you feel more comfortable. |
@mageshn , I added a SR client performance test and used it to test the difference between IdentityHashMap and HashMap. There was no performance degradation. IdentityHashMap: HashMap: The HashMap was a little faster (on average), perhaps because of difference in collision resolution (linear probing vs. chaining) and the nature of the data used during the test. |
@rayokota, I'm also concerned about this change being exposed to clients. Have you considered just changing how the client caches the entries, perhaps using a wrapper class for keys to achieve the precise comparison semantics we want? |
@rhauch @mageshn I did some more investigation into whether it's safe to not set the
To summarize:
Also, I did try a wrapper approach at one point (by trying to manipulate both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments/questions.
} | ||
|
||
public SchemaAndValue toConnectData(org.apache.avro.Schema avroSchema, Object value, | ||
Integer version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need JavaDoc for this new public method, with a clear definition of the behavior if/when version
is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll add.
return false; | ||
} | ||
AvroSchemaAndVersion that = (AvroSchemaAndVersion) o; | ||
return Objects.equals(schema, that.schema) && Objects.equals(version, that.version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version
can be null, so this only is a match if there is no version or the versions match. Is that sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the desired behavior.
} else if (version != null) { | ||
versionInt = version.intValue(); | ||
} | ||
if (versionInt >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to not assume that -1
will never be used in a schema version? This seems brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, version must be a positive integer. The assumption is documented elsewhere. I'll add a comment.
import java.util.Objects; | ||
|
||
/** | ||
* Wrapper for GenericContainer along with a version number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to be much more explicit about whether a version is required, when it is expected, and what the behavior is when the version is null. If it is not expected to be null, then we should consider using int
rather than Integer
, or add asserts or checks accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add a comment that it can be null.
@@ -0,0 +1,60 @@ | |||
/* | |||
* Copyright 2018 Confluent Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rayokota for driving this change. This is a super important change and will be appreciated by many users who hit the cache limit issue.
newbie here, can anyone tell how do i take these changes into my existing confluent-5.1.1 setup? I am hitting this issue consistently. I am running kafka, connect, zk, schema registry |
Hello group: 'io.confluent', name: 'kafka-avro-serializer', version: '5.3.0' and still running into too many schema objects created issue. any leads on how to resolve this ? |
There are 2 reasons why we currently use IdentityHashMap:
Regarding #1 (performance), the real gain is by having the client use the same object reference with the CachedSchemaRegistryClient. If a client does not use the same object reference, then a new Map entry is created, which can eventually lead to "Too many schema objects". Our clients use the same object reference, but customers often do not.
If we replace the IdentityHashMap with HashMap, clients that use the same object reference will still be performant. This is because the Avro schema class caches its hashcode, and its equals method will first check for reference equality.
If we replace the IdentityHashMap with HashMap, clients that do NOT use the same object reference will not be as performant, however, they will correctly get the same lookup from the Map and will not continue adding new entries to the Map (thus more likely averting "Too many schema objects").
Regarding #2 (mutable keys), the fact that we store the version on the schema is somewhat of a hack. The version can vary by subject, so there is not really a unique version per schema. A cleaner solution would be to not store the version on the schema, but to pass it along with the schema so that the schema does not have to be mutated.