Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization in parallel causes invalid objects #4791

Closed
HarelM opened this issue Jun 18, 2020 · 25 comments
Closed

Deserialization in parallel causes invalid objects #4791

HarelM opened this issue Jun 18, 2020 · 25 comments

Comments

@HarelM
Copy link

HarelM commented Jun 18, 2020

NEST/Elasticsearch.Net version: 5.6.6

Elasticsearch version: 5.6.3

Description of the problem including expected versus actual behavior:
Actual:
When deserializing a GeoJson object using NTS the properties data is duplicated from one entity to the other.
Expected: Data should not "leak".

Steps to reproduce:

  1. Query the database using search in parallel
  2. Test the results
    It doesn't happen all the time, it happens from time to time which I believe is related to timing and multithread stuff...
    Code can be seen here:
    https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.DataAccess/ElasticSearchGateway.cs#L381L401

Expected behavior
There shouldn't be a "memory override"

Provide ConnectionSettings (if relevant):
https://github.com/IsraelHikingMap/Site/blob/master/IsraelHiking.DataAccess/ElasticSearchGateway.cs#L66L75
Most of the info can be found in the following thread, I don't know if the problem is here or in the deserialization code in the NTS library... :-(
The comment below and afterwards is the relevant discussion.
There are also elastic dump there.
NetTopologySuite/NetTopologySuite.IO.GeoJSON#46 (comment)

Provide DebugInformation (if relevant):
Can be found in the above issue.

Any help would be greatly appreciated. Running on .net core 3.1 windows server 2012 R2.

@DGuidi
Copy link

DGuidi commented Jun 18, 2020

I've checked a bit the code and looks to me (but probably I'm wrong) that ConnectionSettingsAwareSerializerBase initialize a list of json converters that uses during deserialization process.
This is a problem if this converters store some state: in this specific issue the "state" maybe can be the Context: see this code. Looks (from @HarelM tests) that the same serializer is called from multiple threads, and the context is not correctly handled in this case.

@HarelM
Copy link
Author

HarelM commented Jun 19, 2020

See my commit here to fix this issue:
IsraelHikingMap/Site@c4be675
I basically took the 5.6.6 serializer version and modified it to create a JsonSerializer every time instead of storing it, and I stored settings instead (I also removed a lot of the protected code since I don't need to derive from it).
This bug seems to be very old but I recently started to see it which made me find it here.
Let me know if you need more info.

@russcam
Copy link
Contributor

russcam commented Jun 22, 2020

Hi @HarelM, Please note that NEST 5.x is no longer supported, so we are no longer pushing out new releases of 5.x. Can you replicate the issue in 6.x or 7.x, which are supported?

@HarelM
Copy link
Author

HarelM commented Jun 23, 2020

Thanks for getting back to me!
I know that 5.x is no longer supported but the code in 6.x and 7.x is almost the same (as far as I've seen - serializers are stored in memory and not created) so I expect this bug to exist these as well.
Unfortunately, I was unable to migrate to 7.8 when I tried due to breaking changes in NEST and the fact that some geospatial queries stopped working.
All my logic around elastic can be seen in the following file (including the workaround I did to solve this issue)
https://github.com/IsraelHikingMap/Site/blob/f708d68e854decd881a87f1cbb9156501ca58311/IsraelHiking.DataAccess/ElasticSearchGateway.cs

Since this issue doesn't always reproduce, and it reproduced mainly on my production environment I wouldn't like to try and reproduce it on later versions there.

Having said that, I could use your help in migrating to later version of elastic in my dev environment and see if I can reproduce it there. I followed the migration documents but was unable to get the results I did with 5.x. If you can direct me to other documents explaining how to easily migrate indices that were created in old versions it would be great as I haven't found such a document (the documents mainly describe which version are backwards compatible and not how to easily migrate and index).
Also I'll need to know how to change the index creation methods I have and the geospatial queries that stopped working (see the link to the code above).

@russcam
Copy link
Contributor

russcam commented Jun 23, 2020

I'm not 100% sure that this bug is coming from the client; it sounds like it originates from a JsonConverter of NetTopologySuite? A small reproducible example would help to narrow the scope down.

There are quite a few serialization changes between 5.x, 6.x and 7.x:

  • internalized Json.NET serializer in 6.x
  • replacement of Json.NET with forked Utf8Json in 7.x

so it's possible that you may not see the same behaviour even just in upgrading the client.

If you can direct me to other documents explaining how to easily migrate indices that were created in old versions it would be great as I haven't found such a document (the documents mainly describe which version are backwards compatible and not how to easily migrate and index).

Take a look at the upgrading Elasticsearch documentation that has a table of the recommended upgrade paths from 5.x to 7.8.0. On Elasticsearch 5.6.3, you should be able to do a rolling upgrade to 6.8, then a rolling upgrade from 6.8 to 7.8.0. Going to 6.8 as an intermediate step will update indices.

I have and the geospatial queries that stopped working (see the link to the code above).

This might be related to the move to using BKD tree data structure for geo_shapes. I think the gaps in moving to BKD trees have now been closed in Elasticsearch 7.7.0+, e.g.

@HarelM
Copy link
Author

HarelM commented Jun 23, 2020

Unfortunately, I can't seem to be able to create a small reproducible example as mentioned in the NTS thread as the cause of the issue seems to be timing and multi threading, and it doesn't happen all the time, it happens sometimes, which is the worst kind of bugs. I was able to reproduce it on my dev env for a short while but then it stopped happening without me changing any code...

You and @DGuidi seem to think the bug is no in your code but in the other party's, which is unfortunate as the bug exists and no one want to take care of it. :-(
As far as I understand, the bottom line is that NTS is using the serializer's context when desalinizing (I don't know if this is a bug, bad design or neither) and the fact that ES is not creating a new serializer every time is what I believe is causing this issue.

While there were changes in later versions of NEST the fact is that JsonNetSerializer's serializer is still initialized only once as can be seen here hasn't changed.
https://github.com/elastic/elasticsearch-net/blob/master/src/Nest.JsonNetSerializer/ConnectionSettingsAwareSerializerBase.Customization.cs#L23L25

It might not be the right solution, but moving from storing the serializer(s) to storing only the settings/options object and creating a serializer every time the Deserialize method is called has solved the issue for me (it might introduce performance problems, which I'm not aware of, but I prefer performance issue rather than data corruption issues)

As to upgrading, I followed the instructions there and failed like a noob. I also did a migration to docker at the same time, so I might have had too much noise.
I'll be trying to upgrade in the near future as I don't want to keep a non supported version in my production environment, but this will not happen instantly so I wouldn't count on it.
Nevertheless, Thanks for the references :-)

@russcam
Copy link
Contributor

russcam commented Jun 23, 2020

@HarelM Stating that no one wants to take care of it is a mischaracterisation. At the moment, there is not a clear reproducible example to demonstrate the issue, requiring an investment in time and effort to come up with one and investigate. In addition, the issue is for a version no longer supported and it's not clear that the issue is with the client. This is why I think a minimal reproducible example is so important; it removes all extraneous variables, reducing the problem space to one that can expedite investigation 😃

The fact that ES is not creating a new serializer every time is what I believe is causing this issue.

Json.NET's JsonSerializer is thread safe as far as I know. The use of StreamingContext makes assumptions about how JsonSerializer is consumed, which may not hold for a JsonSerializer outside of the control of where such assumptions are made. For example, in the case of MultiSearch where a given search response is tied to a document type T, a new serializer may be created to handle that

IElasticsearchSerializer IConnectionSettingsValues.StatefulSerializer(JsonConverter converter) =>
_serializerFactory.CreateStateful(this, converter);

which may break the assumptions that the usage of StreamingContext is relying on. A simple reproducible example on a supported version would help to better understand what is at play.

@DGuidi
Copy link

DGuidi commented Jun 23, 2020

It's hard to understand, at least for me, how the elastic code uses the json serializers during the deserialization process. It's also hard to talk about "a bug", maybe there is some assumptions involved.
Anyway: from @HarelM debugging looks that the same jsonconverter instance is used across multiple parallel deserializations. If this is the case, the streamingcontext, that is used inside a converter to store some information about the state of the deserialization process itself, is invalid because refers to a different object respect to the object that is deserializing.
@HarelM have you tried to play with StatefulSerializer overload?

@HarelM
Copy link
Author

HarelM commented Jun 23, 2020

Thank you both for your fast responses. :-)
As I said, I couldn't create a minimal reproduction due to the complexity of the issue. I'm basically doing the same query to the database, sometimes it fails and sometimes it doesn't.
When I say fail it means that I check for unique ID in the properties of the geojson I get back after desirialization and sometime the answer is that it's not unique, while I'm 100% sure the database is returning a unique response (I dumped the json text to file and saw that the json is valid and doesn't have uniqueness issues).
As far as I understand, In order to use the latest NEST I need to use the latest (relevant) ES library, which is currently not an easy task (I might be wrong but I have already failed upgrading once so it's obviously not straight forward).

It might be solved in latest version and it might not be, I don't know and I don't want to find it the hard way when deploying to server). I would try and take my findings and see if you can reproduce it in 5.6 and if so see if you can upgrade to 7.8 and make sure this is resolved. I'm not an expert with ES and so I think that you (@russcam) should be able to do it better than I do...

I'd be happy to provide all the required data and help to try and reproduce it.
I can probably write some code to show how it should fail when it fails some times, but I don't know if you can categorize it as minimal reproduction. If so, I can probably write and send it tonight. let me know...

@HarelM
Copy link
Author

HarelM commented Jun 23, 2020

have you tried to play with StatefulSerializer overload?

I did not, I'm not sure how to as I'm not deriving from ConnectionSettingsBase anywhere in my code... :-(

@Mpdreamz
Copy link
Member

Mpdreamz commented Sep 17, 2020

Hi @HarelM

I noticed in your linked ticket you are getting close to upgrading Elasticsearch and the client, we would be very keen to hear if the problem persists although I suspect not.

Since we moved away from Json.NET as default serializer and you rely on Json.NET contract resolvers you can tell the client to use Json.NET as documented here:

https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/custom-serialization.html#_jsonnetserializer

This will use Json.NETonly for the user defined values, the rest of the request and response (de)serialization still flows through our internal one.

I saw that the fix you did to call JsonSerializer.Create always fixed the issue for you, this goes against everything I know from JsonSerializer, since most state actually lives on the ContractResolver which is part of SerializerSettings. An instance of JsonSeralizer should be thread safe.

Our Nest.JsonNetSerializer does exactly the same as it did in 5.x:

https://github.com/elastic/elasticsearch-net/blob/master/src/Nest.JsonNetSerializer/ConnectionSettingsAwareSerializerBase.Customization.cs

So this might resurface the issue your facing.

I went hunting a bit and wonder if these thread safety issues on Json.Net are related:

JamesNK/Newtonsoft.Json#1393
or possibly: JamesNK/Newtonsoft.Json#1163

The referenceresolver bug might explain the duplication.

If benchmarking using JsonSerializer.Create() adds no significant overhead this might be a quick win for us.

@HarelM
Copy link
Author

HarelM commented Sep 17, 2020

@Mpdreamz thanks for taking the time to review this issue and what I wrote.
I have implemented everything that is related to the upgrade of 7.x (code-wise). I'm still not sure how to properly do it in production as everything I read about upgrading did not made me fully understand what is possible and what I shouldn't do (I have a single node currently - I wanted to attach another node with 6,8 version and let it replicate the data, then re-index the data, but then the code I'm using i.e. NEST is 5.x and not 6,x and the indices will be 6,x so I'm not sure this will work, after that I need to do the same process to migrate from 6.x to 7,x which holds the same dilemma. The reason I did not upgrade to 6,x was the breaking changes related to geospatial usage, which was not clear to me how to solve, and when I tried it locally on my machine the queries I did stopped working).
I'm guessing that I'll simply shut down the server for a few hours to do the upgrade unfortunately despite the fact that I would like my users not to suffer, I don't think I can think of an alternative right now...
I'm aware of the possibility of this bug to resurface and I have taken the optimistic approach and removed the code that I wrote to fix the bug. I'll surely report back here with the results of the upgrade and let you guys know if the issue still exists.
I hope it won't, but in case it does, I have a solution (custom serializer) I guess...

@Mpdreamz
Copy link
Member

On the client side we offer versioned packages:

https://www.elastic.co/blog/nest-and-elasticsearch-net-upgrading-your-codebase

To aid with the upgrade period where both versions exists, though still a non trivial excercise.

Long term Elasticsearch server will introduce API versioning to mitigate this issue: elastic/elasticsearch#51816 (comment) (which will allow 7.x clients to talk to 8.x).

See also @russcam's commment here: #4967 (comment)

@HarelM
Copy link
Author

HarelM commented Sep 21, 2020

The server was updated with version 7.9.1. The upgrade experience was a nightmare and I ended up shutting down the site for several hours using elasticdump and not all the complex stuff offered by elastic as they were too confusing and hard to follow.
I'll close this issue in a few days if I don't see the issue manifest again in my site's logs.

@HarelM
Copy link
Author

HarelM commented Sep 21, 2020

Sorry to disappoint, the issue exists in 7.9.1 as well. I have just seen it in the logs :-(
I'll re-write the workaround I had on my side to keep production stable. Let me know if you need anything else.

@Mpdreamz
Copy link
Member

Mpdreamz commented Sep 21, 2020

Speaking out of turn because it was @russcam who noticed this but NetTopologySuite.IO.GeoJSON attaches state to serializer

https://github.com/NetTopologySuite/NetTopologySuite.IO.GeoJSON/blob/master/src/NetTopologySuite.IO.GeoJSON/Converters/FeatureConverter.cs#L176-L179

Which on initial quick scan could explain concurrency/reuse issues.

This would explain why creating a new serializer everytime fixes the issue for you.

@HarelM
Copy link
Author

HarelM commented Sep 21, 2020

@DGuidi any chance you can take a look at it?

@DGuidi
Copy link

DGuidi commented Sep 21, 2020

@HarelM and all, indeed GeoJSON serializer uses to deserialize features.
I think we started this issue exactly for this reason.
This code works in a single-threaded "world", but looks (to me and your logs, actually) that the same instance of a serializer is used at the same time from two threads in elastic.

@HarelM
Copy link
Author

HarelM commented Sep 21, 2020

@DGuidi can you please explain why using the same serializer instance from multiple threads at the same time is considered incorrect?
Performance-wise this makes sense. I too would assume the serializer is not stateful if I write some code around it, in theory...

@DGuidi
Copy link

DGuidi commented Sep 21, 2020

simply speaking, context is used to keep track of which feature is currently deserialized, so proper "complex attributes" (like a table or a child object used as a field of a parent object) can be assigned. Using the same deserializer from different threads at the same can assign child objects to te wrong parent.
Probably the entire deserialization strategy can be rewritten to handle "stateless deserialization", but the context property is from newtonsoft.json so basically it's here to be used...

russcam added a commit to russcam/NetTopologySuite.IO.GeoJSON that referenced this issue Sep 22, 2020
Relates:elastic/elasticsearch-net#4791

This commit updates the AttributesTableConverter and FeatureConverter
to not use the serializer context when deserializing AttributesTable.
The serializer context is not safe to use multithreaded, which can be
the case when JsonConverters from NetTopologySuite.IO.GeoJson are added
to an singleton instance of JsonSerializer not under the control
of NetTopologySuite.IO.GeoJson.
@russcam
Copy link
Contributor

russcam commented Sep 22, 2020

I've opened NetTopologySuite/NetTopologySuite.IO.GeoJSON#59 to remove the use of serializer context; its usage isn't integral to NetTopologySuite's deserialization approach, and can be rewritten to not use it.

@airbreather
Copy link

I've opened NetTopologySuite/NetTopologySuite.IO.GeoJSON#59 to remove the use of serializer context; its usage isn't integral to NetTopologySuite's deserialization approach, and can be rewritten to not use it.

NetTopologySuite.IO.GeoJSON v2.0.4 includes this fix.

@HarelM
Copy link
Author

HarelM commented Oct 16, 2020

@airbreather Thanks! I'll update the package, remove my workaround and report back here.

@HarelM
Copy link
Author

HarelM commented Oct 29, 2020

I have updated the server with the package, removed my workaround and looks like this issue is resolved.
Let me know if you would like me to close it, or do anything else.
Thanks everyone for the time and effort in helping me address this issue.
Truly amazing team work! :-)

@russcam
Copy link
Contributor

russcam commented Oct 29, 2020

Great to hear that the issue is now resolved, @HarelM. Closing this issue.

@russcam russcam closed this as completed Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants