[SPARK-6677] [SQL] [PySpark] fix cached classes #5445

davies · 2015-04-10T00:57:49Z

It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.

This PR also change __FIELDS__ and __DATATYPE__ to lower case to match Python code style.

SparkQA · 2015-04-10T02:06:41Z

Test build #29993 has finished for PR 5445 at commit 583f5ab.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-10T07:00:09Z

Test build #30010 has finished for PR 5445 at commit 47bdede.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-10T18:30:50Z

Test build #30039 has finished for PR 5445 at commit 63b3238.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

davies · 2015-04-10T20:35:50Z

cc @JoshRosen

JoshRosen · 2015-04-11T23:35:04Z

To recap my understanding of this patch:

The _cached_cls dictionary maps from either DataType object ids or DataType objects to the generated Row classes for those data types.
Using an object id as a dictionary key will be safe as long as that that id refers to the same object for the lifetime of the dictionary. As long as DataType instance isn't garbage-collected, its object id will not be re-used by a different DataType object.
The problem here seems to be that we weren't guaranteed to retain a strong reference to the DataType instance. Although the DataType itself was used as a dictionary key for the _cached_cls dictionary, that dictionary is a WeakValueDictionary, so its reference to the DataType key would be removed if that type's Row class was garbage collected.
The solution implemented in this patch addresses this issue via two mechanisms:
- When storing a Row class in the _cached_cls map, store a reference in the Row class that points to the DataType. This avoids the problem that we have now where the Row class can remain in the map even though its DataType has been garbage-collected. EDIT: this reference seems to just providing us a cheap way to check the validity of the entry returned from the map, not for GC reasons; see discussions below.
- Add a check that tests whether the Row class returned from the _cached_cls map has the expected DataType. As far as I understand it, this acts more as an assertion / sanity check / error-handler, so we expect this check to succeed most (all?) of the time. EDIT: this is necessary; see discussion below.

JoshRosen · 2015-04-11T23:40:27Z

python/pyspark/sql/types.py

Due to name-mangling, this field will no longer be accessible outside of the Row class itself, but I don't think that's a problem based on how we use it: it looks like the only place where we read the old __DATATYPE__ field was in __reduce__.

It seems that Row will have two reference to dataType, via _Row__dataType and __dataType, so I'd like to change it to __dataType__ to avoid double references.

It looks like Row held a reference to its DataType even in the old code, so I guess the only new references here are the ones that we're adding to the functions / arrays / etc. returned from the other branches of _create_cls. I suppose that we could set those objects' __dataType__ fields inside of the _create_cls function instead of setting them in _restore_object if you think that would be easier to understand. Not a huge deal to set it outside, though.

_restore_object is only used for Row, so the dataType should be StructType. The reason it will crash while Row has a reference to dataType is that there will be multiple id for a Row class in _cached_cls, all of them are the same StructType, but from different batches (see another comment).

So we should not add __dataType__ for other datatypes

Ah, gotcha. Do you want to go ahead with the renaming here to eliminate the double-reference?

JoshRosen · 2015-04-11T23:49:59Z

A bit of time on GitHub code search suggests that renaming __FIELD__ to __field__ shouldn't cause any major problems because several other libraries also use this field name and it doesn't appear to be a Python reserved field / method.

@davies, could you take a look at my summary above and let me know if it's accurate? Just want to double-check my understanding before merging. Thanks!

JoshRosen · 2015-04-12T04:28:38Z

This is fairly complicated, but the solution here makes sense to me: we should be guaranteed safety because we now always check that the cache returns a row class for the expected data type. This looks good to me, but I'll wait to see if @davies wants to do a field renaming proposed upthread. If not, I'll merge this.

davies · 2015-04-12T05:20:02Z

@JoshRosen I think it's fine to go without renaming. The comments are very valuable, thank you!

JoshRosen · 2015-04-12T05:33:02Z

Great! I'm going to merge this into master (1.4.0) and branch-1.3 (1.3.2). Thanks!

It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype. This PR also change `__FIELDS__` and `__DATATYPE__` to lower case to match Python code style. Author: Davies Liu <davies@databricks.com> Closes #5445 from davies/fix_type_cache and squashes the following commits: 63b3238 [Davies Liu] typo 47bdede [Davies Liu] fix cached classes (cherry picked from commit 5d8f7b9) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

fix cached classes

47bdede

davies force-pushed the fix_type_cache branch from 583f5ab to 47bdede Compare April 10, 2015 05:32

typo

63b3238

JoshRosen reviewed Apr 11, 2015
View reviewed changes

asfgit closed this in 5d8f7b9 Apr 12, 2015

[SPARK-6677] [SQL] [PySpark] fix cached classes #5445

[SPARK-6677] [SQL] [PySpark] fix cached classes #5445

Uh oh!

Conversation

davies commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

davies commented Apr 10, 2015

Uh oh!

JoshRosen commented Apr 11, 2015

Uh oh!

JoshRosen Apr 11, 2015

Choose a reason for hiding this comment

Uh oh!

davies Apr 12, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Apr 12, 2015

Choose a reason for hiding this comment

Uh oh!

davies Apr 12, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Apr 12, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Apr 11, 2015

Uh oh!

JoshRosen commented Apr 12, 2015

Uh oh!

davies commented Apr 12, 2015

Uh oh!

JoshRosen commented Apr 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants