Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Apr 10, 2015

It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.

This PR also change __FIELDS__ and __DATATYPE__ to lower case to match Python code style.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #29993 has finished for PR 5445 at commit 583f5ab.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30010 has finished for PR 5445 at commit 47bdede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30039 has finished for PR 5445 at commit 63b3238.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@davies
Copy link
Contributor Author

davies commented Apr 10, 2015

cc @JoshRosen

@JoshRosen
Copy link
Contributor

To recap my understanding of this patch:

  • The _cached_cls dictionary maps from either DataType object ids or DataType objects to the generated Row classes for those data types.
  • Using an object id as a dictionary key will be safe as long as that that id refers to the same object for the lifetime of the dictionary. As long as DataType instance isn't garbage-collected, its object id will not be re-used by a different DataType object.
  • The problem here seems to be that we weren't guaranteed to retain a strong reference to the DataType instance. Although the DataType itself was used as a dictionary key for the _cached_cls dictionary, that dictionary is a WeakValueDictionary, so its reference to the DataType key would be removed if that type's Row class was garbage collected.
  • The solution implemented in this patch addresses this issue via two mechanisms:
    • When storing a Row class in the _cached_cls map, store a reference in the Row class that points to the DataType. This avoids the problem that we have now where the Row class can remain in the map even though its DataType has been garbage-collected. EDIT: this reference seems to just providing us a cheap way to check the validity of the entry returned from the map, not for GC reasons; see discussions below.
    • Add a check that tests whether the Row class returned from the _cached_cls map has the expected DataType. As far as I understand it, this acts more as an assertion / sanity check / error-handler, so we expect this check to succeed most (all?) of the time. EDIT: this is necessary; see discussion below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to name-mangling, this field will no longer be accessible outside of the Row class itself, but I don't think that's a problem based on how we use it: it looks like the only place where we read the old __DATATYPE__ field was in __reduce__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Row will have two reference to dataType, via _Row__dataType and __dataType, so I'd like to change it to __dataType__ to avoid double references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Row held a reference to its DataType even in the old code, so I guess the only new references here are the ones that we're adding to the functions / arrays / etc. returned from the other branches of _create_cls. I suppose that we could set those objects' __dataType__ fields inside of the _create_cls function instead of setting them in _restore_object if you think that would be easier to understand. Not a huge deal to set it outside, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_restore_object is only used for Row, so the dataType should be StructType. The reason it will crash while Row has a reference to dataType is that there will be multiple id for a Row class in _cached_cls, all of them are the same StructType, but from different batches (see another comment).

So we should not add __dataType__ for other datatypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. Do you want to go ahead with the renaming here to eliminate the double-reference?

@JoshRosen
Copy link
Contributor

A bit of time on GitHub code search suggests that renaming __FIELD__ to __field__ shouldn't cause any major problems because several other libraries also use this field name and it doesn't appear to be a Python reserved field / method.

@davies, could you take a look at my summary above and let me know if it's accurate? Just want to double-check my understanding before merging. Thanks!

@JoshRosen
Copy link
Contributor

This is fairly complicated, but the solution here makes sense to me: we should be guaranteed safety because we now always check that the cache returns a row class for the expected data type. This looks good to me, but I'll wait to see if @davies wants to do a field renaming proposed upthread. If not, I'll merge this.

@davies
Copy link
Contributor Author

davies commented Apr 12, 2015

@JoshRosen I think it's fine to go without renaming. The comments are very valuable, thank you!

@JoshRosen
Copy link
Contributor

Great! I'm going to merge this into master (1.4.0) and branch-1.3 (1.3.2). Thanks!

asfgit pushed a commit that referenced this pull request Apr 12, 2015
It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.

This PR also change `__FIELDS__` and `__DATATYPE__` to lower case to match Python code style.

Author: Davies Liu <davies@databricks.com>

Closes #5445 from davies/fix_type_cache and squashes the following commits:

63b3238 [Davies Liu] typo
47bdede [Davies Liu] fix cached classes

(cherry picked from commit 5d8f7b9)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 5d8f7b9 Apr 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants