[SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string#19255
[SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string#19255maver1ck wants to merge 2 commits intoapache:masterfrom
Conversation
python/pyspark/sql/types.py
Outdated
| import re | ||
| import base64 | ||
| from array import array | ||
| from functools import lru_cache |
There was a problem hiding this comment.
Any ideas for Python 2.7 ?
There was a problem hiding this comment.
I think we should disable it in Python 2.x for now if we are going ahead with this.
There was a problem hiding this comment.
Or use backported library.
https://pypi.python.org/pypi/functools32
There was a problem hiding this comment.
I don't think the backport is the builtin one. I'd not add this as a dependency for now.
There was a problem hiding this comment.
I added support for Python < 3.3.
What do you think ?
|
Test build #81846 has finished for PR 19255 at commit
|
|
Could you mark [PySpark] in the title? cc @ueshin |
|
This PR also needs pref tests and the results in the description. |
|
@HyukjinKwon |
|
Jenkins, retest this please. |
|
Test build #82976 has finished for PR 19255 at commit
|
|
Test build #82977 has finished for PR 19255 at commit
|
|
Ping @HyukjinKwon |
There was a problem hiding this comment.
I actually reviewed this one several times. I think I am hesitant for this one because ...
-
I am actually aware of few issues related with it, for example, https://bugs.python.org/issue28969 - this at least can be reproduced in Python 3.5.0.
-
The improvement looks not quite significant enough to get rid of this concern (roughly ~ 6% in a specific case). The example in the description is actually quite extreme.
-
The number of calls for
_parse_datatype_json_stringdoes not quite look frequent actually.I am looking at the profile from your benchmark in the PR description:
============================================================ Profile of RDD<id=8> ============================================================ 241698152 function calls (221630936 primitive calls) in 378.367 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 41000000/21000000 84.290 0.000 256.553 0.000 types.py:623(fromInternal) 336 66.444 0.198 369.218 1.099 {cPickle.loads} 21000000 59.575 0.000 108.619 0.000 types.py:1421(_create_row) 11000000 25.702 0.000 25.702 0.000 {zip} 21000000 23.727 0.000 280.280 0.000 types.py:1418(<lambda>) 21000000 20.740 0.000 20.740 0.000 types.py:1417(_create_row_inbound_converter) 41191856 19.036 0.000 19.036 0.000 {isinstance} 20000000 19.010 0.000 38.526 0.000 types.py:440(fromInternal) 21000000 18.286 0.000 33.532 0.000 types.py:1469(__new__) 21000000 15.512 0.000 15.512 0.000 types.py:1553(__setattr__) 21000000 15.246 0.000 15.246 0.000 {built-in method __new__ of type object at 0x10535f428} 1000008 6.572 0.000 378.076 0.000 rdd.py:1040(<genexpr>) 680 2.062 0.003 2.062 0.003 {method 'read' of 'file' objects} 7056 0.455 0.000 0.455 0.000 decoder.py:372(raw_decode) 16 0.289 0.018 378.365 23.648 {sum} 44016/7056 0.224 0.000 1.153 0.000 types.py:906(_parse_datatype_json_value) 1000000 0.212 0.000 0.212 0.000 <stdin>:1(<lambda>) 36960 0.186 0.000 0.302 0.000 types.py:396(__init__) 36960/16800 0.161 0.000 0.962 0.000 types.py:427(fromJson) 17136 0.138 0.000 0.295 0.000 types.py:466(__init__) 17136/7056 0.095 0.000 1.130 0.000 types.py:574(fromJson) 7056 0.059 0.000 0.547 0.000 decoder.py:361(decode) 36960 0.054 0.000 0.054 0.000 {method 'encode' of 'unicode' objects} 7056 0.043 0.000 1.754 0.000 types.py:857(_parse_datatype_json_string) 54096 0.042 0.000 0.060 0.000 types.py:484(<genexpr>) 26880 0.040 0.000 0.040 0.000 types.py:103(__call__) 17136 0.030 0.000 0.034 0.000 types.py:538(__iter__) ... -
The gain is actually smaller than this because it only applies to Python 3. Python 2 requires the external backport.
-
Last nit is,
maxsizeisNone(this one is easily fixable though). It grows without bound .. Not sure if it is safe.
Please correct me if I am wrong. So, to me, it's -0. I think you should rather try to persuade @ueshin or @JoshRosen. Otherwise, closing it could be an option too.
|
Ditto: if you'd like to leave it as is, I think we should better close it. |
|
I'd say we should close this, too. |
|
OK. Let's close it. |
What changes were proposed in this pull request?
_parse_datatype_json_string is called many times for the same datatypes.
By cacheing its result we can speed up pySpark internals.
Speed up is bigger for complicated SQL schemas.
Test
Before:
After:
How was this patch tested?
Existing tests.
Performance benchmark.