Skip to content

[SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string#19255

Closed
maver1ck wants to merge 2 commits intoapache:masterfrom
maver1ck:spark_22029
Closed

[SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string#19255
maver1ck wants to merge 2 commits intoapache:masterfrom
maver1ck:spark_22029

Conversation

@maver1ck
Copy link
Contributor

@maver1ck maver1ck commented Sep 16, 2017

What changes were proposed in this pull request?

_parse_datatype_json_string is called many times for the same datatypes.
By cacheing its result we can speed up pySpark internals.
Speed up is bigger for complicated SQL schemas.

Test

import time
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000).selectExpr("struct(struct(id, id), id) as id0", "struct(struct(id, id), id) as id1", "struct(struct(id, id), id) as id2", "struct(struct(id, id), id) as id3", "struct(struct(id, id), id) as id4", "struct(struct(id, id), id) as id5", "struct(struct(id, id), id) as id6", "struct(struct(id, id), id) as id7", "struct(struct(id, id), id) as id8", "struct(struct(id, id), id) as id9").cache()
df.count()
for i in range(10):
  start = time.time()
  df.rdd.map(lambda x: x).count()
  print(i, time.time() - start)

Before:

0 12.123183250427246
1 11.405049562454224
2 11.22128415107727
3 11.24196171760559
4 11.492472410202026
5 11.242795705795288
6 11.167386531829834
7 11.119175672531128
8 11.130866765975952
9 11.13940691947937

After:

0 11.548425674438477
1 10.786834239959717
2 10.491389751434326
3 10.409717082977295
4 10.756452083587646
5 10.595868825912476
6 10.374756097793579
7 10.407423257827759
8 10.379964113235474
9 10.382996797561646

How was this patch tested?

Existing tests.
Performance benchmark.

import re
import base64
from array import array
from functools import lru_cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas for Python 2.7 ?

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should disable it in Python 2.x for now if we are going ahead with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use backported library.
https://pypi.python.org/pypi/functools32

Copy link
Member

@HyukjinKwon HyukjinKwon Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the backport is the builtin one. I'd not add this as a dependency for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for Python < 3.3.
What do you think ?

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81846 has finished for PR 19255 at commit c903860.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could you mark [PySpark] in the title? cc @ueshin

@HyukjinKwon
Copy link
Member

This PR also needs pref tests and the results in the description.

@maver1ck maver1ck changed the title [WIP][SPARK-22029] Add lru_cache to _parse_datatype_json_string [WIP][SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string Sep 17, 2017
@maver1ck
Copy link
Contributor Author

@HyukjinKwon
I added perf tests.

@maver1ck maver1ck changed the title [WIP][SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string [SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string Oct 23, 2017
@maver1ck
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 23, 2017

Test build #82976 has finished for PR 19255 at commit 679141d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2017

Test build #82977 has finished for PR 19255 at commit 679141d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maver1ck
Copy link
Contributor Author

Ping @HyukjinKwon

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually reviewed this one several times. I think I am hesitant for this one because ...

  • I am actually aware of few issues related with it, for example, https://bugs.python.org/issue28969 - this at least can be reproduced in Python 3.5.0.

  • The improvement looks not quite significant enough to get rid of this concern (roughly ~ 6% in a specific case). The example in the description is actually quite extreme.

  • The number of calls for _parse_datatype_json_string does not quite look frequent actually.

    I am looking at the profile from your benchmark in the PR description:

    ============================================================
    Profile of RDD<id=8>
    ============================================================
             241698152 function calls (221630936 primitive calls) in 378.367 seconds
    
       Ordered by: internal time, cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    41000000/21000000   84.290    0.000  256.553    0.000 types.py:623(fromInternal)
          336   66.444    0.198  369.218    1.099 {cPickle.loads}
     21000000   59.575    0.000  108.619    0.000 types.py:1421(_create_row)
     11000000   25.702    0.000   25.702    0.000 {zip}
     21000000   23.727    0.000  280.280    0.000 types.py:1418(<lambda>)
     21000000   20.740    0.000   20.740    0.000 types.py:1417(_create_row_inbound_converter)
     41191856   19.036    0.000   19.036    0.000 {isinstance}
     20000000   19.010    0.000   38.526    0.000 types.py:440(fromInternal)
     21000000   18.286    0.000   33.532    0.000 types.py:1469(__new__)
     21000000   15.512    0.000   15.512    0.000 types.py:1553(__setattr__)
     21000000   15.246    0.000   15.246    0.000 {built-in method __new__ of type object at 0x10535f428}
      1000008    6.572    0.000  378.076    0.000 rdd.py:1040(<genexpr>)
          680    2.062    0.003    2.062    0.003 {method 'read' of 'file' objects}
         7056    0.455    0.000    0.455    0.000 decoder.py:372(raw_decode)
           16    0.289    0.018  378.365   23.648 {sum}
    44016/7056    0.224    0.000    1.153    0.000 types.py:906(_parse_datatype_json_value)
      1000000    0.212    0.000    0.212    0.000 <stdin>:1(<lambda>)
        36960    0.186    0.000    0.302    0.000 types.py:396(__init__)
    36960/16800    0.161    0.000    0.962    0.000 types.py:427(fromJson)
        17136    0.138    0.000    0.295    0.000 types.py:466(__init__)
    17136/7056    0.095    0.000    1.130    0.000 types.py:574(fromJson)
         7056    0.059    0.000    0.547    0.000 decoder.py:361(decode)
        36960    0.054    0.000    0.054    0.000 {method 'encode' of 'unicode' objects}
         7056    0.043    0.000    1.754    0.000 types.py:857(_parse_datatype_json_string)
        54096    0.042    0.000    0.060    0.000 types.py:484(<genexpr>)
        26880    0.040    0.000    0.040    0.000 types.py:103(__call__)
        17136    0.030    0.000    0.034    0.000 types.py:538(__iter__)
    ...
    
  • The gain is actually smaller than this because it only applies to Python 3. Python 2 requires the external backport.

  • Last nit is, maxsize is None (this one is easily fixable though). It grows without bound .. Not sure if it is safe.

Please correct me if I am wrong. So, to me, it's -0. I think you should rather try to persuade @ueshin or @JoshRosen. Otherwise, closing it could be an option too.

@HyukjinKwon
Copy link
Member

Ditto: if you'd like to leave it as is, I think we should better close it.

@ueshin
Copy link
Member

ueshin commented Nov 7, 2017

I'd say we should close this, too.

@maver1ck
Copy link
Contributor Author

maver1ck commented Nov 7, 2017

OK. Let's close it.

@maver1ck maver1ck closed this Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants