[SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string by maver1ck · Pull Request #19255 · apache/spark

maver1ck · 2017-09-16T18:53:49Z

What changes were proposed in this pull request?

_parse_datatype_json_string is called many times for the same datatypes.
By cacheing its result we can speed up pySpark internals.
Speed up is bigger for complicated SQL schemas.

Test

import time
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000).selectExpr("struct(struct(id, id), id) as id0", "struct(struct(id, id), id) as id1", "struct(struct(id, id), id) as id2", "struct(struct(id, id), id) as id3", "struct(struct(id, id), id) as id4", "struct(struct(id, id), id) as id5", "struct(struct(id, id), id) as id6", "struct(struct(id, id), id) as id7", "struct(struct(id, id), id) as id8", "struct(struct(id, id), id) as id9").cache()
df.count()
for i in range(10):
  start = time.time()
  df.rdd.map(lambda x: x).count()
  print(i, time.time() - start)

Before:

0 12.123183250427246
1 11.405049562454224
2 11.22128415107727
3 11.24196171760559
4 11.492472410202026
5 11.242795705795288
6 11.167386531829834
7 11.119175672531128
8 11.130866765975952
9 11.13940691947937

After:

0 11.548425674438477
1 10.786834239959717
2 10.491389751434326
3 10.409717082977295
4 10.756452083587646
5 10.595868825912476
6 10.374756097793579
7 10.407423257827759
8 10.379964113235474
9 10.382996797561646

How was this patch tested?

Existing tests.
Performance benchmark.

maver1ck · 2017-09-16T18:54:15Z

python/pyspark/sql/types.py

 import re
 import base64
 from array import array
+from functools import lru_cache


Any ideas for Python 2.7 ?

I think we should disable it in Python 2.x for now if we are going ahead with this.

Or use backported library.
https://pypi.python.org/pypi/functools32

I don't think the backport is the builtin one. I'd not add this as a dependency for now.

I added support for Python < 3.3.
What do you think ?

SparkQA · 2017-09-16T19:09:20Z

Test build #81846 has finished for PR 19255 at commit c903860.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-16T23:13:08Z

Could you mark [PySpark] in the title? cc @ueshin

HyukjinKwon · 2017-09-17T00:53:01Z

This PR also needs pref tests and the results in the description.

maver1ck · 2017-09-20T10:17:41Z

@HyukjinKwon
I added perf tests.

maver1ck · 2017-10-23T09:47:08Z

Jenkins, retest this please.

SparkQA · 2017-10-23T12:25:42Z

Test build #82976 has finished for PR 19255 at commit 679141d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-23T12:27:49Z

Test build #82977 has finished for PR 19255 at commit 679141d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maver1ck · 2017-10-27T11:23:06Z

Ping @HyukjinKwon

HyukjinKwon

I actually reviewed this one several times. I think I am hesitant for this one because ...

I am actually aware of few issues related with it, for example, https://bugs.python.org/issue28969 - this at least can be reproduced in Python 3.5.0.
The improvement looks not quite significant enough to get rid of this concern (roughly ~ 6% in a specific case). The example in the description is actually quite extreme.

The number of calls for _parse_datatype_json_string does not quite look frequent actually.

I am looking at the profile from your benchmark in the PR description:

============================================================
Profile of RDD<id=8>
============================================================
         241698152 function calls (221630936 primitive calls) in 378.367 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
41000000/21000000   84.290    0.000  256.553    0.000 types.py:623(fromInternal)
      336   66.444    0.198  369.218    1.099 {cPickle.loads}
 21000000   59.575    0.000  108.619    0.000 types.py:1421(_create_row)
 11000000   25.702    0.000   25.702    0.000 {zip}
 21000000   23.727    0.000  280.280    0.000 types.py:1418(<lambda>)
 21000000   20.740    0.000   20.740    0.000 types.py:1417(_create_row_inbound_converter)
 41191856   19.036    0.000   19.036    0.000 {isinstance}
 20000000   19.010    0.000   38.526    0.000 types.py:440(fromInternal)
 21000000   18.286    0.000   33.532    0.000 types.py:1469(__new__)
 21000000   15.512    0.000   15.512    0.000 types.py:1553(__setattr__)
 21000000   15.246    0.000   15.246    0.000 {built-in method __new__ of type object at 0x10535f428}
  1000008    6.572    0.000  378.076    0.000 rdd.py:1040(<genexpr>)
      680    2.062    0.003    2.062    0.003 {method 'read' of 'file' objects}
     7056    0.455    0.000    0.455    0.000 decoder.py:372(raw_decode)
       16    0.289    0.018  378.365   23.648 {sum}
44016/7056    0.224    0.000    1.153    0.000 types.py:906(_parse_datatype_json_value)
  1000000    0.212    0.000    0.212    0.000 <stdin>:1(<lambda>)
    36960    0.186    0.000    0.302    0.000 types.py:396(__init__)
36960/16800    0.161    0.000    0.962    0.000 types.py:427(fromJson)
    17136    0.138    0.000    0.295    0.000 types.py:466(__init__)
17136/7056    0.095    0.000    1.130    0.000 types.py:574(fromJson)
     7056    0.059    0.000    0.547    0.000 decoder.py:361(decode)
    36960    0.054    0.000    0.054    0.000 {method 'encode' of 'unicode' objects}
     7056    0.043    0.000    1.754    0.000 types.py:857(_parse_datatype_json_string)
    54096    0.042    0.000    0.060    0.000 types.py:484(<genexpr>)
    26880    0.040    0.000    0.040    0.000 types.py:103(__call__)
    17136    0.030    0.000    0.034    0.000 types.py:538(__iter__)
...

The gain is actually smaller than this because it only applies to Python 3. Python 2 requires the external backport.
Last nit is, maxsize is None (this one is easily fixable though). It grows without bound .. Not sure if it is safe.

Please correct me if I am wrong. So, to me, it's -0. I think you should rather try to persuade @ueshin or @JoshRosen. Otherwise, closing it could be an option too.

HyukjinKwon · 2017-11-06T13:29:19Z

Ditto: if you'd like to leave it as is, I think we should better close it.

ueshin · 2017-11-07T03:58:57Z

I'd say we should close this, too.

maver1ck · 2017-11-07T07:58:20Z

OK. Let's close it.

Add lru_cache to _parse_datatype_json_string

c903860

maver1ck commented Sep 16, 2017

View reviewed changes

maver1ck changed the title ~~[WIP][SPARK-22029] Add lru_cache to _parse_datatype_json_string~~ [WIP][SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string Sep 17, 2017

Make decorator works in Python < 3.3

679141d

maver1ck changed the title ~~[WIP][SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string~~ [SPARK-22029][PySpark] Add lru_cache to _parse_datatype_json_string Oct 23, 2017

HyukjinKwon reviewed Oct 31, 2017

View reviewed changes

maver1ck closed this Nov 7, 2017

Conversation

maver1ck commented Sep 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maver1ck Sep 16, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maver1ck Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maver1ck Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2017

Uh oh!

gatorsmile commented Sep 16, 2017

Uh oh!

HyukjinKwon commented Sep 17, 2017

Uh oh!

maver1ck commented Sep 20, 2017

Uh oh!

maver1ck commented Oct 23, 2017

Uh oh!

SparkQA commented Oct 23, 2017

Uh oh!

SparkQA commented Oct 23, 2017

Uh oh!

maver1ck commented Oct 27, 2017

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 6, 2017

Uh oh!

ueshin commented Nov 7, 2017

Uh oh!

maver1ck commented Nov 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maver1ck commented Sep 16, 2017 •

edited

Loading

HyukjinKwon Sep 17, 2017 •

edited

Loading

HyukjinKwon Sep 17, 2017 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading