[SPARK-50881][PYTHON] Use cached schema where possible in conenct dataframe.py by garlandz-db · Pull Request #49749 · apache/spark

garlandz-db · 2025-01-31T09:28:15Z

What changes were proposed in this pull request?

schema property returns a deepcopy everytime to ensure completeness. However this creates a performance degradation for internal use in dataframe.py. we make the following changes:

columns returns a copy of the array of names. This is the same as classic
all uses of schema in dataframe.py now calls the cached schema, avoiding a deepcopy

Why are the changes needed?

this does not scale well when these methods are called thousands of times like columns method in pivot

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

benchmarking does show improvement in performance approximately 1/3 times faster.

import cProfile, pstats
import copy
cProfile.run("""
x = pd.DataFrame(zip(np.random.rand(1000000), np.random.randint(1, 3000, 10000000), list(range(1000)) * 100000), columns=['x', 'y', 'z'])
df = spark.createDataFrame(x)
schema = df.schema
for i in range(1_000_000):
  [name for name in schema.names]
""")
p = pstats.Stats("profile_results")
p.sort_stats("cumtime").print_stats(.1)

         17000003 function calls in 8.886 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.931    0.931    8.886    8.886 <string>:1(<module>)
  1000000    0.391    0.000    0.391    0.000 <string>:3(<listcomp>)
  1000000    0.933    0.000    5.516    0.000 DatasetInfo.py:22(gather_imported_dataframes)
  1000000    0.948    0.000    6.669    0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment)
  1000000    0.895    0.000    7.564    0.000 DatasetInfo.py:90(__setitem__)
  3000000    2.853    0.000    4.583    0.000 utils.py:54(retrieve_imported_type)
        1    0.000    0.000    8.886    8.886 {built-in method builtins.exec}
  3000000    0.667    0.000    0.667    0.000 {built-in method builtins.getattr}
  1000000    0.204    0.000    0.204    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  3000000    0.473    0.000    0.473    0.000 {method 'get' of 'dict' objects}
  3000000    0.590    0.000    0.590    0.000 {method 'rsplit' of 'str' objects}


Thu Jan 16 20:13:47 2025    profile_results

         3 function calls in 0.000 seconds

vs

         55000003 function calls (50000003 primitive calls) in 23.181 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.987    0.987   23.181   23.181 <string>:1(<module>)
  1000000    1.060    0.000    5.750    0.000 DatasetInfo.py:22(gather_imported_dataframes)
  1000000    0.956    0.000    6.907    0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment)
  1000000    0.930    0.000    7.837    0.000 DatasetInfo.py:90(__setitem__)
6000000/1000000    7.420    0.000   14.357    0.000 copy.py:128(deepcopy)
  5000000    0.494    0.000    0.494    0.000 copy.py:182(_deepcopy_atomic)
  1000000    2.734    0.000   11.015    0.000 copy.py:201(_deepcopy_list)
  1000000    0.951    0.000    1.160    0.000 copy.py:243(_keep_alive)
  3000000    2.946    0.000    4.690    0.000 utils.py:54(retrieve_imported_type)
        1    0.000    0.000   23.181   23.181 {built-in method builtins.exec}
  3000000    0.686    0.000    0.686    0.000 {built-in method builtins.getattr}
  9000000    0.976    0.000    0.976    0.000 {built-in method builtins.id}
  1000000    0.201    0.000    0.201    0.000 {built-in method builtins.isinstance}
  5000000    0.560    0.000    0.560    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 15000000    1.673    0.000    1.673    0.000 {method 'get' of 'dict' objects}
  3000000    0.607    0.000    0.607    0.000 {method 'rsplit' of 'str' objects}


Thu Jan 16 20:13:47 2025    profile_results

         3 function calls in 0.000 seconds

Was this patch authored or co-authored using generative AI tooling?

garlandz-db · 2025-01-31T09:29:43Z

python/pyspark/sql/connect/dataframe.py

    @property
    def columns(self) -> List[str]:
-        return self.schema.names
+        return [field.name for field in self._original_schema.fields]


now its same as classic/dataframe.py implementation

HyukjinKwon · 2025-02-03T04:33:14Z

cc @zhengruifeng

zhengruifeng · 2025-02-03T05:32:03Z

python/pyspark/sql/connect/dataframe.py

+    def _original_schema(self) -> StructType:
+        if self._cached_schema:
+            return self._cached_schema
+        return self.schema


I think it should be a property
And since this is only for internal usage, I think we can avoid calling self.schema which has deepcopy.

e.g.

@property def _schema(self) -> StructType: if self._cached_schema is None: query = self._plan.to_proto(self._session.client) self._cached_schema = self._session.client.schema(query) return self._cached_schema @property def schema(self) -> StructType: return copy.deepcopy(self._schema) ```

garlandz-db · 2025-02-04T10:08:48Z

Failed tests dont seem relevant

zhengruifeng · 2025-02-06T09:57:41Z

Failed tests dont seem relevant

please rebase this PR to latest master, to make sure CI is green

…aframe.py ### What changes were proposed in this pull request? * schema property returns a deepcopy everytime to ensure completeness. However this creates a performance degradation for internal use in dataframe.py. we make the following changes: 1. `columns` returns a copy of the array of names. This is the same as classic 2. all uses of schema in dataframe.py now calls the cached schema, avoiding a deepcopy ### Why are the changes needed? * this does not scale well when these methods are called thousands of times like `columns` method in `pivot` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? * existing tests benchmarking does show improvement in performance approximately 1/3 times faster. ``` import cProfile, pstats import copy cProfile.run(""" x = pd.DataFrame(zip(np.random.rand(1000000), np.random.randint(1, 3000, 10000000), list(range(1000)) * 100000), columns=['x', 'y', 'z']) df = spark.createDataFrame(x) schema = df.schema for i in range(1_000_000): [name for name in schema.names] """) p = pstats.Stats("profile_results") p.sort_stats("cumtime").print_stats(.1) ``` ``` 17000003 function calls in 8.886 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.931 0.931 8.886 8.886 <string>:1(<module>) 1000000 0.391 0.000 0.391 0.000 <string>:3(<listcomp>) 1000000 0.933 0.000 5.516 0.000 DatasetInfo.py:22(gather_imported_dataframes) 1000000 0.948 0.000 6.669 0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment) 1000000 0.895 0.000 7.564 0.000 DatasetInfo.py:90(__setitem__) 3000000 2.853 0.000 4.583 0.000 utils.py:54(retrieve_imported_type) 1 0.000 0.000 8.886 8.886 {built-in method builtins.exec} 3000000 0.667 0.000 0.667 0.000 {built-in method builtins.getattr} 1000000 0.204 0.000 0.204 0.000 {built-in method builtins.isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 3000000 0.473 0.000 0.473 0.000 {method 'get' of 'dict' objects} 3000000 0.590 0.000 0.590 0.000 {method 'rsplit' of 'str' objects} Thu Jan 16 20:13:47 2025 profile_results 3 function calls in 0.000 seconds ``` vs ``` 55000003 function calls (50000003 primitive calls) in 23.181 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.987 0.987 23.181 23.181 <string>:1(<module>) 1000000 1.060 0.000 5.750 0.000 DatasetInfo.py:22(gather_imported_dataframes) 1000000 0.956 0.000 6.907 0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment) 1000000 0.930 0.000 7.837 0.000 DatasetInfo.py:90(__setitem__) 6000000/1000000 7.420 0.000 14.357 0.000 copy.py:128(deepcopy) 5000000 0.494 0.000 0.494 0.000 copy.py:182(_deepcopy_atomic) 1000000 2.734 0.000 11.015 0.000 copy.py:201(_deepcopy_list) 1000000 0.951 0.000 1.160 0.000 copy.py:243(_keep_alive) 3000000 2.946 0.000 4.690 0.000 utils.py:54(retrieve_imported_type) 1 0.000 0.000 23.181 23.181 {built-in method builtins.exec} 3000000 0.686 0.000 0.686 0.000 {built-in method builtins.getattr} 9000000 0.976 0.000 0.976 0.000 {built-in method builtins.id} 1000000 0.201 0.000 0.201 0.000 {built-in method builtins.isinstance} 5000000 0.560 0.000 0.560 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 15000000 1.673 0.000 1.673 0.000 {method 'get' of 'dict' objects} 3000000 0.607 0.000 0.607 0.000 {method 'rsplit' of 'str' objects} Thu Jan 16 20:13:47 2025 profile_results 3 function calls in 0.000 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? Closes #49749 from garlandz-db/SPARK-50881. Authored-by: Garland Zhang <garland.zhang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 9f86647) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

zhengruifeng · 2025-02-10T10:52:31Z

merged to master/4.0

…aframe.py ### What changes were proposed in this pull request? * schema property returns a deepcopy everytime to ensure completeness. However this creates a performance degradation for internal use in dataframe.py. we make the following changes: 1. `columns` returns a copy of the array of names. This is the same as classic 2. all uses of schema in dataframe.py now calls the cached schema, avoiding a deepcopy ### Why are the changes needed? * this does not scale well when these methods are called thousands of times like `columns` method in `pivot` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? * existing tests benchmarking does show improvement in performance approximately 1/3 times faster. ``` import cProfile, pstats import copy cProfile.run(""" x = pd.DataFrame(zip(np.random.rand(1000000), np.random.randint(1, 3000, 10000000), list(range(1000)) * 100000), columns=['x', 'y', 'z']) df = spark.createDataFrame(x) schema = df.schema for i in range(1_000_000): [name for name in schema.names] """) p = pstats.Stats("profile_results") p.sort_stats("cumtime").print_stats(.1) ``` ``` 17000003 function calls in 8.886 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.931 0.931 8.886 8.886 <string>:1(<module>) 1000000 0.391 0.000 0.391 0.000 <string>:3(<listcomp>) 1000000 0.933 0.000 5.516 0.000 DatasetInfo.py:22(gather_imported_dataframes) 1000000 0.948 0.000 6.669 0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment) 1000000 0.895 0.000 7.564 0.000 DatasetInfo.py:90(__setitem__) 3000000 2.853 0.000 4.583 0.000 utils.py:54(retrieve_imported_type) 1 0.000 0.000 8.886 8.886 {built-in method builtins.exec} 3000000 0.667 0.000 0.667 0.000 {built-in method builtins.getattr} 1000000 0.204 0.000 0.204 0.000 {built-in method builtins.isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 3000000 0.473 0.000 0.473 0.000 {method 'get' of 'dict' objects} 3000000 0.590 0.000 0.590 0.000 {method 'rsplit' of 'str' objects} Thu Jan 16 20:13:47 2025 profile_results 3 function calls in 0.000 seconds ``` vs ``` 55000003 function calls (50000003 primitive calls) in 23.181 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.987 0.987 23.181 23.181 <string>:1(<module>) 1000000 1.060 0.000 5.750 0.000 DatasetInfo.py:22(gather_imported_dataframes) 1000000 0.956 0.000 6.907 0.000 DatasetInfo.py:75(_maybe_handle_dataframe_assignment) 1000000 0.930 0.000 7.837 0.000 DatasetInfo.py:90(__setitem__) 6000000/1000000 7.420 0.000 14.357 0.000 copy.py:128(deepcopy) 5000000 0.494 0.000 0.494 0.000 copy.py:182(_deepcopy_atomic) 1000000 2.734 0.000 11.015 0.000 copy.py:201(_deepcopy_list) 1000000 0.951 0.000 1.160 0.000 copy.py:243(_keep_alive) 3000000 2.946 0.000 4.690 0.000 utils.py:54(retrieve_imported_type) 1 0.000 0.000 23.181 23.181 {built-in method builtins.exec} 3000000 0.686 0.000 0.686 0.000 {built-in method builtins.getattr} 9000000 0.976 0.000 0.976 0.000 {built-in method builtins.id} 1000000 0.201 0.000 0.201 0.000 {built-in method builtins.isinstance} 5000000 0.560 0.000 0.560 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 15000000 1.673 0.000 1.673 0.000 {method 'get' of 'dict' objects} 3000000 0.607 0.000 0.607 0.000 {method 'rsplit' of 'str' objects} Thu Jan 16 20:13:47 2025 profile_results 3 function calls in 0.000 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? Closes apache#49749 from garlandz-db/SPARK-50881. Authored-by: Garland Zhang <garland.zhang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 7d560ba) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added SQL PYTHON CONNECT labels Jan 31, 2025

garlandz-db commented Jan 31, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-50881] Use cached schema where possible in conenct dataframe.py~~ [SPARK-50881][PYTHON] Use cached schema where possible in conenct dataframe.py Feb 3, 2025

zhengruifeng reviewed Feb 3, 2025

View reviewed changes

garlandz-db requested a review from zhengruifeng February 4, 2025 10:08

garlandz-db force-pushed the SPARK-50881 branch 2 times, most recently from 63a88e3 to f338b47 Compare February 6, 2025 10:32

Use cached schema instead of deep copying to retrieve names

eaa09e2

garlandz-db force-pushed the SPARK-50881 branch from f338b47 to eaa09e2 Compare February 10, 2025 09:09

zhengruifeng approved these changes Feb 10, 2025

View reviewed changes

zhengruifeng closed this in 9f86647 Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50881][PYTHON] Use cached schema where possible in conenct dataframe.py#49749

[SPARK-50881][PYTHON] Use cached schema where possible in conenct dataframe.py#49749
garlandz-db wants to merge 1 commit intoapache:masterfrom
garlandz-db:SPARK-50881

garlandz-db commented Jan 31, 2025 •

edited

Loading

Uh oh!

garlandz-db Jan 31, 2025

Uh oh!

HyukjinKwon commented Feb 3, 2025

Uh oh!

zhengruifeng Feb 3, 2025

Uh oh!

garlandz-db commented Feb 4, 2025

Uh oh!

zhengruifeng commented Feb 6, 2025

Uh oh!

zhengruifeng commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

garlandz-db commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

garlandz-db Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 3, 2025

Uh oh!

zhengruifeng Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

garlandz-db commented Feb 4, 2025

Uh oh!

zhengruifeng commented Feb 6, 2025

Uh oh!

zhengruifeng commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garlandz-db commented Jan 31, 2025 •

edited

Loading