Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

milesgranger · 2023-10-03T07:59:12Z

Merged #971 and #1027 together and added tests comparing each per tpch query; right now the first 7 queries as implemented in #971.

First run: https://cloud.coiled.io/clusters/284888/information?account=dask-engineering&tab=Metrics
Note that dask runs first, then pyspark for the same query.
One can pretty easily determine start/stop of each based on tasks being present in the first chart.

Initial impression is PySpark spills to disk less / uses less memory, and Dask is generally faster.

Dashboard artifact:
https://github.com/coiled/benchmarks/suites/16815018862/artifacts/960398346

This reverts commit a654c8e.

Was shutting down w/o tasks after 20mins

milesgranger · 2023-10-03T09:51:33Z

Ran again with 20 nodes:
https://cloud.coiled.io/clusters/284918/information?account=dask-engineering&tab=Metrics

Dashboard link:
https://github.com/coiled/benchmarks/suites/16817723847/artifacts/960516668

For 8 nodes:
https://cloud.coiled.io/clusters/284888/information?account=dask-engineering&tab=Metrics

Dashboard link:
https://github.com/coiled/benchmarks/suites/16815018862/artifacts/960398346

mrocklin · 2023-10-03T12:22:03Z

Initial impression is PySpark spills to disk less / uses less memory, and Dask is generally faster.

Surprising, but encouraging! I'm intersted in digging in

One can pretty easily determine start/stop of each based on tasks being present in the first chart.

I wonder if we can improve on this. I'm not very familiar with how A/B tests work, but if we had two branches with the same test names like test_tpch.py::test_query_1 but where one implementation was with Dask and the other was with Spark, then would the normal A/B test charts work?

Right now looking at the charts I think I'm seeing that Dask is a lot slower on more queries than Spark is. It's a little hard to tell though (I'm looking at the x-axis range on charts like these). Is that the right way to interpret these?

milesgranger · 2023-10-03T12:28:04Z

I think I'm seeing that Dask is a lot slower on more queries than Spark is.

Ah, you are indeed correct, I found from the 8 and 20 node clusters that dask was faster 3 out of the 7 queries. (1, 2 and 6)

mrocklin · 2023-10-03T13:26:46Z

Also, while I'm asking for things. I'll also be very curious how to changes at different scale levels. Common wisdom today says "Dask is ok at medium scales, but for large scale you really have to use Spark". I'm curious how true this is today.

phofl · 2023-10-13T13:15:26Z

There might be value here in avoiding future work.

I was looking at this from the other direction. We will have to update them when we eventually want to use them (and more importantly we have to actually remember this when adding new queries), so better to copy as needed

mrocklin · 2023-10-13T13:24:18Z

I don't have a strong preference

milesgranger · 2023-10-13T13:28:27Z

We will have to update them when we eventually want to use them (and more importantly we have to actually remember this when adding new queries), so better to copy as needed

Just remove the pytest.mark.skip? This was my original approach in implementing all the queries then limited to however many have been implemented in Dask. But here, 1) the work is done. 2) only need to remove the skip marker. 3) maybe I'm missing something. :)

phofl · 2023-10-13T13:32:13Z

These queries change regularly over In the polars repo, so I wanted to avoid keeping stuff in synch that we are not using anyway

mrocklin · 2023-10-13T13:32:26Z

I think that Patrick is saying that by the time we implement these queries it's decently likely that the upstream implementations will also have changed, so we'll have to go back and look at them anyway.

If so, then I guess this depends on how much work there is to modify the upstream queries, and how frequently they change.

tests/benchmarks/tpch/test_comparison_single_vm.py

mrocklin · 2023-10-13T13:34:43Z

I'm curious if anyone has thoughts on moving the entire tpch/ directory to root level, rather than having it in tests/benchmarks/tpch/. I think that this is actually two questions:

Will this screw up any particular infrastructure that we have today, like benchmark plots?
Do people agree with this aesthetic?

milesgranger · 2023-10-13T13:35:28Z

These queries change regularly over In the polars repo, so I wanted to avoid keeping stuff in synch that we are not using anyway

I don't understand, then don't update them? They're being skipped anyway. Anyhow, it appears you feel strongly about this so I'll remove them.

phofl · 2023-10-13T13:37:45Z

It's likely that folks would just add our implementation without looking at the upstream changes. I want to avoid the cognitive load here since copy-paste should be quick at a later point.

Do people agree with this aesthetic?

No preference one way or the other

mrocklin · 2023-10-13T14:44:17Z

Anything stopping us from merging? I'm sensitive to having this PR stay around over the weekend.

milesgranger · 2023-10-13T14:48:16Z

Nothing from my end stopping a merge. 👍

fjetter

I have some suggestions for follow up work but I would like to not do this on this PR. There is one question regarding configuration of the dask TCPH benchmarks and queuing that we should address. This PR should not change any other benchmarks IMO

If that is a no-op, we're good to go from my POV

fjetter · 2023-10-13T14:47:22Z

tests/conftest.py

+    skip_benchmarks = pytest.mark.skip(reason="need --tpch-non-dask option to run")
+    for item in items:
+        if not config.getoption("--tpch-non-dask") and not (
+            str(item.path).startswith(
+                str(TEST_DIR / "benchmarks" / "tpch" / "test_dask")
+            )
+        ):
+            item.add_marker(skip_benchmarks)


Please open a follow up ticket for this. We don't need this for this PR but I want this to run somewhat regularly (every commit, once a day, etc.)

fjetter · 2023-10-13T14:50:35Z

tests/benchmarks/tpch/test_pyspark.py

+        def _():
+            spark = get_or_create_spark(f"query{module.__name__}")
+
+            # scale1000 stored as timestamp[ns] which spark parquet
+            # can't use natively.
+            if ENABLED_DATASET == "scale 1000":
+                module.query = fix_timestamp_ns_columns(module.query)
+
+            module.setup(spark)  # read spark tables query will select from
+            if hasattr(module, "ddl"):
+                spark.sql(module.ddl)  # might create temp view
+            q_final = spark.sql(module.query)  # benchmark query
+            try:
+                # trigger materialization of df
+                return q_final.toJSON().collect()
+            finally:
+                spark.catalog.clearCache()
+                spark.sparkContext.stop()
+                spark.stop()
+
+        return await asyncio.to_thread(_)
+
+    if not is_local:
+        rows = tpch_pyspark_client.run_on_scheduler(_run_tpch)
+    else:
+        rows = asyncio.run(_run_tpch(None))  # running locally
+    print(f"Received {len(rows)} rows")
+
+
+class SparkMaster(SchedulerPlugin):


I would also suggest as a follow up to

Move the plugins into a spark.py submodule or something like this. That makes it easier to import if one runs a couple of ad-hoc tests. At least until we have this in coiled

I would refactor the run_tpch_pyspark function slightly to allow it being used as run_spark_query(dask_client, query) for similar reasons.

I think that for pyspark we merge it as-is, fix up spark outside of this repo, and then make this file like the others, where the query is just defined in the test.

Future work regardless though.

fjetter · 2023-10-13T14:51:15Z

tests/conftest.py

@@ -526,8 +585,9 @@ def tpch_cluster(request, dask_env_variables, cluster_kwargs, github_cluster_tag
        **cluster_kwargs["tpch"],
    )
    dump_cluster_kwargs(kwargs, f"tpch.{module}")
-    with Cluster(**kwargs) as cluster:
-        yield cluster
+    with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):


I'm a bit confused. This should be on main already, isn't it?

Ok, I think this is a rendering issue

benchmarks/tests/conftest.py

Lines 518 to 531 in b8f0f0e

@pytest.fixture(scope="module")

def tpch_cluster(request, dask_env_variables, cluster_kwargs, github_cluster_tags):

module = os.path.basename(request.fspath).split(".")[0]

module = module.replace("test_", "")

kwargs = dict(

name=f"{module}-{uuid.uuid4().hex[:8]}",

environ=dask_env_variables,

tags=github_cluster_tags,

**cluster_kwargs["tpch"],

)

dump_cluster_kwargs(kwargs, f"tpch.{module}")

with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):

with Cluster(**kwargs) as cluster:

yield cluster

fjetter · 2023-10-13T14:56:42Z

tests/benchmarks/tpch/test_pyspark.py

+def fix_timestamp_ns_columns(query):
+    """
+    scale100 stores l_shipdate/o_orderdate as timestamp[us]
+    scale1000 stores l_shipdate/o_orderdate as timestamp[ns] which gives:
+        Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
+    so we set spark.sql.legacy.parquet.nanosAsLong and then convert to timestamp.
+    """
+    for name in ("l_shipdate", "o_orderdate"):
+        query = re.sub(rf"\b{name}\b", f"to_timestamp(cast({name} as string))", query)
+    return query


@milesgranger please also open an issue for this. I don't feel comfortable with this string casting when we're comparing us to spark. If that means we have to regenerate the dataset, that's unfortunate but I wouldn't want us to bias spark too severely. At the very least we should confirm on smaller / other data what the impact here is and whether it can be ignored.

phofl and others added 20 commits September 1, 2023 13:09

Add tpch benchmark queries

a886336

Remove anon

e394327

Remove unnecessary files

a654c8e

Update memory and query

ba474eb

Revert "Remove unnecessary files"

35d256f

This reverts commit a654c8e.

Fix query

fe26959

Update

a650024

Update client

65f4457

Merge remote-tracking branch 'origin/main' into phofl/tpch

09f8c42

Use compute

1b30944

Add pyspark tpch

083756e

Reduce setup calls and materialize df

ac82ef1

Fix bad method calls to plugins on teardown/close

4e2285c

Bump to 12 workers

5bb4073

Add idle_timeout=None for scheduler

11539b6

Was shutting down w/o tasks after 20mins

Add retry logic for S3 403 sporadic errors on materialization

7fd2ca5

Resolve conflicts

bf03130

Make direct dask vs pyspark tpch comparison tests

0dd202f

DROP COMMIT: Filter for tpch dask vs pyspark tests

80d4bdd

DROP COMMIT: Try again with 20 nodes

b6c58eb

milesgranger requested review from mrocklin and fjetter October 3, 2023 09:52

milesgranger force-pushed the milesgranger/tpch-benchmarks branch from 51ba62a to c0db646 Compare October 4, 2023 07:32

milesgranger added 3 commits October 4, 2023 11:04

DROP COMMIT: Try again with scale 1000 and 100 nodes

600e166

Rename scale 1000 s3 directory

07c8161

Fix pyspark executor memory settings

31a6547

milesgranger added 2 commits October 13, 2023 14:51

Flatten duckdb [skip ci]

d14e1a9

Reuse fixtures in tpch.conftest from test_polars [skip ci]

f308a10

milesgranger force-pushed the milesgranger/tpch-benchmarks branch from e5ae393 to f308a10 Compare October 13, 2023 12:51

Remove dask-expr from environment-test.yml [skip ci]

8e3c684

mrocklin reviewed Oct 13, 2023

View reviewed changes

tests/benchmarks/tpch/test_comparison_single_vm.py Outdated Show resolved Hide resolved

Remove test_comparison_single_vm

a516003

milesgranger added 2 commits October 13, 2023 15:37

Remove duckdb queries 8-22 [skip ci]

5d6da86

Remove pyspark queries 8-22 [skip ci]

a0b2203

fjetter reviewed Oct 13, 2023

View reviewed changes

fjetter approved these changes Oct 13, 2023

View reviewed changes

fjetter merged commit a6a9c55 into main Oct 13, 2023

milesgranger mentioned this pull request Oct 13, 2023

Fix CI breakages introduced by tpch tests #1061

Merged

mrocklin deleted the milesgranger/tpch-benchmarks branch October 13, 2023 16:29

milesgranger mentioned this pull request Oct 16, 2023

Run non dask TPC-H benchmarks on schedule or label #1083

Merged

This was referenced Oct 17, 2023

TPC-H data is split into two different plots #1098

Open

Add migration for tpch tests restructure from #1094 and #1044 #1105

Merged

milesgranger added a commit that referenced this pull request Oct 18, 2023

Add migration for test_tpch -> tpch/test_dask in #1044

b5521aa

hendrikmakait pushed a commit that referenced this pull request Oct 20, 2023

Add migration for tpch tests restructure from #1094 and #1044 (#1105)

96e10aa

milesgranger mentioned this pull request Oct 27, 2023

Only test certain modules #1152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

milesgranger commented Oct 3, 2023 •

edited

Loading

milesgranger commented Oct 3, 2023

mrocklin commented Oct 3, 2023

milesgranger commented Oct 3, 2023 •

edited

Loading

mrocklin commented Oct 3, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

fjetter left a comment

fjetter Oct 13, 2023

fjetter Oct 13, 2023

mrocklin Oct 13, 2023

mrocklin Oct 13, 2023

fjetter Oct 13, 2023

fjetter Oct 13, 2023

fjetter Oct 13, 2023

	@pytest.fixture(scope="module")
	def tpch_cluster(request, dask_env_variables, cluster_kwargs, github_cluster_tags):
	module = os.path.basename(request.fspath).split(".")[0]
	module = module.replace("test_", "")
	kwargs = dict(
	name=f"{module}-{uuid.uuid4().hex[:8]}",
	environ=dask_env_variables,
	tags=github_cluster_tags,
	**cluster_kwargs["tpch"],
	)
	dump_cluster_kwargs(kwargs, f"tpch.{module}")
	with dask.config.set({"distributed.scheduler.worker-saturation": "inf"}):
	with Cluster(**kwargs) as cluster:
	yield cluster

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Conversation

milesgranger commented Oct 3, 2023 • edited Loading

milesgranger commented Oct 3, 2023

mrocklin commented Oct 3, 2023

milesgranger commented Oct 3, 2023 • edited Loading

mrocklin commented Oct 3, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

phofl commented Oct 13, 2023

mrocklin commented Oct 13, 2023

milesgranger commented Oct 13, 2023

fjetter left a comment

Choose a reason for hiding this comment

fjetter Oct 13, 2023

Choose a reason for hiding this comment

fjetter Oct 13, 2023

Choose a reason for hiding this comment

mrocklin Oct 13, 2023

Choose a reason for hiding this comment

mrocklin Oct 13, 2023

Choose a reason for hiding this comment

fjetter Oct 13, 2023

Choose a reason for hiding this comment

fjetter Oct 13, 2023

Choose a reason for hiding this comment

fjetter Oct 13, 2023

Choose a reason for hiding this comment

milesgranger commented Oct 3, 2023 •

edited

Loading

milesgranger commented Oct 3, 2023 •

edited

Loading