Datafusion upstream merge (#576)

* Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com> * Add workflow to keep datafusion dev branch up to date (#440) * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Move / minimize number of cudf / dask-cudf imports (#480) * Move / minimize number of cudf / dask-cudf imports * Add tests for GPU-related errors * Fix unbound local error * Fix ddf value error * Use `map_partitions` to compute LIMIT / OFFSET (#517) * Use map_partitions to compute limit / offset * Use partition_info to extract partition_index * Use `dev` images for independent cluster testing (#518) * Switch to dask dev images * Use mamba for conda installs in images * Remove sleep call for installation * Use timeout / until to wait for cluster to be initialized * Add documentation for FugueSQL integrations (#523) * Add documentation for FugueSQL integrations * Minor nitpick around autodoc obj -> class * Timestampdiff support (#495) * added timestampdiff * initial work for timestampdiff * Added test cases for timestampdiff * Update interval month dtype mapping * Add datetimesubOperator * Uncomment timestampdiff literal tests * Update logic for handling interval_months for pandas/cudf series and scalars * Add negative diff testcases, and gpu tests * Update reinterpret and timedelta to explicitly cast to int64 instead of int * Simplify cast_column_to_type mapping logic * Add scalar handling to castOperation and reuse it for reinterpret Co-authored-by: rajagurnath <gurunathrajagopal@gmail.com> * Relax jsonschema testing dependency (#546) * Update upstream testing workflows (#536) * Use dask nightly conda packages for upstream testing * Add independent cluster testing to nightly upstream CI [test-upstream] * Remove unnecessary dask install [test-upstream] * Remove strict channel policy to allow nightly dask installs * Use nightly Dask packages in independent cluster test [test-upstream] * Use channels argument to install Dask conda nightlies [test-upstream] * Fix channel expression * [test-upstream] * Need to add mamba update command to get dask conda nightlies * Use conda nightlies for dask-sql import test * Add import test to upstream nightly tests * [test-upstream] * Make sure we have nightly Dask for import tests [test-upstream] * Fix pyarrow / cloudpickle failures in cluster testing (#553) * Explicitly install libstdcxx-ng in clusters * Make pyarrow dependency consistent across testing * Make libstdcxx-ng dep a min version * Add cloudpickle to cluster dependencies * cloudpickle must be in the scheduler environment * Bump cloudpickle version * Move cloudpickle install to workers * Fix pyarrow constraint in cluster spec * Use bash -l as default entrypoint for all jobs (#552) * Constrain dask/distributed for release (#563) * Unpin dask/distributed for development (#564) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * update dask-sphinx-theme (#567) * Introduce subquery.py to handle subquery expressions * update ordering * Make sure scheduler has Dask nightlies in upstream cluster testing (#573) * Make sure scheduler has Dask nightlies in upstream cluster testing * empty commit to [test-upstream] * Update gpuCI `RAPIDS_VER` to `22.08` (#565) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * updates * Remove startswith function merged by mistake * [REVIEW] - Remove instance that are meant for the currently removed timestampdiff * Modify test environment pinnings to cover minimum versions (#555) * Remove black/isort deps as we prefer pre-commit * Unpin all non python/jdk dependencies * Minor package corrections for py3.9 jdk11 env * Set min version constraints for all non-testing dependencies * Pin all non-test deps for 3.8 testing * Bump sklearn min version to 1.0.0 * Bump pyarrow min version to 1.0.1 * Fix pip notation for fugue * Use unpinned deps for cluster testing for now * Add fugue deps to environments, bump pandas to 1.0.2 * Add back antlr4 version ceiling * Explicitly mark all fugue dependencies * Alter test_analyze to avoid rtol * Bump pandas to 1.0.5 to fix upstream numpy issues * Alter datetime casting util to dodge panda casting failures * Bump pandas to 1.1.0 for groupby dropna support * Simplify string dtype check for get_supported_aggregations * Add check_dtype=False back to test_group_by_nan * Bump cluster to python 3.9 * Bump fastapi to 0.69.0, resolve remaining JDBC failures * Typo - correct pandas version * Generalize test_multi_case_when's dtype check * Bump pandas to 1.1.1 to resolve flaky test failures * Constrain mlflow for windows python 3.8 testing * Selectors don't work for conda env files * Problems seem to persist in 1.1.1, bump to 1.1.2 * Remove accidental debug changes * [test-upstream] * Use python 3.9 for upstream cluster testing [test-upstream] * Updated missed pandas pinning * Unconstrain mlflow to see if Windows failures persist * Add min version for protobuf * Bump pyarrow min version to allow for newer protobuf versions * Don't move jar to local mvn repo (#579) * Add tests for intersection * Add tests for intersection * Add another intersection test, even more simple but for testing raw intersection * Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes * Cleanup * Use an either/or strategy for converting to Timedelta objects * Support more than 2 operands for Timedelta conversions * fix merge issues, is_frame() function of call.py was removed accidentally before * Remove pytest that was testing Calcite exception messages. Calcite is no longer used so no need for this test * comment out gpu tests, will be enabled in datafusion-filter PR * Don't check dtype for failing test Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com> Co-authored-by: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: rajagurnath <gurunathrajagopal@gmail.com> Co-authored-by: Sarah Charlotte Johnson <scharlottej13@gmail.com> Co-authored-by: ksonj <ksonj@users.noreply.github.com>
dask-contrib · Jun 17, 2022 · 453249e · 453249e
1 parent 230d726
commit 453249e
Show file tree

Hide file tree

Showing 42 changed files with 571 additions and 333 deletions.
diff --git a/.github/cluster-upstream.yml b/.github/cluster-upstream.yml
@@ -0,0 +1,21 @@
+# Docker-compose setup used during tests
+version: '3'
+services:
+    dask-scheduler:
+        container_name: dask-scheduler
+        image: daskdev/dask:dev-py3.9
+        command: dask-scheduler
+        environment:
+            USE_MAMBA: "true"
+            EXTRA_CONDA_PACKAGES: "dask/label/dev::dask cloudpickle>=2.1.0"
+        ports:
+            - "8786:8786"
+    dask-worker:
+        container_name: dask-worker
+        image: daskdev/dask:dev-py3.9
+        command: dask-worker dask-scheduler:8786
+        environment:
+            USE_MAMBA: "true"
+            EXTRA_CONDA_PACKAGES: "dask/label/dev::dask cloudpickle>=2.1.0 pyarrow>=3.0.0 libstdcxx-ng>=12.1.0"
+        volumes:
+            - /tmp:/tmp
diff --git a/.github/docker-compose.yaml → .github/cluster.yml b/.github/docker-compose.yaml → .github/cluster.yml
@@ -3,7 +3,7 @@ version: '3'
 services:
     dask-scheduler:
         container_name: dask-scheduler
-        image: daskdev/dask:dev
+        image: daskdev/dask:dev-py3.9
         command: dask-scheduler
         environment:
             USE_MAMBA: "true"
@@ -12,10 +12,10 @@ services:
             - "8786:8786"
     dask-worker:
         container_name: dask-worker
-        image: daskdev/dask:dev
+        image: daskdev/dask:dev-py3.9
         command: dask-worker dask-scheduler:8786
         environment:
             USE_MAMBA: "true"
-            EXTRA_CONDA_PACKAGES: "pyarrow>=4.0.0"  # required for parquet IO
+            EXTRA_CONDA_PACKAGES: "cloudpickle>=2.1.0 pyarrow>=3.0.0 libstdcxx-ng>=12.1.0"
         volumes:
             - /tmp:/tmp
diff --git a/.github/workflows/datafusion-sync.yml b/.github/workflows/datafusion-sync.yml
@@ -0,0 +1,30 @@
+name: Keep datafusion branch up to date
+on:
+  push:
+    branches:
+      - main
+
+# When this workflow is queued, automatically cancel any previous running
+# or pending jobs
+concurrency:
+  group: datafusion-sync
+  cancel-in-progress: true
+
+jobs:
+  sync-branches:
+    runs-on: ubuntu-latest
+    if: github.repository == 'dask-contrib/dask-sql'
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v2
+      - name: Set up Node
+        uses: actions/setup-node@v2
+        with:
+          node-version: 12
+      - name: Opening pull request
+        id: pull
+        uses: tretuna/sync-branches@1.4.0
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          FROM_BRANCH: main
+          TO_BRANCH: datafusion-sql-planner
diff --git a/.github/workflows/test-upstream.yml b/.github/workflows/test-upstream.yml
@@ -4,6 +4,11 @@ on:
     - cron: "0 0 * * *" # Daily “At 00:00” UTC
   workflow_dispatch: # allows you to trigger the workflow run manually
 
+# Required shell entrypoint to have properly activated conda environments
+defaults:
+  run:
+    shell: bash -l {0}
+
 jobs:
   test-dev:
     name: "Test upstream dev (${{ matrix.os }}, python: ${{ matrix.python }})"
@@ -29,6 +34,7 @@ jobs:
           use-mamba: true
           python-version: ${{ matrix.python }}
           channel-priority: strict
+          channels: dask/label/dev,conda-forge,nodefaults
           activate-environment: dask-sql
           environment-file: ${{ env.CONDA_FILE }}
       - name: Install hive testing dependencies for Linux
@@ -39,23 +45,110 @@ jobs:
           docker pull bde2020/hive-metastore-postgresql:2.3.0
       - name: Install upstream dev Dask / dask-ml
         run: |
-          python -m pip install --no-deps git+https://github.com/dask/dask
-          python -m pip install --no-deps git+https://github.com/dask/distributed
+          mamba update dask
           python -m pip install --no-deps git+https://github.com/dask/dask-ml
       - name: Test with pytest
         run: |
           pytest --junitxml=junit/test-results.xml --cov-report=xml -n auto tests --dist loadfile
 
+  cluster-dev:
+    name: "Test upstream dev in a dask cluster"
+    needs: build
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Cache local Maven repository
+        uses: actions/cache@v2
+        with:
+          path: ~/.m2/repository
+          key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
+      - name: Set up Python
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          miniforge-variant: Mambaforge
+          use-mamba: true
+          python-version: "3.9"
+          channel-priority: strict
+          channels: dask/label/dev,conda-forge,nodefaults
+          activate-environment: dask-sql
+          environment-file: continuous_integration/environment-3.9-jdk11-dev.yaml
+      - name: Download the pre-build jar
+        uses: actions/download-artifact@v1
+        with:
+          name: jar
+          path: dask_sql/jar/
+      - name: Install cluster dependencies
+        run: |
+          mamba install python-blosc lz4 -c conda-forge
+
+          which python
+          pip list
+          mamba list
+      - name: Install upstream dev dask-ml
+        run: |
+          mamba update dask
+          python -m pip install --no-deps git+https://github.com/dask/dask-ml
+      - name: run a dask cluster
+        run: |
+          docker-compose -f .github/cluster-upstream.yml up -d
+
+          # periodically ping logs until a connection has been established; assume failure after 2 minutes
+          timeout 2m bash -c 'until docker logs dask-worker 2>&1 | grep -q "Starting established connection"; do sleep 1; done'
+
+          docker logs dask-scheduler
+          docker logs dask-worker
+      - name: Test with pytest while running an independent dask cluster
+        run: |
+          DASK_SQL_TEST_SCHEDULER="tcp://127.0.0.1:8786" pytest --junitxml=junit/test-cluster-results.xml --cov-report=xml -n auto tests --dist loadfile
+
+  import-dev:
+    name: "Test importing with bare requirements and upstream dev"
+    needs: build
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Cache local Maven repository
+        uses: actions/cache@v2
+        with:
+          path: ~/.m2/repository
+          key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
+      - name: Set up Python
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          python-version: "3.8"
+          mamba-version: "*"
+          channels: dask/label/dev,conda-forge,nodefaults
+          channel-priority: strict
+      - name: Download the pre-build jar
+        uses: actions/download-artifact@v1
+        with:
+          name: jar
+          path: dask_sql/jar/
+      - name: Install upstream dev Dask / dask-ml
+        if: needs.detect-ci-trigger.outputs.triggered == 'true'
+        run: |
+          mamba update dask
+          python -m pip install --no-deps git+https://github.com/dask/dask-ml
+      - name: Install dependencies and nothing else
+        run: |
+          pip install -e .
+
+          which python
+          pip list
+          mamba list
+      - name: Try to import dask-sql
+        run: |
+          python -c "import dask_sql; print('ok')"
+
   report-failures:
     name: Open issue for upstream dev failures
-    needs: test-dev
+    needs: [test-dev, cluster-dev]
     if: |
       always()
-      && needs.test-dev.result == 'failure'
+      && (
+        needs.test-dev.result == 'failure' || needs.cluster-dev.result == 'failure'
+      )
     runs-on: ubuntu-latest
-    defaults:
-      run:
-        shell: bash
     steps:
       - uses: actions/checkout@v2
       - name: Report failures

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -57,6 +57,7 @@ jobs:
           use-mamba: true
           python-version: ${{ matrix.python }}
           channel-priority: strict
+          channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
           activate-environment: dask-sql
           environment-file: ${{ env.CONDA_FILE }}
       - name: Setup Rust Toolchain
@@ -77,8 +78,7 @@ jobs:
       - name: Optionally install upstream dev Dask / dask-ml
         if: needs.detect-ci-trigger.outputs.triggered == 'true'
         run: |
-          python -m pip install --no-deps git+https://github.com/dask/dask
-          python -m pip install --no-deps git+https://github.com/dask/distributed
+          mamba update dask
           python -m pip install --no-deps git+https://github.com/dask/dask-ml
       - name: Test with pytest
         run: |
@@ -107,10 +107,11 @@ jobs:
         with:
           miniforge-variant: Mambaforge
           use-mamba: true
-          python-version: "3.8"
+          python-version: "3.9"
           channel-priority: strict
+          channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
           activate-environment: dask-sql
-          environment-file: continuous_integration/environment-3.8-dev.yaml
+          environment-file: continuous_integration/environment-3.9-dev.yaml
       - name: Setup Rust Toolchain
         uses: actions-rs/toolchain@v1
         id: rust-toolchain
@@ -127,18 +128,23 @@ jobs:
           which python
           pip list
           mamba list
-      - name: Optionally install upstream dev Dask / dask-ml
+      - name: Optionally install upstream dev dask-ml
         if: needs.detect-ci-trigger.outputs.triggered == 'true'
         run: |
-          python -m pip install --no-deps git+https://github.com/dask/dask
-          python -m pip install --no-deps git+https://github.com/dask/distributed
+          mamba update dask
           python -m pip install --no-deps git+https://github.com/dask/dask-ml
       - name: run a dask cluster
+        env:
+          UPSTREAM: ${{ needs.detect-ci-trigger.outputs.triggered }}
         run: |
-          docker-compose -f .github/docker-compose.yaml up -d
+          if [[ $UPSTREAM == "true" ]]; then
+            docker-compose -f .github/cluster-upstream.yml up -d
+          else
+            docker-compose -f .github/cluster.yml up -d
+          fi
 
-          # Wait for installation
-          sleep 40
+          # periodically ping logs until a connection has been established; assume failure after 2 minutes
+          timeout 2m bash -c 'until docker logs dask-worker 2>&1 | grep -q "Starting established connection"; do sleep 1; done'
 
           docker logs dask-scheduler
           docker logs dask-worker
@@ -157,7 +163,7 @@ jobs:
         with:
           python-version: "3.8"
           mamba-version: "*"
-          channels: conda-forge,defaults
+          channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
           channel-priority: strict
       - name: Install dependencies and nothing else
         run: |
@@ -167,12 +173,6 @@ jobs:
           which python
           pip list
           mamba list
-      - name: Optionally install upstream dev Dask / dask-ml
-        if: needs.detect-ci-trigger.outputs.triggered == 'true'
-        run: |
-          python -m pip install --no-deps git+https://github.com/dask/dask
-          python -m pip install --no-deps git+https://github.com/dask/distributed
-          python -m pip install --no-deps git+https://github.com/dask/dask-ml
       - name: Try to import dask-sql
         run: |
           python -c "import dask_sql; print('ok')"
diff --git a/continuous_integration/environment-3.10-dev.yaml b/continuous_integration/environment-3.10-dev.yaml
@@ -3,40 +3,41 @@ channels:
 - conda-forge
 - nodefaults
 dependencies:
-- adagio>=0.2.3
-- antlr4-python3-runtime>=4.9.2, <4.10.0 # Remove max pin after qpd(fugue dependency) updates their conda recipe
-- black=22.3.0
-- ciso8601>=2.2.0
 - dask-ml>=2022.1.22
 - dask>=2022.3.0
-- fastapi>=0.61.1
-- fs>=2.4.11
+- fastapi>=0.69.0
 - intake>=0.6.0
-- isort=5.7.0
-- jsonschema>=4.4.0
-- lightgbm>=3.2.1
-- mlflow>=1.19.0
-- mock>=4.0.3
-- nest-asyncio>=1.4.3
-- pandas>=1.0.0  # below 1.0, there were no nullable ext. types
-- pip=20.2.4
-- pre-commit>=2.11.1
-- prompt_toolkit>=3.0.8
-- psycopg2>=2.9.1
-- pygments>=2.7.1
-- pyhive>=0.6.4
-- pytest-cov>=2.10.1
+- jsonschema
+- lightgbm
+- maturin>=0.12.8
+- mlflow
+- mock
+- nest-asyncio
+- pandas>=1.1.2
+- pre-commit
+- prompt_toolkit
+- psycopg2
+- pyarrow>=3.0.0
+- pygments
+- pyhive
+- pytest-cov
 - pytest-xdist
-- pytest>=6.0.1
+- pytest
 - python=3.10
-- scikit-learn>=0.24.2
-- sphinx>=3.2.1
-- tpot>=0.11.7
-- triad>=0.5.4
+- rust>=1.60.0
+- scikit-learn>=1.0.0
+- setuptools-rust>=1.1.2
+- sphinx
+- tpot
 - tzlocal>=2.1
 - uvicorn>=0.11.3
-- maturin>=0.12.8
-- setuptools-rust>=1.1.2
-- rust>=1.60.0
+# fugue dependencies; remove when we conda install fugue
+- adagio
+- antlr4-python3-runtime<4.10
+- ciso8601
+- fs
+- pip
+- qpd
+- triad
 - pip:
   - fugue[sql]>=0.5.3