feat: Add Python Virtual Environment Support: Installing User Defined Packages by SarahAsad23 · Pull Request #4630 · apache/texera

SarahAsad23 · 2026-05-02T00:11:59Z

What changes were proposed in this PR?

This PR is an extension of PR #4484. Previously, we introduced support for creating Python Virtual Environments (PVEs) with system-level dependencies preinstalled. This PR builds on that foundation by enabling users to install custom Python packages within a PVE.

Any related issues, documentation, discussions?

This change is part of ongoing efforts to support environment isolation and reproducibility within Texera. Related issue includes #4296. This PR closes sub-issue #4465.

How was this PR tested?

Tested Manually and PveResourceSpec test file updated.

To test:

On CU click "+" Python Environments.
Input environment name.
Input package name and version.
Click "OK" and wait for pip logs.

Was this PR authored or co-authored using generative AI tooling?

Co-authored using: ChatGPT (OpenAI)

…#4036)  ### What changes were proposed in this PR?  This PR improves two shell scripts: `build-images.sh` and `merge-image-tags` by enabling them to accept command line args. This can be useful when later we introduce the CI to automate the image building ### Any related issues, documentation, discussions?  No. This is a small improvement so I think there is no need to raise an issue ### How was this PR tested?  By executing the scripts with different args. ### Was this PR authored or co-authored using generative AI tooling?  No Co-authored-by: Chen Li <chenli@gmail.com>

…pache#4038)  ### What changes were proposed in this PR?  This PR fixes an issue where the UDF editor window does not respond to browser window resizing. ### Any related issues, documentation, discussions?  Fixes apache#4029 ### How was this PR tested?  Manually tested https://github.com/user-attachments/assets/0e8d99d9-9cc2-42f7-859a-b91fa9a50f82 ### Was this PR authored or co-authored using generative AI tooling?  No --------- Co-authored-by: ali risheh <ali.risheh876@gmail.com> Co-authored-by: Chen Li <chenli@gmail.com>

…he#4057)  ### What changes were proposed in this PR?  This PR fixes the non-deterministic sorting behavior in the admin user dashboard apache#4044 , where sorting by a column (e.g., name, role) could shuffle the order of users with equal values in that column. For all sortable columns in AdminUserComponent, we now breaks ties using user id. sortByActive also uses user id to break ties. ### Any related issues, documentation, discussions?  Fixes apache#4044 ### How was this PR tested?  Manually tested ### Was this PR authored or co-authored using generative AI tooling?  No Co-authored-by: ali risheh <ali.risheh876@gmail.com>

…e registry (apache#4055)  ### What changes were proposed in this PR?  This PR adds a Github actions to build and push images to remote registry on DockerHub. This is useful for regular nightly builds and releases. <img width="300" height="500" alt="Screenshot 2025-11-13 at 3 38 26 PM" src="https://github.com/user-attachments/assets/d43e4110-fb30-498b-afa9-6ae07ac66e35" /> Committers can manually trigger this CI to build and push images with different options. ### Any related issues, documentation, discussions?  Related to apache#4046 ### How was this PR tested?  The PR is tested using https://github.com/bobbai00/texera, the main branch of my personal fork. ### Was this PR authored or co-authored using generative AI tooling?  No

…e#4065)  ### What changes were proposed in this PR?  This PR fixes a bug where editing user data on admin dashboard would result in user data jumping around. This issue is caused by the part where it fetches the user list again after editing. The original implementation was to call `ngOnInit` after editing to re-fetch the whole user list from the backend, causing the changed data to be out of order. The new implementation does the following thing: - Creates a new `User` instance with the affected user's data along with the updated attribute - After backend successfully updates the updated user in the database, the frontend uses the helper function `replaceOneImmutable` to update `userList` and `listOfDisplayUser` in the frontend to reflect the changes in frontend. This allows the user data to be changed in place without fetching the whole list after every update. ### Any related issues, documentation, discussions?  Closes apache#4064 ### Before Change video https://github.com/user-attachments/assets/6769e32f-d7a4-4817-956d-773e97fae57e ### Proposed Change video https://github.com/user-attachments/assets/01b4a0b1-3f56-437f-9b29-637854e3dd79 ### How was this PR tested?  None. ### Was this PR authored or co-authored using generative AI tooling?  No. --------- Co-authored-by: ali risheh <ali.risheh876@gmail.com>

### What changes were proposed in this PR? 1. **Centralize and extend `AttributeType` operations** Move and refactor the existing attribute-type helpers into `AttributeTypeUtils`: * `compare`, `add`, `zeroValue`, `minValue`, `maxValue`. * Unify null-handling semantics across these operations. (use of match-case instead of if + match) Extend support to additional types: * Add comparison/aggregation support for `BOOLEAN`, `STRING`, and `BINARY`. Change numeric coercion strategy: * Coerce numeric values to `Number` instead of a specific primitive type (e.g., `Double`) to reduce `ClassCastException`s when the input is not strictly schema-validated. * Preserve existing comparison semantics for doubles by delegating to `java.lang.Double.compare` (including handling of ±∞ and `NaN`). Introduce “identity” helpers: * `zeroValue` returns an additive identity for numeric/timestamp types, and `Array.emptyByteArray` for `BINARY` as a safe, non-throwing identity. * `minValue` / `maxValue`: provide lower/upper bounds for supported numeric and timestamp types. 2. **Refactor operators to reuse `AttributeTypeUtils`** * `AggregationOperation`: implement `SUM` / `MIN` / `MAX` using the centralized helpers instead of custom per-operator logic. * `StableMergeSortOpExec`: reuse the typed compare logic from `AttributeTypeUtils`. * `SortPartitionsOpExec`: simplify to use a one-liner comparator based on `AttributeTypeUtils.compare` (or a thin wrapper) for clarity and reuse. 3. **Add tests** * workflow-core/src/test/scala/org/apache/amber/core\tuple/AttributeTypeUtilsSpec.scala * **compare**: Verifies correct null-handling and ordering for INTEGER, BOOLEAN, TIMESTAMP, STRING, and BINARY values. * **add**: Ensures `null` acts as identity and confirms correct addition for INTEGER, LONG, DOUBLE, and TIMESTAMP. * **zeroValue**: Checks that numeric/timestamp zero identities and empty binary array for BINARY are returned, and that unsupported types (e.g., STRING) throw. * **minValue / maxValue**: Validate correct numeric and timestamp bounds, BINARY minimum, and exceptions for unsupported types (e.g., BOOLEAN, STRING). * workflow-operator/src/test/scala/org/apache/amber/operator/aggregate/AggregateOpSpec.scala * Verifies `getAggregationAttribute` chooses the correct result type for different functions (SUM keeps input type, COUNT → INTEGER, CONCAT → STRING). * Checks `getAggFunc` SUM behavior for INTEGER and DOUBLE columns, ensuring correct totals and preserved fractional values. * Tests COUNT, CONCAT, MIN, MAX, and AVERAGE aggregations, including correct handling of `null` values and edge cases like “no rows”. * Confirms `getFinal` rewrites COUNT into a SUM on the intermediate count column and rewires attributes correctly for non-COUNT functions. * Exercises `AggregateOpExec` end-to-end: SUM grouped by a key (city) and combined global SUM+COUNT with no group-by keys, validating the produced tuples. 5. **Scope / non-goals / Extras** * No change to external APIs * Main behavior changes are localized to `AttributeType` operations and the operators that consume them. --- **Any related issues, documentation, discussions?** * Closes: apache#3923 **How was this PR tested?** Workflow Image: <img width="1684" height="859" alt="image" src="https://github.com/user-attachments/assets/2682ebdc-0f45-40c6-b304-0cea0b76b44f" /> Workflow file: [agg_test_1.json](https://github.com/user-attachments/files/23540242/agg_test_1.json) Python benchmark: ``` import pandas as pd df = pd.read_csv("/mnt/data/test.csv") # Limit BEFORE sorting df_limited = df.head(1000) # Now sort ascending df_sorted = df_limited.sort_values("rna_umis", ascending=True) # Group by pass_all_filters with aggregations agg = df_sorted.groupby("pass_all_filters")["rna_umis"].agg( min="min", max="max", count="count", avg="mean", sum="sum" ).reset_index() agg ``` Python Result: <img width="928" height="188" alt="image" src="https://github.com/user-attachments/assets/69da33cd-ada4-4b05-a3f9-ae139f8575b9" /> Texera Result (Avg): False | 0 | 80926 | 240 | 15987.68 | 3837043 -- | -- | -- | -- | -- | -- True | 11893 | 102559 | 760 | 35557.93 | 27024027 For timestamps test: - 1970-01-01T00:00:00Z - 2000-02-29T12:00:00Z - 2024-12-31T23:59:59Z 1. Avg: - New version: 909835199750 - Previous version: 909835199750 2. Sum: - New version: 2055-03-01T05:59:59.000Z (UTC) - Previous version: 2055-03-01T11:59:59.000Z (UTC-6; Mexico City Time) **Was this PR authored or co-authored using generative AI tooling?** * Co-authored with ChatGPT.

### What changes were proposed in this PR? This PR updates all Texera service images in the single-node `docker-compose.yml` to use the Apache registry with `latest` tags, aligning with the naming convention established in the CI/CD workflow (apache#4055). The following image references have been updated: - `texera/file-service:single-node-release-1-0-0` → `apache/texera-file-service:latest` - `texera/workflow-compiling-service:single-node-release-1-0-0` → `apache/texera-workflow-compiling-service:latest` - `texera/computing-unit-master:single-node-release-1-0-0` → `apache/texera-workflow-execution-coordinator:latest` - `texera/texera-web-application:single-node-release-1-0-0` → `apache/texera-dashboard-service:latest` - `texera/texera-example-data-loader:single-node-release-1-0-0` → `apache/texera-example-data-loader:latest` This change ensures that the docker-compose configuration uses the correct image names and registry that are now being built and pushed by the GitHub Actions workflow. ### Any related issues, documentation, discussions? Related to apache#4055 which introduced the GitHub Actions workflow for building and pushing images to the Apache registry. ### How was this PR tested? This PR only updates image references in the docker-compose.yml configuration file. No code changes were made. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) Co-authored-by: Claude <noreply@anthropic.com>

…apache#4067)  ### What changes were proposed in this PR?  This PR introduces a new attribute type, `big_object`, that lets Java operators pass data larger than 2 GB to downstream operators. Instead of storing large data directly in the tuple, the data is uploaded to MinIO, and the tuple stores a pointer to that object. Future PRs will add support for Python and R UDF operators. #### Main changes: 1. MinIO - Added a new bucket: `texera-big-objects`. - Implemented multipart upload (separate from LakeFS) to efficiently handle large uploads 2. BigObjectManager (Internal Java API) - `create()` → Generates a unique S3 URI, registers it in the database, and returns the URI string - `deleteAllObjects()` → Deletes all big objects from S3 (Please check the Note section below) 3. Streaming I/O Classes - `BigObjectOutputStream`: Streams data to S3 using background multipart upload - `BigObjectInputStream`: Lazily streams data from S3 when reading 4. Iceberg Integration - BigObject pointers are stored as strings in Iceberg - A magic suffix is added to attribute names to differentiate them from normal strings #### User API ##### Creating and Writing a BigObject: ```java // In an OperatorExecutor BigObject bigObject = new BigObject(); try (BigObjectOutputStream out = new BigObjectOutputStream(bigObject)) { out.write(myLargeDataBytes); // or: out.write(byteArray, offset, length); } // bigObject is now ready to be added to tuples ``` ##### Reading a BigObject: ```java // Option 1: Read all data at once try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) { byte[] allData = in.readAllBytes(); // ... process data } // Option 2: Read a specific amount try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) { byte[] chunk = in.readNBytes(1024); // Read 1KB // ... process chunk } // Option 3: Use as a standard InputStream try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) { int bytesRead = in.read(buffer, offset, length); // ... process data } ``` #### Note This PR does NOT handle lifecycle management for big objects. For now, when a workflow or workflow execution is deleted, all related big objects in S3 are deleted immediately. We will add proper lifecycle management in a future update. #### System Diagram <img width="3444" height="2684" alt="BigObject-Page-1 drawio (4)" src="https://github.com/user-attachments/assets/98eded06-03b2-41be-b50b-0520a654ddca" /> ### Any related issues, documentation, discussions?  Related to apache#3787. ### How was this PR tested?  Tested by running this workflow multiple times and check MinIO dashboard to see whether three big objects are created and deleted. Specify the file scan operator's property to use any file bigger than 2GB. [Big Object Java UDF.json](https://github.com/user-attachments/files/23666312/Big.Object.Java.UDF.json) ### Was this PR authored or co-authored using generative AI tooling?  Yes. --------- Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com>

Please see this [wiki page](https://github.com/apache/texera/wiki/Guide-to-enable-the-LLM%E2%80%90based-Texera-copilot) to learn how to enable this feature  ### What changes were proposed in this PR? This PR introduces the LLM agent management & chat panel on the workflow workspace to help users with their workflows. #### Demo 1. Manage agent using the panel ![2025-11-08 14 59 31](https://github.com/user-attachments/assets/75baf11d-e351-47b8-b676-b59e0e3b0db0) 2. Ask agent questions regarding available Texera operators ![2025-11-08 15 00 38](https://github.com/user-attachments/assets/4875efd2-4c87-42c8-91e0-5bb3a23c190a) 3. Ask agent about users' current workflow ![2025-11-08 15 02 05](https://github.com/user-attachments/assets/c8e57bbb-e93f-445e-951b-266e8ff7f3b0) #### Architecture Diagram See apache#4034 #### Major Changes 1. Frontend: introduce the agent management & chat panel 5. Backend: - New micro service `litellm` is introduced: which is a open source service that manages the communication between app and LLM APIs - `AccessControlService` is modified: adding the logic for routing `litellm` related requests ### Any related issues, documentation, discussions?  Related to apache#4034 #### Current PR limitation and future PR plans In current PR, the agent is only able to act in a "read-only" way, meaning it can only answer questions regarding operators, but couldn't change user's workflow. In future PRs, - Agent will be able to edit user's workflow - Agent feature will be added to k8s deployment architecture. ### How was this PR tested?  Frontend unit test cases are added. To test the PR e2e: 1. Launch litellm by following the instruction in `bin/litellm-config.yaml` 2. Launch `AccessControlService` 5. All set! You can now test the agent in workflow workspace. ### Was this PR authored or co-authored using generative AI tooling?  The code content is co-authored with Claude code. This PR is not generated by generative AI. --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu> Co-authored-by: Claude <noreply@anthropic.com>

…ry (apache#4072) ### What changes were proposed in this PR? This PR updates all Texera service images in the Kubernetes Helm chart (`bin/k8s/values.yaml`) to use the Apache registry with `latest` tags, aligning with the naming convention established in the CI/CD workflow (apache#4055). The following image references have been updated: - `texera/texera-example-data-loader:cluster` → `apache/texera-example-data-loader:latest` - `texera/texera-web-application:cluster` → `apache/texera-dashboard-service:latest` - `texera/workflow-computing-unit-managing-service:cluster` → `apache/texera-workflow-computing-unit-managing-service:latest` - `texera/workflow-compiling-service:cluster` → `apache/texera-workflow-compiling-service:latest` - `texera/file-service:cluster` → `apache/texera-file-service:latest` - `texera/config-service:cluster` → `apache/texera-config-service:latest` - `texera/access-control-service:cluster` → `apache/texera-access-control-service:latest` - `texera/computing-unit-master:cluster` → `apache/texera-workflow-execution-coordinator:latest` This ensures that the Kubernetes Helm chart uses the correct image names and registry that are now being built and pushed by the GitHub Actions workflow. ### Any related issues, documentation, discussions? Related to apache#4055 which introduced the GitHub Actions workflow for building and pushing images to the Apache registry. ### How was this PR tested? This PR only updates image references in the Kubernetes Helm chart configuration file. No code changes were made. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chen Li <chenli@gmail.com>

### What changes were proposed in this PR? Move dependency `transformer` from `requirements.txt` to `operator-requirements.txt`. ### Any related issues, documentation, discussions? The dependency were introduced apache#2600 for supporting hugging face operators. It should not have been a dependency for pyamber, but the specific operator. - apache#2600 This blocks apache#4088 ### How was this PR tested? Existing tests. ### Was this PR authored or co-authored using generative AI tooling? No

### What changes were proposed in this PR? Pin external GitHub Actions ### Any related issues, documentation, discussions? Per https://infra.apache.org/github-actions-policy.html ### Was this PR authored or co-authored using generative AI tooling? No

### What changes were proposed in this PR?  Bump `transformers` from 4.53.0 to 4.57.3 to support Hugging Face operators. ### Any related issues, documentation, discussions?  Resolves apache#4091 by updating the `transformers` dependency to support Hugging Face operators. ### How was this PR tested?  Tested by running the Hugging Face operators in Texera and verifying that the models load and run successfully (see screenshot below). <img width="453" height="295" alt="image" src="https://github.com/user-attachments/assets/208d9721-24a2-4da9-9488-81da5ad3219a" /> ### Was this PR authored or co-authored using generative AI tooling?  No.

### What changes were proposed in this PR? Bump `pandas` version to 2.2.3 to be [compatible with Python 3.13](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v2.2.3.html#pandas-2-2-3-is-now-compatible-with-python-3-13). ### Any related issues, documentation, discussions? Resolves apache#4095 ### How was this PR tested? CI ### Was this PR authored or co-authored using generative AI tooling? No Signed-off-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>

### What changes were proposed in this PR? Bump numpy version to 2.1.0 to be [compatible with Python 3.13](https://numpy.org/news/#numpy-210-released). ### Any related issues, documentation, discussions? Closes apache#4097 ### How was this PR tested? CI ### Was this PR authored or co-authored using generative AI tooling? No --------- Signed-off-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>

…artifacts (apache#4076)  ### What changes were proposed in this PR?  This PR adds a CI file for uploading the release artifacts to the [dist.apache/](https://dist.apache.org/repos/dist/dev/incubator/texera/) Here are the secrets needed to be set: | Secret | Purpose | |-----------------|-----------------------------------------------| | GPG_PRIVATE_KEY | The GPG private key used to sign the release tarball. Imported via gpg --import to create the .asc signature file. | | GPG_PASSPHRASE | Passphrase for the GPG private key. Used with --passphrase-fd to unlock the key during signing | | SVN_USERNAME | Apache SVN username for committing artifacts to dist.apache.org. Used to authenticate with the ASF distribution repository. | | SVN_PASSWORD | Apache SVN password. Paired with SVN_USERNAME to push release artifacts to the staging directory (dist/dev/incubator/texera/). | ### Any related issues, documentation, discussions?  Closes apache#4081 ### How was this PR tested?  This PR is tested manually using the Github actions on my own fork. See: https://github.com/bobbai00/texera/actions/runs/19608186790 ### Was this PR authored or co-authored using generative AI tooling?  Yes, co-authored with Claude code --------- Co-authored-by: Claude <noreply@anthropic.com>

### What changes were proposed in this PR? This PR refactors the package structure by moving all Amber engine code from `org.apache.amber` to `org.apache.texera.amber`. This aligns the package naming with the Texera project organization and ensures all components are properly namespaced under the Apache Texera organization. **Key Changes:** 1. **Directory Structure Migration** - Moved all source directories: - Scala/Java sources: 8 modules moved - Protobuf definitions: 14 files moved - Python proto generated code: moved under new namespace - Frontend TypeScript proto: moved under new namespace 2. **Code Updates** - Updated across 707 files: - Package declarations in 576 Scala/Java files - Import statements across all Scala/Java files - 57 Python files updated for new proto imports - 14 Protobuf files updated with new Java package - 2 TypeScript files updated with new import paths - Configuration files (cluster.conf) - String literals containing class names for reflection/dynamic loading 3. **Package Namespace Changes:** ```diff - org.apache.amber.engine.common - org.apache.amber.operator.* - org.apache.amber.core.* - org.apache.amber.compiler.* + org.apache.texera.amber.engine.common + org.apache.texera.amber.operator.* + org.apache.texera.amber.core.* + org.apache.texera.amber.compiler.* ``` ### Any related issues, documentation, discussions? Closes apache#4003 ### How was this PR tested? CI ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.5 (Cursor IDE)

…4087)  ### What changes were proposed in this PR?  This PR adds pre-configured IntelliJ run configurations for: - launching all 8 backend microservices, - the frontend service, - and lakeFS via Docker Compose. With these changes, developers can now launch the backend services, lakeFS, and frontend directly from IntelliJ’s run menu, eliminating the need to manually locate and configure each relevant class or compose file. This leverages IntelliJ’s built-in Compound and individual run configurations, so no additional plugins are required. https://github.com/user-attachments/assets/9ef8fb13-2dc3-4598-ba44-0540d37202db ### Any related issues, documentation, discussions?  Fixes apache#4045 ### How was this PR tested?  Verified on a local IntelliJ IDEA environment. The Compound run config cleanly launches all backend microservices in parallel. ### Was this PR authored or co-authored using generative AI tooling?  No --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu> Co-authored-by: Chen Li <chenli@gmail.com>

…est architecture (apache#4077) ### What changes were proposed in this PR? This PR improves the single-node docker-compose configuration with the following changes: 1. **Added microservices**: - `config-service` (port 9094): Provides endpoints for configuration management - `access-control-service` (port 9096): Handles user permissions and access control - `workflow-computing-unit-managing-service` (port 8888): Provides endpoints for managing computing units - All services are added with proper health checks and dependencies on postgres - Nginx reverse proxy routes are configured for `/api/config` and `/api/computing-unit` 2. **Removed outdated environment variables** from `.env`: - `USER_SYS_ENABLED=true` - `STORAGE_ICEBERG_CATALOG_TYPE=postgres` 3. **Removed unused example data loader**: the example data will be loaded via other ways, not the container way anymore. ### Any related issues, documentation, discussions? Closes apache#4083 ### How was this PR tested? docker-compose tested locally. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-5-20250101) --------- Co-authored-by: Claude <noreply@anthropic.com>

Bumps [pg8000](https://github.com/tlocke/pg8000) from 1.31.2 to 1.31.5. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/tlocke/pg8000/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pg8000&package-manager=pip&previous-version=1.31.2&new-version=1.31.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/texera/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xiaozhen Liu <xiaozl3@uci.edu>

### What changes were proposed in this PR? Add a configuration option to automatically shorten file paths for Windows users when the original path exceeds the system’s maximum length. After this PR, Windows users should not see this error anymore. <img width="612" height="157" alt="image" src="https://github.com/user-attachments/assets/73a23ef2-0fad-4f2f-bc99-c7f2e576a4d9" /> ### Any related issues, documentation, discussions? Follow-up of PR apache#4087 ### How was this PR tested? Tested manually. ### Was this PR authored or co-authored using generative AI tooling? No

### What changes were proposed in this PR? Removed official support for R-UDF. The frontend is not changed, but during execution user will receive an error about unofficially supported R-UDF. We plan to move the R-UDF to a third party hosted repo, so users can install the R-UDF support as a plugin. ### Any related issues, documentation, discussions? This change was due to the fact that R-UDF runtime requires `rpy2`, which is not apache-license friendly. resolves apache#4084 ### How was this PR tested? Added test suite `TestExecutorManager`. ### Was this PR authored or co-authored using generative AI tooling? Tests generated by Cursor. --------- Co-authored-by: Yicong Huang <yicong.huang+data@databricks.com> Co-authored-by: Chen Li <chenli@gmail.com>

### What changes were proposed in this PR?  1. Replace flake8 and black with Ruff in CI. 2. Format existing code using Ruff Basic Ruff commands: Under amber/src/main/python ```cd amber/src/main/python``` Run Ruff’s formatter in dry mode ```ruff format --check .``` Run Ruff’s formatter ```ruff format .``` Run Ruff’s linter ```ruff check .``` ### Any related issues, documentation, discussions?  Closes apache#4078 ### How was this PR tested?  I created a PR on my own fork to ensure CI is working. ### Was this PR authored or co-authored using generative AI tooling?  No --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>

### What changes were proposed in this PR? This PR bumps the project version from `1.0.0` to `1.1.0-incubating` across all relevant configuration files: - **`build.sbt`**: Updated `version := "1.0.0"` to `version := "1.1.0-incubating"` - **`bin/single-node/docker-compose.yml`**: - Updated project name from `texera-single-node-release-1-0-0` to `texera-single-node-release-1-1-0-incubating` - Updated network name from `texera-single-node-release-1-0-0` to `texera-single-node-release-1-1-0-incubating` - Updated all 7 Texera service image tags from `:latest` to `:1.1.0-incubating` - Updated the R operator comment reference - **`bin/k8s/values.yaml`**: Updated all 8 Texera service image tags from `:latest` to `:1.1.0-incubating` ### Any related issues, documentation, discussions? Closes apache#4082 ### How was this PR tested? This is a configuration-only change. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.5) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…n suffix (apache#4116)

### What changes were proposed in this PR?  This PR renames the `BigObject` type to `LargeBinary`. The original feature was introduced in apache#4067, but we decided to adopt the `LargeBinary` terminology to align with naming conventions used in other systems (e.g., Arrow). This change is purely a renaming/terminology update and does not modify the underlying functionality. ### Any related issues, documentation, discussions?  apache#4100 (comment) ### How was this PR tested?  Run this workflow and check if the workflow runs successfully and see if three objects are created in MinIO console. [Java UDF.json](https://github.com/user-attachments/files/23976766/Java.UDF.json) ### Was this PR authored or co-authored using generative AI tooling?  No. --------- Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…4124) ### What changes were proposed in this PR? This PR removes the `WITH_R_SUPPORT` build argument and all R-related installation logic from the Docker build configuration: 1. **Dockerfiles** (`computing-unit-master.dockerfile` and `computing-unit-worker.dockerfile`): - Removed `ARG WITH_R_SUPPORT` build argument - Removed conditional R runtime dependencies installation - Removed R compilation and installation steps (R 4.3.3) - Removed R packages installation (arrow, coro, dplyr) - Removed `LD_LIBRARY_PATH` environment variable for R libraries - Removed `r-requirements.txt` copy in worker dockerfile - Simplified to Python-only dependencies 2. **GitHub Actions Workflow** (`.github/workflows/build-and-push-images.yml`): - Removed `with_r_support` workflow input parameter - Removed `with_r_support` from job outputs and parameter passing - Removed `WITH_R_SUPPORT` build args from both AMD64 and ARM64 build steps - Removed R Support from build summary ### Any related issues, documentation, discussions? Related to apache#4090 ### How was this PR tested? Verified Dockerfile & CI yml syntax are valid ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) via Claude Code CLI

### What changes were proposed in this PR?  This PR introduces Python support for the `large_binary` attribute type, enabling Python UDF operators to process data larger than 2 GB. Data is offloaded to MinIO (S3), and the tuple retains only a pointer (URI). This mirrors the existing Java LargeBinary implementation, ensuring cross-language compatibility. (See apache#4067 for system diagram and apache#4111 for renaming) ## Key Features ### 1. MinIO/S3 Integration - Utilizes the shared `texera-large-binaries` bucket. - Implements lazy initialization of S3 clients and automatic bucket creation. ### 2. Streaming I/O - **`LargeBinaryOutputStream`:** Writes data to S3 using multipart uploads (64KB chunks) to prevent blocking the main execution. - **`LargeBinaryInputStream`:** Lazily downloads data only when the read operation begins. Implements standard Python `io.IOBase`. ### 3. Tuple & Iceberg Compatibility - `largebinary` instances are automatically serialized to URI strings for Iceberg storage and Arrow tables. - Uses a magic suffix (`__texera_large_binary_ptr`) to distinguish pointers from standard strings. ### 4. Serialization - Pointers are stored as strings with metadata (`texera_type: LARGE_BINARY`). Auto-conversion ensures UDFs always see `largebinary` instances, not raw strings. ## User API Usage ### 1. Creating & Writing (Output) Use `LargeBinaryOutputStream` to stream large data into a new object. ```python from pytexera import largebinary, LargeBinaryOutputStream # Create a new handle large_binary = largebinary() # Stream data to S3 with LargeBinaryOutputStream(large_binary) as out: out.write(my_large_data_bytes) # Supports bytearray, bytes, etc. ``` ### 2. Reading (Input) Use `LargeBinaryInputStream` to read data back. It supports all standard Python stream methods. ```python from pytexera import LargeBinaryInputStream with LargeBinaryInputStream(large_binary) as stream: # Option A: Read everything all_data = stream.read() # Option B: Chunked reading chunk = stream.read(1024) # Option C: Iteration for line in stream: process(line) ``` ## Dependencies - `boto3`: Required for S3 interactions. - `StorageConfig`: Uses existing configuration for endpoints/credentials. ## Future Direction - Support for R UDF Operators - Check apache#4123 ### Any related issues, documentation, discussions?  Design: apache#3787 ### How was this PR tested?  Tested by running this workflow multiple times and check MinIO dashboard to see whether six objects are created and deleted. Specify the file scan operator's property to use any file bigger than 2GB. [Large Binary Python.json](https://github.com/user-attachments/files/24062982/Large.Binary.Python.json) ### Was this PR authored or co-authored using generative AI tooling?  No. --------- Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com>

…condition (apache#4615) ### What changes were proposed in this PR? This PR fixes a race in `SyncExecutionResource.allTargetsCompleted` that causes the sync execution API (`POST /api/execution/{wid}/{cuid}/run`) to terminate before a HashJoin's probe phase produces output, returning an empty result. **Root cause.** `HashJoinOpDesc.getPhysicalPlan` produces two PhysicalOps (`build`, `probe`) sharing one logical id, separated by a blocking edge. The scheduler places them in two regions and runs them sequentially. `WorkflowExecution.getAllRegionExecutionsStats` aggregates per-logical-op state by `groupBy(_._1.logicalOpId.id)` over only the *registered* `RegionExecution`s. Between "build region completed" and "probe region instantiated," only the build PhysicalOp is registered, so `aggregateStates(Iterable(COMPLETED))` returns `COMPLETED`. The sync resource then takes the `TargetResultsReady` branch, calls `killExecution`, and reads the probe's still-empty Iceberg output. The same shape applies to any logical operator whose physical plan contains multiple PhysicalOps separated by a blocking edge (e.g., `Aggregate`). It does not surface in the regular WebSocket-driven frontend execution because the frontend waits for full workflow termination. **Fix.** Strengthen `allTargetsCompleted` to require, in addition to `operatorState == COMPLETED`, that every declared external input port of the target is already present in `OperatorMetrics.operatorStatistics.inputMetrics`. Port-1 metrics only appear after the probe actually consumes data, which closes the race window. Internal ports (e.g., HashJoin's build→probe internal edge) are filtered out on both sides of the comparison so the predicate matches what `aggregateMetrics` already exposes. Source operators (zero declared inputs) and single-input operators are unaffected; for empty-input edge cases, `terminalStateObservable` continues to provide the fallback signal. ```scala val targetExpectedExternalInputs: Map[String, Int] = effectiveLogicalPlan.operators .filter(op => request.targetOperatorIds.contains(op.operatorIdentifier.id)) .map(op => op.operatorIdentifier.id -> op.operatorInfo.inputPorts.count(!_.id.internal) ) .toMap def allTargetsCompleted(stats: ExecutionStatsStore): Boolean = { request.targetOperatorIds.nonEmpty && request.targetOperatorIds.forall { opId => stats.operatorInfo.get(opId).exists { metrics => val externalInputPortsReporting = metrics.operatorStatistics.inputMetrics.count(!_.portId.internal) val expectedExternalInputs = targetExpectedExternalInputs.getOrElse(opId, 0) metrics.operatorState == COMPLETED && externalInputPortsReporting >= expectedExternalInputs } } } ``` ### Any related issues, documentation, discussions? Closes apache#4576 ### How was this PR tested? Manually reproduced and verified end-to-end against `ComputingUnitMaster` on port 8085 with a 3-operator DAG (CSVFileScan movies + CSVFileScan ratings → HashJoin on `movieId`) executed via `POST /api/execution/{wid}/{cuid}/run` with `targetOperatorIds = [HashJoinId]`. Inputs: `movies.csv` (1000 rows) and `ratings.csv` (10 311 rows). Steps to reproduce / verify: ``` # 1. Start the master sbt "project WorkflowExecutionService" compile java ... org.apache.texera.web.ComputingUnitMaster # listens on :8085 # 2. Get a JWT curl -s -X POST http://localhost:8080/api/auth/login \ -H "Content-Type: application/json" \ -d '{"username":"<user>","password":"<pw>"}' # 3. POST the request (CSV → CSV → HashJoin, target = HashJoin) curl -s -X POST http://localhost:8085/api/execution/<wid>/<cuid>/run \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <token>" \ --data @sync-exec-request.json ``` Existing tests pass (`sbt "project WorkflowExecutionService" compile` succeeds). No new unit test was added because the failure is a timing race in the controller's region-registration sequence relative to the sync resource's observable; reproducing it deterministically in a unit test would require either mocking `ExecutionStatsStore` to emit a build-only snapshot followed by a build+probe snapshot, or driving the full controller actor system, both of which are out of scope for this targeted fix. Manual reproduction is reliable on every run because the race window is several hundred milliseconds wide and `Observable.amb` consistently selects the (incorrect) target-completion signal first prior to this fix. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>

apache#4624) ### What changes were proposed in this PR? The Build matrix jobs (`frontend`, `scala`, `python`, `agent-service`) were duplicated between `github-action-build.yml` and `reusable-build.yml`, and the two had drifted — `reusable-build.yml` was missing the recent license-check additions (npm bundle check, pip-licenses manifest, bundled-jar diff against LICENSE-binary, agent-service license manifest). Net change: **+238 / −416** lines. - Rename `reusable-build.yml` → `build.yml` (workflow name `Build`). It is now the single source of truth for the matrix steps, with the license-check additions ported in. - Rename `github-action-build.yml` → `required-checks.yml` (workflow name `Required Checks`). Replace the four inline matrix jobs with a single `build:` caller that `uses: ./.github/workflows/build.yml`. The `backport:` caller is unchanged; the `Required Checks` aggregator job's `needs:` shrinks from `[precheck, frontend, scala, python, agent-service, backport]` to `[precheck, build, backport]`. - Update `direct-backport-push.yml`'s `workflow_id` reference to the new filename. `.asf.yaml` continues to require only `Required Checks`, so the display-name change (matrix children gain a `build /` prefix) does not affect branch protection. ### Any related issues, documentation, discussions? Closes apache#4623 ### How was this PR tested? YAML parses locally for all three modified workflow files. Step parity between the new `build.yml` and the previous inline `github-action-build.yml` matrix jobs verified by side-by-side diff. The job will be exercised on this PR itself; matrix children appear under `Required Checks / build / …` and the `backport:` matrix continues to appear under `Required Checks / backport (...) / …`. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

### What changes were proposed in this PR? Gate the Build workflow's main stacks on PR labels. `required-checks.yml`'s `precheck` job now waits for the Pull Request Labeler workflow to finish (polls the `labeler` check on the PR head SHA, up to 5 min) and reads the resulting labels to decide which stacks run: | PR labels | frontend | scala | python | agent-service | |---|---|---|---|---| | only `docs` and/or `dev` | skip | skip | skip | skip | | no `frontend` label | skip | run | run | run | | includes `frontend` (or any non-skip label) | run | run | run | run | | `push` / `workflow_dispatch` (no PR) | run | run | run | run | `.github/labeler.yml`: rename the existing `build` label to `dev` so the name matches the role precheck reads. The labeler applies it for `bin/**` changes (the previous `deployment/**` glob is dropped because that directory no longer exists). The backport matrix inherits the same `run_*` decisions: each `release/*` target only re-validates the stacks selected by the table above. A docs-only PR with a `release/*` label still spawns a backport run, but every stack inside it is skipped. ### Any related issues, documentation, discussions? Closes apache#4621. Picks up the idea from the closed prior attempt apache#3642. `.asf.yaml` ruleset's required check names are static; skipped stacks now report `skipped` rather than `success`. The `Required Checks` aggregator added in apache#4624 already treats `skipped` as a pass, so branch protection stays green. ### How was this PR tested? Self-test on this PR: it touches `.github/workflows/**` and `.github/labeler.yml`, so labeler should add `ci` (and not `frontend`). Expected precheck output: `run_frontend=false`, others `true`. Adding/removing the `frontend` label should flip the frontend stack on/off; replacing all labels with `docs` only (or `dev` only) should skip every stack. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bobbai00 and others added 30 commits November 18, 2025 22:11

chore(deps): bump protobuf from 3.20.3 to 4.25.8 in /amber (apache#4101)

2d5ee9d

fix: hide Regions on MiniMap (apache#4112)

e8c3f41

fix: correct IntelliJ run configurations by removing unnecessary .mai…

4f25d8e

…n suffix (apache#4116)

bobbai00 and others added 4 commits May 1, 2026 22:55

added frontend user packages

f457843

SarahAsad23 marked this pull request as draft May 2, 2026 00:12

github-actions Bot added the frontend Changes related to the frontend GUI label May 2, 2026

github-actions Bot assigned SarahAsad23 May 2, 2026

SarahAsad23 added 7 commits May 1, 2026 17:39

Merge branch 'main' into pve-user-packages

69511dd

ws for create and install

2e60e97

added install to backend

f0200f9

styling

33788b4

user pacakges on +

5dd4cdb

refresh user packages

78660a0

formatting

4238c35

github-actions Bot added engine common and removed common labels May 4, 2026

SarahAsad23 force-pushed the pve-user-packages branch from 574c3c5 to 4ed8293 Compare May 4, 2026 05:18

github-actions Bot added dependencies Pull requests that update a dependency file ddl-change Changes to the TexeraDB DDL python ci changes related to CI docs Changes related to documentations dev common platform Non-amber Scala service paths agent-service labels May 4, 2026

SarahAsad23 force-pushed the pve-user-packages branch from 4ed8293 to 4238c35 Compare May 4, 2026 05:22

SarahAsad23 closed this May 4, 2026

SarahAsad23 removed their assignment May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Python Virtual Environment Support: Installing User Defined Packages#4630

feat: Add Python Virtual Environment Support: Installing User Defined Packages#4630
SarahAsad23 wants to merge 3982 commits into
apache:mainfrom
SarahAsad23:pve-user-packages

SarahAsad23 commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

SarahAsad23 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

SarahAsad23 commented May 2, 2026 •

edited

Loading