Skip to content

feat: Add Python Virtual Environment Support: Installing User Defined Packages#4630

Closed
SarahAsad23 wants to merge 3982 commits into
apache:mainfrom
SarahAsad23:pve-user-packages
Closed

feat: Add Python Virtual Environment Support: Installing User Defined Packages#4630
SarahAsad23 wants to merge 3982 commits into
apache:mainfrom
SarahAsad23:pve-user-packages

Conversation

@SarahAsad23
Copy link
Copy Markdown
Contributor

@SarahAsad23 SarahAsad23 commented May 2, 2026

What changes were proposed in this PR?

This PR is an extension of PR #4484. Previously, we introduced support for creating Python Virtual Environments (PVEs) with system-level dependencies preinstalled. This PR builds on that foundation by enabling users to install custom Python packages within a PVE.

Any related issues, documentation, discussions?

This change is part of ongoing efforts to support environment isolation and reproducibility within Texera. Related issue includes #4296. This PR closes sub-issue #4465.

How was this PR tested?

Tested Manually and PveResourceSpec test file updated.

To test:

  1. On CU click "+" Python Environments.
  2. Input environment name.
  3. Input package name and version.
  4. Click "OK" and wait for pip logs.

Was this PR authored or co-authored using generative AI tooling?

Co-authored using: ChatGPT (OpenAI)

bobbai00 and others added 30 commits November 18, 2025 22:11
…#4036)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR improves two shell scripts: `build-images.sh` and
`merge-image-tags` by enabling them to accept command line args. This
can be useful when later we introduce the CI to automate the image
building

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
No. This is a small improvement so I think there is no need to raise an
issue

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
By executing the scripts with different args.

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No

Co-authored-by: Chen Li <chenli@gmail.com>
…pache#4038)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR fixes an issue where the UDF editor window does not respond to
browser window resizing.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Fixes apache#4029 

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Manually tested


https://github.com/user-attachments/assets/0e8d99d9-9cc2-42f7-859a-b91fa9a50f82



### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No

---------

Co-authored-by: ali risheh <ali.risheh876@gmail.com>
Co-authored-by: Chen Li <chenli@gmail.com>
…he#4057)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR fixes the non-deterministic sorting behavior in the admin user
dashboard apache#4044 , where sorting by a column (e.g., name, role) could
shuffle the order of users with equal values in that column.
For all sortable columns in AdminUserComponent, we now breaks ties using
user id.
sortByActive also uses user id to break ties.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Fixes apache#4044

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Manually tested

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No

Co-authored-by: ali risheh <ali.risheh876@gmail.com>
…e registry (apache#4055)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR adds a Github actions to build and push images to remote
registry on DockerHub. This is useful for regular nightly builds and
releases.

<img width="300" height="500" alt="Screenshot 2025-11-13 at 3 38 26 PM"
src="https://github.com/user-attachments/assets/d43e4110-fb30-498b-afa9-6ae07ac66e35"
/>

Committers can manually trigger this CI to build and push images with
different options.


### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Related to apache#4046

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
The PR is tested using https://github.com/bobbai00/texera, the main
branch of my personal fork.

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No
…e#4065)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR fixes a bug where editing user data on admin dashboard would
result in user data jumping around. This issue is caused by the part
where it fetches the user list again after editing. The original
implementation was to call `ngOnInit` after editing to re-fetch the
whole user list from the backend, causing the changed data to be out of
order.

The new implementation does the following thing:
- Creates a new `User` instance with the affected user's data along with
the updated attribute
- After backend successfully updates the updated user in the database,
the frontend uses the helper function `replaceOneImmutable` to update
`userList` and `listOfDisplayUser` in the frontend to reflect the
changes in frontend.
This allows the user data to be changed in place without fetching the
whole list after every update.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Closes apache#4064 

### Before Change video


https://github.com/user-attachments/assets/6769e32f-d7a4-4817-956d-773e97fae57e




### Proposed Change video



https://github.com/user-attachments/assets/01b4a0b1-3f56-437f-9b29-637854e3dd79



### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
None.


### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No.

---------

Co-authored-by: ali risheh <ali.risheh876@gmail.com>
### What changes were proposed in this PR?

1. **Centralize and extend `AttributeType` operations**

Move and refactor the existing attribute-type helpers into
`AttributeTypeUtils`:

   * `compare`, `add`, `zeroValue`, `minValue`, `maxValue`.
* Unify null-handling semantics across these operations. (use of
match-case instead of if + match)

   Extend support to additional types:

* Add comparison/aggregation support for `BOOLEAN`, `STRING`, and
`BINARY`.

   Change numeric coercion strategy:

* Coerce numeric values to `Number` instead of a specific primitive type
(e.g., `Double`) to reduce `ClassCastException`s when the input is not
strictly schema-validated.
* Preserve existing comparison semantics for doubles by delegating to
`java.lang.Double.compare` (including handling of ±∞ and `NaN`).

   Introduce “identity” helpers:

* `zeroValue` returns an additive identity for numeric/timestamp types,
and `Array.emptyByteArray` for `BINARY` as a safe, non-throwing
identity.
* `minValue` / `maxValue`: provide lower/upper bounds for supported
numeric and timestamp types.

2. **Refactor operators to reuse `AttributeTypeUtils`**

* `AggregationOperation`: implement `SUM` / `MIN` / `MAX` using the
centralized helpers instead of custom per-operator logic.
* `StableMergeSortOpExec`: reuse the typed compare logic from
`AttributeTypeUtils`.
* `SortPartitionsOpExec`: simplify to use a one-liner comparator based
on `AttributeTypeUtils.compare` (or a thin wrapper) for clarity and
reuse.

3. **Add tests**
*
workflow-core/src/test/scala/org/apache/amber/core\tuple/AttributeTypeUtilsSpec.scala
* **compare**: Verifies correct null-handling and ordering for INTEGER,
BOOLEAN, TIMESTAMP, STRING, and BINARY values.
* **add**: Ensures `null` acts as identity and confirms correct addition
for INTEGER, LONG, DOUBLE, and TIMESTAMP.
* **zeroValue**: Checks that numeric/timestamp zero identities and empty
binary array for BINARY are returned, and that unsupported types (e.g.,
STRING) throw.
* **minValue / maxValue**: Validate correct numeric and timestamp
bounds, BINARY minimum, and exceptions for unsupported types (e.g.,
BOOLEAN, STRING).
*
workflow-operator/src/test/scala/org/apache/amber/operator/aggregate/AggregateOpSpec.scala
* Verifies `getAggregationAttribute` chooses the correct result type for
different functions (SUM keeps input type, COUNT → INTEGER, CONCAT →
STRING).
* Checks `getAggFunc` SUM behavior for INTEGER and DOUBLE columns,
ensuring correct totals and preserved fractional values.
* Tests COUNT, CONCAT, MIN, MAX, and AVERAGE aggregations, including
correct handling of `null` values and edge cases like “no rows”.
* Confirms `getFinal` rewrites COUNT into a SUM on the intermediate
count column and rewires attributes correctly for non-COUNT functions.
* Exercises `AggregateOpExec` end-to-end: SUM grouped by a key (city)
and combined global SUM+COUNT with no group-by keys, validating the
produced tuples.


5. **Scope / non-goals / Extras**
   * No change to external APIs
* Main behavior changes are localized to `AttributeType` operations and
the operators that consume them.

---

**Any related issues, documentation, discussions?**

* Closes: apache#3923

**How was this PR tested?**

Workflow Image:
<img width="1684" height="859" alt="image"
src="https://github.com/user-attachments/assets/2682ebdc-0f45-40c6-b304-0cea0b76b44f"
/>

Workflow file: 

[agg_test_1.json](https://github.com/user-attachments/files/23540242/agg_test_1.json)

Python benchmark:

```
import pandas as pd

df = pd.read_csv("/mnt/data/test.csv")

# Limit BEFORE sorting
df_limited = df.head(1000)

# Now sort ascending
df_sorted = df_limited.sort_values("rna_umis", ascending=True)

# Group by pass_all_filters with aggregations
agg = df_sorted.groupby("pass_all_filters")["rna_umis"].agg(
    min="min", max="max", count="count", avg="mean", sum="sum"
).reset_index()

agg

```
Python Result:
<img width="928" height="188" alt="image"
src="https://github.com/user-attachments/assets/69da33cd-ada4-4b05-a3f9-ae139f8575b9"
/>

Texera Result (Avg):

False | 0 | 80926 | 240 | 15987.68 | 3837043
-- | -- | -- | -- | -- | --
True | 11893 | 102559 | 760 | 35557.93 | 27024027

For timestamps test:
- 1970-01-01T00:00:00Z
- 2000-02-29T12:00:00Z
- 2024-12-31T23:59:59Z


1. Avg:

- New version: 909835199750
- Previous version: 909835199750

2. Sum:

- New version: 2055-03-01T05:59:59.000Z (UTC)
- Previous version: 2055-03-01T11:59:59.000Z (UTC-6; Mexico City Time)

**Was this PR authored or co-authored using generative AI tooling?**

* Co-authored with ChatGPT.
### What changes were proposed in this PR?

This PR updates all Texera service images in the single-node
`docker-compose.yml` to use the Apache registry with `latest` tags,
aligning with the naming convention established in the CI/CD workflow
(apache#4055).

The following image references have been updated:
- `texera/file-service:single-node-release-1-0-0` →
`apache/texera-file-service:latest`
- `texera/workflow-compiling-service:single-node-release-1-0-0` →
`apache/texera-workflow-compiling-service:latest`
- `texera/computing-unit-master:single-node-release-1-0-0` →
`apache/texera-workflow-execution-coordinator:latest`
- `texera/texera-web-application:single-node-release-1-0-0` →
`apache/texera-dashboard-service:latest`
- `texera/texera-example-data-loader:single-node-release-1-0-0` →
`apache/texera-example-data-loader:latest`

This change ensures that the docker-compose configuration uses the
correct image names and registry that are now being built and pushed by
the GitHub Actions workflow.

### Any related issues, documentation, discussions?

Related to apache#4055 which introduced the GitHub Actions workflow for
building and pushing images to the Apache registry.

### How was this PR tested?

This PR only updates image references in the docker-compose.yml
configuration file. No code changes were made.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)

Co-authored-by: Claude <noreply@anthropic.com>
…apache#4067)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR introduces a new attribute type, `big_object`, that lets Java
operators pass data larger than 2 GB to downstream operators. Instead of
storing large data directly in the tuple, the data is uploaded to MinIO,
and the tuple stores a pointer to that object. Future PRs will add
support for Python and R UDF operators.

#### Main changes:
1. MinIO
- Added a new bucket: `texera-big-objects`.
- Implemented multipart upload (separate from LakeFS) to efficiently
handle large uploads
2. BigObjectManager (Internal Java API)
- `create()` → Generates a unique S3 URI, registers it in the database,
and returns the URI string
- `deleteAllObjects()` → Deletes all big objects from S3 (Please check
the Note section below)
3. Streaming I/O Classes
- `BigObjectOutputStream`: Streams data to S3 using background multipart
upload
- `BigObjectInputStream`: Lazily streams data from S3 when reading
4. Iceberg Integration
- BigObject pointers are stored as strings in Iceberg
- A magic suffix is added to attribute names to differentiate them from
normal strings

####  User API
##### Creating and Writing a BigObject:
```java
// In an OperatorExecutor
BigObject bigObject = new BigObject();
try (BigObjectOutputStream out = new BigObjectOutputStream(bigObject)) {
    out.write(myLargeDataBytes);
    // or: out.write(byteArray, offset, length);
}
// bigObject is now ready to be added to tuples
```

##### Reading a BigObject:
```java
// Option 1: Read all data at once
try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) {
    byte[] allData = in.readAllBytes();
    // ... process data
}

// Option 2: Read a specific amount
try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) {
    byte[] chunk = in.readNBytes(1024); // Read 1KB
    // ... process chunk
}

// Option 3: Use as a standard InputStream
try (BigObjectInputStream in = new BigObjectInputStream(bigObject)) {
    int bytesRead = in.read(buffer, offset, length);
    // ... process data
}
```

#### Note
This PR does NOT handle lifecycle management for big objects. For now,
when a workflow or workflow execution is deleted, all related big
objects in S3 are deleted immediately. We will add proper lifecycle
management in a future update.

#### System Diagram
<img width="3444" height="2684" alt="BigObject-Page-1 drawio (4)"
src="https://github.com/user-attachments/assets/98eded06-03b2-41be-b50b-0520a654ddca"
/>


### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  4. If there is design documentation, please add the link.
  8. If there is a discussion in the mailing list, please add the link.
-->
Related to apache#3787. 


### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->

Tested by running this workflow multiple times and check MinIO dashboard
to see whether three big objects are created and deleted. Specify the
file scan operator's property to use any file bigger than 2GB.
[Big Object Java
UDF.json](https://github.com/user-attachments/files/23666312/Big.Object.Java.UDF.json)


### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
Yes.

---------

Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com>
Please see this [wiki
page](https://github.com/apache/texera/wiki/Guide-to-enable-the-LLM%E2%80%90based-Texera-copilot)
to learn how to enable this feature

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?

This PR introduces the LLM agent management & chat panel on the workflow
workspace to help users with their workflows.

#### Demo
1. Manage agent using the panel
![2025-11-08 14 59
31](https://github.com/user-attachments/assets/75baf11d-e351-47b8-b676-b59e0e3b0db0)

2. Ask agent questions regarding available Texera operators
![2025-11-08 15 00
38](https://github.com/user-attachments/assets/4875efd2-4c87-42c8-91e0-5bb3a23c190a)

3. Ask agent about users' current workflow

![2025-11-08 15 02
05](https://github.com/user-attachments/assets/c8e57bbb-e93f-445e-951b-266e8ff7f3b0)


#### Architecture Diagram
See apache#4034 

#### Major Changes
1. Frontend: introduce the agent management & chat panel

5. Backend:
- New micro service `litellm` is introduced: which is a open source
service that manages the communication between app and LLM APIs
- `AccessControlService` is modified: adding the logic for routing
`litellm` related requests

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  6. If there is design documentation, please add the link.
  7. If there is a discussion in the mailing list, please add the link.
-->
Related to apache#4034 

#### Current PR limitation and future PR plans
In current PR, the agent is only able to act in a "read-only" way,
meaning it can only answer questions regarding operators, but couldn't
change user's workflow.

In future PRs, 
- Agent will be able to edit user's workflow
- Agent feature will be added to k8s deployment architecture. 

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Frontend unit test cases are added.

To test the PR e2e:
1. Launch litellm by following the instruction in
`bin/litellm-config.yaml`
2. Launch `AccessControlService`
5. All set! You can now test the agent in workflow workspace.

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->

The code content is co-authored with Claude code. This PR is not
generated by generative AI.

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
Co-authored-by: Claude <noreply@anthropic.com>
…ry (apache#4072)

### What changes were proposed in this PR?

This PR updates all Texera service images in the Kubernetes Helm chart
(`bin/k8s/values.yaml`) to use the Apache registry with `latest` tags,
aligning with the naming convention established in the CI/CD workflow
(apache#4055).

The following image references have been updated:
- `texera/texera-example-data-loader:cluster` →
`apache/texera-example-data-loader:latest`
- `texera/texera-web-application:cluster` →
`apache/texera-dashboard-service:latest`
- `texera/workflow-computing-unit-managing-service:cluster` →
`apache/texera-workflow-computing-unit-managing-service:latest`
- `texera/workflow-compiling-service:cluster` →
`apache/texera-workflow-compiling-service:latest`
- `texera/file-service:cluster` → `apache/texera-file-service:latest`
- `texera/config-service:cluster` →
`apache/texera-config-service:latest`
- `texera/access-control-service:cluster` →
`apache/texera-access-control-service:latest`
- `texera/computing-unit-master:cluster` →
`apache/texera-workflow-execution-coordinator:latest`

This ensures that the Kubernetes Helm chart uses the correct image names
and registry that are now being built and pushed by the GitHub Actions
workflow.

### Any related issues, documentation, discussions?

Related to apache#4055 which introduced the GitHub Actions workflow for
building and pushing images to the Apache registry.

### How was this PR tested?

This PR only updates image references in the Kubernetes Helm chart
configuration file. No code changes were made.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Chen Li <chenli@gmail.com>
### What changes were proposed in this PR?
Move dependency `transformer` from `requirements.txt` to
`operator-requirements.txt`.

### Any related issues, documentation, discussions?
The dependency were introduced apache#2600 for supporting hugging face
operators. It should not have been a dependency for pyamber, but the
specific operator.
- apache#2600

This blocks apache#4088 

### How was this PR tested?
Existing tests.

### Was this PR authored or co-authored using generative AI tooling?
No
### What changes were proposed in this PR?

Pin external GitHub Actions

### Any related issues, documentation, discussions?

Per https://infra.apache.org/github-actions-policy.html

### Was this PR authored or co-authored using generative AI tooling?

No
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
Bump `transformers` from 4.53.0 to 4.57.3 to support Hugging Face
operators.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Resolves apache#4091 by updating the `transformers` dependency to support
Hugging Face operators.

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Tested by running the Hugging Face operators in Texera and verifying
that the models load and run successfully (see screenshot below).
<img width="453" height="295" alt="image"
src="https://github.com/user-attachments/assets/208d9721-24a2-4da9-9488-81da5ad3219a"
/>


### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No.
### What changes were proposed in this PR?
Bump `pandas` version to 2.2.3 to be [compatible with Python
3.13](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v2.2.3.html#pandas-2-2-3-is-now-compatible-with-python-3-13).

### Any related issues, documentation, discussions?

Resolves apache#4095 

### How was this PR tested?
CI

### Was this PR authored or co-authored using generative AI tooling?
No

Signed-off-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
### What changes were proposed in this PR?
Bump numpy version to 2.1.0 to be [compatible with Python
3.13](https://numpy.org/news/#numpy-210-released).

### Any related issues, documentation, discussions?
Closes apache#4097 

### How was this PR tested?
CI

### Was this PR authored or co-authored using generative AI tooling?
No

---------

Signed-off-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
…artifacts (apache#4076)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR adds a CI file for uploading the release artifacts to the
[dist.apache/](https://dist.apache.org/repos/dist/dev/incubator/texera/)

Here are the secrets needed to be set:
 | Secret          | Purpose |
  |-----------------|-----------------------------------------------|
| GPG_PRIVATE_KEY | The GPG private key used to sign the release
tarball. Imported via gpg --import to create the .asc signature file. |
| GPG_PASSPHRASE | Passphrase for the GPG private key. Used with
--passphrase-fd to unlock the key during signing |
| SVN_USERNAME | Apache SVN username for committing artifacts to
dist.apache.org. Used to authenticate with the ASF distribution
repository. |
| SVN_PASSWORD | Apache SVN password. Paired with SVN_USERNAME to push
release artifacts to the staging directory (dist/dev/incubator/texera/).
|

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Closes apache#4081


### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->

This PR is tested manually using the Github actions on my own fork. See:
https://github.com/bobbai00/texera/actions/runs/19608186790


### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
Yes, co-authored with Claude code

---------

Co-authored-by: Claude <noreply@anthropic.com>
### What changes were proposed in this PR?

This PR refactors the package structure by moving all Amber engine code
from `org.apache.amber` to `org.apache.texera.amber`. This aligns the
package naming with the Texera project organization and ensures all
components are properly namespaced under the Apache Texera organization.

**Key Changes:**

1. **Directory Structure Migration** - Moved all source directories:
   - Scala/Java sources: 8 modules moved
   - Protobuf definitions: 14 files moved
   - Python proto generated code: moved under new namespace
   - Frontend TypeScript proto: moved under new namespace

2. **Code Updates** - Updated across 707 files:
   - Package declarations in 576 Scala/Java files
   - Import statements across all Scala/Java files
   - 57 Python files updated for new proto imports
   - 14 Protobuf files updated with new Java package
   - 2 TypeScript files updated with new import paths
   - Configuration files (cluster.conf)
- String literals containing class names for reflection/dynamic loading

3. **Package Namespace Changes:**
   ```diff
   - org.apache.amber.engine.common
   - org.apache.amber.operator.*
   - org.apache.amber.core.*
   - org.apache.amber.compiler.*
   
   + org.apache.texera.amber.engine.common
   + org.apache.texera.amber.operator.*
   + org.apache.texera.amber.core.*
   + org.apache.texera.amber.compiler.*
   ```

### Any related issues, documentation, discussions?

Closes apache#4003

### How was this PR tested?
CI

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5 (Cursor IDE)
…4087)

<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR adds pre-configured IntelliJ run configurations for:
- launching all 8 backend microservices,
- the frontend service,
- and lakeFS via Docker Compose.

With these changes, developers can now launch the backend services,
lakeFS, and frontend directly from IntelliJ’s run menu, eliminating the
need to manually locate and configure each relevant class or compose
file. This leverages IntelliJ’s built-in Compound and individual run
configurations, so no additional plugins are required.


https://github.com/user-attachments/assets/9ef8fb13-2dc3-4598-ba44-0540d37202db



### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Fixes apache#4045

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Verified on a local IntelliJ IDEA environment. The Compound run config
cleanly launches all backend microservices in parallel.

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
Co-authored-by: Chen Li <chenli@gmail.com>
…est architecture (apache#4077)

### What changes were proposed in this PR?

This PR improves the single-node docker-compose configuration with the
following changes:


1. **Added microservices**:
- `config-service` (port 9094): Provides endpoints for configuration
management
- `access-control-service` (port 9096): Handles user permissions and
access control
- `workflow-computing-unit-managing-service` (port 8888): Provides
endpoints for managing computing units
- All services are added with proper health checks and dependencies on
postgres
- Nginx reverse proxy routes are configured for `/api/config` and
`/api/computing-unit`

2. **Removed outdated environment variables** from `.env`:
   - `USER_SYS_ENABLED=true`
   - `STORAGE_ICEBERG_CATALOG_TYPE=postgres`

3. **Removed unused example data loader**: the example data will be
loaded via other ways, not the container way anymore.

### Any related issues, documentation, discussions?

Closes apache#4083 

### How was this PR tested?

docker-compose tested locally.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-5-20250101)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Bumps [pg8000](https://github.com/tlocke/pg8000) from 1.31.2 to 1.31.5.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/tlocke/pg8000/commits">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pg8000&package-manager=pip&previous-version=1.31.2&new-version=1.31.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts page](https://github.com/apache/texera/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Xiaozhen Liu <xiaozl3@uci.edu>
### What changes were proposed in this PR?
Add a configuration option to automatically shorten file paths for
Windows users when the original path exceeds the system’s maximum
length.

After this PR, Windows users should not see this error anymore.

<img width="612" height="157" alt="image"
src="https://github.com/user-attachments/assets/73a23ef2-0fad-4f2f-bc99-c7f2e576a4d9"
/>


### Any related issues, documentation, discussions?
Follow-up of PR apache#4087


### How was this PR tested?
Tested manually.


### Was this PR authored or co-authored using generative AI tooling?
No
### What changes were proposed in this PR?
Removed official support for R-UDF. The frontend is not changed, but
during execution user will receive an error about unofficially supported
R-UDF. We plan to move the R-UDF to a third party hosted repo, so users
can install the R-UDF support as a plugin.

### Any related issues, documentation, discussions?
This change was due to the fact that R-UDF runtime requires `rpy2`,
which is not apache-license friendly.
resolves apache#4084 

### How was this PR tested?
Added test suite `TestExecutorManager`.

### Was this PR authored or co-authored using generative AI tooling?
Tests generated by Cursor.

---------

Co-authored-by: Yicong Huang <yicong.huang+data@databricks.com>
Co-authored-by: Chen Li <chenli@gmail.com>
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
1. Replace flake8 and black with Ruff in CI.
2. Format existing code using Ruff

Basic Ruff commands:
Under amber/src/main/python
```cd amber/src/main/python```
Run Ruff’s formatter in dry mode
```ruff format --check .```
Run Ruff’s formatter
```ruff format .```
Run Ruff’s linter
```ruff check .```

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  4. If there is design documentation, please add the link.
  5. If there is a discussion in the mailing list, please add the link.
-->
Closes apache#4078

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
I created a PR on my own fork to ensure CI is working.

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
### What changes were proposed in this PR?

This PR bumps the project version from `1.0.0` to `1.1.0-incubating`
across all relevant configuration files:

- **`build.sbt`**: Updated `version := "1.0.0"` to `version :=
"1.1.0-incubating"`
- **`bin/single-node/docker-compose.yml`**:
- Updated project name from `texera-single-node-release-1-0-0` to
`texera-single-node-release-1-1-0-incubating`
- Updated network name from `texera-single-node-release-1-0-0` to
`texera-single-node-release-1-1-0-incubating`
- Updated all 7 Texera service image tags from `:latest` to
`:1.1.0-incubating`
  - Updated the R operator comment reference
- **`bin/k8s/values.yaml`**: Updated all 8 Texera service image tags
from `:latest` to `:1.1.0-incubating`

### Any related issues, documentation, discussions?

Closes apache#4082

### How was this PR tested?

This is a configuration-only change.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.5)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->

This PR renames the `BigObject` type to `LargeBinary`. The original
feature was introduced in apache#4067, but we decided to adopt the
`LargeBinary` terminology to align with naming conventions used in other
systems (e.g., Arrow).

This change is purely a renaming/terminology update and does not modify
the underlying functionality.


### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
apache#4100 (comment)


### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Run this workflow and check if the workflow runs successfully and see if
three objects are created in MinIO console.
[Java
UDF.json](https://github.com/user-attachments/files/23976766/Java.UDF.json)



### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->

No.

---------

Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…4124)

### What changes were proposed in this PR?

This PR removes the `WITH_R_SUPPORT` build argument and all R-related
installation logic from the Docker build configuration:

1. **Dockerfiles** (`computing-unit-master.dockerfile` and
`computing-unit-worker.dockerfile`):
   - Removed `ARG WITH_R_SUPPORT` build argument
   - Removed conditional R runtime dependencies installation
   - Removed R compilation and installation steps (R 4.3.3)
   - Removed R packages installation (arrow, coro, dplyr)
   - Removed `LD_LIBRARY_PATH` environment variable for R libraries
   - Removed `r-requirements.txt` copy in worker dockerfile
   - Simplified to Python-only dependencies

2. **GitHub Actions Workflow**
(`.github/workflows/build-and-push-images.yml`):
   - Removed `with_r_support` workflow input parameter
   - Removed `with_r_support` from job outputs and parameter passing
- Removed `WITH_R_SUPPORT` build args from both AMD64 and ARM64 build
steps
   - Removed R Support from build summary

### Any related issues, documentation, discussions?

Related to apache#4090

### How was this PR tested?

Verified Dockerfile & CI yml syntax are valid

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) via Claude
Code CLI
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR introduces Python support for the `large_binary` attribute type,
enabling Python UDF operators to process data larger than 2 GB. Data is
offloaded to MinIO (S3), and the tuple retains only a pointer (URI).
This mirrors the existing Java LargeBinary implementation, ensuring
cross-language compatibility. (See apache#4067 for system diagram and apache#4111
for renaming)

## Key Features

### 1. MinIO/S3 Integration
- Utilizes the shared `texera-large-binaries` bucket.
- Implements lazy initialization of S3 clients and automatic bucket
creation.

### 2. Streaming I/O
- **`LargeBinaryOutputStream`:** Writes data to S3 using multipart
uploads (64KB chunks) to prevent blocking the main execution.
- **`LargeBinaryInputStream`:** Lazily downloads data only when the read
operation begins. Implements standard Python `io.IOBase`.

### 3. Tuple & Iceberg Compatibility
- `largebinary` instances are automatically serialized to URI strings
for Iceberg storage and Arrow tables.
- Uses a magic suffix (`__texera_large_binary_ptr`) to distinguish
pointers from standard strings.

### 4. Serialization
- Pointers are stored as strings with metadata (`texera_type:
LARGE_BINARY`). Auto-conversion ensures UDFs always see `largebinary`
instances, not raw strings.

## User API Usage

### 1. Creating & Writing (Output)
Use `LargeBinaryOutputStream` to stream large data into a new object.

```python
from pytexera import largebinary, LargeBinaryOutputStream

# Create a new handle
large_binary = largebinary()

# Stream data to S3
with LargeBinaryOutputStream(large_binary) as out:
    out.write(my_large_data_bytes)
    # Supports bytearray, bytes, etc.
```

### 2. Reading (Input)
Use `LargeBinaryInputStream` to read data back. It supports all standard
Python stream methods.

```python
from pytexera import LargeBinaryInputStream

with LargeBinaryInputStream(large_binary) as stream:
    # Option A: Read everything
    all_data = stream.read()

    # Option B: Chunked reading
    chunk = stream.read(1024)

    # Option C: Iteration
    for line in stream:
        process(line)
```

## Dependencies
- `boto3`: Required for S3 interactions.
- `StorageConfig`: Uses existing configuration for
endpoints/credentials.

## Future Direction
- Support for R UDF Operators
- Check apache#4123


### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->
Design: apache#3787

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Tested by running this workflow multiple times and check MinIO dashboard
to see whether six objects are created and deleted. Specify the file
scan operator's property to use any file bigger than 2GB.
[Large Binary
Python.json](https://github.com/user-attachments/files/24062982/Large.Binary.Python.json)

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No.

---------

Signed-off-by: Chris <143021053+kunwp1@users.noreply.github.com>
bobbai00 and others added 4 commits May 1, 2026 22:55
…condition (apache#4615)

### What changes were proposed in this PR?

This PR fixes a race in `SyncExecutionResource.allTargetsCompleted` that
causes the sync execution API (`POST /api/execution/{wid}/{cuid}/run`)
to terminate before a HashJoin's probe phase produces output, returning
an empty result.

**Root cause.** `HashJoinOpDesc.getPhysicalPlan` produces two
PhysicalOps (`build`, `probe`) sharing one logical id, separated by a
blocking edge. The scheduler places them in two regions and runs them
sequentially. `WorkflowExecution.getAllRegionExecutionsStats` aggregates
per-logical-op state by `groupBy(_._1.logicalOpId.id)` over only the
*registered* `RegionExecution`s. Between "build region completed" and
"probe region instantiated," only the build PhysicalOp is registered, so
`aggregateStates(Iterable(COMPLETED))` returns `COMPLETED`. The sync
resource then takes the `TargetResultsReady` branch, calls
`killExecution`, and reads the probe's still-empty Iceberg output. The
same shape applies to any logical operator whose physical plan contains
multiple PhysicalOps separated by a blocking edge (e.g., `Aggregate`).
It does not surface in the regular WebSocket-driven frontend execution
because the frontend waits for full workflow termination.

**Fix.** Strengthen `allTargetsCompleted` to require, in addition to
`operatorState == COMPLETED`, that every declared external input port of
the target is already present in
`OperatorMetrics.operatorStatistics.inputMetrics`. Port-1 metrics only
appear after the probe actually consumes data, which closes the race
window. Internal ports (e.g., HashJoin's build→probe internal edge) are
filtered out on both sides of the comparison so the predicate matches
what `aggregateMetrics` already exposes. Source operators (zero declared
inputs) and single-input operators are unaffected; for empty-input edge
cases, `terminalStateObservable` continues to provide the fallback
signal.

```scala
val targetExpectedExternalInputs: Map[String, Int] = effectiveLogicalPlan.operators
  .filter(op => request.targetOperatorIds.contains(op.operatorIdentifier.id))
  .map(op =>
    op.operatorIdentifier.id -> op.operatorInfo.inputPorts.count(!_.id.internal)
  )
  .toMap

def allTargetsCompleted(stats: ExecutionStatsStore): Boolean = {
  request.targetOperatorIds.nonEmpty && request.targetOperatorIds.forall { opId =>
    stats.operatorInfo.get(opId).exists { metrics =>
      val externalInputPortsReporting =
        metrics.operatorStatistics.inputMetrics.count(!_.portId.internal)
      val expectedExternalInputs = targetExpectedExternalInputs.getOrElse(opId, 0)
      metrics.operatorState == COMPLETED &&
      externalInputPortsReporting >= expectedExternalInputs
    }
  }
}
```

### Any related issues, documentation, discussions?

Closes apache#4576

### How was this PR tested?

Manually reproduced and verified end-to-end against
`ComputingUnitMaster` on port 8085 with a 3-operator DAG (CSVFileScan
movies + CSVFileScan ratings → HashJoin on `movieId`) executed via `POST
/api/execution/{wid}/{cuid}/run` with `targetOperatorIds =
[HashJoinId]`. Inputs: `movies.csv` (1000 rows) and `ratings.csv` (10
311 rows).

Steps to reproduce / verify:

```
# 1. Start the master
sbt "project WorkflowExecutionService" compile
java ... org.apache.texera.web.ComputingUnitMaster   # listens on :8085

# 2. Get a JWT
curl -s -X POST http://localhost:8080/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username":"<user>","password":"<pw>"}'

# 3. POST the request (CSV → CSV → HashJoin, target = HashJoin)
curl -s -X POST http://localhost:8085/api/execution/<wid>/<cuid>/run \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  --data @sync-exec-request.json
```

Existing tests pass (`sbt "project WorkflowExecutionService" compile`
succeeds). No new unit test was added because the failure is a timing
race in the controller's region-registration sequence relative to the
sync resource's observable; reproducing it deterministically in a unit
test would require either mocking `ExecutionStatsStore` to emit a
build-only snapshot followed by a build+probe snapshot, or driving the
full controller actor system, both of which are out of scope for this
targeted fix. Manual reproduction is reliable on every run because the
race window is several hundred milliseconds wide and `Observable.amb`
consistently selects the (incorrect) target-completion signal first
prior to this fix.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
apache#4624)

### What changes were proposed in this PR?

The Build matrix jobs (`frontend`, `scala`, `python`, `agent-service`)
were duplicated between `github-action-build.yml` and
`reusable-build.yml`, and the two had drifted — `reusable-build.yml` was
missing the recent license-check additions (npm bundle check,
pip-licenses manifest, bundled-jar diff against LICENSE-binary,
agent-service license manifest). Net change: **+238 / −416** lines.

- Rename `reusable-build.yml` → `build.yml` (workflow name `Build`). It
is now the single source of truth for the matrix steps, with the
license-check additions ported in.
- Rename `github-action-build.yml` → `required-checks.yml` (workflow
name `Required Checks`). Replace the four inline matrix jobs with a
single `build:` caller that `uses: ./.github/workflows/build.yml`. The
`backport:` caller is unchanged; the `Required Checks` aggregator job's
`needs:` shrinks from `[precheck, frontend, scala, python,
agent-service, backport]` to `[precheck, build, backport]`.
- Update `direct-backport-push.yml`'s `workflow_id` reference to the new
filename.

`.asf.yaml` continues to require only `Required Checks`, so the
display-name change (matrix children gain a `build /` prefix) does not
affect branch protection.

### Any related issues, documentation, discussions?

Closes apache#4623

### How was this PR tested?

YAML parses locally for all three modified workflow files. Step parity
between the new `build.yml` and the previous inline
`github-action-build.yml` matrix jobs verified by side-by-side diff. The
job will be exercised on this PR itself; matrix children appear under
`Required Checks / build / …` and the `backport:` matrix continues to
appear under `Required Checks / backport (...) / …`.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### What changes were proposed in this PR?

Gate the Build workflow's main stacks on PR labels.

`required-checks.yml`'s `precheck` job now waits for the Pull Request
Labeler workflow to finish (polls the `labeler` check on the PR head
SHA, up to 5 min) and reads the resulting labels to decide which stacks
run:

| PR labels | frontend | scala | python | agent-service |
|---|---|---|---|---|
| only `docs` and/or `dev` | skip | skip | skip | skip |
| no `frontend` label | skip | run | run | run |
| includes `frontend` (or any non-skip label) | run | run | run | run |
| `push` / `workflow_dispatch` (no PR) | run | run | run | run |

`.github/labeler.yml`: rename the existing `build` label to `dev` so the
name matches the role precheck reads. The labeler applies it for
`bin/**` changes (the previous `deployment/**` glob is dropped because
that directory no longer exists).

The backport matrix inherits the same `run_*` decisions: each
`release/*` target only re-validates the stacks selected by the table
above. A docs-only PR with a `release/*` label still spawns a backport
run, but every stack inside it is skipped.

### Any related issues, documentation, discussions?

Closes apache#4621. Picks up the idea from the closed prior attempt apache#3642.

`.asf.yaml` ruleset's required check names are static; skipped stacks
now report `skipped` rather than `success`. The `Required Checks`
aggregator added in apache#4624 already treats `skipped` as a pass, so branch
protection stays green.

### How was this PR tested?

Self-test on this PR: it touches `.github/workflows/**` and
`.github/labeler.yml`, so labeler should add `ci` (and not `frontend`).
Expected precheck output: `run_frontend=false`, others `true`.
Adding/removing the `frontend` label should flip the frontend stack
on/off; replacing all labels with `docs` only (or `dev` only) should
skip every stack.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SarahAsad23 SarahAsad23 marked this pull request as draft May 2, 2026 00:12
@github-actions github-actions Bot added the frontend Changes related to the frontend GUI label May 2, 2026
@SarahAsad23 SarahAsad23 force-pushed the pve-user-packages branch from 574c3c5 to 4ed8293 Compare May 4, 2026 05:18
@github-actions github-actions Bot added dependencies Pull requests that update a dependency file ddl-change Changes to the TexeraDB DDL python ci changes related to CI docs Changes related to documentations dev common platform Non-amber Scala service paths agent-service labels May 4, 2026
@SarahAsad23 SarahAsad23 force-pushed the pve-user-packages branch from 4ed8293 to 4238c35 Compare May 4, 2026 05:22
@SarahAsad23 SarahAsad23 closed this May 4, 2026
@SarahAsad23 SarahAsad23 removed their assignment May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-service ci changes related to CI common ddl-change Changes to the TexeraDB DDL dependencies Pull requests that update a dependency file dev docs Changes related to documentations engine frontend Changes related to the frontend GUI platform Non-amber Scala service paths python

Projects

None yet

Development

Successfully merging this pull request may close these issues.