Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running bacalhau get fails to merge results when downloading #3677

Open
frrist opened this issue Mar 21, 2024 · 7 comments
Open

Running bacalhau get fails to merge results when downloading #3677

frrist opened this issue Mar 21, 2024 · 7 comments
Labels
type/bug Type: Something is not working as expected

Comments

@frrist
Copy link
Member

frrist commented Mar 21, 2024

Steps to reproduce:

  1. Run this job:
Name: Docker Job with Output
Type: batch
Namespace: default
Count: 2
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - dd if=/dev/urandom of=/output_custom/output.txt bs=1M count=1

    Publisher:
      Type: local
    ResultPaths:
      - Name: output_custom
        Path: /output_custom
  1. Run bacalhau get
bacalhau get j-1cbabf4e-3ac9-4496-adb5-34d43248a308
Fetching results of job 'j-1cbabf4e-3ac9-4496-adb5-34d43248a308'...
error downloading job: cannot merge results as output already exists: /home/frrist/workspace/src/github.com/bacalhau-project/bacalhau/job-j-1cbabf4e/output_custom/output.txt. Try --raw to download raw results instead of merging them
  1. Observe the resulting download
tree job-j-1cbabf4e
job-j-1cbabf4e
├── exitCode
├── output_custom
│   └── output.txt
├── raw
│   ├── 34.150.138.100_6001_e-a57e7662-07fd-41bd-9fa4-c0d0b5d14f94
│   │   ├── exitCode
│   │   ├── output_custom
│   │   │   └── output.txt
│   │   ├── stderr
│   │   └── stdout
│   ├── 34.150.138.100_6001_e-a57e7662-07fd-41bd-9fa4-c0d0b5d14f94.tgz
│   ├── 35.245.247.85_6001_e-65d12831-115f-4660-8d4f-36d16a162de4
│   │   ├── exitCode
│   │   ├── output_custom
│   │   ├── stderr
│   │   └── stdout
│   └── 35.245.247.85_6001_e-65d12831-115f-4660-8d4f-36d16a162de4.tgz
├── stderr
└── stdout

6 directories, 13 files
@frrist frrist added the type/bug Type: Something is not working as expected label Mar 21, 2024
@wdbaruni
Copy link
Member

Yes this an expected behaviour as both executions are generating conflicting output.txt files under output_custom path, and we don't know how to merge the output directory when we get conflicting file names. The error is telling the user that we can't merge results from the multiple executions, but they can still download the raw results

This job works as both executions are generating random files under output_custom.

Name: Docker Job with Output
Type: batch
Namespace: default
Count: 2
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - >
            dd if=/dev/urandom of=/output_custom/$(tr -dc A-Za-z0-9 </dev/urandom | head -c 13 ; echo '').txt bs=1M count=1
    Publisher:
      Type: local
    ResultPaths:
      - Name: output_custom
        Path: /output_custom

The result directory structure looks like:

→ tree job-j-518c8f47
job-j-518c8f47
├── exitCode
├── output_custom
│   ├── 4xpHSwz4corBo.txt
│   └── F2LWEK6Urmzin.txt
├── stderr
└── stdout

@frrist
Copy link
Member Author

frrist commented Mar 21, 2024

Yes, I understand this behavior is expected given the current implementation of the code - but from a UX perspective this unexpected imo. We're basically saying "You're holding it wrong" for a relatively simple task and implying we expect users to write jobs that name all their output files different when run on multiple nodes.

Perhaps we could make the merge behavior something users opt into rather than out of?

@aronchick
Copy link
Collaborator

aronchick commented Mar 22, 2024

Agreed - we should save both simultaneously, and tell the user that. Eg

/<output name>-<short node ID>

For all nodes

@frrist
Copy link
Member Author

frrist commented Mar 22, 2024

Would something like:

job-j-1234/
  node-n-abc/
     exitcode
     stderr
     stdout
     output-name/
       ...
  node-n-def/
    exitcode
    stderr
    stdout
    output-name/
       ...

be acceptable?

@wdbaruni
Copy link
Member

the problem is you won't get deterministic outputs or structure of the result directory. but at the same time I hate this kind of magic that works most of the time, but fails on others. Also we are not really doing a good job in merging stdout, stderr or exit code.

also I wouldn't group results by node-id as the same node might run multiple executions for the same job, such as when we have a fat node in the network. We don't do this today, but we should.

So options can be:

  1. Group by an execution counter. If the job had count=3, then we will have 000, 001 and 002 sub-directories, and each will have results and logs for a specific execution. This solves the deterministic behaviour, but not simple to implement today as we don't have this execution counter logic. Though we should
  2. Group by executionID: much easier to implement and can work nicely if we allow users to download results for specific executions

@frrist
Copy link
Member Author

frrist commented Apr 16, 2024

the problem is you won't get deterministic outputs or structure of the result directory

Is determinism in output still something we care about? I'm not sure I see this as a problem, rather a side effect of the system.

I hate this kind of magic that works most of the time, but fails on others. Also we are not really doing a good job in merging stdout, stderr or exit code.

I hate the magic too, because it's the furthest thing from magic when it doesn't work. I don't think we should be merging outputs, it's almost always not what I want when I download the results of a job.

also I wouldn't group results by node-id as the same node might run multiple executions for the same job

Humm, yeah this is a fair point. On the other hand I like seeing which result came from which node. sooo....

Group by executionID: much easier to implement and can work nicely if we allow users to download results for specific executions

What if we group by node and execution, that is:

Job 
  Node-1
    Execution-1
    Execution-2
  Node-2
    Execution-1
  Node-3
    Execution-1
    Execution-2
    Execution-3

I am not sure what this should look like when there is more than 1 task, thoughts?

@wdbaruni
Copy link
Member

We need to balance between correctness and have a great UX

Is determinism in output still something we care about? I'm not sure I see this as a problem, rather a side effect of the system.

It is not a great user experience if users want to pipe the download command or do some sort of automation when the the download path is not predictable. I understand they play with wildcards, but still not a nice UX

On the other hand I like seeing which result came from which node

Why do users care about the node id when reading results? I understand it might be useful for debugging, which users can know by calling bacalhau job executions <job_id>, but not necessery useful in the result directory structure

What if we group by node and execution

Too many levels is not a great experience. We can always add a metadata file inside each path that holds some metadata about the execution, such as its full id, node id, start time, end time, ... etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: Something is not working as expected
Projects
Status: Backlog
Development

No branches or pull requests

3 participants