Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delocalizing-files is not working if path ended with whitespace #259

Closed
faruqsandi opened this issue Mar 17, 2023 · 4 comments
Closed

Delocalizing-files is not working if path ended with whitespace #259

faruqsandi opened this issue Mar 17, 2023 · 4 comments

Comments

@faruqsandi
Copy link

Hello.

I ran dsub with these parameters

dsub \
	--provider google-cls-v2 \
	--project myproject  \
	--zones "us-central1-*" \
	--image ubuntu:22.04 \
	--logging "gs://mybucket/faruq/fastp_pe_manual/logging" \
	--input INPUT1="gs://mybucket/folder/fastq/4/EXB0E0BPU4_S2_R1_001.fastq.gz" \
	--input INPUT2="gs://mybucket/folder/fastq/4/EXB0E0BPU4_S2_R2_001.fastq.gz" \
	--output-recursive "OUTPUT_DIR=gs://mybucket/faruq/fastp_pe_manual " \
	--script ./dsub_fastp.sh \
	--wait

Please mind that in the --output-recursive path ended with space. And of course it is not working.
It is user responsibility to ensure that user should not add extra white space after path when using quotes/double quotes, but I think, this little mistake too easy to miss. And this error is not visible until at delocalization stage, which is when you think you are done!

Probably, is it better if dsub to always trim leading and trailing whitespaces when the path using quotes/double quotes.

Thank you.

@mbookman
Copy link
Contributor

Hi @faruqsandi !

Thank-you for reporting the problem you ran into. When you say "it is not working", can you clarify what you observe did happen? Was there a job failure?

We have a test for recursive outputs that does include spaces. It doesn't include spaces at the end of the path, as your test case did, however I just tested this scenario and did get outputs. I would expect that you would observe outputs in your target directory if you double-quote your path (and include a trailing slash):

gsutil ls "gs://mybucket/faruq/fastp_pe_manual /"

or you could use the extended wildcard support:

gsutil ls gs://mybucket/faruq/fastp_pe_manual**

dsub is technically doing the "right" thing here in supporting passing through strings that can have whitespace (uncommon on *nix systems, more common on Windows). That said, it would be very reasonable for dsub to try to detect "problematic" characters (such as spaces) in paths and emit a warning to users.

Please let us know what you find out.

Thanks!

@faruqsandi
Copy link
Author

faruqsandi commented Mar 18, 2023

Hello!
Thanks for the reply!

Yes, I agree too to use warning. Because probably this error (a typo) will only happen once in blue moon. I also just realized that we can actually create a folder with trailing spaces *nix. For example mkdir "hello! there!" is possible and we need to use cd hello\!\ \ \ \ \ \ \ \ \ \ \ \ there\!/ to get inside it.

So, while it is possible for bash to run a script that a path contains trailing whitespaces, probably it is not possible in GCP bucket. Anyway, I tried something that might be worth looking into..


Ok, Let me show you the success scenario with this whitespace_example.sh:

touch example.txt
mv example.txt $OUTPUT_DIR

using this dsub command (no double quotes, so trailing whitespaces doesnt matter):

dsub \
	 --provider google-cls-v2 \
	--project xna-labs-uvaca-labs-workspace  \
	--zones "us-central1-*" \
	--logging "gs://mybucket/faruqsandi_new/dsub_test/logging" \
	--output-recursive OUTPUT_DIR=gs://mybucket/faruqsandi_new/dsub_test  \
	--script ./whitespace_example.sh \
	--wait

the output of

gsutil ls -R  gs://mybucket/faruqsandi_new/

is

gs://mybucket/faruqsandi_new/:

gs://mybucket/faruqsandi_new/dsub_test/:
gs://mybucket/faruqsandi_new/dsub_test/example.txt

gs://mybucket/faruqsandi_new/dsub_test/logging/:
gs://mybucket/faruqsandi_new/dsub_test/logging/whitespace--bit--230318-125920-93-stderr.log
gs://mybucket/faruqsandi_new/dsub_test/logging/whitespace--bit--230318-125920-93-stdout.log
gs://mybucket/faruqsandi_new/dsub_test/logging/whitespace--bit--230318-125920-93.log

there is example.txt, which is what we expected. I believe, using this param (double quotes, no trailing whitespace)

	--output-recursive OUTPUT_DIR="gs://mybucket/faruqsandi_new/dsub_test"  \

will yield the same thing.


Let's move to failed scenario using this dsub command (double quotes and trailing whitespace):

dsub \
	 --provider google-cls-v2 \
	--project xna-labs-uvaca-labs-workspace  \
	--zones "us-central1-*" \
	--logging "gs://mybucket/faruqsandi/dsub_test/logging" \
	--output-recursive "OUTPUT_DIR=gs://mybucket/faruqsandi/dsub_test " \
	--script ./whitespace_example.sh \
	--wait

log says success:

[
  {
    "job-name": "whitespace-example",
    "last-update": "2023-03-18 12:54:03.834733",
    "status-message": "Success",
    "job-id": "whitespace--bit--230318-125228-99",
    "user-id": "bit",
    "status": "SUCCESS",
    "status-detail": "Success",
    "create-time": "2023-03-18 12:52:32.074038",
    "start-time": "2023-03-18 12:52:48.026275",
    "end-time": "2023-03-18 12:54:03.834733",
    "internal-id": "projects/asdasdas/locations/us-central1/operations/adsadsa",
    "logging": "gs://mybucket/faruqsandi/dsub_test/logging/whitespace--bit--230318-125228-99.log",
    "labels": {},
    "envs": {},
    "inputs": {},
    "input-recursives": {},
    "outputs": {},
    "output-recursives": {
      "OUTPUT_DIR": "gs://mybucket/faruqsandi/dsub_test "
    },
    "mounts": {},
    "provider": "google-cls-v2",
    "provider-attributes": {
      "ssh": false,
      "block-external-network": null,
      "instance-name": "google-pipelines-worker-sdasdas",
      "zone": "us-central1-f",
      "regions": [],
      "zones": [
        "us-central1-a",
        "us-central1-b",
        "us-central1-c",
        "us-central1-f"
      ],
      "machine-type": "n1-standard-1",
      "preemptible": false,
      "boot-disk-size": 10,
      "network": "",
      "subnetwork": "",
      "use_private_address": false,
      "cpu_platform": "",
      "accelerators": [],
      "enable-stackdriver-monitoring": false,
      "service-account": "default",
      "disk-size": 200,
      "disk-type": "pd-standard",
      "volumes": []
    },
    "events": [
      {
        "name": "start",
        "start-time": "2023-03-18 05:52:48.026275+00:00"
      },
      {
        "name": "pulling-image",
        "start-time": "2023-03-18 05:53:42.370788+00:00"
      },
      {
        "name": "localizing-files",
        "start-time": "2023-03-18 05:53:53.424937+00:00"
      },
      {
        "name": "running-docker",
        "start-time": "2023-03-18 05:53:55.086473+00:00"
      },
      {
        "name": "delocalizing-files",
        "start-time": "2023-03-18 05:53:56.370611+00:00"
      },
      {
        "name": "ok",
        "start-time": "2023-03-18 05:54:03.834733+00:00"
      }
    ],
    "dsub-version": "v0-4-8",
    "script-name": "whitespace_example.sh",
    "script": "\ntouch example.txt\nmv example.txt $OUTPUT_DIR"
  }
]

However when I run this command to see what is in the OUTPUT_DIR:

gsutil ls -R  gs://mybucket/faruqsandi/

the output is:

gs://mybucket/faruqsandi/dsub_test/:

gs://mybucket/faruqsandi/dsub_test/logging/:
gs://mybucket/faruqsandi/dsub_test/logging/whitespace--bit--230318-125228-99-stderr.log
gs://mybucket/faruqsandi/dsub_test/logging/whitespace--bit--230318-125228-99-stdout.log
gs://mybucket/faruqsandi/dsub_test/logging/whitespace--bit--230318-125228-99.log

there is no example.txt. Both stdout.log and stderr.log are empty. The content of log is:

2023-03-18 05:53:53 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/output gs://mybucket/faruqsandi/dsub_test/logging/whitespace--bit--230318-125228-99.log
2023-03-18 05:53:53 INFO: mkdir -m 777 -p /mnt/data/output/gs/mybucket/faruqsandi/dsub_test /
2023-03-18 05:53:57 INFO: Delocalizing OUTPUT_DIR
2023-03-18 05:53:57 INFO: gsutil  -mq rsync -r /mnt/data/output/gs/mybucket/faruqsandi/dsub_test / gs://mybucket/faruqsandi/dsub_test /

probably the clue is in the last line of that log.

@mbookman
Copy link
Contributor

GCS does support whitespace in object paths.

The issue here is actually with the test whitespace_example.sh.

Rather than:

touch example.txt
mv example.txt $OUTPUT_DIR 

This should be:

touch example.txt
mv example.txt "$OUTPUT_DIR"

Otherwise, the "mv" command becomes:

mv example.txt /mnt/data/output/gs/mybucket/faruqsandi/dsub_test /

Instead of

mv example.txt "/mnt/data/output/gs/mybucket/faruqsandi/dsub_test /"

and so when dsub goes to rsync the output directory, that directory is empty.

With the output directory quoted, I do see example.txt showing up in my bucket:

$ gsutil ls -R "gs://mybucket/faruqsandi/dsub_test "
gs://mybucket/faruqsandi/dsub_test /:
gs://mybucket/faruqsandi/dsub_test /example.txt

@faruqsandi
Copy link
Author

faruqsandi commented Mar 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants