Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

複数Workflow同時実行時のエラー発生事象の調査 #104

Closed
itutu-tienday opened this issue Aug 31, 2023 · 5 comments · Fixed by #127
Closed

複数Workflow同時実行時のエラー発生事象の調査 #104

itutu-tienday opened this issue Aug 31, 2023 · 5 comments · Fixed by #127
Assignees
Labels
Milestone

Comments

@itutu-tienday
Copy link
Collaborator

itutu-tienday commented Aug 31, 2023

以下の事象について、調査および改善を実施する。

事象

  • Workflowを同時実行すると、原因不明だが、複数のWorkflowがエラーで中断されるケースが確認されている。

  • テストパターン

    • 5つのWorkspace(1~5)別にブラウザのタブを作成し、タブごとに ほぼ同タイミングで Workflowを実行(RUN ALL)。
    • 実行環境は Mac(macOS 13)
  • 結果

    • 不定期に、一部 or すべてのタブのWorkflowが、エラーで中断されるケースが確認されている。

エラーケース詳細

  • ケース1) すべてのタブ(1~5)のWorkflowでエラー発生

    • 対象Workflowは「caiman_mc → caiman_cnmf」
    • すべてのタブ(1~5)のcaiman_cnmfで、エラーが発生。
    • エラーログ
      • 以下の .snakemake/log には以下の記録あり。(タブ3,5 でのみエラー?)
      • その他のログは、このケースではその後WorkflowのRUNが行われたため、ログが上書きされ残されていない。
        [Wed Aug 30 17:27:25 2023]
        Error in rule 2:
            jobid: 1
            input: /Volumes/workspace/optinist-sv-storedir/output/5/e23f5698/caiman_mc_qfusonsrx3/caiman_mc.pkl
            output: /Volumes/workspace/optinist-sv-storedir/output/5/e23f5698/caiman_cnmf_fux2p3gk0h/caiman_cnmf.pkl
            conda-env: /Volumes/workspace/optinist-for-server/.snakemake/conda/f286fa37fa6660cd11b27453490b9fad_
        
        [Wed Aug 30 17:27:25 2023]
        Error in rule 2:
            jobid: 1
            input: /Volumes/workspace/optinist-sv-storedir/output/3/96c67ad9/caiman_mc_oyjgd3wwys/caiman_mc.pkl
            output: /Volumes/workspace/optinist-sv-storedir/output/3/96c67ad9/caiman_cnmf_pri9m2zneq/caiman_cnmf.pkl
            conda-env: /Volumes/workspace/optinist-for-server/.snakemake/conda/f286fa37fa6660cd11b27453490b9fad_
        
  • ケース2) タブ1,2 でエラーが発生。

    • エラー内容は以下
      • タブ1 … なぜかタブ1のプロセスで、タブ2のプロセスのoutput_fileのパスを参照しているログが記録されている。
      • タブ2 … タブ1のエラーログと同様。以下の内容
        • error.log
          line 937, in update\n
          raise AmbiguousRuleException(file, producer, ambiguities[0])\n
          snakemake.exceptions.AmbiguousRuleException: Rules 6 and 5 are ambiguous for the file
          /Volumes/workspace/optinist-sv-storedir/output/2/0fa2aee5/input_0/mouse2p_2_donotouse.pkl.\n
          Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.\n
          Wildcards:\n \t6: \n \t5: \n
          Expected input files:\n \t6:
          /Volumes/workspace/optinist-sv-storedir/input/2/mouse2p_2_donotouse.tiff\n \t5:
          /Volumes/workspace/optinist-sv-storedir/input/2/mouse2p_2_donotouse.tiff\n
          Expected output files:\n \t6:
          /Volumes/workspace/optinist-sv-storedir/output/2/0fa2aee5/input_0/mouse2p_2_donotouse.pkl\n \t5:
          /Volumes/workspace/optinist-sv-storedir/output/2/0fa2aee5/input_0/mouse2p_2_donotouse.pkl\n
          

検証

  • 以下の動作パターンを検証したが、正常動作を確認。
    • パターン … 5つのWorkflowのうち、暫定的に 1つWorkflowを強制的に終了(raise Error)するコードを追加し、動作確認
    • 結果 … エラーが派生したWorkflow以外は、すべて処理に成功。 (期待する動作)

想定される要因

その他

@quanpython
Copy link
Collaborator

@itutu-tienday
I tried reproducing with the above test and environment conditions (MacOS 13), and got the following errors, without the errors you mentioned:

File "/Users/pro/Documents/optinist-for-server/.snakemake/conda/a5230297924c2d6bffa9bd1c72d68d8f_/lib/python3.8/site-packages/numpy/core/memmap.py", line 228, in __new__ f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b') FileNotFoundError: [Errno 2] No such file or directory: 'data/input/1/memmap_d1_128_d2_128_d3_1_order_C_frames_1000.mmap'
File "/Users/pro/Documents/optinist-for-server/.snakemake/conda/a5230297924c2d6bffa9bd1c72d68d8f_/lib/python3.8/site-packages/numpy/core/memmap.py", line 267, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) ValueError: mmap length is greater than file size
2023-09-06 22:44:35,501 : ERROR - logger.py - {'level': 'debug', 'msg': 'Full Traceback (most recent call last):\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/__init__.py", line 771, in snakemake\n    success = workflow.execute(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/workflow.py", line 741, in execute\n    dag.init()\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 205, in init\n    job = self.update([job], progress=progress, create_inventory=True)\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 917, in update\n    raise exceptions[0]\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 875, in update\n    self.update_(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 1008, in update_\n    raise ex\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 990, in update_\n    selected_job = self.update(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 917, in update\n    raise exceptions[0]\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 875, in update\n    self.update_(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 1008, in update_\n    raise ex\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 990, in update_\n    selected_job = self.update(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 917, in update\n    raise exceptions[0]\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 875, in update\n    self.update_(\n  File "/Users/pro/opt/anaconda3/envs/optinistfs/lib/python3.8/site-packages/snakemake/dag.py", line 1036, in update_\n    raise MissingInputException(job, missing_input)\nsnakemake.exceptions.MissingInputException: Missing input files for rule 3:\n    output: data/output/1/2a8d4596/caiman_mc_2bq24wuuam/caiman_mc.pkl\n    affected files:\n        data/output/1/2a8d4596/input_0/mouse2p_2_donotouse.pkl\n', 'timestamp': 1694015075.50097}

=> Are there differences in your environment that haven't been mentioned?
I also tested with the suite2p and lccd workflows but did not see any errors, so I guess these errors only appear on the caiman_mc node.

@itutu-tienday
Copy link
Collaborator Author

@quanpython

=> Are there differences in your environment that haven't been mentioned?

No additional supplemental information is available at this time.

I also tested with the suite2p and lccd workflows but did not see any errors,

In your environment, the concurrency did not cause any errors?

  • Please share the evidence (logs of successful concurrency processes), just to be sure.

Also, if possible, we would like to have additional confirmation of the following

  • Run suite2p, lccd concurrently with up to 10 tabs and monitor processing status

so I guess these errors only appear on the caiman_mc node.

I see...now if you could do a little research on the caiman_mc error.

@quanpython
Copy link
Collaborator

@itutu-tienday

In your environment, the concurrency did not cause any errors?

Yes.
Here is log file with 5 tabs.

lccd-5.log
suite2p-5.log

Run suite2p, lccd concurrently with up to 10 tabs and monitor processing status

With this case, only a few tabs were successful. There is workflow runs without response.
Most tabs return an error immediately after pressing Run All:

CreateRuleException in file /Users/pro/Documents/optinist-for-server/studio/app/Snakefile, line 16:
The name all is already used by another rule
  File "/Users/pro/Documents/optinist-for-server/studio/app/Snakefile", line 16, in <module>

Log:
lccd-10.log
suite2p-10.log

I see...now if you could do a little research on the caiman_mc error.

If multiple workflows are using same input tif file, they are also use same memmap file , I think renaming it with unique base name may solve above error.

# studio/app/optinist/wrappers/caiman/motion_correction.py
fname_new = save_memmap(
mc.mmap_file, base_name="memmap_", order="C", border_to_0=border_to_0
)

@itutu-tienday
Copy link
Collaborator Author

@quanpython
Thanks for the survey.

With this case, only a few tabs were successful. There is workflow runs without response.
Most tabs return an error immediately after pressing Run All:

Can you identify the cause and remedy for the above event?

If multiple workflows are using same input tif file, they are also use same memmap file , I think renaming it with unique base name may solve above error.

I see. Is it possible that the above phenomenon occurs only in caiman?

@itutu-tienday itutu-tienday transferred this issue from arayabrain/optinist-for-server Oct 3, 2023
@itutu-tienday itutu-tienday added this to the v1.1.0 milestone Oct 26, 2023
@ReiHashimoto ReiHashimoto linked a pull request Oct 27, 2023 that will close this issue
@itutu-tienday
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants