Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub process logger #482

Closed
ophirazulai opened this issue Oct 20, 2021 · 21 comments
Closed

Sub process logger #482

ophirazulai opened this issue Oct 20, 2021 · 21 comments

Comments

@ophirazulai
Copy link

Hi,

Here is the description of the problem:

I have process A who creates logger using this command:
task = Task.init(project_name=args.clearml_proj_base + “/training”, task_name=args.clearml_task,
tags=[args.loss,‘patch size’ + str(args.patch_size), str(args.num_features)+‘’+str(args.max_features)+‘’+str(args.unet_depth) , ‘two channels’, ‘lr_default’],
continue_last_task=False)
logger = task.get_logger()

Then this process submit new job/process B (for inference) in our cluster and this job runs on a different computer.
The new job creates logger using:
task = Task.init(project_name=project_name, task_name=task_name)
or
task = Task.init(project_name=project_name, task_name=task_name, continue_last_task=False, reuse_last_task_id = False)
or
task = Task.init(project_name=project_name, task_name=task_name, continue_last_task=False)

Different project_name and task_name.

The problem is that all of B logs are created in the A task.
Also there is no entry for the new project_name/task_name.

Thanks,
Ophir Azulai
IBM Research AI

@jkhenning
Copy link
Member

Hi @ophirazulai,

Then this process submit new job/process B (for inference) in our cluster

Can you share the code that does that?

and this job runs on a different computer.

Using ClearML Agent?

@ophirazulai
Copy link
Author

ophirazulai commented Oct 20, 2021

Hi @jkhenning

I can only share snippets

Process A (Training):

  • init the clearml task
    task = Task.init(project_name=args.clearml_proj_base + "/training", task_name=args.clearml_task,
    tags=[args.loss,'patch size' + str(args.patch_size), str(args.num_features)+''+str(args.max_features)+''+str(args.unet_depth) , 'two channels', 'lr_default'],
    continue_last_task=False)
    logger = task.get_logger()

  • Now process A starts Process B in the cluster for inference on the checkpiint
    job_id, jbsub_output = submit_job(command_to_run, out_file=log_file,
    err_file=log_file,
    interactive=False, machine_type='x86',
    duration='1h', num_nodes=1, num_cores=4, num_gpus=0, num_processes=1,
    mem='8g', project_name=inference_clearml_task)

Process B (Inference) - no ClearML agent

from clearml import Task
task = Task.init(project_name=project_name, task_name=task_name, continue_last_task=continue_last_task,

Can it be that something is passed in the system variables ? Maybe a quick fix will be to override it

@jkhenning
Copy link
Member

OK, but what does submit_job do? how do you create the new task?

@ophirazulai
Copy link
Author

It is internal IBM job scheduler that simply starts new process on a different computer.

Can it be that something is passed in the system variables ? Maybe a quick fix will be to override it

@jkhenning
Copy link
Member

What I'm not sure is how you start a new process on the other machine - what do you provide to the ClearML Task.init(), how do you provide the clearml.conf configuration etc? What is being copied and/or reused from the current machine?

@ophirazulai
Copy link
Author

The IBM job scheduler start a new process on a different computer.
I don't know how it is done.

The job is submitted on my name so I guess the clearml.conf is taken from my home directory.

In process B
I call Task.init() with a new project name and task name but still all logs are going to the Task A's entry

@jkhenning
Copy link
Member

It's possible the new task reuses your current task - you can try passing reuse_last_task_id=False in the Task.init call of process B

@ophirazulai
Copy link
Author

Tried that already, didn't help

@jkhenning
Copy link
Member

Well, than - it's important to understand exactly how the new process is created - is that a fork? or something is copied (and then - what command-line is executed?)

@ophirazulai
Copy link
Author

It is not my code, I don't know.

Does your code uses any environment variables ?
https://linuxize.com/post/how-to-set-and-list-environment-variables-in-linux/

That I can override, this seems the only way process B can know about process A.

@jkhenning
Copy link
Member

ClearML can use environment variables, but the question remains how process B is executed...

How are you running process A? Did you manually start the task with some IDE or python command?

@ophirazulai
Copy link
Author

I run task A exactly as task B. Submit new job in the IBM internal job scheduler.
I give the scheduler the full python command line + my env and every node in the cluster can see my storage.

Is there environment variable that if it exists, then the logs continue to session in this environment variable ?

@jkhenning
Copy link
Member

Can you try to unset the CLEARML_PROC_MASTER_ID environment variable?

@ophirazulai
Copy link
Author

Yes! :-) will update

@ophirazulai
Copy link
Author

Tried to unset CLEARML_PROC_MASTER_ID and it didn't help.

Tried two ways:
In process A before starting process B

and

In process B

Here is an example code:

import sys
import os

import copy

print("os.os.environ = ")
print(os.environ)

if 'CLEARML_PROC_MASTER_ID' in os.environ:
print("deleting CLEARML_PROC_MASTER_ID")
del os.environ['CLEARML_PROC_MASTER_ID']
try:
os.unsetenv('CLEARML_PROC_MASTER_ID')
except:
print("Excpetion unsetting CLEARML_PROC_MASTER_ID")
else:
print("CLEARML_PROC_MASTER_ID is not found")

@jkhenning
Copy link
Member

Hi @ophirazulai ,

In your example code, when do you import clearml? before or after this code?

@ophirazulai
Copy link
Author

First lines in the script.
Before importing clearml

@ophirazulai
Copy link
Author

Also tried to close the task before sending the job and then re-open it.
Still didn't work.

Here is the code:
task.close()

job_id, jbsub_output = submit_job(command_to_run, out_file=log_file,
                                  err_file=log_file,
                                  interactive=False, machine_type='x86',
                                  duration='1h', num_nodes=1, num_cores=4, num_gpus=0, num_processes=1,
                                  mem='8g', project_name=inference_clearml_task)

# init the clearml task
task = Task.init(project_name=args.clearml_proj_base + "/training", task_name=args.clearml_task,
                 tags=[args.loss, 'patch size' + str(args.patch_size),
                       str(args.num_features) + '_' + str(args.max_features) + '_' + str(args.unet_depth),
                       'two channels', 'lr_default'],
                 continue_last_task=True)
logger = task.get_logger()

@jkhenning
Copy link
Member

Can you try to unset CLEARML_PROC_MASTER_ID as well as TRAINS_PROC_MASTER_ID just to be sure?

@ophirazulai
Copy link
Author

This solved the problem, thanks a lot for your support

@jkhenning
Copy link
Member

Great 🙂

I'm closing the issue, please ping here if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants