Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run very long on Training pipeline #12

Closed
truongnmt opened this issue Apr 30, 2018 · 4 comments
Closed

Run very long on Training pipeline #12

truongnmt opened this issue Apr 30, 2018 · 4 comments

Comments

@truongnmt
Copy link

truongnmt commented Apr 30, 2018

I'm following this tutorial for detecting Atrial fibrillation but when run Training pipeline it take so much time.

I'm using Tesla K80 and I leave it ran all night, more than 7 hours, but now it's still running.
In this block it's running 1000 epochs:

template_train_ppl = (
    ds.Pipeline()
      .init_model("dynamic", DirichletModel, name="dirichlet", config=model_config)
      .init_variable("loss_history", init_on_each_run=list)
      .load(components=["signal", "meta"], fmt="wfdb")
      .load(components="target", fmt="csv", src=LABELS_PATH)
      .drop_labels(["~"])
      .rename_labels({"N": "NO", "O": "NO"})
      .flip_signals()
      .random_resample_signals("normal", loc=300, scale=10)
      .random_split_signals(2048, {"A": 9, "NO": 3})
      .binarize_labels()
      .train_model("dirichlet", make_data=concatenate_ecg_batch,
                   fetches="loss", save_to=V("loss_history"), mode="a")
      .run(batch_size=BATCH_SIZE, shuffle=True, drop_last=True, n_epochs=N_EPOCH, lazy=True)
)

train_ppl = (eds.train >> template_train_ppl).run()

Do you thing that we have smt not right here?
Or does the framework have something indicate that it's running, maybe print the number of current epoch it's running?
And btw when I run I see this in terminal FYI:

2018-04-30 16:19:57.630351: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.71GiB. The caller indicates that this
 is not a failure, but may mean that there could be performance gains if more memory is available.
@roman-kh
Copy link
Member

roman-kh commented May 1, 2018

You might add a line

.call(lambda _, v: print(v[-1]), v=V('loss_history'))

before run(batch_size=BATCH_SIZE,...).

It will print a loss function value at each iteration.

@truongnmt
Copy link
Author

Thanks a lot, it worked!!!
And btw, it has just finished 1000 epochs 🔥 🔥 🔥

@emadahmed97
Copy link

@truongnmt How long did it take you?

@truongnmt
Copy link
Author

truongnmt commented May 14, 2018

@emadahmed97 took me about 8 or 9 hours dude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants