Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R pkg fit() call finishes but subprocess doesn't terminate #65

Closed
sheffe opened this issue Apr 16, 2020 · 4 comments
Closed

R pkg fit() call finishes but subprocess doesn't terminate #65

sheffe opened this issue Apr 16, 2020 · 4 comments

Comments

@sheffe
Copy link
Contributor

sheffe commented Apr 16, 2020

This model consistently feels like a magic trick, thanks for contributing!

Bug
I'm running the ivis R package(v1.7.1) (more system details below). I can get model$fit() and model$transform() working just fine and producing substantive results. However, when the R process finishes and returns the fitted model, I'm seeing continued sky-high system usage. The R process calling ivis is definitely completed and back to a command prompt, but in htop I can see the RStudio GUI process (parent of the rsession process) occupying at least 2 full cores. Some process further down is not stopping when the R process gets the returned value. (Restarting the R session does kill it.)

I don't understand enough of the ivis-through-reticulate toolchain to provide more helpful diagnostics in this first report, but happy to run experiments and document further.

Environment

  • ivis R package(v1.7.1), installed from Github (56a8479) 14 Apr 2020
  • reticulate (v1.15), 2020-04-02 CRAN (R 3.6.2)
  • R 3.6.2 on MacOS 10.14.6 (18G4032)
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.2                         
year           2019                        
month          12                          
day            12                          
svn rev        77560                       
language       R                           
version.string R version 3.6.2 (2019-12-12)
nickname       Dark and Stormy Night  
@idroz
Copy link
Collaborator

idroz commented Apr 16, 2020

Hi - glad you're enjoying ivis!

We've been noticing some weird threading issues that relate to the latest tensorflow 2.1. @Szubie is working on a larger fix to address this via data generators.

Meanwhile, could you try downgrading tensorflow to version 1.15 and let us know if that solves your hanging thread issue? Since R wrapper for ivis installs it into a virtualenv, the steps to downgrade would be:

  1. From R REPL find out where your local virtual environments are stored:
> reticulate::virtualenv_root()
  1. Assuming that your venvs are in ~/.virtualenvs, activate the ivis environment from command line:
$ source ~/.virtualenvs/ivis/bin/activate
  1. Finally, downgrade tensorflow to pip:
$ python -m pip install tensorflow==1.15

After this restart R and reload the ivis package.

@sheffe
Copy link
Contributor Author

sheffe commented Apr 16, 2020

@idroz thanks for the fix and clear instructions here. It's now working as expected -- I've run a few training iterations and don't see any ongoing weird behavior now.

I'm not sure if this is helpful for further diagnosis, but the extent of ongoing resource usage seemed correlated to the overall complexity of the fit() call -- increasing data size or increasing ntrees/search_k/n_epochs_without_progress tended to boost the level of after-termination resource consumption. I don't have any good metrics for that correlation, but it seemed consistent over ~200 training sessions. All CPU, no GPU.

Another observation: before downgrading tensorflow to 1.15 per your suggestion, I'd get this error W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled with every training epoch. That message has disappeared with the downgrade. Comparing embedding results before/after the downgrade, I can't see any material effect on the output or metrics like training speed.

(You've solved my problem (thanks!) but I'll leave the issue close decision to you, since sounds like it ties to more systematic changes.)

@idroz
Copy link
Collaborator

idroz commented Apr 16, 2020

Glad it worked.

The issue you describe most certainly relates to a documented tensorflow problem: tensorflow/tensorflow#35100.

Nightly TF build seems to have fixed it, so hoping that the next stable TF release will be able to sort it out.

Will ping here when either data generator solves the problem or TF team pushes a working update!

@Szubie
Copy link
Collaborator

Szubie commented May 13, 2020

Hi, we have pushed a new update to ivis here: ecaf4cc

This update stops TensorFlow 2.1 from spawning new Threads without closing them. This takes care of the warning error seen on every epoch of training, and should also fix the issue you've been seeing in R with the subprocess not terminating correctly.

Please give it a go, hopefully it solves the issues you've been encountering!

@Szubie Szubie closed this as completed Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants