Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training and prediction of the "CREMI" data set. PART 2 #13

Open
ravil-mobile opened this issue Sep 7, 2017 · 5 comments
Open

Training and prediction of the "CREMI" data set. PART 2 #13

ravil-mobile opened this issue Sep 7, 2017 · 5 comments

Comments

@ravil-mobile
Copy link

Hello, one more time. I decided to be more precise and wrote down the exact steps that I did. The problem that I described in the previous issue still remains but I hope that information will be useful in debugging the code

We did the following steps:

  1. all diluvian and tensorflow packages were removed from the system. All files
    were also removed from /user/.keras/dataset

  2. Following the instruction, diluvian was installed as following:
    pip install diluvian

    ... Installing collected packages: tensorflow, diluvian Successfully installed diluvian-0.0.3 tensorflow-1.1.0
  3. At the next step tensorflow-gpu was installed:
    pip install 'tensorflow-gpu==1.2.1'

    ... Installing collected packages: tensorflow-gpu Successfully installed tensorflow-gpu-1.2.1
  4. I ran diluvian with the default arguments to check whether everything worked ok

    diluvian train

    the "cremi" database was downloaded from the website and diluvian ran just
    for two epochs

    During the trail run I got the following warning:
    "/user/anaconda2/lib/python2.7/site-packages/keras/callbacks.py:120:
    UserWarning: Method on_batch_end() is slow compared to the batch update (1.893568).
    Check your callbacks."

  5. I generated the convig file to change the number of epochs and run the training for a long term

    diluvian check-config > myconfig.toml

    <myconfig.toml>
    ...
    total_epochs = 1000
    ...
    num_gpus = 2
    ...
    

    diluvian train -c ./myconfig.toml

    By default, the u-net architecture was chosen:

    <myconfig.toml>
    ...
    factory = "diluvian.network.make_flood_fill_unet"
    ...
    

    The training was terminated automatically at the 76th epoch
    the values of loss and validation loss were about the same ( 0.63 )

  6. At the next step I ran the prediction mode to estimate how good the training
    was.

    diluvian fill -c ./myconfig.roml -m ./model-output.hdf5 ./file-{volume}

    where file-{volume} was just a dummy that contained nothing. As far as
    I understood diluvian takes the file to name the output files (*.hdf5)

7 Results:
At the end of the filling procedure I got three files as the output.
The problem is that all of them contain only zeros. Basically it means
that the output mask was not predicted

IMPORTANT: 
1) even reducing the learning rate during the training doesn't
help to improve the result of both prediction and training

2) ffn architecture produced the same result that u-net did

3)  we tried to run deluvian with different data set but the result was about the same

Please, can you tell us where a bug can be or what we’re doing wrong?

Thanks in advance

Best regards,
Ravil

@aschampion
Copy link
Owner

Hi Ravil,

My guess as to what's happening is this:

the values of loss and validation loss were about the same ( 0.63 )

For a near-default configuration this loss is high and it's unlikely the network is generating high probability, structured output yet. Hence the filling volumes are empty because the network is not yet trained to the point where it's outputting probability high enough to cause FOV moves. You can tell this is happening if during the filling step the lower progress bar is just flashing by quickly and shows 1 / 1 at the end (this is the progress bar for filling each region). Another way to check is that during training if you increase the output verbosity (-l INFO) the training generators will show the average numbers of moves in the training subvolumes. In normal training this should increase to be near or above 20 once the network is learning to move.

These networks take a very long time to train to have performance competitive with other approaches -- days on 8 GPUs. However, on 2 GPUs you should be able to train the network to a point where the output starts to look like neuronal segmentation in less than 8 hours.

Some things to try for fast training to good results:

  • switching from SGD to Adam
  • orders of magnitude larger training and validation sizes than the defaults

@sflender
Copy link

sflender commented Dec 4, 2017

Hi Ravil, Andrew, do you understand why training was terminated automatically after the 76th epoch, even though you did not configure it to stop early? I observed similar behavior.

Best,
-Samuel.

@aschampion
Copy link
Owner

@sflender like most network training paradigms, diluvian terminates training early if the validation loss (or actually in this case, validation_metric which is by default F_0.5 score of masks of single-body validation volumes) has not improved in several epochs. The number of epochs is controlled by the CONFIG.training.patience parameter (see docs). To effectively disable this, just set patience to be larger than total_epochs.

@Jingliu1994
Copy link

@aschampion I do the things that you suggest,including changing the optimizer and enlarging training and validation sizes,but still can't get a good results.The training loss can't fall when drop to about 0.3,and the validation loss is about 0.5. The number of gpu is 1,batch size is 12.I change the learning rate,but get no progress.What should i do next,thanks

@aschampion
Copy link
Owner

@Jingliu1994 You can continue to sweep the parameter space, but there are several things I would suggest first:

  • If you're using the CREMI data, the official data still has many ground truth errors and quality problems (e.g., random labels in blank sections). The MALA v2 submission on the CREMI front page has realigned ground truth volumes with much better ground truth labels. This is what the ground truth I was using when working on FFNs was based on, but for various reasons don't automatically distribute it with diluvian.
  • Ignore the validation loss (which is often meaningless because of the FFN training process
    -- this is why in Google's paper they validate with the skeleton metrics instead) and pay attention only to the F_beta validation metric. Even if the training loss improvement is miniscule the validation metric may still be improving. An example pulled at random from my logs ("val subvolumes" is the validation metric):
    model_4_97_33u__adam__janaug_64k__14_nov
  • Apply the network at higher resolution, 8nm or 4nm. (Will greatly increase inference time)
  • Use a larger input FOV. (Will greatly increase training and inference time)

You should also be aware that Google released their implementation a few weeks ago. If you just want good results and aren't that concerned with having a simple sandbox to experiment with FFN-like architectures, it's probably a better choice than diluvian for you. Multi-segmentation consensus and FFN-based merging are both crucial to the quality of FFN results reported in Google's original paper; diluvian doesn't implement either of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants