Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning this model #56

Open
Meight opened this issue Nov 6, 2018 · 43 comments
Open

Fine tuning this model #56

Meight opened this issue Nov 6, 2018 · 43 comments

Comments

@Meight
Copy link
Contributor

Meight commented Nov 6, 2018

Has anyone been able to successfully fine tune this model at all and, say, from Xception only pretrained on ImageNet?

After three weeks of tweaking and exploring, a good dozen of different loss functions and many more runs with a wide range of hyperparameters (including around those of the original paper), I still can't get the model to even overfit on a small batch from Pascal VOC raw dataset. Consequently, I haven't even been able to reproduce the original paper's results by fine tuning this repo's model so far.

I triple checked and unit tested my preprocessing pipeline, which in turn is just copy/pasted from the original repo, and here's the kind of results I get during training phase:

Feature maps during training

(bottom right most picture is just the argmax over all classes.)

The model does converge toward the same loss value when using pixelwise cross-crossentropy with logits (tried all the possible variations of that, whether by adding a softmax activation in the model or by using TF's native function tf.nn.softmax_cross_entropy_with_logits_v2) with different hyperparameters but it doesn't even begin to perform proper segmentation. I've also tried using @bonlime 's cost function as shared in this reply and several variations of soft dice loss but results aren't any better.

Plotting the different feature maps shows I've successfully loaded the weights of Xception pretrained on ImageNet (the model can totally discriminate objects across images), so this is not a problem.

I'm starting to seriously doubt this model is actually trainable or tunable as is so I'd be curious to hear if anyone got to train it before I dive into its detailed implementation.

@bonlime
Copy link
Owner

bonlime commented Nov 8, 2018

@Meight
You raised a very good point. After implementing this model I also tried very hard to fine-tune it, but the results were unsatisfactory bad. I stopped trying at the beginning of the summer.
Are you aware of a Keras problem with fine-tuning? Maybe this is the reason why it's impossible to tune this model. http://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/

@bonlime
Copy link
Owner

bonlime commented Nov 8, 2018

I've managed to successfully fine-tune models from this repo: https://github.com/qubvel/segmentation_models, maybe you can use them as well.

@Meight
Copy link
Contributor Author

Meight commented Nov 9, 2018

Thank you for the reply! Although I spent so much time on this for no useful result, I'm kind of glad to learn it's not just a stupid mistake I kept missing.

I should have said in my initial post that I came across that story of broken batch normalization — which is kind of crazy to be honest, but that's another debate —, but I wasn't so sure as this issue hasn't occurred in other Keras models I've tried to fine tune in the past. That could definitely be at least one of the problems of this model though.

I discovered the repository you linked only a few days ago and I still have to adapt it to our workflow. I'm glad to learn you managed to fine tune these models. On a side note not related to the current repo, I noticed the models are implemeted using keras instead of tf.keras though. Have you tried/been able to run these models onto multiple GPUs?

I would suggest you update the readme of this repository as to tell people the proposed implementation couldn't be trained or fine tuned as far as we know and that it's only valid for inference for now. Hopefully we can spare people a lot of wasted time if they're not willing to troubleshoot it themselves. I'll submit a pull request for that, if you like.

Thank you again for the reply!

Meight added a commit to Meight/keras-deeplab-v3-plus that referenced this issue Nov 9, 2018
@pluniak
Copy link

pluniak commented Dec 29, 2018

@Meight @bonlime
Have you tried finetuning the whole model or just the last couple of layers? In the link posted by bonlime they say that the problem stems from the fact that frozen batch normalization layers in Keras are not really frozen. If that really is the reason, finetuning should work when no layers are frozen. This actually matches my experience with finetuning Inception V3 for classification in Keras: Poor results when finetuning only the last layers; great results when finetuning all layers. Of course this works only if enough training data is available for finetuning.

@Meight
Copy link
Contributor Author

Meight commented Dec 30, 2018

@pluniak I tried both cases and results and each time results were ridiculously poor. I grabbed a native TensorFlow version of DeepLab v3+, used the exact same preprocessing and quickly got results close to those of the paper. My conclusion is that there was definitely something wrong with the model in this repo, but I stopped wasting time investigating it as soon as @bonlime confirmed he had been having similar issues.

Besides, state of the art for semantic segmentation evolved quite significantly since this model was published, and there exist other alternatives that perform about equally. There was virtually no interest for my research to invest time on this.

@pluniak
Copy link

pluniak commented Jan 1, 2019

@Meight
Many thanks for pointing this out. Surely you saved me a lot of time!

May I ask which other models have evolved since them that performed equally well for you? I'm especially interested in models available in Keras or TF.
Based on the Pascal Voc leaderboard, DeepLab V3+-based models still seem to be state of the art:

http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_FCN-8s-heavy

@rauldiaz
Copy link
Contributor

rauldiaz commented Jan 4, 2019

Hi,

I was able to fine-tune this network from pre-trained weights a few months ago. I did nothing special, just loaded the model with the pre-trained pascal voc weights and hit train. The only thing in my case is that the number of classes is 120, so the last layer is definitely different. Other than that, the network trains and smoothly converges with great performance.

@trungpham2606
Copy link

@rdiazgar can you show me some of your results ? Iam intending to fine-tune this repo's model but was hesitate when reading the author's readme. ?

@rauldiaz
Copy link
Contributor

rauldiaz commented Jan 8, 2019

@trungpham2606, sorry but I'm afraid I can't show you any results, as they are currently submitted to a conference and hence I must keep them confidential.

What I meant to say with my post is that I certainly had no problems using this network with pretrained weights and fine-tune it for a different dataset (KITTI). In my case, I just loaded the deeplab model with the 'pascal voc' weights with a different number of categories to classify (120 labels). Then I simply followed standard keras training with a custom data generator to feed the network and opted out by assigning a small learning rate value (1e-3), except for the last layer, which had a lr value 10x larger (1e-2). This was my fine-tuning strategy and it certainly has worked without any problems so far.

I was also surprised to recently see the README section claiming it can't be fine-tuned. Perhaps they refer to other strategies for fine-tuning, like freezing all but the last layers. I can only say that in my experience, I have not encountered any problems using this network, either training from scratch or fine-tuning from pre-trained weights.

Raul

@trungpham2606
Copy link

@rdiazgar oh. First, thank you for your quick response. I will try to fine tune this model according to your fine-tune pipeline.
Best
Trungpham

@duchengyao
Copy link

downgrade from tensorflow 1.11 1.12 to 1.10 might solve the problem, or not using tf.keras.

@hfurkanbozkurt
Copy link

hfurkanbozkurt commented Jan 30, 2019

@Meight I am having the same problem. I can tune it a little bit but the accuracy is very bad (less than 0.5) even after a good amount of training time. Did you manage to get at least more than 0.5 accuracy?

@Meight
Copy link
Contributor Author

Meight commented Feb 2, 2019

@hfurkanbozkurt This is about the range I was able to reach too (~0.47-0.48). When fine tuning the pure TF implementation I have now I was able to reach results close to those of the paper. I have no clue what was wrong with my pipeline when I tried using this Keras implementation since it works flawlessly with the TF implementation with no modification whatsoever.

Seeing some people commenting here that they could fine tune it successfully baffles me since there also seems to be many people whom haven't been able to and I spent about three weeks on this and probably checked every single line of code 10 times. This will remain a mystery as far as I'm concerned... Good luck if you keep working on it!

@kritiyer
Copy link

kritiyer commented Feb 12, 2019

I was successfully able to retrain on my custom dataset from the pre-loaded weights (I haven't tried fine-tuning the decoder only).

After combing through the issues on here, here is a list of changes I made:

  1. labels must have shape (image_size, image_size, num_classes), unlike TF implementation where labels are (image_size, image_size)
  2. use the preprocess_input() from model to scale input images to have values [-1,1]
  3. Add sigmoid activation to the last layer of the model (I have a binary segmentation problem, but I think softmax should work too?)
  4. don't use any data augmentation from ImageDataGenerator()

I hope this helps someone!

@trungpham2606
Copy link

@kritiyer can you provide some result's images you get :3

@kritiyer
Copy link

@trungpham2606 I'm working with medical image data so I'm not comfortable posting the images here, but I promise it's working! I did have to use the datumbox keras fork for BatchNormalization=False to work properly and give decent results: https://github.com/datumbox/keras/tree/fork/keras2.2.4

@wave-transmitter
Copy link

@bonlime @Meight Hello, just to make it totally clear, is it possible to train end-to-end a model (without any frozen layers) with voc dataset weights initialization? If not, do you have any idea why this is happening?

@kritiyer @rdiazgar Can you please refer some results in terms of mIoU? Please elaborate a bit more on the steps you followed to train the model? E.g. why should not someone use ImageDataGenerator()?

@rauldiaz
Copy link
Contributor

rauldiaz commented Feb 27, 2019

Hi @wave-transmitter ,

Yes, it is possible to train end-to-end this model without any frozen layers. I have successfully used this model with the mobilenetv2 and xception backbones either from scratch, from the pascal-voc weights, and even from the cityscapes weights (see #67). The dataset that I used to train my model is not pascal voc, but Kitti.

Unfortunately, I cannot share any results as of now because my work is under a conference confidentiality policy. I will certainly post some results when the conference proceedings become public.

In my personal case, I instantiated the model with or without the pre-trained weights, never froze a layer, and trained the model via a custom image data generator that feeds the images (normalized by 1./255) and their corresponding ground truth values. I did not use the ImageDataGenerator available in Keras, but I see no reason why this should be the problem.

Best,
Raul

@kritiyer
Copy link

kritiyer commented Feb 27, 2019

@wave-transmitter Hello, I also successfully trained using both Mobilenet and Xception (from the Pascal weights), and was able to fine-tune the decoder as well as train from scratch with frozen batch normalization layers (I don't have enough GPU memory to train the batch normalization layers). So far the best Dice score I got for a binary classification problem is 0.97.

I used an ImageDataGenerator to feed in my data because it was too large to load in memory, but if I used any of the data augmentation arguments (rotate, shear, flip, etc) I got garbage results and I'm not sure why. I listed the steps I took to train in my comments above. I'm using tensorflow-gpu 1.10 and keras 2.2.4 (datumbox fork, linked above).

@Licini
Copy link

Licini commented Feb 27, 2019

hi @rdiazgar ,

Would you mind to also share which optimizer and loss function you were using? Thanks in advance!

@rauldiaz
Copy link
Contributor

Hi @Licini,

Sure. I simply used SGD with momentum=0.9, and learning rate of 0.001. The loss is cross-entropy.

@wave-transmitter
Copy link

wave-transmitter commented Feb 28, 2019

Thank you both for your detailed answers.

@rdiazgar Is it possible to share your model's accuracy in terms of IoU? No need to share inferenced results. Moreover, for how many epochs did you train end-to-end the model and which was the selected batch size?

@kritiyer Can you please also let us know about your choices regarding the optimizer, the learning rate and the batch size? Similarly, for how many epochs did you train your model?

@rauldiaz
Copy link
Contributor

Hi @wave-transmitter,

Truth being said, I am not using this model for semantic segmentation, so I don't have any quantitative measure for intersection over union. I am training this model for monocular depth estimation.

I trained the model for about 30 epochs with a batch size of 4, which is about 300k iterations for the KITTI training set. The input images are random crops of 375x513 pixels.

Raul

@Licini
Copy link

Licini commented Mar 1, 2019

@rdiazgar Thanks for the sharing! I was able to retrain a simple two classes version using mobilenetv2, no frozen layers. And It worked pretty well. For anyone who's interested. I was using binary cross-entropy, one object class and background class. My dataset was about 8k imgs without any augmentation. It trained for 10 epochs with batch of 8. Didn't have any IoU measurements yet, but it at least worked for my eyes.

@pluniak
Copy link

pluniak commented Mar 11, 2019

@kritiyer @rdiazgar @Licini
Thanks for your input!
Can you please tell which versions of TF/Keras you were using?

Philipp

@Licini
Copy link

Licini commented Mar 11, 2019

sure @pluniak , I was using keras 2.2.4 with tensorflow-gpu 1.8.0

@rauldiaz
Copy link
Contributor

@pluniak

I used keras 2.2.4, and tensorflow-gpu 1.9.0 in one machine and 1.12.0 in another one.

@pluniak
Copy link

pluniak commented Mar 22, 2019

I have also successfully fine-tuned this model. I did nothing special: TF.1.13.1-GPU, Keras 2.2.4, binary_cross_entropy, Adamax(default params), labels_shape(height,width,no_classes). Keras.ImageDataGenerator and class_weights also work. I passed in numpy arrays. Converges quickly with reasonable performance.

@kritiyer @rdiazgar @Licini
One thing that suprised me though is that there is no sigmoid or softmax activation in the last layer, so output values range from -40 to +10 in my case (1 class only). Classifying these values by >.5 gives me better results though than when adding sigmoid activation after the last layer, because rarely any output values are >.5 after sigmoid activation. Did anybody else experience the same tendency towards negative output values? Where does this come from? Training longer on my limited training set doesn't help. I'm dividing pixel values by 255.
Interesting is also the fact that IOU/Jaccard on validation data is on the same level as for training data. The model converges quickly but doesn't overfit at all. Any explanations for this? Is it possibly a model bias problem?
I'd be glad for some comments ... :-)

@rauldiaz
Copy link
Contributor

rauldiaz commented Mar 22, 2019

Hi @pluniak ,

Regarding the lack of activation in the last layer, I believe that this is just for convenience. For instance, if you want to classify your pixels via a Softmax function, all this function does is to turn raw output logit values and turn them into a probability distribution (probits). However, this is only useful from a training point of view, because these probits are used for computing the loss function (e.g., cross-entropy). However, at test time, you only care about which output logit has the higher value (argmax), and you don't need to call Softmax to do that. Plus, by not using Softmax at test time, you save some computation costs because exponentials and logarithms are quite expensive operations.

If you check Keras' docs and code, you'll see that most of the loss functions defined have an optional parameter named 'from_logits' that takes into account exactlty that: when True, the loss calls for softmax before computing the loss; when False, it assumes the network's last layer includes already a softmax call.

Best
Raul

@pluniak
Copy link

pluniak commented Apr 5, 2019

@rdiazgar Thanks. Makes sense :-)

@pluniak pluniak mentioned this issue Apr 5, 2019
@trungpham2606
Copy link

@rdiazgar hello bro, I want to ask you about the pre-processing part. Did you normalize the images to [-1, 1] or any other ranges ?
I had tested by normalizing images to be in range [-1, 1] then the results i got were so poor.

@rauldiaz
Copy link
Contributor

Hi,

The range [0,1] worked out the best for me. You're right, the [-1,1] got me worse results.

@trungpham2606
Copy link

@rdiazgar
Hello bro,
I saw in the train.py (old version), the author had provided codes to load .npy weights. I dont know what are the differences between setting 'weights' = 'pascal_voc' and using those codes ?

@rauldiaz
Copy link
Contributor

I never used the train.py script of this repo. I have my own training script and I simply instantiate the DeepLabv3+ model. The 'pascal_voc' weights are simply a model checkpoint that has learnt to segment images from the Pascal VOC dataset. You can also use the weights = 'cityscapes' to start your training script from a pre-trained checkpoint oriented to autonomous driving.

@trungpham2606
Copy link

@rdiazgar oh tks rdiazgar. I will try.

@FreedomGu
Copy link

@rauldiza
Hi @rauldiaz,
I met a problem with the labels, should I fit the labels into model.fit() by the size (Number of images ,image.shape, classes)?
I am so confused since the result I got is all 0 when i use .predict() but the accuracy is still very high.

@rauldiaz
Copy link
Contributor

rauldiaz commented May 8, 2019

If I understand your question right, you asking what shape should your ground truth labels be, right?

That depends on what the loss function needs. For instance, sparse_categorical_crossentropy expects the labels to be simply the number associated with each class, while categorical_crossentropy expects the labels to be one-hot coded vectors for each class.

In a segmentation scenario like this, if you are using categorical_crossentropy as a loss function, the shape of your labels should be (batch_size, image_height, image_width, classes). If you choose the sparse loss version, the shape should be (batch_size, image_height, image_width, 1).

@pissw2016
Copy link

pissw2016 commented Jul 3, 2019

Hi, so is there anyone fine tune with VOC successfully?
I am trying to reproduce the result of deepv3+ without decoder, according to paper it is 81.34(with train 16 eval 8)
while I just froze all encoder and :

end (Dropout) (None, 64, 64, 256) 0 activation_76[0][0]


conv_upsample (Conv2D) (None, 64, 64, 21) 5397 end[0][0]


lambda_1 (Lambda) (None, 512, 512, 21) 0 conv_upsample[0][0]


reshape_1 (Reshape) (None, 262144, 21) 0 lambda_1[0][0]


pred_mask (Activation) (None, 262144, 21) 0 reshape_1[0][0]

Total params: 41,093,045
Trainable params: 5,397
Non-trainable params: 41,087,648

fire epoch result:

1464/1464 [==============================] - 1594s 1s/step - loss: 1.4667 - Jaccard: 0.4825 - sparse_accuracy_ignoring_last_label: 0.7597 - val_loss: 0.4599 - val_Jaccard: 0.7528 - val_sparse_accuracy_ignoring_last_label: 0.9455
which is really werid.
Jaccard is kind of equal to mIOU details from Golbstein

BN is depends on the data, so when the data distribution change the BN parameter shall be change too. Freeze or not is not a problem. and if the data is changing like transfer learning, I believe the BN layer shall not be freeze . Thus can catch the distribution of new data.

So I think the fine tune might need the data and DA exactly same as the first stage traning.

@zhangbo2008
Copy link

May i ask which paper you want to reproduce?

@lauraset
Copy link

lauraset commented Nov 8, 2019

Hi @rauldiaz
I successfully fine-tuned this model on my own dataset. But when I checked the detailed network structures, I found obvious differences between this model and the original xception (in keras applications) . They are as follows:

  1. in the entry flow, all maxpooling layers are replaced with separable convolution
  2. the number of the middle flow of xception is changed to 16, while the original one is 8
  3. in the exit flow, averagepooling is deprecated.

I am not sure the effect of these changes on the final results.

@rauldiaz
Copy link
Contributor

rauldiaz commented Nov 21, 2019

Hi @lauraset ,

Your question seems more targeted to the original author of this repository (@bonlime), rather than to me.

@pimonteiro
Copy link

@rauldiaz Hello! Sorry for undigging such an old thread, but I'm having really bad results on the mobilenetv2 version with the cityscapes weights. The exception works amazingly well, but the mobilenetv2 returns a very blurry image segmentation. The dataset i'm using is the kitti360.

The only thing I modified on the model was the line 172 of the model

#in_channels = inputs.shape[-1].value # inputs._keras_shape[-1]
in_channels = inputs.shape.as_list()[-1]

because it was causing an error while creating the model (the change was suggested on issue #125 ).

Did you went through something similar?

@Thunder003
Copy link

hey @Meight, would you please share the steps you followed in fine-tunning the official repo of deeplabV3+? I was tunning it for two-class(background+one foreground) but after some iterations, all of my image pixels start to acquire a single value(1 in my case).
Also, please let me know the TensorFlow version you used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests