New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with TFImageNet Example #100
Comments
Thanks Rahul! I suspect the skipped tasks are not a problem, but I'm not positive (I think when a task is "skipped" that means that its result has already been cached, and so the result is reused). The Mnist example doesn't use all of the GPUs right now. TensorFlow should be able to do this. Right now the way to do this is probably to modify Also, we merged the TensorFlow version of ImageNet prematurely. I'll fix it up soon. |
This is running correctly, so closing this. But want to remark that it is extremely slow as compared to SparkNet with Caffe. Caffe root@ip-172-31-30-96:~/SparkNet# cat imagenet5.txt | grep accuracy Tensorflow root@ip-172-31-30-96:~/SparkNet# cat tensorflowImageNet5.txt | grep accuracy |
Hmm.... those numbers look too good. I can replicate it, and it seems like there's a bug with the ImageNet example. |
The bug is fixed by #115. GPUs are enabled again for Caffe. |
Hello Robert, 230.758, i = 0: 0.09% accuracy For a 3 g2.8xlarge slave cluster with Caffe, do you think this is reasonable? Also should I expect speedup with more number of slaves, or do you feel the current training data images are too less to achieve good accuracy. If less, what is a reasonable number of images which will provide good results, and at same time not cause memory issues in the system. Thanks, |
The default settings in the app were aimed at being simple to run but not optimal for training. I'd suggest setting var trainRDD = loader.apply(sc, "ILSVRC2012_img_train/train.000", "train.txt", fullHeight, fullWidth) that is, remove a 0 from the path for |
Ok, there was another oversight. We weren't caching |
Hello Robert, Thank you for the suggestions. I have started the process with your previous suggestions, will leave it overnight, and report the findings tomorrow. I will try again tomorrow with this new caching. Thanks, |
Ok, sounds good. The caching is actually pretty important. Recomputing stuff could be more severe with more data. |
Here's the log for a run I did overnight with two g2.8xlarge workers. It looks like it gets to 10% accuracy at around
|
Hello Robert, Thank you for sharing the logs. So given that we are facing memory issues when running for the entire ImageNet Data, there will be problems in getting above this 10% accuracy. Also, about your earlier observation about the severity of caching, it indeed happened that way. With more data, not caching resulted in extremely slow training : While in your case you were able to complete 60 iterations in around 7500s, without caching I was only able to complete 20 iterations. Thanks, |
Hello Robert, I ran the CIFAR-10 example for a longer time, and I see that the accuracy is stuck at around 65%. Is this expected? syncInteval is 50, as per your previous suggestion on a 3 slave GPU cluster. 54.481, i = 0: 11.47% accuracy |
Hey Rahul, we are currently not subtracting the mean image (this would get you to similar performance as the cifar_quick model from here: http://caffe.berkeleyvision.org/gathered/examples/cifar10.html) and also not flipping the images during training, which should give a further boost. See the ImageNet example on how to substract the mean; for flipping, I'd recommend either augmenting the training set or implementing your own preprocessor. Let us know if you need any help! -- Philipp. |
Hello,
Nice to see the integration with Tensorflow and GPUs back 馃憤
I setup the cluster with the new AMI and was able to run the MNIST example. It ran succesfully but in the Spark web ui, was able to see a lot of jobs skipped.
I hope that is not a problem. Also can the MNIST example use all the GPUs?
Further, for the TFImageNetApp, I ran into the following error. My ImageNetApp (caffe) used to correctly work with my S3 bucket.
Command
/root/spark/bin/spark-submit --class apps.TFImageNetApp /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 2 sparknetdivideo
Error
java.lang.IllegalArgumentException: The data and shape arguments are not compatible, data.length = 196608 and shape = Array(227, 256, 256). at libs.NDArray$.apply(NDArray.scala:55) at libs.ImageNetTensorFlowPreprocessor$$anonfun$convert$16.apply(Preprocessor.scala:131) at libs.ImageNetTensorFlowPreprocessor$$anonfun$convert$16.apply(Preprocessor.scala:122) at libs.TensorFlowNet$$anonfun$loadFrom$1.apply$mcVI$sp(TensorFlowNet.scala:64) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at libs.TensorFlowNet.loadFrom(TensorFlowNet.scala:63) at libs.TensorFlowNet.forward(TensorFlowNet.scala:74) at apps.TFImageNetApp$$anonfun$7$$anonfun$apply$2.apply$mcVI$sp(TFImageNetApp.scala:106) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at apps.TFImageNetApp$$anonfun$7.apply(TFImageNetApp.scala:105) at apps.TFImageNetApp$$anonfun$7.apply(TFImageNetApp.scala:102) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
The text was updated successfully, but these errors were encountered: