New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Extractors fail with ConcurrentModificationException #9
Comments
This is a common problem with Spark, because the closures (anon function going inside RDD.map(...)) are serialized by Spark before distributing them. Hadoop does not have this problem because it binary-serializes the whole .jar and copies it over the network. Spark uses JavaSerialization by default, but it is very slow compared to, say, Kryo. So we use Kryo to do that by using a wrapper (Spark doesn't support kryo-serde for closures, not yet). And uptill now the |
Pheew. I just found the bug. If you notice carefully, this problem occurs in every class where there is a reference to the context object inside the
context.language is being called inside extract. Now, the extract method is what acts as the mapper for Spark, therefore, it needs to be serialized along with everything it references, which means Spark attempts to serialize the whole DistributedExtractionContext object, and fails because it contains the complex |
I propose the following options to solve this: (1) Instead of:
we could use:
But I think instantiating the classes could become a huge problem? It won't be as streamlined as just passing in the context as parameter. (2): Simply adding @jimkont @sangv Let me know what you think and I'll make a pull request to extraction-framework master. |
I am more comfortable with solution (2) at the moment but I understand your doubts. I think you can add some comments to the source code to explain why those fields are added, for future reference. |
(1) will cause big issues in the whole framework so I vote for (2) too. But let's create vals only when needed at the moment. e.g. in
Not all extractors need all context so I woudlnt force that in the On Fri, May 23, 2014 at 9:27 AM, Andrea Di Menna
Kontokostas Dimitris |
Thanks Andrea. Yes, going with (1) will give ugly problems.
OK, I'll send a pull request for this after testing all the extractors. |
I ran into the same issue when running extractors on english language dumps using the nildev2 branch (probably more serializers need to be registered). |
@sangv can you post the entire error message (it'll probably be big)? That's weird. I just tested nildev2 with the liwiki dump and it worked fine. I can't do english language right now because I haven't downloaded the huge dump as of yet. Also, where are you testing this on? On your PC or on a server? |
BTW, you did not include InfoboxExtractor right? That one doesn't work yet (the only bad one) because it maintains state in a Apart from state, it references |
en: extracted 9620000 pages in 76:51.307s (per page: 0.479345841995842 ms; 14/06/20 16:05:08 WARN storage.BlockManager: Putting block rdd_5_3611 failed 14/06/20 16:05:08 ERROR executor.Executor: Exception in task ID 13245 com.esotericsoftware.kryo.KryoException: Serialization trace: classes (sun.misc.Launcher$AppClassLoader) classLoader (org.apache.hadoop.conf.Configuration) hadoopConf (org.dbpedia.extraction.dump.extract.DistConfig) distConfig (org.dbpedia.extraction.dump.extract.DistConfigLoader) org$dbpedia$extraction$util$Finder$$wrap finder$1 (org.dbpedia.extraction.dump.extract.DistConfigLoader$$anon$1) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599) at at at at at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:69) at at at at at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:157) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:557) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:482) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:76) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at at at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at at at java.lang.Thread.run(Thread.java:722) Caused by: java.util.ConcurrentModificationException at java.util.Vector$Itr.checkForComodification(Vector.java:1156) at java.util.Vector$Itr.next(Vector.java:1133) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at ... 48 more 14/06/20 16:05:08 WARN scheduler.TaskSetManager: Lost TID 13245 (task 14/06/20 16:05:08 WARN scheduler.TaskSetManager: Loss was due to com.esotericsoftware.kryo.KryoException: Serialization trace: classes (sun.misc.Launcher$AppClassLoader) classLoader (org.apache.hadoop.conf.Configuration) hadoopConf (org.dbpedia.extraction.dump.extract.DistConfig) distConfig (org.dbpedia.extraction.dump.extract.DistConfigLoader) org$dbpedia$extraction$util$Finder$$wrap finder$1 (org.dbpedia.extraction.dump.extract.DistConfigLoader$$anon$1) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:523) at at at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599) at at at at at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:69) at at at at at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:157) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:557) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:482) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:76) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at at at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at at at java.lang.Thread.run(Thread.java:722) 14/06/20 16:05:08 ERROR scheduler.TaskSetManager: Task 3613.0:0 failed 1 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at at at java.lang.reflect.Method.invoke(Method.java:601) at scala_maven_executions.MainHelper.runMain(MainHelper.java:164) at Caused by: org.apache.spark.SparkException: Job aborted: Task 3613.0:0 Serialization trace: classes (sun.misc.Launcher$AppClassLoader) classLoader (org.apache.hadoop.conf.Configuration) hadoopConf (org.dbpedia.extraction.dump.extract.DistConfig) distConfig (org.dbpedia.extraction.dump.extract.DistConfigLoader) org$dbpedia$extraction$util$Finder$$wrap finder$1 (org.dbpedia.extraction.dump.extract.DistConfigLoader$$anon$1)) at at at at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org at at at scala.Option.foreach(Option.scala:236) at at at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at On Sat, Jun 21, 2014 at 4:30 AM, Nilesh Chakraborty <
|
@sangv I just made a new pull request. Can you run the same test on the milestone3 branch and confirm that you can reproduce this error? Thanks. |
I will try it later this week, but you should just try pulling down the english On Tue, Jun 24, 2014 at 8:19 PM, Nilesh Chakraborty <
|
OK, I'll try that, will probably need a couple of days to download the whole 10.8 GB dump. But this kind of error should be reproducible over any language dump - that's a puzzle. And here's something else:
The above is an excerpt from your error that tells what's happening. Spark tries to serialize a Finder instance inside DistConfigLoader, and recursively tries to serialize the DistConfig instance. This should not be happening with the way the code is right now, at least not in milestone3. I could easily add a serializer for Hadoop's
|
@jimkont @sangv is it possible to get temporary remote access to some machine with high bandwidth? Something with 2-4 cores and 6-8GB RAM? We can think about clusters later, but I think having a single test server would be nice. Downloading dumps and ad-hoc testing would be less cumbersome. :-/ At home I have download limits for high speed internet, and it's costly. |
I have tested the following extractors till now (it does not matter if they are together in a pipeline or not, results are same):
Out of the above, only RedirectExtractor and PageIdExtractor work properly, and the resulting outputs match with with the original extraction-framework. Adding any of the other 3 into the pipeline causes this:
The text was updated successfully, but these errors were encountered: