Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes ArchiveRecordImpl serializable #316

Merged
merged 1 commit into from
Apr 22, 2019
Merged

Conversation

jrwiebe
Copy link
Contributor

@jrwiebe jrwiebe commented Apr 18, 2019

What does this Pull Request do?

Makes class ArchiveRecordImpl serializable by removing non-serializable ARCRecord and WARCRecord variables. Also removes unused headerResponseFormat variable.

How should this be tested?

The following code would fail prior to this commit with a NotSerializableException error. Now it works:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
import org.apache.spark.storage.StorageLevel._

sc.setLogLevel("DEBUG")

val validPages = RecordLoader
                  .loadArchives("/path/to/warcs/*.warc.gz", sc)
                  .keepValidPages()
                  .persist(DISK_ONLY) // crucial line
                  .map(r => ExtractDomain(r.getUrl))
                  .countItems()
                  .saveAsTextFile("/writable/path/all-domains/output")

Additional Notes:

Caching RDDs to disk may be useful, but it is not a solution to out-of-memory issues discussed by @ruebot in #aut on Slack.

Interested parties

@ruebot

…Record and WARCRecord variables. Also removes unused headerResponseFormat variable.
@ruebot ruebot self-requested a review April 18, 2019 21:02
@codecov-io
Copy link

codecov-io commented Apr 18, 2019

Codecov Report

Merging #316 into master will increase coverage by 0.11%.
The diff coverage is 78.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   75.84%   75.95%   +0.11%     
==========================================
  Files          41       41              
  Lines        1151     1148       -3     
  Branches      202      200       -2     
==========================================
- Hits          873      872       -1     
  Misses        209      209              
+ Partials       69       67       -2
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 84.9% <78.26%> (+2.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8504190...01b8696. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented Apr 18, 2019

Still getting the heap space error.

Cleared out my ~/.m2, and built the serialize-ArchiveRecordImpl branch on tuna.

Ran the following:

/home/ruestn/spark-2.4.1-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp --jars /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar -i /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689-cache-issue-316.scala 2>&1 | tee /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-tuna-pr-test.log

10689-cache-issue-316.scala is:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
import org.apache.spark.storage.StorageLevel._

sc.setLogLevel("DEBUG")

val validPages = RecordLoader
                  .loadArchives("/tuna1/scratch/nruest/auk_collection_testing/10689/warcs/*.gz", sc)
                  .keepValidPages()
                  .persist(DISK_ONLY)

validPages
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .saveAsTextFile("/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/all-domains/output")

validPages
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
  .saveAsTextFile("/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/all-text/output")

val links = validPages
              .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
              .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1)
              .replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2)
              .replaceAll("^\\s*www\\.", ""))))
              .filter(r => r._2 != "" && r._3 != "")
              .countItems()
              .filter(r => r._2 > 5)

WriteGraphML(links, "/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/gephi/10689-gephi.graphml")

sys.exit

Error:

19/04/18 18:02:34 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.StringCoding.safeTrim(StringCoding.java:89)
        at java.lang.StringCoding.access$100(StringCoding.java:50)
        at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
        at java.lang.StringCoding.decode(StringCoding.java:193)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:117)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:139)
        at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1203)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

If you want to check out the full log on tuna: /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-tuna-pr-test.log

@ruebot
Copy link
Member

ruebot commented Apr 18, 2019

Caching RDDs to disk may be useful, but it is not a solution to out-of-memory issues discussed by @ruebot in #aut on Slack.

🤦‍♂️

I should have read the PR closer. Sorry for the giant verbose dump above @jrwiebe

@jrwiebe
Copy link
Contributor Author

jrwiebe commented Apr 19, 2019

@ruebot You can lead a horse to water ...

@ruebot ruebot requested a review from lintool April 19, 2019 21:24
@lintool
Copy link
Member

lintool commented Apr 22, 2019

lgtm

@ruebot ruebot merged commit 5cb05f7 into master Apr 22, 2019
@ruebot ruebot deleted the serialize-ArchiveRecordImpl branch April 22, 2019 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants