Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Apache Spark to 2.3.0; resolves #218 #219

Merged
merged 2 commits into from
May 14, 2018
Merged

Update Apache Spark to 2.3.0; resolves #218 #219

merged 2 commits into from
May 14, 2018

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented May 12, 2018

GitHub issue(s): #218

What does this Pull Request do?

How should this be tested?

TravisCI should take care of things, but a smoke test with a directory of warcs, and some basic tweet analysis would be good.

@lintool @TitusAn

- Update tests to use workaround for SPARK-2243
- Comment out ExtractGraph test as per https://github.com/archivesunleashed/aut/pull/204/files#diff-4541b9834513985c360b64093fd45073
- Align Hadoop version with Apache Spark pom.xml https://github.com/apache/spark/blob/branch-2.3/pom.xml#L120
@ruebot ruebot requested a review from lintool May 12, 2018 23:44
@codecov
Copy link

codecov bot commented May 12, 2018

Codecov Report

Merging #219 into master will decrease coverage by 4.96%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #219      +/-   ##
=========================================
- Coverage   66.16%   61.2%   -4.97%     
=========================================
  Files          34      34              
  Lines         665     665              
  Branches      124     124              
=========================================
- Hits          440     407      -33     
- Misses        184     217      +33     
  Partials       41      41
Impacted Files Coverage Δ
.../scala/io/archivesunleashed/app/ExtractGraph.scala 0% <0%> (-94.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8a8a97...b6eb72c. Read the comment docs.

@ruebot
Copy link
Member Author

ruebot commented May 13, 2018

io.archivesunleashed.WarcTest: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:(..)

That's the error you were asking about at the datathon, that requires this hack conf.set("spark.driver.allowMultipleContexts", "true"); in all the tests. There's probably much better what to take care of it. Which, we can implement later.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a tranche of about 200GB WARCs and a 17GB Twitter JSON file - all worked nicely.

@ruebot
Copy link
Member Author

ruebot commented May 14, 2018

@ianmilligan1 thanks!! We should be good go after TravisCI turns green one last time.

@ianmilligan1 ianmilligan1 merged commit fc8f4bf into master May 14, 2018
@ianmilligan1 ianmilligan1 deleted the issue-218 branch May 14, 2018 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants