Upgrade to Hadoop 3.x #329

jrwiebe · 2019-07-24T17:20:59Z

AUT currently uses Hadoop 2.6.5. Though it is stable, at three years old it is beginning to show its age. I discovered this when testing S3 access (#319): hadoop-aws 2.6.5 is incapable of authenticating with temporary security credentials (probably an edge case) and with endpoints that require Signature Version 4 (many do). Upgrading to a current branch of Hadoop should be a matter of bringing other dependencies up to date, which might not be simple.

At present I would suggest going with version 3.1.2, the latest 3.1.x release. Its docs say: "This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready." Or 3.0.0 -- all the Cloudera CDH 6 releases use that version, which is an indication its stability and wide use.

I'm not sure of the implications of using a distribution of Spark built with an older Hadoop to run our code that depends on Hadoop 3 (Spark 2.4.3 uses Hadoop 2.6.5). I wonder how the version conflicts would be resolved if we included the Hadoop 3 dependencies in our fatjar (we currently exclude them), and run it on Spark with Hadoop 2.6.5? I imagine it should work if we include Hadoop in the fatjar if we instruct people to use the version of Spark built without Hadoop. I think it's unreasonable to expect people to build Spark themselves, though.

ruebot · 2019-07-24T18:36:44Z

We'll have to sort out FileUtil.copyMerge since it was deprecated in 3+ versions of Hadoop. Luckily it is only used here.

Preliminary StackOverFlow searching says we can implement our own version in Scala. Is that something we'd want to do, or should we look for a better solution to combine all the part files.

...and maybe there is a way to pull off a hdfs cat 🤷‍♂️

Anyway, I'll keep digging and check out the Hadoop 3.1.2 API docs.

jrwiebe · 2019-07-24T18:58:57Z

The Scala re-implementation looks good. I'd use it.

greebie · 2019-07-24T19:50:16Z

Just dropping in to confirm that I have played around with the scala re-implementation for 3.1.1 and it works fine.

jrwiebe · 2019-08-01T17:58:25Z

Regarding the last paragraph of my initial comment:

I'm not sure of the implications of using a distribution of Spark built with an older Hadoop to run our code that depends on Hadoop 3 (Spark 2.4.3 uses Hadoop 2.6.5). I wonder how the version conflicts would be resolved if we included the Hadoop 3 dependencies in our fatjar (we currently exclude them), and run it on Spark with Hadoop 2.6.5? I imagine it should work if we include Hadoop in the fatjar if we instruct people to use the version of Spark built without Hadoop. I think it's unreasonable to expect people to build Spark themselves, though.

Spark 2.x is not going to let us take advantage of Hadoop 3, since the official distribution of Spark is built with Hadoop 2. To confirm I tried removing the Hadoop exclusions from our fatjar (i.e., including Hadoop 3 in our fatjar), and then merging my S3-enabling changes with branch issue-329, and I got this:

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "XXXXXXXXXXXXX")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")


val r = RecordLoader.loadArchives("s3a://jrwiebe-aut/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)

// Exiting paste mode, now interpreting.

java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.addDeprecatedKeys(S3AFileSystem.java:218)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.<clinit>(S3AFileSystem.java:222)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:64)
  ... 49 elided

The problem is org.apache.hadoop.conf.Configuration.reloadExistingConfigurations is a Hadoop 3 method that doesn't exist in Hadoop 2. It doesn't matter that we loaded the Hadoop 3 libraries.

This is the easiest way to get Spark + Hadoop 3:

Download Spark built without hadoop (on the download page choose package type "Pre-built with user-provided Apache Hadoop").
Download Hadoop 3.1.2
Extract both in your home directory
$ export SPARK_DIST_CLASSPATH=$(/path/to/hadoop-3.1.2/bin/hadoop classpath)
$ /path/to/spark-2.4.3-bin-without-hadoop/bin/spark-shell --jars /path/to/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar

Running spark-shell this way, the above test script works.

This isn't that difficult, but it introduces complexity to the installation process, which is a turn off. Now we're asking people to download two packages, and it's easy enough to make a mistake setting the environment variable that will prevent Spark from starting. Maybe the Hadoop 3 upgrade isn't worth the hassle. (I say, after @ruebot does most of the work.)

ruebot · 2019-08-01T18:07:28Z

Funnily enough, I was going down the same path last night with the Hadoop-less Spark. It does complicate things significantly, and it might be worth talking this out on the next standup call we do (@ianmilligan1 @lintool sound good?)

Worst case, we have this branch we can keep up-to-date from time-to-time, and maybe in the next year or so we'll see a Spark 3 with Hadoop 3 support. Based on following the Spark mailing list, that sounds reasonable since Spark 2.4.4 should be the last before 3.0.0.

It would be great if we can have Spark 2.4.4 before we are going to get busier for 3.0.0. If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll it next Monday. (15th July). How do you think about this?

ianmilligan1 · 2019-08-01T18:30:50Z

Sounds good to chat about this, @ruebot – FWIW I don't think adding another download/export call is a game breaker if we think the benefits outweigh the disadvantages, as installing spark shell is already fairly complex. But I think it's worth further discussion for sure.

ruebot · 2022-05-17T20:07:02Z

I don't think we'll need this after we wrap up #494, and we really only use Hadoop for getFiles:

      /** Gets all non-empty archive files.
        *
        * @param dir the path to the directory containing archive files
        * @param fs filesystem
        * @return a String consisting of all non-empty archive files path.
        */
      def getFiles(dir: Path, fs: FileSystem): String = {
        val statuses = fs.globStatus(dir)
        val files = statuses
          .filter(f => fs.getContentSummary(f.getPath).getLength > 0)
          .map(f => f.getPath)
        files.mkString(",")
      }

So I'm going to close this and the related PR #491. Happy to re-open if we decide that we need to do this.

ruebot added the enhancement label Jul 25, 2019

ruebot mentioned this issue Nov 26, 2019

Spark 3.0.0 + Hadoop 3.2 + Java 11 support. #385

Closed

ruebot mentioned this issue Jun 29, 2020

Hadoop 3.2 support #491

Closed

ruebot closed this as completed May 17, 2022

ruebot added wontfix and removed enhancement labels May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Hadoop 3.x #329

Upgrade to Hadoop 3.x #329

jrwiebe commented Jul 24, 2019

ruebot commented Jul 24, 2019

jrwiebe commented Jul 24, 2019

greebie commented Jul 24, 2019

jrwiebe commented Aug 1, 2019 •

edited

Loading

ruebot commented Aug 1, 2019

ianmilligan1 commented Aug 1, 2019

ruebot commented May 17, 2022 •

edited

Loading

Upgrade to Hadoop 3.x #329

Upgrade to Hadoop 3.x #329

Comments

jrwiebe commented Jul 24, 2019

ruebot commented Jul 24, 2019

jrwiebe commented Jul 24, 2019

greebie commented Jul 24, 2019

jrwiebe commented Aug 1, 2019 • edited Loading

ruebot commented Aug 1, 2019

ianmilligan1 commented Aug 1, 2019

ruebot commented May 17, 2022 • edited Loading

jrwiebe commented Aug 1, 2019 •

edited

Loading

ruebot commented May 17, 2022 •

edited

Loading