Add support for Hadoop paths #5

shoffing · 2017-03-22T16:18:10Z

Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.

Also fix a tiny typo in an error message.

Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.

nightscape · 2017-03-23T09:18:09Z

src/main/scala/com/crealytics/spark/excel/ExcelRelation.scala

@@ -11,6 +11,7 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql._
 import org.apache.spark.sql.sources._
 import org.apache.spark.sql.types._
+import org.apache.hadoop.fs.{FileSystem, Path}


Is this package already in one of the configured dependencies of Spark (e.g. hadoop-client)?
I just did a quick research but did not find a definite answer...

FileSystem is in hadoop-common

$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/FileSystem.class' 48255 Thu Nov 13 13:09:30 EST 2014 org/apache/hadoop/fs/FileSystem.class

as is Path

$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/Path.class' 10611 Thu Nov 13 13:09:22 EST 2014 org/apache/hadoop/fs/Path.class

hadoop-client has a dependency on hadoop-common.

https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-client/dependency-analysis.html

nightscape · 2017-03-23T09:20:28Z

src/main/scala/com/crealytics/spark/excel/ExcelRelation.scala

+  val path = new Path(location)
+  val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path)
+  val workbook = WorkbookFactory.create(inputStream)
+  inputStream.close()


Oh, I didn't even think about closing the stream 😉
Did you make sure that WorkbookFactory.create reads all data eagerly?
Otherwise we'd get into trouble later when it tries to read stuff from the already closed stream.

So it turns out that I didn't need to be closing the stream at all, Apache POI automatically closes the input stream after reading it entirely. 😄

WorkbookFactory.create(inputStream) defers to WorkbookFactory.create(inputStream, null):
https://apache.googlesource.com/poi/+/refs/heads/trunk/src/ooxml/java/org/apache/poi/ss/usermodel/WorkbookFactory.java#166

which in turn either creates an NPOIFSFileSystem around the inputStream (which reads it eagerly then closes the underlying stream)...
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi/3.8/org/apache/poi/poifs/filesystem/NPOIFSFileSystem.java#262

...or it creates an XSSFWorkbook around it, which wraps it in OPCPackage.open:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/opc/OPCPackage.java#218

which wraps it in a ZipInputStream and then immediately in a ZipInputStreamZipEntrySource (which ultimately closes the input stream as well):
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/util/ZipInputStreamZipEntrySource.java#46

So as far as I can tell, it looks like the input stream closing is handled in both cases.

Line 32 should be able to be safely removed.

nightscape · 2017-03-23T09:20:57Z

src/main/scala/com/crealytics/spark/excel/ExcelRelation.scala

@@ -25,7 +26,10 @@ case class ExcelRelation(
  )
  (@transient val sqlContext: SQLContext)
 extends BaseRelation with TableScan with PrunedScan {
-  val workbook = WorkbookFactory.create(new FileInputStream(location))
+  val path = new Path(location)
+  val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path)


I assume this also works with local files?

FileSystem.get(...) looks up a filesystem from the URI scheme:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L455

If there isn't one, then it defaults back to the basic get(config):
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L224

Which uses the standard Java URI class:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L233

So, it should work with any type of URI that there is a FileSystem available to handle. Theoretically it should also allow FTP URIs and more, but I've only tested it with local files and HDFS files.

nightscape · 2017-03-23T13:25:02Z

@shoffing Great detective work, thanks for digging all that stuff out!
If you can amend the removal of inputStream.close() I'll happily merge this 😄

nightscape · 2017-03-23T13:43:16Z

I'm waiting for #6 to be mergeable, then I'll do a release.

Add support for Hadoop paths (crealytics#5)

Add support for Hadoop paths

5d96592

Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.

nightscape reviewed Mar 23, 2017

View reviewed changes

Remove unnecessary inputStream.close() line

ee8cffd

nightscape merged commit c7cc437 into crealytics:master Mar 23, 2017

virtualramblas added a commit to virtualramblas/spark-excel that referenced this pull request Mar 23, 2017

Merge pull request #1 from crealytics/master

6035582

Add support for Hadoop paths (crealytics#5)

nightscape mentioned this pull request May 31, 2017

Not able to read excel from hadoop location #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Hadoop paths #5

Add support for Hadoop paths #5

shoffing commented Mar 22, 2017

nightscape Mar 23, 2017

shoffing Mar 23, 2017

nightscape Mar 23, 2017

shoffing Mar 23, 2017 •

edited

Loading

nightscape Mar 23, 2017

shoffing Mar 23, 2017

nightscape commented Mar 23, 2017

nightscape commented Mar 23, 2017

Add support for Hadoop paths #5

Add support for Hadoop paths #5

Conversation

shoffing commented Mar 22, 2017

nightscape Mar 23, 2017

Choose a reason for hiding this comment

shoffing Mar 23, 2017

Choose a reason for hiding this comment

nightscape Mar 23, 2017

Choose a reason for hiding this comment

shoffing Mar 23, 2017 • edited Loading

Choose a reason for hiding this comment

nightscape Mar 23, 2017

Choose a reason for hiding this comment

shoffing Mar 23, 2017

Choose a reason for hiding this comment

nightscape commented Mar 23, 2017

nightscape commented Mar 23, 2017

shoffing Mar 23, 2017 •

edited

Loading