Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Hadoop paths #5

Merged
merged 2 commits into from
Mar 23, 2017
Merged

Conversation

shoffing
Copy link
Contributor

Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.

Also fix a tiny typo in an error message.

Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.
@@ -11,6 +11,7 @@ import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
import org.apache.hadoop.fs.{FileSystem, Path}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this package already in one of the configured dependencies of Spark (e.g. hadoop-client)?
I just did a quick research but did not find a definite answer...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystem is in hadoop-common

$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/FileSystem.class'
 48255 Thu Nov 13 13:09:30 EST 2014 org/apache/hadoop/fs/FileSystem.class

as is Path

$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/Path.class'
 10611 Thu Nov 13 13:09:22 EST 2014 org/apache/hadoop/fs/Path.class

hadoop-client has a dependency on hadoop-common.

https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-client/dependency-analysis.html

val path = new Path(location)
val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path)
val workbook = WorkbookFactory.create(inputStream)
inputStream.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't even think about closing the stream 😉
Did you make sure that WorkbookFactory.create reads all data eagerly?
Otherwise we'd get into trouble later when it tries to read stuff from the already closed stream.

Copy link
Contributor Author

@shoffing shoffing Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that I didn't need to be closing the stream at all, Apache POI automatically closes the input stream after reading it entirely. 😄

WorkbookFactory.create(inputStream) defers to WorkbookFactory.create(inputStream, null):
https://apache.googlesource.com/poi/+/refs/heads/trunk/src/ooxml/java/org/apache/poi/ss/usermodel/WorkbookFactory.java#166

which in turn either creates an NPOIFSFileSystem around the inputStream (which reads it eagerly then closes the underlying stream)...
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi/3.8/org/apache/poi/poifs/filesystem/NPOIFSFileSystem.java#262

...or it creates an XSSFWorkbook around it, which wraps it in OPCPackage.open:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/opc/OPCPackage.java#218

which wraps it in a ZipInputStream and then immediately in a ZipInputStreamZipEntrySource (which ultimately closes the input stream as well):
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/util/ZipInputStreamZipEntrySource.java#46

So as far as I can tell, it looks like the input stream closing is handled in both cases.

Line 32 should be able to be safely removed.

@@ -25,7 +26,10 @@ case class ExcelRelation(
)
(@transient val sqlContext: SQLContext)
extends BaseRelation with TableScan with PrunedScan {
val workbook = WorkbookFactory.create(new FileInputStream(location))
val path = new Path(location)
val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this also works with local files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystem.get(...) looks up a filesystem from the URI scheme:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L455

If there isn't one, then it defaults back to the basic get(config):
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L224

Which uses the standard Java URI class:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L233

So, it should work with any type of URI that there is a FileSystem available to handle. Theoretically it should also allow FTP URIs and more, but I've only tested it with local files and HDFS files.

@nightscape
Copy link
Collaborator

@shoffing Great detective work, thanks for digging all that stuff out!
If you can amend the removal of inputStream.close() I'll happily merge this 😄

@nightscape nightscape merged commit c7cc437 into crealytics:master Mar 23, 2017
@nightscape
Copy link
Collaborator

I'm waiting for #6 to be mergeable, then I'll do a release.

virtualramblas added a commit to virtualramblas/spark-excel that referenced this pull request Mar 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants