-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Hadoop paths #5
Conversation
Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.
@@ -11,6 +11,7 @@ import org.apache.spark.rdd.RDD | |||
import org.apache.spark.sql._ | |||
import org.apache.spark.sql.sources._ | |||
import org.apache.spark.sql.types._ | |||
import org.apache.hadoop.fs.{FileSystem, Path} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this package already in one of the configured dependencies of Spark (e.g. hadoop-client)?
I just did a quick research but did not find a definite answer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileSystem
is in hadoop-common
$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/FileSystem.class'
48255 Thu Nov 13 13:09:30 EST 2014 org/apache/hadoop/fs/FileSystem.class
as is Path
$ jar -tvf hadoop-common-2.6.0.jar | grep 'org/apache/hadoop/fs/Path.class'
10611 Thu Nov 13 13:09:22 EST 2014 org/apache/hadoop/fs/Path.class
hadoop-client has a dependency on hadoop-common.
https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-client/dependency-analysis.html
val path = new Path(location) | ||
val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path) | ||
val workbook = WorkbookFactory.create(inputStream) | ||
inputStream.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't even think about closing the stream 😉
Did you make sure that WorkbookFactory.create
reads all data eagerly?
Otherwise we'd get into trouble later when it tries to read stuff from the already closed stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out that I didn't need to be closing the stream at all, Apache POI automatically closes the input stream after reading it entirely. 😄
WorkbookFactory.create(inputStream) defers to WorkbookFactory.create(inputStream, null):
https://apache.googlesource.com/poi/+/refs/heads/trunk/src/ooxml/java/org/apache/poi/ss/usermodel/WorkbookFactory.java#166
which in turn either creates an NPOIFSFileSystem around the inputStream (which reads it eagerly then closes the underlying stream)...
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi/3.8/org/apache/poi/poifs/filesystem/NPOIFSFileSystem.java#262
...or it creates an XSSFWorkbook around it, which wraps it in OPCPackage.open:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/opc/OPCPackage.java#218
which wraps it in a ZipInputStream and then immediately in a ZipInputStreamZipEntrySource (which ultimately closes the input stream as well):
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.poi/poi-ooxml/3.8-beta1/org/apache/poi/openxml4j/util/ZipInputStreamZipEntrySource.java#46
So as far as I can tell, it looks like the input stream closing is handled in both cases.
Line 32 should be able to be safely removed.
@@ -25,7 +26,10 @@ case class ExcelRelation( | |||
) | |||
(@transient val sqlContext: SQLContext) | |||
extends BaseRelation with TableScan with PrunedScan { | |||
val workbook = WorkbookFactory.create(new FileInputStream(location)) | |||
val path = new Path(location) | |||
val inputStream = FileSystem.get(path.toUri, sqlContext.sparkContext.hadoopConfiguration).open(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this also works with local files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileSystem.get(...) looks up a filesystem from the URI scheme:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L455
If there isn't one, then it defaults back to the basic get(config):
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L224
Which uses the standard Java URI class:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L233
So, it should work with any type of URI that there is a FileSystem available to handle. Theoretically it should also allow FTP URIs and more, but I've only tested it with local files and HDFS files.
@shoffing Great detective work, thanks for digging all that stuff out! |
I'm waiting for #6 to be mergeable, then I'll do a release. |
Add support for Hadoop paths (crealytics#5)
Trying to load an Excel file from a Hadoop path (such as hdfs://localhost/users/shoffing/test_excel.xlsx) wasn't working. This change adds support for loading excel files from Hadoop and hdfs.
Also fix a tiny typo in an error message.