Read Excel document using Spark 1.x

Jörn Franke edited this page Jan 7, 2017 · 1 revision

This is a Spark 1.x application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of Excel files. As an output it prints the cell address and the formatted content of the cell. It has successfully been tested with the HDP Sandbox VM 2.5, but other Hadoop distributions should work equally well, if they support Spark. You will need at least Spark 1.5

Getting an example Excel

You can create yourself an Excel file in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel file that is used for unit testing of hadoopoffice library by executing the following command:

wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/raw/master/fileformat/src/test/resources/excel2013test.xlsx

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/spark/office/excel/input

hadoop fs -put ./excel2013test.xlsx /user/spark/office/excel/input

After it has been copied you are ready to use the example.

Building the example

Note the HadoopOffice library is available on Maven Central.

Execute

git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice

You can build the application by changing to the directory hadoopoffice/examples/scala-spark-excelinput and using the following command:

sbt clean +assembly

Running the example

Execute the following command (please take care that you use spark-submit of Spark1)

spark-submit --class org.zuinnote.spark.office.example.excel.SparkScalaExcelIn ./example-ho-spark-scala-excelin.jar /user/spark/office/excel/input

After the Spark1 job has been completed, you find in the output the cell address and the formatted cell content of each cell in the Excel file.