Convert the rows in an Excel document to CSV

Jörn Franke edited this page Jan 21, 2017 · 12 revisions

This is a MapReduce application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of files in Excel format (.xls,.xlsx) on HDFS containing. As output it converts them to a CSV file / sheet. It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well.

Getting an example Excel document

You can create yourself an Excel document in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel document that is used for unit testing of hadoopoffice library by executing the following command:

wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/raw/master/fileformat/src/test/resources/excel2013test.xlsx

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/cloudera/office/excel/input

hadoop fs -put ./excel2013test.xlsx /user/cloudera/office/excel/input

After it has been copied you are ready to use the example.

Building the example

Note the fileformat is available on Maven Central and you do NOT need to build and publish the Hadoop File Format anymore.

Execute

git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice

You can build the application by changing to the directory hadoopoffice/examples/mapreduce-excelinput and using the following command:

../gradlew clean build

Running the example

Make sure that the output directory is clean:

hadoop fs -rm -R /user/cloudera/office/output

Execute the following command

hadoop jar ./build/libs/example-ho-mr-excelinput-0.1.0.jar org.zuinnote.hadoop.office.example.driver.Excel2CSVDriver /user/cloudera/office/excel/input /user/cloudera/office/output

After the map/reduce job has completed, you find the result (ie the Excel input file converted into a CSV output file) in /user/cloudera/office/output. You can display it using the following command:

hadoop fs -cat /user/cloudera/office/output/part-r-00000

Keep in mind that CSV requires that each row should have all the columns, which is not the case for Excel. Hence, you need to have an Excel sheet with all columns filled for each row or you need to make sure in your application that each row has all the columns. The example can be easily extended to fulfil this requirement.