Convert the rows in an Excel document to CSV
Clone this wiki locally
This is a MapReduce application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of files in Excel format (.xls,.xlsx) on HDFS containing. As output it converts them to a CSV file / sheet. It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well.
Getting an example Excel document
You can create yourself an Excel document in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel document that is used for unit testing of hadoopoffice library by executing the following command:
wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/raw/master/fileformat/src/test/resources/excel2013test.xlsx
You can put it on your HDFS cluster by executing the following commands:
hadoop fs -mkdir -p /user/cloudera/office/excel/input
hadoop fs -put ./excel2013test.xlsx /user/cloudera/office/excel/input
After it has been copied you are ready to use the example.
Building the example
git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice
You can build the application by changing to the directory hadoopoffice/examples/mapreduce-excelinput and using the following command:
../gradlew clean build
Running the example
Make sure that the output directory is clean:
hadoop fs -rm -R /user/cloudera/office/output
Execute the following command
hadoop jar ./build/libs/example-ho-mr-excelinput-0.1.0.jar org.zuinnote.hadoop.office.example.driver.Excel2CSVDriver /user/cloudera/office/excel/input /user/cloudera/office/output
After the map/reduce job has completed, you find the result (ie the Excel input file converted into a CSV output file) in /user/cloudera/office/output. You can display it using the following command:
hadoop fs -cat /user/cloudera/office/output/part-r-00000
Keep in mind that CSV requires that each row should have all the columns, which is not the case for Excel. Hence, you need to have an Excel sheet with all columns filled for each row or you need to make sure in your application that each row has all the columns. The example can be easily extended to fulfil this requirement.