Convert rows in CSV to an Excel document

Jörn Franke edited this page Jan 21, 2017 · 5 revisions

This is a MapReduce application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of (simple CSV) files on HDFS . As output it converts them to an Excel file (.xlsx format). It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well.

Getting an example CSV

You can create yourself a CSV file in LibreOffice or Microsoft Excel. Alternatively, you can download a CSV file that is used for unit testing of hadoopoffice library by executing the following command:

wget --no-check-certificate https://raw.githubusercontent.com/ZuInnoTe/hadoopoffice/master/fileformat/src/test/resources/simplecsv.csv

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/cloudera/office/csv/input

hadoop fs -put ./simplecsv.csv /user/cloudera/office/csv/input

After it has been copied you are ready to use the example.

Building the example

Note the fileformat is available on Maven Central and you do NOT need to build and publish the Hadoop File Format anymore.

Execute

git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice

You can build the application by changing to the directory hadoopoffice/examples/mapreduce-exceloutput and using the following command:

../gradlew clean build

Running the example

Make sure that the output directory is clean:

hadoop fs -rm -R /user/cloudera/office/output

Execute the following command

hadoop jar ./build/libs/example-ho-mr-exceloutput-0.1.0.jar org.zuinnote.hadoop.office.example.driver.CSV2ExcelDriver /user/cloudera/office/csv/input /user/cloudera/office/output

After the map/reduce job has completed, you find the result (ie the CSV input file converted into a Excel output file) in /user/cloudera/office/output. You can download it from HDFS as follows and open it in Excel afterwards:

hadoop fs -copyToLocal /user/cloudera/office/output/part-r-00000.xlsx

This is of course just a simple example. It is recommended for professional use to use an open source library to parse the CSV files. Furthermore, for the sake of simplicity we do not ensure the same order of rows as in the CSV, but this can be easily implemented in this example (hint: sort by the line numbers generated by the TextInputFormat).