Convert rows in CSV to an Excel document
Clone this wiki locally
This is a MapReduce application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of (simple CSV) files on HDFS . As output it converts them to an Excel file (.xlsx format). It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well.
Getting an example CSV
You can create yourself a CSV file in LibreOffice or Microsoft Excel. Alternatively, you can download a CSV file that is used for unit testing of hadoopoffice library by executing the following command:
wget --no-check-certificate https://raw.githubusercontent.com/ZuInnoTe/hadoopoffice/master/fileformat/src/test/resources/simplecsv.csv
You can put it on your HDFS cluster by executing the following commands:
hadoop fs -mkdir -p /user/cloudera/office/csv/input
hadoop fs -put ./simplecsv.csv /user/cloudera/office/csv/input
After it has been copied you are ready to use the example.
Building the example
git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice
You can build the application by changing to the directory hadoopoffice/examples/mapreduce-exceloutput and using the following command:
../gradlew clean build
Running the example
Make sure that the output directory is clean:
hadoop fs -rm -R /user/cloudera/office/output
Execute the following command
hadoop jar ./build/libs/example-ho-mr-exceloutput-0.1.0.jar org.zuinnote.hadoop.office.example.driver.CSV2ExcelDriver /user/cloudera/office/csv/input /user/cloudera/office/output
After the map/reduce job has completed, you find the result (ie the CSV input file converted into a Excel output file) in /user/cloudera/office/output. You can download it from HDFS as follows and open it in Excel afterwards:
hadoop fs -copyToLocal /user/cloudera/office/output/part-r-00000.xlsx
This is of course just a simple example. It is recommended for professional use to use an open source library to parse the CSV files. Furthermore, for the sake of simplicity we do not ensure the same order of rows as in the CSV, but this can be easily implemented in this example (hint: sort by the line numbers generated by the TextInputFormat).