Clone this wiki locally
Welcome to the hadoopoffice wiki!
hadoopoffice is a library for processing Office documents, such as MS Excel Spreadsheet, on Hadoop and ecosystem components (e.g. Spark/Hive). The data in the documents can be combined with any data you have on Hadoop.
It contains the following components:
- Hadoop File Format to enable any MapReduce/Tez/Spark application to rows containing data from Office documents from files in HDFS. This format supports the original mapreduce api (mapred.*) and the alternative mapreduce api (mapreduce.*)
- Spark Datasource to use the HadoopOffice library via the Spark DataSource API
In the future we will add further components:
- Hive Serde for making rows containing data from Office documents available in Hive to be queried using SQL
- Hive UDF for providing additional metadata, such as comments, in Office documents in Hive
- MS Excel (*.xls, *.xlsx) based on parsers of Apache POI
- MapReduce: Convert the rows in an Excel document to CSV
- MapReduce: Convert rows in CSV to an Excel document
- Spark: Read Excel document using Spark 1.x
- Spark2 Datasource: Read an Excel document using the Spark2 datasource API
- Spark2 Datasource: Write an Excel document using the Spark2 datasource API
- Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)
Find here the status from the continuous integration (CI) platform:
Find here the status from the static code analyzer platform:
- Sonarqube: https://sonarqube.com/dashboard?id=ZuInnoTe%3Ahadoopoffice
- Codacy (includes also Scala): https://www.codacy.com/app/jornfranke/hadoopoffice
Find here the latest Javadocs.
Find here the OpenHub report.
Find here some release notes.
Join us on Gitter.im