Jörn Franke edited this page Feb 20, 2017 · 25 revisions

Welcome to the hadoopoffice wiki!

hadoopoffice is a library for processing Office documents, such as MS Excel Spreadsheet, on Hadoop and ecosystem components (e.g. Spark/Hive). The data in the documents can be combined with any data you have on Hadoop.

It contains the following components:

  • Hadoop File Format to enable any MapReduce/Tez/Spark application to rows containing data from Office documents from files in HDFS. This format supports the original mapreduce api (mapred.*) and the alternative mapreduce api (mapreduce.*)
  • Spark Datasource to use the HadoopOffice library via the Spark DataSource API

In the future we will add further components:

  • Hive Serde for making rows containing data from Office documents available in Hive to be queried using SQL
  • Hive UDF for providing additional metadata, such as comments, in Office documents in Hive

Supported formats:

  • MS Excel (*.xls, *.xlsx) based on parsers of Apache POI

How-To Guides:

Find here the status from the continuous integration (CI) platform:

Find here the status from the static code analyzer platform:

Find here the latest test results.

Find here the latest Javadocs.

Find here the OpenHub report.

Find here some release notes.

Build Status Codacy Badge

Join us on Gitter.im