Home

Welcome to the hadoopoffice wiki!

hadoopoffice is a library for processing and writing Office documents, such as MS Excel Spreadsheet, on Hadoop and ecosystem components (e.g. Spark/Hive). The data in the documents can be combined with any data you have on Hadoop.

It contains the following components:

Hadoop File Format to enable any MapReduce/Tez/Spark application to read rows containing data from Office documents from files in HDFS (or any other supported filesystem). Additionally, it supports writing of Office documents to HDFS (or any other supported file system). This format supports the original mapreduce api (mapred.*) and the alternative mapreduce api (mapreduce.*)
- JUnit5 test results
- Latest Javadocs
Hive Serde to use Hive and SQL queries to read/write Office documents from files in HDFS (or any other supported filesystem)
- JUnit5 test results
- Latest Javadocs
Flink DataSource/DataSink to use Apache Flink to read/write Office documents (recommended if you want to have more control, e.g. defining formulas in spreadsheets, comments etc.)
- JUnit5 test results
- Latest Javadocs
Spark Datasource to use (read/write) the HadoopOffice library via the Spark DataSource API
DEPRECATED Flink TableSource/TableSink to use the Table API of Apache Flink to read/write Office documents using SQL

Supported formats:

MS Excel (*.xls, *.xlsx) based on parsers and writers of Apache POI

How-To Guides:

MapReduce: Convert the rows in an Excel document to CSV
MapReduce: Convert rows in CSV to an Excel document
Spark: Read Excel document using Spark 1.x
Spark: Write Excel document using Spark 1.x
Spark2 Datasource: Read an Excel document using the Spark2 datasource API
Spark2 Datasource: Write an Excel document using the Spark2 datasource API
Hive: Query and Write Excel documents using SQL in Apache Hive
Flink: Using Apache Flink to read/write Excel documents
Working with templates to include advanced Excel features, such as diagrams
Saving CPU/memory resources with low footprint mode
Encrypting your files for improved security
Digital Signature for files to enable non-repudiation.
Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)
Convert spreadsheet cells to simple datatypes and vice versa (e.g. convert Excel cells to data as String, byte, short, BigDecimal and many more)

Find here the status from the continuous integration (CI) platform:

https://travis-ci.org/ZuInnoTe/hadoopoffice/

Find here the status from the static code analyzer platform:

Sonarqube: https://sonarcloud.io/dashboard?id=ZuInnoTe%3Ahadoopoffice
Codacy (includes also Scala): https://www.codacy.com/app/jornfranke/hadoopoffice

Find here the OpenHub report.

Join us on Gitter.im

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally