Skip to content

Convert spreadsheet cells to simple datatypes and vice versa

Jörn Franke edited this page Jul 22, 2018 · 4 revisions

The HadoopOffice library uses internally SpreadSheetCellDAO (for single cells) and SpreadSheetCellDAOArryWritable (for addressing rows containing several files). These DAOs allow the user to flexible define text, formulas, comments, sheet and address of a cell.

Nevertheless for many use cases these DAOs need to be converted to simple datatypes, such as String, Integer, Short, Decimal etc. Furthermore, once the user has done his/her analysis they need to convert simple datatypes back to DAOs to store them in spreadsheet formats.

All this functionality is realized by the converter class for Excel available since 1.1.0. It has the following functionality:

  • Derive the simple datatypes automatically by providing one or more sample of data (you can repeatingly call this function with a new "row" in form of an Array of SpreadSheetCellDAO): updateSpreadSheetCellRowToInferSchemaInformation. Input as many rows as necessary to derive a correct schema (or all if you do not know)
  • Get schema underlying this converter getSchemaRow
    • Look here for the supported datatypes
  • Set the schema (if you do not want/cannot use the updateSpreadSheetCellRowToInferSchemaInformation method: setSchemaRow
  • Convert an existing "row" (array of SpreadSheetCellDAO") to an array of Objects corresponding to simple datatypes (note: you need before either set the schema manually, using setSchemaRow, or infer it with a set of sample rows using updateSpreadSheetCellRowToInferSchemaInformation): getDataAccordingToSchema
    • Note: This may return null objects in case the cell does not exist.
  • Convert a "row" of objects of simple datatypes to a row for Excel, ie an array of SpreadSheetCellDAO: getSpreadSheetCellDAOfromSimpleDataType.
    • Note: You need to specify the sheet name and the row in Excel for where these objects should be stored. The cell position is implicitly derived by the position of the object in the array.
    • You may specify null in the input row which is converted.

The converter requires you when you create it to specify the date locale to be used for conversion. Most of the time you need to specify Local.US, because even Excel versions for the non-US market store dates internally as US dates. However, sometimes dates are also stored as Strings and thus you may need to use other date locales. Another important aspect is that you need to specify the locale for converting decimal numbers. Depending on the country the decimal "point" can be a comma etc. Hence, to get correct decimal numbers you need to specify the right locale. Finally, a dateTimeLocale needs to be specified to be used when converting date/time stamps. If dateTimeLocale is not specified, then it tries to identify them based on the java.sql.Timestamp format.

The converter is used by the Hive Serde, the Flink TableSource/TableSink, the Flink DataSource/DataSink and the Spark-HadoopOffice DataSource/DataSink.