Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)
Clone this wiki locally
Usually Office files are small, ie several magnitudes smaller than the default HDFS blocksize ( < 128M). Hence, you may face performance penalties when processing in a cluster due to the number of requests to the Namenode. Luckily, there is already since several Hadoop versions a solution to these problems: Hadoop Archives (HAR). These archives do not compress files, but combine them into a single file.
The files within a Hadoop Archive can be accessed transparently by any Hadoop application (including Spark, TEZ or Hive):
Instead of providing the HDFS URL, such as hdfs:/user/office/, you need simply to provide a HAR URL har:/user/office/MyOfficeDocuments.har/