Star Schema Benchmark using the Hive / Druid Integration
Switch branches/tags
Nothing to show
Clone or download
Carter Shanklin Carter Shanklin
Carter Shanklin and Carter Shanklin Load script tweaks
Latest commit 12a59b8 Sep 14, 2017

README.md

Star Schema Benchmark using the Hive / Druid Integration

Pre-requisites to running this:

  • Functional Druid cluster.
  • A version of Hive that supports the Druid Storage Handler. This includes Apache Hive version 2.2 or later, or Hortonworks Data Platform (HDP) 2.6 or later.
  • Apache Maven and gcc, for the data generator.

Before continuing, identify these things:

  • Your desired data scale, in gigabytes. For example, a scale of 1000 equals about a TB of data
  • Your HiveServer2 host:port
  • The Druid overlord host
  • The username and password for your Druid metadata database

Process:

  • Build the data generator (native code)
  • Package the data generator in a JAR file capable of being run as a MapReduce job to generate data within a Hadoop cluster
  • Run a MapReduce job to generate "CSV" data within HDFS
  • Run a Hive job to convert this "CSV" data into Hive tables
  • Run a Hive job to push pre-aggregated data into Druid. This step may require you to create additional HDFS directories and set permissions if you're not using HDP.

If all goes well, you only run 3 commands:

  1. sh 00datagen.sh [scale] [hiveserver2:port]
  2. sh 00load.sh [scale] [hiveserver2:port] [overlord] [username] [password]
  3. sh 00run.sh [hiveserver2:port]

Example to run at scale 100:

sh 00datagen.sh 100 hive.example.com:10500
sh 00load.sh 100 hive.example.com:10500 druid.example.com druid password
sh 00run.sh hive.example.com:10500