GitHub - b441berith/msgpack-hadoop: MessagePack-Hadoop integration provides an efficient schema-free data representation for Hadoop and Hive.

b441berith / msgpack-hadoop Public

forked from msgpack/msgpack-hadoop

Notifications You must be signed in to change notification settings
Fork 0
Star 0

MessagePack-Hadoop integration provides an efficient schema-free data representation for Hadoop and Hive.

msgpack.org/

Apache-2.0 license

0 stars 15 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/main/java/org/msgpack/hadoop		src/main/java/org/msgpack/hadoop
LICENSE.txt		LICENSE.txt
README		README
pom.xml		pom.xml

Repository files navigation

MessagePack-Hadoop Integration
========================================

This package contains the bridge layer between MessagePack (http://msgpack.org)
and Hadoop (http://hadoop.apache.org/) families.

This enables you to run MR jobs on the MessagePack-formatted data, and also
enables you to issue Hive query language over it.

MessagePack-Hive adapter enables SQL-based adhoc-query, which takes *nested*
*unstructured* data as input (like JSON, but binary-encoded). Of course, query
is executed with MapReduce framework!

Here is the sample MessagePack-Hive query, which counts unique user per URL.

> CREATE EXTERNAL TABLE IF NOT EXISTS mpbin (v string) \
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '@'  LINES TERMINATED BY '\n' \
  LOCATION '/path/to/hdfs/';

> SELECT url, COUNT(1) \
  FROM mpbin LATERAL VIEW msgpack_map(v, 'user', 'url') m AS user, url
  GROUP BY txt;

Required Setup
========================================

Please setup Hadoop + Hive system. Either Local, Pseudo-Distributed, or
Distributed environment is OK.

Hive Getting Started
========================================

1. locate jars

Put these jars to $HIVE_HOME/lib/ directory.

* msgpack-hadoop-$version.jar
* msgpack-$version.jar
* javassist-$version.jar

2. exec hive shell

Please execute the following command.

$ hive --auxpath $HIVE_HOME/lib/msgpack-hadoop-$version.jar,$HIVE_HOME/lib/msgpack-$version.jar,$HIVE_HOME/lib/javassist-$version.jar

You can skip --auxpath option once modify your hive-site.xml.

<property>
  <name>hive.aux.jars.path</name>
  <value>$HIVE_HOME/lib/msgpack-hadoop-$version.jar,$HIVE_HOME/lib/msgpack-$version.jar,$HIVE_HOME/lib/javassist-$version.jar</value>
</property>

3. add jar and load custom UDTF function

This step is required for every Hive query.

hive> add $HIVE_HOME/lib/msgpack-hadoop-$version.jar
hive> add $HIVE_HOME/lib/msgpack-$version.jar
hive> add $HIVE_HOME/lib/javassist-$version.jar
hive> CREATE TEMPORARY FUNCTION msgpack_map AS 'org.msgpack.hadoop.hive.udf.GenericUDTFMessagePackMap';

4. create external table

Create external table, which points the data directory.

hive> CREATE EXTERNAL TABLE IF NOT EXISTS mp_table (v string) \
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '@'  LINES TERMINATED BY '\n' \
      LOCATION '/path/to/hdfs/';

5. execute the query

Finally, execute the SELECT query over input data.

Input msgpack data is unstructured, nested data. Therefore, you need to "map"
MessagePack structure to Hive field name. Actually, you can map the field by
using msgpack_map() UDTF function, and name the fields by "AS" clause.

hive> SELECT url, COUNT(1) \
      FROM mp_table LATERAL VIEW msgpack_map(v, 'user', 'url') m AS user, url
      GROUP BY txt;

Caveats
========================================

Currently, MessagePackInputFormat is now unsplittable. Therefore, you need to
manually *shred* the data into small pieces.