Use the hadoop hive metadb to write map/reduce and easily query table info.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples
lib
CHANGELOG
README
hivemeta.gemspec

README

hivemeta

A ruby API for access to a Hive metastore running under MySQL.

Useful for querying columns in Hadoop map/reduce applications.  Normally,
a developer needs to handle both the splitting of incoming data and the
assignment of numerically indexed fields to friendly variables like so:

fields = line.chomp.split /\t/
item_id = fields[0]
inv_cnt = fields[7].to_i
puts "#{item_id}\t#{inv_cnt}"

This is not overly traumatic, however it's susceptible to errors creeping
in from file format changes.  Ongoing maintenance can easily become a burden
if there are many map/reduce programs reading the same changed data files.
Code size increases as the column count increases.

With hivemeta, the process is streamlined.  That same task is now:

row = inv_table.process_row line
puts "#{row.item_id}\t#{row.inv_cnt.to_i}"

The row object automagically knows its column names and they can be
referenced in one of the following ways (in order of best to worst
performance and coolness):

row.col_name
row[:col_name]
row['col_name']

Also included is a demo application, hivemeta_query.rb, to spit out table
information from the command-line via table name search or by the table's
location in HDFS.

---

Installation

gem install hivemeta

or (for jruby)

jgem install hivemeta

---

API Usage

streaming map/reduce code snippet (abstracted processing loop):

require 'hivemeta'

h = HiveMeta::Connection.new # see below for detail

h.table('sample_inventory').process do |row|
  item_id = row.item_id # can access by method or [:sym] or ['str']
  count   = row.inv_cnt.to_i
  puts "#{item_id}\t#{count}" if count >= 1000
end

streaming map/reduce code snippet (normal STDIN processing loop):

require 'hivemeta'

h = HiveMeta::Connection.new # see below for detail
inv_table = h.table 'sample_inventory'

STDIN.each_line do |line|
  begin
    row = inv_table.process_row line
  rescue HiveMeta::FieldCountError
    STDERR.puts "reporter:counter:bad_data:row_size,1"
    next
  end
  item_id = row.item_id # can access by method or [:sym] or ['str']
  count   = row.inv_cnt.to_i
  puts "#{item_id}\t#{count}" if count >= 1000
end

establishing a connection (in ruby code):

db_user    = 'hive'
db_pass    = 'hivepasshere'
db_host    = 'localhost'
db_name    = 'hivemeta'

h = HiveMeta::Connection.new(db_name, db_host, db_user, db_pass)

establishing a connection (environment variables):

# when no arguments are passed, the following env variables will be used:
#
#  hivemeta_db_host
#  hivemeta_db_name
#  hivemeta_db_user
#  hivemeta_db_pass
#
# to set these in a streaming map/reduce job, use -D arguments like so:
#
#  -D hivemeta.db_host=mydbhost \
#  -D hivemeta.db_name=hivemeta \
#  -D hivemeta.db_user=hive \
#  -D hivemeta.db_pass=mydbpass \

# the connection will made with those env variables without any other code
h = HiveMeta::Connection.new

---

hivemeta_query.rb Usage

# query by table names
$ hivemeta_query.rb join_test_name
join_test_name
hdfs://namenode/tmp/join_test_name
delimiter: "\t" (ASCII 9)
0   userid             # userid
1   name               # username

# query by table name wildcards
$ hivemeta_query.rb join_test%
join_test_address
hdfs://namenode/tmp/join_test_address
delimiter: "," (ASCII 44)
0   userid             # uid
1   address
2   city
3   state

join_test_name
hdfs://namenode/tmp/join_test_name
delimiter: "\t" (ASCII 9)
0   userid             # userid
1   name               # username

# list the tables using /tmp in HDFS
$ hivemeta_query -l /tmp
join_test_address
join_test_work
my_test_table

# view usage information
$ hivemeta_query.rb -h
usage: ./hivemeta_query.rb [options] table_name|hdfs_path
  -h, --help
  -c, --comments       # display comments along with field detail (default)
  -C, --no-comments    # do not display comments with the field detail
  -l, --list-tables    # list matching tables but no detail
  -f, --list-file-path # list the table HDFS file locations
  -w, --fit-width      # fit the text to the width of the screen (default)
  -W, --no-fit-width   # do not fit the text to the width of the screen
  -u, --db-user=arg    # hive metastore db user (requires read access)
  -p, --db-pass=arg    # hive metastore db password
  -H, --db-host=arg    # host running the hive meta db (default: localhost)
  -d, --db-name=arg    # hive meta db name (default: hivemeta)