Skip to content

Pig_a9a binary dataset

daijyc edited this page Mar 1, 2015 · 5 revisions

a9a

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a


preparation

conv-pig.awk

awk -f conv_pig.awk a9a | sed -e "s/+1/1/" | sed -e "s/-1/0/" > a9a.train
awk -f conv_pig.awk a9a.t | sed -e "s/+1/1/" | sed -e "s/-1/0/" > a9a.test

Putting data on HDFS

hadoop fs -copyFromLocal a9a.train .
hadoop fs -copyFromLocal a9a.test .

Training/test data prepareation

register hivemall-0.3-with-dependencies.jar

define addBias HiveUDF('hivemall.ftvec.AddBiasUDF');

rmf a9a.train.exploded

a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});
b = foreach a generate rowid, label, addBias(features) as features;
c = foreach b generate rowid, label, flatten(features) as featurepair:chararray;
d = foreach c generate rowid, label, flatten(STRSPLIT(featurepair, ':'));
store d into 'a9a.train.exploded';

rmf a9a.test.exploded

a = load 'a9a.test' as (rowid:int, label:float, features:{(featurepair:chararray)});
b = foreach a generate rowid, label, addBias(features) as features;
c = foreach b generate rowid, label, flatten(features) as featurepair:chararray;
d = foreach c generate rowid, label, flatten(STRSPLIT(featurepair, ':'));
store d into 'a9a.test.exploded';
Clone this wiki locally