forked from myui/hivemall
-
Notifications
You must be signed in to change notification settings - Fork 0
Pig_MovieLens Dataset
daijyc edited this page Mar 2, 2015
·
3 revisions
First, downlod MovieLens dataset from the following site.
Get detail about the dataset in the README.
You can find three dat file in the archive:
movies.dat, ratings.dat, users.dat.
conv.awk
BEGIN{ FS="#" }
{
rowid=$1;
name=$2;
features="{"
n=split($3,feature,"|")
for(i=1;i<=n;i++)
{
if (i!=1)
features = features ","
features = features "(" feature[i] ")";
}
features = features "}"
print NR rowid "#" name "#" features;
}
END{}
Change column separator as follows:
sed 's/::/#/g' movies.dat > movies1.t
awk -f conv.awk movies1.t > movies.t
sed 's/::/#/g' ratings.dat > ratings.t
sed 's/::/#/g' users.dat > users.t
Create a file named occupations.t with the following contents:
0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer
hadoop fs -put movies.t .
hadoop fs -put ratings.t .
hadoop fs -put users.t .
hadoop fs -put occupations.t .
%default seed 31
%default kfold 10
define concat_ws HiveUDF('concat_ws');
define sort_array HiveUDF('sort_array');
define floor HiveUDF('floor');
define rand HiveUDF('rand');
rmf training
rmf testing
rmf fold10
ratings = load 'ratings.t' using PigStorage('#') as (userid:int, movieid:int, rating:int, tstamp:chararray);
movies = load 'movies.t' using PigStorage('#') as (movieid:int, title:chararray, genres:{(genre:chararray)});
users = load 'users.t' using PigStorage('#') as (userid:int, gender:chararray, age:int, occupation:int, zipcode:chararray);
ratingmovie = join ratings by movieid left outer, movies by movieid;
ratingmovie = foreach ratingmovie generate ratings::userid as userid, ratings::movieid as movieid, ratings::rating as rating, ratings::tstamp as tstamp,
movies::title as m_title, sort_array(movies::genres) as m_genres:{(genre:chararray)};
ratingmovie = foreach ratingmovie generate userid, movieid, rating, tstamp, m_title, concat_ws('|', m_genres) as m_genres;
joined = join ratingmovie by userid, users by userid;
randjoined = foreach joined generate rand($seed) as rand, ratingmovie::userid as userid, ratingmovie::movieid as movieid, ratingmovie::rating as rating;
sorted = order randjoined by rand;
training = limit sorted 800000;
training = foreach training generate userid, movieid, rating;
store training into 'training';
sorteddesc = order randjoined by rand desc;
testing = limit sorteddesc 200209;
testing = foreach testing generate userid, movieid, rating;
store testing into 'testing';
fold10 = foreach joined generate floor(RANDOM(${seed})*${kfold}) as gid, RANDOM($seed) as rand, ratingmovie::userid as userid, ratingmovie::movieid as movieid, ratingmovie::rating as rating;
sorted = order fold10 by rand;
sorted = foreach sorted generate gid, userid, movieid, rating;
store sorted into 'fold10';