Skip to content

Pig_MovieLens Dataset

daijyc edited this page Mar 2, 2015 · 3 revisions

Data preparation

First, downlod MovieLens dataset from the following site.

http://www.grouplens.org/system/files/ml-1m.zip

Get detail about the dataset in the README.

http://files.grouplens.org/papers/ml-1m-README.txt

You can find three dat file in the archive:

movies.dat, ratings.dat, users.dat.

conv.awk
BEGIN{ FS="#" }
{
    rowid=$1;
    name=$2;
    features="{"
    n=split($3,feature,"|")
    for(i=1;i<=n;i++)
    {
        if (i!=1)
            features = features ","
        features = features "(" feature[i] ")";
    }
    features = features "}"
    print NR rowid "#" name "#" features;
}
END{}

Change column separator as follows:

sed 's/::/#/g' movies.dat > movies1.t
awk -f conv.awk movies1.t > movies.t
sed 's/::/#/g' ratings.dat > ratings.t
sed 's/::/#/g' users.dat > users.t

Create a file named occupations.t with the following contents:

0#other/not specified
1#academic/educator
2#artist
3#clerical/admin
4#college/grad student
5#customer service
6#doctor/health care
7#executive/managerial
8#farmer
9#homemaker
10#K-12 student
11#lawyer
12#programmer
13#retired
14#sales/marketing
15#scientist
16#self-employed
17#technician/engineer
18#tradesman/craftsman
19#unemployed
20#writer

Put data to hdfs

hadoop fs -put movies.t .
hadoop fs -put ratings.t .
hadoop fs -put users.t .
hadoop fs -put occupations.t .

Creating training/testing data

%default seed 31
%default kfold 10

define concat_ws HiveUDF('concat_ws');
define sort_array HiveUDF('sort_array');
define floor HiveUDF('floor');
define rand HiveUDF('rand');

rmf training
rmf testing
rmf fold10

ratings = load 'ratings.t' using PigStorage('#') as (userid:int, movieid:int, rating:int, tstamp:chararray);
movies = load 'movies.t' using PigStorage('#') as (movieid:int, title:chararray, genres:{(genre:chararray)});
users = load 'users.t' using PigStorage('#') as (userid:int, gender:chararray, age:int, occupation:int, zipcode:chararray);
ratingmovie = join ratings by movieid left outer, movies by movieid;
ratingmovie = foreach ratingmovie generate ratings::userid as userid, ratings::movieid as movieid, ratings::rating as rating, ratings::tstamp as tstamp,
  movies::title as m_title, sort_array(movies::genres) as m_genres:{(genre:chararray)};
ratingmovie = foreach ratingmovie generate userid, movieid, rating, tstamp, m_title, concat_ws('|', m_genres) as m_genres;
joined = join ratingmovie by userid, users by userid;
randjoined = foreach joined generate rand($seed) as rand, ratingmovie::userid as userid, ratingmovie::movieid as movieid, ratingmovie::rating as rating;

sorted = order randjoined by rand;
training = limit sorted 800000;
training = foreach training generate userid, movieid, rating;
store training into 'training';

sorteddesc = order randjoined by rand desc;
testing = limit sorteddesc 200209;
testing = foreach testing generate userid, movieid, rating;
store testing into 'testing';

fold10 = foreach joined generate floor(RANDOM(${seed})*${kfold}) as gid, RANDOM($seed) as rand, ratingmovie::userid as userid, ratingmovie::movieid as movieid, ratingmovie::rating as rating;
sorted = order fold10 by rand;
sorted = foreach sorted generate gid, userid, movieid, rating;
store sorted into 'fold10';
Clone this wiki locally