# Week 12 Problem 3

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/lcdm-uiuc/info490-sp17/blob/master/help/act_assign_tab.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_  → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.
-----
# Problem 12.3. Apache Pig

In this problem, we will run Pig to compute the average rating for each book in the book-crossing data set.

In [1]:
from nose.tools import assert_equal, assert_almost_equal

---
### Raw Data Preview

First, let's have a look at the data in case you don't remember them from w6p1 assignment:

In [2]:
!head -5 $HOME/data/book-crossing/BX-Book-Ratings.csv

"User-ID";"ISBN";"Book-Rating"
"276725";"034545104X";"0"
"276726";"0155061224";"5"
"276727";"0446520802";"0"
"276729";"052165615X";"3"


In [3]:
!head -5 $HOME/data/book-crossing/BX-Books.csv

"ISBN";"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"
"0195153448";"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
"0002005018";"Clara Callan";"Richard Bruce Wright";"2001";"HarperFlamingo Canada";"http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg"
"0060973129";"Decision in Normandy";"Carlo D'Este";"1991";"HarperPerennial";"http://images.amazon.com/images/P/0060973129.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0060973129.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0060973129.01.LZZZZZZZ.jpg"
"0374157065";"Flu: The Story of the Great Influenza Pandemic of 1918

---
### Data Preprocessing
To make the messy data easier to be processed later, the bash script here does the following:
- Removes the header line;
- Removes the quotation marks for data in each field (otherwise there might be problems with numbers);
- Cut the last three columns of `BX-Books.csv`, which are image urls and publishers that we don't need for this problem;
- Saves the output as `ratings.csv`, `books.csv` to the current directory (same directory as this notebook);
- Displays the first 10 lines of each output csv file.

The columns in the processed file is:
- Columns of ratings.csv: **User-ID; ISBN; Book-Rating**
- Columns of books.csv: **ISBN; Book-Title; Book-Author; Year-Of-Publication**


In [4]:
%%bash

sed 's/"//g' $HOME/data/book-crossing/BX-Book-Ratings.csv | sed '1d' > ratings.csv
sed 's/"//g' $HOME/data/book-crossing/BX-Books.csv | cut -d';' -f -4 | sed '1d' > books.csv

echo
echo '***** Ratings File *****'
head ratings.csv

echo
echo '***** Books File *****'
head books.csv


***** Ratings File *****
276725;034545104X;0
276726;0155061224;5
276727;0446520802;0
276729;052165615X;3
276729;0521795028;6
276733;2080674722;0
276736;3257224281;8
276737;0600570967;6
276744;038550120X;7
276745;342310538;10

***** Books File *****
0195153448;Classical Mythology;Mark P. O. Morford;2002
0002005018;Clara Callan;Richard Bruce Wright;2001
0060973129;Decision in Normandy;Carlo D'Este;1991
0374157065;Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It;Gina Bari Kolata;1999
0393045218;The Mummies of Urumchi;E. J. W. Barber;1999
0399135782;The Kitchen God's Wife;Amy Tan;1991
0425176428;What If?: The World's Foremost Military Historians Imagine What Might Have Been;Robert Cowley;2000
0671870432;PLEADING GUILTY;Scott Turow;1993
0679425608;Under the Black Flag: The Romance and the Reality of Life Among the Pirates;David Cordingly;1996
074322678X;Where You'll Find Me: And Other Stories;Ann Beattie;2002


-----
### Pig Latin: Average

Write a Pig script that:

- Imports `ratings.csv` and `books.csv` (note that these two files are seperated by semicolon),
- Groups all reviews by ISBN and uses [AVG](https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#AVG) to compute the average rating for each book,
- Joins the averaged rating dataset and the book dataset on the ISBN column, 
- Sorts the joined dataset by book title using default ascending string order, and
- Uses the DUMP command to display the first 10 rows.

The resulting schema should contain six columns:

```
(ISBN and average rating from calculated ratings.csv, ISBN, book title, book author, publish year from books.csv)
```

For example, the second line should be (the first line is harder to read since its title has commas):

```
(0964147726,0.0,0964147726, Always Have Popsicles,Rebecca Harvin,1994)
```

Some hints for debugging:

- Don't rush to the end; do and check one step at a time.
- Use operations that display output wisely, e.g. DESCRIBE, ILLUSTRATE.
- Before you use DUMP, make sure that you are trying to display a small number of rows, instead of all rows at a time. Otherwise your notebook might crash.
- Take advantage of the debugging cell provided before the assertion cell.

In [5]:
%%writefile average.pig

ratings = LOAD 'ratings.csv' USING PigStorage(';')
    AS (userID: int, ISBN: chararray, rating: double) ;
    
books = LOAD 'books.csv' USING PigStorage(';')
    AS (ISBN: chararray, title: chararray, author: chararray, year: int) ;

book_group = GROUP ratings BY ISBN ;

average_ratings = FOREACH book_group GENERATE group AS ISBN, AVG(ratings.rating) ;

book_ratings = JOIN average_ratings by ISBN, books by ISBN;

ordered_books = ORDER book_ratings BY title ;

top_rows = LIMIT ordered_books 10;
DUMP top_rows;

Writing average.pig


In [None]:
average_ratings = !pig -x local -f average.pig 2> pig_stderr.log
print('\n'.join(average_ratings))

To debug, uncomment and run the following code cell.

In [7]:
#!cat pig_stderr.log

----
### Tests

In [8]:
answer = [
    '(0590567330,2.25,0590567330, A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America),Karen Hesse,1999)',
    '(0964147726,0.0,0964147726, Always Have Popsicles,Rebecca Harvin,1994)',
    '(0942320093,0.0,0942320093, Apple Magic (The Collector\'s series),Martina Boudreau,1984)',
    '(0310232546,8.0,0310232546, Ask Lily (Young Women of Faith: Lily Series, Book 5),Nancy N. Rue,2001)',
    '(0962295701,0.0,0962295701, Beyond IBM: Leadership Marketing and Finance for the 1990s,Lou Mobley,1989)',
    '(0439188970,0.0,0439188970, Clifford Visita El Hospital (Clifford El Gran Perro Colorado),Norman Bridwell,2000)',
    '(0399151788,10.0,0399151788, Dark Justice,Jack Higgins,2004)',
    '(0786000015,0.0,0786000015, Deceived,Carla Simpson,1994)',
    '(006250746X,5.0,006250746X, Earth Prayers From around the World: 365 Prayers, Poems, and Invocations for Honoring the Earth,Elizabeth Roberts,1991)',
    '(1566869250,5.0,1566869250, Final Fantasy Anthology: Official Strategy Guide (Brady Games),David Cassady,1999)'
    ]

a1 = [a.split(',') for a in answer]
a2 = [a.split(',') for a in average_ratings]

for irow, row in enumerate(answer):
    for icol in [0, 2, 3]:
        assert_equal(a1[irow][icol], a2[irow][icol])
    #float numbers in column 1
    assert_almost_equal(float(a1[irow][1]), float(a2[irow][1]))
    if irow in [0, 3, 8]: 
        continue
    for icol in [4, 5]:
        assert_equal(a1[irow][icol], a2[irow][icol])

---
### Cleanup

In [9]:
%%bash
# Remove pig log files
rm -f pig*.log

# Remove our pig scripts
rm -f *.pig

# Remove csv files
rm books.csv ratings.csv