<DIV ALIGN=CENTER>

# Introduction to Pig
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction


In this IPython Notebook, we introduce Pig, a tool that simplifies
writing MapReduce programs. We will run Pig from within our Jupyter
Notebook, but of course Pig can also be run at the Unix command line
when Hadoop has been properly installed. In this course, we have a
Docker container that has Hadoop installed as a single-node system. You
can open a terminal window from within the JupyterHub environment to
execute Pig commands at the Unix prompt, or simply use a Jupyter Notebook
as demonstrated below.

-----

-----

## Pig

Pig is a tool in the Hadoop ecosystem to simplify the creation of
MapReduce programs. The language used to create these MapReduce programs
is called _Pig Latin_, which is processed by the Pig platform. We can
invoke the Pig platform at the command line, or in this case within a
Jupyter Cell, and display the platform usage with the `-help` flag. We
can run Pig in local mode or in Hadoop mode, which simplifies
development and testing. in the rest of this Notebook, we will develop
two Pig applications, and run both in local and Hadoop modes.

-----

In [1]:
!pig -help


Apache Pig version 0.15.0 (r1682971) 
compiled Jun 01 2015, 11:44:35

USAGE: Pig [options] [-] : Run interactively in grunt shell.
       Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
       Pig [options] [-f[ile]] file : Run cmds found in file.
  options include:
    -4, -log4jconf - Log4j configuration file, overrides log conf
    -b, -brief - Brief logging (no timestamps)
    -c, -check - Syntax check
    -d, -debug - Debug level, INFO is default
    -e, -execute - Commands to execute (within quotes)
    -f, -file - Path to the script to execute
    -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
    -h, -help - Display this message. You can specify topic to get help for that topic.
        properties is the only topic currently supported: -h properties.
    -i, -version - Display version information
    -l, -logfile - Path to client side log file; default is current working directory.
    -m, -param_file - Path to the parameter file

-----

### Local Pig Processing

We can now use Pig to repeat the Word Count example. In this simple example, we will need to load the data, tokenize the data, group the tokens, and count the occurrences of each token. For simplify, we also will only display the top occurring tokens, which means we need to order the data set by the number of occurrences (in descending order) and limit the output in some manner. Finally, we should save our result and display only the top occurring tokens. To complete these steps, we will employ the following Pig operations:

- `Load`: to load a file into a Pig data set.

- `FOREACH`: to iterate through each item in a data set. We first step through each line in the file, before we step through each item in a group of tokens.

- `TOKENIZE`: to generate tokens from each line in the file.

- `GENERATE FLATTEN`: to generate a list of words from the tokens in the line.

- `GROUP ... By`: to group all identical tokens together

- `COUNT`: to compute the number of times each token occurs in the grouped token data set

- `ORDER BY DESC`: to order the data set in descending order by a column. In this case, we use `$1` to refer to the second column in the grouping, which is the counts for each token.

- `LIMIT`: to limit the number of items in the data set.

- `STORE ... INTO`: to save the results into a named file.

- `DUMP`: to display the data to the screen.

Once we have created this Pig script, we run Pig in local mode (via the `-x local` flag) and send the STDERR to `/dev/null`, which hides all informational messages displayed by Pig (or Hadoop). This results in a cleaner display, but will potentially hide useful messages, so use with care.

-----

In [2]:
%%writefile /home/data_scientist/hadoop/wordcount-local.pig

Lines = LOAD 'book.txt' AS (Line:chararray) ;
Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (Line)) AS Word ;
Groups = GROUP Words BY Word ;
Counts = FOREACH Groups GENERATE group, COUNT (Words) ;
Results = ORDER Counts BY $1 DESC ;
Top_Results = LIMIT Results 10 ;
STORE Results INTO 'top_words' ;
DUMP Top_Results ;

Writing /home/data_scientist/hadoop/wordcount-local.pig


In [3]:
%%bash

cd $HOME/hadoop

# We run locally, and send pig/hadoop messages to nowhere
pig -x local -f wordcount-local.pig 2> /dev/null

(the,13634)
(of,8141)
(and,6691)
(a,5868)
(to,4819)
(in,4663)
(his,3051)
(he,2791)
(I,2450)
(with,2399)


-----

For reference, the output of our Python based map-reduce was as follows:

  Word | Count  | - | Word | Count
 :----: | ----: |   | :----: | ----:
the |  13600 | | in | 4606
of | 8127 | | his	|  3035
and | 6542 | | he  | 2712
a  | 5842 | | I  | 2432
to | 4787 | | with | 2391

As seen in the output of the previous cell, the counts are different.
The reason is simply the method of tokenization performed by Pig is
different than our original tokenization on white space (if you did the
Student assignment in the Map/Reduce Notebook, you will have seen the
effect of removing punctuation during the tokenization process).

-----

In [4]:
%%bash

cd $HOME/hadoop

# Show the contents of the local output
ls -la top_words

# Now display top words from output
echo
echo 'Top Words'
head -10 top_words/part-r-00000

total 468
drwxr-xr-x 2 data_scientist users   4096 Apr  1 21:09 .
drwxr-xr-x 3 data_scientist users   4096 Apr  1 21:09 ..
-rw-r--r-- 1 data_scientist users 459046 Apr  1 21:09 part-r-00000
-rw-r--r-- 1 data_scientist users   3596 Apr  1 21:09 .part-r-00000.crc
-rw-r--r-- 1 data_scientist users      0 Apr  1 21:09 _SUCCESS
-rw-r--r-- 1 data_scientist users      8 Apr  1 21:09 ._SUCCESS.crc

Top Words
the	13634
of	8141
and	6691
a	5868
to	4819
in	4663
his	3051
he	2791
I	2450
with	2399


-----

## Setup Local Hadoop Environment


-----

In [5]:
!$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

!echo
!echo '========================================'
!echo

# Restart namenode and datanodes
!$HADOOP_PREFIX/sbin/start-dfs.sh
!$HADOOP_PREFIX/sbin/start-yarn.sh

!echo
!echo '========================================'
!echo

# Sometimes when the namenode is restarted, it enteres Safe Mode, 
# not allowing any changes to the file system. 
# We do want to make changes, so we manually leave Safe Mode.

!$HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave



Starting namenodes on [eeacbe10081d]
eeacbe10081d: starting namenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-namenode-eeacbe10081d.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-datanode-eeacbe10081d.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-secondarynamenode-eeacbe10081d.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-eeacbe10081d.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-data_scientist-nodemanager-eeacbe10081d.out


Safe mode is OFF


-----

### Hadoop Pig

We can easily run our Pig script in Hadoop, the main change is that we
need to specify the location of our input and output data sets within
our Hadoop HDFS filesystem. In this case, we change the location of the
_book.txt_ file to be `wc/in/book.txt`, and we now use the Pig `STORE`
operation to save the final Pig data set into the `wc/out/top_words`
directory. Once we have made these changes to our Pig script, we can
simply run Pig in Hadoop mode (the default) to process the book and
compute the top words, by total count in the selected text.


-----

In [6]:
%%writefile /home/data_scientist/hadoop/wordcount.pig

Lines = LOAD 'wc/in/book.txt' AS (Line:chararray) ;
Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (Line)) AS Word ;
Groups = GROUP Words BY Word ;
Counts = FOREACH Groups GENERATE group, COUNT (Words) ;
Results = ORDER Counts BY $1 DESC ;
Top_Results = LIMIT Results 10 ;
STORE Results INTO 'wc/out/top_words' ;
DUMP Top_Results ;

Writing /home/data_scientist/hadoop/wordcount.pig


In [7]:
%%bash

# We remove old output if it exists and create output directory
$HADOOP_PREFIX/bin/hdfs dfs -rm -r -f wc/out 2> /dev/null
$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p wc/out 2> /dev/null

cd $HOME/hadoop

# We run remotely, and send pig/hadoop messages to nowhere
pig -f wordcount.pig 2> /dev/null

Deleted wc/out
(the,13634)
(of,8141)
(and,6691)
(a,5868)
(to,4819)
(in,4663)
(his,3051)
(he,2791)
(I,2450)
(with,2399)


-----

We can use HDFS commands to show the contents of our HDFS file system
related to this Pig task, before we display the output of our Pig
script. In this case, we have the `book.txt` file in the input
directory, and the `_SUCCESS` and `part-r-00000` files to indicate the
result of and output from the Pig Hadoop task. We display the contents
of the `part-r-00000` file, to show the _top words_ in the selected book.

-----

In [8]:
%%bash

cd $HADOOP_PREFIX
# Display directory contents

echo
echo 'Input Directory'
bin/hdfs dfs -ls wc/in

echo
echo 'Output Directory'
bin/hdfs dfs -ls wc/out/top_words

# Write output

echo
echo 'Top Words'

bin/hdfs dfs -cat wc/out/top_words/part-r-00000 | head -10


Input Directory
Found 1 items
-rw-r--r--   1 data_scientist supergroup    1580927 2017-04-01 21:07 wc/in/book.txt

Output Directory
Found 2 items
-rw-r--r--   1 data_scientist supergroup          0 2017-04-01 21:11 wc/out/top_words/_SUCCESS
-rw-r--r--   1 data_scientist supergroup     459046 2017-04-01 21:11 wc/out/top_words/part-r-00000

Top Words
the	13634
of	8141
and	6691
a	5868
to	4819
in	4663
his	3051
he	2791
I	2450
with	2399


cat: Unable to write to output stream.


-----

### Movie Lens Data Analysis

Earlier in this course, we downloaded the _small_ MovieLens data set to
learn about recommender systems. In the rest of this Notebook, we will
revisit that data to provide a second sample data set to learn about
using Pig. In particular, we will learn how to join data sets and how to
develop summary results by grouping similar rows. in the following
several code cells, we first name our new data directory, after which we
download and expand the MovieLens data. Finally, we display the first
few rows of one of the data files.

-----

In [9]:
# Name of the directory holding the Small MovieLens data
data_dir = '/home/data_scientist/hadoop'

In [10]:
# Grab a book to process
!wget --output-document=$data_dir/ml-latest-small.zip \
    http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

!unzip -o $data_dir/ml-latest-small.zip -d $data_dir

--2017-04-01 21:12:52--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.146
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.146|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 918269 (897K) [application/zip]
Saving to: ‘/home/data_scientist/hadoop/ml-latest-small.zip’


2017-04-01 21:12:53 (2.62 MB/s) - ‘/home/data_scientist/hadoop/ml-latest-small.zip’ saved [918269/918269]

Archive:  /home/data_scientist/hadoop/ml-latest-small.zip
   creating: /home/data_scientist/hadoop/ml-latest-small/
  inflating: /home/data_scientist/hadoop/ml-latest-small/links.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/movies.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/ratings.csv  
  inflating: /home/data_scientist/hadoop/ml-latest-small/README.txt  
  inflating: /home/data_scientist/hadoop/ml-latest-small/tags.csv  


In [11]:
!head -10 $data_dir/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


-----

For our first Pig task, we will load the `ratings.csv` file into Pig and
display the first ten lines, in order to compare with the output of the
Unix `head` command. For this, we will use the Pig `STREAM` operator,
which allows us to stream data into a Unix command. For simplicity, we
will simply use the Unix `head` command. In the next two code cells, we
first save these Pig commands into a script file, after which we execute
this Pig script in local mode, which display the results, which should
match, other than the first row, which is our header row, processed as a
normal data row.

-----

In [12]:
%%writefile /home/data_scientist/hadoop/head.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',') ;
tr = STREAM ratings THROUGH `head -10` AS (userID, mnovieID, rating, timestamp) ;
DUMP tr ;

Writing /home/data_scientist/hadoop/head.pig


In [13]:
%%bash

cd $HOME/hadoop

pig -x local -b -f head.pig 2> /dev/null

(userId,movieId,rating,timestamp)
(1,31,2.5,1260759144)
(1,1029,3.0,1260759179)
(1,1061,3.0,1260759182)
(1,1129,2.0,1260759185)
(1,1172,4.0,1260759205)
(1,1263,2.0,1260759151)
(1,1287,2.0,1260759187)
(1,1293,2.0,1260759148)
(1,1339,3.5,1260759125)


-----

Pig provides a powerful mechanisms for processing data within a Hadoop
framework. However, one aspect that is not simple in the standard Hadoop
installation is removing Header information. Since each data file in the
MovieLens includes a header row, we employ the following Bash script to
make copies of the data files we will be using, remove the first line
from each file, before we display the first tow lines of these two data
files to demonstrate the files now have no header information. Note the
use of `sed -i '1d'`, which deletes the first line _in-place_ of the
named file.

-----

In [14]:
%%bash

cd $HOME/hadoop

# Copy original files to new name
cp ml-latest-small/ratings.csv ml-latest-small/original-ratings.csv
cp ml-latest-small/movies.csv ml-latest-small/original-movies.csv

# GNU SED allows inline editing, here we delete the first line from the file
sed -i '1d' ml-latest-small/ratings.csv
sed -i '1d' ml-latest-small/movies.csv

# List CSV files
ls -la ml-latest-small/*.csv

echo
echo '***** Ratings File *****'
head -2 ml-latest-small/ratings.csv

echo
echo '***** Movies File *****'
head -2 ml-latest-small/movies.csv

-rw-r--r-- 1 data_scientist users  183372 Oct 17 11:12 ml-latest-small/links.csv
-rw-r--r-- 1 data_scientist users  458368 Apr  1 21:12 ml-latest-small/movies.csv
-rw-r--r-- 1 data_scientist users  458390 Apr  1 21:12 ml-latest-small/original-movies.csv
-rw-r--r-- 1 data_scientist users 2438266 Apr  1 21:12 ml-latest-small/original-ratings.csv
-rw-r--r-- 1 data_scientist users 2438233 Apr  1 21:12 ml-latest-small/ratings.csv
-rw-r--r-- 1 data_scientist users   41902 Oct 17 11:12 ml-latest-small/tags.csv

***** Ratings File *****
1,31,2.5,1260759144
1,1029,3.0,1260759179

***** Movies File *****
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy


-----

### Pig: Data Processing

In the rest of this Notebook, we will use Pig to process the MovieLens
data set. First, we will read the data into Pig with a defined schema.
This includes naming the columns and specifying the actual data types
for each column. This first example will use the `DESCRIBE` and
`Illustrate` Pig commands to display the schema for the new Pig data set
and a graphical representation of the schema with one actual row.
Finally, we use the `LIMIT` and `DUMP` Pig commands to only display the
values for a limited number of rows in the new Pig data set. 

In the following two code cells, we first create the pig script to
perform these operations, after which we run this script by using Pig in
local mode. Note that we could also employ the Pig `SAMPLE` command,
which randomly samples from the data set to achieve the same result.

-----

In [15]:
%%writefile /home/data_scientist/hadoop/ratings.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, mnovieID:int, rating:double, timestamp:int) ;
DESCRIBE ratings ;
ILLUSTRATE ratings ;
top_rows = LIMIT ratings 10 ;
DUMP top_rows ;

Writing /home/data_scientist/hadoop/ratings.pig


In [16]:
%%bash

cd $HOME/hadoop

pig -x local -f ratings.pig 2> /dev/null

ratings: {userID: int,mnovieID: int,rating: double,timestamp: int}
(5,30707,4.5,1163374340)
-----------------------------------------------------------------------------------
| ratings     | userID:int   | mnovieID:int   | rating:double   | timestamp:int   | 
-----------------------------------------------------------------------------------
|             | 5            | 30707          | 4.5             | 1163374340      | 
-----------------------------------------------------------------------------------

(1,31,2.5,1260759144)
(1,1029,3.0,1260759179)
(1,1061,3.0,1260759182)
(1,1129,2.0,1260759185)
(1,1172,4.0,1260759205)
(1,1263,2.0,1260759151)
(1,1287,2.0,1260759187)
(1,1293,2.0,1260759148)
(1,1339,3.5,1260759125)
(1,1343,2.0,1260759131)


-----

### Pig: Join

We can use a _join_ operation with Pig to combine rows from multiple
tables, in the same manner as a SQL Join. While the full Pig join syntax
can be complicated, especially if outer joins are involved, we
demonstrate a simple inner join in the following two code cells. The
_join_ operation syntax is straightforward, we specify the first data
set and the common column followed by the second data set and the common
column. For example, to join two data sets on the _movie id_ column, we
employ the following Pig command:

```pig
movie_ratings = JOIN ratings by movieID, movies by movieID ;
```

In the following two code cells, we first create the pig script to
perform this join operation, after which we run this script by using Pig
in local mode. We also use the `DESCRIBE` and `DUMP` commands to display
the schema for the new joined data set, as well as the first few rows of
the joined data.

-----

In [17]:
%%writefile /home/data_scientist/hadoop/join.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, movieID:int, rating:double, timestamp:int) ;

movies = LOAD 'ml-latest-small/movies.csv' USING PigStorage(',')
    AS (movieID:int, title:chararray, genre:chararray) ;

movie_ratings = JOIN ratings by movieID, movies by movieID ;

DESCRIBE movie_ratings ;
top_rows = LIMIT movie_ratings 10 ;
DUMP top_rows ;

Writing /home/data_scientist/hadoop/join.pig


In [18]:
%%bash

cd $HOME/hadoop

pig -x local -b -f join.pig 2> /dev/null

movie_ratings: {ratings::userID: int,ratings::movieID: int,ratings::rating: double,ratings::timestamp: int,movies::movieID: int,movies::title: chararray,movies::genre: chararray}
(136,1,4.5,1405977265,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(138,1,2.0,1440379001,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(185,1,5.0,1003523268,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(272,1,4.0,1453587590,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(273,1,4.5,1466945564,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(275,1,5.0,1350254117,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(447,1,3.0,832492847,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(520,1,3.0,1142028487,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(538,1,5.0,831835670,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy)
(562,1,4.5,1167426662,1,Toy Story (1995),Adventur

-----

### Pig: Group By

Pig allows us to perform a _group by_ operation, which creates a summary
data set. In this example, we use `GROUP ... BY` to group all reviews
for a specific movie, before we join with the movie titles. The syntax
of this command is simple, we group a data set by a specific column, for
example:

```pig
hr_group = GROUP high_ratings BY movieID ;
```

creates a new data set that the highly rated movies group by movie id.
In the following two code cells, we first create the pig script to
perform this full operation, after which we run this script by using Pig
in local mode.

-----

In [19]:
%%writefile /home/data_scientist/hadoop/join-group.pig

ratings = LOAD 'ml-latest-small/ratings.csv' USING PigStorage(',')
    AS (userID:int, movieID:int, rating:double, timestamp:int) ;

high_ratings = FILTER ratings BY rating > 3 ;

hr_group = GROUP high_ratings BY movieID ;

hr_count = FOREACH hr_group GENERATE group AS mvID, COUNT(high_ratings) AS cnt ;

movies = LOAD 'ml-latest-small/movies.csv' USING PigStorage(',')
    AS (movieID:int, title:chararray, genre:chararray) ;

movie_ratings = JOIN hr_count by mvID, movies by movieID ;

ordered_movies = ORDER movie_ratings BY cnt DESC ;

top_movies = LIMIT ordered_movies 10 ;

DESCRIBE top_movies ;

DUMP top_movies ;

Writing /home/data_scientist/hadoop/join-group.pig


In [20]:
%%bash

cd $HOME/hadoop

pig -x local -b -f join-group.pig 2> /dev/null

top_movies: {hr_count::mvID: int,hr_count::cnt: long,movies::movieID: int,movies::title: chararray,movies::genre: chararray}
(318,286,318,"Shawshank Redemption, The (1994)")
(356,273,356,Forrest Gump (1994),Comedy|Drama|Romance|War)
(296,269,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller)
(593,259,593,"Silence of the Lambs, The (1991)")
(260,247,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi)
(2571,219,2571,"Matrix, The (1999)")
(527,214,527,Schindler's List (1993),Drama|War)
(1196,201,1196,Star Wars: Episode V - The Empire Strikes Back (1980),Action|Adventure|Sci-Fi)
(1198,198,1198,Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),Action|Adventure)
(608,195,608,Fargo (1996),Comedy|Crime|Drama|Thriller)


-----

### Pig Cleanup

After running these examples, you will have a number of files that
should be cleaned up, especially if you wish to rerun this Notebook.
These files include Pig log files, our Pig scripts, the MovieLens
data, and finally, the Pig output. The following code cell deletes all
of these files; you can modify this cell to leave some or all of these
files if you wish to analyze anything in more detail.

-----

In [21]:
%%bash

# Clean up the working directory (Don't run to save data for later analysis)
cd $HOME/hadoop

# Remove pig log files
rm -f pig*.log

# Remove our pig scripts
rm -f *.pig

# Remove the movieLens Data
rm -f ml-latest-small.zip
rm -rf ml-latest-small

# Remove our output file.
rm -rf top_words

# Display cleaned directory contents
ls -la

total 1560
drwxr-xr-x 2 data_scientist users    4096 Apr  1 21:13 .
drwxr-xr-x 1 data_scientist users    4096 Apr  1 21:05 ..
-rw-r--r-- 1 data_scientist users 1580927 Nov 21 11:09 book.txt
-rwxr--r-- 1 data_scientist users     849 Apr  1 21:06 mapper.py
-rwxr--r-- 1 data_scientist users    1496 Apr  1 21:06 reducer.py


In [22]:
# Having the namenode and datanodes running in the background consumes quite a bit of memory. 
# So we should shut down the nodes at the end of the notebook.

!$HADOOP_PREFIX/sbin/stop-dfs.sh
!$HADOOP_PREFIX/sbin/stop-yarn.sh

Stopping namenodes on [eeacbe10081d]
eeacbe10081d: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
localhost: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
no proxyserver to stop


-----

### Student Activity

In the preceding cells, we introduced Hadoop Pig. Now that you have run
the Notebook, go back and run it a second time. Notice how the data and
thus model fits have changed.

1. Change the first Pig example to compute bi-grams as opposed to
unigrams.
2. Change the MovieLens example to compute the average rating for each
Movie.
3. Generate a Recommendation system (similar to our example in a
previous Notebook) by using Pig.

-----