If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/lcdm-uiuc/info490-sp17/blob/master/help/act_assign_tab.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_  → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 12.1. Intro to Hadoop.
In this problem set, you will be doing simple exercises using Hadoop.  Before you start, however, you should be aware of the following: __you MUST delete YOUR CODE HERE in order for your code to work (comments beginning with # are NOT kosher for command-line statements!!)__. 

When you comment your code for this assignment, please make the comments either above or below any command-line statements (lines starting with !).

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

from nose.tools import assert_equal, assert_true, assert_is_instance
from numpy.testing import assert_array_almost_equal, assert_almost_equal

First, we make sure that namenodes and datanodes are stopped, formatted and started, and make sure to get rid of any spurious files that might gunk up our system.

In [2]:
!$HADOOP_PREFIX/sbin/stop-dfs.sh
!$HADOOP_PREFIX/sbin/stop-yarn.sh
!rm -rf /tmp/*
!echo "Y" | $HADOOP_PREFIX/bin/hdfs namenode -format 2> /dev/null
!$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
!$HADOOP_PREFIX/sbin/start-dfs.sh
!$HADOOP_PREFIX/sbin/start-yarn.sh
!$HADOOP_PREFIX/bin/hdfs dfsadmin -safemode leave
!$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p /user/$NB_USER

Stopping namenodes on [info490rb.studentspace.cs.illinois.edu]
info490rb.studentspace.cs.illinois.edu: no namenode to stop
localhost: no datanode to stop
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: no secondarynamenode to stop
stopping yarn daemons
no resourcemanager to stop
localhost: no nodemanager to stop
no proxyserver to stop
rm: cannot remove ‘/tmp/hsperfdata_root’: Operation not permitted
Formatting using clusterid: CID-8f55f22f-4240-49e8-9c56-78282c1e75fc
Starting namenodes on [info490rb.studentspace.cs.illinois.edu]
info490rb.studentspace.cs.illinois.edu: starting namenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-namenode-info490rb.studentspace.cs.illinois.edu.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-datanode-info490rb.studentspace.cs.illinois.edu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-data_scientist-secondarynamenode-info490rb.

## Part 1: Exploring files and the system
First, let's start out by listing the contents of the directory `/user/` in HDFS.  When you do this, you will pipe the output into a file called temp1.txt so that you may pass the assertion tests below (the easiest way to do this piping is with the `>temp1.txt` statement after your command-line statement). 

In [3]:
# Display HDFS directory
!$HADOOP_PREFIX/bin/hdfs dfs -ls /user/ > temp1.txt

In [4]:
res1 = !cat temp1.txt

assert_is_instance(res1, list)
assert_is_instance(res1[0], str)
assert_is_instance(res1[1], str)
assert_true(res1[1], "Found 1 items")
assert_true(res1[1][:40], "drwxr-xr-x   - data_scientist supergroup")

## Part 2 Free space: 
Now, let's issue a Hadoop command that allows us to see the free space available to us, making sure to make it human readable. Like before, you will pipe the output into a file called temp2.txt so that you may pass the assertion tests below (this piping can be done by putting `>temp2.txt` after your command-line statement). 

In [5]:
# Free Space
!$HADOOP_PREFIX/bin/hdfs dfs -df -h > temp2.txt

In [6]:
res2 = !cat temp2.txt
assert_is_instance(res2, list)
assert_is_instance(res2[0], str)
assert_is_instance(res2[1], str)
assert_true(len(res2), 2)
assert_true(res2[0], "Filesystem                                             Size  Used  Available  Use%")
assert_true(res2[1][:46], "hdfs://info490rb.studentspace.cs.illinois.edu:")

## Part 3: Version
Next, let's get the version information of Hadoop that we are running, making sure to pipe the output into the vers.txt file provided.

In [7]:
# version
!$HADOOP_PREFIX/bin/hdfs version > vers.txt

In [8]:
vers = !cat vers.txt
assert_true(all(isinstance(w, str) for w in vers))
assert_true(vers[0], 'Hadoop 2.7.2')
assert_true(vers[3], 'Compiled with protoc 2.5.0')
assert_true(len(vers), 6)

## Cleaning up files
Run this cell before restarting and rerunning your code!

In [9]:
!rm temp1.txt
!rm temp2.txt
!rm vers.txt

## New directory for Hadoop
Here, I'm creating a new directory for Hadoop so that we are ready for the next two coding parts.

In [10]:
%%bash
#!/usr/bin/env bash

DIR=$HOME/hadoop_assign

# Delete if exists
if [ -d "$DIR" ]; then
    rm -rf "$DIR"
fi

# Now make the directory
mkdir "$DIR"

ls -la $DIR

total 8
drwxr-xr-x  2 data_scientist users 4096 Apr 13 08:44 .
drwxr-xr-x 22 data_scientist users 4096 Apr 13 08:44 ..


## Part 4: Copying a book into a directory
For these final two coding sections, we will be dealing with the script for Monty Python and the Holy Grail.

For this section, you must copy the file grail.txt from here:  

`/home/data_scientist/data/nltk_data/corpora/webtext/` 

into your hadoop_assign directory that you just created in above your $HOME directory. Please use `cp` and do not use Hadoop commands here or else you might fail the assertion tests.

In [11]:
# place the text in the hadoop subdirectory of our home directory
!cp /home/data_scientist/data/nltk_data/corpora/webtext/grail.txt $HOME/hadoop_assign/grail.txt

In [12]:
!ls $HOME/hadoop_assign >copy.txt
copy = !cat copy.txt
assert_is_instance(copy[0], str)
assert_is_instance(copy, list)
assert_true(len(copy), 1)
assert_true(copy[0], 'grail.txt')

## Part 5: Making a new directory and copying a book in Hadoop
Finally, we will do two things: we will create a new directory called `grail/in` using Hadoop, and then we will put grail.txt (located in `$HOME/hadoop_assign/`) into `grail/in`, once again using Hadoop.

In [13]:
# create a new directory
!$HADOOP_PREFIX/bin/hdfs dfs -mkdir -p grail/in

# copy data
!$HADOOP_PREFIX/bin/hdfs dfs -put $HOME/hadoop_assign/grail.txt grail/in/grail.txt

In [14]:
had_grail = !$HADOOP_PREFIX/bin/hdfs dfs -count -h grail/in/*
had_grail = had_grail[0].split()
assert_is_instance(had_grail, list)
assert_true(all(isinstance(w, str) for w in had_grail))
assert_true(had_grail, ['0', '1', '63.5', 'K', 'grail/in/grail.txt'])

### Clean up 
Please run this before you restart and run your assignment!

In [15]:
!rm copy.txt
!$HADOOP_PREFIX/bin/hdfs dfs -rm -r -f grail
!rm -rf $HOME/hadoop_assign
!$HADOOP_PREFIX/sbin/stop-dfs.sh
!$HADOOP_PREFIX/sbin/stop-yarn.sh

17/04/13 08:44:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted grail
Stopping namenodes on [info490rb.studentspace.cs.illinois.edu]
info490rb.studentspace.cs.illinois.edu: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
