This colab notebook is adapted from https://github.com/anjalysam/Hadoop/blob/master/hadoop_mapreduce.ipynb and uses mapper.py and reducer.py from the same repo. They will need to be uploaded to your filespace within this notebook and are also provided in the blackboard folder containing the notebook.

Please make your own copy of this notebook and follow the steps below to download Hadoop MapReduce in Colab to count the number of occurrences of words in a subset of new posts. As you go through the notebook, there are five questions. Please provide your answers to these in a separate file uploaded as PDF to the turn in link by Sunday May 1st.

First, download and install Hadoop, following the steps below.

In [None]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz


In [None]:
!tar -xzvf hadoop-3.3.2.tar.gz

In [None]:
#copy  hadoop file to usr/local
!cp -r hadoop-3.3.2/ /usr/local/

In [None]:
#To find the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"

Next, edit /usr/local/hadoop-3.3.2/etc/hadoop/hadoop-env.sh to uncomment the line '# export JAVA_HOME=', setting JAVA_HOME to the result you see above, e.g. after editing the line should look like

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

To edit a file in your colab directory, click on the file icon in the left hand panel, navigate to the file in the tree that is shown, and double-click the file to open an editing pane on the right --- or you can just click on the file name in the first paragraph in this cell. Once you have made your edit, save the file the control-s when you are finished.

In [None]:
#Running Hadoop - this step just tests it is able to run
!/usr/local/hadoop-3.3.2/bin/hadoop

In [None]:
!mkdir ~/input
!cp /usr/local/hadoop-3.3.2/etc/hadoop/*.xml ~/input

In [None]:
!ls ~/input

In [None]:
!/usr/local/hadoop-3.3.2/bin/hadoop jar /usr/local/hadoop-3.3.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar grep ~/input ~/grep_example 'allowed[.]*'

In [None]:
!cat ~/grep_example/*

Next, download 20 newsgroups from the url below and unpack.

In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz

!tar -xzvf 20news-18828.tar.gz

In [None]:
!find / -name 'hadoop-streaming*.jar'

Next you must upload the files mapper.py and reducer.py into the 'content' directory in your filespace. To do that, first download them from the blackboard into your device hard drive. Then, using the same file navigator you used to edit hadoop-env.sh, select the directory 'content' in your root directory. If you don't see it, click the up-arrow file icon above (moving up a directory) until you do. Click on the three vertical dots that appear to the right of the folder when you move the mouse over it, and select 'upload'. Then navigate to your copies of mapper.py and reduce.py and upload them.

Recall that when you make a run with Hadoop, it will split the inputs across several instances of the map code in mapper.py, which will emit a set of key-value pairs, and will sort all the pairs by key and send the pairs to a number of instances of reducer.py which will process them to provide the final output. In this case, the mapper.py emits pairs that correspond to each word seen and the value '1'. For each distinct word, reducer.py sums up the values for all the pairs showing that have that word as the key.

Next, run the step below so that any process can run these files, as Hadoop will create many instances of your mapper and reducer functions. Then run the cell after that, which calls Hadoop on the indicated posts.

In [None]:
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

In [None]:
pr = "/content/20news-18828"
pra= pr + "/alt.atheism"
prg = pr + "/comp.graphics"
prm = pr + "/rec.motorcycles"
!rm -r /content/alt-atheism-output  # Remove the output folder if it exists (i.e. if running this command for a second time)
!/usr/local/hadoop-3.3.2/bin/hadoop jar /usr/local/hadoop-3.3.2/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar -input {pra}/5112* -output /content/alt-atheism-output -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

After running the cell above, scroll its output down to the bottom so that you can see the line containing the phrase 'reduce task executor complete.' and the lines below it.

**Question 1**  See the line 'Map output records=' after the line indicated above. This shows the number of key-value pairs emitted by all map tasks during the run.

Recalling what the mapper.py code does, how many words were counted in total, including duplicates?

**Question 2** This total matches the total in the line 'Reduce input records', since every pair emitted by a mapper is consumed by a reducer. Why do you think the line 'reduce input groups=' shows a smaller number than the number of records output by mappers? How many distinct words (not including duplicates) were produced? (You can open the output file in /content/alt-atheism-output/part-00000 to check this, but you don't need to.)

**Question 3**  These results were produced counting the words in 9 posts from alt.atheism. If instead we used 3 posts from
alt.atheism, 3 from comp.graphics and 3 from rec.motorcycles, do you
think it would produce more or fewer distinct words? Why?

**Question 4** Check out your hunch by modifying the call to hadoop in the cell below. To get you started, the call currently specifies as input 3 posts from alt.atheism and one each from comp.graphics and rec.motorcycles using the variables created in the cell above. You can add as many posts as you like after the -input keywork (they are files in the file system). This call saves the output in a different folder, mixed-output, so you can compare the results, or you can read them from the output of this cell.

Please describe the results you saw in your answer. Was your hunch from question 3 correct?

In [None]:
!/usr/local/hadoop-3.3.2/bin/hadoop jar /usr/local/hadoop-3.3.2/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar -input {pra}/51120 {pra}/51121 {pra}/51122 {prg}/37921 {prm}/103130 -output /content/mixed-output -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

**Question 5** Modify either the mapper.py or reducer.py file so that the final output only includes words that are seen at least 5 times. Explain the change(s) you made. (You can modify the files, which are /content/mapper.py and /content/reducer.py, in the same way that you edited the hadoop-env.sh file above, by clicking on them in this sentence). Re-run either of the calls above and compare your output. Note that you will find it easier to run if you change the output folder name that comes after the -output keyword.

In [None]:
# Your repeated call goes here, to be run after modifying the mapper and/or reducer code.
