# Lab 3 - Hadoop

Hadoop is an open-source framework that allows for the distributed storage and processing of Big Data! It works in three modes:

1. Standalone
2. Pseudo-distributed
3. Fully-distributed

To save us a bit of time with setup and system wiring, we'll use the pseudo-distributed mode in this lab. Each of the daemons responsible for distributed storage and processing will run as separate processes on a single host, localhost.

Main deamons:
* NameNode
* Resource Manager
* Standby NameNode

Worker deamons:
* DataNode
* Node Manager

We'll still need to do a little bit of configuration, but this should be a nice solution using just our one machine! If you'd like a more verbose version of this process and/or a breakdown of the standalone mode, [this source](https://github.com/LMAPcoder/Hadoop-on-Colab) provides a nicely detailed tutorial.

## Section 1 - Installing SSH

SSH is a cryptographic network protocol for operating securely over an unsecured network. You'll likely have used this before for remotely logging onto the csserver! If you would like a quick overview of SSH and how it works, you can take a look [here](https://www.youtube.com/watch?v=Atbl7D_yPug&t=69s).

Hadoop uses passphrase SSH for communication between nodes, so we'll use SSH to define a way for the main node to remotely access every node in our cluster. The best way to do this, is to set up passwordless login for the Hadoop user by generating a public-private key pair. We'll run through this step by step below.

**Note:** You may get a warning when you run this cell - you can ignore this. We'll just be using SSH for communication between our nodes.

In [3]:
# Installing openssh-server
!apt-get update
!apt-get install openssh-server -qq > /dev/null

# Starting our server
!service ssh start

Get:1 http://ports.ubuntu.com/ubuntu-ports jammy InRelease [270 kB]
Get:2 http://ports.ubuntu.com/ubuntu-ports jammy-updates InRelease [119 kB]
Get:3 http://ports.ubuntu.com/ubuntu-ports jammy-backports InRelease [109 kB]
Get:4 http://ports.ubuntu.com/ubuntu-ports jammy-security InRelease [110 kB]
Get:5 http://ports.ubuntu.com/ubuntu-ports jammy/universe arm64 Packages [17.2 MB]
Get:6 http://ports.ubuntu.com/ubuntu-ports jammy/multiverse arm64 Packages [224 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports jammy/restricted arm64 Packages [24.2 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 Packages [1758 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports jammy-updates/universe arm64 Packages [1265 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 Packages [1494 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports jammy-updates/restricted arm64 Packages [1231 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports jammy-updates/multiverse arm64 Packages [28.4 kB]
Get

In [4]:
# Creating a new rsa key pair with empty password
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:y8NXvvSQw77hZGQOPuC6jgIducCdJ3d2ReyHtTiNZfw root@56cc797895e3
The key's randomart image is:
+---[RSA 3072]----+
|          o..    |
|           o =   |
|. ...     o O o  |
|..o+ o o . * + E |
| o o+ o S . *    |
|. o    + + O .   |
| .      * + %    |
|  .  . . o B *   |
|   ...+.    =..  |
+----[SHA256]-----+


In [5]:
# Copying the public key we just generated to authorised keys
!cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys

# Changing the permissions on the key
# Hint: Check "man chmod" for information on this command and remember, changing permissions
# so be careful with this command.
!chmod 0600 ~/.ssh/authorized_keys

In [6]:
# Conneting with our local machine (essentially, we are just connecting to our own machine "as if" it is a remote server.)
# pptime will just tell us how long our system has been running.
!ssh -o StrictHostKeyChecking=no localhost uptime

 20:02:32 up 17 min,  0 users,  load average: 0.17, 0.44, 0.40


## Section 2 - Installing and Configuring Hadoop

In [7]:
import os

In [8]:
# Installing Hadoop and configuring JAVA_HOME:
# Downloading Hadoop
# Upzipping
# Copying hadoop into our /usr/local folder
# Removing the unused original copy
# Remove the compressed (zip) file, we're not using it anymore.
# Adding a variable called "JAVA_HOME" to hadoop's environment script which tells it where Java is on our system.

!if [ ! -d /usr/local/hadoop-3.3.5/ ]; then \
wget -4 https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz; \
tar -xzf hadoop-3.3.5.tar.gz; \
cp -r hadoop-3.3.5/ /usr/local/; \
rm -rf hadoop-3.3.5/; \
rm hadoop-3.3.5.tar.gz; \
echo "export JAVA_HOME=$(dirname $(dirname $(realpath $(which java))))" >> /usr/local/hadoop-3.3.5/etc/hadoop/hadoop-env.sh; \
fi

--2024-02-14 20:02:32--  https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 706533213 (674M) [application/x-gzip]
Saving to: ‘hadoop-3.3.5.tar.gz’


2024-02-14 20:03:37 (10.4 MB/s) - ‘hadoop-3.3.5.tar.gz’ saved [706533213/706533213]



In [9]:
# Setting up some of our environmental variables:
# Here we add Hadoop's location to our path (in Python) and tell our system where hadoop is located.
os.environ['PATH'] = "/usr/local/hadoop-3.3.5/bin/:" + os.environ['PATH']
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.5"

As mentioned in the intro, we'll need to do a small bit of configuration to get the pseudo-distributed mode running as it should. To do this, we'll need to set some properties in the site XML configuration files. This way we can tell Hadoop which machines are in the cluster and where we want to run the daemons.

You can find the specifics of this [here](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-common/SingleCluster.html), but I have included a script you can run to add the correct properties to the correct files (`lab3_config.sh`). The files we'll be editing can be found at `$HADOOP_HOME/etc/hadoop`.

In [10]:
# from google.colab import drive
# drive.mount('/content/drive')

In [11]:
# Here, we've uploaded lab3_config.sh to our Google drive , not that this path will depend on **YOUR** google drive.
# We are then copying this to our "content" folder.
# This does a lot of the annoying configuration for us.
# !cp /content/drive/MyDrive/Temp/lab3_config.sh /content/
# !cat lab3_config.sh

In [12]:
# Running our config script
# Note: remember to check you have the correct filepath
!bash lab3_config.sh

Before HDFS can be used, the file system must be formatted. The formatting process creates an empty file system by creating the storage directories and the initial versions of the NameNodes:


In [13]:
!$HADOOP_HOME/bin/hdfs namenode -format

2024-02-14 20:03:52,714 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 56cc797895e3/172.17.0.2
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.5
STARTUP_MSG:   classpath = /usr/local/hadoop-3.3.5/etc/hadoop:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/netty-codec-4.1.77.Final.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/woodstox-core-5.4.0.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/failureaccess-1.0.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/kerb-common-1.0.1.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/netty-all-4.1.77.Final.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/re2j-1.1.jar:/usr/local/hadoop-3.3.5/share/hadoop/common/lib/zookeeper-3.5.6.jar:/usr/local/hadoop-3.3.5/sha

In [14]:
# Creating our HDFS environment variables:
os.environ["HDFS_NAMENODE_USER"] = "root"
os.environ["HDFS_DATANODE_USER"] = "root"
os.environ["HDFS_SECONDARYNAMENODE_USER"] = "root"
os.environ["YARN_RESOURCEMANAGER_USER"] = "root"
os.environ["YARN_NODEMANAGER_USER"] = "root"

Hadoop comes with scripts for running commands, and starting and stopping daemons across the whole cluster. These scripts can be found in the `bin` and `sbin` directories.


In [15]:
# Launching hdfs daemons
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [56cc797895e3]
2024-02-14 20:04:02,912 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [16]:
# Listing the running daemons
!jps

1008 DataNode
1196 SecondaryNameNode
1324 Jps
861 NameNode


In [17]:
# Launching our yarn daemons
# nohup causes a process to ignore a "hang-up" signal
!nohup $HADOOP_HOME/sbin/start-yarn.sh

nohup: ignoring input and appending output to 'nohup.out'


In [18]:
# Listing the running daemons (i.e. programs doing some work in the background for us
# Note the NodeManager, and Resource Manager
# More info on JPS here: https://www.oreilly.com/library/view/apache-hadoop-3/9781788999830/81815d09-e275-44e2-88a8-4f63ab943b92.xhtml
!jps

1008 DataNode
1557 NodeManager
1929 Jps
1436 ResourceManager
1196 SecondaryNameNode
861 NameNode


In [19]:
# Report the basic file system information and statistics to make sure everything is set up as it should be:
!$HADOOP_HOME/bin/hdfs dfsadmin -report

2024-02-14 20:04:09,882 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 62671097856 (58.37 GB)
Present Capacity: 17729007616 (16.51 GB)
DFS Remaining: 17728983040 (16.51 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:9866 (localhost)
Hostname: 56cc797895e3
Decommission Status : Normal
Configured Capacity: 62671097856 (58.37 GB)
DFS Used: 24576 (24 KB)
Non D

## Section 3 - Running a program

The default installation comes with several MapReduce examples already installed. We can see below that this includes a basic wordcount program:

In [20]:
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of r

We'll use this sample program to start. First lets make some input for our program "locally" before adding it to the HDFS:

In [21]:
!mkdir input
!echo "Hello world from COMP30770" > input/f1.txt
!echo "Hello there from UCD" > input/f2.txt

In [22]:
# First we'll need to make the corresponding input dir on our HDFS.
# Don't panic!! Let's break this down.
# 1. We run "hdfs"
# 2. We pass the argument "dfs", this tell Hadppot that we are going to use a file system command.
# 3. We then make a directory in on our HDFS.
!$HADOOP_HOME/bin/hdfs dfs -mkdir /word_count

# We can add files to the HDFS in two ways...
# 1. Using "copyFromLocal":
!$HADOOP_HOME/bin/hdfs dfs -copyFromLocal input/f1.txt /word_count

# 2. Using "put":
!$HADOOP_HOME/bin/hdfs dfs -put input/f2.txt /word_count

2024-02-14 20:04:11,964 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-02-14 20:04:12,980 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-02-14 20:04:14,683 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [23]:
# We can see our files now on the HDFS:
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count

2024-02-14 20:04:15,774 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup         27 2024-02-14 20:04 /word_count/f1.txt
-rw-r--r--   1 root supergroup         21 2024-02-14 20:04 /word_count/f2.txt


In [24]:
# Executing our first mapreduce wordcount program
# We are using the "wordcount" program here with the files we created as the argument.
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar wordcount /word_count/*.txt /word_count/output/

2024-02-14 20:04:16,849 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-02-14 20:04:17,293 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-02-14 20:04:17,550 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1707941046409_0001
2024-02-14 20:04:17,720 INFO input.FileInputFormat: Total input files to process : 2
2024-02-14 20:04:18,181 INFO mapreduce.JobSubmitter: number of splits:2
2024-02-14 20:04:18,331 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1707941046409_0001
2024-02-14 20:04:18,331 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-02-14 20:04:18,459 INFO conf.Configuration: resource-types.xml not found
2024-02-14 20:04:18,459 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-02-14 20:04:18,619 INFO impl.YarnClientImpl: Su

In [25]:
# Checking out our new output directory:
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count/output

2024-02-14 20:04:36,744 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup          0 2024-02-14 20:04 /word_count/output/_SUCCESS
-rw-r--r--   1 root supergroup         49 2024-02-14 20:04 /word_count/output/part-r-00000


In [26]:
# part-r-00000 contains the actual ouput:
!$HADOOP_HOME/bin/hdfs dfs -cat /word_count/output/part-r-00000

2024-02-14 20:04:37,817 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
COMP30770	1
Hello	2
UCD	1
from	2
there	1
world	1


We can also make use of Hadoop's streaming feature to write a MapReduce programs in other languages, like Python! The feature will create the job for us, submit that job to the appropriate cluster and monitor it's progress until finished.

We'll need to write a mapper and reducer scripts in Python as below. Both of these scripts read their input from STDIN (line by line) and send output via STDOUT so we'll make use of the `sys` module here.

In [27]:
%%writefile mapper.py
#!/usr/bin/env python

import sys

for line in sys.stdin:
  line = line.strip()  # removes whitespace either side of our line
  words = line.split()  # splitting our line into a list of words

  for word in words:
    print('%s\t%s' % (word, 1))  # writing our results to STDOUT (this is the input for reducer.py)

Writing mapper.py


In [28]:
%%writefile reducer.py
#!/usr/bin/env python

import sys
from operator import itemgetter

current_word = None
current_count = 0
word = None

for line in sys.stdin:
  line = line.strip()
  word, count = line.split('\t', 1)  # splitting the data on the basis of tab (see mapper.py)

  try:
    count = int(count)  # convert count (currently a string) to int
  except ValueError:
    continue  # silently ignore line if count is not a number

  # this IF-switch only works because Hadoop sorts map output
  # by key (here: word) before it is passed to the reducer
  if current_word == word:
    current_count += count
  else:
    if current_word: # to avoid None values
      print('%s\t%s' % (current_word, current_count))
    current_count = count
    current_word = word

# do not forget to output the last word if needed!
if current_word == word:
  print('%s\t%s' % (current_word, current_count))

Writing reducer.py


In [29]:
!pwd


/workdir/lab3


In [30]:
# Giving these new files permissions:
!chmod u+x /workdir/lab3/mapper.py /workdir/lab3/reducer.py

In [32]:
#Running MapReduce programs

!$HADOOP_HOME/bin/hdfs dfs -rm -r /word_count/python_output

!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar \
  -input /word_count/*.txt \
  -output /word_count/python_output \
  -mapper "python3 /workdir/lab3/mapper.py" \
  -reducer "python3 /workdir/lab3/reducer.py"

2024-02-14 20:05:14,863 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted /word_count/python_output
2024-02-14 20:05:17,015 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/tmp/hadoop-unjar4563505953634249597/] [] /tmp/streamjob9671179864342969534.jar tmpDir=null
2024-02-14 20:05:17,540 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-02-14 20:05:17,641 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-02-14 20:05:17,768 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1707941046409_0003
2024-02-14 20:05:17,955 INFO mapred.FileInputFormat: Total input files to process : 2
2024-02-14 20:05:17,994 INFO mapreduce.JobSubmi

In [33]:
# Checking out our new python_output directory:
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count/python_output

2024-02-14 20:05:57,545 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup          0 2024-02-14 20:05 /word_count/python_output/_SUCCESS
-rw-r--r--   1 root supergroup         49 2024-02-14 20:05 /word_count/python_output/part-00000


In [34]:
# part-00000 contains the ouput this time:
!$HADOOP_HOME/bin/hdfs dfs -cat /word_count/python_output/part-00000

2024-02-14 20:05:58,992 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
COMP30770	1
Hello	2
UCD	1
from	2
there	1
world	1


## Section 4 - Modifying the WordCount program in Java

The exercises in Section 5 can be completed in either Java or Python - whichever you prefer. We want our final output to perform the same task as above - the aim here is to become comfortable making small changes to the WordCount example.

If you are completing this task in Java, you will need to recompile and repackage the code after editing it, or you will run the un-edited version of the code. We'll run through an example below before coming to the exercises.


**Java Example:**

First download the Java source code of the WordCount example:

In [35]:
!wget --no-check-certificate csserver.ucd.ie/~aventresque/COMP30770/2020/WordCount.java

--2024-02-14 20:06:20--  http://csserver.ucd.ie/~aventresque/COMP30770/2020/WordCount.java
Resolving csserver.ucd.ie (csserver.ucd.ie)... 193.1.133.60
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://csserver.ucd.ie/~aventresque/COMP30770/2020/WordCount.java [following]
--2024-02-14 20:06:20--  https://csserver.ucd.ie/~aventresque/COMP30770/2020/WordCount.java
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2089 (2.0K) [text/x-java]
Saving to: ‘WordCount.java’


2024-02-14 20:06:21 (67.8 MB/s) - ‘WordCount.java’ saved [2089/2089]



Now set Hadoop’s classpath and compile the file (NB - the following steps are quite precise, so make sure you do them all in order each time you want to edit the WordCount file!):

In [32]:
# Setting Hadoop's classpath:
!export HADOOP_CLASSPATH=/usr/lib/jvm/java-1.8.0-openjdk-amd64/lib/tools.jar

# Recompiling:
!$HADOOP_HOME/bin/hadoop com.sun.tools.javac.Main WordCount.java

**Note:** You can ignore any bad substitution warnings you might receive!

In [33]:
# Repackaging:
!jar cf wc.jar WordCount*.class

In [34]:
# Running our "new" WordCount program on our basic example:
!$HADOOP_HOME/bin/hadoop jar wc.jar WordCount /word_count/*.txt /word_count/output2

2024-01-31 14:41:47,204 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-01-31 14:41:47,718 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2024-01-31 14:41:47,762 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1706711979035_0003
2024-01-31 14:41:48,130 INFO input.FileInputFormat: Total input files to process : 2
2024-01-31 14:41:48,639 INFO mapreduce.JobSubmitter: number of splits:2
2024-01-31 14:41:49,009 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1706711979035_0003
2024-01-31 14:41:49,010 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-01-31 14:41:49,254 INFO conf.Configuration: resource-types.xml not found
2024-01-31 14:41:49,256 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-0

In [35]:
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count/output2

Found 2 items
-rw-r--r--   1 root supergroup          0 2024-01-31 14:42 /word_count/output2/_SUCCESS
-rw-r--r--   1 root supergroup         49 2024-01-31 14:42 /word_count/output2/part-r-00000


In [36]:
# Again, part-r-00000 contains the actual ouput:
!$HADOOP_HOME/bin/hdfs dfs -cat /word_count/output2/part-r-00000

COMP30770	1
Hello	2
UCD	1
from	2
there	1
world	1


## Section 5 - Exercises

First, download the folliwing 5 books using `wget`:

* http://www.gutenberg.org/files/1524/1524-0.txt
* http://www.gutenberg.org/files/1112/1112-0.txt
* http://www.gutenberg.org/files/2267/2267.txt
* http://www.gutenberg.org/files/1513/1513-0.txt
* http://www.gutenberg.org/files/1121/1121-0.txt



In [37]:
# TODO:



Upload the 6 books in your HDFS. Modify the WordCount example (in either language) to answer some of the following questions. They will require you to change just a few changes, or even no changes.

Note: do not forget to recompile (hadoop com.sun.tools.javac.Main ClassName.java) and repackage (jar cf wc.jar ClassName*.class) the Java code after editing it, or you will run the un-edited version of the code.

In [38]:
# TODO:



1. Run the WordCount example. How many unique words does the corpus contain?
Hint: the mapreduce program gives us a lot of outputs - is there a line that tells us how many records were handed to the reduce function?

2. Make some changes to the map function to reduce the number of unique terms found (e.g. remove non-alphanumeric terms, make strings lowercase) and run the job again. Which changes did you make and how do they influence the number of unique terms found?

  **Java Hint:** Look through your code to find `word.set(itr.nextToken());` - we want to make some changes here in the format of `word.set(itr.nextToken().DOSOMETHING!);`. You can find some help with this [here](https://www.w3schools.com/java/java_ref_string.asp).

  **Python Hint:** You could take a look at the Python RegEx module [here](https://www.w3schools.com/python/python_regex.asp). You might also want to take a look at some string methods [here](https://www.w3schools.com/python/python_ref_string.asp).

In [39]:
%%writefile mapper.py

#!/usr/bin/env python

import sys
import re

for line in sys.stdin:
  line = line.strip()  # removes whitespace either side of our line
  words = line.split()  # splitting our line into a list of words

  for word in words:

    word = re.sub('[^0-9a-zA-Z]', '', word.lower())

    if word == '':
      continue
    else:
      print('%s\t%s' % (word, 1))  # writing our results to STDOUT (this is the input for reducer.py)


Overwriting mapper.py


In [40]:
%%writefile reducer.py

#!/usr/bin/env python

import sys
from operator import itemgetter

current_word = None
current_count = 0
word = None

for line in sys.stdin:
  line = line.strip()
  word, count = line.split('\t', 1)  # splitting the data on the basis of tab (see mapper.py)

  try:
    count = int(count)  # convert count (currently a string) to int
  except ValueError:
    continue  # silently ignore line if count is not a number

  # this IF-switch only works because Hadoop sorts map output
  # by key (here: word) before it is passed to the reducer
  if current_word == word:
    current_count += count
  else:
    if current_word: # to avoid None values
      print('%s\t%s' % (current_word, current_count))
    current_count = count
    current_word = word

# do not forget to output the last word if needed!
if current_word == word:
  print('%s\t%s' % (current_word, current_count))

Overwriting reducer.py


In [41]:
!chmod u+x /content/mapper.py /content/reducer.py

In [42]:
#Running MapReduce programs
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar \
  -input /exercise_folder/*.txt \
  -output /exercise_folder/python_output \
  -mapper "python /content/mapper.py" \
  -reducer "python /content/reducer.py"

packageJobJar: [/tmp/hadoop-unjar6415954454821354174/] [] /tmp/streamjob1516436789155777311.jar tmpDir=null
2024-01-31 14:42:30,056 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-01-31 14:42:30,296 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-01-31 14:42:30,742 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1706711979035_0004
2024-01-31 14:42:31,144 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1706711979035_0004
2024-01-31 14:42:31,157 ERROR streaming.StreamJob: Error Launching job : Input Pattern hdfs://localhost:9000/exercise_folder/*.txt matches 0 files
Streaming Command Failed!


In [43]:
!$HADOOP_HOME/bin/hdfs dfs -cat /exercise_folder/python_output/part-00000

cat: `/exercise_folder/python_output/part-00000': No such file or directory


3. How many words appear less than 4 times?

  **Hint:** You can either perform this check manually (by checking outputs) or if you’re feeling adventurous, you could experiment with adding some code to the reducer!
  
  **Java Hint:** Your results will not be an int but an "IntWritable" - we can use `.get()` to grab a value that we can compare!



4. What are the most frequent words that end in -ing?

  **Hint:** Check out those string methods again!