# Introduction to Hadoop


## The building blocks of Hadoop

- NameNode
- DataNode
- Secondary NameNode
- JobTracker
- TaskTracker

## NameNode

- The NameNode is the master of HDFS that directs the worker DataNode daemons to perform the low-level I/O tasks.

- The NameNode keeps track of how your fi les are broken down into fi le blocks, which nodes store those blocks, and the overall health of the distributed file system.

- The NameNode is a single point of failure of your Hadoop cluster

## Secondary NameNode

- An assistant daemon for monitoring the state of the cluster HDFS.

- Each cluster has one Secondary NameNode.

- The secondary NameNode snapshots help minimize the downtime and loss of data due to the failure of NameNode

## DataNode

- Each worker machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system -- reading and writing HDFS blocks to actual files on the local file system.

- DataNodes are constantly reporting to the NameNode.

- Each of the DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk.

## JobTracker

- There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.

- The JobTracker determines the execution plan by determining which fi les to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefi ned limit of retries.

## TaskTracker


- Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns.

- If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

## Setting up Hadoop 

- Basic requirements

    - Linux server 
    - SSH 
    - Setup environment variables: `JAVA_HOME` , `HADOOP_HOME`

- Cluster setup ref: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html

- Core Hadoop configuration files

    - `core-site.xml`
    - `mapred-site.xml`
    - `hdfs-site.xml`

But, our scope here is how to use it in `colab`

#### First, we have to clone hadoop to our working session

In [2]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

--2021-08-06 15:22:54--  https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.209.10, 135.181.214.104, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500749234 (478M) [application/x-gzip]
Saving to: ‘hadoop-3.3.0.tar.gz’


2021-08-06 15:23:13 (25.6 MB/s) - ‘hadoop-3.3.0.tar.gz’ saved [500749234/500749234]



In [3]:
!tar -xzvf hadoop-3.3.0.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/FSDataOutputStream.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/TrashPolicyDefault.Emptier.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/HarFileSystem.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/PathExistsException.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-project/hadoop-common/target/api/org/apache/hadoop/fs/XAttrSetFlag.html
hadoop-3.3.0/share/doc/hadoop/hadoop-project-dist/hadoop-common/build/source/hadoop-common-p

In [4]:
!ls

hadoop-3.3.0  hadoop-3.3.0.tar.gz  sample_data


#### Find feasible Java path

In [5]:
# To find the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


#### Setup Java path and Hadoop Home

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["HADOOP_HOME"] = "/content/hadoop-3.3.0"

In [7]:
!echo $JAVA_HOME & echo $HADOOP_HOME

/content/hadoop-3.3.0
/usr/lib/jvm/java-11-openjdk-amd64


In [None]:
!cat ./hadoop-3.3.0/etc/hadoop/hadoop-env.sh 

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

##
## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
## ONE CAN USE THIS FILE TO SET

In [None]:
# Running Hadoop
!./hadoop-3.3.0/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
    

### Download data for later examples

In [None]:
from google.colab import files
uploaded = files.upload()

Saving mapper_stock.py to mapper_stock.py
Saving stocks.txt to stocks.txt
Saving reducer_stock.py to reducer_stock.py


In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
!tar -xzvf 20news-18828.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
20news-18828/sci.space/61079
20news-18828/sci.space/60809
20news-18828/sci.space/61497
20news-18828/sci.space/60156
20news-18828/sci.space/61475
20news-18828/sci.space/59873
20news-18828/sci.space/61541
20news-18828/sci.space/61489
20news-18828/sci.space/61133
20news-18828/sci.space/62418
20news-18828/sci.space/61289
20news-18828/sci.space/60864
20news-18828/sci.space/61443
20news-18828/sci.space/60189
20news-18828/sci.space/60251
20news-18828/sci.space/61356
20news-18828/sci.space/61255
20news-18828/sci.space/61345
20news-18828/sci.space/61382
20news-18828/sci.space/60847
20news-18828/sci.space/62481
20news-18828/sci.space/60783
20news-18828/sci.space/60854
20news-18828/sci.space/62131
20news-18828/sci.space/60921
20news-18828/sci.space/62317
20news-18828/sci.space/61372
20news-18828/sci.space/61406
20news-18828/sci.space/60247
20news-18828/sci.space/62114
20news-18828/sci.space/60791
20news-18828/sci.space/61362
20news-

In [None]:
!ls

20news-18828	     hadoop-3.3.0.tar.gz  output	    sample_data
20news-18828.tar.gz  mapper.py		  reducer.py	    stocks.txt
hadoop-3.3.0	     mapper_stock.py	  reducer_stock.py


## Work with Hadoop File System

In [12]:
%cd ./hadoop-3.3.0/bin
!./hadoop

/content/hadoop-3.3.0/bin
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get t

In [None]:
!./hadoop version

Hadoop 3.3.0
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af
Compiled by brahma on 2020-07-06T18:44Z
Compiled with protoc 3.7.1
From source with checksum 5dc29b802d6ccd77b262ef9d04d19c4
This command was run using /content/hadoop-3.3.0/share/hadoop/common/hadoop-common-3.3.0.jar


In [None]:
!echo $JAVA_HOME

/usr/lib/jvm/java-11-openjdk-amd64


In [None]:
!echo $HADOOP_HOME

/content/hadoop-3.3.0


In [None]:
!./hadoop fs

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum [-v] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge [-immediate] [-fs <path>]]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h]

In [None]:
!./hadoop fs -ls /

Found 25 items
-rwxr-xr-x   1 root root          0 2021-08-04 09:20 /.dockerenv
drwxr-xr-x   - root root       4096 2021-07-15 13:30 /bin
drwxr-xr-x   - root root       4096 2018-04-24 08:34 /boot
drwxr-xr-x   - root root       4096 2021-08-04 09:23 /content
drwxr-xr-x   - root root       4096 2021-07-29 13:16 /datalab
drwxr-xr-x   - root root        360 2021-08-04 09:21 /dev
drwxr-xr-x   - root root       4096 2021-08-04 09:20 /etc
drwxr-xr-x   - root root       4096 2018-04-24 08:34 /home
drwxr-xr-x   - root root       4096 2021-07-15 13:31 /lib
drwxr-xr-x   - root root       4096 2021-07-15 13:21 /lib32
drwxr-xr-x   - root root       4096 2021-07-15 13:21 /lib64
drwxr-xr-x   - root root       4096 2020-09-21 17:14 /media
drwxr-xr-x   - root root       4096 2020-09-21 17:14 /mnt
drwxr-xr-x   - root root       4096 2021-07-15 13:33 /opt
dr-xr-xr-x   - root root          0 2021-08-04 09:21 /proc
drwx------   - root root       4096 2021-08-04 09:21 /root
drwxr-xr-x   - root root       4

In [None]:
!./hadoop fs -ls /usr

Found 10 items
drwxr-xr-x   - root root       4096 2021-07-16 13:18 /usr/bin
drwxr-xr-x   - root root       4096 2018-04-24 08:34 /usr/games
drwxr-xr-x   - root root       4096 2021-07-15 13:33 /usr/grte
drwxr-xr-x   - root root      12288 2021-07-15 13:32 /usr/include
drwxr-xr-x   - root root       4096 2021-07-15 13:31 /usr/lib
drwxr-xr-x   - root root       4096 2021-07-15 13:21 /usr/lib32
drwxr-xr-x   - root root       4096 2021-07-16 13:42 /usr/local
drwxr-xr-x   - root root       4096 2021-07-15 13:30 /usr/sbin
drwxr-xr-x   - root root       4096 2021-07-16 13:18 /usr/share
drwxr-xr-x   - root root       4096 2021-07-15 13:31 /usr/src


In [None]:
!pwd

/content/hadoop-3.3.0/bin


In [None]:
!mkdir ~/input
!cp /content/hadoop-3.3.0/etc/hadoop/*.xml ~/input

In [None]:
!ls ~/input

capacity-scheduler.xml	hdfs-rbf-site.xml  kms-acls.xml     yarn-site.xml
core-site.xml		hdfs-site.xml	   kms-site.xml
hadoop-policy.xml	httpfs-site.xml    mapred-site.xml


In [None]:
!./hadoop jar /content/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep ~/input ~/grep_example 'allowed[.]*'

2021-08-04 09:36:33,335 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-04 09:36:33,580 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-04 09:36:33,581 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-04 09:36:33,860 INFO input.FileInputFormat: Total input files to process : 10
2021-08-04 09:36:33,901 INFO mapreduce.JobSubmitter: number of splits:10
2021-08-04 09:36:34,279 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local212156713_0001
2021-08-04 09:36:34,279 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-04 09:36:34,520 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-08-04 09:36:34,521 INFO mapreduce.Job: Running job: job_local212156713_0001
2021-08-04 09:36:34,530 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2021-08-04 09:36:34,538 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2021-0

In [None]:
!ls ~

grep_example  input


In [None]:
!cat ~/grep_example/*

22	allowed.
1	allowed


### One more counting example

In [None]:
from google.colab import files
uploaded = files.upload()

Saving alice.txt to alice.txt


In [None]:
# Copy file to HDFS.
!./hadoop fs -copyFromLocal alice.txt

copyFromLocal: `alice.txt': File exists


In [None]:
!ls /content/hadoop-3.3.0/share/hadoop/mapreduce

hadoop-mapreduce-client-app-3.3.0.jar
hadoop-mapreduce-client-common-3.3.0.jar
hadoop-mapreduce-client-core-3.3.0.jar
hadoop-mapreduce-client-hs-3.3.0.jar
hadoop-mapreduce-client-hs-plugins-3.3.0.jar
hadoop-mapreduce-client-jobclient-3.3.0.jar
hadoop-mapreduce-client-jobclient-3.3.0-tests.jar
hadoop-mapreduce-client-nativetask-3.3.0.jar
hadoop-mapreduce-client-shuffle-3.3.0.jar
hadoop-mapreduce-client-uploader-3.3.0.jar
hadoop-mapreduce-examples-3.3.0.jar
jdiff
lib-examples
sources


In [None]:
# Run WordCount for alice.txt with "wordcount". As WordCount executes, the 
# Hadoop prints the progress in terms of Map and Reduce. When the `WordCount` is 
# complete, both will say 100%.
!./hadoop jar /content/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount alice.txt count

2021-08-04 12:07:46,236 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-04 12:07:46,405 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-04 12:07:46,405 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-04 12:07:46,570 INFO input.FileInputFormat: Total input files to process : 1
2021-08-04 12:07:46,601 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-04 12:07:46,855 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local245519796_0001
2021-08-04 12:07:46,855 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-04 12:07:47,149 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-08-04 12:07:47,150 INFO mapreduce.Job: Running job: job_local245519796_0001
2021-08-04 12:07:47,160 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2021-08-04 12:07:47,169 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2021-08-

In [None]:
# Copy WordCount results to local file system. Here, the file part-r-00000 
# contains the results from WordCount.
!./hadoop fs –copyToLocal count/part-r-00000 count.txt

–copyToLocal: Unknown command
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum [-v] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge [-immediate] [-fs <path>]]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [

In [None]:
# View the WordCount results.
!more count/part-r-00000 count.txt

::::::::::::::
count/part-r-00000
::::::::::::::
"'TIS	1
"--SAID	1
"Come	1
"Coming	1
"Defects,"	1
"Edwin	1
"French,	1
"HOW	1
"He's	1
"How	1
"I	7
"I'll	2
"Information	1
"Keep	1
"Let	1
"Plain	2
"Project	5
"Right	1
"Such	1
"THEY	1
[K"There	2
"There's	1
"Too	1
"Turtle	1
"Twinkle,	1
"Uglification,"'	1
"Up	1
"What	2
"Who	1
"William	1
[K"With	1
"YOU	1
"You	2
"come	1
"it"	2
"much	1
"poison"	1
"purpose"?'	1
#11]	1
$5,000)	1
[K'"--found	1
'"Miss	1
'"WE	1
'"What	1
'"Will	1
''Tis	2
'--I	1
'--Mystery,	1
'--and	2
'--as	1
'--but	1
'--change	1
'--for	1
'--it	1
'--likely	1
'--or	1
'--so	2
'--that	1
'--well	1
'--where's	1
[K'--yes,	1
'--you	1
'A	9
'ARE	1
'AS-IS'	1
'After	1
'Ah!	2
'Ah,	2
'Ahem!'	1
'Alice!'	1
'All	2
'An	1
'And	17
'Anything	1
'Are	4
'As	3
'At	1
'Back	1
'Beautiful	2
'Begin	1
[K

### More example

In [None]:
from google.colab import files
uploaded = files.upload()

Saving shakespeare.txt to shakespeare.txt


In [None]:
!./hadoop fs -copyFromLocal shakespeare.txt

!./hadoop jar /content/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordmedian shakespeare.txt wordmedian

copyFromLocal: `shakespeare.txt': File exists
2021-08-04 12:26:39,924 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-04 12:26:40,074 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-04 12:26:40,075 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-04 12:26:40,236 INFO input.FileInputFormat: Total input files to process : 1
2021-08-04 12:26:40,264 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-04 12:26:40,561 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local77825482_0001
2021-08-04 12:26:40,561 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-04 12:26:40,840 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-08-04 12:26:40,841 INFO mapreduce.Job: Running job: job_local77825482_0001
2021-08-04 12:26:40,849 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2021-08-04 12:26:40,864 INFO output.FileOutputCommitter: File Outp

In [None]:
!./hadoop fs –copyToLocal wordmedian/part-r-00000 wordmedian.txt

–copyToLocal: Unknown command
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum [-v] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge [-immediate] [-fs <path>]]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [

In [None]:
!cat wordmedian/part-r-00000 

1	547
2	2871
3	3215
4	4012
5	2715
6	1744
7	1075
8	692
9	394
10	190
11	70
12	31
13	15
14	13
15	2
16	1
17	1
18	1


## Our own Hadoop MapReduce program - Words count

In [None]:
!find / -name 'hadoop-streaming*.jar'

/content/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-test-sources.jar
/content/hadoop-3.3.0/share/hadoop/tools/sources/hadoop-streaming-3.3.0-sources.jar
/content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar


In [None]:
!ls ../../

hadoop-3.3.0  hadoop-3.3.0.tar.gz  mapper.py  reducer.py  sample_data


In [None]:
!chmod u+rwx /content/mapper.py
!chmod u+rwx /content/reducer.py

In [None]:
!./hadoop jar /content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/20news-18828/alt.atheism/49960 -output /content/output -file /content/mapper.py  -file /content/reducer.py  -mapper 'python mapper.py'  -reducer 'python reducer.py'

2021-08-04 10:00:21,993 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/mapper.py, /content/reducer.py] [] /tmp/streamjob5541988181897270773.jar tmpDir=null
2021-08-04 10:00:22,802 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-04 10:00:23,001 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-04 10:00:23,002 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-04 10:00:23,025 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-04 10:00:23,178 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-04 10:00:23,205 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-04 10:00:23,470 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local851318730_0001
2021-08-04 10:00:23,470 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-04 10:00:23,868 INFO mapred.Loca

In [None]:
!ls ../../output

part-00000  _SUCCESS


In [None]:
!cat ../../output/part-00000

034529887x	1
0511211216	1
071	5
080182494x	1
0801834074	1
0877226423	1
0877227675	1
0908	1
0910309264	1
1	1
10	1
11	1
1266	1
1271	1
14	1
140195	1
14215	1
14226	1
142282197	1
17701900	1
1881	1
1977	1
1981	1
1986	1
1988	1
1989	1
1990	1
1992	1
20th	1
226	1
24hour	1
2568900	1
272	1
273	1
2nd	1
3005	1
316	1
372	1
3d	1
3nl	1
4	1
41	2
430	2
4581244	1
4679525	1
490	1
495	2
4rh	1
4rl	1
512	2
53701	1
541	1
59	1
608	1
664	1
700	1
702	1
7119	1
716	1
7215	1
7251	1
750	1
7723	1
787140195	1
787522973	1
831	1
8372475	1
88	1
880	2
8964079	1
8ew	1
91605	1
aah	1
aap	2
abortions	1
absurdities	1
accompanied	1
accounts	1
address	1
addresses	2
adulteries	1
aesthetics	1
african	3
africanamericans	1
agnostic	1
al	1
alien	1
allen	2
also	3
altatheism	1
altatheismarchivename	1
altatheismmoderated	1
alternate	1
alternative	1
although	2
america	1
american	5
americans	2
amherst	1
amongst	1
amusing	1
ancient	1
andor	1
another	1
anselm	1
anthology	3
anyone	1
appendix	2
approachable	1
archive	1
archivename	1
archives	1

In [None]:
!./hadoop jar /content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /content/stocks.txt -output /content/output_stock -file /content/mapper_stock.py -file /content/reducer_stock.py -mapper 'python mapper_stock.py' -reducer 'python reducer_stock.py'

2021-08-04 11:48:20,430 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/mapper_stock.py, /content/reducer_stock.py] [] /tmp/streamjob10901118934950135143.jar tmpDir=null
2021-08-04 11:48:21,147 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-04 11:48:21,268 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-04 11:48:21,268 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-04 11:48:21,289 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-04 11:48:21,420 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-04 11:48:21,444 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-04 11:48:21,739 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1180319525_0001
2021-08-04 11:48:21,739 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-04 11:48:22,174 IN

In [None]:
!ls ../../output_stock

part-00000  _SUCCESS


In [None]:
!cat ../../output_stock/part-00000

AAPL	10
CSCO	10
GOOG	5
MSFT	10
YHOO	10


## Another way - write python mapper/reducer directly in colab

In [8]:
%%writefile stock_mapper.py
# -*- coding: utf-8 -*-
"""mapper functions"""

import sys

for line in sys.stdin:
    part = line.split(',')    
    print (part[0], 1)

Writing stock_mapper.py


In [9]:
%%writefile stock_reducer.py
# -*- coding: utf-8 -*-
"""reducer functions"""

import sys
from operator import itemgetter 
wordcount = {}

for line in sys.stdin:
    word,count = line.split(' ')
    count = int(count)
    wordcount[word] = wordcount.get(word,0) + count

sorted_wordcount = sorted(wordcount.items(), key=itemgetter(0))

for word, count in sorted_wordcount:
    print ('%s\t%s'% (word,count))

Writing stock_reducer.py


In [10]:
from google.colab import files
uploaded = files.upload()

Saving stocks.txt to stocks.txt


In [13]:
!ls ../../
!pwd

hadoop-3.3.0	     sample_data      stock_reducer.py
hadoop-3.3.0.tar.gz  stock_mapper.py  stocks.txt
/content/hadoop-3.3.0/bin


In [None]:
%%bash

./hadoop jar /content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
        -input /content/stocks.txt \
        -output /content/new_output \
        -file /content/stock_mapper.py \
        -file /content/stock_reducer.py \
        -mapper 'python stock_mapper.py' \
        -reducer 'python stock_reducer.py'

packageJobJar: [/content/stock_mapper.py, /content/stock_reducer.py] [] /tmp/streamjob2169261634108359524.jar tmpDir=null


2021-08-06 11:57:40,261 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2021-08-06 11:57:41,131 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-06 11:57:41,274 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-06 11:57:41,274 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-06 11:57:41,297 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-06 11:57:41,494 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-06 11:57:41,523 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-06 11:57:41,833 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1465626538_0001
2021-08-06 11:57:41,833 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-08-06 11:57:42,305 INFO mapred.LocalDistributedCacheManager: Localized file:/content/stock_mapper.py as file:/tmp/hadoop-root/mapred/local/job_l

#### Or we can write/save and run a shell file

In [14]:
sh = """
./hadoop jar /content/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
        -input /content/stocks.txt \
        -output /content/new_output \
        -file /content/stock_mapper.py \
        -file /content/stock_reducer.py \
        -mapper 'python stock_mapper.py' \
        -reducer 'python stock_reducer.py'
"""
with open('run_stock_script.sh', 'w') as file:
  file.write(sh)

In [15]:
!ls 
!bash run_stock_script.sh

container-executor  hdfs      mapred.cmd	   test-container-executor
hadoop		    hdfs.cmd  oom-listener	   yarn
hadoop.cmd	    mapred    run_stock_script.sh  yarn.cmd
2021-08-06 15:28:04,194 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/stock_mapper.py, /content/stock_reducer.py] [] /tmp/streamjob11789842932365313397.jar tmpDir=null
2021-08-06 15:28:05,202 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-06 15:28:05,466 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-06 15:28:05,466 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-08-06 15:28:05,489 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-08-06 15:28:05,763 INFO mapred.FileInputFormat: Total input files to process : 1
2021-08-06 15:28:05,796 INFO mapreduce.JobSubmitter: number of splits:1
2021-08-06 15:28:06,288 INFO mapreduce.JobSu

In [16]:
!ls ../../new_output
!cat ../../new_output/part-00000

part-00000  _SUCCESS
AAPL	10
CSCO	10
GOOG	5
MSFT	10
YHOO	10
