# 01 使用Python实现Hadoop MapReduce程序

**本章内容**
   
编写一个简单的 MapReduce 程序，模仿WordCount并使用Python来实现，该例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符分隔。

## 1. 先决条件

编写此程序之前，需要先架设好Hadoop集群，这样才能测试和运行本章示例。如没有架设好，请参考[Docker+Hadoop集群](Docker_Hadoop_Cluster.ipynb)笔记。

## 2. 实现原理

使用Python编写MapReduce代码的关键在于HadoopStreaming在Map和Reduce之间传递数据是通过STDIN (标准输入)和STDOUT (标准输出)来进行的。因此可以通过使用Python的sys.stdin来输入数据，使用sys.stdout来输出数据，剩下的交给HadoopStreaming处理即可。

## 3. Python代码

### Mapper

将下列的代码存储为word_count_mapper.py文件，此程序将从STDIN读取数据并将单词分隔开，生成单词与其出现次数的映射关系：

In [2]:
#!/usr/bin/python
import sys
 
# 输入来自标准输入STDIN (标准输入)
for line in sys.stdin:
    # 移除开始和结尾的空白字符
    line = line.strip()
    # 将每行分割为单词
    words = line.split()
    # 计数
    for word in words:
        # 将结果写入到STDOUT (标准输出);
        # 这里的输出将成为Reduce阶段的输入，即reducer.py的输入
        # 单词及其计数使用tab符分隔
        print('{0}\t{1}'.format(word, 1))

在这个脚本中，并不计算出单词出现的总数，计数由后来的Reduce步骤来完成。

### Reducer

将下列的代码保存为word_count_reducer.py中，这个脚本的作用是从word_count_mapper.py的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。

In [3]:
#!/usr/bin/python
 
from operator import itemgetter
import sys
 
current_word = None
current_count = 0
word = None
 
# 输入来自STDIN
for line in sys.stdin:
    # 移除开始和结尾的空白字符
    line = line.strip()
 
    # 解析来自mapper的单词和频率
    word, count = line.split('\t', 1)
 
    # 将计数值（字符串）转化为整数
    try:
        count = int(count)
    except ValueError:
        # 如果不是数值，则跳过
        # 忽略/抛弃此行
        continue
    # 下面的计数代码有效是因为Hadoop在将map输出交给reducer之前按键进行了排序
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # 写入到标准输出STDOUT
            print('{0}\t{1}'.format(current_word, current_count))
        current_count = count
        current_word = word
# 输出最后一个单词及其计数
if current_word == word:
    print('{0}\t{1}'.format(current_word, current_count))

None	0


### 本地测试

在Hadoop集群上运行MapReduce job前最好先本地手工测试word_count_mapper.py和word_count_reducer.py脚本，以便于提前发现、调试和修改BUG。

In [3]:
# 为脚本分配可执行权限，使其可以直接运行
!chmod +x ./word_count_mapper.py

In [7]:
words = "foo foo quux labs foo bar quux"
!echo $words | ./word_count_mapper.py

foo	1
foo	1
quux	1
labs	1
foo	1
bar	1
quux	1


In [9]:
# # 为脚本分配可执行权限，使其可以直接运行
!chmod +x ./word_count_reducer.py

In [12]:
words = "foo foo quux labs foo bar quux"
!echo $words |./word_count_mapper.py | sort |./word_count_reducer.py

bar	1
foo	3
labs	1
quux	2


## 在Hadoop集群上运行Python脚本

### 准备数据

In [2]:
!ls -l ./data

total 16024
-rw-r--r--@ 1 xiaobai  staff  2377193  8 11 21:42 gone_with_wind.txt
-rw-r--r--@ 1 xiaobai  staff   674570  8 11 18:19 pg20417.txt
-rw-r--r--@ 1 xiaobai  staff  1586393  8 11 18:20 pg4300.txt
-rw-r--r--@ 1 xiaobai  staff  1428841  8 11 18:20 pg5000.txt


In [1]:
input_file = './data/gone_with_wind.txt'
mapper_script = './word_count_mapper.py'
reducer_script = './word_count_reducer.py'
!cat $input_file | head -n 5| $mapper_script | sort | $reducer_script

cat: stdout: Broken pipe
CHAPTER	1
GONE	1
I	1
One	1
Part	1
THE	1
WIND	1
WITH	1


### 启动Hadoop集群

In [75]:
!docker-compose up -d


Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Starting namenode ... 
[1BStarting datanode1 ... done[0m
Recreating master  ... 
Starting datanode2 ... 
[3BStarting resourcemanager ... [0m[3A[2K
Starting nodemanager     ... 
Starting historyserver   ... 
[5BRecreating worker2       ... mdone[0m
Recreating worker1       ... 
[1Beating worker1       ... [32mdone[0m[3A[2K[2A[2K[1A[2K

### 复制样本数据到HDFS

在Hadoop集群上运行MapReduce Job之前，必须先将本地数据复制到Hadoop的HDFS中:

1. 先使用`docker cp` 命令将数据、脚本文件复制到docker容器master中:

In [24]:
!docker cp --help


Usage:	docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
	docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH

Copy files/folders between a container and the local filesystem

Use '-' as the source to read a tar archive from stdin
and extract it to a directory destination in a container.
Use '-' as the destination to stream a tar archive of a
container source to stdout.

Options:
  -a, --archive       Archive mode (copy all uid/gid information)
  -L, --follow-link   Always follow symbol link in SRC_PATH


In [19]:
!docker cp ./data/gone_with_wind.txt master:/root/

In [20]:
!docker cp ./word_count_mapper.py master:/root/

In [21]:
!docker cp ./word_count_reducer.py master:/root/

In [25]:
!docker exec master ls /root/

gone_with_wind.txt
jars
word_count_mapper.py
word_count_reducer.py


2. 为master容器中的脚本添加可执行权限

In [26]:
!docker exec master chmod +x /root/word_count_mapper.py /root/word_count_reducer.py

In [27]:
!docker exec master ls /root/ -l

total 2324
-rw-r--r-- 1  501 dialout 2368496 Aug 12 05:39 gone_with_wind.txt
drwxr-xr-x 2 root root         64 Aug 11 12:19 jars
-rwxr-xr-x 1  501 dialout     603 Aug 11 09:39 word_count_mapper.py
-rwxr-xr-x 1  501 dialout    1144 Aug 12 04:08 word_count_reducer.py


3. 使用`docker exec master`命令在Docker容器中执行hadoop fs命令上传数据到HDFS

In [83]:
!docker exec master hadoop fs -mkdir /input

In [28]:
!docker exec master hadoop fs -put /root/gone_with_wind.txt /input/

In [29]:
!docker exec master hadoop fs -ls /input/

Found 2 items
-rw-r--r--   1 root root    2368496 2019-08-12 06:17 /input/gone_with_wind.txt
-rw-r--r--   1 root root     674570 2019-08-11 12:33 /input/pg20417.txt


In [87]:
!docker exec master hadoop fs -mkdir /output

mkdir: `/output': File exists


In [71]:
!docker exec master hadoop fs -ls /

Found 24 items
-rwxr-xr-x   1 root root          0 2019-08-11 08:40 /.dockerenv
drwxr-xr-x   - root root       4096 2018-03-12 00:00 /bin
drwxr-xr-x   - root root       4096 2017-11-19 15:32 /boot
drwxr-xr-x   - root root         64 2019-08-10 15:56 /conf
-rw-------   1 root root     380928 2018-03-26 21:40 /core
drwxr-xr-x   - root root        340 2019-08-11 08:40 /dev
drwxr-xr-x   - root root       4096 2019-08-11 08:40 /etc
drwxr-xr-x   - root root       4096 2017-11-19 15:32 /home
drwxr-xr-x   - root root       4096 2019-08-11 11:17 /input
drwxr-xr-x   - root root       4096 2018-03-26 21:39 /lib
drwxr-xr-x   - root root       4096 2018-03-12 00:00 /lib64
drwxr-xr-x   - root root       4096 2018-03-12 00:00 /media
drwxr-xr-x   - root root       4096 2018-03-12 00:00 /mnt
drwxr-xr-x   - root root       4096 2018-03-12 00:00 /opt
drwxr-xr-x   - root root       4096 2019-08-11 11:48 /output
dr-xr-xr-x   - root root          0 2019-08-11 08:40 /proc
drwx------   - root root       4096 

### 运行MapReduce Job

现在一切准备就绪，可以在Hadoop集群上运行Python MapReduce作业了。 如前所述,Hadoop Streaming API可以帮助我们通过STDIN和STDOUT在Map和Reduce代码之间传递数据。

In [88]:
# 查找hadoop-streaming的存储路径
!docker exec master ls /usr/hadoop-2.8.3/share/hadoop/tools/lib/ | grep 'stream'

hadoop-streaming-2.8.3.jar


In [32]:
# 构造hadoop-streaming JAR包的路径
jarPath = '/usr/hadoop-2.8.3/share/hadoop/tools/lib/'
jarPath += 'hadoop-streaming-2.8.3.jar'
jarArgs = '-files /root/word_count_mapper.py,/root/word_count_reducer.py \
-mapper /root/word_count_mapper.py \
-reducer /root/word_count_reducer.py \
-input /input/gone_with_wind.txt \
-output /output/gone-with-wind-output'
!docker exec master hadoop jar $jarPath $jarArgs

19/08/12 06:46:16 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/word_count_mapper.py, /root/word_count_reducer.py] [] /tmp/streamjob7236624025099789117.jar tmpDir=null
19/08/12 06:46:18 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/08/12 06:46:18 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/08/12 06:46:18 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
19/08/12 06:46:18 INFO mapred.FileInputFormat: Total input files to process : 1
19/08/12 06:46:19 INFO mapreduce.JobSubmitter: number of splits:1
19/08/12 06:46:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1555473399_0001
19/08/12 06:46:20 INFO mapred.LocalDistributedCacheManager: Localized file:/root/word_count_mapper.py as file:/tmp/hadoop-root/mapred/local/1565592380070/word_count_map

19/08/12 06:46:31 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/08/12 06:46:31 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
19/08/12 06:46:31 INFO mapred.Merger: Merging 1 sorted segments
19/08/12 06:46:31 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 4048447 bytes
19/08/12 06:46:31 INFO reduce.MergeManagerImpl: Merged 1 segments, 4048456 bytes to disk to satisfy reduce memory limit
19/08/12 06:46:31 INFO reduce.MergeManagerImpl: Merging 1 files, 4048460 bytes from disk
19/08/12 06:46:31 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
19/08/12 06:46:31 INFO mapred.Merger: Merging 1 sorted segments
19/08/12 06:46:31 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 4048447 bytes
19/08/12 06:46:31 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/08/12 06:46:31 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/spark-2.3.0/./word_co

In [33]:
!docker exec master hadoop fs -ls /output/gone-with-wind-output

Found 2 items
-rw-r--r--   1 root root          0 2019-08-12 06:46 /output/gone-with-wind-output/_SUCCESS
-rw-r--r--   1 root root     391550 2019-08-12 06:46 /output/gone-with-wind-output/part-00000


In [34]:
!docker exec master hadoop fs -cat /output/gone-with-wind-output/part-00000

(Ellen	1
(Good	1
(If	1
(Oh,	2
(Skip	1
(Surely	1
(Swing	1
(The	1
(although	1
(and	1
(as	1
(formerly	1
(he	2
(of	1
(so	1
(that	1
(we	1
-	1
---	2
-and	1
...	110
...’	1
...”	20
1791,	1
1812,	1
1836,	1
1847.’	1
1849	1
1849,	1
1861	2
1861,	3
1862	4
1862,	2
1862.	2
1863	2
1863,	2
1863.	1
1864	3
1864,	2
1864.	1
1865	1
1866,	4
1871,	1
?	2
?”	2
A	161
A-stealin’	1
ABCs	1
AFTER	2
AFTERNOON	3
AGAIN	1
AGAIN,	1
ALL,	1
AN	1
AND	1
APRIL	1
ARMY,	1
AS	2
AT	2
Abandoned	1
Abe	3
Abel	6
Abel’s,”	1
Abolitionist	1
Abolitionist,	1
Abolitionists	1
About	4
Above	4
Abraham	1
Abruptly	3
Abruptly,	1
Academy	2
Academy,	1
Academy.	1
Accept,	1
Accepting	1
Accompanying	1
Accustomed	2
Across	3
Actually	2
Adairsville,	1
Adam	1
Added	2
Admirable	1
Admiral	1
Admire	1
Adorned	1
Adventurers	1
Advice	1
Affikun	1
Affikun.	1
African	3
African,	1
After	96
Afternoon	2
Afterward	1
Afterwards,	1
Again	7
Again,	1
Against	2
Aged	1
Ages.	1
Ah	342
Ah!	2
Ah,	5
Ahead	1
Ah—Ah	1
Ah—Ah—Miss	1
Ah—”	2
Ah’	1
Ah’d	7
Ah’ll	9
Ah’m	3
Ah’s	26
Aid	1


directed.	1
directing	2
direction	10
direction.	1
directions	1
directions,	1
directly	7
directly,	2
directness	1
direful	1
dirt	10
dirt,	3
dirt-cheap	1
dirt.	2
dirt.”	1
dirthy	1
dirtier,	1
dirty	29
dirty,	10
dirty-backed	1
dirty-minded	1
dirty.	2
dirty—”	1
dis	27
dis,	2
dis.”	1
disabuse	1
disadvantage	2
disadvantage.	2
disagreeable	2
disagreeable.	1
disagreement	1
disagreements	1
disappear	1
disappearance	1
disappeared	14
disappeared.	1
disappearing	3
disappearings	1
disappoint	1
disappointed	7
disappointed,	1
disappointed.	3
disappointing,	1
disappointment	18
disappointment,	2
disappointment.	2
disappointments.	1
disapproval	13
disapproval,	1
disapproval.	5
disapprove	2
disapprove.”	1
disapproved	8
disapproved,	2
disapproving	3
disapprovingly	2
disapprovingly.	1
disarranged,	1
disaster	10
disaster.	2
disbelief	1
disbelief.	3
discarded	3
discern	1
discerned	1
discharge	3
discharge.	1
discharged	2
discharging	2
discip

prayed—”	1
prayer	10
prayer,	2
prayer.	6
prayers	12
prayers.	3
prayers?”	1
praying	6
praying,	1
praying.	1
praying:	1
prayin’	3
prayin’.”	1
pre-occupation	1
preached	2
preacher	1
preacher.	1
preachers	2
precarious	2
precariously	1
precautions.	1
precede	1
preceded	2
preceding	3
precepts	1
precinct	1
precious	28
precious!”	1
precipitate	2
precipitous	1
precipitously	1
precise	1
precisely	1
predatory	1
predatory,	1
predicament	2
predicament,	1
predicament.	1
predicted	1
predicted.	1
prediction	1
predictions	1
preempted	1
preen	1
preened	1
preening	1
preenings	1
preface	1
prefacing	1
prefer	6
preferable	6
preference	1
preferred	15
preferrin’	1
prefers	1
pregnancy	10
pregnancy,	5
pregnancy.	1
pregnancy?	1
pregnant	7
pregnant,	2
pregnant.	3
pregnant.”	1
prejudices	1
prejudices,”	1
preliminary	2
premature	2
premeditated	1
premium	1
premonitions	1
preoccupation	1
preoccupied	5
preoccupied,	1
preoccupied.	1
preparation	6

��She	62
“Sherman	1
“Sherman’s	1
“Shet	1
“She—never	1
“She’d	1
“She’ll	3
“She’s	22
“Sho	1
“Show	1
“Shucks,	1
“Shut	5
“Since	1
“Sing	2
“Sir,	1
“Sir,”	4
“Sit	5
“Slatterys?”	1
“Slip	1
“Smell	1
“Snakes,	1
“So	31
“So,	4
“So,”	1
“Soap!	1
“Soldiers,”	1
“Some	8
“Somebody	2
“Somebody’s	4
“Somehow,”	1
“Someone	1
“Something	1
“Something’s	1
“Sometimes	4
“Soon	1
“Soon’s	2
“Sorry,	1
“Sorry—for	2
“Soun’	1
“Spare	1
“Spared	1
“Speak	1
“Speaking	1
“Spec	1
“Speculator!”	1
“Spell	1
“Spying,	1
“Starving	1
“Starving’s	1
“States’	1
“Stay	1
“Steady	1
“Stentor	1
“Still	2
“Stop	4
“Stop,	3
“Stop,”	1
“Stop—please,	1
“Stuff	1
“Such	1
“Suellen	1
“Suellen,	1
“Suellen?”	1
“Sugar,	7
“Suh?”	1
“Suit	1
“Supper	1
“Suppose	1
“Sure,	1
“Surely	1
“Surely,	1
“Surround	1
“Susie,	1
“Sweet	1
“Sweet,”	1
“Sweet.”	1
“Sweetheart”	1
“Sword	1
“S’me,	1
“Take	7
“Talk	1
“Tara	2
“Tara?	1
“Tarleton—Brenton,	1
“Tarleton—Stuart,	1
“Tarleton—Thomas,	1
“Tarle

【测试】Spark_shell实战WordCount

请在命令行模式输入以下命令，即可创建一个spark_shell：
```
docker exec -it master spark-shell --executor-memory 512M --total-executor-cores 2
```
完成scala版的WordCount，命令如下：
```
sc.textFile("hdfs://master:8020/input/GoneWiththeWind.txt").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).sortBy(_._2,false).take(10).foreach(println)
```