## Word count with hdfs3, distributed, and dask

In this example, we count the number of words in text files (Enron email dataset - 6.4 GB) stored in HDFS.

We use the libraries in an increasing order of API functionality, from low-level to high-level:

* hdfs3
* hdfs3 + distributed
* hdfs3 + distributed + dask

Setup:

```
conda install hdfs3 -c blaze
pip install dask distributed
```

TODO: implement proper count by word instead of total word count

### Example 1) Word count with hdfs3

In [1]:
import hdfs3

In [2]:
hdfs = hdfs3.HDFileSystem('ip-172-31-56-96', port=8020)

Generate list of foldernames and filenames in /tmp/enron

In [3]:
filenames = hdfs.glob('/tmp/enron/*/*')

In [4]:
filenames[:5]

[b'/tmp/enron/edrm-enron-v2_saibi-e_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_baughman-d_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_lavorato-j_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_derrick-j_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_crandall-s_xml.zip/merged.txt']

Print first 500 bytes of first file

In [7]:
hdfs.head('/tmp/enron/edrm-enron-v2_allen-p_xml.zip/merged.txt', 500)

AttributeError: 'HDFileSystem' object has no attribute 'head'

In [8]:
def count_words(filename):
    with hdfs.open(filename) as f:
        count = 0
        for line in f.readlines():
            words = line.split()
            count += len(words)
    return count

Count words in first file

In [9]:
%%time
count_words(filenames[0])

CPU times: user 302 ms, sys: 44.3 ms, total: 346 ms
Wall time: 351 ms


741115

Count words in all (readable) files

In [10]:
%%time
wordcounts = {}
for filename in filenames:
    wordcounts[filename] = count_words(filename)

CPU times: user 3min 53s, sys: 14.1 s, total: 4min 7s
Wall time: 4min 8s


In [11]:
wordcounts.values()

dict_values([741115, 9568549, 6141444, 8900991, 3271346, 2423343, 639765, 17209969, 1593956, 19767915, 1302676, 0, 1458698, 1778714, 48181070, 1298290, 166803, 3487509, 0, 25683387, 416411, 2675053, 2197147, 8807611, 1502013, 983098, 646054, 551252, 5442101, 5803387, 999658, 1965265, 135598, 0, 1910755, 9991092, 2553089, 1297129, 1297348, 1941178, 0, 309968, 4179474, 6195075, 2697321, 10529060, 4227674, 1599893, 16318646, 712714, 1241517, 2499543, 7490795, 1458891, 748537, 15967056, 640160, 2631204, 3390948, 602964, 17587152, 2993344, 6218398, 0, 3525664, 6742487, 22469610, 2605953, 940679, 1675549, 822513, 10179772, 34052870, 4136274, 3212032, 8287836, 823519, 4819396, 13099510, 4583501, 3798862, 942434, 5416725, 859444, 236457, 1502412, 7843359, 4746897, 2416830, 1287526, 454763, 16250124, 3707412, 31537933, 7120847, 31577075, 1742278, 1077756, 15837004, 8222338, 1082188, 1482969, 31577075, 826188, 2010409, 6019558, 7071963, 2308908, 1386230, 653546, 290147, 0, 3785987, 826083, 44147

In [12]:
sum([x for x in wordcounts.values()])

911246888

Notes and missing API functionality?

* ~~wanted to list only filenames (and not other HDFS file info)~~
* ~~wanted to load all text files in subdirs (glob like /tmp/enron/*/*.txt)~~
* ~~wanted to set encoding in .open() method~~
* ~~wanted to easily read .head() of large text file~~
* ~~why are the encoding errors happening?~~
* why are some word counts zero?
* glob returns unordered list

### Example 2) Word count with hdfs3 + distributed

Start dscheduler and dworkers on nodes:

head node: `dscheduler`  
compute nodes: `dworker 172.31.56.96:8786`

In [13]:
from distributed import Executor, progress, wait
from distributed.hdfs import read_bytes

In [14]:
e = Executor('172.31.56.96:8786')

Count words in first file

In [15]:
future = e.submit(count_words, filenames[0])

In [16]:
future.result()

741115

Count words in all (readable) files

In [17]:
futures = e.map(count_words, filenames)

In [18]:
futures[:5]

[<Future: status: finished, key: count_words-cb83eead7442c056080b2d380cbf6157>,
 <Future: status: pending, key: count_words-50f8866b61cb3204e85e4538579bb1c6>,
 <Future: status: pending, key: count_words-a47a689665cb2d8438240c9a237398ef>,
 <Future: status: pending, key: count_words-0e7e10b60233cb232b6c4c8b75b70e15>,
 <Future: status: pending, key: count_words-d04d4f2512cab8d9a72c0d79e68aceac>]

In [19]:
%%time
wait(futures);

CPU times: user 201 ms, sys: 5.52 ms, total: 207 ms
Wall time: 1min 30s


DoneAndNotDoneFutures(done={<Future: status: finished, key: count_words-fd055f1276df7859f2da0a5b8051912e>, <Future: status: finished, key: count_words-88fd1d6bee18f82e70fa28d1d06c907b>, <Future: status: finished, key: count_words-d6be01eb81dfb48f1b970c43189dce27>, <Future: status: finished, key: count_words-93ae83824d768da7d304d9822472e318>, <Future: status: finished, key: count_words-57c6d64cb1d60a0e3e7bb11a86067529>, <Future: status: finished, key: count_words-67c5aa9ab2ef95e14fd481717cc74adf>, <Future: status: finished, key: count_words-a4c86790c2ec50cebbbe7cbef79dfe44>, <Future: status: finished, key: count_words-b5127efa1de1509389c31f6f531f5111>, <Future: status: finished, key: count_words-129ffaf7075e51112f4a5be50640611e>, <Future: status: finished, key: count_words-9a7e7ed825b9ff11e83cbb4525ec10f6>, <Future: status: finished, key: count_words-203bead19fa3a1018aba20de0abe106f>, <Future: status: finished, key: count_words-aa37105a87e69397f4a078a31cf84160>, <Future: status: finishe

In [20]:
futures[:5]

[<Future: status: finished, key: count_words-cb83eead7442c056080b2d380cbf6157>,
 <Future: status: finished, key: count_words-50f8866b61cb3204e85e4538579bb1c6>,
 <Future: status: finished, key: count_words-a47a689665cb2d8438240c9a237398ef>,
 <Future: status: finished, key: count_words-0e7e10b60233cb232b6c4c8b75b70e15>,
 <Future: status: finished, key: count_words-d04d4f2512cab8d9a72c0d79e68aceac>]

In [21]:
futures[0].result()

741115

In [22]:
sum(e.gather(futures))

911246888

Notes and missing API functionality?

* ~~how to view number of workers/cores?~~
* ~~wanted to easily read .head() of large text file~~
* ~~can I pass arguments to functions in e.map(func, input)?~~
* ~~can I get the results of futures without a list comprehension, similar to an RDD?~~
* ~~why are the encoding errors happnening?~~
* ~~should blocking futures be default with an option to wait()?~~
* ~~when to use readbytes, BytesIO, etc.~~
* ~~are these errors from dscheduler important? https://gist.github.com/koverholt/6c0f9c10b23152c3f0c4~~
* option to drop failed futures?

### Example 3) Word count with hdfs3 + distributed + dask

In [23]:
from distributed.hdfs import read_bytes
from distributed.collections import futures_to_collection

In [24]:
bytes = read_bytes('/tmp/enron/*/*', hdfs=hdfs, delimiter=b'\r\n')

In [25]:
bytes[:5]

[<Future: status: pending, key: read_block-e069a953fedbf56df977a77171e205e5>,
 <Future: status: pending, key: read_block-b90671f599480b883ed1f4967927bdc6>,
 <Future: status: pending, key: read_block-4baea76d420ed00c8dc86698cf71161f>,
 <Future: status: pending, key: read_block-6cd548dbda5a00717e1f40cf22558d75>,
 <Future: status: pending, key: read_block-cf643bb93a1fa4b1a3b8572625bbe469>]

In [26]:
def bytes_to_lines(b):
    return b.decode().split('\n')

In [27]:
lists = e.map(bytes_to_lines, bytes)

In [28]:
%%time
db = futures_to_collection(lists)

Setting global dask scheduler to use distributed
CPU times: user 77.7 ms, sys: 7.08 ms, total: 84.8 ms
Wall time: 21.4 s


In [29]:
def count_words_in_bytes(data):
    words = data.split()
    count = len(words)
    return count

In [30]:
word_counts = db.map(count_words_in_bytes)

In [31]:
%%time
word_counts.sum().compute()

CPU times: user 292 ms, sys: 14.6 ms, total: 307 ms
Wall time: 1min 41s


913806111

Notes and missing API functionality?

* readbytes reads everything into memory vs. lazy execution?
* wanted to easily read .head() of large text file
* different word count than hdfs3 + distributed, perhaps due to line splitting
* solution was a bit more complex than hdfs3 + distributed

### PySpark API

```
>>> rdd = sc.textFile('/tmp/enron/*/*.txt')
>>> counts = rdd.flatMap(lambda line: line.split()).count()
...
16/02/02 07:01:47 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 76.110 s
16/02/02 07:01:47 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/02 07:01:47 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 76.307768 s
>>> counts
913806131
```

### Results

Time to count words in all files:

* hdfs3 - 4 min 8 s
* hdfs3 + distributed - 1 min 30 s
* hdfs3 + distributed + dask - 2 min 2 s
* PySpark - 1 min 16 s