## Word count with hdfs3, distributed, and dask

In this example, we count the number of words in text files (Enron email dataset - 6.4 GB) stored in HDFS.

We use the libraries in an increasing order of API functionality, from low-level to high-level:

* hdfs3
* hdfs3 + distributed
* hdfs3 + distributed + dask

Setup:

```
conda install hdfs3 -c blaze
pip install dask distributed
```

### Example 1) Word count with hdfs3

In [1]:
import hdfs3

In [2]:
hdfs = hdfs3.HDFileSystem('ip-172-31-56-96', port=8020)

Generate list of foldernames and filenames in /tmp/enron

In [3]:
dirnames = [x['name'] for x in hdfs.ls('/tmp/enron')]

In [4]:
dirnames[:5]

[b'/tmp/enron/edrm-enron-v2_allen-p_xml.zip',
 b'/tmp/enron/edrm-enron-v2_arnold-j_xml.zip',
 b'/tmp/enron/edrm-enron-v2_arora-h_xml.zip',
 b'/tmp/enron/edrm-enron-v2_badeer-r_xml.zip',
 b'/tmp/enron/edrm-enron-v2_bailey-s_xml.zip']

In [5]:
filenames = [x.decode('utf-8') + '/merged.txt' for x in dirnames]

In [6]:
filenames[:5]

['/tmp/enron/edrm-enron-v2_allen-p_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_arnold-j_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_arora-h_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_badeer-r_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_bailey-s_xml.zip/merged.txt']

Print first 10 lines of first file

In [7]:
with hdfs.open(filenames[0]) as f:
    f.encoding = 'utf-8'
    [print(f.readline()) for i in range(10)]

Date: Tue, 26 Sep 2000 09:26:00 -0700 (PDT)
From: Phillip K Allen
To: pallen70@hotmail.com
Subject: Investment Structure
X-SDOC: 948896
X-ZLID: zl-edrm-enron-v2-allen-p-1713.eml

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/26/2000 
04:26 PM ---------------------------



In [8]:
def count_words(filename, encoding='utf-8'):
    with hdfs.open(filename) as f:
        f.encoding = encoding
        count = 0
        all_lines = f.readlines()
        for line in all_lines:
            words = line.split()
            count += len(words)
    return count

Count words in first file

In [9]:
%%time
count_words('/tmp/enron/edrm-enron-v2_allen-p_xml.zip/merged.txt')

CPU times: user 3.3 s, sys: 183 ms, total: 3.48 s
Wall time: 3.49 s


13099980

Count words in all (readable) files

In [10]:
%%time
wordcounts = {}
for filename in filenames:
    try:
        wordcounts[filename] = count_words(filename)
    except UnicodeDecodeError:
        wordcounts[filename] = 'Encoding error'

CPU times: user 1min 53s, sys: 8.42 s, total: 2min 1s
Wall time: 2min 2s


In [11]:
wordcounts.values()

dict_values([15836636, 'Encoding error', 'Encoding error', 826188, 3487505, 7071894, 1482517, 'Encoding error', 1296974, 'Encoding error', 1241517, 34052622, 'Encoding error', 1298290, 1901115, 551252, 'Encoding error', 3073065, 'Encoding error', 'Encoding error', 646054, 2993343, 'Encoding error', 3212038, 4227672, 'Encoding error', 'Encoding error', 1502415, 1778714, 349008, 0, 236457, 748549, 2674950, 'Encoding error', 4004560, 'Encoding error', 'Encoding error', 7490781, 'Encoding error', 2605960, 859418, 'Encoding error', 1675517, 5445575, 17209358, 'Encoding error', 1082150, 25683415, 454763, 0, 'Encoding error', 'Encoding error', 'Encoding error', 5803416, 3390924, 416404, 8287835, 'Encoding error', 'Encoding error', 'Encoding error', 2010407, 'Encoding error', 653546, 222069, 5814482, 'Encoding error', 602964, 9978422, 732104, 940670, 5736422, 1405998, 6019540, 327602, 'Encoding error', 942422, 741113, 1302679, 0, 7247270, 'Encoding error', 13099980, 'Encoding error', 13706115,

In [12]:
sum([x for x in wordcounts.values() if isinstance(x, int)])

363695216

Notes and missing API functionality?

* wanted to list only filenames (and not other HDFS file info)
* wanted to load all text files in subdirs (glob like /tmp/enron/*/*.txt)
* wanted to set encoding in .open() method
* wanted to easily read .head() of large text file
* why are the encoding errors happening?
* why are some word counts zero?

### Example 2) Word count with hdfs3 + distributed

Start dscheduler and dworkers on nodes:

head node: `dscheduler`  
compute nodes: `dworker 172.31.56.96:8786`

In [13]:
from distributed import Executor, progress, wait
from distributed.hdfs import read_bytes

In [14]:
e = Executor('172.31.56.96:8786')

Count words in first file

In [15]:
future = e.submit(count_words, filenames[0])

In [16]:
future.result()

13099980

Count words in all (readable) files

In [17]:
futures = e.map(count_words, filenames)

In [18]:
futures[:5]

[<Future: status: finished, key: count_words-bd731296fd7be686e9f2aa39d4e55b13>,
 <Future: status: pending, key: count_words-092e034e73e859cdd6b3318853257e1c>,
 <Future: status: pending, key: count_words-d57a35bd921b22c3d7dc5b7557a0b9ff>,
 <Future: status: pending, key: count_words-35c292d5690de86d25da24b4828ef306>,
 <Future: status: pending, key: count_words-afc75889580ce64dd0bafef7e4bd9d78>]

In [19]:
%%time
wait(futures);

CPU times: user 136 ms, sys: 13.9 ms, total: 150 ms
Wall time: 42.4 s


DoneAndNotDoneFutures(done={<Future: status: finished, key: count_words-c2e14215bfa7ae8d5a8585777e71f64c>, <Future: status: finished, key: count_words-5471ef08686e7a526ec9f644d47047ad>, <Future: status: error, key: count_words-5294eafbc4a49823a9c74bcf7c90fa97>, <Future: status: finished, key: count_words-cb852603ce94b8c4eeb7e7b667e0283a>, <Future: status: finished, key: count_words-db42ce3e0b8fea60612053c00c090a11>, <Future: status: error, key: count_words-e5cacef389b5881b14680045bfa1b89a>, <Future: status: error, key: count_words-81dfa05b33ee5ae1fc05af2b7b77e62f>, <Future: status: finished, key: count_words-2fc64e386ff2d6b88baf5e084cb260d6>, <Future: status: finished, key: count_words-0d7de3b16b6ef3dd26b51632d4300883>, <Future: status: error, key: count_words-56d455d3e683ebeab55a473ac132e7c7>, <Future: status: error, key: count_words-4fe87556fde72271f709f1ce9b6c5b37>, <Future: status: error, key: count_words-f14ed610161016676958a2e762fe6741>, <Future: status: finished, key: count_word

In [20]:
# Some of the tasks fail due to utf-8 formatting errors, we ignore the failed tasks
futures_success = [x for x in futures if x.status == 'finished']

In [21]:
futures_success[:5]

[<Future: status: finished, key: count_words-bd731296fd7be686e9f2aa39d4e55b13>,
 <Future: status: finished, key: count_words-092e034e73e859cdd6b3318853257e1c>,
 <Future: status: finished, key: count_words-afc75889580ce64dd0bafef7e4bd9d78>,
 <Future: status: finished, key: count_words-a4ae70581908099cdf5d466a0e0dd3b9>,
 <Future: status: finished, key: count_words-dba948a37282190f54c4f50e5038d9d0>]

In [22]:
futures_success[0].result()

13099980

In [23]:
sum(e.gather(futures_success))

363695216

Notes and missing API functionality?

* how to view number of workers/cores?
* wanted to easily read .head() of large text file
* can I pass arguments to functions in e.map(func, input)?
* can I get the results of futures without a list comprehension, similar to an RDD?
* why are the encoding errors happnening?
* should blocking futures be default with an option to wait()?
* are these errors from dscheduler important? https://gist.github.com/koverholt/6c0f9c10b23152c3f0c4
* when to use readbytes, BytesIO, etc.
* readbytes reads everything into memory vs. lazy execution?
* option to drop failed futures?

### Example 3) Word count with hdfs3 + distributed + dask

In [24]:
import dask

Notes and missing API functionality?

* wanted to easily read .head() of large text file
* need a futures_to_bag

### Results

Time to count words in all files:

* hdfs3 - 120 seconds
* hdfs3 + distributed - 39 seconds
* hdfs3 + distributed + dask - 