## Word count with hdfs3, distributed, and dask

In this example, we count the number of words in text files (Enron email dataset - 6.4 GB) stored in HDFS.

We use the libraries in an increasing order of API functionality, from low-level to high-level:

* hdfs3
* hdfs3 + distributed
* hdfs3 + distributed + dask

Setup:

Copy data from S3 into HDFS:

```
$ hadoop distcp 
```

Install dependencies

```
$ conda install hdfs3 -c blaze
$ pip install dask distributed
```

TODO: implement proper count by word instead of total word count

### Example 1) Word count with hdfs3

In [1]:
import hdfs3
from collections import defaultdict

In [2]:
hdfs = hdfs3.HDFileSystem('ip-172-31-56-96', port=8020)

Generate list of foldernames and filenames in /tmp/enron

In [3]:
filenames = hdfs.glob('/tmp/enron/*/*')

In [4]:
filenames[:5]

[b'/tmp/enron/edrm-enron-v2_shapiro-r_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_skilling-j_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_mclaughlin-e_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_germany-c_xml.zip/merged.txt',
 b'/tmp/enron/edrm-enron-v2_cash-m_xml.zip/merged.txt']

Print first 500 bytes of first file

In [17]:
hdfs.head(filenames[0])

b'Date: Sat, 7 Jul 2001 07:20:00 -0700 (PDT)\r\nFrom: Jo Ann Hill\r\nTo: Janine Migden\r\nCc: Richard Shapiro\r\nSubject: PRC Ratings - Non-Exempt (Migden)\r\nX-SDOC: 738886\r\nX-ZLID: zl-edrm-enron-v2-shapiro-r-9869.eml\r\n\r\nJanine -- \r\n\r\nEarlier this week, you received an email from the PEP team which included the \r\nPerformance Evaluations of your employees, both Exempt and Non-Exempt.  \r\nPlease remember that the final Performance Evaluations should be completed \r\nwith your employees, forms signed and returned to me at EB1665B no later than \r\nFriday, August 17th.  I will be preparing a report for your use noting the \r\nfinal ratings of your Exempt employees from the PRC meetings and distribute \r\nthat within the next week.\r\n\r\nFor your Non-Exempt employees, I have listed below all eligible employees for \r\nwhom you should complete a Performance Evaluation.  However, I need the \r\nrating for each employee so it may be input into the PEP system.  The \r\ndescriptio

In [9]:
def count_words(filename):
    word_counts = defaultdict(int)
    with hdfs.open(filename) as f:
        for line in f:
            for word in line.split():
                word_counts[word] += 1
    return word_counts

Count words in first file

In [10]:
%%time
word_counts_single = count_words(filenames[0])

CPU times: user 5.73 s, sys: 90.4 ms, total: 5.82 s
Wall time: 5.83 s


In [11]:
sorted(word_counts_single.items(), key=lambda k_v: k_v[1], reverse=True)[:10]

[(b'the', 361079),
 (b'to', 228363),
 (b'of', 206534),
 (b'and', 182790),
 (b'in', 126461),
 (b'a', 114219),
 (b'for', 82769),
 (b'is', 76657),
 (b'that', 74224),
 (b'by', 54229)]

Count words in all (readable) files

In [12]:
%%time
wordcounts = {}
for filename in filenames[:5]:
    wordcounts[filename] = count_words(filename)

CPU times: user 22.3 s, sys: 456 ms, total: 22.7 s
Wall time: 22.8 s


In [13]:
def merge_dicts(dict_args):
    '''
    Given any number of dicts, shallow copy and merge into a new dict,
    precedence goes to key value pairs in latter dicts.
    '''
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

In [14]:
all_dicts = [x for x in wordcounts.values()]
merged_dicts = merge_dicts(all_dicts)

In [15]:
len(merged_dicts)

748877

In [16]:
sorted(merged_dicts.items(), key=lambda k_v: k_v[1], reverse=True)[:10]

[(b'0', 4734110),
 (b'-', 126643),
 (b'Phy', 96008),
 (b'TAGG/', 89524),
 (b'01-09-2001', 85333),
 (b'I', 82468),
 (b'JARNOLD', 81540),
 (b'TCO', 77891),
 (b'&', 75190),
 (b'to', 71976)]

Notes and missing API functionality?

* ~~wanted to list only filenames (and not other HDFS file info)~~
* ~~wanted to load all text files in subdirs (glob like /tmp/enron/*/*.txt)~~
* ~~wanted to set encoding in .open() method~~
* ~~wanted to easily read .head() of large text file~~
* ~~why are the encoding errors happening?~~
* why are some word counts zero?
* glob returns unordered list

### Example 2) Word count with hdfs3 + distributed

Start dscheduler and dworkers on nodes:

head node: `dscheduler`  
compute nodes: `dworker 172.31.56.96:8786`

In [None]:
from distributed import Executor, progress, wait
from distributed.hdfs import read_bytes

In [None]:
e = Executor('172.31.56.96:8786')

Count words in first file

In [None]:
future = e.submit(count_words, filenames[0])

In [None]:
future.result()

Count words in all (readable) files

In [None]:
futures = e.map(count_words, filenames)

In [None]:
futures[:5]

In [None]:
%%time
wait(futures);

In [None]:
futures[:5]

In [None]:
futures[0].result()

In [None]:
sum(e.gather(futures))

Notes and missing API functionality?

* ~~how to view number of workers/cores?~~
* ~~wanted to easily read .head() of large text file~~
* ~~can I pass arguments to functions in e.map(func, input)?~~
* ~~can I get the results of futures without a list comprehension, similar to an RDD?~~
* ~~why are the encoding errors happnening?~~
* ~~should blocking futures be default with an option to wait()?~~
* ~~when to use readbytes, BytesIO, etc.~~
* ~~are these errors from dscheduler important? https://gist.github.com/koverholt/6c0f9c10b23152c3f0c4~~
* option to drop failed futures?

### Example 3) Word count with hdfs3 + distributed + dask

In [None]:
from distributed.hdfs import read_bytes
from distributed.collections import futures_to_collection, futures_to_dask_bag

In [None]:
%%time
bytes = read_bytes('/tmp/enron/*/*', hdfs=hdfs, delimiter=b'\r\n')

In [None]:
bytes[:5]

In [None]:
def bytes_to_lines(b):
    return b.decode().split('\n')

In [None]:
lists = e.map(bytes_to_lines, bytes)

In [None]:
%%time
db = futures_to_dask_bag(lists)

In [None]:
def count_words_in_bytes(data):
    words = data.split()
    count = len(words)
    return count

In [None]:
word_counts = db.map(count_words_in_bytes)

In [None]:
%%time
word_counts.sum().compute()

Notes and missing API functionality?

* readbytes reads everything into memory vs. lazy execution?
* wanted to easily read .head() of large text file
* different word count than hdfs3 + distributed, perhaps due to line splitting
* solution was a bit more complex than hdfs3 + distributed

In [None]:
future.exception()

### PySpark API

```
>>> rdd = sc.textFile('/tmp/enron/*/*.txt')
>>> counts = rdd.flatMap(lambda line: line.split()).count()
...
16/02/02 07:01:47 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 76.110 s
16/02/02 07:01:47 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/02 07:01:47 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 76.307768 s
>>> counts
913806131
```

### Results

Time to count words in all files:

* hdfs3 - 4 min 8 s
* hdfs3 + distributed - 1 min 30 s
* hdfs3 + distributed + dask - 2 min 2 s
* PySpark - 1 min 16 s