## Word count with hdfs3, distributed, and dask

In this example, we count the number of words in text files (Enron email dataset - 6.4 GB) stored in HDFS.

In [2]:
import dask
import distributed
import hdfs3

### Example 1) Word count with hdfs3

In [3]:
hdfs = hdfs3.HDFileSystem('ip-172-31-56-96', port=8020)

Generate list of foldernames and filenames in /tmp/enron

In [4]:
dirnames = [x['name'] for x in hdfs.ls('/tmp/enron')]

In [18]:
dirnames[:5]

[b'/tmp/enron/edrm-enron-v2_allen-p_xml.zip',
 b'/tmp/enron/edrm-enron-v2_arnold-j_xml.zip',
 b'/tmp/enron/edrm-enron-v2_arora-h_xml.zip',
 b'/tmp/enron/edrm-enron-v2_badeer-r_xml.zip',
 b'/tmp/enron/edrm-enron-v2_bailey-s_xml.zip']

In [21]:
filenames = [x.decode('utf-8') + '/merged.txt' for x in dirnames]

In [23]:
filenames[:5]

['/tmp/enron/edrm-enron-v2_allen-p_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_arnold-j_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_arora-h_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_badeer-r_xml.zip/merged.txt',
 '/tmp/enron/edrm-enron-v2_bailey-s_xml.zip/merged.txt']

Print first 10 lines of first file

In [88]:
with hdfs.open(filenames[0]) as f:
    f.encoding = 'utf-8'
    [print(f.readline()) for i in range(10)]

Date: Tue, 26 Sep 2000 09:26:00 -0700 (PDT)
From: Phillip K Allen
To: pallen70@hotmail.com
Subject: Investment Structure
X-SDOC: 948896
X-ZLID: zl-edrm-enron-v2-allen-p-1713.eml

---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/26/2000 
04:26 PM ---------------------------



In [55]:
def count_words(filename, encoding):
    with hdfs.open(filename) as f:
        f.encoding = encoding
        count = 0
        all_lines = f.readlines()
        for line in all_lines:
            words = line.split()
            count += len(words)
    return count

Count words in first file

In [86]:
%%time
count_words('/tmp/enron/edrm-enron-v2_allen-p_xml.zip/merged.txt', 'utf-8')

CPU times: user 3.27 s, sys: 123 ms, total: 3.39 s
Wall time: 3.4 s


13099980

Count words in all (readable) files

In [None]:
%%time
wordcounts = {}
for filename in filenames:
    try:
        wordcounts[filename] = count_words(filename, 'utf-8')
    except UnicodeDecodeError:
        wordcounts[filename] = 'Encoding error'

In [85]:
wordcounts.values()

dict_values(['Encoding error', 4004560, 5814482, 13099980, 25683415, 416404, 'Encoding error', 7247270, 290147, 4179444, 6141434, 5803416, 1077750, 'Encoding error', 'Encoding error', 'Encoding error', 4746877, 'Encoding error', 7490781, 602964, 3390924, 4414609, 13706115, 1965278, 'Encoding error', 2423343, 'Encoding error', 1297348, 'Encoding error', 551252, 3212038, 1901115, 646054, 'Encoding error', 'Encoding error', 'Encoding error', 'Encoding error', 'Encoding error', 'Encoding error', 0, 'Encoding error', 0, 0, 'Encoding error', 822377, 'Encoding error', 'Encoding error', 'Encoding error', 'Encoding error', 940670, 10179741, 1941178, 'Encoding error', 0, 732104, 6019540, 327602, 741113, 833143, 'Encoding error', 859418, 1214718, 380978, 'Encoding error', 349008, 'Encoding error', 1241517, 'Encoding error', 653546, 'Encoding error', 2309593, 'Encoding error', 'Encoding error', 1287530, 'Encoding error', 135598, 712714, 'Encoding error', 2010407, 1296974, 2176585, 5445575, 1720935

In [80]:
sum([x for x in wordcounts.values() if isinstance(x, int)])

363695216

Missing API functionality?

* wanted to list only filenames (and not other HDFS file info)
* wanted to load all text files in subdirs (glob like /tmp/enron/*/*.txt)
* wanted to set encoding in .open() method
* wanted to easily read head of large text file

### Example 2) Word count with hdfs3 + distributed

Missing API functionality?

* .head() in distributed or dask?
* need a futures_to_bag