# parallel demo

Sometimes filling up a pipeline works great to speed up processing, but we can also get some speedup by using data-parallel methods.  If we have hundreds of similar files to process with one script, for example, on a machine with several cores, you can parallelize their processing and take advantage of more than one core at a time.

Let's look at a simple example.  What if we had five texts from Gutenberg to count words for?

First, let's get some data:

In [1]:
ls *.txt

alice.txt  douglass.txt  frankenstein.txt  pride.txt  wuthering.txt


In [2]:
wc *.txt

   3735   29461  167518 alice.txt
   4104   43789  248369 douglass.txt
   7653   77986  448689 frankenstein.txt
  13426  124588  717575 pride.txt
  12486  118899  681641 wuthering.txt
  41404  394723 2263792 total


Okay, we've got five texts, with a total of nearly 400,000 words combined.  That's a good start.

Remember our previous pipeline for counting words?  It was something like this:

In [6]:
grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25

   2435 the
   1669 of
   1574 to
   1445 and
    794 in
    742 was
    530 he
    457 my
    433 it
    426 with
    422 that
    388 his
    365 as
    359 for
    336 me
    295 this
    292 at
    285 be
    255 had
    253 by
    237 not
    208 or
    206 but
    201 him
    200 is
sort: write failed: standard output: Broken pipe
sort: write error


Note that we can add a timer with the command ```time``` at front:

In [8]:
time grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25

   2435 the
   1669 of
   1574 to
   1445 and
    794 in
    742 was
    530 he
    457 my
    433 it
    426 with
    422 that
    388 his
    365 as
    359 for
    336 me
    295 this
    292 at
    285 be
    255 had
    253 by
    237 not
    208 or
    206 but
    201 him
    200 is
sort: write failed: standard output: Broken pipe
sort: write error

real	0m0.137s
user	0m0.115s
sys	0m0.023s


What if we just used a filename wildcard, what would happen?

In [10]:
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25

   4816 wuthering.txt:and
   4750 wuthering.txt:the
   4507 pride.txt:the
   4371 frankenstein.txt:the
   4242 pride.txt:to
   3729 pride.txt:of
   3658 pride.txt:and
   3616 wuthering.txt:to
   3046 frankenstein.txt:and
   2760 frankenstein.txt:of
   2435 douglass.txt:the
   2340 wuthering.txt:of
   2203 pride.txt:her
   2174 frankenstein.txt:to
   2124 wuthering.txt:he
   1944 wuthering.txt:you
   1937 pride.txt:in
   1844 pride.txt:was
   1818 alice.txt:the
   1776 frankenstein.txt:my
   1695 pride.txt:she
   1669 douglass.txt:of
   1574 douglass.txt:to
   1556 pride.txt:that
   1550 pride.txt:it
sort: write failed: standard output: Broken pipe
sort: write error

real	0m3.313s
user	0m3.169s
sys	0m0.378s


## What just happened?

## A different approach - data parallel

We can use the parallel command to assign the pipeline job to multiple cores... as many as we have available.  We'll need to break the task up first, just getting raw word lists, which we can then combine easily.

In [11]:
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"


real	0m0.442s
user	0m0.297s
sys	0m0.222s


In [13]:
wc *-words.txt

  28625   28625  149885 alice.txt-words.txt
  42277   42277  231365 douglass.txt-words.txt
  73902   73902  417234 frankenstein.txt-words.txt
 121166  121166  669063 pride.txt-words.txt
 114101  114101  619516 wuthering.txt-words.txt
 380071  380071 2087063 total


Looks about right, with some space and punctuation removed.  Did you see how fast that went?

Now we can combine them and apply the rest of our pipeline:

In [17]:
time cat *-words.txt >> combined.txt


real	0m0.004s
user	0m0.002s
sys	0m0.003s


In [16]:
time < combined.txt sort | uniq -c | sort -rn | head -25

  17881 the
  13905 and
  12415 to
  11129 of
   5889 in
   5090 was
   4720 he
   4611 you
   4605 that
   4561 it
   4518 her
   4090 my
   3926 she
   3693 his
   3332 not
   3320 with
   3316 as
   2982 for
   2981 had
   2836 be
   2772 but
   2761 me
   2452 at
   2422 on
   2230 is
sort: write failed: standard output: Broken pipe
sort: write error

real	0m0.506s
user	0m0.810s
sys	0m0.070s


Faster, right?  Well, if a gain of only a couple of seconds doesn't seem like much, imagine if we had hundreds of texts.  Or thousands...  (see [how to get ebook files](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) for more details).

In [18]:
mkdir pg-text
cd pg-text
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"




(380 text files later...)

In [27]:
ls *.txt

10001.txt    10062.txt    10119-8.txt  10234-8.txt
10002-8.txt  10063.txt    10119.txt    10234.txt
10002.txt    10064-8.txt  10120-8.txt  12370-8.txt
10003.txt    10064.txt    10120.txt    12370.txt
10004-8.txt  10065-8.txt  10121-8.txt  12372-8.txt
10004.txt    10065.txt    10121.txt    12372.txt
10005-8.txt  10066-8.txt  10122-8.txt  12373-8.txt
10005.txt    10066.txt    10122.txt    12373.txt
10006-8.txt  10067-8.txt  10123.txt    12374-8.txt
10006.txt    10067.txt    10124-8.txt  12374.txt
10007-8.txt  10068-8.txt  10124.txt    12375-8.txt
10007.txt    10068.txt    10125-8.txt  12375.txt
10008-8.txt  10069-8.txt  10125.txt    12376-8.txt
10008.txt    10069.txt    10126.txt    12376.txt
10009.txt    10070.txt    10127-8.txt  12377.txt
10010.txt    10071-8.txt  10127.txt    12378-8.txt
10011-8.txt  10071.txt    10128-8.txt  12378.txt
10011.txt    10072.txt    10128.txt    12380-8.txt
10012-8.txt  10073-8.txt  10129-8.txt  12380.txt
10012.txt    10073.txt    10129.

In [28]:
wc *.txt

      958      8807     52510 10001.txt
     5690     54201    306901 10002-8.txt
     5690     54201    306892 10002.txt
     6327     64594    380817 10003.txt
     5361     51299    302753 10004-8.txt
     5361     51300    302750 10004.txt
     7313     73655    434769 10005-8.txt
     7313     73656    434760 10005.txt
     1582     16512     95836 10006-8.txt
     1582     16512     95831 10006.txt
     3695     31295    180138 10007-8.txt
     3695     31295    180129 10007.txt
     9154     69542    407280 10008-8.txt
     9154     69542    407271 10008.txt
     8502     90180    504214 10009.txt
     1451     13938     86336 10010.txt
     3881     26555    155576 10011-8.txt
     3881     26555    155567 10011.txt
     9175     93985    561045 10012-8.txt
     9175     94007    561124 10012.txt
     2535     18202    136659 10013-8.txt
     2535     18202    136650 10013.txt
     2858     19257    142501 10014-8.txt
     2858     19257    142492 10014.t

Okay, so now we have 23.7 million words.  Let's count 'em!

In [29]:
# File by file - should be slow
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25

  37463 10136.txt:the
  37463 10136-8.txt:the
  22092 10136.txt:of
  22092 10136-8.txt:of
  19746 10136.txt:and
  19746 10136-8.txt:and
  19113 10147.txt:the
  19113 10147-8.txt:the
  14295 10128.txt:the
  14295 10128-8.txt:the
  14264 10151.txt:the
  14264 10151-8.txt:the
  14051 1jcfs10.txt:the
  13631 10103.txt:the
  13631 10103-8.txt:the
  13411 10114.txt:the
  13411 10114-8.txt:the
  12642 10136.txt:to
  12642 10136-8.txt:to
  11808 10038.txt:the
  11808 10038-8.txt:the
  11126 10165.txt:the
  11126 10165-8.txt:the
  10856 10062.txt:the
  10856 10062-8.txt:the
sort: write failed: standard output: Broken pipe
sort: write error

real	3m52.574s
user	3m45.721s
sys	0m22.869s


In [30]:
# In parallel - should be about twice as fast
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"


real	0m14.181s
user	0m14.433s
sys	0m7.436s


In [31]:
time cat *-words.txt >> combined.txt


real	0m0.363s
user	0m0.001s
sys	0m0.303s


In [32]:
time < combined.txt sort | uniq -c | sort -rn | head -25

1554064 the
 824506 of
 799253 and
 629027 to
 429747 in
 266622 that
 238989 it
 237853 was
 237255 he
 200894 with
 195603 his
 188414 is
 186255 for
 175814 as
 164763 you
 140903 had
 137538 on
 135345 but
 135036 not
 131652 be
 131261 at
 126337 by
 118609 this
 113214 her
 108945 or
sort: write failed: standard output: Broken pipe
sort: write error

real	0m40.816s
user	1m7.956s
sys	0m2.530s
