#### CLI: Ch. 4 Creating Reusable Command-Line Tools

Suppose we crafted the following one liner:

In [4]:
! curl -s http://www.gutenberg.org/cache/epub/76/pg76.txt | gunzip -c | tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort | uniq -c | sort -nr | head -n 10

   5577 and
   4187 the
   2592 a
   2510 to
   2416 i
   1969 it
   1560 t
   1531 was
   1518 of
   1334 he
sort: write failed: 'standard output': Broken pipe
sort: write error


Something we might want to use every now and then. We can do several things: Turn it into a bash file, that we can run from the commandline (easy enough, we just have to line up the steps we worked out in a text file:

    curl -s http://www.gutenberg.org/cache/epub/76/pg76.txt | # download the zipped ebook
    gunzip -c |                                               # unzip the zipped file
    tr '[:upper:]' '[:lower:]' |                              # convert the entire text to lowercase
    grep -oE '\w+' |                                          # extract all words and place them on sep.
    sort |                                                    # sort them in alphabetical order
    uniq -c |                                                 # remove duplicates, count frequency of words
    sort -nr |                                                # sort by count in reverse order
    head -n 20                                                # show first 20 results

Then we stuck the she-bang on top: #!/usr/bin/env BASH and change the file permissions to make it executable with: chmod u+x [file]
    

Now, if we were to use this script more often, some things will prove to be a bit cumbersome:

    - if we want to process a new file, we have to open the file and put in the new URL of that file
    - we might want to show a different number of results
    
These are small changes to the file, using variables that we can pass in at the commandline when calling the pipeline:

    #!/usr/bin/env bash
    NUM_WORDS="$1"
    ...
    uniq -c | sort -nr | head -n $NUM_WORDS
    
Larger changes will become necessary when you do a lot of text analysis. When you run your script on another book, you will probably find that "and", "the", "a", and "to" are the most used words in that text too. We can filter these so-called stopwords out of course, often a file aptly called "stopwords" is used for that purpose.

Then, you will probably formulate another question: "What are the most used words in a text that are most typical for that text"? Or any question you might have. NB. the question itself is not relevant here, but the fact that your programs will keep evolving over time:

    - you will refactor stuff, abstract things out, move code to separate modules, etc.
    - you will define new questions you want to answer, making changes in the code necessary
    - you will other ideas that make you pull in other kinds of data that needs to be processed in other ways.


You can do a lot with the commandline, especially in the context of (data) analysis: What is there?, How does it look like?, What is missing here?, etc., etc. But at a certain point there is a trade-off between ease of use and quickness and ease of re-use and thoroughness. At that point porting your solutions to another context, like a programming language like Python or R, might be a good road to follow. Especially because, as we will see later on, these languages come with libraries (batteries included) that went through the process we sketched above (refactoring, abstracting out, etc.) preparing these modules to work for others too.

So here is our shell script in Python:

In [None]:
import re
import sys
from collections import Counter
num_words = int(sys.argv[1])
text = sys.stdin.read().lower()
words = re.split('\W+', text)
cnt = Counter(words)
for word, count in cnt.most_common(num_words):
    print("%7d %s" % (count, word))

And suppose you are really into text analysis, then Python offers the NLTK libraries to work with or PyTextrank. If your interest lies with data analysis then Python offers Pandas. We will dive into these topics later on.