<a href="https://colab.research.google.com/github/georgepar/lt-asrtts/blob/main/Bash_For_Text_Processing_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Shell scripting tutorial

Learning shell scripting can provide you quick and easy ways to perform a lot of work related with a machine learning / text processing project.

Some of the things you can achieve with shell scripting:

- Installing project dependencies
- Build project dependencies from source
- Download organize and clean data
- Extract data statistics (e.g. word / character counts)
- Perform text processing
- Train and evaluate models using existing frameworks that provide a command line interface (Kaldi, openfst, fasttext, fairseq etc.)



First of all list current working directory and it's contents

In [1]:
%%bash

pwd

/content


In [2]:
%%bash

ls -lah

total 16K
drwxr-xr-x 1 root root 4.0K Apr  4 13:24 .
drwxr-xr-x 1 root root 4.0K Apr  8 12:08 ..
drwxr-xr-x 4 root root 4.0K Apr  4 13:24 .config
drwxr-xr-x 1 root root 4.0K Apr  4 13:24 sample_data


## Installing project dependencies and building a project from source

Let's say we need to build an N-Gram language model for some corpus. One commonly used tool for this is KenLM. Let's download and build it from source

Download KenLM from git repo

In [3]:
!git clone https://github.com/kpu/kenlm

Cloning into 'kenlm'...
remote: Enumerating objects: 14165, done.[K
remote: Counting objects: 100% (478/478), done.[K
remote: Compressing objects: 100% (332/332), done.[K
remote: Total 14165 (delta 163), reused 409 (delta 132), pack-reused 13687[K
Receiving objects: 100% (14165/14165), 5.91 MiB | 13.95 MiB/s, done.
Resolving deltas: 100% (8043/8043), done.


Install necessary dependencies for building KenLM (follow docs: https://kheafield.com/code/kenlm/dependencies/)

In [5]:
!sudo apt-get update
!sudo apt-get install -y build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev libeigen3-dev

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.36)] [                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
                                                                               Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.36)] [                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
0% [Connecting to security.ubuntu.com (185.125.190.36)] [Connecting to ppa.laun                                                                               Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRe

List files in current directory

In [6]:
%%bash

ls -lah

total 20K
drwxr-xr-x 1 root root 4.0K Apr  8 12:09 .
drwxr-xr-x 1 root root 4.0K Apr  8 12:08 ..
drwxr-xr-x 4 root root 4.0K Apr  4 13:24 .config
drwxr-xr-x 8 root root 4.0K Apr  8 12:09 kenlm
drwxr-xr-x 1 root root 4.0K Apr  4 13:24 sample_data


In [7]:
%%bash

ls -lah /content/kenlm

total 220K
drwxr-xr-x 8 root root 4.0K Apr  8 12:09 .
drwxr-xr-x 1 root root 4.0K Apr  8 12:09 ..
-rw-r--r-- 1 root root  696 Apr  8 12:09 BUILDING
-rwxr-xr-x 1 root root   81 Apr  8 12:09 clean_query_only.sh
drwxr-xr-x 3 root root 4.0K Apr  8 12:09 cmake
-rw-r--r-- 1 root root 4.7K Apr  8 12:09 CMakeLists.txt
-rwxr-xr-x 1 root root 1.2K Apr  8 12:09 compile_query_only.sh
-rw-r--r-- 1 root root  26K Apr  8 12:09 COPYING
-rw-r--r-- 1 root root  35K Apr  8 12:09 COPYING.3
-rw-r--r-- 1 root root 7.5K Apr  8 12:09 COPYING.LESSER.3
-rw-r--r-- 1 root root  63K Apr  8 12:09 Doxyfile
drwxr-xr-x 8 root root 4.0K Apr  8 12:09 .git
drwxr-xr-x 3 root root 4.0K Apr  8 12:09 .github
-rw-r--r-- 1 root root  261 Apr  8 12:09 .gitignore
-rw-r--r-- 1 root root 1.2K Apr  8 12:09 LICENSE
drwxr-xr-x 7 root root 4.0K Apr  8 12:09 lm
-rw-r--r-- 1 root root  220 Apr  8 12:09 MANIFEST.in
-rw-r--r-- 1 root root   59 Apr  8 12:09 pyproject.toml
drwxr-xr-x 2 root root 4.0K Apr  8 12:09 python
-rw-r--r-- 1 root ro

Navigate inside kenlm

In [8]:
%%bash

mkdir -p /content/kenlm/build

In [9]:
%%bash

ls -lah /content/kenlm

total 224K
drwxr-xr-x 9 root root 4.0K Apr  8 12:10 .
drwxr-xr-x 1 root root 4.0K Apr  8 12:09 ..
drwxr-xr-x 2 root root 4.0K Apr  8 12:10 build
-rw-r--r-- 1 root root  696 Apr  8 12:09 BUILDING
-rwxr-xr-x 1 root root   81 Apr  8 12:09 clean_query_only.sh
drwxr-xr-x 3 root root 4.0K Apr  8 12:09 cmake
-rw-r--r-- 1 root root 4.7K Apr  8 12:09 CMakeLists.txt
-rwxr-xr-x 1 root root 1.2K Apr  8 12:09 compile_query_only.sh
-rw-r--r-- 1 root root  26K Apr  8 12:09 COPYING
-rw-r--r-- 1 root root  35K Apr  8 12:09 COPYING.3
-rw-r--r-- 1 root root 7.5K Apr  8 12:09 COPYING.LESSER.3
-rw-r--r-- 1 root root  63K Apr  8 12:09 Doxyfile
drwxr-xr-x 8 root root 4.0K Apr  8 12:09 .git
drwxr-xr-x 3 root root 4.0K Apr  8 12:09 .github
-rw-r--r-- 1 root root  261 Apr  8 12:09 .gitignore
-rw-r--r-- 1 root root 1.2K Apr  8 12:09 LICENSE
drwxr-xr-x 7 root root 4.0K Apr  8 12:09 lm
-rw-r--r-- 1 root root  220 Apr  8 12:09 MANIFEST.in
-rw-r--r-- 1 root root   59 Apr  8 12:09 pyproject.toml
drwxr-xr-x 2 root roo

Display current working directory

Use cmake to compile project (follow instructions: https://github.com/kpu/kenlm)

In [10]:
!cd /content/kenlm/build && cmake .. && make -j4

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found Threads: TRUE  
-- Found ZLIB: /usr

In [11]:
%%bash

ls -lah /content/kenlm/build/bin

total 8.2M
drwxr-xr-x 2 root root 4.0K Apr  8 12:13 .
drwxr-xr-x 7 root root 4.0K Apr  8 12:10 ..
-rwxr-xr-x 1 root root 728K Apr  8 12:11 build_binary
-rwxr-xr-x 1 root root 612K Apr  8 12:13 count_ngrams
-rwxr-xr-x 1 root root 749K Apr  8 12:12 filter
-rwxr-xr-x 1 root root 695K Apr  8 12:11 fragment
-rwxr-xr-x 1 root root 1.2M Apr  8 12:13 interpolate
-rwxr-xr-x 1 root root 1.2M Apr  8 12:12 kenlm_benchmark
-rwxr-xr-x 1 root root 1.5M Apr  8 12:13 lmplz
-rwxr-xr-x 1 root root 199K Apr  8 12:12 phrase_table_vocab
-rwxr-xr-x 1 root root 306K Apr  8 12:11 probing_hash_table_benchmark
-rwxr-xr-x 1 root root 747K Apr  8 12:12 query
-rwxr-xr-x 1 root root 505K Apr  8 12:13 streaming_example


The previous commands built the KenLM binaries inside the bin folder. Let's copy it in a more accessible directory

In [12]:
!cd /content/kenlm/build && sudo make install

[ 32%] Built target kenlm_util
[ 34%] Built target probing_hash_table_benchmark
[ 55%] Built target kenlm
[ 57%] Built target query
[ 59%] Built target fragment
[ 61%] Built target build_binary
[ 63%] Built target kenlm_benchmark
[ 70%] Built target kenlm_builder
[ 72%] Built target lmplz
[ 75%] Built target count_ngrams
[ 79%] Built target kenlm_filter
[ 81%] Built target filter
[ 83%] Built target phrase_table_vocab
[ 95%] Built target kenlm_interpolate
[ 97%] Built target interpolate
[100%] Built target streaming_example
[36mInstall the project...[0m
-- Install configuration: "Release"
-- Installing: /usr/local/share/kenlm/cmake/kenlmTargets.cmake
-- Installing: /usr/local/share/kenlm/cmake/kenlmTargets-release.cmake
-- Installing: /usr/local/include/kenlm/util/bit_packing.hh
-- Installing: /usr/local/include/kenlm/util/ersatz_progress.hh
-- Installing: /usr/local/include/kenlm/util/exception.hh
-- Installing: /usr/local/include/kenlm/util/fake_ostream.hh
-- Installing: /usr/local

In [13]:
!lmplz --help

Builds unpruned language models with modified Kneser-Ney smoothing.

Please cite:
@inproceedings{Heafield-estimate,
  author = {Kenneth Heafield and Ivan Pouzyrevsky and Jonathan H. Clark and Philipp Koehn},
  title = {Scalable Modified {Kneser-Ney} Language Model Estimation},
  year = {2013},
  month = {8},
  booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics},
  address = {Sofia, Bulgaria},
  url = {http://kheafield.com/professional/edinburgh/estimate\_paper.pdf},
}

Provide the corpus on stdin.  The ARPA file will be written to stdout.  Order of
the model (-o) is the only mandatory option.  As this is an on-disk program,
setting the temporary file location (-T) and sorting memory (-S) is recommended.

Memory sizes are specified like GNU sort: a number followed by a unit character.
Valid units are % for percentage of memory (supported platforms only) and (in
increasing powers of 1024): b, K, M, G, T, P, E, Z, Y.  Default is K (*1024).

## Download and preprocessing training corpus

Let's get a book from project gutenberg and clean it up using bash

In [14]:
%%bash

mkdir data
wget -O data/dracula.txt http://www.gutenberg.org/cache/epub/345/pg345.txt

--2024-04-08 12:29:09--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/345/pg345.txt [following]
--2024-04-08 12:29:09--  https://www.gutenberg.org/cache/epub/345/pg345.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 890394 (870K) [text/plain]
Saving to: ‘data/dracula.txt’

     0K .......... .......... .......... .......... ..........  5% 3.03M 0s
    50K .......... .......... .......... .......... .......... 11% 3.17M 0s
   100K .......... .......... .......... .......... .......... 17% 98.8M 0s
   150K .......... .......... .......... .......... .......... 23% 3.31M 0s
   200K .......... .......... .......... .

In [15]:
%%bash

ls -lah data

total 880K
drwxr-xr-x 2 root root 4.0K Apr  8 12:29 .
drwxr-xr-x 1 root root 4.0K Apr  8 12:29 ..
-rw-r--r-- 1 root root 870K Apr  1 09:06 dracula.txt


Count the number of lines, words and characters using wc

In [16]:
%%bash

wc -l data/dracula.txt

15851 data/dracula.txt


We can format column printing using awk

In [17]:
%%bash

wc -l data/dracula.txt | awk '{printf "%s contains %s lines\n", $2, $1}'

data/dracula.txt contains 15851 lines


In [18]:
%%bash

wc -w data/dracula.txt | awk '{printf "%s contains %s words\n", $2, $1}'

data/dracula.txt contains 164351 words


In [19]:
%%bash

wc -c data/dracula.txt | awk '{printf "%s contains %s characters\n", $2, $1}'

data/dracula.txt contains 890394 characters


Inspect the first 250 lines using head

In [20]:
%%bash

head -250 data/dracula.txt

﻿The Project Gutenberg eBook of Dracula
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Dracula

Author: Bram Stoker

Release date: October 1, 1995 [eBook #345]
                Most recently updated: November 12, 2023

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team


*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***




                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration: colophon]

                    

We see that the first 200 lines contain project gutenberg specific text and the tableof contents. We can remove these using sed. Then we inspect the new file using head and wc.

In [21]:
%%bash

sed -e "1,105d" data/dracula.txt > data/dracula1.txt
head -50 data/dracula1.txt


DRACULA




CHAPTER I

JONATHAN HARKER’S JOURNAL

(_Kept in shorthand._)


_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish
rule.

We left in pretty good time, and came after nightfall to Klausenburgh.
Here I stopped for the night at the Hotel Royale. I had for dinner, or
rather supper, a chicken done up some way with red pepper, which was
very good but thirsty. (_Mem._, get recipe for Mina.) I as

In [22]:
%%bash

wc data/dracula1.txt | awk '{printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3}'

data/dracula1.txt contains 15746 lines, 163955 words and 887239 characters


We can also remove all empty lines using sed

In [23]:
%%bash

sed -r '/^\s*$/d' data/dracula1.txt > data/dracula2.txt
head -50 data/dracula2.txt

DRACULA
CHAPTER I
JONATHAN HARKER’S JOURNAL
(_Kept in shorthand._)
_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish
rule.
We left in pretty good time, and came after nightfall to Klausenburgh.
Here I stopped for the night at the Hotel Royale. I had for dinner, or
rather supper, a chicken done up some way with red pepper, which was
very good but thirsty. (_Mem._, get recipe for Mina.) I asked the
waiter, and

In [24]:
%%bash

wc data/dracula2.txt | awk '{printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3}'

data/dracula2.txt contains 13307 lines, 163955 words and 882339 characters


Convert all characters to lowercase using tr

In [25]:
%%bash

tr A-Z a-z <data/dracula2.txt >data/dracula3.txt
head -50 data/dracula3.txt

dracula
chapter i
jonathan harker’s journal
(_kept in shorthand._)
_3 may. bistritz._--left munich at 8:35 p. m., on 1st may, arriving at
vienna early next morning; should have arrived at 6:46, but train was an
hour late. buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets. i feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. the
impression i had was that we were leaving the west and entering the
east; the most western of splendid bridges over the danube, which is
here of noble width and depth, took us among the traditions of turkish
rule.
we left in pretty good time, and came after nightfall to klausenburgh.
here i stopped for the night at the hotel royale. i had for dinner, or
rather supper, a chicken done up some way with red pepper, which was
very good but thirsty. (_mem._, get recipe for mina.) i asked the
waiter, and

And remove punctuation and numbers

In [26]:
%%bash

cat data/dracula3.txt | tr -d [:punct:] | tr -d [:digit:] > data/dracula4.txt
head data/dracula4.txt

dracula
chapter i
jonathan harker’s journal
kept in shorthand
 may bistritzleft munich at  p m on st may arriving at
vienna early next morning should have arrived at  but train was an
hour late budapesth seems a wonderful place from the glimpse which i
got of it from the train and the little i could walk through the
streets i feared to go very far from the station as we had arrived
late and would start as near the correct time as possible the


Now we can perform a word frequency analysis using uniq and sort.
First we need to substitute spaces with newlines and then sort them to group the same words together. Uniq then will count consecutive lines that are the same and print word frequencies. We reverse sort the result to print most frequent words first

In [27]:
%%bash

# sed -r 's/\s+/\n/g' data/dracula4.txt | \  # Replace spaces with new lines
#     awk 'NF' | \  # another way to remove empty lines
#     sort | \  # alphabetical sort
#     uniq -c | \ # word count
#     sort -nr | \ # reverse numeric sort
#     awk '{$1=$1; print}' | \  # strip leading and trailing whitespace
#     awk 'BEGIN { OFS="\t" } {print $2,$1}' > data/wordcount.txt  # reverse columns


sed -r 's/\s+/\n/g' data/dracula4.txt | \
   awk 'NF' | \
   sort | \
   uniq -c | \
   sort -nr | \
   awk '{$1=$1; print}' | \
   awk 'BEGIN { OFS=" " } {print $2,$1}' > data/wordcount.txt

In [28]:
%%bash

head -50 data/wordcount.txt

the 7983
and 5834
i 4532
to 4527
of 3726
a 2940
in 2533
he 2527
that 2424
it 2062
was 1872
as 1574
for 1519
is 1507
we 1500
his 1464
me 1394
not 1385
you 1364
with 1316
my 1213
all 1145
be 1117
at 1082
on 1074
so 1066
have 1055
her 1044
had 1038
but 1027
him 927
she 800
when 759
there 748
which 656
this 646
if 639
from 630
are 592
said 569
were 546
by 525
or 519
then 514
could 493
one 485
do 458
them 457
us 452
they 452



We can even create a histogram of word counts using a simple python script.

In [29]:
%%bash

function histogram {
python3 -c 'import sys
for line in sys.stdin:
  data, width = line.split()
  print("{:<15}{:=<{width}}".format(data, "", width=int(int(width) / 75)))' # each = corresponds to a count of 75

}
export -f histogram

cat data/wordcount.txt  | histogram > data/histogram.txt

In [30]:
%%bash

head -250 data/histogram.txt

no             =====
will           =====
must           =====
up             =====
some           =====
what           =====
shall          =====
would          =====
out            =====
our            =====
may            =====
been           =====
know           =====
see            =====
can            =====
now            ====
more           ====
time           ====
has            ====
am             ====
over           ====
any            ====
van            ====
came           ====
come           ====
your           ====
went           ===
an             ===
helsing        ===
into           ===
only           ===
who            ===
very           ===
before         ===
did            ===
like           ===
go             ===
back           ===
down           ===
here           ===
seemed         ===
again          ===
about          ===
even           ===
such           ===
took           ==
than           ==
way            ==
their          ==
saw            ==
though        

## Training an n-gram language model

Now we can use KenLM to train a 3-gram Language model on our preprocessed corpus

In [31]:
!lmplz -o 3 <data/dracula4.txt > data/dracula.lm.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /content/data/dracula4.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 163034 types 11707
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:140484 2:3786929152 3:7100492288
Statistics:
1 11707 D1=0.65254 D2=1.02232 D3+=1.40256
2 74732 D1=0.777838 D2=1.13233 D3+=1.3484
3 134905 D1=0.885782 D2=1.25363 D3+=1.40872
Memory estimate for binary LM:
type      kB
probing 4420 assuming -p 1.5
probing 4903 assuming -r models -p 1.5
trie    1882 without quantization
trie    1077 assuming -q 8 -b 8 quantization 
trie    1789 assuming -a 22 array pointer compression
trie     984 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:140484 2:1195712 3:2698100
----5---10---15---20-

We can see some 1-gram, 2-gram and 3-gram scores using grep

In [32]:
%%bash

cat data/dracula.lm.arpa | egrep "1-grams|2-grams|3-grams" -A10

\1-grams:
-4.905576	<unk>	0
0	<s>	-0.5739316
-1.4011121	</s>	0
-3.7756853	dracula	-0.2081592
-4.7675614	chapter	-0.109111056
-1.8003287	i	-0.7904093
-2.9896963	jonathan	-0.30539653
-3.9969275	harker’s	-0.14176422
-3.742329	journal	-0.20828262
-3.388833	kept	-0.20666325
--
\2-grams:
-1.6933726	<s> </s>	0
-0.92340153	dracula </s>	0
-1.219743	i </s>	0
-0.9685342	jonathan </s>	0
-1.425547	harker’s </s>	0
-0.5972116	journal </s>	0
-0.96158737	kept </s>	0
-1.1766673	in </s>	0
-1.2551455	shorthand </s>	0
-1.1146483	may </s>	0
--
\3-grams:
-0.6580499	<s> dracula </s>
-0.84243387	castle dracula </s>
-0.6578116	this dracula </s>
-0.6578116	ebook dracula </s>
-0.7756664	chapter i </s>
-0.95658225	i i </s>
-1.1548808	jonathan i </s>
-1.2445455	may i </s>
-0.7756664	p i </s>
-1.0863998	on i </s>


We can also use query to use the trained language model to score the perplexity of a sentence.
Lower perplexity indicates a more probable sentence.

Let's have the model score two possible endings.

In [33]:
!query -h

query: invalid option -- 'h'
KenLM was compiled with maximum order 6.
Usage: query [-b] [-n] [-w] [-s] lm_file
-b: Do not buffer output.
-n: Do not wrap the input in <s> and </s>.
-v summary|sentence|word: Print statistics at this level.
   Can be used multiple times: -v summary -v sentence -v word
-l lazy|populate|read|parallel: Load lazily, with populate, or malloc+read
The default loading method is populate on Linux and read on others.

Each word in the output is formatted as:
  word=vocab_id ngram_length log10(p(word|context))
where ngram_length is the length of n-gram matched.  A vocab_id of 0 indicates
the unknown word. Sentence-level output includes log10 probability of the
sentence and OOV count.


In [34]:
%%bash

echo "harker and mina die a horrible death" > data/bad_ending
echo "harker and mina live happily ever after" > data/good_ending

In [35]:
%%bash

cat data/bad_ending

harker and mina die a horrible death


In [36]:
!query data/dracula.lm.arpa < data/bad_ending 2>&1| grep "Perplexity" | head -1

Perplexity including OOVs:	461.44070875334245


In [37]:
%%bash

cat data/good_ending

harker and mina live happily ever after


In [38]:
!query data/dracula.lm.arpa < data/good_ending 2>&1| grep "Perplexity" | head -1

Perplexity including OOVs:	926.7947324237933
