<a href="https://colab.research.google.com/github/beinghorizontal/wav2vec2/blob/main/create_n_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **3. Build an *n-gram* with KenLM**



Great, let's see step-by-step how to build an *n-gram*. We will use the popular [KenLM library](https://github.com/kpu/kenlm) to do so. Let's start by installing the Ubuntu library prerequisites:

In [2]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
libboost-program-options-dev is already the newest version (1.65.1.0ubuntu1).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.65.1.0ubuntu1).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.65.1.0ubuntu1).
libboost-thread-dev set to manually installed.
libboost-test-dev is already the newest version (1.65.1.0ubuntu1).
libboost-test-dev set to manually installed.
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
libbz2-dev is already the newest version (1.0.6-8.1ubuntu0.2).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.2-1.3ubuntu0.1).
liblzma-dev set to manually installed.
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2.1).
zlib1g-dev set to manually in

before downloading and unpacking the KenLM repo.

In [3]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2022-08-14 17:20:14--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2022-08-14 17:20:15 (971 KB/s) - written to stdout [491888/491888]



KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [4]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found Boost: /usr/include (found suitable version "1.65.1", minimum required is "1.41.0") found components: program_options system

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

In [5]:

!kenlm/build/bin/lmplz -o 5 <"/content/drive/MyDrive/textfile.txt" > "5gram.arpa"

=== 1/5 Counting and sorting n-grams ===
Reading /content/drive/MyDrive/textfile.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 1918066688 bytes == 0x55888bede000 @  0x7f42597a51e7 0x55888a6357e2 0x55888a5d04fe 0x55888a5af2eb 0x55888a59b066 0x7f425793ec87 0x55888a59cbaa
tcmalloc: large alloc 8950972416 bytes == 0x5588fe414000 @  0x7f42597a51e7 0x55888a6357e2 0x55888a62480a 0x55888a625248 0x55888a5af308 0x55888a59b066 0x7f425793ec87 0x55888a59cbaa
****************************************************************************************************
Unigram tokens 432283 types 44014
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:528168 2:1062796864 3:1992744320 4:3188390656 5:4649736704
tcmalloc: large alloc 4649738240 bytes == 0x55888bede000 @  0x7f42597a51e7 0x55888a6357e2 0x55888a62480a 0x55888a625248 0x55888a5af8d7 0x55888a59b066 0x7f425793ec87 0x55888a59cbaa
tcmalloc: large alloc 199274

Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [6]:
!head -20 5gram.arpa

\data\
ngram 1=44014
ngram 2=224034
ngram 3=347340
ngram 4=370997
ngram 5=362076

\1-grams:
-5.3902674	<unk>	0
0	<s>	-0.7231836
-1.3045018	</s>	0
-2.7780101	can	-0.36739397
-5.264799	volcro	-0.08811892
-3.880325	stick	-0.5280876
-2.3268511	with	-0.36656973
-5.264799	cotton	-0.08811892
-5.0772634	cloth	-0.08811892
-3.799947	google	-0.14392525
-3.9268823	speech	-0.2904494
-4.2891884	recognition	-0.11017436


There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [7]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [8]:
!head -20 5gram_correct.arpa

\data\
ngram 1=44015
ngram 2=224034
ngram 3=347340
ngram 4=370997
ngram 5=362076

\1-grams:
-5.3902674	<unk>	0
0	<s>	-0.7231836
0	</s>	-0.7231836
-1.3045018	</s>	0
-2.7780101	can	-0.36739397
-5.264799	volcro	-0.08811892
-3.880325	stick	-0.5280876
-2.3268511	with	-0.36656973
-5.264799	cotton	-0.08811892
-5.0772634	cloth	-0.08811892
-3.799947	google	-0.14392525
-3.9268823	speech	-0.2904494


Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

### compress to binary

In [None]:
!kenlm/build/bin/build_binary /content/5gram_correct.arpa /content/5gram.bin