<a href="https://colab.research.google.com/github/beinghorizontal/wav2vec2/blob/main/quickcreate_3n_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import output
output.enable_custom_widget_manager()

In [3]:
 from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **part 1. Build an *n-gram* with KenLM** and upload binary to drive



Great, let's see step-by-step how to build an *n-gram*. We will use the popular [KenLM library](https://github.com/kpu/kenlm) to do so. Let's start by installing the Ubuntu library prerequisites:

In [None]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

before downloading and unpacking the KenLM repo.

In [5]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2022-08-21 18:54:20--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2022-08-21 18:54:23 (410 KB/s) - written to stdout [491888/491888]



KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [None]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

In [7]:
import shutil
import os

def make_archive(source, destination):
          base = os.path.basename(destination)
          name = base.split('.')[0]
          format = base.split('.')[1]
          archive_from = os.path.dirname(source)
          archive_to = os.path.basename(source.strip(os.sep))
          shutil.make_archive(name, format, archive_from, archive_to)
          shutil.move('%s.%s'%(name,format), destination)

make_archive('/content/kenlm', '/content/drive/MyDrive/kenlm.zip')
  

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

In [None]:
# try without flag first
!kenlm/build/bin/lmplz -o 3 <"/content/drive/MyDrive/textfile_ngram.txt" > "3gram.arpa"

#!kenlm/build/bin/lmplz -o 5 <"/content/drive/MyDrive/textfile_ngram.txt" > "5gram.arpa" --discount_fallback


Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [9]:
!head -20 3gram.arpa

head: cannot open '5gram.arpa' for reading: No such file or directory


There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [10]:
with open("3gram.arpa", "r") as read_file, open("3gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [None]:
!head -20 3gram_correct.arpa

Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

### compress to binary

In [13]:
!kenlm/build/bin/build_binary /content/3gram_correct.arpa /content/3gram.bin

Reading /content/3gram_correct.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


## Important
### Edit unigram.txt and remove unicodes and then save it on drive and close the colab