<a href="https://colab.research.google.com/github/beinghorizontal/wav2vec2/blob/main/create_n_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## **part 1. Build an *n-gram* with KenLM** and upload binary to drive



Great, let's see step-by-step how to build an *n-gram*. We will use the popular [KenLM library](https://github.com/kpu/kenlm) to do so. Let's start by installing the Ubuntu library prerequisites:

In [None]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

before downloading and unpacking the KenLM repo.

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

KenLM is written in C++, so we'll make use of `cmake` to build the binaries.

In [None]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

Great, as we can see, the executable functions have successfully been built under `kenlm/build/bin/`.

KenLM by default computes an *n-gram* with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the *n-gram* is expected to be stored in a text file.
We download our dataset and save it as a `.txt` file.

In [None]:
# try without flag first
!kenlm/build/bin/lmplz -o 5 <"/content/drive/MyDrive/textfile_ngram.txt" > "5gram.arpa"

#!kenlm/build/bin/lmplz -o 5 <"/content/drive/MyDrive/textfile_ngram.txt" > "5gram.arpa" --discount_fallback


Great, we have built a *5-gram* LM! Let's inspect the first couple of lines.

In [None]:
!head -20 5gram.arpa

There is a small problem that 🤗 Transformers will not be happy about later on.
The *5-gram* correctly includes a "Unknown" or `<unk>`, as well as a *begin-of-sentence*, `<s>` token, but no *end-of-sentence*, `</s>` token.
This sadly has to be corrected currently after the build.

We can simply add the *end-of-sentence* token by adding the line `0 </s>  -0.11831701` below the *begin-of-sentence* token and increasing the `ngram 1` count by 1. Because the file has roughly 100 million lines, this command will take *ca.* 2 minutes.

In [None]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

Let's now inspect the corrected *5-gram*.

In [None]:
!head -20 5gram_correct.arpa

Great, this looks better! We're done at this point and all that is left to do is to correctly integrate the `"ngram"` with [`pyctcdecode`](https://github.com/kensho-technologies/pyctcdecode) and 🤗 Transformers.

### compress to binary

In [None]:
!kenlm/build/bin/build_binary /content/5gram_correct.arpa /content/5gram.bin

## **4. Combine an *n-gram* with Wav2Vec2**

In [None]:
!pip install transformers==4.18.0


In [None]:
# from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

# processor = Wav2Vec2Processor.from_pretrained("crossdelenna/wav2vec2-base-en-in")
# model = Wav2Vec2ForCTC.from_pretrained("crossdelenna/wav2vec2-base-en-in")

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("crossdelenna/wav2vec2-base-en-in")

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

In [None]:
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install pyctcdecode==0.3.0


In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="/content/drive/MyDrive/5gram_correct.arpa",
)

We can safely ignore the warning and all that is left to do now is to wrap the just created `decoder`, together with the processor's `tokenizer` and `feature_extractor` into a `Wav2Vec2ProcessorWithLM` class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

## Before uploading LM evaluate with and without LM

In [None]:
!pip install datasets==2.0.0
import datasets
timit = datasets.load_dataset("crossdelenna/en_in", use_auth_token='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')


In [None]:
timit

# check random audio

In [None]:
import IPython.display as ipd
audio_sample = timit['test'][3]['input_values']
#print(audio_sample["labels"].lower())
ipd.Audio(data=audio_sample, autoplay=True, rate=16000)

In [None]:
from transformers import AutoModelForCTC, Wav2Vec2Processor

model = AutoModelForCTC.from_pretrained("crossdelenna/wav2vec2-base-en-in")
processor = Wav2Vec2Processor.from_pretrained("crossdelenna/wav2vec2-base-en-in")
#processor = Wav2Vec2Processor.from_pretrained("crossdelenna/wav2vec2-base-en-in")

In [None]:
from transformers import Wav2Vec2ForCTC
model = Wav2Vec2ForCTC.from_pretrained("crossdelenna/wav2vec2-base-en-in").cuda()

In [None]:
processor = Wav2Vec2Processor.from_pretrained("crossdelenna/wav2vec2-base-en-in")

In [None]:
import torch
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)
  
  return batch

In [None]:
results = timit["test"].map(map_to_result, remove_columns=timit["test"].column_names)

In [None]:
!pip install jiwer
from datasets import load_metric

wer_metric = load_metric("wer")

In [None]:
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))

## push to repo

In [None]:
!sudo apt-get install git-lfs tree

In [None]:
!huggingface-cli login

In [None]:
from huggingface_hub import Repository

repo = Repository(local_dir="wav2vec2-base-en-in", clone_from="crossdelenna/wav2vec2-base-en-in")

## Save LM processor

In [None]:
processor_with_lm.save_pretrained("wav2vec2-base-en-in")

In [None]:
!tree -h wav2vec2-base-en-in/

## Convert arpa to bin to reduce size

In [None]:
!kenlm/build/bin/build_binary wav2vec2-base-en-in/language_model/5gram_correct.arpa wav2vec2-base-en-in/language_model/5gram.bin

Great, it worked! Let's remove the `.arpa` file and check the size of the binary *5-gram* LM.

In [None]:
!rm wav2vec2-base-en-in/language_model/5gram_correct.arpa && tree -h wav2vec2-base-en-in/

## Push repo with LM to hub

In [None]:
#!cd wav2vec2-base-en-in
#!cd \content


In [None]:
!transformers-cli login
#transformers-cli repo create your-model-name

In [None]:
!git clone https://crossdelenna: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@huggingface.co/crossdelenna/wav2vec2-base-en-in


In [None]:
!cd /content/wav2vec2-base-en-in
!git lfs install
#!git config --global user.email "nifty.emini@gmail.com"
#!git config --global user.name "crossdelenna"


In [None]:
!git add .
!git commit -m "LM decoder"
!git push

In [None]:
!repo.git_pull()

In [None]:
repo.push_to_hub(commit_message="Upload lm-boosted decoder")