# AI4Bharat ASR to HF Compatible Format

The objective of this notebook is to convert the AI4Bharat ASR models to the Hugging Face compatible "transformers" format. This allows us to use transformer's "automatic-speech-recognition" pipeline to transcribe speech using the AI4Bharat models.

This notebook focuses on converting the [indicwav2vec-kannada](https://github.com/AI4Bharat/IndicWav2Vec?tab=readme-ov-file#download-models) model to the Hugging Face compatible format. The same steps can be followed for other AI4Bharat ASR models.

You can run this on Google Colab or any other environment with a GPU.

## Installation and Setup

In [1]:
! sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
! sudo add-apt-repository ppa:savoury1/ffmpeg4 -y && apt-get update && apt-get install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libbz2-dev is already the newest version (1.0.8-5build1).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
liblzma-dev set to manually installed.
libboost-all-dev is already the newest version (1.74.0.3ubuntu7).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
zlib1g-dev set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Repository: 'deb https://ppa.launchpadcontent.net/savoury1/ffmpeg4/ubuntu/ jammy main'
Description:
FFmpeg 4.4.5 builds (& associated multimedia packages) for Xenial & newer.

*** Anyone interested in full builds of FFmpeg 4.4.x including all "bells and whistles" needs to have donated, after which access to the new private PPA can be requested. 

In [2]:
!sudo apt install -y liblzma-dev libbz2-dev libzstd-dev libsndfile1-dev libopenblas-dev libfftw3-dev libgflags-dev libgoogle-glog-dev build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libboost-program-options-dev is already the newest version (1.74.0.3ubuntu7).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.74.0.3ubuntu7).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.74.0.3ubuntu7).
libboost-thread-dev set to manually installed.
libbz2-dev is already the newest version (1.0.8-5build1).
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
libzstd-dev is already the newest version (1.4.8+dfsg-3build1).
libzstd-dev set to manually installed.
libboost-test-dev is already the newest version (1.74.0.3ubuntu7).
libboost-test-dev set to manually installed.
libopenblas-dev is already the newest version (0.3.20+ds-1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
libsndfile1-dev is 

In [3]:
! pip install transformers datasets pyctcdecode soundfile
! pip install https://github.com/kpu/kenlm/archive/master.zip

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyctcdecode
  Downloading pyctcdecode-0.5.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting pygtrie<3.0,>=2.1 (from pyctcdecode)
  Downloading pygtrie-2.5.0-py3-none-any.whl.metadata (7.5 kB)
Collecting hypothesis<7,>=6.14 (from pyctcdecode)
  Downloading hypothesis-6.112.0-py3-none-any.whl.metadata (6.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [4]:
%cd /content

/content


In [5]:
!rm -rf IndicWav2Vec fairseq kenlm flashlight
!git clone https://github.com/AI4Bharat/IndicWav2Vec.git
!git clone https://github.com/pytorch/fairseq.git
!git clone https://github.com/kpu/kenlm.git
!git clone https://github.com/flashlight/flashlight.git

Cloning into 'IndicWav2Vec'...
remote: Enumerating objects: 1943, done.[K
remote: Counting objects: 100% (228/228), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 1943 (delta 158), reused 204 (delta 142), pack-reused 1715 (from 1)[K
Receiving objects: 100% (1943/1943), 139.60 MiB | 17.45 MiB/s, done.
Resolving deltas: 100% (360/360), done.
Updating files: 100% (759/759), done.
Cloning into 'fairseq'...
remote: Enumerating objects: 35337, done.[K
remote: Counting objects: 100% (254/254), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Total 35337 (delta 104), reused 215 (delta 87), pack-reused 35083 (from 1)[K
Receiving objects: 100% (35337/35337), 25.32 MiB | 12.56 MiB/s, done.
Resolving deltas: 100% (25594/25594), done.
Cloning into 'kenlm'...
remote: Enumerating objects: 14170, done.[K
remote: Counting objects: 100% (483/483), done.[K
remote: Compressing objects: 100% (337/337), done.[K
remote: Total 14170 (delta 166), reused 

In [None]:
!pip install "numpy<1.24"

In [1]:
!python -m pip install pip==24.0

Collecting pip==24.0
  Using cached pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Using cached pip-24.0-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.0


#### Build IndicWav2Vec

🔥 Remove `numpy==1.20.0` from `/content/IndicWav2Vecw2v_inference/requirements.txt` file 🔥

In [None]:
%cd /content/IndicWav2Vec
!pip install packaging soundfile swifter -r w2v_inference/requirements.txt
%cd ..

#### Build Fairseq

In [1]:
%cd /content/fairseq
!git checkout cf8ff8c3c5242e6e71e8feb40de45dd699f3cc08
!pip install ./
%cd /content

/content/fairseq
Note: switching to 'cf8ff8c3c5242e6e71e8feb40de45dd699f3cc08'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at cf8ff8c3 Add unittests for jitting EMA model
Processing /content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==1.0.0a0+cf8ff8c)
  Downloading hydra_cor

## Build Model

Download model

In [2]:
!wget https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/models/acoustic/kannada.pt -O /content/kn.pt

--2024-09-12 13:17:03--  https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/models/acoustic/kannada.pt
Resolving indic-asr-public.objectstore.e2enetworks.net (indic-asr-public.objectstore.e2enetworks.net)... 164.52.210.97, 101.53.152.30, 164.52.206.154, ...
Connecting to indic-asr-public.objectstore.e2enetworks.net (indic-asr-public.objectstore.e2enetworks.net)|164.52.210.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3786635828 (3.5G) [application/zip]
Saving to: ‘/content/kn.pt’


2024-09-12 13:20:53 (15.9 MB/s) - ‘/content/kn.pt’ saved [3786635828/3786635828]



Install git-lfs

In [3]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt-get install git-lfs
!git lfs install

# Put in your details
!git config --global user.email "mishraaditya6991@gmail.com"
!git config --global user.name "adimyth"

Detected operating system as Ubuntu/jammy.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Detected apt version as 2.4.13
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... Packagecloud gpg key imported to /etc/apt/keyrings/github_git-lfs-archive-keyring.gpg
done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libcodec2-1.0 libmfx1 libsrt1.4-gnutls libva-drm2 libva-x11-2 libva2 libvpx7 libx264-163
Use 'apt autoremove' to remove them.
The following packages will be upgraded:
  git-lfs
1 upgraded, 0 newly installed, 0 to remove and 111 not upgraded.
Need to get 7,420 kB of archives.
After this operation, 6,051 kB of additional dis

Login to huggingface-hub

In [9]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

In [4]:
!pip install transformers==4.29.2

Collecting transformers==4.29.2
  Downloading transformers-4.29.2-py3-none-any.whl.metadata (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.3/112.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.29.2)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[?25h[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of 

In [5]:
import transformers

transformers.__version__

'4.29.2'

In [6]:
from transformers import Wav2Vec2Config
from huggingface_hub import create_repo, Repository

from transformers import pipeline, AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM

2024-09-12 13:23:07.642641: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-12 13:23:08.019013: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-12 13:23:08.109697: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Export models to HuggingFace

Create and Initialize Repo

In [7]:
repo_url = create_repo("indicwav2vec-kannada", private=True)

Save config.json from a "similar" architecture in huggingface

In [8]:
repo = Repository(local_dir="indicwav2vec-kannada", clone_from=repo_url)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/adimyth/indicwav2vec-kannada into local empty directory.


In [9]:
config = Wav2Vec2Config.from_pretrained('facebook/wav2vec2-large-960h-lv60-self')
config.save_pretrained('indicwav2vec-kannada')



config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

In [10]:
# using the indicwav2vec-hindi config.json for indicwav2vec-tamil
import json

data = {
  "_name_or_path": "facebook/wav2vec2-large-960h-lv60-self",
  "activation_dropout": 0.1,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": False,
  "apply_spec_augment": True,
  "architectures": [
    "Wav2Vec2ForCTC"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": True,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "sum",
  "ctc_zero_infinity": False,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": True,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "layer",
  "feat_proj_dropout": 0.1,
  "feat_quantizer_dropout": 0.0,
  "final_dropout": 0.1,
  "gradient_checkpointing": False,
  "hidden_act": "gelu",
  "hidden_dropout": 0.1,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "layerdrop": 0.1,
  "mask_feature_length": 10,
  "mask_feature_min_masks": 0,
  "mask_feature_prob": 0.0,
  "mask_time_length": 10,
  "mask_time_min_masks": 2,
  "mask_time_prob": 0.05,
  "model_type": "wav2vec2",
  "num_adapter_layers": 3,
  "num_attention_heads": 16,
  "num_codevector_groups": 2,
  "num_codevectors_per_group": 320,
  "num_conv_pos_embedding_groups": 16,
  "num_conv_pos_embeddings": 128,
  "num_feat_extract_layers": 7,
  "num_hidden_layers": 24,
  "num_negatives": 100,
  "output_hidden_size": 1024,
  "pad_token_id": 0,
  "proj_codevector_dim": 256,
  "tdnn_dilation": [
    1,
    2,
    3,
    1,
    1
  ],
  "tdnn_dim": [
    512,
    512,
    512,
    512,
    1500
  ],
  "tdnn_kernel": [
    5,
    3,
    3,
    1,
    1
  ],
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "use_weighted_layer_sum": False,
  "vocab_size": 68,
  "xvector_output_dim": 512
}

with open('/content/config.json', 'w') as f:
    json.dump(data, f, indent=2)

In [14]:
# downloading the dictionary as per the github repo
!wget https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/models/acoustic/kannada.dict.txt -O /content/dict.ltr.txt

--2024-09-12 13:24:17--  https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/models/acoustic/kannada.dict.txt
Resolving indic-asr-public.objectstore.e2enetworks.net (indic-asr-public.objectstore.e2enetworks.net)... 164.52.210.97, 101.53.152.30, 164.52.206.154, ...
Connecting to indic-asr-public.objectstore.e2enetworks.net (indic-asr-public.objectstore.e2enetworks.net)|164.52.210.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457 [text/plain]
Saving to: ‘/content/dict.ltr.txt’


2024-09-12 13:24:19 (298 MB/s) - ‘/content/dict.ltr.txt’ saved [457/457]



Convert ASR model to Huggingface's format

Override the `merge` function in `/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py` with the following -

```python
    def merge(
        *others: Union[BaseContainer, Dict[str, Any], List[Any], Tuple[Any, ...], Any]
    ) -> Union[ListConfig, DictConfig]:
        """Merge a list of previously created configs into a single one"""
        assert len(others) > 0
        target = copy.deepcopy(others[0])
        target = _ensure_container(target)
        assert isinstance(target, (DictConfig, ListConfig))


        print("="*90)
        print(f"Target: {type(target)}")
        print(target)
        print("="*90)

        print("="*90)
        print(f"Others: {type(*others[1:])}")
        print(*others[1:])
        print("="*90)

        # with flag_override(target, "readonly", False):
        #     target.merge_with(*others[1:])
        #     turned_readonly = target._get_flag("readonly") is True

        with flag_override(target, "readonly", False):
            if 'eval_wer_config' in target:
              OmegaConf.set_struct(target.eval_wer_config, False)  # Allow adding new keys
              
              if 'eos_token' not in target.eval_wer_config:
                  target.eval_wer_config.eos_token = "</s>"  # or whatever default you want
            
            # Proceed with the merge
            target.merge_with(*others[1:])
            

            # Optionally, re-enable struct to lock the structure
            if 'eval_wer_config' in target:
              OmegaConf.set_struct(target.eval_wer_config, True)


            turned_readonly = target._get_flag("readonly") is True

        if turned_readonly:
            OmegaConf.set_readonly(target, True)

        return target
```

In [None]:
%cd "/content/IndicWav2Vec"
!python workshop-2022/utils/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py \
    --pytorch_dump_folder /content/indicwav2vec-kannada \
    --checkpoint_path /content/kn.pt \
    --config_path /content/config.json \
    --dict_path /content/dict.ltr.txt
%cd /content

/content/IndicWav2Vec
2024-09-12 14:00:46.753223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-12 14:00:46.783555: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-12 14:00:46.792464: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
loading configuration file /content/config.json
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-large-960h-lv60-self",
  "activation_dropout": 0.1,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForCTC"
  ],
  "attention_dropout": 

Push to Huggingface Model Hub

In [26]:
%cd "/content/indicwav2vec-kannada"
!huggingface-cli lfs-enable-largefiles .
!git lfs track "*.binary"
!git add .
!git commit -m "added language model"
!git push origin main
%cd /content

/content/indicwav2vec-kannada
Local repo set up for largefiles
Tracking "*.binary"
[main 610171e] added language model
 7 files changed, 215 insertions(+)
 create mode 100644 config.json
 create mode 100644 preprocessor_config.json
 create mode 100644 pytorch_model.bin
 create mode 100644 special_tokens_map.json
 create mode 100644 tokenizer_config.json
 create mode 100644 vocab.json
Uploading LFS objects: 100% (1/1), 1.3 GB | 39 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 2 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 2.14 KiB | 2.14 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/adimyth/indicwav2vec-kannada
   7b70ce4..610171e  main -> main
/content
