# MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™

The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.
The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516).  The models and code are originally released [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).
There are the different models open sourced in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS).  In this example we are considering ASR and LID.

<a id="0"></a>
### Table of contents:
- [Install prerequisites](#1)
- [Automatic Speech Recognition (ASR)](#2)
  - [Download pretrained model and processor](#3)
  - [Prepare an example audio](#4)
  - [Make inference with the original model](#5)
  - [Convert to OpenVINO IR model and make inference](#6)
- [Language Identification (LID)](#7)
  - [Download pretrained model and processor](#8)
  - [Make inference with the original model](#9)
  - [Convert to OpenVINO IR model and make inference](#10)

<a name='1'></a>
## Install prerequisites

In [None]:
!pip install -q --upgrade pip 
!pip install -q datasets transformers accelerate "openvino==2023.1.0.dev20230811" torch soundfile

In [None]:
from pathlib import Path

import torch

import openvino

<a name='2'></a>
## Automatic Speech Recognition (ASR)

<a name='3'></a>
### Download pretrained model and processor
Download pretrained model and processor. By default, MMS loads adapter weights for English. If you want to load adapter weights of another language make sure to specify `target_lang=<your-chosen-target-lang>` as well as `ignore_mismatched_sizes=True`. The `ignore_mismatched_sizes=True` keyword has to be passed to allow the language model head to be resized according to the vocabulary of the specified language. Similarly, the processor should be loaded with the same target language. 
It is also possible to change the supported language later.

In [None]:
from transformers import Wav2Vec2ForCTC, AutoProcessor
model_id = "facebook/mms-1b-all"

asr_processor = AutoProcessor.from_pretrained(model_id)
asr_model = Wav2Vec2ForCTC.from_pretrained(model_id)

You can look at all supported languages:

In [None]:
asr_processor.tokenizer.vocab.keys()

<a name='4'></a>
### Prepare an example audio
Read an audio file and process the audio data. Make sure that the audio data is sampled to 16000 kHz.
For this example we will use [a streamable version of the Multilingual LibriSpeech (MLS) dataset](https://huggingface.co/datasets/multilingual_librispeech). It support contains example on 7 languages: `'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'`.
Let's use `'german'`. Specify `streaming=True` to not download the entire dataset.

In [None]:
from datasets import load_dataset

mls = load_dataset("facebook/multilingual_librispeech", "german", split="test", streaming=True)
mls = iter(mls)  # make it itarable

example = next(mls)  # get one example

Example has a dictionary structure. It contains an audio data and a text transcription.

In [None]:
print(example)  # look at structure

Switch out the language adapters by calling the `load_adapter()` function for the model and `set_target_lang()` for the tokenizer. Pass the target language as an input - `"deu"` for German.

In [None]:
asr_processor.tokenizer.set_target_lang("deu")
asr_model.load_adapter("deu")

<a name='5'></a>
### Make inference with the original model

In [None]:
inputs = asr_processor(example['audio']['array'], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = asr_model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = asr_processor.decode(ids)
print(transcription)

<a name='6'></a>
### Convert to OpenVINO IR model and make inference
Convert to OpenVINO IR model format with `openvino.convert_model` function directly. Use `openvino.save_model` function to serialize the result of conversion.

In [None]:
MAX_SEQ_LENGTH = 30480

input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
attention_mask = torch.ones([1, MAX_SEQ_LENGTH], dtype=torch.int32)
asr_model_xml_path = Path('models/ov_asr_model.xml')

if not asr_model_xml_path.exists():
    asr_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
    converted_model = openvino.convert_model(asr_model, example_input={'input_values': input_values})
    openvino.save_model(converted_model, asr_model_xml_path)

Compile model. 

In [None]:
core = openvino.Core()

compiled_asr_model = core.compile_model(asr_model_xml_path, device_name='CPU')

Make inference.

In [None]:
inputs = asr_processor(example['audio']['array'], sampling_rate=16_000, return_tensors="pt")
outputs = compiled_asr_model(inputs['input_values'])[0]

ids = torch.argmax(torch.from_numpy(outputs), dim=-1)[0]
transcription = asr_processor.decode(ids)
print(transcription)

<a name='7'></a>
## Language Identification (LID) 

<a name='8'></a>
### Download pretrained model and processor
Different LID models are available based on the number of languages they can recognize - 126, 256, 512, 1024, 2048, 4017. We will use 126.

In [None]:
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor

model_id = "facebook/mms-lid-126"

lid_processor = AutoFeatureExtractor.from_pretrained(model_id)
lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

<a name='9'></a>
### Make inference with the original model

In [None]:
inputs = lid_processor(example['audio']['array'], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = lid_model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = lid_model.config.id2label[lang_id]
print(detected_lang)

<a name='10'></a>
### Convert to OpenVINO IR model and make inference

In [None]:
MAX_SEQ_LENGTH = 30480

input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
attention_mask = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.int32)
lid_model_xml_path = Path('models/ov_lid_model.xml')

if not lid_model_xml_path.exists():
    lid_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
    converted_model = openvino.convert_model(lid_model, example_input={'input_values': input_values})
    openvino.save_model(converted_model, lid_model_xml_path)

And compile.

In [None]:
core = openvino.Core()

compiled_lid_model = core.compile_model(lid_model_xml_path, device_name='CPU')

Now it is possible to make inference. 

In [None]:
def detect_lang(audio_data):
    inputs = lid_processor(audio_data, sampling_rate=16_000, return_tensors="pt")
    
    outputs = compiled_lid_model(inputs['input_values'])[0]
    
    lang_id = torch.argmax(torch.from_numpy(outputs), dim=-1)[0].item()
    detected_lang = lid_model.config.id2label[lang_id]
    
    return detected_lang

In [None]:
detect_lang(example['audio']['array'])

Let's check another language.

In [None]:
mls = load_dataset("facebook/multilingual_librispeech", "french", split="test", streaming=True)
mls = iter(mls)

example = next(mls)
print(example['text'])

In [None]:
detect_lang(example['audio']['array'])