# Section 1. Getting Started

Disclaimer: Unfortunately, the notebook for this praktikum is not readily-executable for you. It cannot be ran on Google Colab due to its memory requirement. You can however use it as instruction to set up your own implementation on bwUniCluster.

## 1.1. Install SALMONN

First, clone the repository:

In [None]:
# !git clone https://github.com/bytedance/SALMONN.git
# %cd SALMONN

Install the necessary packages. You should mostly follow the instruction from the [SALMONN repository](https://github.com/bytedance/SALMONN/), however, a few modifications is required as follows.

First create a python `3.9.17` environment, for example with:

In [None]:
# !conda create -n salmonn python=3.9.17
# !conda activate salmonn

Contradicting to the instruction from the SALMONN repository, install the **updated version** of `torch` and `torchaudio` according to the [Pytorch website](https://pytorch.org/get-started/locally/), for example:

In [None]:
# !conda install pytorch torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Install the rest of the packages according to the `requirements.txt` files provided in the SALMONN repository.

In [None]:
# !pip install peft==0.3.0
# !pip install soundfile
# !pip install librosa
# !pip install transformers==4.28.0
# !pip install sentencepiece==0.1.97
# !pip install accelerate==0.20.3
# !pip install bitsandbytes==0.35.0
# !pip install gradio==3.23.0

Install other necessary packages:

In [None]:
# !pip install omegaconf

## 1.2. Download models

To reduce the computational cost, we will work with the smaller version of SALMONN, with 7B parameters. We need to download the required pre-trained models. Most of them are available on Huggingface:

In [None]:
# !huggingface-cli download openai/whisper-large-v2
# !huggingface-cli download lmsys/vicuna-7b-v1.5
# !huggingface-cli download tsinghua-ee/SALMONN-7B

The `BEATs` model can be downloaded from this [link](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea).

# Section 2. Data Preparation

Our data should be put into a standard format. An example is provided by SALMONN, which can be found in `data/example_data.json`. Let's have a look:

In [None]:
import json
with open("data/example_data.json", 'r') as f:
    data = json.load(f)
data

{'annotation': [{'path': '/data/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac',
   'text': 'Chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook',
   'task': 'asr'},
  {'path': '/data/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac',
   'text': "That had its source away back in the woods of the old cuthbert place it was reputed to be an intricate headlong brook in its earlier course through those woods with dark secrets of pool and cascade but by the time it reached lynde's hollow it was a quiet well conducted little stream",
   'task': 'asr'}]}

As can be seen, we should format our data such that each sample consist of:
- Path to the audio
- Target text (translation in our case)
- Task name. The task name should maps with the keys provided in the prompt files, which can be found in the `prompts/` directory.


I have created a minimal example of the data required for speech translation fine-tuning. The data is a subset of [CoVoST2](https://github.com/facebookresearch/covost), English-to-German translation. Note that this is a dummy dataset, with only 30 samples for training, 10 for dev and 10 for testing. It is only used for demonstration, not to develop actual working ST systems.

You can download it from [here](https://drive.google.com/file/d/19QTZy63Y1oejH_7g1ziCpkrEuyJARQZD/view?usp=sharing) for inspiration to format your own dataset.

## Section 3. Translation with raw SALMONN

It is expected that SALMONN should already have the capability to do translation, given the large data it was pre-trained on. Let's see how SALMONN performs on translating CoVoST English audio to German text, without any fine-tuning.

To use SALMONN for inference, we should make use of the `cli_inference.py` script, and change the config file at `configs/decode_config.yaml` according to our usecase. Specifically, in `configs/decode_config.yaml`, we should change:
- `llama_path`: path to the vicuna model, which should be `lmsys/vicuna-7b-v1.5`
- `whisper_path`: path to the Whisper model, which should be `openai/whisper-large-v2`
- `beats_path`: path to the downloaded BEATs model above
- `ckpt`: path to the SALMONN model checkpoint (the `*.pth` file, and not the outter directory)

To lower the computational cost, we can also set:
- `low_resource: True`
- `lora_alpha: 28`

After that, we can run the inference script:

In [None]:
!python3 cli_inference.py --cfg-path configs/decode_config.yaml

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:12<00:00,  6.47s/it]
trainable params: 4194304 || all params: 6742618112 || trainable%: 0.06220586618327525
Loading training prompts done!
Your Wav Path:
data/covost2-test-en-de_proccessed/SRCAUDIO.en/audio_30.wav
Your Prompt:
Listen to the speech and translate it into German.
Output:
<s> Was haben Sie mir empfohlen, Herr?</s>


Let's have a look at the sample we just provided to the model:

In [None]:
with open("data/dev_data.json", 'r') as f:
    dev_data = json.load(f)
dev_data['annotation'][0]

{'path': 'data/covost2-test-en-de_proccessed/SRCAUDIO.en/audio_30.wav',
 'text': 'Was raten Sie mir, mein Herr?',
 'task': 'translation_en2de'}

In [None]:
import IPython
IPython.display.Audio("data/covost2-test-en-de_proccessed/SRCAUDIO.en/audio_30.wav")

As can be seen, the model can already do English-German translation on this one sample.

## Section 4. Fine-tuning

Fine-tuning can potentially help improving the performance of SALMONN on translation task, e.g., when the translation domain is rare. In this section, we will try to fine-tune SALMONN on the tiny subset of CoVoST.

To fine-tune SALMONN, we should make use of the `train.py` script, and change the config file at `configs/config.yaml` according to our usecase.

### Changes in `configs/config.yaml`
- `llama_path`, `whisper_path`, `beats_path`, `ckpt`: similar to Section 3
- `train_ann_path`, `valid_ann_path`, `test_ann_path`: points to the path of the formatted data in Section 2
- `output_dir`: path to output directory

To lower the computational cost, we can also set:
- `low_resource: True`
- `lora_alpha: 28`

You can have a closer look into the config files to change the hyperparameters according to your needs.


### Add prompts to our task

We should add the prompts for training to the files in `prompts/`, under the same key as the task name in our training data. For example:
```
"translation_en2de": [
    "<Speech><SpeechHere></Speech> Can you translate the speech into German?",
    "<Speech><SpeechHere></Speech> Listen to the speech and translate it into German.",
    "<Speech><SpeechHere></Speech> Bitte übersetzen Sie den Inhalt dieser Aufnahme ins Deutsche.",
],
```

### Run the inference script

In [None]:
!python3 train.py --cfg-path configs/config.yaml

Not using distributed mode
2024-11-26 21:43:44,835 [INFO] 
=====  Running Parameters    =====
2024-11-26 21:43:44,836 [INFO] {
    "accum_grad_iters": 1,
    "amp": true,
    "batch_size_eval": 4,
    "batch_size_train": 4,
    "device": "cuda",
    "dist_url": "env://",
    "epoch_based": false,
    "evaluate": false,
    "iters_per_epoch": 100,
    "log_freq": 5,
    "num_workers": 8,
    "optims": {
        "beta2": 0.999,
        "init_lr": 3e-05,
        "max_epoch": 30,
        "min_lr": 1e-05,
        "warmup_start_lr": 1e-06,
        "warmup_steps": 3000,
        "weight_decay": 0.05
    },
    "output_dir": "out",
    "seed": 42,
    "use_distributed": false,
    "world_size": 1
}
2024-11-26 21:43:44,836 [INFO] 
2024-11-26 21:43:44,836 [INFO] {
    "test_ann_path": "data/test_data.json",
    "train_ann_path": "data/train_data.json",
    "valid_ann_path": "data/dev_data.json",
    "whisper_path": "openai/whisper-large-v2"
}
2024-11-26 21:43:44,836 [INFO] 
2024-11-26 21:43:44,83

Let's try decoding with our newly fine-tuned model. This is done the same way as in Section 3, but just change the SALMONN checkpoint path from the decoding config file.

In [None]:
!python3 cli_inference.py --cfg-path configs/decode_config_ft.yaml

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:13<00:00,  6.80s/it]
trainable params: 4194304 || all params: 6742618112 || trainable%: 0.06220586618327525
Loading training prompts done!
Your Wav Path:
data/covost2-test-en-de_proccessed/SRCAUDIO.en/audio_30.wav
Your Prompt:
Listen to the speech and translate it into German.
Output:
<s> Was haben Sie mir empfohlen, Herr?</s>


# Section 5. Action items

- Modify the `cli_inference.py` script to handle multiple samples from a test set
- Evaluate SALMONN on your test set before and after finetuning. Did the performance improve? Provide your insights.
- **One page** should be enough to cover the important information
- Mail to Tu Anh Dinh: tu.dinh@kit.edu