## This notebook is tested and supposted to be used in kaggle notebbok.

https://www.kaggle.com/code/alexandrmaximenko/audiollm-seminar

In [None]:
import torch
import librosa
from transformers import AutoModel, AutoProcessor
from transformers.generation import GenerationConfig
import transformers
import pandas as pd
import json
from tqdm import tqdm
import IPython
import gc
import os

# There might me Errors like 
# "AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'"
# It's okay and won't effect notebook execution

In [None]:
## Downloading illustrations

! gdown https://drive.google.com/uc?id=1pT7Fq-igxFG-8a1c3L26zSMHs2DHLM36
! gdown https://drive.google.com/uc?id=188nBhpSwhxP99UflMK-zxo-V1yNIObR1
! gdown https://drive.google.com/uc?id=1tU8OQ6wzuqK_Epq6VkIJ0nMISXu1szcT
! gdown https://drive.google.com/uc?id=1pmK-nbP4qcQIAfoTZQJ2-mQgQKJKPXMj

### In this notebook we will try to use UltraVox AudioLLM

**UltraVox-v0.5 - family of Audio-Conditioned LLMs with different model sizes:** [huggingface_link](https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_2-1b), [ultravox team blogpost](https://www.ultravox.ai/blog/ultravox-v0-5-taking-the-lead-in-speech-understanding)

**We are going to use smallest version of UltraVox-v0.5 family, which consits of:**
* **WhisperAudioEncoder with about 600M parameters**
* **Audio projector with about 40M parameters**
* **LLaMA 3.2 1B decoder, with about 1.2B parameters**

![](image_0.png)



### Let's try UltraVox with user-friendly pipe-interface

----
**For using UltraVox you have to authorize in HuggingFace and accept the usage rules.**


In [None]:
from huggingface_hub import login

# Authenticate with Hugging Face
# You can get your token from https://huggingface.co/settings/tokens
login()

**Let's download audios we will use to process with model**

In [None]:
# Downloading audios we will use in examples

! gdown https://drive.google.com/uc?id=18-9If-0WZH0cPZ9MpQEwCrABA3BrTKBh
! gdown https://drive.google.com/uc?id=1Yv6B3BEMjjmkcK4ogcutyf6gR1QOl3kl

In [None]:
import IPython
IPython.display.Audio('gagarin_eng.mp3')
# When did Gagarin fly into space?

In [None]:
import IPython
IPython.display.Audio('gagarin.mp3')
# В каком году Гагарин полетел в космос?

In [None]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# Pipeline initialization
pipe = transformers.pipeline(
    model='fixie-ai/ultravox-v0_5-llama-3_2-1b', 
    trust_remote_code=True, 
    device=DEVICE, 
    dtype=torch.bfloat16
)

**Let use our pipeline for answering our spoken questions:**

In [None]:
path = "gagarin_eng.mp3"  # When did Gagarin fly into space?
audio, sr = librosa.load(path, sr=16000)


turns = [
  {
    "role": "system",
    "content": "You are a friendly and helpful character. You love to answer questions for people."
  },
]
result_eng = pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=256)

In [None]:
print(result_eng)

In [None]:
path = "gagarin.mp3"  # В каком году Гагарин полетел в космос?
audio, sr = librosa.load(path, sr=16000)


result_ru = pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=256)

In [None]:
print(result_ru)


**As we can see, model works good enough with english speech.**

**But we can't say the same for russian one.**


In [None]:
# As we won't use pipe later, delete it and call garbage collector to free memory

print("Before pipe deletion:")
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")
print("="*50)
print("Pipe Deletion..")
pipe.model.to('cpu')
del pipe.model
del pipe
gc.collect()
torch.cuda.empty_cache()

print("After pipe deletion:")
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")

### Let's try little more manual way to use it - by calling Processor and Model itself

----

In [None]:
# Initializing model and processor
model = AutoModel.from_pretrained('fixie-ai/ultravox-v0_5-llama-3_2-1b', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('fixie-ai/ultravox-v0_5-llama-3_2-1b', trust_remote_code=True)

# Move model to GPU if available
model = model.to(torch.bfloat16).to(DEVICE)

**Let's take a look to the model and it's modules.**

In [None]:
model

In [None]:
print('Model parameters num: ', sum(p.numel() for p in model.parameters()))
print('Model encoder parameters num: ', sum(p.numel() for p in model.audio_tower.parameters()))
print('Model modality projector parameters num: ', sum(p.numel() for p in model.multi_modal_projector.parameters()))
print('Model text decoder parameters num: ', sum(p.numel() for p in model.language_model.parameters()))


**Model use** `<|audio|>` **token to handle audio tokens insertion into text embeddings.**

**We will create dialogue template using this token and then use model's tokenizer to transform it into correct text representation.**

**Important note - we** `pass add_generation_prompt=True` **argument to** `create_chat_template` **method for adding assistant replic start to our prompt**


In [None]:
path = "gagarin_eng.mp3"
audio, sr = librosa.load(path, sr=16000)
AUDIO_TOKEN = '<|audio|>'

turns = [
    {
        "role": "system",
        "content": "You are the helpful and smart audio assistant"
    }, 
    {
        "role": "user",
        "content": f"Answer the question: {AUDIO_TOKEN}"
    }, 
]

text = processor.tokenizer.apply_chat_template(
    turns, 
    add_generation_prompt=True, 
    tokenize=False
)
print(text)

**Next, we will use model's processor to process audio and create appropriate input for the model.**

In [None]:
processor_out = processor(text=text, audio=audio, sampling_rate=16000)
inputs = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in processor_out.items()}

**Let's take a look into processor output - what model will take as input.**

In [None]:
print(processor_out)

In [None]:
print('Audio values shape: ', processor_out['audio_values'].shape)
print('='*100)
print('Audio lens: ', processor_out['audio_lens'])
print('='*100)
print('Audio token len: ', processor_out['audio_token_len'])
print('='*100)
print('Audio token start idx: ', processor_out['audio_token_start_idx'])
print('='*100)
print('Input ids: ', processor_out['input_ids'])
print('='*100)
print('Decoded Input ids: ', processor.tokenizer.decode(processor_out['input_ids'][0]))

#### Model Input Values Description

##### 1. **Audio Values**
- **Shape**: `torch.Size([1, 128, 233])`
- **Description**: Processed audio features for the audio encoder (Whisper)
  - Dimension 0: Batch size (1 audio file)
  - Dimension 1: Feature dimension (128 mel-frequency bins)
  - Dimension 2: Time frames (233 frames)

##### 2. **Audio Lens**
- **Value**: `tensor([233])`
- **Description**: Actual length of the audio in frames (before any padding). Used to track the true audio length for each sample in the batch.

##### 3. **Audio Token Len**
- **Value**: `tensor([15])`
- **Description**: Number of audio tokens that will be inserted into the language model sequence. This is calculated as `⌈audio_lens / (encoder_ds_factor × stack_factor)⌉`, where the audio features are downsampled and stacked to reduce sequence length. In our case - this factor is 4*4 = 16

##### 4. **Audio Token Start Idx**
- **Value**: `tensor([43])`
- **Description**: The position in the input_ids sequence where the audio tokens begin. In this case, the audio tokens are inserted at index 43, replacing the `<|audio|>` placeholder with 15 consecutive audio token embeddings.

##### 5. **Input IDs**
- **Shape**: `torch.Size([1, 59])`
- **Description**: Tokenized text sequence that includes:
  - System prompt with conversation template (Llama 3.2 format)
  - User instruction: "Answer the question:"
  - 15 `<|eot_id|>` tokens (token ID: 128009) as placeholders for audio embeddings
  - Assistant response header to prompt generation
  
**Decoded version** shows the chat template structure with special tokens for system/user/assistant roles. The repeated `<|eot_id|>` tokens will be replaced by actual audio embeddings during model forward pass.

![](image_1.png)
![](image_2.png)

In [None]:
res = model.generate(**inputs, max_new_tokens=300)

**As model's** `generate` **method returns token input ids, we need to decode them into text.**

In [None]:
print(processor.tokenizer.decode(res[0]))

**As we can see, output contains our input text and generated text (under the assistant-role replic)**

In [None]:
output = model.forward(**inputs)

In [None]:
output.logits.shape

In [None]:
# Free model gpu memory as we're going to use another LLM in the next module

print("Before model deletion:")
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")
print("="*50)
print("Model Deletion..")
# model.to('cpu')
# del model
del res
del output
gc.collect()
torch.cuda.empty_cache()

print("After model deletion:")
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")

## Let's check how we can generate training data

Consider one popular approach - [AudioChatLLaMa](https://arxiv.org/pdf/2311.06753v2)-like alignment.

![](image_4.png)

### Data Generation Steps:

**Input**: Standard ASR dataset with `(audio, transcript)` pairs

**Step 1: Use Transcript to Generate LLM Response**

Given an ASR pair, the transcript is used to prompt LLM to generate a response:
```
prompt = "<s>[INST] <<SYS>>\n{{system_prompt}}\n<</SYS>>\n\n{{user_prompt}} [/INST]"
```

Where:
- `{{system_prompt}}` = empty (not used)
- `{{user_prompt}}` = the transcript from ASR data

**Step 2: Create Audio-Response Pairs**

The original `(audio, transcript)` pairs are transformed into `(audio, llm_response)` pairs:
- **Input to model**: Audio
- **Output from model**: LLM-generated response (not the transcript)


### Key Insight:

Instead of training the model to transcribe audio → text, it's trained to respond to audio content as if it understood the spoken input, making it behave more like a conversational assistant.

---

Let's implement this pipeline with an example.

We will use:
* LLaMA-3.2-1B as LLM
* LibriSpeech dev subset as asr dataset

**First - let's write code for parsing dev-clean part of Librispeech dataset, which is mounted as input in our Kaggle notebook**

In [None]:
# Download librispeech dev-clean data
! wget https://openslr.elda.org/resources/12/dev-clean.tar.gz \
  && tar -xf dev-clean.tar.gz

In [None]:
from pathlib import Path
import json

DEV_CLEAN_ROOT = Path("/kaggle/working/LibriSpeech/dev-clean")


# Parse all transcript files into a dict: utt_id -> transcription
def load_transcripts(root: Path) -> dict:
    transcripts = {}
    for trans_file in root.rglob("*.trans.txt"):
        with open(trans_file, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split()
                if not parts:
                    continue
                utt_id = parts[0]              # e.g. "84-121123-0000"
                text = " ".join(parts[1:])
                transcripts[utt_id] = text
    return transcripts

transcripts = load_transcripts(DEV_CLEAN_ROOT)
print("Loaded transcripts:", len(transcripts))

# Create JSONL with one entry per audio file
output_json = Path("/kaggle/working/dev-clean.json")
librispeech_data = []
for flac_path in sorted(DEV_CLEAN_ROOT.rglob("*.flac")):
    utt_id = flac_path.stem  # filename without .flac
    text = transcripts.get(utt_id)
    record = {
        "audio_path": str(flac_path),    # full path inside Kaggle FS
        "transcription": text,
    }
    librispeech_data.append(record)
    
with output_json.open("w", encoding="utf-8") as f:
    json.dump(librispeech_data, f)
        

print("Wrote JSONL to:", output_json)

In [None]:
# Example of data sample
index = torch.randint(len(librispeech_data), (1,))
print('Data sample transcription:\n', librispeech_data[index]['transcription'])
IPython.display.Audio(librispeech_data[index]['audio_path'])

**Next, we need to write code for our LLM-Response generation.** 

**Here i'll provide code for transformers-based generation, but**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# replic end token, we'll use it to find last assistant replic
REPLIC_END_TOKEN = processor.tokenizer.encode('<|end_header_id|>', add_special_tokens=False)[0]

class InstructDataGeneratorHF:
    def __init__(self, model_path: str = "meta-llama/Llama-3.2-1B-Instruct"):
        print(f"Loading tokenizer: {model_path}")
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

        print(f"Loading transformers model: {model_path}")
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16 if self.device.type == "cuda" else torch.float32,
            trust_remote_code=True,
        ).to(self.device)
        self.model.eval()

        self.generation_kwargs = dict(
            temperature=0.8,
            top_p=0.9,
            max_new_tokens=256,
            do_sample=True,
        )

        self.system_message = (
            "You are a helpful assistant. "
            "Keep your responses concise - maximum 50 words or 2-3 sentences."
        )

    def create_chat_prompt(self, transcription: str) -> str:
        """Create chat prompt using Direct Chat template."""
        messages = [
            {"role": "system", "content": self.system_message},
            {"role": "user", "content": transcription},
        ]

        prompt = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
        )
        return prompt

    @torch.inference_mode()
    def generate_single(self, transcription: str) -> str:
        """Generate response for a single transcription."""
        prompt = self.create_chat_prompt(transcription)

        enc = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
        )
        input_ids = enc["input_ids"].to(self.device)
        attention_mask = enc["attention_mask"].to(self.device)

        generated = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            **self.generation_kwargs,
        )

        # Find the last assistant replic start and decode from there
        last_assistant_replic_start = torch.where(generated[0] == REPLIC_END_TOKEN)[0][-1]
        output_ids = generated[0][last_assistant_replic_start:]

        text = self.tokenizer.decode(
            output_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True,
        ).strip()

        return text

    def generate_batch(self, transcriptions: list[str]) -> list[str]:
        """Generate responses for a list of transcriptions with progress bar."""
        responses = []
        for transcription in tqdm(transcriptions, desc="Generating responses"):
            response = self.generate_single(transcription)
            responses.append(response)
        
        return responses

In [None]:
generator = InstructDataGeneratorHF("meta-llama/Llama-3.2-1B-Instruct")

In [None]:
NUM_SAMPLES = 100

librispeech_data_sample = librispeech_data[:NUM_SAMPLES]

transcriptions = [sample['transcription'] for sample in librispeech_data_sample]

generated_responses = generator.generate_batch(transcriptions)

In [None]:
# Check our instruct data samples
index = torch.randint(NUM_SAMPLES, (1,))
print('Transcription example: ', librispeech_data_sample[index]['transcription'])
print('Response example: ', generated_responses[index])

In [None]:
## Add simple filtering
generated_samples = []
rejected_samples = []
rejecting_strings = [
    "I can't create content",
    "It seems like you",
    "I can't help",
    "I can't provide",
    "I can’t support",
]
for data_sample, response in zip(librispeech_data_sample[:NUM_SAMPLES], generated_responses):
    if not any(sub in response for sub in rejecting_strings):
        data_sample.update({'response': response})
        generated_samples.append(data_sample)
    else:
        data_sample.update({'response': response})
        rejected_samples.append(data_sample)

print('Number of samples before filtering: ', NUM_SAMPLES)
print('Number of samples after filtering: ', len(generated_samples))
with open('instruct_data.json', 'w') as file:
    json.dump(generated_samples, file)

In [None]:
# Check rejected samples
index = torch.randint(len(rejected_samples), (1,))
print('Transcription example: ', rejected_samples[index]['transcription'])
print('Response example: ', rejected_samples[index]['response'])

In [None]:
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")

print('Removing generator...')
print('='*100)
generator.model.to('cpu')
del generator.model
del generator
gc.collect()
torch.cuda.empty_cache()
print("allocated:", torch.cuda.memory_allocated() / 1024**2, "MB")
print("reserved: ", torch.cuda.memory_reserved()  / 1024**2, "MB")

## Model training (overfitting) on Instruct data

**Now we want to write simple example of AudioLLM training.**

**But we don't have enough resources, so we'll write simple training pipeline and make our model overfit on small batch.**

**Overfitting on small batch - usefull practice for check the correctness of your training pipeline and it's commonly used in practice.**

**So, we're going to implement dataset and make out UltraVox model overfit on it using transformers `trainer` class**

### AudioLLM dataset && utils

In [None]:
BASE_SYSTEM_MESSAGE = (
    "You are a helpful assistant. "
    "Keep your responses concise — maximum 50 words or 2–3 sentences."
)

# replic end token
REPLIC_END_TOKEN = processor.tokenizer.encode('<|end_header_id|>', add_special_tokens=False)[0]


import torch

def create_labels(
    tokens: torch.Tensor,
    special_token: int = REPLIC_END_TOKEN,
    ignore_index: int = -100
) -> torch.Tensor:
    """
    tokens: LongTensor of shape [batch_size, T]
    Returns a clone where, for each row, all positions BEFORE the last
    occurrence of special_token  are set to `ignore_index`.
    
    If a row does not contain `special_id`, that row is left unchanged.
    """
    assert tokens.dim() == 2, "tokens must be 2D [batch, T]"
    B, T = tokens.shape
    device = tokens.device

    is_special = tokens.eq(special_token)

    # Indices along sequence dimension
    idx = torch.arange(T, device=device).unsqueeze(0)

    # For non-special positions put -1, for special positions put their index
    masked_idx = torch.where(is_special, idx, torch.full_like(idx, -1))
    
    last_pos = masked_idx.max(dim=1).values                  # [B]

    pos = idx
    mask = pos < last_pos.unsqueeze(1)

    out = tokens.clone()
    out[mask] = ignore_index
    return out
    

class UltraVoxAudioInstructDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        json_path: str,
        processor: AutoProcessor,
        max_length: int = 256,
        num_samples: int | None = None
    ):
        with open(json_path, 'r') as f:
            self.data = json.load(f)

        if num_samples is not None:
            self.data = self.data[:num_samples]
        self.processor = processor
        self.max_length = max_length

        print(f"Loaded {len(self.data)} samples from {json_path}")

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx) -> dict:
        sample = self.data[idx]

        # Load audio from file path
        audio_path = sample["audio_path"]
        audio, sample_rate = librosa.load(audio_path)
        if sample_rate != 16000:
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
            sample_rate = 16000
                                    
        turns = [
            {
                "role": "system",
                "content": BASE_SYSTEM_MESSAGE
            }, 
            {
                "role": "user",
                "content": f"{AUDIO_TOKEN}"
            }, 
            {
                "role": "assistant",
                "content": sample["response"]
            }
        ]

        text = processor.tokenizer.apply_chat_template(
            turns,
            add_generation_prompt=False,
            tokenize=False
        )

        processor_out = processor(text=text, audio=audio, sampling_rate=16000)
        # processor_out['audio_values'] = processor_out['audio_values'].squeeze(0)

        labels = create_labels(processor_out['input_ids'])
        processor_out.update({'labels': labels})
        inputs = {k: v[0] for k, v in processor_out.items()}

        return inputs

In [None]:
# Initialize our dataset with only 10 samples
dataset = UltraVoxAudioInstructDataset('instruct_data.json', processor, num_samples=10)

In [None]:
# Let's look at input ids and labels in our dataset sample
# In labels there must be only assistant replic, other tokens must be masked with -100
idx = torch.randint(len(dataset), (1,))
print('INPUT_IDS Decoded:')
print(processor.tokenizer.decode(dataset[idx]['input_ids'], skip_special_tokens=False))
print('='*150)
print('LABELS Decoded:')
print(processor.tokenizer.decode(torch.where(dataset[idx]['labels'] != -100, dataset[idx]['labels'], processor.tokenizer.pad_token_id), skip_special_tokens=True))

### Model training cycle

For implementing model training cycle we will use `trainer` from transformers.

To avoid CUDA OOM we will freeze almost all parameters and train only modality adapter.

Let's remind, that our goal here - check the loss to slow down, as we don't have enough resources and time to make good finetuning

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
# Initializing model and processor
model = AutoModel.from_pretrained('fixie-ai/ultravox-v0_5-llama-3_2-1b', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('fixie-ai/ultravox-v0_5-llama-3_2-1b', trust_remote_code=True)

# Move model to GPU if available
model = model.to(torch.bfloat16).to(DEVICE)

In [None]:
# freezing model params except modality adapter
def freeze_model_except_projector(model):
    """
    Freeze all parameters in the Ultravox model except the multimodal projector.
    
    Args:
        model: Ultravox model instance
    """
    # First, freeze everything
    for param in model.parameters():
        param.requires_grad = False
    
    # Then, unfreeze only the projector
    if hasattr(model, 'multi_modal_projector'):
        for param in model.multi_modal_projector.parameters():
            param.requires_grad = True
        print("✓ Unfroze multi_modal_projector parameters")
    else:
        print("⚠ Warning: 'multi_modal_projector' not found in model")
    
    # Print summary of trainable parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"\nParameter Summary:")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params:,}")
    print(f"  Frozen parameters: {total_params - trainable_params:,}")
    print(f"  Trainable percentage: {100 * trainable_params / total_params:.2f}%")

# Usage:
freeze_model_except_projector(model)

In [None]:
experiment_name = 'overfit_exp'
num_steps = 200
lr = 1e-3
warmup_steps = 50
output_dir = f'experiments/{experiment_name}'
logging_steps = 10
save_steps = 100000
eval_steps = 100000

training_arguments = TrainingArguments(
    output_dir=output_dir,
    max_steps=num_steps,
    per_device_train_batch_size=1,
    learning_rate=lr,
    warmup_steps=20,
    logging_dir=os.path.join(output_dir, "tensorboard_logs"),
    logging_steps=logging_steps,
    save_steps=save_steps,
    eval_steps=eval_steps,
    eval_strategy="no",
    bf16=True,
    dataloader_num_workers=1,
    remove_unused_columns=False,
    report_to=["tensorboard"],
    save_safetensors=False,
    gradient_checkpointing=False,
    ddp_find_unused_parameters=False
)

In [None]:
class AudioLLMTrainer(Trainer):
    """Custom Trainer for AudioLLM that handles device placement correctly."""

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """Override to handle audio inputs properly."""
        outputs = model(**inputs)
        loss = outputs.loss
        return (loss, outputs) if return_outputs else loss

trainer = AudioLLMTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    # data_collator=data_collator,
)

In [None]:
trainer = AudioLLMTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    # data_collator=data_collator,
)
trainer.train()

In [None]:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
import matplotlib.pyplot as plt

logdir = "experiments/overfit_exp/tensorboard_logs"  # directory with event files
ea = EventAccumulator(logdir)
ea.Reload()

# list available tags
print(ea.Tags()["scalars"])

# pick one scalar
tag = "train/loss"  # for example
events = ea.Scalars(tag)

steps = [e.step for e in events]
values = [e.value for e in events]

plt.plot(steps, values)
plt.xlabel("step")
plt.ylabel(tag)
plt.grid(True)
plt.show()

**As we can see - training happened well and loss droped almost to 0**