<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Gen AI Experiments](https://img.shields.io/badge/Gen%20AI%20Experiments-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://github.com/buildfastwithai/gen-ai-experiments)
[![Gen AI Experiments GitHub](https://img.shields.io/github/stars/buildfastwithai/gen-ai-experiments?style=for-the-badge&logo=github&color=gold)](http://github.com/buildfastwithai/gen-ai-experiments)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bBT0Gio0goR3NfU80bttnnXP7cCC1hSH?usp=sharing)




**What You'll Learn:**
- Master cutting-edge AI tools & frameworks
- 6 weeks of hands-on, project-based learning
- Weekly live mentorship sessions
Transform your AI ideas into reality through hands-on projects and expert mentorship.


[Start Your Journey](https://www.buildfastwithai.com/genai-course)




# Testing Voxtral Model Using DeepInfra

This notebook provides a comprehensive guide to using the mistral's Voxtral model via deepinfra's API within the OpenAI SDK. We will cover everything from basic setup to advanced examples.

**Note:** You will need an API key from [DeepInfra](https://deepinfra.com/) to run the examples

###Installation

First, let's install the necessary Python libraries.

In [None]:
!pip install openai --quiet

###Basic Usage of Voxtral with DeepInfra

- Here’s how to set up the `OpenAI` class to connect to the Voxtral model through DeepInfra.

In [None]:
# Lets Download our test case audio file
!wget https://raw.githubusercontent.com/pratik-gond/temp_files/main/build_fast_with_ai.mp3 -O audio.mp3
audio_file_path = "audio.mp3"
audio_file = open(audio_file_path, "rb")

#We can also play the downloaded file
from IPython.display import Audio
Audio(audio_file_path)

--2025-07-23 12:57:44--  https://raw.githubusercontent.com/pratik-gond/temp_files/main/build_fast_with_ai.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25389 (25K) [audio/mpeg]
Saving to: ‘audio.mp3’


2025-07-23 12:57:44 (8.83 MB/s) - ‘audio.mp3’ saved [25389/25389]



###Transcription

In [None]:
from openai import OpenAI
from google.colab import userdata
api_key = userdata.get('DEEPINFRA_API_KEY')

client = OpenAI(
    api_key=api_key,
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.audio.transcriptions.create(
  model="mistralai/Voxtral-Small-24B-2507",  #we can also use the model :  mistralai/Voxtral-Mini-3B-2507
  file=audio_file
)

In [None]:
print(response.text)

This is the tutorial made by Build Fast with AI.


### Multilingual Capabilities

Voxtral models are proficient in multiple languages. Let's test this by sending prompts in Hindi and Spanish.

In [None]:
# Lets Download our test case audio file
!wget https://raw.githubusercontent.com/pratik-gond/temp_files/main/hindi_poem_mean.mp3 -O audio_hindi.mp3
audio_file_path_hi = "audio_hindi.mp3"
audio_file_hi = open(audio_file_path_hi, "rb")
Audio(audio_file_path_hi)

--2025-07-23 12:59:50--  https://raw.githubusercontent.com/pratik-gond/temp_files/main/hindi_poem_mean.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 156525 (153K) [audio/mpeg]
Saving to: ‘audio_hindi.mp3’


2025-07-23 12:59:50 (4.41 MB/s) - ‘audio_hindi.mp3’ saved [156525/156525]



###Transcription

In [None]:
response = client.audio.transcriptions.create(
  model="mistralai/Voxtral-Small-24B-2507",  #we can also use the model :  mistralai/Voxtral-Mini-3B-2507
  file=audio_file_hi,
  response_format="text" #we can also specify response format
)

response

'आज पानी गिर रहा है, बहुत पानी गिर रहा है, रात भर गिरता रहा है, प्राण मन घिरता रहा है, अब सवेरा हो गया है, कब सवेरा हो गया है, ठीक से मैंने न जाना, बहुत सोकर सिर्फ माना, क्योंकि बादल की अंधेरी है अभी तक भी घन्वी, इन पंक्तियों से लेखक का क्या तात्पर्य है?'

In [None]:
!wget https://raw.githubusercontent.com/pratik-gond/temp_files/main/what_is_AI_spanish.mp3 -O audio_spanish.mp3
audio_file_path_sp = "audio_spanish.mp3"
audio_file_sp = open(audio_file_path_sp, "rb")
Audio(audio_file_path_sp)

--2025-07-23 13:00:47--  https://raw.githubusercontent.com/pratik-gond/temp_files/main/what_is_AI_spanish.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13485 (13K) [audio/mpeg]
Saving to: ‘audio_spanish.mp3’


2025-07-23 13:00:47 (16.6 MB/s) - ‘audio_spanish.mp3’ saved [13485/13485]



###Transcription

In [None]:
response = client.audio.transcriptions.create(
  model="mistralai/Voxtral-Small-24B-2507",  #we can also use the model :  mistralai/Voxtral-Mini-3B-2507
  file=audio_file_sp
)
print(response.text)

que es la inteligencia artificial.


### Advanced Audio Transcription Parameters

- **`chunking_strategy`**: `"auto"` or `object` *(Optional)*  
  Determines how the audio is split into chunks.  
  - `"auto"`: Normalizes loudness and applies voice activity detection (VAD) to determine chunk boundaries.  
  - `object`: Provide a `server_vad` configuration to manually control VAD parameters.  
  - If unset, the audio is processed as a single block.

- **`language`**: `string` *(Optional)*  
  The input language in ISO-639-1 format (e.g., `"en"` for English).  
  Providing this improves transcription accuracy and reduces latency.

- **`prompt`**: `string` *(Optional)*  
  A text snippet to guide the model’s transcription style or continue from a previous segment.  
  Must match the language of the input audio.

- **`response_format`**: `string` *(Optional, default = "json")*  
  Specifies the format of the transcription output.  
  Options: `"json"`, `"text"`, `"srt"`, `"verbose_json"`, `"vtt"`  


In [None]:
!wget https://raw.githubusercontent.com/pratik-gond/temp_files/main/theory_of_relativity.mp3 -O audio_theory.mp3
audio_file_path_th = "audio_theory.mp3"
audio_file_sp = open(audio_file_path_th, "rb")
Audio(audio_file_path_th)

--2025-07-23 10:35:53--  https://raw.githubusercontent.com/pratik-gond/temp_files/main/theory_of_relativity.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25197 (25K) [audio/mpeg]
Saving to: ‘audio_theory.mp3’


2025-07-23 10:35:53 (2.29 MB/s) - ‘audio_theory.mp3’ saved [25197/25197]



In [None]:
response = client.audio.transcriptions.create(
  model="mistralai/Voxtral-Small-24B-2507",  #we can also use the model :  mistralai/Voxtral-Mini-3B-2507
  file=audio_file_sp,
  chunking_strategy="auto",
  language="en",
  prompt="",
  response_format="json"
)
print(response.text) #we got a JSON object

Explain the theory of relativity in simple terms.
