# Text Encoder Input Preparation

Prepares inputs for `libtext_encoder_htp.so` on QCS6490.

| Input | ONNX Shape | QNN Shape | dtype | Quantization |
|-------|------------|-----------|-------|-------------|
| text_ids | (1, 128) | (1, 128) | INT32 | None |
| style_ttl | (1, 50, 256) | (1, 256, 50) | UINT8 | scale=0.00541, offset=-172 |
| text_mask | (1, 1, 128) | (1, 128, 1) | UINT8 | scale=0.00392, offset=0 |

In [2]:
import numpy as np
import os

os.makedirs('./inputs', exist_ok=True)

def quantize(data, scale, offset):
    """Quantize float32 to uint8: q = round(x/scale) - offset, clipped to [0,255]"""
    return np.clip(np.round(data / scale) - offset, 0, 255).astype(np.uint8)

In [3]:
# 1. text_ids - INT32, no transpose needed
text_ids = np.fromfile('qnn_calibration/text_encoder/text_ids.raw', dtype=np.int32)
text_ids.tofile('./inputs/text_ids.raw')
print(f'text_ids: {text_ids.size} elements, {os.path.getsize("./inputs/text_ids.raw")} bytes')

text_ids: 128 elements, 512 bytes


In [4]:
# 2. style_ttl - transpose (1,50,256) -> (1,256,50), then quantize
style_ttl = np.fromfile('qnn_calibration/text_encoder/style_ttl.raw', dtype=np.float32)
style_ttl = style_ttl.reshape(1, 50, 256).transpose(0, 2, 1)
style_ttl_q = quantize(style_ttl, scale=0.0054120589047670, offset=-172)
style_ttl_q.tofile('./inputs/style_ttl.raw')
print(f'style_ttl: {style_ttl_q.shape}, {os.path.getsize("./inputs/style_ttl.raw")} bytes')

style_ttl: (1, 256, 50), 12800 bytes


In [5]:
# 3. text_mask - transpose (1,1,128) -> (1,128,1), then quantize
text_mask = np.fromfile('qnn_calibration/text_encoder/text_mask.raw', dtype=np.float32)
text_mask = text_mask.reshape(1, 1, 128).transpose(0, 2, 1)
text_mask_q = quantize(text_mask, scale=0.0039215688593686, offset=0)
text_mask_q.tofile('./inputs/text_mask.raw')
print(f'text_mask: {text_mask_q.shape}, {os.path.getsize("./inputs/text_mask.raw")} bytes')

text_mask: (1, 128, 1), 128 bytes


## Run Inference
```bash
qnn-net-run --model ./libtext_encoder_htp.so \
            --backend libQnnHtp.so \
            --input_list text_enc_input.txt \
            --output_dir text_output \
            --use_native_input_files
```

**Note:** `--use_native_input_files` is required since inputs are pre-quantized.

## Output
| Output | QNN Shape | dtype | Size |
|--------|-----------|-------|------|
| text_emb | (1, 128, 256) | FLOAT32 | 131,072 bytes |

**Important:** Output is transposed relative to calibration. To compare:
```python
output = np.fromfile('text_emb.raw', dtype=np.float32).reshape(1, 128, 256)
output_t = output.transpose(0, 2, 1)  # -> (1, 256, 128) to match calibration
```

# Vector Estimator Input Preparation

Prepares inputs for `libvector_estimator_htp.so` on QCS6490.

| Input | ONNX Shape | QNN Shape | dtype | Quantization |
|-------|------------|-----------|-------|--------------|
| noisy_latent | (1, 144, 256) | (1, 256, 144) | UINT8 | scale=0.03227, offset=-127 |
| text_emb | (1, 128, 256) {text encoder output} | (1, 128, 256) | UINT8 | scale=0.02788, offset=-128 |
| style_ttl | (1, 50, 256) | (1, 256, 50) | UINT8 | scale=0.00541, offset=-172 |
| latent_mask | (1, 1, 256) | (1, 256, 1) | UINT8 | scale=0.00392, offset=0 |
| text_mask | (1, 1, 128) | (1, 128, 1) | UINT8 | scale=0.00392, offset=0 |
| current_step | (1,) | (1,) | UINT8 | scale=3.92e-7, offset=0 |
| total_step | (1,) | (1,) | UINT8 | scale=0.01961, offset=0 |

In [None]:
# 1. noisy_latent - transpose (1,144,256) -> (1,256,144), then quantize
noisy_latent = np.fromfile('qnn_calibration/vector_estimator/noisy_latent.raw', dtype=np.float32)
noisy_latent = noisy_latent.reshape(1, 144, 256).transpose(0, 2, 1)
noisy_latent_q = quantize(noisy_latent, scale=0.0322749949991703, offset=-127)
noisy_latent_q.tofile('./inputs/noisy_latent.raw')
print(f'noisy_latent: {noisy_latent_q.shape}, {os.path.getsize("./inputs/noisy_latent.raw")} bytes')

In [None]:
# 2. text_emb - from text_encoder output (board_output), quantize to UINT8
# Note: text_encoder outputs (1,128,256), vector_estimator expects (1,128,256) - no transpose needed
text_emb = np.fromfile('./board_output/text_emb.raw', dtype=np.float32)
text_emb_q = quantize(text_emb, scale=0.0278750918805599, offset=-128)
text_emb_q.tofile('./inputs/text_emb.raw')
print(f'text_emb: {text_emb_q.size} elements, {os.path.getsize("./inputs/text_emb.raw")} bytes')

In [None]:
# 3. style_ttl - already prepared above (same quantization params)
# 4. latent_mask - transpose (1,1,256) -> (1,256,1), then quantize
latent_mask = np.fromfile('qnn_calibration/vector_estimator/latent_mask.raw', dtype=np.float32)
latent_mask = latent_mask.reshape(1, 1, 256).transpose(0, 2, 1)
latent_mask_q = quantize(latent_mask, scale=0.0039215688593686, offset=0)
latent_mask_q.tofile('./inputs/latent_mask.raw')
print(f'latent_mask: {latent_mask_q.shape}, {os.path.getsize("./inputs/latent_mask.raw")} bytes')

# 5. text_mask - already prepared above (same quantization params)

In [None]:
# 6. current_step and total_step (for diffusion loop step 0 of 10)
current_step = np.array([0], dtype=np.float32)
current_step_q = quantize(current_step, scale=0.0000003921568634, offset=0)
current_step_q.tofile('./inputs/current_step.raw')

total_step = np.array([10], dtype=np.float32)
total_step_q = quantize(total_step, scale=0.0196078438311815, offset=0)
total_step_q.tofile('./inputs/total_step.raw')

print(f'current_step: {current_step_q}, {os.path.getsize("./inputs/current_step.raw")} bytes')
print(f'total_step: {total_step_q}, {os.path.getsize("./inputs/total_step.raw")} bytes')

## Run Inference
```bash
# Input list (vec_est_input.txt):
# ./noisy_latent.raw ./text_emb.raw ./style_ttl.raw ./latent_mask.raw ./text_mask.raw ./current_step.raw ./total_step.raw

qnn-net-run --model ./libvector_estimator_htp.so \
            --backend libQnnHtp.so \
            --input_list vec_est_input.txt \
            --output_dir vec_output \
            --use_native_input_files
```

## Output
| Output | QNN Shape | dtype | Size |
|--------|-----------|-------|------|
| denoised_latent | (1, 256, 144) | FLOAT32 | 147,456 bytes |

In [9]:
# Verify vector estimator output against vocoder calibration input
def cosine_similarity(a, b):
    return np.dot(a.flatten(), b.flatten()) / (np.linalg.norm(a) * np.linalg.norm(b))

output = np.fromfile('./board_output/denoised_latent.raw', dtype=np.float32)
calib = np.fromfile('qnn_calibration/vocoder/latent.raw', dtype=np.float32)

print(f'Output: {output.size} elements, range [{output.min():.4f}, {output.max():.4f}]')
print(f'Calib:  {calib.size} elements, range [{calib.min():.4f}, {calib.max():.4f}]')

# Output is (1, 256, 144), calib is (1, 144, 256) - need transpose
output_t = output.reshape(1, 256, 144).transpose(0, 2, 1).flatten()
print(f'\nCosine similarity (transposed): {cosine_similarity(output_t, calib):.4f}')
# Note: Low similarity (~0.19) expected with only 1 diffusion step instead of 10

Output: 36864 elements, range [-3.1223, 3.0408]
Calib:  36864 elements, range [-3.5043, 3.4468]

Cosine similarity (transposed): 0.1880


# Vocoder Input Preparation

Prepares inputs for `libvocoder_htp.so` on QCS6490.

| Input | ONNX Shape | QNN Shape | dtype | Quantization |
|-------|------------|-----------|-------|--------------|
| latent | (1, 144, 256) | (1, 256, 144) | UINT8 | scale=0.02726, offset=-129 |

**Note:** Vector estimator outputs in QNN format `(1, 256, 144)` - no transpose needed.

In [None]:
# 1. latent - from vector estimator output, no transpose needed (already QNN format)
latent = np.fromfile('./board_output/denoised_latent.raw', dtype=np.float32)
latent_q = quantize(latent, scale=0.0272593162953854, offset=-129)
latent_q.tofile('./inputs/latent.raw')
print(f'latent: {latent_q.size} elements, {os.path.getsize("./inputs/latent.raw")} bytes')

## Run Inference
```bash
# Input list (vocoder_input.txt):
# ./latent.raw

qnn-net-run --model ./libvocoder_htp.so \
            --backend libQnnHtp.so \
            --input_list vocoder_input.txt \
            --output_dir vocoder_output \
            --use_native_input_files
```

## Output
| Output | QNN Shape | dtype | Size |
|--------|-----------|-------|------|
| wav_tts | (1, 786432) | FLOAT32 | 3,145,728 bytes |

In [None]:
# Convert vocoder output to WAV
from scipy.io import wavfile

wav_data = np.fromfile('./board_output/wav_tts.raw', dtype=np.float32)
print(f'Audio: {wav_data.size} samples ({wav_data.size/44100:.2f} sec at 44.1kHz)')
print(f'Range: [{wav_data.min():.4f}, {wav_data.max():.4f}]')

wavfile.write('./board_output/output.wav', 44100, wav_data)
print('Saved: board_output/output.wav')