[Bug] Discrepancy in Evaluation Metrics Using Paper Hyperparameters for ZS-TTS (X-TTS)

### Describe the bug

Hello,

I am currently working on improving the ZS-TTS (X-TTS) model with an architecture change. However, I am observing significant discrepancies between the evaluation metrics reported in your paper and those obtained when I generate audio using the paper’s recommended hyperparameters.

Observed Metrics with Official Git Audio:
	•	English: CER 0.5425, UTMOS 4.007 ± 0.25, Duration 0.6423 sec
	•	Spanish: CER 1.4606, Duration 0.5371 sec
	•	French: CER 1.4937, Duration 0.4799 sec

Metrics with Paper Hyperparameters (temperature = 0.75, length penalty = 1.0, repetition penalty = 10.0, top-k = 50, top-p = 0.85):
	•	English: CER 0.8989, UTMOS 4.0242 ± 0.25, Duration 0.6143 sec
	•	Spanish: CER 2.3393, Duration [value?] sec
	•	French: CER 3.6471, Duration 0.5367 sec

I have tried multiple random seeds, but the generated audio’s metrics consistently differ from the provided benchmarks. My improved model performs better than the generated outputs for each hyperparameter setting, yet it remains inferior to the provided results.

I suspect there might be critical implementation details or processing steps (e.g., in decoding or preprocessing) that are influencing these results. I would appreciate any guidance on the following:
	•	Are there any additional preprocessing or postprocessing steps that are crucial for achieving the reported performance?
	•	Could there be differences in the decoding process or hyperparameter nuances that are not explicitly documented?
	•	Any other factors that you believe might be key to reproducing the original results?

I am using the [Coqui-AI/TTS](https://github.com/coqui-ai/TTS) repository for generation and [Edresson/ZS-TTS-Evaluation](https://github.com/Edresson/ZS-TTS-Evaluation) for evaluation.

Thank you for your help!

### To Reproduce

Generate the audio from [Edresson/ZS-TTS-Evaluation](https://github.com/Edresson/ZS-TTS-Evaluation) using this repo xtts v2.0.2
test it using the same repo

### Expected behavior

Observed Metrics with Official Git Audio:
	•	English: CER 0.5425, UTMOS 4.007 ± 0.25, Duration 0.6423 sec
	•	Spanish: CER 1.4606, Duration 0.5371 sec
	•	French: CER 1.4937, Duration 0.4799 sec

### Logs

```shell

```

### Environment

```shell
TTS Version - 1.3.0
PyTorch version - 1.9.0+cu111
Python version - 3.11.8
os - linux
cuda version
gpu - v100 32gb
installed via pip
```

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Discrepancy in Evaluation Metrics Using Paper Hyperparameters for ZS-TTS (X-TTS) #4176

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Discrepancy in Evaluation Metrics Using Paper Hyperparameters for ZS-TTS (X-TTS) #4176

Description

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions