<a href="https://colab.research.google.com/github/catundchat/STT_TTS_Report/blob/main/code/STT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

语音识别代码示例，使用方法：
1. choose appropriate pretrained model and tokenizer
2. load audio file
3. run the demo to generate transcription
4. compare transcription with groundtruth

In [5]:
!pip install transformers
!pip install librosa
!pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.29.0-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting aiohttp (from gradio)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from gradio)
  Downloading fastapi-0.95.1-py3-n

In [2]:
!nvidia-smi

Wed May 10 10:36:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import librosa
import torch
import numpy as np
import re
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from jiwer import wer, cer

# 定义编辑距离函数
def edit_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = np.zeros((m+1, n+1), dtype=int)
    for i in range(m+1):
        dp[i][0] = i
    for j in range(n+1):
        dp[0][j] = j
    for i in range(1, m+1):
        for j in range(1, n+1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]) + 1
    return dp[m][n]

# 定义计算字符错误率CER的函数
def cer(ground_truth, transcription):
    distance = edit_distance(ground_truth, transcription)
    return distance / len(ground_truth)

# 定义计算准确率的函数
def accuracy(ground_truth, transcription):
    correct_chars = sum(1 for gt_char, tr_char in zip(ground_truth, transcription) if gt_char == tr_char)
    return correct_chars / len(ground_truth)

# 去除中文字段里的标点符号
def remove_punctuation(text):
    pattern = re.compile(r"[\u3000-\u303f\uff00-\uffef]|[.,!?;]")
    return re.sub(pattern, "", text)

try:
    # 下载在中文文本上微调后的预训练Wav2Vec2模型，根据需要更改模型，这里推荐wbbbbb/wav2vec2-large-chinese-zh-cn
    processor = Wav2Vec2Processor.from_pretrained("wbbbbb/wav2vec2-large-chinese-zh-cn")
    model = Wav2Vec2ForCTC.from_pretrained("wbbbbb/wav2vec2-large-chinese-zh-cn")

    # 载入音频文件，根据需要修改路径！
    audio_file = "/content/drive/MyDrive/Colab Notebooks/Juxue_Tech/data/P290_convert.wav"
    speech, _ = librosa.load(audio_file, sr=16000, mono=True)

    # 预处理音频，该模型设置采样频率必须必须为16kHz！
    input_values = processor(speech, return_tensors="pt", padding=True, sampling_rate=16000).input_values

    # 使用 Wav2Vec2 模型进行推理
    with torch.no_grad():
        logits = model(input_values).logits

    # 得到预测的文字
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])

    # print("Transcription:", transcription)
    print(f"Transcription: {transcription}")
  
    # 提供音频的实际文本
    ground_truth = "信息技术部门中机器学习的主要应用之一是向潜在用户或客户推荐项目。这可以分为两种主要的应用：在线广告和项目建议（通常这些建议的目的仍是为了销售产品）。两者都依赖于预测用户和项目的关联，一旦向该用户展示了广告或推荐了该产品，推荐系统要么预测一些行为的概率。"
    ground_truth_punc = remove_punctuation(ground_truth)
    # print(f"groundtruth without punctuation: {ground_truth_punc}")

    # 计算 Accuracy 和 CER，对于中文文本来说，字错误率CER更能反映中文语音识别效果好坏
    acc = accuracy(ground_truth, transcription)
    acc_punc = accuracy(ground_truth_punc, transcription)
    CER = cer(ground_truth, transcription)
    CER_punc = cer(ground_truth_punc, transcription)

    # 输出计算结果
    print(f"Accuracy without punctuation: {acc_punc:.4f}, Accuracy: {acc:.4f} ")
    print(f"CER without punctuation: {CER_punc:.4f}, CER: {CER:.4f}")


except Exception as e:
    print("An error occurred:", e)


Downloading (…)rocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.30k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/76.4k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading pytorch_model.bin:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

Transcription: 信息技术部门中机器学习的主要运用之一是向潜在用户或客户推荐项目这可以分为两种主要的应用在线广告和项目建议通常这些建议的目的仍然是为了销售产品两者都依赖于预测用户和项目之间的观联一旦向该用户展示的广告和推荐的该产品推荐系统要么预测一些行为的概率
Accuracy without punctuation: 0.5169, Accuracy: 0.2381 
CER without punctuation: 0.0678, CER: 0.1270


1. 生成界面需要运行下述代码

In [None]:
!pip install gradio

In [10]:
import gradio as gr

# 假设你已经定义了这个函数
def transcribe(audio_file_path):
    return transcription

# 创建一个 Gradio 界面
# iface = gr.Interface(fn=transcribe, inputs="audio", outputs="text")
iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.inputs.Audio(label="请上传音频文件"), 
    outputs=gr.outputs.Textbox(label="转录结果"))

# 启动界面
iface.launch()




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



2. 生成API需要运行下述代码

In [None]:
!pip install flask flask-ngrok

In [None]:
from flask import Flask, request
from flask_ngrok import run_with_ngrok

app = Flask(__name__)
run_with_ngrok(app)   #starts ngrok when the app is run

@app.route('/transcribe', methods=['POST'])
def transcribe_audio():
    return transcription

if __name__ == '__main__':
    app.run()

模型1：wbbbbb/wav2vec2-large-chinese-zh-cn

Transcription: 信息技术部门中机器学习的主要运用之一是向潜在用户或客户推荐项目这可以分为两种主要的应用在线广告和项目建议通常这些建议的目的仍然是为了销售产品两者都依赖于预测用户和项目之间的观联一旦向该用户展示的广告和推荐的该产品推荐系统要么预测一些行为的概率

Accuracy without punctuation: 0.5169, Accuracy: 0.2381 

CER without punctuation: 0.0678, CER: 0.1270

模型2：jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

Transcription: 新息技术部门中技器学习的主要运用之一是项潜在用户或客户推荐项目这可以分为两种主要的应用在先广告和项目建议通常这些建议的目的仍然是为了销售产品两者都依赖于预测用户和项目之金内观联一但项该用户展示的广告和推荐的该产品推荐系统要木预测些行为的概率

Accuracy without punctuation: 0.4831, Accuracy: 0.2143 

CER without punctuation: 0.1441, CER: 0.1984