## MINDS14数据集
- https://huggingface.co/datasets/PolyAI/minds14

### 1. 下载数据集

In [1]:
from datasets import load_dataset, Audio

In [2]:
# ds = load_dataset("PolyAI/minds14", name="en-AU", split="train", trust_remote_code=True)
ds = load_dataset("PolyAI/minds14", name="en-AU", trust_remote_code=True)

In [3]:
ds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 654
    })
})

发现这个数据集，就一个`train`，那其实可以有这行命令加载数据集。

```python
ds = load_dataset("PolyAI/minds14", name="en-AU", split="train", trust_remote_code=True)
```

In [4]:
ds_train = ds["train"]

In [5]:
ds_train

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 654
})

> 该数据集包含654个音频文件，`audio`列则包含了原始的音频文件，每个都有对应的转录文字和其英语翻译，以及询问人目的的标签`intent_class`。

### 2. 查看数据集字段

In [6]:
ds_train.features

{'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None)}

In [7]:
ds_train[0]

{'path': '/Users/alex.zhou/.cache/huggingface/datasets/downloads/extracted/3add4619499d88fceaa892890675f9167cef4df8c5ba7b59926c152405349aad/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/alex.zhou/.cache/huggingface/datasets/downloads/extracted/3add4619499d88fceaa892890675f9167cef4df8c5ba7b59926c152405349aad/en-AU~PAY_BILL/response_4.wav',
  'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
          0.00024414,  0.0012207 ]),
  'sampling_rate': 8000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

字段说明：
- `path`: 音频文件的路径
- `audio`: 原始的音频文件
- `array`: 解码后的音频文件，以一维`NumPy`数组表示
- `english_transcription`: 英语的翻译文本
- `sampling_rate`: 音频文件的采样率（`ds[0]`该样本是8000赫兹）
- `intent_class`: 分类的类别。使用`ds.features["intent_class"].into2str`可转换为文字

In [8]:
ds_train.features["intent_class"].int2str(ds_train[0]["intent_class"])

'pay_bill'

In [9]:
def int2label(intent_class):
    return ds_train.features["intent_class"].int2str(intent_class)

In [10]:
# 测试int2label函数
int2label(ds_train[0]["intent_class"])

'pay_bill'

In [11]:
ds_train[0]["english_transcription"]

'I would like to pay my electricity bill using my card can you please assist'

**对音频重采样：** 将样本重采样到1600赫兹。

In [12]:
ds_train = ds_train.cast_column("audio", Audio(sampling_rate=16_000))

In [13]:
ds_train[0]

{'path': '/Users/alex.zhou/.cache/huggingface/datasets/downloads/extracted/3add4619499d88fceaa892890675f9167cef4df8c5ba7b59926c152405349aad/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/alex.zhou/.cache/huggingface/datasets/downloads/extracted/3add4619499d88fceaa892890675f9167cef4df8c5ba7b59926c152405349aad/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36116466e-05, 1.92324675e-04, 2.19285779e-04, ...,
         9.40909609e-04, 1.16613088e-03, 7.20880926e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

### 3. Gradio查看数据集样本

In [14]:
import gradio as gr

def get_audio(idx):
    # 获取数据
    if idx > len(ds_train) - 1:
        # 超出数据的长度就随机获取一条
        item = ds_train.shuffle()[0]
    else:
        item = ds_train[idx]

    # 获取数据的audio
    audio = item["audio"]

    # 返回数据
    audio_info = (
        audio["sampling_rate"],
        audio["array"], 
    )
    
    # 返回数据：第一个数据是audio的采样率和解码后的音频数组
    return audio_info, item["english_transcription"], int2label(item["intent_class"])

In [15]:
# 显示界面
with gr.Blocks() as demo:
    index_input = gr.Number(label="Index", value=0, minimum=0, maximum=len(ds_train) - 1)
    btn = gr.Button(value="显示音频")
    
    audio_output = gr.Audio(label="Audio")
    audio_text = gr.Label(label="Text")
    audio_label = gr.Label(label="Label")

    # 添加处理函数
    btn.click(get_audio, inputs=index_input, outputs=[audio_output, audio_text, audio_label])

# 启动应用
demo.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.






In [16]:
# gr.close_all()