---

[1] まずは音声認識からはじめてみましょう
===================

In [1]:
# [1] AzureのCognitive Servicesのspeechのインストール
!pip install azure-cognitiveservices-speech




In [2]:
# [2] 使用するパッケージなどのimport
import azure.cognitiveservices.speech as speech_sdk


In [3]:
# [3] keyとregionの設定
# 今回はセキュリテイを無視していますが、本番環境では重要情報なのでご注意を

COG_SERVICE_KEY="your_cognitive_services_key"
COG_SERVICE_REGION="your_cognitive_services_location"


In [4]:
# [4] SpeechConfigの設定
# 今回はセキュリテイを無視していますが、本番環境では重要情報なのでご注意を
speech_config = speech_sdk.SpeechConfig(COG_SERVICE_KEY, COG_SERVICE_REGION)
print('speech serviceのregionはこちらに設定しました:', speech_config.region)


speech serviceのregionはこちらに設定しました: southcentralus


In [5]:
# [5] 使用するファイルの設定
audioFile = '/content/time.wav'


In [6]:
# [6] AudioConfigの設定
audio_config = speech_sdk.AudioConfig(filename=audioFile)


In [7]:
# [7] SpeechRecognizerの設定（SpeechConfigとAudioConfigを設定時に使用します）
speech_recognizer = speech_sdk.SpeechRecognizer(speech_config, audio_config)


In [8]:
# [8] SpeechRecognizerで音声処理（音声認識）
speech_result = speech_recognizer.recognize_once_async().get()
print("次の音声と認識しました：", speech_result.text)


次の音声と認識しました： What time is it?


---

[2] 発音評価（Pronunciation assessment）
===================

In [9]:
# [1] 発音している単語のスクリプトを用意します
script = 'What time is it'


[2] AudioConfigの設定

参考記事

https://docs.microsoft.com/ja-jp/azure/cognitive-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-python

SDK解説：azure-cognitiveservices-speech Package

https://docs.microsoft.com/ja-jp/python/api/azure-cognitiveservices-speech/?view=azure-python

プログラム参考

https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/4f9ee79c2287a5a00dcd1a50112cd43694aa7286/samples/python/console/speech_sample.py#L707


In [10]:
# [2] AudioConfigの設定
# 詳細
# https://docs.microsoft.com/ja-jp/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.pronunciationassessmentconfig?view=azure-python

pronunciation_config = speech_sdk.PronunciationAssessmentConfig(reference_text=script,
                                                                   grading_system=speech_sdk.PronunciationAssessmentGradingSystem.HundredMark,
                                                                   granularity=speech_sdk.PronunciationAssessmentGranularity.Word)


In [11]:
# [3] SpeechRecognizerの設定（SpeechConfigとAudioConfigを設定時に使用します）
speech_recognizer = speech_sdk.SpeechRecognizer(speech_config, audio_config)
# 先ほど[7]で設定したものと同じ内容です


In [12]:
# [4] 発音評価：Pronunciation Assessmentの実施
pronunciation_config.apply_to(speech_recognizer)
result = speech_recognizer.recognize_once()


In [13]:
# [5] 発音評価：Pronunciation Assessmentの結果をまとめたオブジェクトを作成
pronunciation_result = speech_sdk.PronunciationAssessmentResult(result)


In [14]:
# [6] 発音評価の結果を表示（全文での）
print('Accuracy score: {}, fluency score: {}, completeness score : {}, pronunciation score: {}'.format(
            pronunciation_result.accuracy_score, pronunciation_result.fluency_score,
            pronunciation_result.completeness_score, pronunciation_result.pronunciation_score
        ))

Accuracy score: 100.0, fluency score: 100.0, completeness score : 100.0, pronunciation score: 100.0


In [15]:
# [7] 発音評価の結果を表示（単語ごとに）
for word_result in pronunciation_result.words:
    print('単語：{}, Accuracy score：{}'.format(word_result.word, word_result.accuracy_score))

単語：What, Accuracy score：100.0
単語：time, Accuracy score：100.0
単語：is, Accuracy score：100.0
単語：it, Accuracy score：100.0


In [17]:
# [8] まとめてJsonで取得・表示
import json

json_result = result.properties.get(speech_sdk.PropertyId.SpeechServiceResponse_JsonResult)
jo = json.loads(json_result)
print(json.dumps(jo, indent=2))

{
  "Id": "b4cac8d5693a4cd9a08126fae9c90c28",
  "RecognitionStatus": "Success",
  "Offset": 5000000,
  "Duration": 11100000,
  "DisplayText": "What time is it?",
  "SNR": 27.69381,
  "NBest": [
    {
      "Confidence": 0.9356122,
      "Lexical": "What time is it",
      "ITN": "What time is it",
      "MaskedITN": "what time is it",
      "Display": "What time is it?",
      "PronunciationAssessment": {
        "AccuracyScore": 100.0,
        "FluencyScore": 100.0,
        "CompletenessScore": 100.0,
        "PronScore": 100.0
      },
      "Words": [
        {
          "Word": "What",
          "Offset": 5000000,
          "Duration": 3500000,
          "PronunciationAssessment": {
            "AccuracyScore": 100.0,
            "ErrorType": "None"
          }
        },
        {
          "Word": "time",
          "Offset": 8600000,
          "Duration": 2900000,
          "PronunciationAssessment": {
            "AccuracyScore": 100.0,
            "ErrorType": "None"
          

以上。