DSI BOS 11 (May 2020) Project 5

Alex Golden, Jungmoon Ham, Luke Podsiadlo, Zach Tretter

Workbook 4 - Speech to Text Transcription

----------

## Speech Recognition

#### Workflow Steps

1. Import audio segments (.wav files)

2. Transcribe via google's cloud speech-to-text API

3. Export results as dataframe

#### Requirements

* Key for google API
* Input path for audio files to be processed
* Output path for csv file (the dataframe)
* Name for said csv file

Core code adapted from
* DSI-SF-9 [(Grant Wilson, J. Hall, Gabriel Perez Prieto)](https://github.com/GWilson97/san_francisco_dispatch_audio_mapping/blob/master/code/03a_speech_to_text.ipynb)


In [None]:
# !pip install --upgrade google-cloud-speech
import os
import io
import pandas as pd
import time

from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import enums

### Establish Credentials for Google Application

In [1]:
path_to_key = '/Users/alex/ga-dsi-11/'

key_name = 'police-scanner-speech-to-text-c09b11750e4e.json'

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = path_to_key + key_name

client = speech.SpeechClient()

NameError: name 'os' is not defined

### Establish Input Path (Specific to Your Machine)

In [None]:
input_path = '<><><><><><><><><><>'

os.listdir(input_path)

### Establish Output Path (Specific to Your Machine)

In [None]:
output_path = '<><><><><><><><><><>'

file_name = '<><><><><><><><><><>'

os.listdir(output_path)

### Transcribe to Dataframe

In [None]:
start_time = time.time()

df = pd.DataFrame()

for sample_audio in os.listdir(input_path):
    loop_time = time.time()
    
    # Examine wav files
    if sample_audio.endswith('.wav'):
        
        # Open files
        with io.open(input_path + sample_audio,'rb') as audio_to_transcribe:
            content = audio_to_transcribe.read()
            audio = speech.types.RecognitionAudio(content = content)
            
        # Declare speech recognition parameters
        config = speech.types.RecognitionConfig(
            encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16,
            sample_rate_hertz = 22050,
            language_code = 'en-US',
            audio_channel_count = 1,
            enable_separate_recognition_per_channel = True,
            use_enhanced = True,
            model = 'phone_call',
            speech_contexts = [{'boost':20.0}]
        )
        
        # This models equivalent of fit/predict
        response = client.recognize(config,audio)
        
        # Build Dictionary that becomes a Dataframe
        for result in response.results:
            d = {}
            d['transcript'] = result.alternatives[0].transcript
            d['confidence'] = result.alternatives[0].confidence
            d['file_name'] = sample_audio
            d['audio_length'] = round(int(sample_audio.split("-")[-1].split(".")[0])/1_000,1)
            processing_time = round(time.time() - loop_time,1)
            d['transcribe_time'] = processing_time
            df = df.append(d, ignore_index=True)
                    
        print(f"File {sample_audio} transcribed to df in {round(processing_time,0)} secs")

print(f'total time of {time.time() - start_time}')

### View Dataframe

In [None]:
pd.set_option('display.max_columns',None)  

df = df[['file_name',
         'audio_length',
         'transcribe_time',
         'confidence',
         'transcript']]
df

### Export CSV

In [None]:
df.to_csv(output_path + file_name,
          index_label = False)