Sometimes the audio data may come without text. Feel free to use any library or your own module to do speech recognition (ASR). In this notebook I'll rely on [ffmpeg](https://ffmpeg.org) to extract audio and [cmusphinx](https://cmusphinx.github.io) to recognize speech to text. Check the official websites to find out how to install and configure.

In [None]:
from ffmpy import FFmpeg

def extract_audio(mediafile):
    '''Extract audio from supported file and save in mono PCM wave format'''
    ff = FFmpeg(inputs={mediafile: None}, outputs={mediafile + '.wav': '-ac 1'})
    ff.run()
    wavefile = mediafile + '.wav'
    return wavefile

In [None]:
import speech_recognition as sr

def speech_recognition(wavefile, lang):
    '''Recognize speech to text from wave file with given language'''
    reco = sr.Recognizer()
    with sr.AudioFile(wavefile) as source:
        audio = reco.record(source)
    return reco.recognize_sphinx(audio, language=lang)

It depends on the ASR library whether a long wave can be split to small pieces. The split step can also be done prior to ASR, for example, with tools like [auditok](https://auditok.readthedocs.io/en/latest/).

```
auditok -e 55 -i input.wav -m 10 --printf "{id}\n{start} --> {end}\nFake text here...\n" --time-format "%h:%m:%s,%i" > output.srt
```

The audio file can be split into segments with the time information provided then.

In [None]:
import srt
from ffmpy import FFmpeg

for sub in srt.parse(open('output.srt').read()):
    ss = sub.start
    t = sub.end - ss
    ff = FFmpeg(
        inputs={'input.wav': ' '.join(['-ss', str(ss), '-t', str(t)])},
        outputs={'split/'+str(ss)+'-'+str(t)+'.wav': ' '.join(['-vn', '-acodec', 'pcm_s16le'])}
    )
    ff.run()