Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incoporate WebRTC VAD code #278

Merged
merged 84 commits into from
Aug 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
4cb5845
feat: import (not yet compiling) WebRTC VAD code
dhdaines Aug 12, 2022
27be956
feat: make it compile
dhdaines Aug 12, 2022
da2a5a8
fix: actually make VAD compile and remove unused (but useful) code
dhdaines Aug 12, 2022
16eead3
fix: shut up compiler
dhdaines Aug 12, 2022
252906f
fix: define things if stdint.h not available
dhdaines Aug 12, 2022
5db9fb0
Merge branch 'master' into webrtc_vad
dhdaines Aug 12, 2022
48476d9
fix: add missing limits for Win32
dhdaines Aug 12, 2022
c9fd1bf
fix: fix compilation for real (config.h sucks)
dhdaines Aug 12, 2022
d54f95f
fix: put webrtc_vad.h in a package specific location
dhdaines Aug 12, 2022
dbecaec
fix: avoid warnings with python headers
dhdaines Aug 12, 2022
677f25d
feat: incorporate VAD to Python module
dhdaines Aug 12, 2022
db84027
feat: switch to built-in VAD
dhdaines Aug 12, 2022
3246926
test: add module required by vad test
dhdaines Aug 12, 2022
4234dfe
fix: remove unused code to fix Windows build
dhdaines Aug 12, 2022
7a1f3e1
fix: spl_inl.c was needed on macosx and probably elsewhere
dhdaines Aug 12, 2022
ca281d4
Merge branch 'master' into webrtc_vad
dhdaines Aug 16, 2022
c81d0f9
Merge branch 'master' into webrtc_vad
dhdaines Aug 16, 2022
681673d
refactor: remove public webrtc_vad.h
dhdaines Aug 17, 2022
6f3a189
feat: sugar up the Python Vad binding a bit
dhdaines Aug 17, 2022
6a305b5
docs: add VAD example
dhdaines Aug 17, 2022
6415e21
docs: add shebang
dhdaines Aug 17, 2022
4568f04
docs: make it simpler
dhdaines Aug 17, 2022
ac52ee6
feat: default Vad to 16kHz to avoid surprises
dhdaines Aug 17, 2022
bcfa4cc
fix: check malloc()
dhdaines Aug 17, 2022
b758c4d
docs: note compatibility function
dhdaines Aug 17, 2022
7cd7561
feat: minimally wrap WebRTC VAD
dhdaines Aug 17, 2022
90c37f4
feat: bad example of VAD, will improve
dhdaines Aug 17, 2022
6b4c7c4
fix: change vad example to just do vad
dhdaines Aug 19, 2022
ab243e0
fix: update year on copyright
dhdaines Aug 19, 2022
87fb671
feat: update pocketsphinx5.Vad to use ps_vad.h
dhdaines Aug 19, 2022
52a76cb
docs: examples of live recognition
dhdaines Aug 19, 2022
5f1562a
docs: improve docstrings
dhdaines Aug 19, 2022
2967557
fix: check fread(
dhdaines Aug 19, 2022
864bb41
fix: remove unused constant
dhdaines Aug 19, 2022
42a5478
feat: prototype of ps_endpointer_t (not the final Python API)
dhdaines Aug 19, 2022
ec33619
feat: prototype of segmentation
dhdaines Aug 19, 2022
2e09c1e
fix: add ratio
dhdaines Aug 19, 2022
bd315e8
fix: oops
dhdaines Aug 19, 2022
4d5670c
fix: update WebRTC VAD test for new API
dhdaines Aug 19, 2022
c69652f
test: rename VAD test
dhdaines Aug 19, 2022
114723e
docs: slim down copyright notice
dhdaines Aug 19, 2022
7ceb7f1
docs: reinstate disclaimer
dhdaines Aug 19, 2022
aa5b2dc
docs: update license comment
dhdaines Aug 19, 2022
62d42bf
feat: initial endpointer API
dhdaines Aug 20, 2022
4ba3ae9
fix: get end times right
dhdaines Aug 20, 2022
5a9927e
test: improve endpointer prototype here
dhdaines Aug 20, 2022
372732f
test: add asserts
dhdaines Aug 20, 2022
bd16d21
ci: add sox for python test
dhdaines Aug 20, 2022
7580e51
ci: sudo
dhdaines Aug 20, 2022
8db057d
refactor: track timestamps externally as C code will do
dhdaines Aug 20, 2022
04ea458
refactor: refactor out queue
dhdaines Aug 20, 2022
6702d71
refactor: pull out eof
dhdaines Aug 20, 2022
7d95937
refactor: separate is_speech and pcm
dhdaines Aug 20, 2022
c217c40
refactor: change queue to a sort of ring buffer
dhdaines Aug 20, 2022
d5f03f4
refactor: use less confusing names
dhdaines Aug 20, 2022
4dc2374
feat: initial C implementation of endpointer
dhdaines Aug 20, 2022
932c34f
fix: oops do not always fail
dhdaines Aug 20, 2022
b4f2a8e
feat: report errors somewhat
dhdaines Aug 20, 2022
2ebf1bb
fix: pointer arithmetic is hard
dhdaines Aug 20, 2022
a737ec8
fix: free decoder
dhdaines Aug 20, 2022
2198852
fix: clear after eof
dhdaines Aug 20, 2022
266e8fd
feat: end of stream handling
dhdaines Aug 20, 2022
075a7f0
feat: complete and mostly working endpointer
dhdaines Aug 20, 2022
1b18876
feat: Python interface to endpointer
dhdaines Aug 20, 2022
386ea98
test: synchronize prototype for unit testing
dhdaines Aug 20, 2022
b5df4fd
fix: align ratio computation and eof with prototype
dhdaines Aug 20, 2022
b41cb77
fix: return correct number of end samples
dhdaines Aug 20, 2022
e988446
test: verify equivalency of python and c endpointers
dhdaines Aug 20, 2022
b9c0db8
fix: handle end of stream in live example
dhdaines Aug 20, 2022
6a8fd52
feat: move segmenter into its own class
dhdaines Aug 20, 2022
a127f59
fix: fix examples a bit
dhdaines Aug 20, 2022
2ba4225
docs: remove asyncio example for the moment
dhdaines Aug 21, 2022
93dc809
fix: refcounting
dhdaines Aug 21, 2022
076dfac
test: basic unit tests vad/endpointer
dhdaines Aug 21, 2022
ca1623c
test: run vad at various sampling rates
dhdaines Aug 21, 2022
93d02d5
test: reject unreasonable rates
dhdaines Aug 21, 2022
70d848d
feat: accept approximate rates
dhdaines Aug 21, 2022
e7b35dd
feat: accept approximate sampling rates and document
dhdaines Aug 21, 2022
cafbb23
test: test at other sampling rates
dhdaines Aug 21, 2022
3056ba1
test: test endpointer (loosely) at various sample rates
dhdaines Aug 21, 2022
3f33803
fix: use doubles not floats and check impossible ratios
dhdaines Aug 21, 2022
ec3e54b
fix: add extra checking as in C code
dhdaines Aug 21, 2022
de14964
test: improve endpointer test
dhdaines Aug 21, 2022
4934603
test: install sox
dhdaines Aug 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,14 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Install
run: |
sudo apt-get install sox
- name: Build
run: |
mkdir build
(cd build && cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=install ..)
(cd build && make)

- name: Run tests
run: |
(cd build && make check)
Expand All @@ -24,6 +26,7 @@ jobs:
uses: actions/checkout@v3
- name: Install
run: |
sudo apt-get install sox
python -m pip install --upgrade pip
pip install -r requirements.dev.txt
pip install .
Expand Down
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ endif()
CHECK_INCLUDE_FILE(unistd.h HAVE_UNISTD_H)
CHECK_INCLUDE_FILE(sys/types.h HAVE_SYS_TYPES_H)
CHECK_INCLUDE_FILE(sys/stat.h HAVE_SYS_STAT_H)
CHECK_INCLUDE_FILE(stdint.h HAVE_STDINT_H)
CHECK_SYMBOL_EXISTS(snprintf stdio.h HAVE_SNPRINTF)
CHECK_SYMBOL_EXISTS(popen stdio.h HAVE_POPEN)
CHECK_TYPE_SIZE(long LONG)
Expand Down
57 changes: 56 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,59 @@ DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


WebRTC VAD code (in src/vad):

Copyright (c) 2011, The WebRTC project authors. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.

* Neither the name of Google nor the names of its contributors may
be used to endorse or promote products derived from this software
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Python WebRTC VAD code and test files (in cython and test/data/vad):

The MIT License (MIT)

Copyright (c) 2016 John Wiseman

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
6 changes: 6 additions & 0 deletions config.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@
/* Define if you have the <unistd.h> header file. */
#cmakedefine HAVE_UNISTD_H

/* Define if you have the <inttypes.h> header file. */
#cmakedefine HAVE_INTTYPES_H

/* Define if you have the <stdint.h> header file. */
#cmakedefine HAVE_STDINT_H

/* The size of `long', as computed by sizeof. */
#cmakedefine SIZEOF_LONG @SIZEOF_LONG@

Expand Down
50 changes: 50 additions & 0 deletions cython/_pocketsphinx.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ cdef extern from "sphinxbase/err.h":
ctypedef err_e err_lvl_t
ctypedef void (*err_cb_f)(void* user_data, err_lvl_t lvl, const char *msg)
void err_set_callback(err_cb_f callback, void *user_data)
const char *err_set_loglevel_str(const char *lvl)


cdef extern from "sphinxbase/logmath.h":
Expand Down Expand Up @@ -405,3 +406,52 @@ cdef extern from "pocketsphinx/ps_search.h":
int ps_set_allphone(ps_decoder_t *ps, const char *name, ngram_model_t *lm)
int ps_set_allphone_file(ps_decoder_t *ps, const char *name, const char *path)
int ps_set_align(ps_decoder_t *ps, const char *name, const char *words)

cdef extern from "pocketsphinx/ps_vad.h":
ctypedef struct ps_vad_t:
pass
cdef enum ps_vad_mode_e:
PS_VAD_LOOSE,
PS_VAD_MEDIUM_LOOSE,
PS_VAD_MEDIUM_STRICT,
PS_VAD_STRICT
ctypedef ps_vad_mode_e ps_vad_mode_t
cdef enum ps_vad_class_e:
PS_VAD_ERROR,
PS_VAD_NOT_SPEECH,
PS_VAD_SPEECH
ctypedef ps_vad_class_e ps_vad_class_t
cdef int PS_VAD_DEFAULT_SAMPLE_RATE
cdef double PS_VAD_DEFAULT_FRAME_LENGTH

ps_vad_t *ps_vad_init(ps_vad_mode_t mode, int sample_rate, double frame_length)
int ps_vad_free(ps_vad_t *vad)
int ps_vad_set_input_params(ps_vad_t *vad, int sample_rate, double frame_length)
int ps_vad_sample_rate(ps_vad_t *vad)
size_t ps_vad_frame_size(ps_vad_t *vad)
double ps_vad_frame_length(ps_vad_t *vad)
ps_vad_class_t ps_vad_classify(ps_vad_t *vad, const short *frame)

cdef extern from "pocketsphinx/ps_endpointer.h":
ctypedef struct ps_endpointer_t:
pass
cdef double PS_ENDPOINTER_DEFAULT_WINDOW
cdef double PS_ENDPOINTER_DEFAULT_RATIO
ps_endpointer_t *ps_endpointer_init(double window,
double ratio,
ps_vad_mode_t mode,
int sample_rate, double frame_length)
ps_endpointer_t *ps_endpointer_retain(ps_endpointer_t *ep)
int ps_endpointer_free(ps_endpointer_t *ep)
ps_vad_t *ps_endpointer_vad(ps_endpointer_t *ep)
size_t ps_endpointer_frame_size(ps_endpointer_t *ep)
int ps_endpointer_sample_rate(ps_endpointer_t *ep)
const short *ps_endpointer_process(ps_endpointer_t *ep,
const short *frame)
const short *ps_endpointer_end_stream(ps_endpointer_t *ep,
const short *frame,
size_t nsamp,
size_t *out_nsamp)
int ps_endpointer_in_speech(ps_endpointer_t *ep)
double ps_endpointer_speech_start(ps_endpointer_t *ep)
double ps_endpointer_speech_end(ps_endpointer_t *ep)
177 changes: 170 additions & 7 deletions cython/_pocketsphinx.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ cdef class Config:
if config == NULL:
return None
return Config.create_from_ptr(config)

def __dealloc__(self):
cmd_ln_free_r(self.cmd_ln)

Expand Down Expand Up @@ -662,7 +662,7 @@ cdef class Jsgf:
"""JSGF parser.
"""
cdef jsgf_t *jsgf

def __init__(self, str path, Jsgf parent=None):
cdef jsgf_t *cparent
cpath = path.encode()
Expand Down Expand Up @@ -710,12 +710,12 @@ cdef class Lattice:
def __dealloc__(self):
if self.dag != NULL:
ps_lattice_free(self.dag)

def write(self, str path):
rv = ps_lattice_write(self.dag, path.encode("utf-8"))
if rv < 0:
raise RuntimeError("Failed to write lattice to %s" % path)

def write_htk(self, str path):
rv = ps_lattice_write_htk(self.dag, path.encode("utf-8"))
if rv < 0:
Expand Down Expand Up @@ -1124,7 +1124,7 @@ cdef class Decoder:
if fsg == NULL:
return None
return FsgModel.create_from_ptr(fsg_model_retain(fsg))

def set_fsg(self, str name, FsgModel fsg):
"""Create a search module from an FSG.

Expand Down Expand Up @@ -1218,7 +1218,7 @@ cdef class Decoder:
if rv < 0:
return RuntimeError("Failed to set keyword search %s from phrase %s"
% (name, keyphrase))

def set_allphone_file(self, str name, str lmfile = None):
"""Create a phoneme recognition search module.

Expand Down Expand Up @@ -1285,7 +1285,7 @@ cdef class Decoder:
Config: Configuration parsed from `path`.
"""
return Config.parse_file(path)

def load_dict(self, str dict_path, str fdict_path = None, str _format = None):
"""Load dictionary (and possibly noise dictionary) from a file.

Expand Down Expand Up @@ -1453,3 +1453,166 @@ cdef class Decoder:
"""
return ps_get_n_frames(self.ps)

cdef class Vad:
"""Voice activity detection class.

Args:
mode(int): Aggressiveness of voice activity detction (0-3)
sample_rate(int): Sampling rate of input, default is 16000.
Rates other than 8000, 16000, 32000, 48000
are only approximately supported, see note
in `frame_length`. Outlandish sampling
rates like 3924 and 115200 will raise a
`ValueError`.
frame_length(float): Desired input frame length in seconds,
default is 0.03. The *actual* frame
length may be different if an
approximately supported sampling rate is
requested. You must *always* use the
`frame_bytes` and `frame_length`
attributes to determine the input size.

Attributes:
sample_rate(int): Sampling rate of input (default is 16000)
frame_bytes(int): Number of bytes in a frame accepted by `process`.
frame_length(float): Length of a frame (*may be different from
the one requested in the constructor*!)

Raises:
ValueError: Invalid input parameter (see above).
"""
cdef ps_vad_t *_vad
LOOSE = PS_VAD_LOOSE
MEDIUM_LOOSE = PS_VAD_MEDIUM_LOOSE
MEDIUM_STRICT = PS_VAD_MEDIUM_STRICT
STRICT = PS_VAD_STRICT
DEFAULT_SAMPLE_RATE = PS_VAD_DEFAULT_SAMPLE_RATE
DEFAULT_FRAME_LENGTH = PS_VAD_DEFAULT_FRAME_LENGTH

def __init__(self, mode=PS_VAD_LOOSE,
sample_rate=PS_VAD_DEFAULT_SAMPLE_RATE,
frame_length=PS_VAD_DEFAULT_FRAME_LENGTH):
self._vad = ps_vad_init(mode, sample_rate, frame_length)
if self._vad == NULL:
raise ValueError("Invalid VAD parameters")

def __dealloc__(self):
ps_vad_free(self._vad)

@property
def frame_bytes(self):
return ps_vad_frame_size(self._vad) * 2

@property
def frame_length(self):
return ps_vad_frame_length(self._vad)

@property
def sample_rate(self):
return ps_vad_sample_rate(self._vad)

def is_speech(self, frame, sample_rate=None):
"""Classify a frame as speech or not.

Args:
frame(bytes): Buffer containing speech data (16-bit signed
integers). Must be of length `frame_bytes`
(in bytes).
Returns:
(boolean) Classification as speech or not speech.
Raises:
IndexError: `buf` is of invalid size.
ValueError: Other internal VAD error.
"""
cdef const unsigned char[:] cframe = frame
cdef Py_ssize_t n_samples = len(cframe) // 2
if len(cframe) != self.frame_bytes:
raise IndexError("Frame size must be %d bytes" % self.frame_bytes)
rv = ps_vad_classify(self._vad, <const short *>&cframe[0])
if rv < 0:
raise ValueError("VAD classification failed")
return rv == PS_VAD_SPEECH

cdef class Endpointer:
"""Simple endpointer using voice activity detection.
"""
cdef ps_endpointer_t *_ep
DEFAULT_WINDOW = PS_ENDPOINTER_DEFAULT_WINDOW
DEFAULT_RATIO = PS_ENDPOINTER_DEFAULT_RATIO
def __init__(
self,
window=0.3,
ratio=0.9,
vad_mode=Vad.LOOSE,
sample_rate=Vad.DEFAULT_SAMPLE_RATE,
frame_length=Vad.DEFAULT_FRAME_LENGTH,
):
self._ep = ps_endpointer_init(window, ratio,
vad_mode, sample_rate, frame_length)
if (self._ep == NULL):
raise ValueError("Invalid endpointer or VAD parameters")

@property
def frame_bytes(self):
return ps_endpointer_frame_size(self._ep) * 2

@property
def sample_rate(self):
return ps_endpointer_sample_rate(self._ep)

@property
def in_speech(self):
return ps_endpointer_in_speech(self._ep)

@property
def speech_start(self):
return ps_endpointer_speech_start(self._ep)

@property
def speech_end(self):
return ps_endpointer_speech_end(self._ep)

def process(self, frame):
"""Read a frame of data and return speech if detected.

Args:
frame(bytes): Buffer containing speech data (16-bit signed
integers). Must be of length `frame_bytes`
(in bytes).
Returns:
(bytes) Frame of speech data, or None if none detected.
Raises:
IndexError: `buf` is of invalid size.
ValueError: Other internal VAD error.
"""
cdef const unsigned char[:] cframe = frame
cdef Py_ssize_t n_samples = len(cframe) // 2
cdef const short *outframe
if len(cframe) != self.frame_bytes:
raise IndexError("Frame size must be %d bytes" % self.frame_bytes)
outframe = ps_endpointer_process(self._ep,
<const short *>&cframe[0])
if outframe == NULL:
return None
return (<const unsigned char *>&outframe[0])[:n_samples * 2]

def end_stream(self, frame):
cdef const unsigned char[:] cframe = frame
cdef Py_ssize_t n_samples = len(cframe) // 2
cdef const short *outbuf
cdef size_t out_n_samples
if len(cframe) > self.frame_bytes:
raise IndexError("Frame size must be %d bytes or less" % self.frame_bytes)
outbuf = ps_endpointer_end_stream(self._ep,
<const short *>&cframe[0],
n_samples,
&out_n_samples)
if outbuf == NULL:
return None
return (<const unsigned char *>&outbuf[0])[:out_n_samples * 2]

def set_loglevel(level):
cdef const char *prev_level
prev_level = err_set_loglevel_str(level.encode('utf-8'))
if prev_level == NULL:
raise ValueError("Invalid log level %s" % level)
Loading