Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keywored spotting data generator #100

Merged
merged 4 commits into from Jan 15, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 0 additions & 3 deletions .gitmodules

This file was deleted.

11 changes: 11 additions & 0 deletions README.md
Expand Up @@ -7,6 +7,7 @@
Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which accompanies the recent release of their [Speech Commands Dataset](https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html). For more details, please consult our writeup:

+ Raphael Tang, Jimmy Lin. [Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting.](https://arxiv.org/abs/1710.06554) _arXiv:1710.06554_, October 2017.
+ Raphael Tang, Jimmy Lin. [Deep Residual Learning for Small-Footprint Keyword Spotting.](https://arxiv.org/abs/1710.10361) _Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing_, pp. 5479-5483.

Honk is useful for building on-device speech recognition capabilities for interactive intelligent agents. Our code can be used to identify simple commands (e.g., "stop" and "go") and be adapted to detect custom "command triggers" (e.g., "Hey Siri!").

Expand Down Expand Up @@ -117,6 +118,16 @@ There are command options available:
| `--unknown_prob` | [0.0, 1.0] | 0.1 | the probability of picking an unknown word |
| `--wanted_words` | string1 string2 ... stringn | command random | the desired target words |

### JavaScript-based Keyword Spotting

[Honkling](https://github.com/castorini/honkling) is a JavaScript implementation of Honk.
With Honkling, it is possible to implement various web applications with in-browser keyword spotting functionality.

### Keyword Spotting Data Generator

In order to improve the flexibility of Honk and Honkling, we provide a program that constructs a dataset from youtube videos.
Details can be found in `keyword_spotting_data_generator` folder

### Recording audio

You may do the following to record sequential audio and save to the same format as that of speech command dataset:
Expand Down
101 changes: 16 additions & 85 deletions keyword_spotting_data_generator/README.md
@@ -1,101 +1,32 @@
# Keyword Spotting Data Generator
---
In order to add flexibility of keyword spotting, we are working on dataset generator using youtube videos. Key idea is to decrease the search space by utilizing subtitles.

This is still in development but it's possible to generate some dataset.
Note that current version has precision of ~ 0.5.
In order to improve the flexibility of [Honk](https://github.com/castorini/honk) and [Honkling](https://github.com/castorini/honkling), we provide a program that constructs a dataset from youtube videos.
Key idea is to decrease the search space by utilizing subtitles and extract target audio using [PocketSphinx](https://github.com/cmusphinx/pocketsphinx).

## < Preparation >
___
1. Current version is implemented with technique called [forced alignment](https://github.com/pettarin/forced-alignment-tools#definition-of-forced-alignment). Install [Aeneas](https://github.com/readbeyond/aeneas#system-requirements-supported-platforms-and-installation). If it complains about `/usr/bin/ld: cannot find -lespeak`, this [page](https://github.com/readbeyond/aeneas/issues/189) may help.
2. Instal necessary packages by running `pip install -r requirements.txt`
3. [Obtain a Google API key](https://support.google.com/googleapi/answer/6158862?hl=en), and set `API_KEY = google_api_key` in `search.py`
- Necessary python packages can be downloaded with `pip -r install requirements.txt`
- [ffmpeg](https://www.ffmpeg.org/) and [SoX](http://sox.sourceforge.net/) must be available as well.
- YouTube Data API Key - follow [this instruction](https://developers.google.com/youtube/v3/getting-started) to obtain a new API key
- [Words API Key](https://www.wordsapi.com/)

## < Usage >
___
### Generating Dataset

```
python keyword_data_generator.py -k < keywords to search > -s < size of keyword >
python keyword_data_generator.py
-y < youtube data v3 API key >
-w < words API key >
-k < list of keywords to search >
-s < number of samples to collect per keyword (default: 10) >
-o < output path (default: "./generated_keyword_audios") >
```

### Filtering Correct Audios
by running `drop_audio.py` script, user can manually drop false positive audios. This script plays the audio in the folder and asks whether the audio file contains target keyword.

example:
```
python3 drop_audio.py < folder_name >
python keyword_data_generator.py -y $YOUTUBE_API_KEY -w $WORDS_API_KEY -k google slack -s 20 -o ./generated
```

## < Improvements >
___
- filtering non-english videos
- ffmpeg handling more dynamic vidoe types : mov,mp4,m4a,3gp,3g2,mj2
- if video contains any of target words, generate a block
- adjust ffmpeg command to handle different types of video : mov,mp4,m4a,3gp,3g2,mj2
- dynamic handling of long videos (currently simple filter)
- increase the number of youtube videos retrieved from search (ex. searching similar words)
- increase rate of finding target term by stemming words

## Evaluation of Improvements
In order to quantify the improvements, we are working on evaluation framework which measures the quality of selected audio. We are hoping that this helps us to develop robust keyword spotting data generator.

Evaluation process involves following steps:

1. `python url_file_generator.py` : collect urls which contains target keyword in the audio and store it in a single .txt file (url file)
2. `evaluation_data_generator.py` : for each audio block containing target keyword, record how many times the target keyword actually appear; csv file is generated summarizing details of each audio block (summary file)
3. `evaluation_audio_generator.py` : generate audio dataset from summary file
4. `evaluate.py` : measure the quality of the specified similar audio extraction algorithm on given summary file

##### Setting up Experiment
After cloning this repo, run following command to clone submodule [kws-gen-data](https://github.com/castorini/kws-gen-data)
`git submodule update --init --recursive`

##### `url_file_generator.py`
Collect urls of videos which subtitle contains target keywords

```
python url_file_generator.py
-a < youtube data v3 API key >
-k < keywords to search >
-s < number of urls >
```

##### `evaluation_data_generator.py`
For each audio block with keyword, allow users to record how many times the target keyword actually appear. This is the ground truth for measuring quality.
A csv file generated is called a summary file where each column represents `url`, `start_ms`, `end_ms`, `cc_count`, `audio_count`
- url - unique id of youtube video
- start_ms - start time of the given subtitle section
- end_ms - end time of the given subtitle section
- cc_count - how many keyword appeared in subtitle
- audio_count - how many time keyword appeared in the audio (user input)

```
python evaluation_data_generator.py
-a < youtube data v3 API key >
-k < keywords to search >
-s < number of urls >
-f < url file name (when unspecified, directly search youtube) >
-c < url in url file to start from >
-l < length of maximum length for a video (s) >
-o < output csv file to append output to >
```

##### `evaluation_data_generator.py`
Generate set of `.wav` files from the provided summary file

```
python evaluation_audio_generator.py
-a < youtube data v3 API key >
-k < keywords to search >
-f < summary file >
```

##### `evaluate.py`
Measure the quality of the specified similar audio retrieval process on given summary file

```
python evaluation_audio_generator.py
-k < keywords to search >
-f < summary file >
-r < type of extraction algorithm to use >
-th < threshold for retrieving a window >
```
- improve throughput by parallelizing the process
42 changes: 0 additions & 42 deletions keyword_spotting_data_generator/drop_audio.py

This file was deleted.

114 changes: 0 additions & 114 deletions keyword_spotting_data_generator/evaluation/evaluate.py

This file was deleted.

This file was deleted.