Skip to content
Branch: master
Find file History

README.md

Keyword Spotting Data Generator


In order to improve the flexibility of Honk and Honkling, we provide a program that constructs a dataset from youtube videos. Key idea is to decrease the search space by utilizing subtitles and extract target audio using PocketSphinx.

< Preparation >

  • Necessary python packages can be downloaded with pip -r install requirements.txt
  • ffmpeg and SoX must be available as well.
  • YouTube Data API Key - follow this instruction to obtain a new API key
  • Words API Key

< Usage >

python keyword_data_generator.py
	-y < youtube data v3 API key >
	-w < words API key >
	-k < list of keywords to search >
	-s < number of samples to collect per keyword (default: 10) >
	-o < output path (default: "./generated_keyword_audios") >

example:

python keyword_data_generator.py -y $YOUTUBE_API_KEY -w $WORDS_API_KEY -k google slack -s 20 -o ./generated

< Improvements >


  • filtering non-english videos
  • adjust ffmpeg command to handle different types of video : mov,mp4,m4a,3gp,3g2,mj2
  • dynamic handling of long videos (currently simple filter)
  • improve throughput by parallelizing the process
You can’t perform that action at this time.