Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kathbath asr #5369

Merged
merged 40 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
758418b
start kathbath recipe
bloodraven66 Jul 24, 2023
bff440f
complete download and untar
bloodraven66 Jul 24, 2023
4071d99
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 24, 2023
dbfbbc9
complete dataprep
bloodraven66 Jul 25, 2023
4c8aeda
fix merge
bloodraven66 Jul 25, 2023
94bfe89
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 25, 2023
e45afca
minor changes
bloodraven66 Jul 28, 2023
bab4b1c
Merge branch 'kathbath_asr' of https://github.com/bloodraven66/espnet…
bloodraven66 Jul 28, 2023
0c24b0e
run and conf files
bloodraven66 Jul 28, 2023
b48dbe9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 28, 2023
7c63552
Merge branch 'master' into kathbath_asr
sw005320 Aug 3, 2023
907d1f9
Update egs2/TEMPLATE/asr1/db.sh
sw005320 Aug 3, 2023
182a46f
fix ci errors
bloodraven66 Aug 4, 2023
3da1d51
fix ci errors
bloodraven66 Aug 4, 2023
87dd63a
add dataset info in readme
bloodraven66 Aug 4, 2023
c2ed81f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 4, 2023
4acf4f0
Merge branch 'kathbath_asr' of https://github.com/bloodraven66/espnet…
bloodraven66 Aug 4, 2023
df74487
fix ci warning
bloodraven66 Aug 4, 2023
15bc40a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 4, 2023
18a1124
fix ci error
bloodraven66 Aug 4, 2023
db2651e
Merge branch 'kathbath_asr' of https://github.com/bloodraven66/espnet…
bloodraven66 Aug 4, 2023
edb4d68
add readme with results
bloodraven66 Aug 4, 2023
7c360da
Update README.md with huggingface links
bloodraven66 Aug 5, 2023
d91d032
Update db.sh
bloodraven66 Aug 5, 2023
df89984
Update data.sh
bloodraven66 Aug 6, 2023
1e7c4be
Update README.md with mr results
bloodraven66 Aug 7, 2023
2df1158
Update README.md
bloodraven66 Aug 7, 2023
65a3239
Update README.md
bloodraven66 Aug 7, 2023
3777148
Update README.md
bloodraven66 Aug 7, 2023
dbe5443
Update README.md with marathi model link
bloodraven66 Aug 7, 2023
b39c1ef
replace find-while-ffmpeg with find-exec-ffmpeg
bloodraven66 Aug 7, 2023
8faa86f
Update egs2/kathbath/asr1/README.md
bloodraven66 Aug 8, 2023
3af28a8
update db.sh
bloodraven66 Aug 8, 2023
0315b89
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 8, 2023
b0c2919
Update README.md
bloodraven66 Aug 10, 2023
ee2b66a
Update README.md
bloodraven66 Aug 10, 2023
6536e6c
Update README.md
bloodraven66 Aug 10, 2023
0f860fc
Update README.md
bloodraven66 Aug 10, 2023
48aa29c
Merge branch 'master' into kathbath_asr
bloodraven66 Aug 10, 2023
8cb9055
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/asr.sh
110 changes: 110 additions & 0 deletions egs2/kathbath/asr1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/db.sh
110 changes: 110 additions & 0 deletions egs2/kathbath/asr1/local/data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
SECONDS=0


stage=1
stop_stage=100000

train_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/clean/train_audio.tar"
valid_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/clean/valid_audio.tar"
clean_test_known_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/clean/testkn_audio.tar"
clean_test_unknown_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/clean/testunk_audio.tar"
noisy_test_known_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/noisy/testkn_audio.tar"
noisy_test_unknown_data_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/noisy/testunk_audio.tar"
transcript_clean_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/clean/transcripts_n2w.tar"
transcript_noisy_url="https://indic-asr-public.objectstore.e2enetworks.net/indic-superb/kathbath/noisy/transcripts_n2w.tar"

log "$0 $*"
. utils/parse_options.sh

. ./db.sh
. ./path.sh
. ./cmd.sh


if [ $# -ne 0 ]; then
log "Error: No positional arguments are required."
exit 2
fi

if [ -z "${KATHBATH}" ]; then
log "Fill the value of 'KATHBATH' of db.sh"
exit 1
fi

download_data="${KATHBATH}"

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
if [ ! -e "${KATHBATH}/download_done" ]; then
echo "stage 1: Data Download to ${LIBRISPEECH}"
ftshijt marked this conversation as resolved.
Show resolved Hide resolved

for data_url in $noisy_test_known_data_url $noisy_test_unknown_data_url $transcript_noisy_url; do
if ! wget -P $download_data --no-check-certificate $data_url; then
echo "$0: error executing wget $data_url"
exit 1
fi
fname=${data_url##*/}
if ! tar -C $download_data -xf $download_data/$fname; then
echo "$0: error un-tarring archive $download_data/$fname"
exit 1
fi
rm $download_data/$fname
done

for data_url in $clean_test_known_data_url $clean_test_unknown_data_url $train_data_url $valid_data_url $transcript_clean_url; do
if ! wget -P $download_data --no-check-certificate $data_url; then
echo "$0: error executing wget $data_url"
exit 1
fi
fname=${data_url##*/}
if ! tar -C $download_data -xf $download_data/$fname; then
echo "$0: error un-tarring archive $download_data/$fname"
exit 1
fi
rm $download_data/$fname
done


touch "${KATHBATH}/download_done"
else
log "stage 1: "${KATHBATH}/download_done" is already existing. Skip data downloading"
fi
fi

mkdir -p data
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
if [ ! -e "data/dataprep_done" ]; then
log "stage 2: Data Preparation"


for lang in $download_data/"kb_data_clean_m4a"/* ; do
log "Processing $lang"
for split in $lang/* ; do

#https://github.com/AI4Bharat/IndicWav2Vec/blob/main/data_prep_scripts/ft_scripts/normalize_sr.sh
path=$split/"audio"
ext="wav"
for f in $(find "$path" -type f -name "*$ext")
do
ffmpeg -loglevel warning -hide_banner -stats -i "$f" -ar 16000 -ac 1 "$f$ext" && rm "$f" && mv "$f$ext" "$f" &
done

done
done


# touch "data/dataprep_done"
ftshijt marked this conversation as resolved.
Show resolved Hide resolved
else
log "stage 2: "data/dataprep_done" is complete"
fi
fi
Empty file.
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/path.sh
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/pyscripts
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/scripts
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/steps
1 change: 1 addition & 0 deletions egs2/kathbath/asr1/utils