Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Lhotse/K2 example #45

Open
wants to merge 119 commits into
base: main
Choose a base branch
from
Open

[WIP] Lhotse/K2 example #45

wants to merge 119 commits into from

Conversation

freewym
Copy link
Owner

@freewym freewym commented Nov 4, 2020

@freewym freewym force-pushed the lhotse branch 4 times, most recently from 2b230b6 to f801434 Compare November 4, 2020 18:46
self.tgt_sizes = np.array(
[
round(
cut.supervisions[0].trim(cut.duration).duration / cut.frame_shift
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were previously using "cut.num_samples" if features were not available - it won't work here, as in that case frame_shift will be None

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I thought tgt_size was used to denote "the number of output tokens" in a different context, here it seems like it's representing "the number of feature frames covered by a supervision"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like your intention here (supervisions[0]) is to support only single-supervision cuts; maybe it makes sense to add an assertion?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tgt_sizes is used for determining batching sizes, and possibly affect the loss value. I am not sure if in the future we would add on-the-fly feature extraction in this class if only the recordings were available, and if we do the on-the-fly feature extraction, whether the field frame_shift will been populated. How about making tgt_sizes always the same as src_sizes?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you'd rather want to use the number of tokens in supervision.text - unless I misunderstood the meaning of "target" in this context.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have done exactly the same thing as what we did in Kaldi, i.e. making the length distribution of positives and negatives the same for training. This is done in local/data_prep.py in PR.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey I need to test if all the data prep work as expected (additive noise from MUSAN is still to be done). In the meantime, maybe we can start to think about implementing the LF-MMI loss using K2?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the data-prep and normalizing the sizes: I'm concerned that others who pick up the data from Lhotse may not do this and may get bad results? But IDK whether it would be natural to do that within Lhotse. Will comment in a second, about implementing the LF-MMI loss in k2.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding implementing LF-MMI in k2:

you need to turn the nnet outputs into a DenseFsaVec. The nnet outputs will be of shape
(B,T,F) where B is the batch size, T is num-frames and F is number of features. Feature zero
will probably correspond to epsilon/blank. [If you do want an epsilon/blank then you should
probably call AddEpsilonSelfLoops() on your graphs before calling IntersectDensePruned() / intersect_dense_pruned(), since
IntersectDensePruned() does not treat epsilon specially. Caution: AddEpsilonSelfLoops() still
need to be wrapped to python, I created an issue on k2 for this.]

Anyway, the first step is to construct the DenseFsaVec from your nnet output. DenseFsaVec
supports different-sized supervision segments, and you have the choice here to omit any
padding frames from the frames you construct the DenseFsaVec from. Do git grep dense_fsa
in k2 and you'll find where the code is.

The next step is to construct the denominator graph as an Fsa (this is a k2 python type, although
there is also a C++ typedef of the same name; I refer here to the python type). You can probably create
it without epsilons and then add epsilon self-loops to it. If you want it can just be, effectively,
the union of 2 graphs, one for the numerator and one for the denominator. I expect you will use your
experience of what does and does not work, here.

The numerator graphs can start off as two Fsas, one for the positive and one for the negative examples.
Look at type Fsa in k2 (at the python level), because it does support being (really) a vector of Fsas.
Currently I don't know of a super efficient way to create the minibatch from the num and den fsas and
(say) a vector of bools, but this is doable; please consult @csukuangfj on this and maybe he can
create something.

The objective function will be like num_score - den_score, where each score comes from one call
to intersect_dense_pruned and then putting the output into get_tot_scores with log_semiring = true,
and summing the output tensor.

Sorry I have to go somewhere, but hopefully you can get stareted with this info and ask @csukuangfj for help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddEpsilonSelfLoops is added by k2-fsa/k2#313

espresso/data/asr_k2_dataset.py Outdated Show resolved Hide resolved
espresso/tools/Makefile Show resolved Hide resolved
espresso/tools/Makefile Outdated Show resolved Hide resolved
@freewym
Copy link
Owner Author

freewym commented Nov 6, 2020

@pzelasko I just drafted a data prep script in the file examples/mobvoihotwords/local/data_prep.py. I just would like to double-check with you whether I did everything correctly and efficiently.

Basically I want to augment the original training data with 1.1x/0.9 speed-perturbation, and reverberation separately, and then combined them into a single CutSet. I did that by first extracting augmented features and dump them into the disk separately, and then merging their respective CutSet and in the meantime modifying their ids (by prefixing) to differentiate utterances from the same underlying original one.

Also, I don't know the way I did speed perturbation is correct (in terms of both the use of pitch function and the value of pitch shift being passed on to the function)

Thanks

Copy link

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left you some comments.

examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
if "train" in partition:
with LilcomFilesWriter(f"{output_dir}/feats_{partition}_orig") as storage:
cut_set_orig = cut_set.compute_and_store_features(
extractor=Mfcc(config=mfcc_hires_config),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be okay to instantiate Mfcc(config=...) once and re-use for all calls (although it won't make a difference in performance, just code terseness)

examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
examples/mobvoihotwords/local/data_prep.py Outdated Show resolved Hide resolved
@pzelasko
Copy link

pzelasko commented Nov 6, 2020

BTW this is in a very experimental stage, but some time ago I was able to run Lhotse feature extraction distributed on our CLSP grid with these steps (admittedly not tested with data augmentation yet):

  1. pip install dask distributed dask_jobqueue - Dask, a library that handles distributed computation in Python
  2. pip install git+https://github.com/pzelasko/plz - my wrapper for Dask dedicated for CLSP grid
  3. from dask.distributed import Client
  4. from plz import setup_cluster
  5. with setup_cluster() as cluster, Client(cluster) as ex: <- drop in replacement for process pool exec
  6. cluster.scale(num_jobs)

If you'd like you can try it, else I will try it sometime, probably using your recipe as it'll be a great testing ground for this.

@freewym
Copy link
Owner Author

freewym commented Nov 6, 2020

Thanks for the helpful comments! There are still additional steps for data preprocessing to been done before features extraction (additive noise and split the recordings). I will try the distributed extraction once they are done.

@freewym freewym force-pushed the master branch 2 times, most recently from 705bb32 to 82dfb45 Compare November 8, 2020 08:07
@freewym freewym force-pushed the lhotse branch 5 times, most recently from 3f8f008 to 2fd59b9 Compare November 9, 2020 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants