questions about the datasets. #2

zyj008 · 2018-08-16T03:56:28Z

Hello! I have some questions about the BZ datasets. Do you have data preprocessing operation on the BZ dataset before training the model, such as breaking long sentences into small segments? Some sentences in BZ datasets are much longer than sentences in LJ.

acetylSv · 2018-08-16T05:33:27Z

Hi, I plotted character lengths of each line in transcription into histogram and got this plot.
So I decided to discard sentences whose character length > 300.

zyj008 · 2018-08-16T06:03:45Z

Thanks for your reply! I still have other questions. How many hours of BZ dataset you used for training? I found it is hard to converge well when training about 100 hours dataset. The alignment is usually a fuzzy slash.

This image shows alignment for 140k steps, batch_size=24, num_gpu=4.
Do you have some ideas or advice for me?
Could you share your tfevents file for your pretrained model?
Thank you!

acetylSv · 2018-08-16T14:26:35Z

I used only the segmented part of Blizzard-2013 dataset which contains 9742 files with about 20 hrs. So I'm not sure what will happen if switching to the bigger one.
In my experience, the attention plot will some how "suddenly" learn to align well at 40K steps (batch_size=32). Maybe the maximum length of your training pairs is set too long?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about the datasets. #2

questions about the datasets. #2

zyj008 commented Aug 16, 2018

acetylSv commented Aug 16, 2018

zyj008 commented Aug 16, 2018

acetylSv commented Aug 16, 2018

questions about the datasets. #2

questions about the datasets. #2

Comments

zyj008 commented Aug 16, 2018

acetylSv commented Aug 16, 2018

zyj008 commented Aug 16, 2018

acetylSv commented Aug 16, 2018