Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about the datasets. #2

Open
zyj008 opened this issue Aug 16, 2018 · 3 comments
Open

questions about the datasets. #2

zyj008 opened this issue Aug 16, 2018 · 3 comments

Comments

@zyj008
Copy link

zyj008 commented Aug 16, 2018

Hello! I have some questions about the BZ datasets. Do you have data preprocessing operation on the BZ dataset before training the model, such as breaking long sentences into small segments? Some sentences in BZ datasets are much longer than sentences in LJ.

@acetylSv
Copy link
Owner

Hi, I plotted character lengths of each line in transcription into histogram and got this plot.
So I decided to discard sentences whose character length > 300.

@zyj008
Copy link
Author

zyj008 commented Aug 16, 2018

Thanks for your reply! I still have other questions. How many hours of BZ dataset you used for training? I found it is hard to converge well when training about 100 hours dataset. The alignment is usually a fuzzy slash.
image
This image shows alignment for 140k steps, batch_size=24, num_gpu=4.
Do you have some ideas or advice for me?
Could you share your tfevents file for your pretrained model?
Thank you!

@acetylSv
Copy link
Owner

I used only the segmented part of Blizzard-2013 dataset which contains 9742 files with about 20 hrs. So I'm not sure what will happen if switching to the bigger one.
In my experience, the attention plot will some how "suddenly" learn to align well at 40K steps (batch_size=32). Maybe the maximum length of your training pairs is set too long?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants