Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues in the dataset. #9

Closed
saxenarohit opened this issue Feb 11, 2019 · 1 comment
Closed

Multiple issues in the dataset. #9

saxenarohit opened this issue Feb 11, 2019 · 1 comment

Comments

@saxenarohit
Copy link

saxenarohit commented Feb 11, 2019

  1. Audio

There is a disturbance in audio which would have affected the audio features.

Few Examples:
dia793_utt0.mp4
dia164_utt5.mp4
dia682_utt1.mp4
dia529_utt2.mp4
dia1029_utt1.mp4
dia1008_utt1.mp4

Mostly all videos with size > 2.5 MB (around 200 videos in train_set)

  1. Video and text are not matching.

For example

a) dialogue 241. In utterance 1 the sync breaks between the text and the video
utterance 2 in text is "I asked him." while video dia241_utt2.mp4 has just word "now" and the sync issues goes on.

b) dialogue 757 utterance 7 is also not synced with the text.

c) diaglogue 485 utterance 0 in text "Hey, this- Heyy..." but the video is a long clip.

There are many more video-text sync issues.

Is this dataset usable?
Please help me with this.

@soujanyaporia
Copy link
Collaborator

As discussed over email, there are some alignment issues because of the auto-aligner Gentle that we used. We have to manually fix such issues and we plan to do the same in near future. Be rest assured that such videos are very less in number and do not trouble the overall quality of the dataset. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants