Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEDx Talk with ID=D4TE28-L7FI is not available anymore #7

Closed
david-gimeno opened this issue Mar 27, 2023 · 5 comments
Closed

TEDx Talk with ID=D4TE28-L7FI is not available anymore #7

david-gimeno opened this issue Mar 27, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@david-gimeno
Copy link

I was downloading the MuAViC database for the Spanish language when suddenly a error message appeared when segmenting videos. It seems that the video with ID=D4TE28-L7FI is not available anymore. Do you have a backup of the database for these cases? In addition, the script was interrupted, I consider that it should not happen.

Best regards,

David.

@Anwarvic
Copy link
Contributor

Hi @david-gimeno,

Thanks for raising this issue. Could you post the full trace of the error? Thanks!

@Anwarvic Anwarvic self-assigned this Mar 27, 2023
@david-gimeno
Copy link
Author

You are right, i should have shared the full error trace. I have run the script twice and this is what I got both times:

`Downloading mtedx_es.tgz from https://www.openslr.org/resources/100/mtedx_es.tgz
Extracting mtedx_es.tgz: 100%|██████████| 2058/2058 [01:44<00:00, 19.64it/s]
Downloading mtedx_es-en.tgz from https://www.openslr.org/resources/100/mtedx_es-en.tgz
Extracting mtedx_es-en.tgz: 100%|██████████| 842/842 [00:40<00:00, 20.61it/s]

Downloading es videos from YouTube
[download] 40.2% of 317.89MiB at 229.22KiB/s ETA 14:08ERROR: [youtube] D4TE28-L7FI: Video unavailable
Downloading es/train Videos: 100%|██████████| 988/988 [14:24<00:00, 1.14it/s]
Downloading es/valid Videos: 100%|██████████| 16/16 [00:08<00:00, 1.99it/s]
Downloading es/test Videos: 100%|██████████| 12/12 [00:09<00:00, 1.27it/s]

Segmenting es audio files
Preprocessing es/train Audios: 100%|██████████| 102171/102171 [02:30<00:00, 676.90it/s]
Preprocessing es/valid Audios: 100%|██████████| 905/905 [00:01<00:00, 584.80it/s]
Preprocessing es/test Audios: 100%|██████████| 1012/1012 [00:01<00:00, 669.18it/s]
Downloading 20words_mean_face.npy from https://dl.fbaipublicfiles.com/muavic/metadata/20words_mean_face.npy
MB
Segmenting es videos files (It takes a few hours to complete)
0%| | 0/988 [00:00<?, ?it/s]Downloading es_metadata.tgz from https://dl.fbaipublicfiles.com/muavic/metadata/es_metadata.tgz
Extracting es_metadata.tgz: 100%|██████████| 1019/1019 [00:22<00:00, 46.00it/s]
21%|██ | 203/988 [6:28:13<26:44:13, 122.62s/it][ WARN:0@26410.813] global loadsave.cpp:244 findDecoder imread_('/tmp/tmprpv_y3yz/11837.png'): can't open/read file: check file path/integrity
21%|██ | 203/988 [6:30:03<25:08:21, 115.29s/it]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
return [fn(*args) for args in chunk]
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/process.py", line 198, in
return [fn(*args) for args in chunk]
File "/home/david/phd/muavic/mtedx_utils.py", line 144, in segment_normalize_video
frames = resize_frames(video_frames, new_size=(96, 96))
File "/home/david/phd/muavic/utils.py", line 151, in resize_frames
return [cv2.resize(frame, new_size) for frame in input_frames]
File "/home/david/phd/muavic/utils.py", line 151, in
return [cv2.resize(frame, new_size) for frame in input_frames]
cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgproc/src/resize.cpp:4062: error: (-215:Assertion failed) !ssize.empty() in function 'resize'

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "get_data.py", line 107, in
main(args)
File "get_data.py", line 76, in main
prepare_mtedx(args)
File "get_data.py", line 26, in prepare_mtedx
preprocess_mtedx_video(
File "/home/david/phd/muavic/mtedx_utils.py", line 208, in preprocess_mtedx_video
process_map(
File "/home/david/anaconda3/envs/muavic/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 105, in process_map
return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/home/david/anaconda3/envs/muavic/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/home/david/anaconda3/envs/muavic/lib/python3.8/site-packages/tqdm/std.py", line 1166, in iter
for obj in iterable:
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/home/david/anaconda3/envs/muavic/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgproc/src/resize.cpp:4062: error: (-215:Assertion failed) !ssize.empty() in function 'resize'
`

Another curious aspect is that, although the D4TE28-L7FI is unavailable (i.e., it was not download), there are audio segments for this sample. How is this possible?

Thanks in advance,

David.

@Anwarvic
Copy link
Contributor

Anwarvic commented Mar 28, 2023

Hi @david-gimeno ,

Thanks for posting the full error trace!

This issue occurred because there is a mismatch between the actual video frames and the downloaded metadata found at /home/david/phd/muavic/metadata/es/train/*.pkl. This bug has been taken care of in our recent updates. Please, run:

# update your source code
$ git pull

# re-run your script
$ python get_data.py --root-path /home/david/phd --src-lang es  #should resume where it stopped

This issue has nothing to do with the file D4TE28-L7FI. The message ERROR: [youtube] D4TE28-L7FI: Video unavailable just warns you that our downloading script is incapable of downloading this TED talk (https://www.youtube.com/watch?v=D4TE28-L7FI). All failed-to-download files are found at /home/david/phd/muavic/mtedx/not_found_videos.txt.

Also, the audio files /home/david/phd/muavic/es/audio/train/D4TE28-L7FI/D4TE28-L7FI_xxxx.wav exist because they were segmented from mTEDx dataset which has been downloaded fully already.

Hope that fixes your issue!

@Anwarvic Anwarvic added the bug Something isn't working label Mar 28, 2023
@david-gimeno
Copy link
Author

david-gimeno commented Mar 30, 2023

Thank so much! All the Spanish MuAViC database has been processed :) But just only one question more:

Regarding the transcripts in muavic/es/train_avsr.es, are they following the same order specified in muavic/es/train_avsr.tsv?

On the other hand, I would like to tell you something, it is just a suggestion. According to my experience, instead of saving the video samples as .mp4, using .npz compressed files (using the numpy library) is very efficient in terms of storage or when creating data loaders for training models.

np.savez_compressed(dst_path+"/"+sampleID+".npz", data=rois),

being rois a numpy array with the sequence of region of interest (96x96 pixels) of one sample.

Anyway, thank you again for your time. Best regards,

David.

@Anwarvic
Copy link
Contributor

Hi David,

Glad that everything is working now!

Regarding your question, the answer is "Yes". Transcripts follow the same order as manifest files for AVSR and AVST. And thank you for your suggestion, my team and I will definitely take it into consideration.

I'm gonna close this issue for now if you don't mind. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants