Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error with Google Drive data #32

Closed
MrDotOne opened this issue May 3, 2023 · 6 comments
Closed

train error with Google Drive data #32

MrDotOne opened this issue May 3, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@MrDotOne
Copy link

MrDotOne commented May 3, 2023

I am trying to test this out before turning it over to the researchers, and i have been going over the various steps. I was able to successfully run

(medsam) [root@lri-uapps-1 MedSAM]# python utils/precompute_img_embed.py -i /data/train -o /data/Tr_emb

however the actual model seems to be failing due to too many files:

(medsam) [root@lri-uapps-1 MedSAM]# python train.py -i /data/Tr_emb --task_name SAM-ViT-B --num_epochs 1000 --batch_size 8 --lr 1e-5
Traceback (most recent call last):
File "/usr/local/MedSAM/train.py", line 83, in
train_dataset = NpzDataset(args.npz_tr_path)
File "/usr/local/MedSAM/train.py", line 24, in init
self.npz_data = [np.load(join(data_root, f)) for f in self.npz_files]
File "/usr/local/MedSAM/train.py", line 24, in
self.npz_data = [np.load(join(data_root, f)) for f in self.npz_files]
File "/usr/local/anaconda3/envs/medsam/lib/python3.10/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
OSError: [Errno 24] Too many open files: '/data/Tr_emb/Tr_000000990.npz'

(medsam) [root@lri-uapps-1 MedSAM]# ls /data/Tr_emb/ | wc -l
161857

Could this be a numpy error perhaps?

@JunMa11
Copy link
Collaborator

JunMa11 commented May 3, 2023

could you please try this?

https://stackoverflow.com/questions/16526783/python-subprocess-too-many-open-files

Loading all files at once may not be a suitable solution for low ram settings. I will try to change it to npy dataloader this weekend.

@MrDotOne
Copy link
Author

MrDotOne commented May 3, 2023 via email

@JunMa11
Copy link
Collaborator

JunMa11 commented May 3, 2023

this is wired. our node has 1TB RAM and it works. i will check it again.

@MrDotOne
Copy link
Author

MrDotOne commented May 3, 2023

Having set ulimit over the number of files (i.e. >161k) i do get it to run, however it eats all the memory available in the node. Hope you can see this screen shot.

image

@JunMa11
Copy link
Collaborator

JunMa11 commented May 4, 2023

Yes. This can bring faster training speed. There is always a trade-off.

I will provide a script for batch loading by the end of this week.

@JunMa11 JunMa11 added the enhancement New feature or request label May 5, 2023
@JunMa11
Copy link
Collaborator

JunMa11 commented May 7, 2023

Hi @MrDotOne ,

I have changed the dataloader to npy dataloader. The data loading should cost less RAM now (at the cost of using more space on hard drive).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants