Skip to content
This repository has been archived by the owner on Dec 21, 2023. It is now read-only.

Training – IOError: Fail to write. Disk may be full.: iostream error #102

Closed
teaglin opened this issue Dec 15, 2017 · 7 comments
Closed

Comments

@teaglin
Copy link

teaglin commented Dec 15, 2017

Hi,

I am using a fairly large dataset – 75GBs, that is stored on a separate server. During training, 250GB is written to the local hard drive completely filling it up. RAM usage is also extremely high it uses 80-85% of 128GBs of RAM. Training ends up failing due to – IOError: Fail to write. Disk may be full.: iostream error

Something seems very inefficient with loading remote training data via an sframe. In Tensorflow this works flawlessly loading training data from a local server. Is there some setting that needs to be enabled so the local hard drive is not filled up?

@srikris
Copy link
Contributor

srikris commented Dec 15, 2017

Sorry you are having trouble with this.

Can you tell us some more about how you created this dataset? By any chance, was this created by appending many small SFrames? It would also help to know what are the contents (how many rows, how many columns, what are the types of columns).

@teaglin
Copy link
Author

teaglin commented Dec 15, 2017

Thanks for the quick response!

I am doing object detection, but I created the dataset using existing code I had for tensorflow. All the image paths + classes are stored in a db and the images are stored in a directory. I do some additional processing to the image so each image is loaded individually via worker queues and then a single worker would write it out sequentially to a training dataset file.

This worked well for larger datasets because tensorflow allows writing out an individual training example, I wasn't able to find anything like that in the sframe docs. I initially tried appending separate sframes, but I could never get that to work. So I store them all in a single list and when all the processing is done I create the sframe and write it out to a file. This is only possible because I have enough RAM.

My sframe structure is exactly like the object detection documentation. 2 columns, but well over 100k rows.

+------------------------+-------------------------------+
| image | annotations |
+------------------------+-------------------------------+
| Height: 375 Width: 500 | [{'coordinates': {'y': 204... |
| Height: 375 Width: 500 | [{'coordinates': {'y': 148... |
| Height: 334 Width: 500 | [{'coordinates': {'y': 146... |
| Height: 500 Width: 345 | [{'coordinates': {'y': 321... |
| Height: 480 Width: 500 | [{'coordinates': {'y': 301... |
| Height: 375 Width: 500 | [{'coordinates': {'y': 121... |
| Height: 335 Width: 500 | [{'coordinates': {'y': 119... |
| Height: 335 Width: 500 | [{'coordinates': {'y': 150... |
| Height: 500 Width: 333 | [{'coordinates': {'y': 235... |
| Height: 333 Width: 500 | [{'coordinates': {'y': 120... |
+------------------------+-------------------------------+

@srikris
Copy link
Contributor

srikris commented Dec 15, 2017

This is well within the limits of the SFrame and it should not cause any trouble. Can you share your snippet of code you are using to create and write out the SFrame to a file? That could help us better identify why this be happening.

@teaglin
Copy link
Author

teaglin commented Dec 15, 2017

This is what I ended up having to do to get it to work as I mentioned in my previous post.

class WriteWorker(Thread):
    def __init__(self, savePath):
        Thread.__init__(self)
        self.images = []
        self.annotations = []
        self.queue = Queue()
        self.daemon = True
        self.savePath = savePath
        self.start()

    def run(self):
        while True:
            r = self.queue.get()
            self.images.append(r['image'])
            self.annotations.append(r['annotations'])
            self.queue.task_done()

    def wait(self):
        self.queue.join()
        sf = tc.SFrame({'image':self.images, 'annotations':self.annotations})
        sf.save(self.savePath)

@gustavla
Copy link
Collaborator

Hi @teaglin, the way you are currently creating the SFrame by first building Python lists and then feeding them to the SFrame constructor, means Python needs to keep all your data in memory before it is even handed off to the SFrame. I suspect this could be why your RAM usage is so high.

Instead, you could try using the SFrameBuilder, which is a helper class exactly created for this purpose: building up an SFrame row by row.

Let us know how it goes! I'm also curious to hear about your experience with the object detector, so feel free to drop us a line here when you get some results.

@teaglin teaglin closed this as completed Dec 18, 2017
@teaglin teaglin reopened this Dec 18, 2017
@teaglin
Copy link
Author

teaglin commented Dec 18, 2017

@gustavla I tried out that method it did help, but the real trick was setting the cache config.

tc.config.set_runtime_config('TURI_CACHE_FILE_LOCATIONS', network_location)

My only criticism is that this seems like a very important piece for building large SFrames, but the API is very obscure. Also doing it this way is much much slower than directly writing out to a file. For example in tensorflow you can directly write out a single tfrecord, whereas with SFrame it's basically all built in memory. That memory then gets partially cached to the disk and then after all that it gets written back out to a saved SFrame.

As far as the object detection goes I will give you guys an update once I get some results. Thanks again for the help!

@srikris
Copy link
Contributor

srikris commented Dec 18, 2017

I'll close this and add another issue that points to the specific concern.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants