Training – IOError: Fail to write. Disk may be full.: iostream error #102

teaglin · 2017-12-15T00:23:12Z

Hi,

I am using a fairly large dataset – 75GBs, that is stored on a separate server. During training, 250GB is written to the local hard drive completely filling it up. RAM usage is also extremely high it uses 80-85% of 128GBs of RAM. Training ends up failing due to – IOError: Fail to write. Disk may be full.: iostream error

Something seems very inefficient with loading remote training data via an sframe. In Tensorflow this works flawlessly loading training data from a local server. Is there some setting that needs to be enabled so the local hard drive is not filled up?

srikris · 2017-12-15T01:29:57Z

Sorry you are having trouble with this.

Can you tell us some more about how you created this dataset? By any chance, was this created by appending many small SFrames? It would also help to know what are the contents (how many rows, how many columns, what are the types of columns).

teaglin · 2017-12-15T01:54:55Z

Thanks for the quick response!

I am doing object detection, but I created the dataset using existing code I had for tensorflow. All the image paths + classes are stored in a db and the images are stored in a directory. I do some additional processing to the image so each image is loaded individually via worker queues and then a single worker would write it out sequentially to a training dataset file.

This worked well for larger datasets because tensorflow allows writing out an individual training example, I wasn't able to find anything like that in the sframe docs. I initially tried appending separate sframes, but I could never get that to work. So I store them all in a single list and when all the processing is done I create the sframe and write it out to a file. This is only possible because I have enough RAM.

My sframe structure is exactly like the object detection documentation. 2 columns, but well over 100k rows.

+------------------------+-------------------------------+
| image | annotations |
+------------------------+-------------------------------+
| Height: 375 Width: 500 | [{'coordinates': {'y': 204... |
| Height: 375 Width: 500 | [{'coordinates': {'y': 148... |
| Height: 334 Width: 500 | [{'coordinates': {'y': 146... |
| Height: 500 Width: 345 | [{'coordinates': {'y': 321... |
| Height: 480 Width: 500 | [{'coordinates': {'y': 301... |
| Height: 375 Width: 500 | [{'coordinates': {'y': 121... |
| Height: 335 Width: 500 | [{'coordinates': {'y': 119... |
| Height: 335 Width: 500 | [{'coordinates': {'y': 150... |
| Height: 500 Width: 333 | [{'coordinates': {'y': 235... |
| Height: 333 Width: 500 | [{'coordinates': {'y': 120... |
+------------------------+-------------------------------+

srikris · 2017-12-15T04:23:09Z

This is well within the limits of the SFrame and it should not cause any trouble. Can you share your snippet of code you are using to create and write out the SFrame to a file? That could help us better identify why this be happening.

teaglin · 2017-12-15T23:09:22Z

This is what I ended up having to do to get it to work as I mentioned in my previous post.

class WriteWorker(Thread):
    def __init__(self, savePath):
        Thread.__init__(self)
        self.images = []
        self.annotations = []
        self.queue = Queue()
        self.daemon = True
        self.savePath = savePath
        self.start()

    def run(self):
        while True:
            r = self.queue.get()
            self.images.append(r['image'])
            self.annotations.append(r['annotations'])
            self.queue.task_done()

    def wait(self):
        self.queue.join()
        sf = tc.SFrame({'image':self.images, 'annotations':self.annotations})
        sf.save(self.savePath)

gustavla · 2017-12-16T22:26:16Z

Hi @teaglin, the way you are currently creating the SFrame by first building Python lists and then feeding them to the SFrame constructor, means Python needs to keep all your data in memory before it is even handed off to the SFrame. I suspect this could be why your RAM usage is so high.

Instead, you could try using the SFrameBuilder, which is a helper class exactly created for this purpose: building up an SFrame row by row.

Let us know how it goes! I'm also curious to hear about your experience with the object detector, so feel free to drop us a line here when you get some results.

teaglin · 2017-12-18T00:08:00Z

@gustavla I tried out that method it did help, but the real trick was setting the cache config.

tc.config.set_runtime_config('TURI_CACHE_FILE_LOCATIONS', network_location)

My only criticism is that this seems like a very important piece for building large SFrames, but the API is very obscure. Also doing it this way is much much slower than directly writing out to a file. For example in tensorflow you can directly write out a single tfrecord, whereas with SFrame it's basically all built in memory. That memory then gets partially cached to the disk and then after all that it gets written back out to a saved SFrame.

As far as the object detection goes I will give you guys an update once I get some results. Thanks again for the help!

srikris · 2017-12-18T19:26:21Z

I'll close this and add another issue that points to the specific concern.

srikris added the need user repro label Dec 15, 2017

teaglin closed this as completed Dec 18, 2017

teaglin reopened this Dec 18, 2017

srikris closed this as completed Dec 18, 2017

srikris mentioned this issue Dec 18, 2017

SFrame builder API documentation #118

Closed

igiloh mentioned this issue Dec 27, 2017

Disk usage expands dramatically during training, causes 'Disk may be full' error. #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training – IOError: Fail to write. Disk may be full.: iostream error #102

Training – IOError: Fail to write. Disk may be full.: iostream error #102

teaglin commented Dec 15, 2017

srikris commented Dec 15, 2017

teaglin commented Dec 15, 2017 •

edited

Loading

srikris commented Dec 15, 2017

teaglin commented Dec 15, 2017 •

edited

Loading

gustavla commented Dec 16, 2017

teaglin commented Dec 18, 2017

srikris commented Dec 18, 2017

Training – IOError: Fail to write. Disk may be full.: iostream error #102

Training – IOError: Fail to write. Disk may be full.: iostream error #102

Comments

teaglin commented Dec 15, 2017

srikris commented Dec 15, 2017

teaglin commented Dec 15, 2017 • edited Loading

srikris commented Dec 15, 2017

teaglin commented Dec 15, 2017 • edited Loading

gustavla commented Dec 16, 2017

teaglin commented Dec 18, 2017

srikris commented Dec 18, 2017

teaglin commented Dec 15, 2017 •

edited

Loading

teaglin commented Dec 15, 2017 •

edited

Loading