Growing memory when parsing huge file #3623

BramVanroy · 2019-04-21T12:36:15Z

I am not entirely sure whether there is an error in my multiprocessing implementation, or whether spaCy is leaking memory, or whether this is expected behaviour.

In the code below, you'll see that for very large files (in my case 36GB), the memory usage will keep rising over time. We have a machine with 384GB of RAM, so it can take a lot. Stiil, after 14M sentences, the RAM usage is already at 70% and rising.

In issue #3618 it was mentioned that it is to be expected that new strings will increase the memory size. But it doesn't seem possible that this causes the continuous increase in mem consumption (after a while you'd expect all strings to be 'known' with some rare exceptions only being added).

The one thing that I can think of, is that each subprocess uses its own spaCy instance (is that true?), and as such the string memory/vocabulary is specific per process. That means that the size of the whole vocabulary is basically duplicated over all subprocesses. Is that true? If so, is there a way to make use of only one 'lookup table'/Voc instance across multiple spaCy instances? If this is not he issue, do you have any other idea what may be wrong?

import logging
import multiprocessing as mp
from os import cpu_count

from spacy.util import minibatch
import spacy

import psutil

logging.basicConfig(datefmt='%d-%b %H:%M:%S',
                    format='%(asctime)s - [%(levelname)s]: %(message)s',
                    level=logging.INFO,
                    handlers=[
                        logging.FileHandler('progress.log'),
                        logging.StreamHandler()
                    ])

NLP = spacy.load('en_core_web_sm', disable=['ner', 'textcat'])
N_WORKERS = cpu_count()-1 or 1

""" Process a large input file. A separate reader process puts batches of lines in a queue,
    picked up by workers who in turn process these lines. They put the results in a new queue and return some
    diagnostics information after finishing. The results are read by a separate writer process that writes the
    results to a new file.
"""


def reader(src, work_q, batch_size=1000):
    with open(src, encoding='utf-8') as fhin:
        lines = (line.strip() for line in fhin)
        # minibatch is a generator, and as such this approach
        # should be memory-lenient
        for batch_idx, batch in enumerate(minibatch(lines, batch_size), 1):
            work_q.put((batch, batch_idx))

    # Notify all workers that work is done
    for _ in range(N_WORKERS):
        work_q.put('done')

    logging.info('Done reading!')

def writer(results_q):
    with open('out.txt', 'w') as fhout:
        while True:
            # Get values from results queue; write those to a file
            m = results_q.get()
            if m == 'done':
                logging.info('Done writing everything to file!')
                break

            fhout.write('\n'.join(m) + '\n')
            fhout.flush()

    logging.info('Done writing!')

def spacy_process(texts, results_q):
    docs = list(NLP.pipe(texts))
    sents = [sent.text for doc in docs for sent in doc.sents]

    return sents, len(sents)

def _start_worker(work_q, results_q):
    # Keep track of some values, e.g. lines processed
    lines_processed = 0
    while True:
        m = work_q.get()
        if m == 'done':
            logging.info('Done reading from file!')
            break

        batch, batch_idx = m
        result, n_lines = spacy_process(batch, results_q)
        results_q.put(result)
        lines_processed += n_lines

        if batch_idx == 1 or batch_idx % 25 == 0:
            logging.info(f"Memory usage (batch #{batch_idx:,}):"
                         f" {psutil.virtual_memory().percent}%")

    logging.info('Workers is done working!')

    return lines_processed


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('fin', help='input file.')
    args = parser.parse_args()

    with mp.Manager() as manager, mp.Pool(N_WORKERS+2) as pool:
        logging.info(f"Started a pool with {N_WORKERS} workers")
        results_queue = manager.Queue(maxsize=N_WORKERS*10)
        work_queue = manager.Queue(maxsize=N_WORKERS*10)

        _ = pool.apply_async(writer, (results_queue, ))
        _ = pool.apply_async(reader, (args.fin, work_queue))

        worker_jobs = []
        for _ in range(N_WORKERS):
            job = pool.apply_async(_start_worker, (work_queue, results_queue))
            worker_jobs.append(job)

        # When a worker has finished its job, get its information back
        total_n_sentences = 0
        for job in worker_jobs:
            n_sentences = job.get()
            total_n_sentences += n_sentences

        # Notify the writer that we're done
        results_queue.put('done')

    logging.info(f"Finished processing {args.fin}. Processed {total_n_sentences} lines.")

Info about spaCy

spaCy version: 2.1.3
Platform: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
Python version: 3.6.4
Models: en

honnibal · 2019-04-22T13:11:47Z

I agree that the observed behaviour isn't very satisfying, but it's currently difficult to be sure whether it's spaCy leaking memory, or just Python not giving memory back to the operating system. If you're running in multiprocessing anyway, can you just have shorter running processes?

The one thing that I can think of, is that each subprocess uses its own spaCy instance (is that true?)

Yes, that's true.

If so, is there a way to make use of only one 'lookup table'/Voc instance across multiple spaCy instances?

I'd love to have that, and it should be possible to have the Vocab in shared memory. I've just never gotten this implemented, as it takes some work to get it right, since the vocab and several of its components are written in Cython. If you know how to do this, I'd love to have the pull request.

BramVanroy · 2019-04-22T13:45:44Z

It is indeed not easy to debug this. I'll try to run this when monitoring the objects' size in memory and see what causes the growing. But if I understood you correctly, it might be Python not giving memory back inside of spaCy?

You asked whether I could have shorter running processes. So you mean prematurely ending jobs and restarting workers in hopes that all memory for that worker is released? Is that what you mean? I'll first try monitoring the memory usage of spaCy. Maybe reinitializing only spaCy works as well.

I have no experience in Cython so I fear I cannot help on that. It does look like a very useful feature.

BramVanroy · 2019-04-24T08:19:27Z

Most discussion is taking place at #3618. For my use case (parsing one 36GB file with multiple workers) the only working solution was to split up the file into smaller files and manually run the script over these files.

BramVanroy · 2019-04-29T19:30:24Z

Closing this in favour of #3618. Direct memory issues there.

lock · 2019-05-30T23:23:30Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added more-info-needed This issue needs more information perf / memory Performance: memory use labels Apr 22, 2019

no-response bot removed the more-info-needed This issue needs more information label Apr 22, 2019

BramVanroy closed this as completed Apr 29, 2019

lock bot locked as resolved and limited conversation to collaborators May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Growing memory when parsing huge file #3623

Growing memory when parsing huge file #3623

BramVanroy commented Apr 21, 2019 •

edited

Loading

honnibal commented Apr 22, 2019

BramVanroy commented Apr 22, 2019 •

edited

Loading

BramVanroy commented Apr 24, 2019

BramVanroy commented Apr 29, 2019

lock bot commented May 30, 2019

Growing memory when parsing huge file #3623

Growing memory when parsing huge file #3623

Comments

BramVanroy commented Apr 21, 2019 • edited Loading

Info about spaCy

honnibal commented Apr 22, 2019

BramVanroy commented Apr 22, 2019 • edited Loading

BramVanroy commented Apr 24, 2019

BramVanroy commented Apr 29, 2019

lock bot commented May 30, 2019

BramVanroy commented Apr 21, 2019 •

edited

Loading

BramVanroy commented Apr 22, 2019 •

edited

Loading