Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Growing memory when parsing huge file #3623

Closed
BramVanroy opened this issue Apr 21, 2019 · 5 comments
Closed

Growing memory when parsing huge file #3623

BramVanroy opened this issue Apr 21, 2019 · 5 comments
Labels
perf / memory Performance: memory use

Comments

@BramVanroy
Copy link
Contributor

BramVanroy commented Apr 21, 2019

Cross-post from Stack Overflow.

I am not entirely sure whether there is an error in my multiprocessing implementation, or whether spaCy is leaking memory, or whether this is expected behaviour.

In the code below, you'll see that for very large files (in my case 36GB), the memory usage will keep rising over time. We have a machine with 384GB of RAM, so it can take a lot. Stiil, after 14M sentences, the RAM usage is already at 70% and rising.

In issue #3618 it was mentioned that it is to be expected that new strings will increase the memory size. But it doesn't seem possible that this causes the continuous increase in mem consumption (after a while you'd expect all strings to be 'known' with some rare exceptions only being added).

The one thing that I can think of, is that each subprocess uses its own spaCy instance (is that true?), and as such the string memory/vocabulary is specific per process. That means that the size of the whole vocabulary is basically duplicated over all subprocesses. Is that true? If so, is there a way to make use of only one 'lookup table'/Voc instance across multiple spaCy instances? If this is not he issue, do you have any other idea what may be wrong?

import logging
import multiprocessing as mp
from os import cpu_count

from spacy.util import minibatch
import spacy

import psutil

logging.basicConfig(datefmt='%d-%b %H:%M:%S',
                    format='%(asctime)s - [%(levelname)s]: %(message)s',
                    level=logging.INFO,
                    handlers=[
                        logging.FileHandler('progress.log'),
                        logging.StreamHandler()
                    ])

NLP = spacy.load('en_core_web_sm', disable=['ner', 'textcat'])
N_WORKERS = cpu_count()-1 or 1

""" Process a large input file. A separate reader process puts batches of lines in a queue,
    picked up by workers who in turn process these lines. They put the results in a new queue and return some
    diagnostics information after finishing. The results are read by a separate writer process that writes the
    results to a new file.
"""


def reader(src, work_q, batch_size=1000):
    with open(src, encoding='utf-8') as fhin:
        lines = (line.strip() for line in fhin)
        # minibatch is a generator, and as such this approach
        # should be memory-lenient
        for batch_idx, batch in enumerate(minibatch(lines, batch_size), 1):
            work_q.put((batch, batch_idx))

    # Notify all workers that work is done
    for _ in range(N_WORKERS):
        work_q.put('done')

    logging.info('Done reading!')

def writer(results_q):
    with open('out.txt', 'w') as fhout:
        while True:
            # Get values from results queue; write those to a file
            m = results_q.get()
            if m == 'done':
                logging.info('Done writing everything to file!')
                break

            fhout.write('\n'.join(m) + '\n')
            fhout.flush()

    logging.info('Done writing!')

def spacy_process(texts, results_q):
    docs = list(NLP.pipe(texts))
    sents = [sent.text for doc in docs for sent in doc.sents]

    return sents, len(sents)

def _start_worker(work_q, results_q):
    # Keep track of some values, e.g. lines processed
    lines_processed = 0
    while True:
        m = work_q.get()
        if m == 'done':
            logging.info('Done reading from file!')
            break

        batch, batch_idx = m
        result, n_lines = spacy_process(batch, results_q)
        results_q.put(result)
        lines_processed += n_lines

        if batch_idx == 1 or batch_idx % 25 == 0:
            logging.info(f"Memory usage (batch #{batch_idx:,}):"
                         f" {psutil.virtual_memory().percent}%")

    logging.info('Workers is done working!')

    return lines_processed


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('fin', help='input file.')
    args = parser.parse_args()

    with mp.Manager() as manager, mp.Pool(N_WORKERS+2) as pool:
        logging.info(f"Started a pool with {N_WORKERS} workers")
        results_queue = manager.Queue(maxsize=N_WORKERS*10)
        work_queue = manager.Queue(maxsize=N_WORKERS*10)

        _ = pool.apply_async(writer, (results_queue, ))
        _ = pool.apply_async(reader, (args.fin, work_queue))

        worker_jobs = []
        for _ in range(N_WORKERS):
            job = pool.apply_async(_start_worker, (work_queue, results_queue))
            worker_jobs.append(job)

        # When a worker has finished its job, get its information back
        total_n_sentences = 0
        for job in worker_jobs:
            n_sentences = job.get()
            total_n_sentences += n_sentences

        # Notify the writer that we're done
        results_queue.put('done')

    logging.info(f"Finished processing {args.fin}. Processed {total_n_sentences} lines.")

Info about spaCy

  • spaCy version: 2.1.3
  • Platform: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
  • Python version: 3.6.4
  • Models: en
@honnibal
Copy link
Member

I agree that the observed behaviour isn't very satisfying, but it's currently difficult to be sure whether it's spaCy leaking memory, or just Python not giving memory back to the operating system. If you're running in multiprocessing anyway, can you just have shorter running processes?

The one thing that I can think of, is that each subprocess uses its own spaCy instance (is that true?)

Yes, that's true.

If so, is there a way to make use of only one 'lookup table'/Voc instance across multiple spaCy instances?

I'd love to have that, and it should be possible to have the Vocab in shared memory. I've just never gotten this implemented, as it takes some work to get it right, since the vocab and several of its components are written in Cython. If you know how to do this, I'd love to have the pull request.

@ines ines added more-info-needed This issue needs more information perf / memory Performance: memory use labels Apr 22, 2019
@BramVanroy
Copy link
Contributor Author

BramVanroy commented Apr 22, 2019

It is indeed not easy to debug this. I'll try to run this when monitoring the objects' size in memory and see what causes the growing. But if I understood you correctly, it might be Python not giving memory back inside of spaCy?

You asked whether I could have shorter running processes. So you mean prematurely ending jobs and restarting workers in hopes that all memory for that worker is released? Is that what you mean? I'll first try monitoring the memory usage of spaCy. Maybe reinitializing only spaCy works as well.

I have no experience in Cython so I fear I cannot help on that. It does look like a very useful feature.

@no-response no-response bot removed the more-info-needed This issue needs more information label Apr 22, 2019
@BramVanroy
Copy link
Contributor Author

Most discussion is taking place at #3618. For my use case (parsing one 36GB file with multiple workers) the only working solution was to split up the file into smaller files and manually run the script over these files.

@BramVanroy
Copy link
Contributor Author

Closing this in favour of #3618. Direct memory issues there.

@lock
Copy link

lock bot commented May 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
perf / memory Performance: memory use
Projects
None yet
Development

No branches or pull requests

3 participants