Multi-threaded generator GPU usage is very low #3026

harsham05 · 2018-12-08T00:04:29Z

How to reproduce the behaviour

I am trying to tokenize documents using the multi threaded generator, but I am not seeing any speedup on the GPU compared to the CPU.

The GPU utilization seems to be only 6-7 % when monitoring watch -d -n 0.5 nvidia-smi

I have installed spacy using pip install -U spacy[cuda92]

import spacy
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm')

docs = [doc for doc in nlp.pipe(docs, batch_size=10000, n_threads=12) ]

Your Environment

Operating System: Ubuntu 16.04.5
Python Version Used: Python 3.6.5
spaCy Version Used: 2.0.18
Environment Information: Nvidia GTX 1080, Cuda 9.2

The text was updated successfully, but these errors were encountered:

honnibal · 2018-12-08T10:36:10Z

If you're just trying to tokenize, do [nlp.make_doc(text) for text in texts]. You don't need to run the statistical models at all for that.

If you do need to run the models, you could also try using multiprocessing to feed the GPU with more tasks. Here's an example script that reads from stdin, predicts head indices in multiple processes, and prints the results to stdout in order.

import sys
from cytoolz import partition_all
from pathlib import Path
import plac
import spacy
from joblib import Parallel, delayed


@plac.annotations(
    model=("Model name (needs tagger)", "positional", None, str),
    n_jobs=("Number of workers", "option", "n", int),
    batch_size=("Batch-size for each process", "option", "b", int),
)
def main(model, n_jobs=5, batch_size=2000):
    texts = (ujson.loads(line)['text'] for line in sys.stdin)
    # Split into outer-batches
    partitions = partition_all(batch_size, texts)
    for partition in partition_all(batch_size*10, texts):
        # Parallelise within the outer-batch, and collate the results
        executor = Parallel(n_jobs=n_jobs)
        do = delayed(predict_heads)
        tasks = (do(model, subpart) for subpart in partition_all(batch_size//10, partition))
        results = executor(tasks)
        # Print when complete
        for result_group in results:
            for record in result_group:
                print(record)


def predict_heads(model_name, texts):
    nlp = spacy.load(model_name)
    nlp.disable_pipes('tagger', 'ner')
    output = []
    for doc in nlp.pipe(texts):
        heads = [token.head.i - token.i for token in doc]
        tokens = [token.text for token in doc]
        output.append(ujson.dumps({'text': doc.text, 'heads': heads, 'tokens': tokens}))
    return output


if __name__ == '__main__':
    import socket
    try:
        BrokenPipeError
    except NameError:
        BrokenPipeError = socket.error
    try:
        plac.call(main)
    except BrokenPipeError:
        import os, sys
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(1)  # Python exits with error code 1 on EPIPE

harsham05 · 2018-12-11T02:02:10Z

Thank you @honnibal

I was hoping to bypass the CPU restriction of #1508 by parallelizing nlp.pipe on a GPU so that the BLAS operations get delegated to CUDA thru CuPy and Thinc.

It seems like nlp.pipe doesnt translate well onto the GPU, hence the low GPU utilization?

Is the n_threads parameter of nlp.pipe not applicable on the GPU for spacy 2.0.18?

I am actually extracting tokens & POS tags at the sentence level. I disabled 'ner' & 'parser' to get some speedup.

nlp = spacy.load('en_core_web_sm', disable=['ner', "parser"])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

docs_tokens = []
docs_pos_tags = []

for doc in nlp.pipe(docs, batch_size=50000, n_threads=24):
    
    for sent in doc.sents:
        docs_tokens.append([tok.text.lower() for tok in sent])
        docs_pos_tags.append([tok.pos_ for tok in sent])

honnibal · 2018-12-20T23:10:47Z

@harsham05 The n_threads argument was disabled in v2.0.x because the matrix multiplications were being performed by numpy, and we couldn't release the GIL around the model. numpy also calls into system libraries that aren't always thread-safe, making it difficult to launch multiple processes.

In v2.1.x (spacy-nightly), we're now using our own package of the Blis linear algebra routines, compiled for single-threaded execution. This makes them safe to use with multiprocessing. This sets us up for launching the processes transparently ourselves, finally solving the problem.

In summary, for now:

Try out spacy-nightly
Try using multiprocessing to get better speed ups.

Closing now to merge discussion with #2075

lock · 2019-01-19T23:54:47Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added usage General spaCy usage gpu Using spaCy on GPU labels Dec 8, 2018

honnibal closed this as completed Dec 20, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded generator GPU usage is very low #3026

Multi-threaded generator GPU usage is very low #3026

harsham05 commented Dec 8, 2018

honnibal commented Dec 8, 2018 •

edited

harsham05 commented Dec 11, 2018

honnibal commented Dec 20, 2018 •

edited

lock bot commented Jan 19, 2019

Multi-threaded generator GPU usage is very low #3026

Multi-threaded generator GPU usage is very low #3026

Comments

harsham05 commented Dec 8, 2018

How to reproduce the behaviour

Your Environment

honnibal commented Dec 8, 2018 • edited

harsham05 commented Dec 11, 2018

honnibal commented Dec 20, 2018 • edited

lock bot commented Jan 19, 2019

honnibal commented Dec 8, 2018 •

edited

honnibal commented Dec 20, 2018 •

edited