Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded generator GPU usage is very low #3026

Closed
harsham05 opened this issue Dec 8, 2018 · 4 comments
Closed

Multi-threaded generator GPU usage is very low #3026

harsham05 opened this issue Dec 8, 2018 · 4 comments
Labels
gpu Using spaCy on GPU usage General spaCy usage

Comments

@harsham05
Copy link

How to reproduce the behaviour

I am trying to tokenize documents using the multi threaded generator, but I am not seeing any speedup on the GPU compared to the CPU.

The GPU utilization seems to be only 6-7 % when monitoring watch -d -n 0.5 nvidia-smi

I have installed spacy using pip install -U spacy[cuda92]

import spacy
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm')

docs = [doc for doc in nlp.pipe(docs, batch_size=10000, n_threads=12) ]

Your Environment

  • Operating System: Ubuntu 16.04.5
  • Python Version Used: Python 3.6.5
  • spaCy Version Used: 2.0.18
  • Environment Information: Nvidia GTX 1080, Cuda 9.2
@honnibal
Copy link
Member

honnibal commented Dec 8, 2018

If you're just trying to tokenize, do [nlp.make_doc(text) for text in texts]. You don't need to run the statistical models at all for that.

If you do need to run the models, you could also try using multiprocessing to feed the GPU with more tasks. Here's an example script that reads from stdin, predicts head indices in multiple processes, and prints the results to stdout in order.

import sys
from cytoolz import partition_all
from pathlib import Path
import plac
import spacy
from joblib import Parallel, delayed


@plac.annotations(
    model=("Model name (needs tagger)", "positional", None, str),
    n_jobs=("Number of workers", "option", "n", int),
    batch_size=("Batch-size for each process", "option", "b", int),
)
def main(model, n_jobs=5, batch_size=2000):
    texts = (ujson.loads(line)['text'] for line in sys.stdin)
    # Split into outer-batches
    partitions = partition_all(batch_size, texts)
    for partition in partition_all(batch_size*10, texts):
        # Parallelise within the outer-batch, and collate the results
        executor = Parallel(n_jobs=n_jobs)
        do = delayed(predict_heads)
        tasks = (do(model, subpart) for subpart in partition_all(batch_size//10, partition))
        results = executor(tasks)
        # Print when complete
        for result_group in results:
            for record in result_group:
                print(record)


def predict_heads(model_name, texts):
    nlp = spacy.load(model_name)
    nlp.disable_pipes('tagger', 'ner')
    output = []
    for doc in nlp.pipe(texts):
        heads = [token.head.i - token.i for token in doc]
        tokens = [token.text for token in doc]
        output.append(ujson.dumps({'text': doc.text, 'heads': heads, 'tokens': tokens}))
    return output


if __name__ == '__main__':
    import socket
    try:
        BrokenPipeError
    except NameError:
        BrokenPipeError = socket.error
    try:
        plac.call(main)
    except BrokenPipeError:
        import os, sys
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(1)  # Python exits with error code 1 on EPIPE

@ines ines added usage General spaCy usage gpu Using spaCy on GPU labels Dec 8, 2018
@harsham05
Copy link
Author

Thank you @honnibal

I was hoping to bypass the CPU restriction of #1508 by parallelizing nlp.pipe on a GPU so that the BLAS operations get delegated to CUDA thru CuPy and Thinc.

It seems like nlp.pipe doesnt translate well onto the GPU, hence the low GPU utilization?

Is the n_threads parameter of nlp.pipe not applicable on the GPU for spacy 2.0.18?

I am actually extracting tokens & POS tags at the sentence level. I disabled 'ner' & 'parser' to get some speedup.

nlp = spacy.load('en_core_web_sm', disable=['ner', "parser"])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

docs_tokens = []
docs_pos_tags = []

for doc in nlp.pipe(docs, batch_size=50000, n_threads=24):
    
    for sent in doc.sents:
        docs_tokens.append([tok.text.lower() for tok in sent])
        docs_pos_tags.append([tok.pos_ for tok in sent])
      

@honnibal
Copy link
Member

honnibal commented Dec 20, 2018

@harsham05 The n_threads argument was disabled in v2.0.x because the matrix multiplications were being performed by numpy, and we couldn't release the GIL around the model. numpy also calls into system libraries that aren't always thread-safe, making it difficult to launch multiple processes.

In v2.1.x (spacy-nightly), we're now using our own package of the Blis linear algebra routines, compiled for single-threaded execution. This makes them safe to use with multiprocessing. This sets us up for launching the processes transparently ourselves, finally solving the problem.

In summary, for now:

  • Try out spacy-nightly
  • Try using multiprocessing to get better speed ups.

Closing now to merge discussion with #2075

@lock
Copy link

lock bot commented Jan 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
gpu Using spaCy on GPU usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

3 participants