New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 standard pipeline running 10x slower #1508

Closed
hodsonjames opened this Issue Nov 8, 2017 · 56 comments

Comments

Projects
None yet
@hodsonjames
Copy link

hodsonjames commented Nov 8, 2017

Your Environment

Info about spaCy

  • Python version: 2.7.13
  • Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
  • spaCy version: 2.0.0
  • Models: en

I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard nlp = spacy.load('en'); nlp(u"string") way is now 10x slower.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

If you switch to nlp.pipe(), you should get between 8,000 and 10,000 words per second. In comparison the v1 models would be getting 15k-30k, but with much higher memory usage.

Overall the v2 models are cheaper to run, because you can use them on smaller instances. However it's true that the speed is much worse if you call it document-by-document. To make using nlp.pipe() easier, there's now a as_tuples keyword argument, that lets you pass in an iterator of (text, context) pairs, so you can get back an iterator of (doc, context) tuples.

ines added a commit that referenced this issue Nov 8, 2017

@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 8, 2017

I just ran an experiment with nlp.pipe(), with batches of 10k documents, and 8/16 threads (I'm running a 32Gb/8 core setup). I'm seeing 700 documents per second, and each document has an average of 4.5 words. This translates to roughly 3k words per second.

I am now really missing the previous version... Is it different data structures, improvements in parallelization? Would be great to have the option of the faster, more memory-intensive processing.

For reference, I dropped back to 1.9 and tried the same nlp.pipe() approach. I am seeing 50% more memory usage in v1.9, but a definitive 10x slow-down in v2.

I was curious as to whether this might have something to do with short documents. So, I tried passing in documents that were 10x longer (50 words). I see a big improvement in the v2 processing (twice as fast), but still 7-8x slower than v1.9. Memory ratios seem approximately equivalent under these scenarios.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

4-5 words per document is probably a problem --- I hadn't thought about that use-case.

There are a couple of potential problems here. Could you disable the tagger and NER to make sure it's the parser that's slower? It probably is, but just in case. The most convenient way to do that is with nlp.disable_pipes('tagger', 'ner') for this.

Your parse times are really bad though. Do you know what version of Blas your numpy is linked against? If you're on a Skylake processor you might be using a version of OpenBLAS that's failed to detect your architecture correctly --- there's a bug in the version that ships with most recent versions of Ubuntu and Debian. You can avoid this bug by using numpy via conda, which is linked to MKL. Unfortunately we're still waiting to publish v2 on Conda --- we rely on CI servers to build the wheels, and there's currently a backlog :(.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

The best way to tell which Blas your numpy is linked to is to do:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`

Example output:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`
    linux-vdso.so.1 =>  (0x00007ffdb0d21000)
    libmkl_rt.so => /home/matt/.pyenv/versions/miniconda-latest/lib/python2.7/site-packages/numpy/core/../../../../libmkl_rt.so (0x00007fbd0ad10000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbd0aad3000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbd0a7ca000)
    libpython2.7.so.1.0 => /home/matt/.pyenv/versions/miniconda-latest/lib/python2.7/site-packages/numpy/core/../../../../libpython2.7.so.1.0 (0x00007fbd0a3ed000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbd0a026000)
    /lib64/ld-linux-x86-64.so.2 (0x000055cfc12ea000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbd09e20000)
    libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fbd09c1d000)

That installation has numpy linked against MKL. When numpy is linked against OpenBLAS, it looks like this:

ldd `python -c "import numpy.core; print(numpy.core.multiarray.__file__)"`
    linux-vdso.so.1 =>  (0x00007fff395b8000)
    libopenblasp-r0-39a31c03.2.18.so => /home/matt/repos/spacy-v2/.env/lib/python2.7/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so (0x00007f339ec1e000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f339e8f6000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f339e6d8000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f339e311000)
    /lib64/ld-linux-x86-64.so.2 (0x000056361421d000)
    libgfortran-ed201abd.so.3.0.0 => /home/matt/repos/spacy-v2/.env/lib/python2.7/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0 (0x00007f339e017000)

Of note here is that the OpenBLAS version is 2.18, which is the version all my machines get by default...But this version has a significant bug that causes quite poor performance :(.

To check whether your numpy's Blas is basically decent, you can benchmark it against a wrapper I made for the Blis linear algebra library:

pip install blis
python -m blis.benchmark

When I'm linked against the bad OpenBLAS, I get this:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.62 seconds
Numpy...
11032016.6016
18.82 seconds

When I'm linked against MKL, I get this:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.17 seconds
Numpy...
11032011.7188
4.01 seconds

Correctly compiled versions of OpenBLAS give similar performance --- but unfortunately, that's a total pain in the ass. Unfortunately running conda is often hugely inconvenient too! I've still not managed to get my servers automatically deploying into conda environments.

You might be able to tell I've spent a lot of time on this...I wish I could say I've come to a better conclusion. The good news is that it won't be long before the bug in OpenBLAS 2.18 rolls past, and we're all getting better CPU neural network performance by default. I believe that this unfortunate problem has contributed to a lot of NLP researchers greatly over-estimating the performance advantage of GPUs for their use-case.

For now, my best advice is as follows:

  1. Run the Blis benchmark, and see if your numpy performance looks okay.
  2. If not, try to get numpy through conda, to get it linked to MKL. Then install spacy via pip in the same environment.
  3. Try to use multi-processing to work your cores, instead of multi-threading. I want to switch nlp.pipe() to multi-processing by default --- it should give much better performance. Multi-threading is currently only used during the matrix multiplications, which only works if the matrices are large.
@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 8, 2017

Awesome--thanks for the detailed response. I had no idea about the issue with OpenBLAS2.18. I quickly checked and found that v1.9 was running in a conda env with MKL, and v2.0 was running with OpenBLAS2.18. Indeed. I thought this would be the answer, so...

Updated numpy to use OpenBLAS2.19.

v2.0 with MKL Benchmark:

python -m blis.benchmark
Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.88 seconds
Numpy...
11032011.7188
4.28 seconds

v2.0 with OpenBLAS2.19:

python -m blis.benchmark
Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
7.94 seconds
Numpy...
11032015.625
5.27 seconds

Not a huge difference. Timings are still the same as before for the respective pipelines.

Disabled entire pipeline for v2.0 (nlp.pipeline yields []). This yields performance 2x slower still than v1.9 with the entire pipeline switched on (!).

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

What was the performance with OpenBLAS 2.18 on your machine?

But! It sounds like the problem is much more likely to be this:

Disabled entire pipeline for v2.0 (nlp.pipeline yields []). This yields performance 2x slower still than v1.9 with the entire pipeline switched on (!).

Wat. Okay...Next investigation. Could you install linux-tools so that you can run perf top? This shows you which functions are occupying CPU time. If you process a bunch of text and watch sudo perf stat, what do you see?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

Hmm. I wonder whether the culprit is my fix to this: #285

The fix only kicks in after many documents have been processed. I didn't take much care to make the process cheap, because I thought if it only runs every N documents, there's plenty of time to amortize the cost. But if the documents are short...maybe not!

Here's the relevant bit of code: https://github.com/explosion/spaCy/blob/master/spacy/language.py#L544

You could reach inside your copy of spacy and comment out those lines to see if that's the problem. You can find the path to the file with python -c "import spacy.language; import os.path; print(os.path.abspath(spacy.language.__file__))"

@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 8, 2017

Reducing the language.py pipe to only yielding documents after line 544 does not improve the speed. Vocabulary size grows at the expected rate (eventually around 10% of documents contain a new noun). Not yielding the documents (i.e. removing everything after line 544) grows the vocabulary to a constant 78,930, and runs through a million documents in 3 seconds, but does not provide any output. Will continue investigating tomorrow.

The OpenBLAS2.18 results were only slightly worse than 2.19, but getting them now would require reinstalling everything.

@ashaffer

This comment has been minimized.

Copy link

ashaffer commented Nov 8, 2017

Just to note, i'm experiencing the exact same issue. Large numbers of very small documents passed to nlp.pipe is much (more than 10x) slower than those same documents passed to nlp individually.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

Thanks, looking into it.

@azar923

This comment has been minimized.

Copy link

azar923 commented Nov 8, 2017

Hello all,
Thanks for spaCy dev team for exiting improvements and features of 2.0. Unfortunately, I am having the same issue described above, and it is a huge blocker to use 2.0 since our system works in streaming fashion. However, we can afford batch processing ~ 50-100 documents at once, but results are still much worse then in 1.6 that we use now:

Python version: 2.7.13
Platform: Ubuntu 16.04
spaCy version: 2.0.0
Models: en

numpy is linked to OpenBlas 2.18, will investigate this one further. But here is some stats. As benchmark we use Enron email dataset (measuring processing time of ~ 1200 emails). In both cases pipeline is without syntactic parser (ner and tagger)

1.6
One-by-one mode:
18 sec - Tagger + NER
9 sec - Tagger

Batch mode:
Not measured since one-by-one mode is fast enough

2.0
One-by-one mode:
156 sec - Tagger + NER
88 sec - Tagger

Batch mode (batch size 50, increasing leads to no improvements)
109 sec - Tagger + NER
63 sec - Tagger

There are improvements in batch mode, however, still ~ 5x decrease in comparison with 1.6.
Hope results help to investigate whether there are some problems leading to decrease or these results are expected.
Also will investigate options that were provided above, thanks.

UPD Benchmark with OpenBLAS 2.18 leads to:

Blis...
11032014.6484
9.18 seconds
Numpy...
11032015.625
5.66 seconds

So I guess it's not the issue

@ashaffer

This comment has been minimized.

Copy link

ashaffer commented Nov 8, 2017

Apologies if this is a little off track, but I just ran your Blis benchmark and got the following results:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032016.6016
97.67 seconds
Numpy...
11032010.7422
2.59 seconds

Which of course looks great for numpy...but the Blis numbers i'm seeing seem crazy relative to others in this thread. Any idea what might be going on there?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

Ran some small benchmarks (script attached). I still suspect the problem is coming in from the change to nlp.pipe() introduced to address the streaming data memory growth. These benchmark figures don't trigger that logic, because not many documents are being processed. I'll look into that next.

Here's the summary:

  • With 50 character docs, I'm getting 8k words per second for tagger + NER

  • 8k wps is achieved using nlp.pipe() and numpy linked against MKL

  • Using nlp() decreases speed to 2.8k wps

  • With numpy linked against a poorly configured OpenBLAS, I get 6k wps when using nlp.pipe()

  • If I have bad Blas and don't use pipe, performance drops to 2,000 wps

Script attached below.

# With bad blas
$ python bin/benchmark_small_docs.py 
components= ['tagger', 'parser', 'ner'] disabled= [] use_pipe= 0 wps= 1182.9 words= 60307 seconds= 50.98254521697527

components= ['tagger', 'parser', 'ner'] disabled= [] use_pipe= 1 wps= 3444.612476996155 words= 60307 seconds= 17.50762978498824

$ python bin/benchmark_small_docs.py tagger -p 
components= ['tagger'] disabled= ['parser', 'ner'] use_pipe= 1 wps= 16969.68281227715 words= 60307 seconds= 3.5538083220017143

$ python bin/benchmark_small_docs.py tagger   
components= ['tagger'] disabled= ['parser', 'ner'] use_pipe= 0 wps= 3746.429137875223 words= 60307 seconds= 16.097194896952715

# So: Tagger is 5x slower without .pipe()

$ python bin/benchmark_small_docs.py tagger,ner
components= ['tagger', 'ner'] disabled= ['parser'] use_pipe= 0 wps= 1927.4320057496022 words= 60307 seconds= 31.288782079005614

$ python bin/benchmark_small_docs.py tagger,ner -p
components= ['tagger', 'ner'] disabled= ['parser'] use_pipe= 1 wps= 6041.756207845903 words= 60307 seconds= 9.98170034098439

# NER is 3x slower without .pipe()

# Now the numbers with numpy linked against MKL

$ python bin/benchmark_small_docs.py tagger,ner -p
('components=', ['tagger', 'ner'], 'disabled=', [u'parser'], 'use_pipe=', 1, 'wps=', 8058.4194224254325, 'words=', 55889, 'seconds=', 6.935479164123535)

('components=', ['tagger', 'ner'], 'disabled=', [u'parser'], 'use_pipe=', 0, 'wps=', 2843.937503013295, 'words=', 55889, 'seconds=', 19.651978969573975)
from timeit import default_timer as timer
import spacy
import thinc.extra.datasets
import plac


def iter_phrases(texts):
    for text in texts:
        for i in range(0, len(text), 50):
            yield text[i : i + 50]

@plac.annotations(
    components=("Pipeline components", "positional"),
    use_pipe=("Whether to use nlp.pipe()", "flag", "p")
)
def main(components='tagger,parser,ner', use_pipe=False):
    components = components.split(',')
    use_pipe = int(use_pipe)
    train, dev = thinc.extra.datasets.imdb()
    texts, labels = zip(*train)
    texts = texts[:200]
    nlp = spacy.load('en_core_web_sm')
    start = timer()
    disabled = [name for name in nlp.pipe_names if name not in components]
    with nlp.disable_pipes(*disabled):
        n_words = 0
        if use_pipe:
            for doc in nlp.pipe(iter_phrases(texts)):
                n_words += len(doc)
        else:
            for phrase in iter_phrases(texts):
                doc = nlp(phrase)
                n_words += len(doc)
    seconds = timer() - start
    wps = n_words / seconds
    print('components=', components, 'disabled=', disabled, 'use_pipe=', use_pipe, 'wps=', wps,
          'words=', n_words, 'seconds=', seconds)


if __name__ == '__main__':
    plac.call(main)
@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

@ashaffer What sort of CPU do you have?

@ashaffer

This comment has been minimized.

Copy link

ashaffer commented Nov 8, 2017

@honnibal Intel Core i7 3.1ghz (macbook pro, 2017).

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

While it's running, does it spawn some crazy number of threads?

Most likely I stuffed up the compilation settings for clang.

@ashaffer

This comment has been minimized.

Copy link

ashaffer commented Nov 8, 2017

No, actually. Seems to have just one thread.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 8, 2017

Btw...For some reassurance here: If we can't get this sorted out, I could update the linear model code and just train new linear models for v2. It would just be a different parser class. So even if the problem proves difficult, the worst-case solution isn't so bad.

I'd much rather figure out why the neural networks are getting uneven performance though. Too much choice is bad. If we add more options, people will have to run experiments to figure out what to use. It's also really hard to make good trade-offs if everyone's using different stuff.

Probably the best solution will involve writing more of the forward-pass in Cython, and training models with narrower hidden layers. At the moment the hidden layer sizes have been configured with the current overheads from making a bunch of Python calls in mind. If we eliminate those overheads, performing less computation in the matrix multiplications starts to become better value.

@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 9, 2017

I have run the benchmark on my system with both v1.9 and v2.0, both with the changes in language.py, and without. It doesn't look like the StringStore changes are to blame (nor MKL vs. OpenBLAS, though that is certainly a good point to note for future experiments)

:

v1.9

All components (no pipe)

('version=', '1.9', 'components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 0, 'wps=', 19865.129453002002, 'words=', 64177, 'seconds=', 3.23063588142395)

All components (pipe)

('version=', '1.9', 'components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 1, 'wps=', 33373.595977682424, 'words=', 61751, 'seconds=', 1.850295066833496)

Only Tagger (no pipe)

('version=', '1.9', 'components=', ['tagger'], 'disabled=', ['ner', 'parser'], 'use_pipe=', 0, 'wps=', 67369.0306107956, 'words=', 58240, 'seconds=', 0.8644921779632568)

Only Tagger (pipe)

('version=', '1.9', 'components=', ['tagger'], 'disabled=', ['ner', 'parser'], 'use_pipe=', 1, 'wps=', 69439.76097805913, 'words=', 59184, 'seconds=', 0.8523070812225342)

v2.0

Without stringstore changes

All components (no pipe)

('components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 0, 'wps=', 889.6924891513431, 'words=', 59553, 'seconds=', 66.93661093711853)

Only Tagger (no pipe)

('components=', ['tagger'], 'disabled=', [u'parser', u'ner'], 'use_pipe=', 0, 'wps=', 3144.8852607352537, 'words=', 59553, 'seconds=', 18.936461925506592)

All components (pipe)

('version=', '2.0', 'components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 1, 'wps=', 3740.9975863119626, 'words=', 59553, 'seconds=', 15.919015884399414)

Only Tagger (pipe)

('version=', '2.0', 'components=', ['tagger'], 'disabled=', [u'parser', u'ner'], 'use_pipe=', 1, 'wps=', 16177.477181439559, 'words=', 59553, 'seconds=', 3.6812291145324707)

With changes

All components (no pipe)

('version=', '2.0', 'components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 0, 'wps=', 928.1122493334815, 'words=', 59553, 'seconds=', 64.16572999954224)

Only Tagger (no pipe)

('version=', '2.0', 'components=', ['tagger'], 'disabled=', [u'parser', u'ner'], 'use_pipe=', 0, 'wps=', 3197.4621024591324, 'words=', 59553, 'seconds=', 18.625083923339844)

All components (pipe)

('version=', '2.0', 'components=', ['tagger', 'parser', 'ner'], 'disabled=', [], 'use_pipe=', 1, 'wps=', 3977.071598419103, 'words=', 59553, 'seconds=', 14.974082946777344)

Only Tagger (pipe)

('version=', '2.0', 'components=', ['tagger'], 'disabled=', [u'parser', u'ner'], 'use_pipe=', 1, 'wps=', 15237.57534115711, 'words=', 59553, 'seconds=', 3.908298969268799)

Modified Benchmark Script for both v1.9 and v2.0 (hacky):

from timeit import default_timer as timer
import spacy
import thinc.extra.datasets
import plac


def iter_phrases(texts):
	for text in texts:
		for i in range(0, len(text), 50):
			yield text[i : i + 50]

@plac.annotations(
	components=("Pipeline components", "positional"),
	use_pipe=("Whether to use nlp.pipe()", "flag", "p"),
	version=("Is this version 1.9?","flag","v")
)

def main(components='tagger,parser,ner', use_pipe=False, version=False):
	components = components.split(',')
	use_pipe = int(use_pipe)
	train, dev = thinc.extra.datasets.imdb()
	texts, labels = zip(*train)
	texts = texts[:200]
	nlp = spacy.load('en_core_web_sm')
	start = timer()
	
	n_words = 0
	
	v = "2.0"
	
	if version:
	
		v = "1.9"
		
		disabled = [name for name in ['tagger','ner','parser'] 
					if name not in components]
	
		nlp.pipeline = []
		if 'tagger' in components:
			nlp.pipeline.append(nlp.tagger)
		if 'ner' in components:
			nlp.pipeline.append(nlp.entity)
		if 'parser' in components:
			nlp.pipeline.append(nlp.parser)
			
	else:
		disabled = [name for name in nlp.pipe_names if name not in components]
		nlp.disable_pipes(*disabled)
			
	if use_pipe:
		for doc in nlp.pipe(iter_phrases(texts)):
			n_words += len(doc)
	else:
		for phrase in iter_phrases(texts):
			doc = nlp(phrase)
			n_words += len(doc)

	seconds = timer() - start
	wps = n_words / seconds
	print('version=', v, 'components=', components, 'disabled=', disabled,
		  'use_pipe=', use_pipe, 'wps=', wps, 'words=', n_words,
		  'seconds=', seconds)


if __name__ == '__main__':
	plac.call(main)
@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 9, 2017

Ah, I don't know how I missed the fact that you switched over all models from averaged perceptron to neural. In that case, yes, I'd be happy with an extra set of linear models for en, de, fr that can be loaded. This would very likely solve the speed issue.

What was the relative performance boost that you achieved with the neural models? Any results you can post? I'm using this as part of a feature selection framework, so not very sensitive to F1 unless the models are very noisy in my domains.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 9, 2017

Thanks! If we do a few more iterations of this we'll get to the bottom of this.

Here's the accuracy comparison: https://spacy.io/usage/facts-figures#spacy-models

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 9, 2017

Could you run this benchmark next? This exercises the machine learning library directly. I think this is a good next step of the binary search: if this is slower we're pretty close to our answer. More likely the problem will be in spaCy, but there aren't many more steps to search through.

With MKL I get:

$ time python bin/mnist_bench.py 128 
python bin/mnist_bench.py 128  4.63s user 0.14s system 98% cpu 4.837 total                                               
(miniconda-latest) 
# matt @ matt-XPS-13-9360 in ~/repos/spacy-v2 on git:master x [2:05:18] 
$ time python bin/mnist_bench.py 2  
python bin/mnist_bench.py 2  46.46s user 0.22s system 99% cpu 46.737 total

This shows the huge difference that the batch size can make for the neural network: with batch size 2, it takes 46s, with 128 it takes 4.8s. spaCy uses all words in the batch, so the effective batch size is higher. But with short documents, the batch size can be quite low still.

from __future__ import print_function
import plac
from thinc.v2v import Model, ReLu, Softmax
from thinc.api import clone, chain
from thinc.neural.util import to_categorical

from thinc.extra import datasets

@plac.annotations(
    batch_size=("Batch size", "positional", None, int)
)
def main(batch_size=128, width=512, depth=2):
    with Model.define_operators({'**': clone, '>>': chain}):
        model = ReLu(width) >> ReLu(width) >> Softmax()

    train_data, dev_data, _ = datasets.mnist()
    train_X, train_y = model.ops.unzip(train_data)
    dev_X, dev_y = model.ops.unzip(dev_data)
    dev_y = to_categorical(dev_y)
    train_y = to_categorical(train_y)
    with model.begin_training(train_X[:100], train_y[:100]) as (trainer, _):
        trainer.nb_epoch = 3
        trainer.batch_size = batch_size
        for X, y in trainer.iterate(train_X, train_y):
            model.begin_update(X)

if __name__ == '__main__':
    plac.call(main)
@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 9, 2017

Timings for batch size = 2

1m27.127s

Timings for batch size = 128

7.018s

Seems similar to what you were getting.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 9, 2017

I had a closer look at your numbers, and yeah your tagger performance is similar to mine. So okay it must be in the parser. Could you attach the nn_parser.so object that clang has compiled for you? I want to check that it's used avx2.

@hodsonjames

This comment has been minimized.

Copy link
Author

hodsonjames commented Nov 10, 2017

Sure. I'm using GCC 5.4.0.

nn_parser.so.zip

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 12, 2017

Can't find anything wrong with the linking or compilation here. Hmm!

@funnydevnull

This comment has been minimized.

Copy link

funnydevnull commented Nov 13, 2017

Hi, I'm having a similar issue, though I haven't yet tried all the variations of openblas, etc. Unlike the OP my documents are not that short. I'm running on ~42k documents, each ~1 page long (total words is ~3m) and have run benchmarks with different thread/batch size on v1.8.

My best results for v1.8 (on a 4-core skylake i7) are:

avg batch words/sents: 56224.77, 2734.49
batchsize 800, nthreads 8
	wc, time, rate (word/s), rate (sent/s):  3025161 39.7191619873 76163.7670243 3703.67834163

The v2.0 performance by comparison is 10x worse:

avg batch words/sents: 13747.65, 621.05
batchsize 200, nthreads 4
	wc, time, rate (word/s), rate (sent/s):  274953 41.4248280525 6637.39628928 299.84433452

I am running all this in docker with a standard numpy/openblas install (I checked that its 2.18) on a skylake processor but your blis module seems to generate good numbers. Here's the blis output:

Setting up data nO=384 nI=384 batch_size=2000
Blis...
11032014.6484
5.83 seconds
Numpy...
11032015.625
2.81 seconds
@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Mar 12, 2018

Okay new plan:

  • Bring our own BLAS (or specifically, gemm)
  • Our gemm is single-threaded.
  • Implement for low latency per process (i.e. don't assume large batches)
  • Rely on multi-processing for parallelisation

The models fit well in memory anyway, so one copy of the model per processor isn't so bad. And now that wheels and conda are more standard, it's easier to bring our own gemm. If necessary we can just statically link OpenBLAS. It's pretty small when compiled without Lapack or Fortran, for singe-threaded.

I think this approach has the best chance of always doing sensible things. It means the same story applies to all the units in the library, rather than just the ones we happen to multi-thread.

Multi-threading is also only really attractive when the threads would otherwise sit idle. In production settings that's really not the case. Even if your machine is large, you'd rather just run more containers.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Mar 12, 2018

Here's a wheel for Thinc that statically links single-threaded OpenBLAS. On my 4-core laptop, the single process is training about 70% as fast as an MKL-linked numpy that uses all cores. It's also much faster than an unoptimised BLAS.

thinc-6.10.2-cp36-cp36m-linux_x86_64.whl.rename.zip

Try it out and see how you go when you parse with multiple processes on your machine, e.g. using joblib or just starting two processes running. The wheel is only compatible with Linux Python3.6, using a reasonably modern Intel processor (e.g. Haswell, Sandy Bridge, Ivy Bridge, etc). To try it, rename it to thinc-6.10.2-cp36-cp36m-linux_x86_64.whl, run pip uninstall thinc within your spaCy environment, and then do pip install thinc-6.10.2-cp36-cp36m-linux_x86_64.whl

It's currently taking about 2 minutes per epoch to train the parser on the English Universal Dependencies corpus. We would be able to train the parser, POS tagger and entity recognizer on different processes given the current architecture. Asynchronous SGD should also work, for larger training jobs.

@kootenpv

This comment has been minimized.

Copy link

kootenpv commented Mar 12, 2018

Initially 10x slower was the problem... I'm curious how much closer we are again to v1! Looking forward to seeing some benchmarks.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Mar 12, 2018

I've attached a benchmark script for running the pipeline on some IMDB text using multi-processing with joblib. With en_core_web_sm using the single-threaded Thinc wheel I posted, I'm seeing:

j=1 6,500 wps
j=2 10,103 wps
j=3  11,133 wps

My machine is allegedly 4 core, but as you can see performance is pretty terrible with more than two active processes. I wonder whether I have power management throttling me or something. I've confirmed the same behaviour applies if I simply run the command twice...

import plac
from timeit import default_timer as timer
import spacy
import joblib
from toolz import partition_all

def count_words(model_loc, texts):
    nlp = spacy.load(model_loc)
    n_words = 0
    for doc in nlp.pipe(texts):
        n_words += len(doc)
    return n_words

@plac.annotations(
    model=("spaCy model to load (path or name)", "positional", None, str),
    n_jobs=("Number of processes", "option", "j", int)
)
def main(model, n_jobs=2):
    train, test = imdb()
    texts = [text for (text, label) in train[:10000]]
    chunk_size = int(len(texts) / n_jobs)
    begin = timer()
    partitions = partition_all(chunk_size, texts)
    executor = joblib.Parallel(n_jobs=n_jobs)
    do = joblib.delayed(count_words)
    tasks = (do(model, batch) for batch in partitions)
    word_counts = executor(tasks)
    end = timer()
    n_words = sum(word_counts)
    print(n_words, end-begin, n_words/(end-begin))

if __name__ == '__main__':
    plac.call(main)
@saroufimc1

This comment has been minimized.

Copy link

saroufimc1 commented May 11, 2018

Any update on this? Looks like nlp.pipe is not doing a good job in parallelizing. Using the batch and threads option is not improving speed at all.

@wrathagom

This comment has been minimized.

Copy link
Contributor

wrathagom commented May 20, 2018

I tried to modify the Rasa NLU code to use the pipe command rather than passing in docs one at a time and I am getting much worse performance. Even with the number of threads set to 8, it appears to be trying to use all cores on a 16 core VM...

I'm still working through all of the above trying to see if I can benefit from any of the actions specifically. Is the thinc install the recommended fix?

@sudarshan85

This comment has been minimized.

Copy link

sudarshan85 commented May 23, 2018

I believe I'm having this issue as well. Running 100,000 documents with an average length of about 8500 took 1hr 28m. The notes are stored in pandas dataframe and just iterate through them and yield it to nlp.pipe(). I disabled the tagger and ner but I kept the parser since I wanted to detect sentences. I ran this on a 64 core instance with n_threads=-1. Given this, does that sound reasonable performance?

I was also writing the parsed notes to disk so, I'm curious whether that io contributed to the slow performance.

EDIT:
I concatenated about 1000 documents into a single string and ran that through nlp directly and the Jupyter kernel keeps restarting. It does not spit out any information on why its happening. I ran htop and its only using about 1/2 the RAM capacity (16G out of 31G).

EDIT-2:
I ran my documents using SpaCy 1.9 as many people have done here, and I get about 9x speed up:

Number of documents, Spacy2, Spacy1.9
100,7,3
1000,59,6
10000,548,63
100000,5250,596

All time are in seconds. This doesn't provide a great incentive to use SpaCy 2.0 which is unfortunate because I'd really like to use the neural models that is present in SpaCy 2.0. Is there any fix for this yet?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented May 30, 2018

I believe I'm having this issue as well. Running 100,000 documents with an average length of about 8500 took 1hr 28m. The notes are stored in pandas dataframe and just iterate through them and yield it to nlp.pipe(). I disabled the tagger and ner but I kept the parser since I wanted to detect sentences. I ran this on a 64 core instance with n_threads=-1. Given this, does that sound reasonable performance?

Try running it on a single core instance and comparing the speed. You may actually find the performance improves, because the default numpy configuration is pretty terrible at figuring out how many threads to launch to do the matrix multiplications. If you install via conda you'll get numpy linked against MKL, which is better at detecting this.

The strategy going forward is to use multi-processing rather than multi-threading. The current performance target is 10,000 words per second per CPU core for the full pipeline. If you only run the parser, you should be seeing around 8k wps/core with v2.0.11. On v2.1.0, that should increase to around 12k wps/core.

@sudarshan85

This comment has been minimized.

Copy link

sudarshan85 commented May 30, 2018

Try running it on a single core instance and comparing the speed.

I'm guessing that this is as easy as setting nthreads=1?

The strategy going forward is to use multi-processing rather than multi-threading.

Does it mean that we don't need to worry about releasing the GIL?

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Jul 6, 2018

I'm guessing that this is as easy as setting nthreads=1?

At the moment, not quite. You'll need to set the environment variables for the BLAS library numpy is calling into. Unfortunately because we delegate to numpy, and numpy multi-threads the matrix multiplications by default (by delegating them to a BLAS library), the n_threads argument doesn't actually control the threading in v2.

Does it mean that we don't need to worry about releasing the GIL?

If we use multi-processing, we won't need to worry about releasing the GIL.

@sadovnychyi

This comment has been minimized.

Copy link
Contributor

sadovnychyi commented Jul 6, 2018

While some models are small and it's ok to put them in memory for each process separately, what about models with word vectors? They are big, and multiplying them in memory sounds expensive. Maybe we could store vectors in memmap or something to share it across different processes?

@buhrmann

This comment has been minimized.

Copy link

buhrmann commented Aug 1, 2018

FWIW, we have been running Spacy v1 on a server behind Uwsgi/Flask, and on our (Linux/Mac) systems, copy-on-write seems to avoid the multiplying of memory when Uwsgi forks the app (including spacy models). I.e. we run 2 or 4 forked processes but only incur the memory cost of one.

We haven't done any detailed profiling, but I would expect that memory use eventually does increase somewhat, because of the growing vocabulary (from what I understand, this means affected memory pages will be duplicated across processes).

But it would be nice if this was a "feature" also of Spacy v2 (we haven't tested it yet, but this thread seems to suggest it may not be that easy).

@ines ines added perf / speed and removed performance labels Aug 15, 2018

Ljferrer referenced this issue in usc-isi-i2/dig-text-similarity-search Sep 15, 2018

@cwharland

This comment has been minimized.

Copy link

cwharland commented Oct 6, 2018

Previously, @honnibal suggested that one solution to closing the gap between v2 and v1.9 speeds would be retraining the linear models (#1508 (comment)).

Is this still a viable option? Would we lose a good chunk of accuracy? I've been running some of the 2.1.0a nightlies and it looks like we're not likely to close much of a gap to 1.9 but perhaps I'm missing something.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Oct 21, 2018

@cwharland It's not terribly difficult. I did some work a while ago on that, and got it running in a hacky way. I've just tried to find the branch and pushed the state I found, which doesn't seem to be working. I guess I started trying to unhack it.

If you're on Windows, I think I've tracked down another source of slowness. It seems the Windows Subsystem for Linux doesn't expose avx2 instructions, which forces the BLAS libraries to use old kernels. This is making things about 2x slower on my new laptop.

Issue about the avx2 instructions can be found here: Microsoft/WSL#2234

@cwharland

This comment has been minimized.

Copy link

cwharland commented Oct 21, 2018

Gotcha, so it seems like we should just be sure that our BLAS libraries in w/e environment we run (for us usually ubuntu based docker images) are linked correctly.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Dec 20, 2018

I think we can finally close this 🎉

spacy-nightly now ships with matrix multiplications provided by our Blis package, in single-thread mode. I've also performed extensive hyper-parameter search to make the models smaller, and made sure that Pickle is working again.

You should be seeing between 7000 and 9000 words per second per thread. With the pipeline properly single-threaded, you should now find it easy to use multi-processing to speed up your execution.

@honnibal honnibal closed this Dec 20, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Jan 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.