New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ’« spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!) #1105

Closed
ines opened this Issue Jun 5, 2017 · 109 comments

Comments

Projects
None yet
@ines
Member

ines commented Jun 5, 2017

We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

Quickstart & overview

The most important new features

  • New neural network models for English (15 MB) and multi-language NER (12 MB), plus GPU support via Chainer's CuPy.
  • Strings mapped to hash values instead of integer IDs. This means they will always match – even across models.
  • Improved saving and loading, consistent serialization API across objects, plus Pickle support.
  • Built-in displaCy visualizers with Jupyter notebook support.
  • Improved language data with support for lazy loading and multi-language models. Alpha tokenization for Norwegian BokmΓ₯l, Japanese, Danish and Polish. Lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
  • Revised API for Matcher and language processing pipelines.
  • Trainable document vectors and contextual similarity via convolutional neural networks.
  • Various bug fixes and almost completely re-written documentation.

Installation

spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

pip install spacy-nightly
python -m spacy download en_core_web_sm-2.0.0-alpha --direct   # English
python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct   # Multi-language NER
import spacy
nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()

Alpha models for German, French and Spanish are coming soon!

Now on to the fun part – stickers!

stickers

We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

  • Find something that's currently not covered in the test suite and doesn't require the models, and write a test for it - for example, language-specific tokenization tests.
  • Alternatively, find examples from the docs that haven't been added to the tests yet and add them. Plus points if the examples don't actually work – this means you've either discovered a bug in spaCy, or a bug in the docs! πŸŽ‰

Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers πŸ˜‰

@kootenpv

This comment has been minimized.

Show comment
Hide comment
@kootenpv

kootenpv Jun 5, 2017

I honestly just wanted it to work :-)

I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0.
It works to import, but then tried to download with both:

python -m spacy download en
python -m spacy download en_core_web_sm

Both give:

Compatibility error
No compatible models found for v2.0.0a0 of spaCy.

It's most likely that I'm missing something?

EDIT: Indeed, on the release page it says: en_core_web_sm-2.0.0-alpha. You also need to give the --direct flag.

python -m spacy download en_core_web_sm-2.0.0-alpha --direct

Perhaps it is possible to temporarily update the docs for it?

Otherwise: it works!

kootenpv commented Jun 5, 2017

I honestly just wanted it to work :-)

I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0.
It works to import, but then tried to download with both:

python -m spacy download en
python -m spacy download en_core_web_sm

Both give:

Compatibility error
No compatible models found for v2.0.0a0 of spaCy.

It's most likely that I'm missing something?

EDIT: Indeed, on the release page it says: en_core_web_sm-2.0.0-alpha. You also need to give the --direct flag.

python -m spacy download en_core_web_sm-2.0.0-alpha --direct

Perhaps it is possible to temporarily update the docs for it?

Otherwise: it works!

@ines

This comment has been minimized.

Show comment
Hide comment
@ines

ines Jun 5, 2017

Member

Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion – but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well:

python -m spacy download en
python -m spacy download xx
Member

ines commented Jun 5, 2017

Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion – but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well:

python -m spacy download en
python -m spacy download xx
@kootenpv

This comment has been minimized.

Show comment
Hide comment
@kootenpv

kootenpv Jun 5, 2017

Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :)

I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :)

I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though...

kootenpv commented Jun 5, 2017

Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :)

I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :)

I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though...

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 5, 2017

Member

Thanks!

Try the doc.similarity() if you have a use-case for it? I'm not sure how well this works yet. It's using the tensors learned for the parser, NER and tagger (but no external data). It seems to have some interesting context sensitivity, and in theory it might give useful results --- but it hasn't been optimised for that. So, I'm curious to hear how it does.

http://alpha.spacy.io/docs/usage/word-vectors-similarities

Member

honnibal commented Jun 5, 2017

Thanks!

Try the doc.similarity() if you have a use-case for it? I'm not sure how well this works yet. It's using the tensors learned for the parser, NER and tagger (but no external data). It seems to have some interesting context sensitivity, and in theory it might give useful results --- but it hasn't been optimised for that. So, I'm curious to hear how it does.

http://alpha.spacy.io/docs/usage/word-vectors-similarities

@buhrmann

This comment has been minimized.

Show comment
Hide comment
@buhrmann

buhrmann Jun 6, 2017

Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc.

buhrmann commented Jun 6, 2017

Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 6, 2017

Member

@buhrmann This is inherently a bit confusing, because there are two types of vector representations:

  1. You can import word vectors, as before. The assumption is you'll want to leave these static, with perhaps a trainable projection layer to reduce dimension

  2. The parser, NER, tagger etc learns a small embedding table and a depth-4 convolutional layer, to assign the document a tensor with a row for each token in context.

We're calling type 1 vector, and type 2 tensor. I've designed the neural network models to use a very small embedding table, shared between the parser, tagger and NER. I've also avoided using pre-trained vectors as features. I didn't want the models to depend on, say, the GloVe vectors, because I want to make sure you can load in any arbitrary word vectors without messing up the pipeline.

Member

honnibal commented Jun 6, 2017

@buhrmann This is inherently a bit confusing, because there are two types of vector representations:

  1. You can import word vectors, as before. The assumption is you'll want to leave these static, with perhaps a trainable projection layer to reduce dimension

  2. The parser, NER, tagger etc learns a small embedding table and a depth-4 convolutional layer, to assign the document a tensor with a row for each token in context.

We're calling type 1 vector, and type 2 tensor. I've designed the neural network models to use a very small embedding table, shared between the parser, tagger and NER. I've also avoided using pre-trained vectors as features. I didn't want the models to depend on, say, the GloVe vectors, because I want to make sure you can load in any arbitrary word vectors without messing up the pipeline.

@buhrmann

This comment has been minimized.

Show comment
Hide comment
@buhrmann

buhrmann Jun 6, 2017

Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks).

buhrmann commented Jun 6, 2017

Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks).

@eldor4do

This comment has been minimized.

Show comment
Hide comment
@eldor4do

eldor4do Jun 6, 2017

Hi,

I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB.

However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow!

I made two separate virtualenv to make sure the installation was clean.
This is a small code I wrote to test it's speed -

import time
import spacy
nlp = spacy.load('en')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

And for the latest release, same code only the model imports changed -


import time
import spacy
nlp = spacy.load('en_core_web_sm')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

The first (v1.8.2) runs in 0.15 seconds while the latest (v2.0.0) took 11.77 seconds to run!
Is there something I'm doing wrong in the way I'm using the new model?

eldor4do commented Jun 6, 2017

Hi,

I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB.

However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow!

I made two separate virtualenv to make sure the installation was clean.
This is a small code I wrote to test it's speed -

import time
import spacy
nlp = spacy.load('en')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

And for the latest release, same code only the model imports changed -


import time
import spacy
nlp = spacy.load('en_core_web_sm')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

The first (v1.8.2) runs in 0.15 seconds while the latest (v2.0.0) took 11.77 seconds to run!
Is there something I'm doing wrong in the way I'm using the new model?

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 6, 2017

Member

Hm! That's a lot worse than my tests, but in my tests I used the .pipe() method, which lets the model minibatch. This helps to mask the Python overhead a bit. I still think the result you're seeing is much slower than I expect though.

A thought: Could you try setting export OPENBLAS_NUM_THREADS=1 and trying again? If your machine has lots of cores, it could be that the stupid thing tries to load up like 40 threads to do this tiny amount of work per document, and that kills the performance.

Member

honnibal commented Jun 6, 2017

Hm! That's a lot worse than my tests, but in my tests I used the .pipe() method, which lets the model minibatch. This helps to mask the Python overhead a bit. I still think the result you're seeing is much slower than I expect though.

A thought: Could you try setting export OPENBLAS_NUM_THREADS=1 and trying again? If your machine has lots of cores, it could be that the stupid thing tries to load up like 40 threads to do this tiny amount of work per document, and that kills the performance.

@eldor4do

This comment has been minimized.

Show comment
Hide comment
@eldor4do

eldor4do Jun 6, 2017

@honnibal Hi, setting export OPENBLAS_NUM_THREADS=1 surely helped! It avoided that 100% usage but it is still slower than the old guy. Now it takes about 4 seconds to run, way faster than before but still slow.

eldor4do commented Jun 6, 2017

@honnibal Hi, setting export OPENBLAS_NUM_THREADS=1 surely helped! It avoided that 100% usage but it is still slower than the old guy. Now it takes about 4 seconds to run, way faster than before but still slow.

@slavaGanzin

This comment has been minimized.

Show comment
Hide comment
@slavaGanzin

slavaGanzin Jun 6, 2017

I just finished reading documentation for v2.0 and it's way better than for v1.*.

But this export OPENBLAS_NUM_THREADS=1 looks new for me. I thought blas used by numpy only to train vectors.
Could this be documented?

I just finished reading documentation for v2.0 and it's way better than for v1.*.

But this export OPENBLAS_NUM_THREADS=1 looks new for me. I thought blas used by numpy only to train vectors.
Could this be documented?

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 6, 2017

Member

@slavaGanzin The neural network model makes lots of calls to numpy.tensordot, which uses blas -- both for training and runtime. I'd like to have set this within the code --- even for my own usage I don't want to micromanage this stupid environment variable! The behaviour of "Spin up 40 threads to compute this tiny matrix multiplication" is one that nobody could want. So, we should figure out how to stop it from happening.

@eldor4do What happens if you use .pipe() as well?

Member

honnibal commented Jun 6, 2017

@slavaGanzin The neural network model makes lots of calls to numpy.tensordot, which uses blas -- both for training and runtime. I'd like to have set this within the code --- even for my own usage I don't want to micromanage this stupid environment variable! The behaviour of "Spin up 40 threads to compute this tiny matrix multiplication" is one that nobody could want. So, we should figure out how to stop it from happening.

@eldor4do What happens if you use .pipe() as well?

@eldor4do

This comment has been minimized.

Show comment
Hide comment
@eldor4do

eldor4do Jun 6, 2017

@honnibal I'll try with .pipe() once, but in my actual use case I won't be able to use pipe(), it would be more like repeated calls.

eldor4do commented Jun 6, 2017

@honnibal I'll try with .pipe() once, but in my actual use case I won't be able to use pipe(), it would be more like repeated calls.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 6, 2017

Member

Hopefully the ability to hold more workers in memory compensates a bit.

Btw, the changes to the StringStore are also very useful for multi-processing. The annotations from each worker are now easy to reconcile, because they're stored as hash IDs -- so the annotation encoding no longer depends on the worker's state.

Member

honnibal commented Jun 6, 2017

Hopefully the ability to hold more workers in memory compensates a bit.

Btw, the changes to the StringStore are also very useful for multi-processing. The annotations from each worker are now easy to reconcile, because they're stored as hash IDs -- so the annotation encoding no longer depends on the worker's state.

@eldor4do

This comment has been minimized.

Show comment
Hide comment
@eldor4do

eldor4do Jun 6, 2017

@honnibal Yes, that is a plus. Also, I tested the OPENBLAS value on 2 machines, on one it was able to reduce the threads, on the other, a 4 core machine, it failed to do so. Still all at 100% usage. Any idea what could be the problem?

eldor4do commented Jun 6, 2017

@honnibal Yes, that is a plus. Also, I tested the OPENBLAS value on 2 machines, on one it was able to reduce the threads, on the other, a 4 core machine, it failed to do so. Still all at 100% usage. Any idea what could be the problem?

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Jun 6, 2017

Member

@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count?

Appreciate the feedback -- this is good alpha testing :)

Member

honnibal commented Jun 6, 2017

@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count?

Appreciate the feedback -- this is good alpha testing :)

@kootenpv

This comment has been minimized.

Show comment
Hide comment
@kootenpv

kootenpv Jun 6, 2017

@honnibal What are the expected differences for your test cases?

kootenpv commented Jun 6, 2017

@honnibal What are the expected differences for your test cases?

@notnami

This comment has been minimized.

Show comment
Hide comment
@notnami

notnami Jun 6, 2017

#1021 is still an issue with this alpha release -- all of the sentences I gave as examples fail to be parsed correctly.

notnami commented Jun 6, 2017

#1021 is still an issue with this alpha release -- all of the sentences I gave as examples fail to be parsed correctly.

@alfonsomhc

This comment has been minimized.

Show comment
Hide comment
@alfonsomhc

alfonsomhc Jun 8, 2017

I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):

doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant πŸ‘ ")

assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'πŸ‘'
assert doc[17:19].text == u'outranking eggplant'
assert doc.noun_chunks[0].text == u'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[0].text == u'Peach is the superior emoji.'

There are two problems

  1. This expresion
    doc.noun_chunks[0].text
    has error
    TypeError: 'generator' object is not subscriptable

  2. This expresion
    sentences[0].text
    returns
    'Peach emoji is where it has always been.'
    and therefore the last assertion fails

I'm using python 3.6 (and spacy 2.0 alpha)

I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):

doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant πŸ‘ ")

assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'πŸ‘'
assert doc[17:19].text == u'outranking eggplant'
assert doc.noun_chunks[0].text == u'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[0].text == u'Peach is the superior emoji.'

There are two problems

  1. This expresion
    doc.noun_chunks[0].text
    has error
    TypeError: 'generator' object is not subscriptable

  2. This expresion
    sentences[0].text
    returns
    'Peach emoji is where it has always been.'
    and therefore the last assertion fails

I'm using python 3.6 (and spacy 2.0 alpha)

@alfonsomhc

This comment has been minimized.

Show comment
Hide comment
@alfonsomhc

alfonsomhc Jun 8, 2017

I have also a problem with the example: https://alpha.spacy.io/docs/usage/lightning-tour#examples-pos-tags
This statement fails:
assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579]
because
[apple.pos_, apple.pos]
returns
['PROPN', 95]
The rest of the assertions are fine.

I have also a problem with the example: https://alpha.spacy.io/docs/usage/lightning-tour#examples-pos-tags
This statement fails:
assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579]
because
[apple.pos_, apple.pos]
returns
['PROPN', 95]
The rest of the assertions are fine.

@alfonsomhc

This comment has been minimized.

Show comment
Hide comment
@alfonsomhc

alfonsomhc Jun 8, 2017

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#displacy
The line
displacy.serve(doc_ent, style='ent')
gives the error:
OSError: [Errno 98] Address already in use

I'm running it from Jupyter. I have read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) and I understand that Jupyter mode should be detected automatically. I tried setting
jupyter=True
but I got the same error.

If it helps, Im using Jupyter 5.0

alfonsomhc commented Jun 8, 2017

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#displacy
The line
displacy.serve(doc_ent, style='ent')
gives the error:
OSError: [Errno 98] Address already in use

I'm running it from Jupyter. I have read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) and I understand that Jupyter mode should be detected automatically. I tried setting
jupyter=True
but I got the same error.

If it helps, Im using Jupyter 5.0

@v3t3a

This comment has been minimized.

Show comment
Hide comment
@v3t3a

v3t3a Jun 8, 2017

Hi,

I don't know if you consider the following as a bug or not, but there is a difference between v2 and v1 on creating matcher:

With v1

import spacy

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

With v2 and the same code, I got this error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'spacy' has no attribute 'matcher'

Following new 101's doc, I changed my code to

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

I think that it's a bug because I should access to Matcher just with spacy imported.
What do you think about it ?

v3t3a commented Jun 8, 2017

Hi,

I don't know if you consider the following as a bug or not, but there is a difference between v2 and v1 on creating matcher:

With v1

import spacy

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

With v2 and the same code, I got this error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'spacy' has no attribute 'matcher'

Following new 101's doc, I changed my code to

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

I think that it's a bug because I should access to Matcher just with spacy imported.
What do you think about it ?

@alfonsomhc

This comment has been minimized.

Show comment
Hide comment
@alfonsomhc

alfonsomhc Jun 8, 2017

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#examples-word-vectors
The first assert fails because it is false.
The second assert line has an error:

File "<ipython-input-5-4d61871a144f>", line 9
    assert apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
                                              ^
SyntaxError: invalid syntax

In any case, if I run this:
apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
I get
(False, False, False, False)
This makes me think that the model has no word vectors and therefore the similarities are wrong? I have installed the model that you describe in this page, that is with this command:
python -m spacy download en_core_web_sm-2.0.0-alpha --direct
In the documentation you say:

The default English model installs 300-dimensional vectors trained on the Common Crawl corpus.
so then I assume I should have word vectors?

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#examples-word-vectors
The first assert fails because it is false.
The second assert line has an error:

File "<ipython-input-5-4d61871a144f>", line 9
    assert apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
                                              ^
SyntaxError: invalid syntax

In any case, if I run this:
apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
I get
(False, False, False, False)
This makes me think that the model has no word vectors and therefore the similarities are wrong? I have installed the model that you describe in this page, that is with this command:
python -m spacy download en_core_web_sm-2.0.0-alpha --direct
In the documentation you say:

The default English model installs 300-dimensional vectors trained on the Common Crawl corpus.
so then I assume I should have word vectors?

@alfonsomhc

This comment has been minimized.

Show comment
Hide comment
@alfonsomhc

alfonsomhc Jun 8, 2017

Another problem in https://alpha.spacy.io/docs/usage/lightning-tour#multi-threaded
When I run the example I get this error:

...
/home/user/anaconda3/lib/python3.6/site-packages/spacy/_ml.py in forward(docs, drop)
    248         feats = []
    249         for doc in docs:
--> 250             feats.append(doc.to_array(cols))
    251         return feats, None
    252     model = layerize(forward)

AttributeError: 'str' object has no attribute 'to_array'

Another problem in https://alpha.spacy.io/docs/usage/lightning-tour#multi-threaded
When I run the example I get this error:

...
/home/user/anaconda3/lib/python3.6/site-packages/spacy/_ml.py in forward(docs, drop)
    248         feats = []
    249         for doc in docs:
--> 250             feats.append(doc.to_array(cols))
    251         return feats, None
    252     model = layerize(forward)

AttributeError: 'str' object has no attribute 'to_array'
@v3t3a

This comment has been minimized.

Show comment
Hide comment
@v3t3a

v3t3a Jun 8, 2017

Another one, but in this case, this is a difference between the doc and the use (Of the blow, this make a difference between v1 and v2, but it may be a choice rather than a bug).

This code

import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher = spacy.matcher.Matcher(nlp.vocab, pattern)

Give me following error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/matcher.pyx", line 184, in spacy.matcher.Matcher.__init__ (spacy/matcher.cpp:5534)
TypeError: __init__() takes exactly 1 positional argument (2 given)

In v1, and in the Doc too, it is specified that Matcher.__init__ accept two arguments. Vocab and/or patterns

Piece of code (Matcher.init)

V1

def __init__(self, vocab, patterns={}):
        """
        Create the Matcher.

        Arguments:
            vocab (Vocab):
                The vocabulary object, which must be shared with the documents
                the matcher will operate on.
            patterns (dict): Patterns to add to the matcher.
        Returns:
            The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
        for entity_key, (etype, attrs, specs) in sorted(patterns.items()):
            self.add_entity(entity_key, attrs)
            for spec in specs:
                self.add_pattern(entity_key, spec, label=etype)

V2

def __init__(self, vocab):
        """Create the Matcher.

        vocab (Vocab): The vocabulary object, which must be shared with the
            documents the matcher will operate on.
        RETURNS (Matcher): The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()

v3t3a commented Jun 8, 2017

Another one, but in this case, this is a difference between the doc and the use (Of the blow, this make a difference between v1 and v2, but it may be a choice rather than a bug).

This code

import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher = spacy.matcher.Matcher(nlp.vocab, pattern)

Give me following error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/matcher.pyx", line 184, in spacy.matcher.Matcher.__init__ (spacy/matcher.cpp:5534)
TypeError: __init__() takes exactly 1 positional argument (2 given)

In v1, and in the Doc too, it is specified that Matcher.__init__ accept two arguments. Vocab and/or patterns

Piece of code (Matcher.init)

V1

def __init__(self, vocab, patterns={}):
        """
        Create the Matcher.

        Arguments:
            vocab (Vocab):
                The vocabulary object, which must be shared with the documents
                the matcher will operate on.
            patterns (dict): Patterns to add to the matcher.
        Returns:
            The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
        for entity_key, (etype, attrs, specs) in sorted(patterns.items()):
            self.add_entity(entity_key, attrs)
            for spec in specs:
                self.add_pattern(entity_key, spec, label=etype)

V2

def __init__(self, vocab):
        """Create the Matcher.

        vocab (Vocab): The vocabulary object, which must be shared with the
            documents the matcher will operate on.
        RETURNS (Matcher): The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
@ines

This comment has been minimized.

Show comment
Hide comment
@ines

ines Jun 8, 2017

Member

@alfonsomhc Thanks for the detailed analysis – and sorry, so many stupid typos! Just fixing the peach emoji example.

About displaCy: If you're using displaCy from within a notebook, you should call displacy.render() – after all, you're already running a web server (the notebook server). I'm not actually sure what the expected behaviour of displacy.serve() in Jupyter would be... your error message mostly looks like you're already running something else on displaCy's default port 5000.

Either way, this should probably be more clear in the docs. And maybe displacy.serve should at least print a warning in Jupyter mode that tells the user that starting the webserver is not actually necessary.

About the vectors: This is currently a bit messy, sorry – the vectors that are supposed to be attached to the vocab aren't wired up correctly yet. Right now, there are only tensors (Doc.tensor), which power the document similarity. Still working on implementing the vectors again – they'll definitely be available in the final release.

Thanks again for your time, this was super valuable!

Member

ines commented Jun 8, 2017

@alfonsomhc Thanks for the detailed analysis – and sorry, so many stupid typos! Just fixing the peach emoji example.

About displaCy: If you're using displaCy from within a notebook, you should call displacy.render() – after all, you're already running a web server (the notebook server). I'm not actually sure what the expected behaviour of displacy.serve() in Jupyter would be... your error message mostly looks like you're already running something else on displaCy's default port 5000.

Either way, this should probably be more clear in the docs. And maybe displacy.serve should at least print a warning in Jupyter mode that tells the user that starting the webserver is not actually necessary.

About the vectors: This is currently a bit messy, sorry – the vectors that are supposed to be attached to the vocab aren't wired up correctly yet. Right now, there are only tensors (Doc.tensor), which power the document similarity. Still working on implementing the vectors again – they'll definitely be available in the final release.

Thanks again for your time, this was super valuable!

ines added a commit that referenced this issue Jun 8, 2017

@nikeqiang

This comment has been minimized.

Show comment
Hide comment
@nikeqiang

nikeqiang Jun 8, 2017

@ines The displacy visualizer is working great!

in case not already updated, minor fix below to the sentence collection example in the docs, to account for the fact that render_ents(self, text, spans, title) takes char offsets (rather than token indexes):

original:

match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
                   'label': 'MATCH'}]

revision:

match_ents = [{'start': span.start_char-sent.start_char, 'end': span.end_char-sent.start_char,
                   'label': 'MATCH'}]

screen shot 2017-06-08 at 5 44 45 pm

screen shot 2017-06-08 at 5 51 44 pm

@ines The displacy visualizer is working great!

in case not already updated, minor fix below to the sentence collection example in the docs, to account for the fact that render_ents(self, text, spans, title) takes char offsets (rather than token indexes):

original:

match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
                   'label': 'MATCH'}]

revision:

match_ents = [{'start': span.start_char-sent.start_char, 'end': span.end_char-sent.start_char,
                   'label': 'MATCH'}]

screen shot 2017-06-08 at 5 44 45 pm

screen shot 2017-06-08 at 5 51 44 pm

@testphys

This comment has been minimized.

Show comment
Hide comment
@testphys

testphys Jun 9, 2017

It seems to me that the IS_OOV flag is always True so far.

testphys commented Jun 9, 2017

It seems to me that the IS_OOV flag is always True so far.

@benhachey

This comment has been minimized.

Show comment
Hide comment
@benhachey

benhachey Sep 25, 2017

Hi guys - Very excited about spaCy v2. Can you share a rough prediction for a stable release?

Hi guys - Very excited about spaCy v2. Can you share a rough prediction for a stable release?

@vishnunekkanti

This comment has been minimized.

Show comment
Hide comment
@vishnunekkanti

vishnunekkanti Sep 25, 2017

Contributor

@ines , @honnibal
I don't see any support for xx NER model after a9, Does this mean its not going to be discontinued?

Contributor

vishnunekkanti commented Sep 25, 2017

@ines , @honnibal
I don't see any support for xx NER model after a9, Does this mean its not going to be discontinued?

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 25, 2017

There was a bug in the Break oracle that was fixed pretty recently. So if you tested before 2.0.0a4 the results may be different.

Fair enough, I gave it another whirl and report the more current results.

Wikipedia's not a great test, because it's pretty reliably edited. That's not so true of a lot of web text.
I agree and disagree. Wikipedia is a great test for reasonably edited but difficult to segment text because, as Wikipedia itself puts it:

...sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.
Wikipedia is rife with citation marks[4], (parentheticals), unbalanced quotes, quotes within quotes, etc. A lot of these situations are codified in pragmatic segmenter's golden rules test set. These tend to be the case were the various segmenters disagree.

By "edited" I'm not sure if you are referring to text that is grammatical, is comprised of complete sentences (aka not lists, twittereese, table of contents, etc.), properly sentence cased, and/or accurately punctuated. I'm sure spacy would be superior in certain circumstances and inferior in others. The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job. Sometimes even the notion of a "sentence" isn't well defined-
She turned to him and said, "This is great (thinking to herself just the opposite)." and then walked away. or The results were state of the art. [1] As depicted in ... Should a segmenter break up all spans the have a subject and a predicate or honor quotes and allow for sentences within sentences?
Is [1] part of the first sentence or a sentence onto itself? The notion of what constitutes sentence is very task dependent.

I'd like to lean towards a design where you load up a pipeline, and that pipeline is fully configured with few switches or toggles. To get different behaviors, you load a different pipeline.

Hah! I was going to mention this last time but didn't want to muddy the waters any further. I hear where you are coming from...it is clean that way. As I've been building these I've had the need to pass parameters though. For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick. The other solution that isn't allowed right now is to run the sbd before tokenization which would be nice so that the doc doesn't have to be retokenized when a token needs splitting (I wish there was a doc.split to complement the doc.merge). That being said, I think you might say I have to tweak the tokenizer so that the citation is cleaved off as its own token and that would be a fair argument. Tokenization and sbd are very intertwined...for instance spacy will break up "...test.Please" but not "...test.please".

We can make the sentence boundaries mutable after parsing. We just need some logic to cut the tree.
Cool! There would need to be merge logic as well, no?

we'd like to have these trained on different text types. This has always been the plan, but we found we really wanted to do annotations to train some of these models, so we decided to get Prodigy finished first.

Makes total sense!

What I don't think we want is to have two parses shipped as part of a single pipeline, and then decide between them at runtime based on the document state.

If you do really want this switching strategy, I think a pretty good way to implement it would be to write a component that wrapped N parsers, and delegated to one of them based on whatever logic. The switcher component would be added to the pipeline.

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

It will take a bit of thinking to get the oracle for the incorrectly segmented text correct. But once we have this we can train parsers that condition on pre-processed text, which should be helpful.

This would be very powerful if I'm understanding it correctly.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

There was a bug in the Break oracle that was fixed pretty recently. So if you tested before 2.0.0a4 the results may be different.

Fair enough, I gave it another whirl and report the more current results.

Wikipedia's not a great test, because it's pretty reliably edited. That's not so true of a lot of web text.
I agree and disagree. Wikipedia is a great test for reasonably edited but difficult to segment text because, as Wikipedia itself puts it:

...sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.
Wikipedia is rife with citation marks[4], (parentheticals), unbalanced quotes, quotes within quotes, etc. A lot of these situations are codified in pragmatic segmenter's golden rules test set. These tend to be the case were the various segmenters disagree.

By "edited" I'm not sure if you are referring to text that is grammatical, is comprised of complete sentences (aka not lists, twittereese, table of contents, etc.), properly sentence cased, and/or accurately punctuated. I'm sure spacy would be superior in certain circumstances and inferior in others. The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job. Sometimes even the notion of a "sentence" isn't well defined-
She turned to him and said, "This is great (thinking to herself just the opposite)." and then walked away. or The results were state of the art. [1] As depicted in ... Should a segmenter break up all spans the have a subject and a predicate or honor quotes and allow for sentences within sentences?
Is [1] part of the first sentence or a sentence onto itself? The notion of what constitutes sentence is very task dependent.

I'd like to lean towards a design where you load up a pipeline, and that pipeline is fully configured with few switches or toggles. To get different behaviors, you load a different pipeline.

Hah! I was going to mention this last time but didn't want to muddy the waters any further. I hear where you are coming from...it is clean that way. As I've been building these I've had the need to pass parameters though. For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick. The other solution that isn't allowed right now is to run the sbd before tokenization which would be nice so that the doc doesn't have to be retokenized when a token needs splitting (I wish there was a doc.split to complement the doc.merge). That being said, I think you might say I have to tweak the tokenizer so that the citation is cleaved off as its own token and that would be a fair argument. Tokenization and sbd are very intertwined...for instance spacy will break up "...test.Please" but not "...test.please".

We can make the sentence boundaries mutable after parsing. We just need some logic to cut the tree.
Cool! There would need to be merge logic as well, no?

we'd like to have these trained on different text types. This has always been the plan, but we found we really wanted to do annotations to train some of these models, so we decided to get Prodigy finished first.

Makes total sense!

What I don't think we want is to have two parses shipped as part of a single pipeline, and then decide between them at runtime based on the document state.

If you do really want this switching strategy, I think a pretty good way to implement it would be to write a component that wrapped N parsers, and delegated to one of them based on whatever logic. The switcher component would be added to the pipeline.

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

It will take a bit of thinking to get the oracle for the incorrectly segmented text correct. But once we have this we can train parsers that condition on pre-processed text, which should be helpful.

This would be very powerful if I'm understanding it correctly.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 26, 2017

Member

@benhachey
You can see the current TODOs here: https://github.com/explosion/spaCy/projects . The "Stable" board is the easiest to look at, because it has the per-class items.

The most difficult ticket is the Matcher operators one. We might drop that ticket from the release, because the same bug exists in 1.x, and I don't think it changes the API.

The sentence boundary stuff being discussed here is another tricky ticket.

@vishnunekkanti It's not being discontinued. We just didn't have enough automation around the model training, and the nightlies were often breaking model compatibility, so we fell behind on training all the models.

@christian-storm

The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job

Fair enough. You don't need to resell me on the need for the SBD component :). I'm on board with this.

For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.__call__ take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

It's unlikely to be a good use of your time :). I just attached it for completeness.

Member

honnibal commented Sep 26, 2017

@benhachey
You can see the current TODOs here: https://github.com/explosion/spaCy/projects . The "Stable" board is the easiest to look at, because it has the per-class items.

The most difficult ticket is the Matcher operators one. We might drop that ticket from the release, because the same bug exists in 1.x, and I don't think it changes the API.

The sentence boundary stuff being discussed here is another tricky ticket.

@vishnunekkanti It's not being discontinued. We just didn't have enough automation around the model training, and the nightlies were often breaking model compatibility, so we fell behind on training all the models.

@christian-storm

The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job

Fair enough. You don't need to resell me on the need for the SBD component :). I'm on board with this.

For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.__call__ take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

It's unlikely to be a good use of your time :). I just attached it for completeness.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 26, 2017

Apologies if I came across as overselling sbd ;) I guess I wanted to air some of my thinking on the matter.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

I agree that would work. It does seem like a lot of effort if, in the hypothetical, the only difference is one boolean flag. Not very DRY. For scripts this might be acceptable but not for servers. An industrial perspective might be helpful here.

My reaction is based in part on what I encountered while CTO of Turnitin, a company I co-founded. We processed 10-100's of millions of documents a day ranging from web pages, periodicals, books, student papers, ... in 30+ languages. An architectural choice I often had to face was choosing between stables of task specific NLP servers with an intelligent request router in front of it or a pool of generalized servers that can handle any request. The former is easier to tune and optimize/provision for at the expense of complexity and being brittle. The latter is a lot easier to scale, manage, and maintain high availability for in order to meet the almighty SLA (uptime, request turnaround time, etc.). Unless there are serious budget constraints, over-provisioning solves for the latter and more than makes up for itself in a reduced sys-admin/devops costs and increased uptime.
As a golden rule, once a server is spun up it should never have to touch disk, e.g., load a new model.
As of right now, with spacy, one already has to load a new model for each language. So that would already require 30+ unique instances. Now times that by the variations in sbd, tokenization, etc. The combinatorics are scary.

Not to oversell or anything :)

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.call take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

A big fat yes to all that.

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

As usual, you are absolutely correct. There should only be one method of maintaining state in the pipeline and that should be the doc annotations.

It's unlikely to be a good use of your time :). I just attached it for completeness.

The more I know the more I (hopefully) can be useful. I'm really keen on spacy, respect all the work you guys have done, and hope to contribute to its success.

Apologies if I came across as overselling sbd ;) I guess I wanted to air some of my thinking on the matter.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

I agree that would work. It does seem like a lot of effort if, in the hypothetical, the only difference is one boolean flag. Not very DRY. For scripts this might be acceptable but not for servers. An industrial perspective might be helpful here.

My reaction is based in part on what I encountered while CTO of Turnitin, a company I co-founded. We processed 10-100's of millions of documents a day ranging from web pages, periodicals, books, student papers, ... in 30+ languages. An architectural choice I often had to face was choosing between stables of task specific NLP servers with an intelligent request router in front of it or a pool of generalized servers that can handle any request. The former is easier to tune and optimize/provision for at the expense of complexity and being brittle. The latter is a lot easier to scale, manage, and maintain high availability for in order to meet the almighty SLA (uptime, request turnaround time, etc.). Unless there are serious budget constraints, over-provisioning solves for the latter and more than makes up for itself in a reduced sys-admin/devops costs and increased uptime.
As a golden rule, once a server is spun up it should never have to touch disk, e.g., load a new model.
As of right now, with spacy, one already has to load a new model for each language. So that would already require 30+ unique instances. Now times that by the variations in sbd, tokenization, etc. The combinatorics are scary.

Not to oversell or anything :)

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.call take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

A big fat yes to all that.

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

As usual, you are absolutely correct. There should only be one method of maintaining state in the pipeline and that should be the doc annotations.

It's unlikely to be a good use of your time :). I just attached it for completeness.

The more I know the more I (hopefully) can be useful. I'm really keen on spacy, respect all the work you guys have done, and hope to contribute to its success.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 27, 2017

I've been thinking about the pipeline/factories/etc. a lot and wanted to throw an outsiders idea at you to see what you think. I tried my best to draw from my experience of having used oodles of terrible to fantastic 3rd party libraries/services/etc. and architecting and using a team to build, configure, deploy, and maintain a large distributed NLP system.

In short, I was wondering if you considered making the pipeline explicitly pluggable? It seems the current system is half way there but, I feel, suffers from growing out of making v1 more extensible rather than starting from a fresh redesign. The good news is that I think most of the ingredients are already in place but need to be further codified around one pipeline factory pattern. I'm sure I'm missing some bits and pieces, overlooked some things, and have baked in a few inconsistencies but this is the general idea and motivation for it.

The high level idea is that each factory would explicitly define what it expects as input in order to perform it's task and defines the output it'll generate- the makings of a well defined API! Factories are coded to explicitly state the exact variables (and variable type for good measure) they require from the other process(es). A pipeline is then instantiated as a unordered list of factories that is compiled to make a runnable pipeline that takes text as its input. A sort of make file if you will. By design the 'computation graph' needn't be linear (branches/separate cores to speed up processing) but that's getting ahead of ourselves. First let me wet your whistle with what I think a great pluggable system looks like in python- Luigi. Just r/workflows/pipelines and r/tasks/factories/. A lot more grokable for a newbie like me to get up and running.

An object, e.g., doc, would still be the lingua franca of the pipeline. Although I'd open up the pipeline before tokenization to include pre-processing, i.e., text -> text, so that factories could proceed tokenization. What a pipeline expects as its object type would be defined, i.e., bytes, text, doc object, ... To allow for factories that are pre-tokenization I'd associate the vocab with a run of the pipeline and referenced by the doc. Similar to the current case but semantically different. The same is true with when a factory has a parameterized model. It too would be associated with a run of the pipeline and referenced by the doc. If a new vocab and/or factory model is required for a new style, content area, language and dialect of writing, speech to text output, etc., it would be lazy loaded on first request/cache miss and put on a LRU for later use so memory can be managed. To appease the industrial titans of the world that shudder to think that any request would suffer from a lazy load (what about their SLA!), there would be an option to pre-warm the cache. Document level annotations/data would be stored/accessed from the factory namespace and a namespaced pipe_kwargs could be passed to a pipeline run to alter its default behavior, switch out models, vocabs, flip some switches, etc.

Each factory would have a variable namespace that is public and private to delineate the two. I'd further delineate the public into input and output to but that may be me being too anal. The _init_ would require the initial vocab and model path (if parameter driven model) but I could see those being defined a run time as well, requires would define all the factory.vars that are required, __call__/__run__ would run the factory on the input, and output would define the public output variables and return the object (doc, text, ...).

There are a number of ways to list and document the factory requirements. Luigi does it one way but seemingly annotations world work as well. Doc strings would naturally live there as well and would .

Upon defining a pipeline it would be compiled at run time to figure out the ordering of components and if all the requirements are met. If ill-defined it would Fail Fast instead of dying somewhere down the line. Well, it could still be ill-defined, e.g., using a previous factory's private variable that was removed with a new version instead of asking the factory's maintainer that they need that variable added to the public interface. However, at least there is a clear contract and way to define and refine the documentation of that contract. To ferret out the rule benders, I suppose one could get draconian in 'test' mode and clear or rename all but the public variables to ensure compliance.

It would make the code a lot easier to understand (fewer questions and more contributors), properly silo each component, and allow for shareable factories. Getting a little ahead of myself but one could imagine the output method routing requests to different machines via protocol buffers/json. Spacy would be able to get deployed in a microservices based architecture like this but run just as easily as a monolithic application on a developers laptop with a flip of the devel_mode switch.

I hope this was somewhat intelligible and helpful.

Thanks for listening.

I've been thinking about the pipeline/factories/etc. a lot and wanted to throw an outsiders idea at you to see what you think. I tried my best to draw from my experience of having used oodles of terrible to fantastic 3rd party libraries/services/etc. and architecting and using a team to build, configure, deploy, and maintain a large distributed NLP system.

In short, I was wondering if you considered making the pipeline explicitly pluggable? It seems the current system is half way there but, I feel, suffers from growing out of making v1 more extensible rather than starting from a fresh redesign. The good news is that I think most of the ingredients are already in place but need to be further codified around one pipeline factory pattern. I'm sure I'm missing some bits and pieces, overlooked some things, and have baked in a few inconsistencies but this is the general idea and motivation for it.

The high level idea is that each factory would explicitly define what it expects as input in order to perform it's task and defines the output it'll generate- the makings of a well defined API! Factories are coded to explicitly state the exact variables (and variable type for good measure) they require from the other process(es). A pipeline is then instantiated as a unordered list of factories that is compiled to make a runnable pipeline that takes text as its input. A sort of make file if you will. By design the 'computation graph' needn't be linear (branches/separate cores to speed up processing) but that's getting ahead of ourselves. First let me wet your whistle with what I think a great pluggable system looks like in python- Luigi. Just r/workflows/pipelines and r/tasks/factories/. A lot more grokable for a newbie like me to get up and running.

An object, e.g., doc, would still be the lingua franca of the pipeline. Although I'd open up the pipeline before tokenization to include pre-processing, i.e., text -> text, so that factories could proceed tokenization. What a pipeline expects as its object type would be defined, i.e., bytes, text, doc object, ... To allow for factories that are pre-tokenization I'd associate the vocab with a run of the pipeline and referenced by the doc. Similar to the current case but semantically different. The same is true with when a factory has a parameterized model. It too would be associated with a run of the pipeline and referenced by the doc. If a new vocab and/or factory model is required for a new style, content area, language and dialect of writing, speech to text output, etc., it would be lazy loaded on first request/cache miss and put on a LRU for later use so memory can be managed. To appease the industrial titans of the world that shudder to think that any request would suffer from a lazy load (what about their SLA!), there would be an option to pre-warm the cache. Document level annotations/data would be stored/accessed from the factory namespace and a namespaced pipe_kwargs could be passed to a pipeline run to alter its default behavior, switch out models, vocabs, flip some switches, etc.

Each factory would have a variable namespace that is public and private to delineate the two. I'd further delineate the public into input and output to but that may be me being too anal. The _init_ would require the initial vocab and model path (if parameter driven model) but I could see those being defined a run time as well, requires would define all the factory.vars that are required, __call__/__run__ would run the factory on the input, and output would define the public output variables and return the object (doc, text, ...).

There are a number of ways to list and document the factory requirements. Luigi does it one way but seemingly annotations world work as well. Doc strings would naturally live there as well and would .

Upon defining a pipeline it would be compiled at run time to figure out the ordering of components and if all the requirements are met. If ill-defined it would Fail Fast instead of dying somewhere down the line. Well, it could still be ill-defined, e.g., using a previous factory's private variable that was removed with a new version instead of asking the factory's maintainer that they need that variable added to the public interface. However, at least there is a clear contract and way to define and refine the documentation of that contract. To ferret out the rule benders, I suppose one could get draconian in 'test' mode and clear or rename all but the public variables to ensure compliance.

It would make the code a lot easier to understand (fewer questions and more contributors), properly silo each component, and allow for shareable factories. Getting a little ahead of myself but one could imagine the output method routing requests to different machines via protocol buffers/json. Spacy would be able to get deployed in a microservices based architecture like this but run just as easily as a monolithic application on a developers laptop with a flip of the devel_mode switch.

I hope this was somewhat intelligible and helpful.

Thanks for listening.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 28, 2017

Member

I think this is interesting. A quick question.

What would you do about the "soft dependency" you get from adapting a machine learning model to its input? If I train the dependency parser on tokenized and tagged text, I don't just want "tokenized and tagged text". I ideally want exactly the output of exactly that tokenizer and tagger. I might settle for something similar, but in other cases I might not.

Member

honnibal commented Sep 28, 2017

I think this is interesting. A quick question.

What would you do about the "soft dependency" you get from adapting a machine learning model to its input? If I train the dependency parser on tokenized and tagged text, I don't just want "tokenized and tagged text". I ideally want exactly the output of exactly that tokenizer and tagger. I might settle for something similar, but in other cases I might not.

@kootenpv

This comment has been minimized.

Show comment
Hide comment
@kootenpv

kootenpv Sep 28, 2017

I believe it would warrant a new issue.

I believe it would warrant a new issue.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 28, 2017

I'm probably haven't had enough coffee yet but I'm not 100% sure I get the issue you are raising. When you say "adapting a ml model its input" do you mean further tuning an established model to adapt it to a new data set/domain thereby creating a new model? When it comes to not getting exactly the output you were expecting, I'm failing to conjure up a situation where that would happen.

I'm probably haven't had enough coffee yet but I'm not 100% sure I get the issue you are raising. When you say "adapting a ml model its input" do you mean further tuning an established model to adapt it to a new data set/domain thereby creating a new model? When it comes to not getting exactly the output you were expecting, I'm failing to conjure up a situation where that would happen.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 28, 2017

Member

Well, let's take the case of the dependency parser in v1. If we were to list out the input state it wants naively, it'd be something like:

  • An array of LexemeC pointers
  • An array of POS tag ID integers
  • A length

Now, we can obviously ask for the POS tags to reference a particular scheme, lexemes from the right vocabulary, etc. But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad parsing results, even though you gave it inputs that met the specifications the component declared. It could even be that your Process B tagger was much more accurate than the Process A tagger the parser was trained with. So long as Process B is just different, we could get train/test skew that makes the model perform poorly.

Another example, maybe simpler: Let's say you've got a movie review analysis model, that tries to assign star ratings to individual components, e.g. acting, plot, etc. You use NER labels as features, and the model ends up learning that reviews which mention people a lot are more likely talking about the acting. You see your NER model is making lots of mistakes, so you replace it with a better one --- but the better NER model produces really terrible analysis results. Why? Well, the better NER model might be detecting twice as many person entities in the reviews. You can't just plug the new entity recogniser into the pipeline without retraining everything downstream of it --- if you do, your results will be terrible.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline. If the downstream models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "required input" is "The result of applying exactly these components, and no others".

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

Member

honnibal commented Sep 28, 2017

Well, let's take the case of the dependency parser in v1. If we were to list out the input state it wants naively, it'd be something like:

  • An array of LexemeC pointers
  • An array of POS tag ID integers
  • A length

Now, we can obviously ask for the POS tags to reference a particular scheme, lexemes from the right vocabulary, etc. But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad parsing results, even though you gave it inputs that met the specifications the component declared. It could even be that your Process B tagger was much more accurate than the Process A tagger the parser was trained with. So long as Process B is just different, we could get train/test skew that makes the model perform poorly.

Another example, maybe simpler: Let's say you've got a movie review analysis model, that tries to assign star ratings to individual components, e.g. acting, plot, etc. You use NER labels as features, and the model ends up learning that reviews which mention people a lot are more likely talking about the acting. You see your NER model is making lots of mistakes, so you replace it with a better one --- but the better NER model produces really terrible analysis results. Why? Well, the better NER model might be detecting twice as many person entities in the reviews. You can't just plug the new entity recogniser into the pipeline without retraining everything downstream of it --- if you do, your results will be terrible.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline. If the downstream models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "required input" is "The result of applying exactly these components, and no others".

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 29, 2017

I appreciate the examples, they surfaced some very valid issues. The issues you raise fall into a few camps: Preventing folks from shooting themselves in the foot, defining expected behavior/model selection, and explicitly capturing static dependencies. Let's see if can motivate some solutions.

What I defined before was an api between components. Much like how a bunch of microservices would all agree on what protobuf definitions and versions they'll use to communicate with each other. It is a communication contract full stop. You are absolutely correct, it does nothing to define the expected behavior of the computational graph in turning inputs to outputs. Nor does it define what dependencies are required by each component. We have the technology. We can rebuild him.

What's the diff between pure software packages and ML based packages?
In programming it is incumbent on the developer to pick the appropriate package for a given task by looking at the documented behavior of that component. Furthermore, unit and integration tests are written to ensure the expected behavior, or a representative sample of it, remains intact as package versions are bumped and underlying code is modified. If you are publishing a package like spacy you ensure proper behavior for the user by explicitly listing each required package and version number(s) in requirements.txt.

Riffing off your example, if a developer is selecting an appropriate tokenizer from, say, the available NTLK tokenizers they could look at the docs to see if it splits contractions or not. Even if it was unspecified they could figure out which one is best through some testing or by looking at the source if it is open source. If a developer chooses the wrong tokenizer for their task and doesn't have a test suite to alert them to this fact, I would have to say the bug is on them, no? What if the tokenizer is closed source? Isn't this essentially the same black box as a ML based tokenizer? Well not exactly. As an example, when I used SAP's NLP software the docs detailed the rule set used for tokenization. If the tokenization is learned, the rules are inferred from the data and can't be written down. With the "don't" tokenization example, how would one know that "don't" is going to be properly handled without explicitly testing it?

Expected behavior
So how does one fully specify the expected behavior of a machine learning component? As you well know I don't think anyone has a good answer for this. In academia one details the algorithm, releases the code, specifies the hyper-parameters, the data set used to train and validate, and the summary metric scores found with the test set. This information allows one to intuit how well it may do on another data set but there is no substitute for trying it out.

Imagine if one had access to an inventory of trained models. To select the best model for a given data set/task, one would compare the summary statistics of each model run against the test set. Likely one might even have a human inspect individual predictions to see ensure the right model is being selected (for example). If the model seems like it would benefit from domain adaptation, further training in a manner that avoids catastrophic forgetting might prove effective.

As alluded to by your examples, what if the developer doesn't have a labeled test set to aid in model selection? My knee jerk reaction is that they are ill equipped to stray from the default setup. They should use Prodigy to create a test set first. To me it is equivalent to someone picking NLTK's moses over casual tokenizer for twitter data without running tests to see which is better. This may be a bit far afield but a solution could be to ship a model with a set of sufficient statistics that describes the distribution of the training corpus, a program to generate the statistics for a new corpus, and a way of comparing the statistics for fit/transferability (KL divergence and outliers?). For tokenization and tagging, a first approximation would be the distribution of token and POS. So if the training set didn't have "didn't" but the user's corpus does, it would alert them to that fact and they could build a test to make sure it behaves as expected and possibly give them a motivation to further train the model. It might prevent some from shooting themselves in the foot by aiding them in the model selection process.

Versioning and dependencies
In devops one has to specify the required libraries, packages, configuration files, os services, etc. required to turn a bare metal box into the working environment needed to run a certain piece of software. This is notoriously hard to do as evidenced by the sheer number of configuration tools that exist (Puppet, cfengine, chef, etc.) and next generation tools (Docker VE, VM, ...) that give up on trying to turn the full configuration of an environment into source code. I've been in dependency hell and it sucks.

So how do we ensure that each ML model is reproducible? Much in the same way versions of spacy depend on certain versions of the model as defined in compatibility.json. A model would specify what Glove vectors were used, the corpus it was trained on (or a pointer to it, e.g., ontonotes 5.0), the hyper-parameter settings, etc. Anything and everything needed to recreate that particular version of the model from scratch. Something along the lines of dataversioncontrol. To cordon off model specific data in spacy, the data would be stored in a private namespace for that model/factory. Better yet, to allow for shared data amongst models, like Glove vectors, vocab, etc., the data would be stored in a private global namespace with the model instances having pointers from its private namespace. Much like how a doc.vocab points to a gloabl vocab instances. The difference being that everything would be versioned (hash over data with a human readable version number for good measure).

Now let me walk through each of your examples to see how this further refined concept might address each situation.

Now, we can obviously ask for the POS tags to reference a particular scheme.

Yes, exactly. It would point to a versioned file.

But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad results. It doesn't have to be a simple story of "process A was more accurate than process B".

I'm not entirely sure what you mean by process A and B. Corpus A and corpus B? You shouldn't ever pass weights around. You'd load model A's weights, tags, etc. and never change them. If you did, it would be a new model.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline.

Maybe I'm playing antics with semantics but I wouldn't say "copies" here. One may have two separate tokenizers or possibly one with a switch if it is ruled based. With the learned tokenizers the run of the pipeline would specify which tokenizer to use. The pipeline is defined, compiled, and run. If a different pipeline is required, e.g., a different tokenizer, the pipeline is redefined and recompiled before running. The components would be cached so it would be fast and, to your point, perhaps there is cache of predefined pipelines as well if compilation proves expensive.

If the models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

Agreed. That is why model selection is so important and needs to be surfaced as a step in the development process. The situation you describe is applicable to current spacy users. Without access to Ontonotes how does one know how if it is close enough to their domain to be effective at, say, parsing? Even if one did have access to Ontonotes how does one judge how transferable the models are? One could compare vocabulary, tag, dependency overlap and their frequencies. But nothing trumps a run against the test set, right?

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

Yes, that would be disastrous and shouldn't be allowed or, with the limitations of python, discouraged. The vectors are part of the model definition and when loaded would reside in a private namespace. Of course, nothing can be made private in python so someone could blow through a few stop signs and still shoot themselves in the foot.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "expected input" is "What I saw during training, as exactly as possible".

I believe what you are saying that there is no way to ensure each model is trained on the same data set, no? In other words, to get the reported results, the "expected input" needs to be distributionally similar to the training data. If this is what you mean, one could have an optional check in the compilation step that checks to make sure the datasources are the same across the pipeline. This would prevent some noob, only looking at reported accuracies when creating a pipeline, from chaining together a twitter based model with a model trained on arxiv.

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

Agreed. However, if you decide to domain adapt the model, i.e., online learning with prodigy, this should produce new version of the model with a model definition that points to new parameters and a new data source listing that includes the original data source and the new data source.

Despite the length of this response, what I'm talking about really isn't that complicated in concept and from what I can tell not too far afield from where spacy 2.0 is now. I'd be willing to chip in if that is helpful. It'll be much more difficult once the ship leaves port.

I'm curious to hear what you think?

I appreciate the examples, they surfaced some very valid issues. The issues you raise fall into a few camps: Preventing folks from shooting themselves in the foot, defining expected behavior/model selection, and explicitly capturing static dependencies. Let's see if can motivate some solutions.

What I defined before was an api between components. Much like how a bunch of microservices would all agree on what protobuf definitions and versions they'll use to communicate with each other. It is a communication contract full stop. You are absolutely correct, it does nothing to define the expected behavior of the computational graph in turning inputs to outputs. Nor does it define what dependencies are required by each component. We have the technology. We can rebuild him.

What's the diff between pure software packages and ML based packages?
In programming it is incumbent on the developer to pick the appropriate package for a given task by looking at the documented behavior of that component. Furthermore, unit and integration tests are written to ensure the expected behavior, or a representative sample of it, remains intact as package versions are bumped and underlying code is modified. If you are publishing a package like spacy you ensure proper behavior for the user by explicitly listing each required package and version number(s) in requirements.txt.

Riffing off your example, if a developer is selecting an appropriate tokenizer from, say, the available NTLK tokenizers they could look at the docs to see if it splits contractions or not. Even if it was unspecified they could figure out which one is best through some testing or by looking at the source if it is open source. If a developer chooses the wrong tokenizer for their task and doesn't have a test suite to alert them to this fact, I would have to say the bug is on them, no? What if the tokenizer is closed source? Isn't this essentially the same black box as a ML based tokenizer? Well not exactly. As an example, when I used SAP's NLP software the docs detailed the rule set used for tokenization. If the tokenization is learned, the rules are inferred from the data and can't be written down. With the "don't" tokenization example, how would one know that "don't" is going to be properly handled without explicitly testing it?

Expected behavior
So how does one fully specify the expected behavior of a machine learning component? As you well know I don't think anyone has a good answer for this. In academia one details the algorithm, releases the code, specifies the hyper-parameters, the data set used to train and validate, and the summary metric scores found with the test set. This information allows one to intuit how well it may do on another data set but there is no substitute for trying it out.

Imagine if one had access to an inventory of trained models. To select the best model for a given data set/task, one would compare the summary statistics of each model run against the test set. Likely one might even have a human inspect individual predictions to see ensure the right model is being selected (for example). If the model seems like it would benefit from domain adaptation, further training in a manner that avoids catastrophic forgetting might prove effective.

As alluded to by your examples, what if the developer doesn't have a labeled test set to aid in model selection? My knee jerk reaction is that they are ill equipped to stray from the default setup. They should use Prodigy to create a test set first. To me it is equivalent to someone picking NLTK's moses over casual tokenizer for twitter data without running tests to see which is better. This may be a bit far afield but a solution could be to ship a model with a set of sufficient statistics that describes the distribution of the training corpus, a program to generate the statistics for a new corpus, and a way of comparing the statistics for fit/transferability (KL divergence and outliers?). For tokenization and tagging, a first approximation would be the distribution of token and POS. So if the training set didn't have "didn't" but the user's corpus does, it would alert them to that fact and they could build a test to make sure it behaves as expected and possibly give them a motivation to further train the model. It might prevent some from shooting themselves in the foot by aiding them in the model selection process.

Versioning and dependencies
In devops one has to specify the required libraries, packages, configuration files, os services, etc. required to turn a bare metal box into the working environment needed to run a certain piece of software. This is notoriously hard to do as evidenced by the sheer number of configuration tools that exist (Puppet, cfengine, chef, etc.) and next generation tools (Docker VE, VM, ...) that give up on trying to turn the full configuration of an environment into source code. I've been in dependency hell and it sucks.

So how do we ensure that each ML model is reproducible? Much in the same way versions of spacy depend on certain versions of the model as defined in compatibility.json. A model would specify what Glove vectors were used, the corpus it was trained on (or a pointer to it, e.g., ontonotes 5.0), the hyper-parameter settings, etc. Anything and everything needed to recreate that particular version of the model from scratch. Something along the lines of dataversioncontrol. To cordon off model specific data in spacy, the data would be stored in a private namespace for that model/factory. Better yet, to allow for shared data amongst models, like Glove vectors, vocab, etc., the data would be stored in a private global namespace with the model instances having pointers from its private namespace. Much like how a doc.vocab points to a gloabl vocab instances. The difference being that everything would be versioned (hash over data with a human readable version number for good measure).

Now let me walk through each of your examples to see how this further refined concept might address each situation.

Now, we can obviously ask for the POS tags to reference a particular scheme.

Yes, exactly. It would point to a versioned file.

But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad results. It doesn't have to be a simple story of "process A was more accurate than process B".

I'm not entirely sure what you mean by process A and B. Corpus A and corpus B? You shouldn't ever pass weights around. You'd load model A's weights, tags, etc. and never change them. If you did, it would be a new model.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline.

Maybe I'm playing antics with semantics but I wouldn't say "copies" here. One may have two separate tokenizers or possibly one with a switch if it is ruled based. With the learned tokenizers the run of the pipeline would specify which tokenizer to use. The pipeline is defined, compiled, and run. If a different pipeline is required, e.g., a different tokenizer, the pipeline is redefined and recompiled before running. The components would be cached so it would be fast and, to your point, perhaps there is cache of predefined pipelines as well if compilation proves expensive.

If the models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

Agreed. That is why model selection is so important and needs to be surfaced as a step in the development process. The situation you describe is applicable to current spacy users. Without access to Ontonotes how does one know how if it is close enough to their domain to be effective at, say, parsing? Even if one did have access to Ontonotes how does one judge how transferable the models are? One could compare vocabulary, tag, dependency overlap and their frequencies. But nothing trumps a run against the test set, right?

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

Yes, that would be disastrous and shouldn't be allowed or, with the limitations of python, discouraged. The vectors are part of the model definition and when loaded would reside in a private namespace. Of course, nothing can be made private in python so someone could blow through a few stop signs and still shoot themselves in the foot.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "expected input" is "What I saw during training, as exactly as possible".

I believe what you are saying that there is no way to ensure each model is trained on the same data set, no? In other words, to get the reported results, the "expected input" needs to be distributionally similar to the training data. If this is what you mean, one could have an optional check in the compilation step that checks to make sure the datasources are the same across the pipeline. This would prevent some noob, only looking at reported accuracies when creating a pipeline, from chaining together a twitter based model with a model trained on arxiv.

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

Agreed. However, if you decide to domain adapt the model, i.e., online learning with prodigy, this should produce new version of the model with a model definition that points to new parameters and a new data source listing that includes the original data source and the new data source.

Despite the length of this response, what I'm talking about really isn't that complicated in concept and from what I can tell not too far afield from where spacy 2.0 is now. I'd be willing to chip in if that is helpful. It'll be much more difficult once the ship leaves port.

I'm curious to hear what you think?

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 29, 2017

Member

What's the diff between pure software packages and ML based packages?

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable. The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

Member

honnibal commented Sep 29, 2017

What's the diff between pure software packages and ML based packages?

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable. The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 30, 2017

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

Well put and agree in principle with the caveat that code is rarely so boolean. Take a complex signal processing algorithms where the function F is either learned or is programmed with an analytic/closed form solution. How is the test and verify process of either really any different? Sure, each component of the latter can be tested individually. That certainly makes it easier to debug when things go south. However, a test set of Y = F(X) is as important in either case, right?

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

Once gain, on the same page although I think there is another way to look at it. The key is defining exactly what the "right results" are. In building a ML model one uses the validation set to make sure the model is learning but not overfitting the training set. Then the test set is used as the ultimate test to ensure the model is transferable. If one were to pull two models off the shelf and plug them together as I've been suggesting, you'd judge the effectiveness of each using a task specific test set the two together using a test set that encompasses the whole pipeline, no? This happens all the time in ML, e.g., a speech to text system that uses spectrograms, KenLM, and DL. Even though the first two aren't learned, though they could be, there are a bunch of hyper-parameters that need to be "learned."

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

I would agree that training end to end and freezing the models in the pipeline afterwards leads to the most reproducible results. If this is the intended design, one will only ever be able to disable or append purely additive components, e.g., sentiment.

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

If the dependencies remain a constant across the pipeline, I still think plugging trained models into the pipeline makes sense if one knows what they are doing- an appropriate test harness at each step of the pipeline. On the other hand, I agree it is easy to go off the rails when components are tightly coupled, e.g., setting sent_start and making the trained parser obey them even though it wasn't trained with those sentence boundaries. However, there are many valid cases where it makes sense, e.g., training a sbd, freezing it, and then training the remaining pipeline.

Another idea
With the pipeline versioning idea in mind, why not at least allow for pluggable un-trained models that, once trained, get frozen into a versioned pipeline? Ultimately, I'm looking for a tool that plays well with experimentation, e.g., a new parser design from the literature, and devops. The difference is spacy being part of the NLP pipeline versus running the entire pipeline.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

Okay, I got it :), too much configurability can lead to bad things. But, really, why can't one version and release a component like parse(doc_with_tags) -> doc_with_deps? How is it any different than training each stage of the pipeline, freezing it, and then training the next stage of the pipeline using the same dependencies: data, tag sets, glove vectors, etc.? If trained end-to-end with errors back propagated from component to component, then yes I would agree, these tightly coupled components should be thought of as one unit and domain adapted as one unit.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse? Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable.

Yes! That is the word- composable. "A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements."

That's it! I would love a world where I can truly compose a NLP pipeline. Analogous to how Keras allows you easily build, train, and use a NN; just one level of abstraction higher.

I don't see how "shorter" pipelines are more composable though. Forgive me if I'm wrong but I don't really see any composability in spacy at the moment. Maybe configurability? Though, one gets the impression by reading the docs, "you can mix and match pipeline components," that the vision is to be able to compose pipelines that deliver different behaviors (specific user requirements).

The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

I wish I knew the code better to react.

I'm already in a fairly precarious position needing different tokenizers and sentence boundary detectors and there isn't a clear way to add these components. With your previously proposed solution of breaking and merging the dependency tree to allow for new sentence boundaries, what would that do to accuracy? Isn't this the exact tinkering of a trained model you are trying to avoid?

Once again, thanks for engaging Matthew.

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

Well put and agree in principle with the caveat that code is rarely so boolean. Take a complex signal processing algorithms where the function F is either learned or is programmed with an analytic/closed form solution. How is the test and verify process of either really any different? Sure, each component of the latter can be tested individually. That certainly makes it easier to debug when things go south. However, a test set of Y = F(X) is as important in either case, right?

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

Once gain, on the same page although I think there is another way to look at it. The key is defining exactly what the "right results" are. In building a ML model one uses the validation set to make sure the model is learning but not overfitting the training set. Then the test set is used as the ultimate test to ensure the model is transferable. If one were to pull two models off the shelf and plug them together as I've been suggesting, you'd judge the effectiveness of each using a task specific test set the two together using a test set that encompasses the whole pipeline, no? This happens all the time in ML, e.g., a speech to text system that uses spectrograms, KenLM, and DL. Even though the first two aren't learned, though they could be, there are a bunch of hyper-parameters that need to be "learned."

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

I would agree that training end to end and freezing the models in the pipeline afterwards leads to the most reproducible results. If this is the intended design, one will only ever be able to disable or append purely additive components, e.g., sentiment.

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

If the dependencies remain a constant across the pipeline, I still think plugging trained models into the pipeline makes sense if one knows what they are doing- an appropriate test harness at each step of the pipeline. On the other hand, I agree it is easy to go off the rails when components are tightly coupled, e.g., setting sent_start and making the trained parser obey them even though it wasn't trained with those sentence boundaries. However, there are many valid cases where it makes sense, e.g., training a sbd, freezing it, and then training the remaining pipeline.

Another idea
With the pipeline versioning idea in mind, why not at least allow for pluggable un-trained models that, once trained, get frozen into a versioned pipeline? Ultimately, I'm looking for a tool that plays well with experimentation, e.g., a new parser design from the literature, and devops. The difference is spacy being part of the NLP pipeline versus running the entire pipeline.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

Okay, I got it :), too much configurability can lead to bad things. But, really, why can't one version and release a component like parse(doc_with_tags) -> doc_with_deps? How is it any different than training each stage of the pipeline, freezing it, and then training the next stage of the pipeline using the same dependencies: data, tag sets, glove vectors, etc.? If trained end-to-end with errors back propagated from component to component, then yes I would agree, these tightly coupled components should be thought of as one unit and domain adapted as one unit.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse? Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable.

Yes! That is the word- composable. "A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements."

That's it! I would love a world where I can truly compose a NLP pipeline. Analogous to how Keras allows you easily build, train, and use a NN; just one level of abstraction higher.

I don't see how "shorter" pipelines are more composable though. Forgive me if I'm wrong but I don't really see any composability in spacy at the moment. Maybe configurability? Though, one gets the impression by reading the docs, "you can mix and match pipeline components," that the vision is to be able to compose pipelines that deliver different behaviors (specific user requirements).

The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

I wish I knew the code better to react.

I'm already in a fairly precarious position needing different tokenizers and sentence boundary detectors and there isn't a clear way to add these components. With your previously proposed solution of breaking and merging the dependency tree to allow for new sentence boundaries, what would that do to accuracy? Isn't this the exact tinkering of a trained model you are trying to avoid?

Once again, thanks for engaging Matthew.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 30, 2017

Member

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

No, not at all -- I hope I'm not coming across as intransigent :)

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

I do think this is a potential problem, and maybe we should be clearer about the problem in the docs. The trade-off is sort of like having internals prefixed with an underscore in Python: it can be useful to play with these thing, but you don't really get safety guarantees.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse?

We don't really have a data structure for constituency parses at the moment, or for semantic roles. You could add the data into user_data. More generally though:

Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

Well, not really? You could subclass NeuralDependencyParser and overwrite the predict or set_annotations methods. Or you could do neither, and add some function like this to the pipeline:

def my_dependency_parser(doc):
    parse = doc.to_array([HEAD, DEP])
    # Set every word to depend on the next word
    for i in range(len(doc)):
        parse[i, 0] = i+1
    doc.from_array([HEAD, DEP], parse)

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components. You could have a pipeline component that predicts the sentence boundaries, creates a sequence of Doc objects using slices of the Doc.c pointer for the sentences, and parses each sentence independently:

# From within Cython

class SentenceParser(object):
    def __init__(self, segmenter, parser):
        self.segment = segmenter
        self.parse = parser

    def __call__(self, doc):
        sentences = self.segment(doc)
        cdef Doc subdoc
        for sent in sentences:
            subdoc = Doc(doc.vocab)
            subdoc.c = &doc.c[sent.start]
            subdoc.length = sent.end-sent.start
            self.parse(subdoc)
        return doc

I haven't tested this, but in theory it should work?

Member

honnibal commented Sep 30, 2017

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

No, not at all -- I hope I'm not coming across as intransigent :)

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

I do think this is a potential problem, and maybe we should be clearer about the problem in the docs. The trade-off is sort of like having internals prefixed with an underscore in Python: it can be useful to play with these thing, but you don't really get safety guarantees.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse?

We don't really have a data structure for constituency parses at the moment, or for semantic roles. You could add the data into user_data. More generally though:

Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

Well, not really? You could subclass NeuralDependencyParser and overwrite the predict or set_annotations methods. Or you could do neither, and add some function like this to the pipeline:

def my_dependency_parser(doc):
    parse = doc.to_array([HEAD, DEP])
    # Set every word to depend on the next word
    for i in range(len(doc)):
        parse[i, 0] = i+1
    doc.from_array([HEAD, DEP], parse)

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components. You could have a pipeline component that predicts the sentence boundaries, creates a sequence of Doc objects using slices of the Doc.c pointer for the sentences, and parses each sentence independently:

# From within Cython

class SentenceParser(object):
    def __init__(self, segmenter, parser):
        self.segment = segmenter
        self.parse = parser

    def __call__(self, doc):
        sentences = self.segment(doc)
        cdef Doc subdoc
        for sent in sentences:
            subdoc = Doc(doc.vocab)
            subdoc.c = &doc.c[sent.start]
            subdoc.length = sent.end-sent.start
            self.parse(subdoc)
        return doc

I haven't tested this, but in theory it should work?

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 30, 2017

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components.

I totally get how pipelines work under the hood now. But it isn't as simple as that, right? Which brings me back to what started all this for me . If it was that easy, set_factory would be as trivial as adding a callable function to the pipeline list (#1357) and I would be able to set sentence boundaries without new ones "magically" being created.

I appreciate you sharing the recipes of how you would do it. However, this is exactly what I was trying to avoid. As part of this exercise, I am now more familiar with the code and it is a more tenable solution. I fear you are going to leave a lot of talented people behind that could contribute to spacy and box people out that find spacy unfit for their task. Most researchers won't crack the hood open and take the time to learn cython and the inner-workings of the spacy engine just so they can add or modify a part. I think there is an opportunity for spacy to create an ecosystem much like scikit learn's which currently has 932 contributors and a clear path for becoming one.

At any rate, I'll get off my soapbox now. I'm anxiously awaiting how or if you'll solve for the sbd issue. As of right now I'm dead in the water with spacy because of it. Trying to decide if I move on or hang tight.

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components.

I totally get how pipelines work under the hood now. But it isn't as simple as that, right? Which brings me back to what started all this for me . If it was that easy, set_factory would be as trivial as adding a callable function to the pipeline list (#1357) and I would be able to set sentence boundaries without new ones "magically" being created.

I appreciate you sharing the recipes of how you would do it. However, this is exactly what I was trying to avoid. As part of this exercise, I am now more familiar with the code and it is a more tenable solution. I fear you are going to leave a lot of talented people behind that could contribute to spacy and box people out that find spacy unfit for their task. Most researchers won't crack the hood open and take the time to learn cython and the inner-workings of the spacy engine just so they can add or modify a part. I think there is an opportunity for spacy to create an ecosystem much like scikit learn's which currently has 932 contributors and a clear path for becoming one.

At any rate, I'll get off my soapbox now. I'm anxiously awaiting how or if you'll solve for the sbd issue. As of right now I'm dead in the water with spacy because of it. Trying to decide if I move on or hang tight.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 30, 2017

Member

Well, I think there's a mix of a couple of issues here. One is that the SBD stuff is legit broken at the moment --- it's one of the tickets blocking spaCy 2 stable. Similarly the set_factory thing doesn't work as advertised at the moment either.

But the more interesting thing are these deeper design questions, about how the pipeline works, and to what extent we should expect components to be "hot swappable", how versioning should work, whether we can have a pluggable architecture, etc.

I agree that having me suggest Cython code isn't a scalable approach to community development :p. On the other hand, some of the problems aren't scalable/general here --- there are specific bugs, for which I'm trying to give specific mitigations.

About the more general questions: I think we should probably switch to using entry points to give a more explicit plugin infrastructure, for both the languages and the components. We also plan to have wrapper components for the common machine learning libraries, to make it easy to write a model with say PyTorch and use it to power a POS tagger. The next release of the spaCy 2 docs will also have more details about the Pipe abstract base class.

I probably don't think I want something like the declarative approach to pipelines that you mentioned above, though. I think if you want that sort of workflow, the best thing to do would be to wrap each spaCy component you're interested in as a pip package, and then use Luigi or Airflow as the data pipeline layer.

The components you wrap this way can take a Doc object instead of text if you like --- you just have to supply a different tokenizer or make_doc function. So, you don't need to repeat any work this way. You can make the steps you're presenting as spaCy pipelines as small or as big as you like. I think this will be better than designing our own pipeline management solution.

Member

honnibal commented Sep 30, 2017

Well, I think there's a mix of a couple of issues here. One is that the SBD stuff is legit broken at the moment --- it's one of the tickets blocking spaCy 2 stable. Similarly the set_factory thing doesn't work as advertised at the moment either.

But the more interesting thing are these deeper design questions, about how the pipeline works, and to what extent we should expect components to be "hot swappable", how versioning should work, whether we can have a pluggable architecture, etc.

I agree that having me suggest Cython code isn't a scalable approach to community development :p. On the other hand, some of the problems aren't scalable/general here --- there are specific bugs, for which I'm trying to give specific mitigations.

About the more general questions: I think we should probably switch to using entry points to give a more explicit plugin infrastructure, for both the languages and the components. We also plan to have wrapper components for the common machine learning libraries, to make it easy to write a model with say PyTorch and use it to power a POS tagger. The next release of the spaCy 2 docs will also have more details about the Pipe abstract base class.

I probably don't think I want something like the declarative approach to pipelines that you mentioned above, though. I think if you want that sort of workflow, the best thing to do would be to wrap each spaCy component you're interested in as a pip package, and then use Luigi or Airflow as the data pipeline layer.

The components you wrap this way can take a Doc object instead of text if you like --- you just have to supply a different tokenizer or make_doc function. So, you don't need to repeat any work this way. You can make the steps you're presenting as spaCy pipelines as small or as big as you like. I think this will be better than designing our own pipeline management solution.

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Sep 30, 2017

Member

There's also some relevant discussion about extensibility in #1085 that might be interesting.

Member

honnibal commented Sep 30, 2017

There's also some relevant discussion about extensibility in #1085 that might be interesting.

@christian-storm

This comment has been minimized.

Show comment
Hide comment
@christian-storm

christian-storm Sep 30, 2017

Yeah, I had read #1085 as part of my due diligence trying to wrap my head around all this.

I'm heartened to hear sbd is on the radar and some thought is being given to entry points/pluggable architecture and a pipe abstract class. It is hard to arrive at the right abstraction but it'll be well worth it in the long run. On the same page with respect to the vision and using the right tool for the job, e.g., pipeline management. I'll stop bugging you so you and I can get back to being productive. :)

Yeah, I had read #1085 as part of my due diligence trying to wrap my head around all this.

I'm heartened to hear sbd is on the radar and some thought is being given to entry points/pluggable architecture and a pipe abstract class. It is hard to arrive at the right abstraction but it'll be well worth it in the long run. On the same page with respect to the vision and using the right tool for the job, e.g., pipeline management. I'll stop bugging you so you and I can get back to being productive. :)

@cbrew

This comment has been minimized.

Show comment
Hide comment
@cbrew

cbrew Oct 9, 2017

I think this little fragment ought to work. But it doesn't. Something seems to be wrong with the
saving of the added pipeline component.

I have spacy 2.0.0a16 installed in a fresh
conda environment with python 3.6.2 from conda-forge

import spacy
import spacy.lang.en
from spacy.pipeline import TextCategorizer

nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['ENTITY', 'ACTION', 'MODIFIER'])
nlp.pipeline.append(textcat)
nlp.to_disk('matter')

error is

Traceback (most recent call last):
  File "loadsave.py", line 10, in <module>
    nlp.to_disk('matter')
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 507, in to_disk
    util.to_disk(path, serializers, {p: False for p in disable})
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 505, in <lambda>
    serializers[proc.name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
  File "pipeline.pyx", line 190, in spacy.pipeline.BaseThincComponent.to_disk
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "pipeline.pyx", line 188, in spacy.pipeline.BaseThincComponent.to_disk.lambda7
TypeError: Required argument 'length' (pos 1) not found

cbrew commented Oct 9, 2017

I think this little fragment ought to work. But it doesn't. Something seems to be wrong with the
saving of the added pipeline component.

I have spacy 2.0.0a16 installed in a fresh
conda environment with python 3.6.2 from conda-forge

import spacy
import spacy.lang.en
from spacy.pipeline import TextCategorizer

nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['ENTITY', 'ACTION', 'MODIFIER'])
nlp.pipeline.append(textcat)
nlp.to_disk('matter')

error is

Traceback (most recent call last):
  File "loadsave.py", line 10, in <module>
    nlp.to_disk('matter')
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 507, in to_disk
    util.to_disk(path, serializers, {p: False for p in disable})
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 505, in <lambda>
    serializers[proc.name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
  File "pipeline.pyx", line 190, in spacy.pipeline.BaseThincComponent.to_disk
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "pipeline.pyx", line 188, in spacy.pipeline.BaseThincComponent.to_disk.lambda7
TypeError: Required argument 'length' (pos 1) not found
@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Oct 10, 2017

Member

@cbrew Thanks. Seems to be a bug in .to_bytes() --- the same happens even without adding the model to the pipeline.

Edit: Okay I think I see the issue. After __init__() the component's .model attribute won't be created yet. It's added in a second step, after you call either begin_training() or load with from_bytes() or from_disk().

I think this is leading to incorrect behaviour when you immediately try to serialize the class.

Edit2:

>>> a = True
>>> a.to_bytes()
Traceback (most recent call last):
  File "<stdin>", line 1 in <module>
TypeError: Required argument 'length' (pos 1) not found

So to_bytes() happens to clash with a method on the bool type. Sometimes dynamic typing feels like a terrible bad no good idea...

Member

honnibal commented Oct 10, 2017

@cbrew Thanks. Seems to be a bug in .to_bytes() --- the same happens even without adding the model to the pipeline.

Edit: Okay I think I see the issue. After __init__() the component's .model attribute won't be created yet. It's added in a second step, after you call either begin_training() or load with from_bytes() or from_disk().

I think this is leading to incorrect behaviour when you immediately try to serialize the class.

Edit2:

>>> a = True
>>> a.to_bytes()
Traceback (most recent call last):
  File "<stdin>", line 1 in <module>
TypeError: Required argument 'length' (pos 1) not found

So to_bytes() happens to clash with a method on the bool type. Sometimes dynamic typing feels like a terrible bad no good idea...

@jamesrharwood

This comment has been minimized.

Show comment
Hide comment
@jamesrharwood

jamesrharwood Oct 11, 2017

Is it possible to run Spacy functions on a redis backed worker? I'm finding that my jobs disappear as soon as they reach the nlp() command. For instance:

#### worker.py

import redis
from rq import Worker, Queue, Connection

conn = redis.from_url("redis://localhost:6379")
with Connection(conn):
    worker = Worker(list(map(Queue, ['default'])))
    worker.work()
### test.py

import spacy
nlp=spacy.load('en_core_web_sm')

def test_nlp():
    print "before NLP call"
    r = nlp(u"this is a test")
    print "after NLP call"
    return r

Running python worker.py and then the following:

from rq import Queue
from worker import conn
from test import test_nlp

q = Queue(connection=conn)
q.enqueue(test_nlp)

Results in the worker printing:

21:58:45 *** Listening on default...
21:59:07 default: test_nlp() (61b65f56-4a02-42b2-bdfd-07a5bc7bceb6)
before NLP call
21:59:08
21:59:08 *** Listening on default...

The second print statement never appears, and if I query the job status it confirms that it's started, but not finished and not failed.

Am I missing something obvious?

spacy-nightly: 2.0.0a16
rq: 0.6.0
redis: 2.10.5

UPDATE USING CELERY INSTEAD OF RQ

Using Celery instead of RQ, I now get this error:

[2017-10-12 11:12:18,412: INFO/MainProcess] Received task: test_nlp[1a48e949-b1ba-4820-aed9-b7b44a1fed1f]
[2017-10-12 11:12:18,417: WARNING/ForkPoolWorker-3] before NLP call
[2017-10-12 11:12:19,701: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:11820 exited with 'signal 11 (SIGSEGV)'
[2017-10-12 11:12:19,718: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)

This Celery thread suggests it may be a problem with Spacy not being fork safe:
celery/celery#2964 (comment)

I tried the workaround suggested in the linked comment (importing the spacy model inside the function) but the import causes the same error.

PROBLEM SOLVED?

I tried pip install eventlet and then running the celery worker with -P eventlet -c 1000 and now the task runs successfully!

I'm not sure whether this means it's a bug within prefork or Spacy, so I'm leaving this comment here in the hope that it helps someone!

jamesrharwood commented Oct 11, 2017

Is it possible to run Spacy functions on a redis backed worker? I'm finding that my jobs disappear as soon as they reach the nlp() command. For instance:

#### worker.py

import redis
from rq import Worker, Queue, Connection

conn = redis.from_url("redis://localhost:6379")
with Connection(conn):
    worker = Worker(list(map(Queue, ['default'])))
    worker.work()
### test.py

import spacy
nlp=spacy.load('en_core_web_sm')

def test_nlp():
    print "before NLP call"
    r = nlp(u"this is a test")
    print "after NLP call"
    return r

Running python worker.py and then the following:

from rq import Queue
from worker import conn
from test import test_nlp

q = Queue(connection=conn)
q.enqueue(test_nlp)

Results in the worker printing:

21:58:45 *** Listening on default...
21:59:07 default: test_nlp() (61b65f56-4a02-42b2-bdfd-07a5bc7bceb6)
before NLP call
21:59:08
21:59:08 *** Listening on default...

The second print statement never appears, and if I query the job status it confirms that it's started, but not finished and not failed.

Am I missing something obvious?

spacy-nightly: 2.0.0a16
rq: 0.6.0
redis: 2.10.5

UPDATE USING CELERY INSTEAD OF RQ

Using Celery instead of RQ, I now get this error:

[2017-10-12 11:12:18,412: INFO/MainProcess] Received task: test_nlp[1a48e949-b1ba-4820-aed9-b7b44a1fed1f]
[2017-10-12 11:12:18,417: WARNING/ForkPoolWorker-3] before NLP call
[2017-10-12 11:12:19,701: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:11820 exited with 'signal 11 (SIGSEGV)'
[2017-10-12 11:12:19,718: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)

This Celery thread suggests it may be a problem with Spacy not being fork safe:
celery/celery#2964 (comment)

I tried the workaround suggested in the linked comment (importing the spacy model inside the function) but the import causes the same error.

PROBLEM SOLVED?

I tried pip install eventlet and then running the celery worker with -P eventlet -c 1000 and now the task runs successfully!

I'm not sure whether this means it's a bug within prefork or Spacy, so I'm leaving this comment here in the hope that it helps someone!

@nathanathan

This comment has been minimized.

Show comment
Hide comment
@nathanathan

nathanathan Oct 12, 2017

Contributor

Sentence span similarity isn't working for me in spacy-nightly 2.0.0a16:

import en_core_web_sm as spacy_model
spacy_nlp = spacy_model.load()
sent_list = list(spacy_nlp(u'I saw a duck at the park. Duck under the limbo stick.').sents)
sent_list[0].similarity(sent_list[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "span.pyx", line 134, in spacy.tokens.span.Span.similarity
  File "span.pyx", line 231, in spacy.tokens.span.Span.vector_norm.__get__
  File "span.pyx", line 216, in spacy.tokens.span.Span.vector.__get__
  File "span.pyx", line 112, in __iter__
  File "token.pyx", line 259, in spacy.tokens.token.Token.vector.__get__
IndexError: index 0 is out of bounds for axis 0 with size 0
Contributor

nathanathan commented Oct 12, 2017

Sentence span similarity isn't working for me in spacy-nightly 2.0.0a16:

import en_core_web_sm as spacy_model
spacy_nlp = spacy_model.load()
sent_list = list(spacy_nlp(u'I saw a duck at the park. Duck under the limbo stick.').sents)
sent_list[0].similarity(sent_list[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "span.pyx", line 134, in spacy.tokens.span.Span.similarity
  File "span.pyx", line 231, in spacy.tokens.span.Span.vector_norm.__get__
  File "span.pyx", line 216, in spacy.tokens.span.Span.vector.__get__
  File "span.pyx", line 112, in __iter__
  File "token.pyx", line 259, in spacy.tokens.token.Token.vector.__get__
IndexError: index 0 is out of bounds for axis 0 with size 0
@jesushd12

This comment has been minimized.

Show comment
Hide comment
@jesushd12

jesushd12 Oct 25, 2017

Hi @nathanathan were you able to resolve the problem? I'm getting the same problem with similarity function, I'm using spanish model.

Hi @nathanathan were you able to resolve the problem? I'm getting the same problem with similarity function, I'm using spanish model.

@ines

This comment has been minimized.

Show comment
Hide comment
@ines

ines Oct 27, 2017

Member

@nathanathan @jesushd12 Sorry about that – we're still finalising the vector support on the current models (see #1457). We're currently training a new family of models for the next version, which includes a lot of fixes and updates currently on develop. (Unless there are new bugs or significant problems, this is likely also going to be the version we're promoting to the release candidate πŸŽ‰ )

Member

ines commented Oct 27, 2017

@nathanathan @jesushd12 Sorry about that – we're still finalising the vector support on the current models (see #1457). We're currently training a new family of models for the next version, which includes a lot of fixes and updates currently on develop. (Unless there are new bugs or significant problems, this is likely also going to be the version we're promoting to the release candidate πŸŽ‰ )

@chaturv3di

This comment has been minimized.

Show comment
Hide comment
@chaturv3di

chaturv3di Oct 30, 2017

I'm trying to install spaCy 2.0 alpha in a new conda environment, and I'm receiving undefined symbol: PyFPE_jbuf error. Afaik, this is due to two versions of numpy. However, I have made sure that my packages numpy, scipy, msgpack-numpy, and Cython are all installed solely via pip. In fact, I even tried the flavour where all of these packages are installed solely via conda. No luck.

Would anyone be able to offer any advise?

I'm trying to install spaCy 2.0 alpha in a new conda environment, and I'm receiving undefined symbol: PyFPE_jbuf error. Afaik, this is due to two versions of numpy. However, I have made sure that my packages numpy, scipy, msgpack-numpy, and Cython are all installed solely via pip. In fact, I even tried the flavour where all of these packages are installed solely via conda. No luck.

Would anyone be able to offer any advise?

@honnibal

This comment has been minimized.

Show comment
Hide comment
@honnibal

honnibal Oct 30, 2017

Member

@chaturv3di That error tends to occur when pip uses a cached binary package. I find this happens a lot for me with the cytoolz package --- somehow its metadata is incorrect and pip thinks it can be compatible across both Python 2 and 3.

Try pip uninstall cytoolz && pip install cytoolz --no-cache-dir

Member

honnibal commented Oct 30, 2017

@chaturv3di That error tends to occur when pip uses a cached binary package. I find this happens a lot for me with the cytoolz package --- somehow its metadata is incorrect and pip thinks it can be compatible across both Python 2 and 3.

Try pip uninstall cytoolz && pip install cytoolz --no-cache-dir

@chaturv3di

This comment has been minimized.

Show comment
Hide comment
@chaturv3di

chaturv3di Oct 30, 2017

Thanks @honnibal.

For the record, after following your advise, I received the same error but this time from the preshed package. I did the same with it, i.e. pip uninstall preshed && pip install preshed --no-cache-dir. And it worked.

Thanks @honnibal.

For the record, after following your advise, I received the same error but this time from the preshed package. I did the same with it, i.e. pip uninstall preshed && pip install preshed --no-cache-dir. And it worked.

@chaturv3di

This comment has been minimized.

Show comment
Hide comment
@chaturv3di

chaturv3di Oct 30, 2017

Hi All,

This is related to dependency parsing. Where can I find the exact logic for merging Spans when the "merge phrases" option is chosen on https://demos.explosion.ai?

Thanks in advance.

Hi All,

This is related to dependency parsing. Where can I find the exact logic for merging Spans when the "merge phrases" option is chosen on https://demos.explosion.ai?

Thanks in advance.

@ines

This comment has been minimized.

Show comment
Hide comment
@ines

ines Nov 3, 2017

Member

@chaturv3di See here in the spacy-services:

if collapse_phrases:
    for np in list(self.doc.noun_chunks):
        np.merge(np.root.tag_, np.root.lemma_, np.root.ent_type_)

Essentially, all you need to do is iterate over the noun phrases in doc.noun_chunks, merge them and make sure to re-assign the tags and labels. In spaCy v2.0, you can also specify the arguments as keyword arguments, e.g. span.merge(tag=tag, lemma=lemma, ent_type=ent_type).

Member

ines commented Nov 3, 2017

@chaturv3di See here in the spacy-services:

if collapse_phrases:
    for np in list(self.doc.noun_chunks):
        np.merge(np.root.tag_, np.root.lemma_, np.root.ent_type_)

Essentially, all you need to do is iterate over the noun phrases in doc.noun_chunks, merge them and make sure to re-assign the tags and labels. In spaCy v2.0, you can also specify the arguments as keyword arguments, e.g. span.merge(tag=tag, lemma=lemma, ent_type=ent_type).

@ines

This comment has been minimized.

Show comment
Hide comment
@ines

ines Nov 8, 2017

Member

Thanks everyone for your feedback! πŸ’™
spaCy v2.0 is now live: https://github.com/explosion/spaCy/releases/tag/v2.0.0

Member

ines commented Nov 8, 2017

Thanks everyone for your feedback! πŸ’™
spaCy v2.0 is now live: https://github.com/explosion/spaCy/releases/tag/v2.0.0

@lock

This comment has been minimized.

Show comment
Hide comment
@lock

lock bot May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.