Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError on printing token.orth_ / incorrect token printing token.lower_ ...but only in some cases #27

Closed
NSchrading opened this issue Feb 15, 2015 · 7 comments

Comments

@NSchrading
Copy link

System specs: Linux Mint 16, 64-bit, Python 3.4, using Anaconda

stack trace:

Traceback (most recent call last):
  File "/blah/blah/test.py", line 324, in <module>
    induceFail()
  File "/blah/blah/test.py", line 319, in induceFail
    print(l.orth_)
  File "spacy/tokens.pyx", line 427, in spacy.tokens.Token.orth_.__get__ (spacy/tokens.cpp:8080)
  File "spacy/strings.pyx", line 71, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:1631)
IndexError: 3481370184

Alright, this is a weird one. It involves passing tokens around in different lists, and only certain configurations will reliably cause the error. This error cropped up in a dataset I am actually using, where I wanted to pass the token objects around but only in certain cases. I got around this error by begrudgingly just passing the string representation of the token object. Anyway, I've created a small set of code that reliably induces the error on my machine so you can hopefully debug it:

from spacy.en import English

def blah(toks):
    ''' Take the tokens from nlp(), append them to a list, return the list '''
    lst = []
    for tok in toks:
        lst.append(tok)
        # printing the tok.orth_ at this point works just fine
        print(tok.orth_)
    return lst

def induceFail():
    nlp = English()
    samples = ["a", "test blah wat okay"]
    lst = []
    for sample in samples:
        # Go through all the samples, call nlp() on each to get tokens,
        # pass those tokens to the blah() function, get a list back and put all results in another list
        lst.extend(blah(nlp(sample)))
    # go through the list of all tokens and try to print orth_
    for l in lst:
        # here is where the error is thrown
        print(l.orth_)

induceFail()

Now, if you replace samples with the following sample, it works just fine!

samples = ["will this break", "test blah wat okay"]

And note that printing other attributes like pos_, dep_, and lower_ work without causing an IndexError, BUT don't print the correct thing, which leads me to believe some funny pointer dereferencing bug is causing this (the address is incorrect for certain symbols or something). It seems to only throw an error on the orth_ attribute. For example, I changed to printing lower_ and got the following output:

a
test
blah
wat
okay
a
test
blah
neighbour       <-------------- Wait...what? Neighbour isn't in any of the samples...
okay

Notice neighbour? Where the heck did it get that?. This is in an entirely new Python process. I'm not doing any multiprocessing / multithreading. I'm not using multiple instances of the nlp() object. So somehow an address got messed up, it's getting into memory it's not supposed to, but it is a valid address and it prints what is there? I have no idea.

So then I ran it again. And this time, instead of "neighbour" it was "Milk".

I hope the error is reproducible on your end. I apologize if it isn't. I swear I'm not crazy, you might just need to change the values in the samples list.

@honnibal
Copy link
Member

This should be fixed in version 0.63. Are you on that version yet? I ran your code on 0.63 and it works, but I can reproduce your error on the previous version.

Here's what's going on.

Before the latest version, Token was supposed to be a weak reference. The underlying data is a C array of structs. This data is owned by the Tokens object. Once the Tokens object passes out of scope, the underlying C data is due for collection. If you're still holding references to Token objects, and you access a property that's proxied to that C data, all bets are off.

The recent update tries to check whether any Token objects will outlive the Tokens object. If so, it gives them a copy of the C data. I didn't document any of this properly, and I'm sorry that this seems to have wasted some of your time. I hope to have less broken release processes soon.

But, aside from this: what you're doing is supposed to be a rare edge-case. My intention is for users to maintain a reference to the Tokens object, and access the Token objects through it. Is there a reason you want to create your own list? It might indicate a weakness in my API.

@NSchrading
Copy link
Author

I am running this on version 0.63, sorry I didn't mention that. Strange that it works on your machine, but not mine. But now that I understand you have to keep a reference to the Tokens object, I can do what I wanted to earlier without the error. It might be good to look into this a bit more to make sure you really are preventing those tokens from losing the necessary data; I'm positive I'm on 0.63 (I even checked tokens.pxd and it has this change: "cdef int take_ownership_of_c_data(self) except -1"

The reason I was doing that was I have a large set of tweets, each of which I was turning into a Tokens object by doing nlp(tweet_text). I didn't know you needed to keep a reference to it so I was just passing it into a function that goes and does the analysis I want. In this case I was using the dependency parse to extract Subject, Verb, Object structures. In that function, I was finding those subject, verb, and object tokens and was trying to append them to another list that I return at the end. So down the line I wanted to examine the tokens that were returned. I figured it might be useful to have all of the information, like tok.pos_, tok.dep_, etc, which is why I was adding the token to a list. To work around it, I simply appended the tok.lower_ string representation. Which so far is fine, but maybe I'd want more info later.

Don't worry about the bugs, I understand this is the early stage of development. I do want to commend you on your work so far, though. The dependency parsing works pretty darn well, even on informal text from the internet. And it is by far the fastest and easiest to use in Python. I've tried TurboParser and TweeboParser and both of those are very difficult to work with in Python. It's so simple to traverse the tree using rights and lefts as well. Really well done!

@honnibal
Copy link
Member

Working on this. I think I have it replicating on my server, but not on my laptop. Memory errors like this are difficult, because tests can pass accidentally, depending on whether the memory was over-written.

@honnibal
Copy link
Member

Okay, try v0.65. This is working on my laptop, server, and on Travis.

Please watch out for, and report, memory leaks in the new implementation. There's a reference cycle between Tokens and Token. I've run the parser over lots of documents, but the problems might only arise when the Token objects are used in some non-trivial way.

@NSchrading
Copy link
Author

It seems to be working on my end now with v0.65!

@honnibal
Copy link
Member

Great!

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants