New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError on printing token.orth_ / incorrect token printing token.lower_ ...but only in some cases #27
Comments
This should be fixed in version 0.63. Are you on that version yet? I ran your code on 0.63 and it works, but I can reproduce your error on the previous version. Here's what's going on. Before the latest version, Token was supposed to be a weak reference. The underlying data is a C array of structs. This data is owned by the Tokens object. Once the Tokens object passes out of scope, the underlying C data is due for collection. If you're still holding references to Token objects, and you access a property that's proxied to that C data, all bets are off. The recent update tries to check whether any Token objects will outlive the Tokens object. If so, it gives them a copy of the C data. I didn't document any of this properly, and I'm sorry that this seems to have wasted some of your time. I hope to have less broken release processes soon. But, aside from this: what you're doing is supposed to be a rare edge-case. My intention is for users to maintain a reference to the Tokens object, and access the Token objects through it. Is there a reason you want to create your own list? It might indicate a weakness in my API. |
I am running this on version 0.63, sorry I didn't mention that. Strange that it works on your machine, but not mine. But now that I understand you have to keep a reference to the Tokens object, I can do what I wanted to earlier without the error. It might be good to look into this a bit more to make sure you really are preventing those tokens from losing the necessary data; I'm positive I'm on 0.63 (I even checked tokens.pxd and it has this change: "cdef int take_ownership_of_c_data(self) except -1" The reason I was doing that was I have a large set of tweets, each of which I was turning into a Tokens object by doing nlp(tweet_text). I didn't know you needed to keep a reference to it so I was just passing it into a function that goes and does the analysis I want. In this case I was using the dependency parse to extract Subject, Verb, Object structures. In that function, I was finding those subject, verb, and object tokens and was trying to append them to another list that I return at the end. So down the line I wanted to examine the tokens that were returned. I figured it might be useful to have all of the information, like tok.pos_, tok.dep_, etc, which is why I was adding the token to a list. To work around it, I simply appended the tok.lower_ string representation. Which so far is fine, but maybe I'd want more info later. Don't worry about the bugs, I understand this is the early stage of development. I do want to commend you on your work so far, though. The dependency parsing works pretty darn well, even on informal text from the internet. And it is by far the fastest and easiest to use in Python. I've tried TurboParser and TweeboParser and both of those are very difficult to work with in Python. It's so simple to traverse the tree using rights and lefts as well. Really well done! |
Working on this. I think I have it replicating on my server, but not on my laptop. Memory errors like this are difficult, because tests can pass accidentally, depending on whether the memory was over-written. |
Okay, try v0.65. This is working on my laptop, server, and on Travis. Please watch out for, and report, memory leaks in the new implementation. There's a reference cycle between Tokens and Token. I've run the parser over lots of documents, but the problems might only arise when the Token objects are used in some non-trivial way. |
It seems to be working on my end now with v0.65! |
Great! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
System specs: Linux Mint 16, 64-bit, Python 3.4, using Anaconda
stack trace:
Alright, this is a weird one. It involves passing tokens around in different lists, and only certain configurations will reliably cause the error. This error cropped up in a dataset I am actually using, where I wanted to pass the token objects around but only in certain cases. I got around this error by begrudgingly just passing the string representation of the token object. Anyway, I've created a small set of code that reliably induces the error on my machine so you can hopefully debug it:
Now, if you replace samples with the following sample, it works just fine!
And note that printing other attributes like pos_, dep_, and lower_ work without causing an IndexError, BUT don't print the correct thing, which leads me to believe some funny pointer dereferencing bug is causing this (the address is incorrect for certain symbols or something). It seems to only throw an error on the orth_ attribute. For example, I changed to printing lower_ and got the following output:
Notice neighbour? Where the heck did it get that?. This is in an entirely new Python process. I'm not doing any multiprocessing / multithreading. I'm not using multiple instances of the nlp() object. So somehow an address got messed up, it's getting into memory it's not supposed to, but it is a valid address and it prints what is there? I have no idea.
So then I ran it again. And this time, instead of "neighbour" it was "Milk".
I hope the error is reproducible on your end. I apologize if it isn't. I swear I'm not crazy, you might just need to change the values in the samples list.
The text was updated successfully, but these errors were encountered: