Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with spacy #38

Open
openmotion opened this issue Jun 24, 2016 · 22 comments
Open

error with spacy #38

openmotion opened this issue Jun 24, 2016 · 22 comments

Comments

@openmotion
Copy link

openmotion commented Jun 24, 2016

hello i have this error in hacker_news/data
python preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 47, in
merge=True)
File "build/bdist.linux-x86_64/egg/lda2vec/preprocess.py", line 76, in tokenize
author_name = authors.categories
File "spacy/tokens/doc.pyx", line 250, in noun_chunks (spacy/tokens/doc.cpp:8013)
File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.getitem (spacy/tokens/doc.cpp:4890)
IndexError: list index out of range

@tokestermw
Copy link

Probably this error? So I think update SpaCy.

explosion/spaCy#375

@rchari51
Copy link

rchari51 commented Aug 8, 2016

The problem is indeed in SpaCy. As suggested, the workaround is to write:
for phrase in list(doc.noun_chunks):
instead of
for phrase in doc.noun_chunks:

The merge() in place invalidates the iterator.

@crawfordcomeaux
Copy link

crawfordcomeaux commented Sep 22, 2016

@rchari51 This is what I get after manually making the changes to spacy/tokens/doc.pyx and lda2vec/preprocess.py:

Traceback (most recent call last):
  File "data/preprocess.py", line 47, in <module>
    merge=True)
  File "/home/ubuntu/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
    while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
  File "spacy/tokens/span.pyx", line 54, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:3817)
  File "spacy/tokens/span.pyx", line 97, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:4975)
IndexError: Error calculating span: Can't find end

@saravp
Copy link

saravp commented Oct 19, 2016

@crawfordcomeaux - Were you able to resolve your issue. I ran into the same thing

@crawfordcomeaux
Copy link

@saravp Only with merge=False, which doesn't really fit my use case.

@crawfordcomeaux
Copy link

@saravp I just took a look at spaCy's issues to see if anything related to this stuck out & they just shipped version 1.0. Does anything change if you update spaCy?

@grivescorbett
Copy link

I'm seeing this error as well, using spacy from master. Commit d8db648ebf70e4bddfe21cad50a34891e4b75154

File "data/preprocess.py", line 47, in <module>
merge=True)
File "/Users/grivescorbett/projects/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:4142)
File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5339)

@NilsRethmeier
Copy link

NilsRethmeier commented Dec 20, 2016

@grivescorbett @crawfordcomeaux @openmotion @saravp
@cemoody May it be that this is an indention issue?
I think the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 would make more sense to unindent by one level.
Then this error disappears. I think the inner loop is messing with the spans and people get this error.
Honibols current code does seem to have the unident (see) https://github.com/explosion/sense2vec/blob/master/bin/merge_text.py#L95-L96

To reproduce the error:

s = u"""Marijuana is not the gateway drug alcohol is. I was introduced to alcohol at age of ten. I was introduced to marijuana at age of 14 . I was introduced to cocaine and crack at the age 17 & 18 . upon being introduced to crack I became addicted to crack & left marijuana alone."""
tokenize([s], max_length, skip=-2, attr=LOWER, merge=True, nlp=None, **kwargs):

this works for me now :)

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    """"""
    if nlp is None:
        nlp = English()
    data = np.zeros((len(texts), max_length), dtype='int32')
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        if len(dat) > 0:
            dat = dat.astype('int32')
            msg = "Negative indices reserved for special tokens"
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

@spnichol
Copy link

spnichol commented Dec 23, 2017

I was having this issue in Python 2.7 and tried the above fixes. Unfortunately, none of them solved the problem. I ended up trying it in Python 3.5 and it worked. Definitely an issue with the conversion of those huge uInt64 vals to Int32.

@bountrisv
Copy link

Can you explain what you tried to make it work in python 3? I am still getting negative indices with both fixes.

@hirenaum97
Copy link

File "/home/aum/PycharmProjects/learn_p/venv/src/lda2vec/lda2vec/preprocess.py", line 35, in tokenize
assert dat.min() >= 0, msg
AssertionError: Negative indices reserved for special tokens

what should i do?

@gracegcy
Copy link

@hirenaum97 Hi, were you able to resolve the error? i got a similar one. Thanks.

@hirenaum97
Copy link

just change version of spacy to 1.9

@CoreJa
Copy link

CoreJa commented May 6, 2018

Thanks to @hirenaum97 ,I've change my version to spacy to 1.9 and accepted @NilsRethmeier 's advice unindenting the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 by one level. solved the problem on py2

@lovedatatiff
Copy link

lovedatatiff commented May 11, 2018

I have the error

numpy.core._internal.AxisError: axis -1 is out of bounds for array of dimension 0

caused by the line in corpus.py

specials = np.sort(self.specials.values())

which lead to the error in this line in corpus.py

self.keys_loose, self.keys_counts, n_keys = self._loose_keys_ordered()

which then lead to the error stated when I'm running the preprocessing.py (in the data folder), in line

corpus.finalize()

Does anyone have any idea how to solve this? Thanks a lot!

@lovedatatiff
Copy link

@Core00077 did you manage to successfully run the run.py? I'm trying to duplicate the run for twentynews and am experiencing quite a few issues here

@CoreJa
Copy link

CoreJa commented May 14, 2018

@lovedatatiff sorry, i meant ive run preprocess.py successfully. actually my work pc doesnt have a nvidia GPU. so i couldnt run the whole project. but i think i did it the right way since this project was not maintened for a while, it's better to use lower version depandency like spacy==1.9.0

if possible, i would later message u when i successfully ran it OK

@lovedatatiff
Copy link

lovedatatiff commented May 14, 2018

@Core00077 Hey, that'd be amazing! thank you!
I don't have a Nvidia GPU either and I'm not quite sure how to workaround this since I'm quite new to this - would you mind sharing your code with me if you've successfully run it? Look forward to your reply! :)

@CoreJa
Copy link

CoreJa commented May 14, 2018

@lovedatatiff Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this.

@lovedatatiff
Copy link

lovedatatiff commented May 14, 2018 via email

@CoreJa
Copy link

CoreJa commented May 14, 2018

sorry, my notebook doesn't have a public ip address. i am not sure the aws GPU would work or not. Basically it would work.

@LizaKoz
Copy link

LizaKoz commented Jun 27, 2018

Hi everyone,

Did anybody solve this issue?
I have spacy==2.0.5 and still getting this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests