error with spacy #38

openmotion · 2016-06-24T14:39:03Z

hello i have this error in hacker_news/data
python preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 47, in
merge=True)
File "build/bdist.linux-x86_64/egg/lda2vec/preprocess.py", line 76, in tokenize
author_name = authors.categories
File "spacy/tokens/doc.pyx", line 250, in noun_chunks (spacy/tokens/doc.cpp:8013)
File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.getitem (spacy/tokens/doc.cpp:4890)
IndexError: list index out of range

tokestermw · 2016-06-24T18:37:22Z

Probably this error? So I think update SpaCy.

explosion/spaCy#375

rchari51 · 2016-08-08T18:55:36Z

The problem is indeed in SpaCy. As suggested, the workaround is to write:
for phrase in list(doc.noun_chunks):
instead of
for phrase in doc.noun_chunks:

The merge() in place invalidates the iterator.

crawfordcomeaux · 2016-09-22T20:10:37Z

@rchari51 This is what I get after manually making the changes to spacy/tokens/doc.pyx and lda2vec/preprocess.py:

Traceback (most recent call last):
  File "data/preprocess.py", line 47, in <module>
    merge=True)
  File "/home/ubuntu/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
    while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
  File "spacy/tokens/span.pyx", line 54, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:3817)
  File "spacy/tokens/span.pyx", line 97, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:4975)
IndexError: Error calculating span: Can't find end

saravp · 2016-10-19T02:02:23Z

@crawfordcomeaux - Were you able to resolve your issue. I ran into the same thing

crawfordcomeaux · 2016-10-19T03:49:08Z

@saravp Only with merge=False, which doesn't really fit my use case.

crawfordcomeaux · 2016-10-19T03:52:53Z

@saravp I just took a look at spaCy's issues to see if anything related to this stuck out & they just shipped version 1.0. Does anything change if you update spaCy?

grivescorbett · 2016-11-01T16:00:10Z

I'm seeing this error as well, using spacy from master. Commit d8db648ebf70e4bddfe21cad50a34891e4b75154

File "data/preprocess.py", line 47, in <module>
merge=True)
File "/Users/grivescorbett/projects/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:4142)
File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5339)

NilsRethmeier · 2016-12-20T14:30:27Z

@grivescorbett @crawfordcomeaux @openmotion @saravp
@cemoody May it be that this is an indention issue?
I think the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 would make more sense to unindent by one level.
Then this error disappears. I think the inner loop is messing with the spans and people get this error.
Honibols current code does seem to have the unident (see) https://github.com/explosion/sense2vec/blob/master/bin/merge_text.py#L95-L96

To reproduce the error:

s = u"""Marijuana is not the gateway drug alcohol is. I was introduced to alcohol at age of ten. I was introduced to marijuana at age of 14 . I was introduced to cocaine and crack at the age 17 & 18 . upon being introduced to crack I became addicted to crack & left marijuana alone."""
tokenize([s], max_length, skip=-2, attr=LOWER, merge=True, nlp=None, **kwargs):

this works for me now :)

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    """"""
    if nlp is None:
        nlp = English()
    data = np.zeros((len(texts), max_length), dtype='int32')
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        if len(dat) > 0:
            dat = dat.astype('int32')
            msg = "Negative indices reserved for special tokens"
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

spnichol · 2017-12-23T22:18:12Z

I was having this issue in Python 2.7 and tried the above fixes. Unfortunately, none of them solved the problem. I ended up trying it in Python 3.5 and it worked. Definitely an issue with the conversion of those huge uInt64 vals to Int32.

bountrisv · 2018-01-19T17:41:58Z

Can you explain what you tried to make it work in python 3? I am still getting negative indices with both fixes.

hirenaum97 · 2018-01-25T05:06:47Z

File "/home/aum/PycharmProjects/learn_p/venv/src/lda2vec/lda2vec/preprocess.py", line 35, in tokenize
assert dat.min() >= 0, msg
AssertionError: Negative indices reserved for special tokens

what should i do?

gracegcy · 2018-01-29T20:17:20Z

@hirenaum97 Hi, were you able to resolve the error? i got a similar one. Thanks.

hirenaum97 · 2018-01-30T05:38:38Z

just change version of spacy to 1.9

CoreJa · 2018-05-06T08:44:05Z

Thanks to @hirenaum97 ,I've change my version to spacy to 1.9 and accepted @NilsRethmeier 's advice unindenting the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 by one level. solved the problem on py2

lovedatatiff · 2018-05-11T14:25:50Z

I have the error

numpy.core._internal.AxisError: axis -1 is out of bounds for array of dimension 0

caused by the line in corpus.py

specials = np.sort(self.specials.values())

which lead to the error in this line in corpus.py

self.keys_loose, self.keys_counts, n_keys = self._loose_keys_ordered()

which then lead to the error stated when I'm running the preprocessing.py (in the data folder), in line

corpus.finalize()

Does anyone have any idea how to solve this? Thanks a lot!

lovedatatiff · 2018-05-11T14:30:41Z

@Core00077 did you manage to successfully run the run.py? I'm trying to duplicate the run for twentynews and am experiencing quite a few issues here

CoreJa · 2018-05-14T05:37:43Z

@lovedatatiff sorry, i meant ive run preprocess.py successfully. actually my work pc doesnt have a nvidia GPU. so i couldnt run the whole project. but i think i did it the right way since this project was not maintened for a while, it's better to use lower version depandency like spacy==1.9.0

if possible, i would later message u when i successfully ran it OK

lovedatatiff · 2018-05-14T08:37:54Z

@Core00077 Hey, that'd be amazing! thank you!
I don't have a Nvidia GPU either and I'm not quite sure how to workaround this since I'm quite new to this - would you mind sharing your code with me if you've successfully run it? Look forward to your reply! :)

CoreJa · 2018-05-14T12:47:15Z

@lovedatatiff Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this.

lovedatatiff · 2018-05-14T13:03:08Z

that's amazing! I have a AWS deep learning GPU instance connected with ubuntu - would that work? Would you also mind sharing with me your notebook and I'll see if I could make it work?

…

On 14 May 2018 at 13:47, Core-Chan ***@***.***> wrote: @lovedatatiff <https://github.com/lovedatatiff> Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AS9PjXcY2h8xtI5wqq9dtGEbUDwuG7G6ks5tyXzYgaJpZM4I91EC> .

CoreJa · 2018-05-14T13:09:56Z

sorry, my notebook doesn't have a public ip address. i am not sure the aws GPU would work or not. Basically it would work.

LizaKoz · 2018-06-27T09:49:38Z

Hi everyone,

Did anybody solve this issue?
I have spacy==2.0.5 and still getting this problem

dbl001 mentioned this issue Apr 4, 2017

IndexError: Error calculating span: Can't find end #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error with spacy #38

error with spacy #38

openmotion commented Jun 24, 2016 •

edited

Loading

tokestermw commented Jun 24, 2016

rchari51 commented Aug 8, 2016

crawfordcomeaux commented Sep 22, 2016 •

edited

Loading

saravp commented Oct 19, 2016

crawfordcomeaux commented Oct 19, 2016

crawfordcomeaux commented Oct 19, 2016

grivescorbett commented Nov 1, 2016

NilsRethmeier commented Dec 20, 2016 •

edited

Loading

spnichol commented Dec 23, 2017 •

edited

Loading

bountrisv commented Jan 19, 2018

hirenaum97 commented Jan 25, 2018

gracegcy commented Jan 29, 2018

hirenaum97 commented Jan 30, 2018

CoreJa commented May 6, 2018

lovedatatiff commented May 11, 2018 •

edited

Loading

lovedatatiff commented May 11, 2018

CoreJa commented May 14, 2018

lovedatatiff commented May 14, 2018 •

edited

Loading

CoreJa commented May 14, 2018

lovedatatiff commented May 14, 2018 via email

CoreJa commented May 14, 2018

LizaKoz commented Jun 27, 2018

error with spacy #38

error with spacy #38

Comments

openmotion commented Jun 24, 2016 • edited Loading

tokestermw commented Jun 24, 2016

rchari51 commented Aug 8, 2016

crawfordcomeaux commented Sep 22, 2016 • edited Loading

saravp commented Oct 19, 2016

crawfordcomeaux commented Oct 19, 2016

crawfordcomeaux commented Oct 19, 2016

grivescorbett commented Nov 1, 2016

NilsRethmeier commented Dec 20, 2016 • edited Loading

spnichol commented Dec 23, 2017 • edited Loading

bountrisv commented Jan 19, 2018

hirenaum97 commented Jan 25, 2018

gracegcy commented Jan 29, 2018

hirenaum97 commented Jan 30, 2018

CoreJa commented May 6, 2018

lovedatatiff commented May 11, 2018 • edited Loading

lovedatatiff commented May 11, 2018

CoreJa commented May 14, 2018

lovedatatiff commented May 14, 2018 • edited Loading

CoreJa commented May 14, 2018

lovedatatiff commented May 14, 2018 via email

CoreJa commented May 14, 2018

LizaKoz commented Jun 27, 2018

openmotion commented Jun 24, 2016 •

edited

Loading

crawfordcomeaux commented Sep 22, 2016 •

edited

Loading

NilsRethmeier commented Dec 20, 2016 •

edited

Loading

spnichol commented Dec 23, 2017 •

edited

Loading

lovedatatiff commented May 11, 2018 •

edited

Loading

lovedatatiff commented May 14, 2018 •

edited

Loading