-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error with spacy #38
Comments
Probably this error? So I think update SpaCy. |
The problem is indeed in SpaCy. As suggested, the workaround is to write: The merge() in place invalidates the iterator. |
@rchari51 This is what I get after manually making the changes to
|
@crawfordcomeaux - Were you able to resolve your issue. I ran into the same thing |
@saravp Only with |
@saravp I just took a look at spaCy's issues to see if anything related to this stuck out & they just shipped version 1.0. Does anything change if you update spaCy? |
I'm seeing this error as well, using spacy from master. Commit d8db648ebf70e4bddfe21cad50a34891e4b75154
|
@grivescorbett @crawfordcomeaux @openmotion @saravp To reproduce the error: s = u"""Marijuana is not the gateway drug alcohol is. I was introduced to alcohol at age of ten. I was introduced to marijuana at age of 14 . I was introduced to cocaine and crack at the age 17 & 18 . upon being introduced to crack I became addicted to crack & left marijuana alone."""
tokenize([s], max_length, skip=-2, attr=LOWER, merge=True, nlp=None, **kwargs): this works for me now :) def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
""""""
if nlp is None:
nlp = English()
data = np.zeros((len(texts), max_length), dtype='int32')
data[:] = skip
bad_deps = ('amod', 'compound')
for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
if merge:
# from the spaCy blog, an example on how to merge
# noun phrases into single tokens
for phrase in doc.noun_chunks:
# Only keep adjectives and nouns, e.g. "good ideas"
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
phrase = phrase[1:]
if len(phrase) > 1:
# Merge the tokens, e.g. good_ideas
phrase.merge(phrase.root.tag_, phrase.text,
phrase.root.ent_type_)
# Iterate over named entities
for ent in doc.ents:
if len(ent) > 1:
# Merge them into single tokens
ent.merge(ent.root.tag_, ent.text, ent.label_)
dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
if len(dat) > 0:
dat = dat.astype('int32')
msg = "Negative indices reserved for special tokens"
assert dat.min() >= 0, msg
# Replace email and URL tokens
idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
dat[idx] = skip
length = min(len(dat), max_length)
data[row, :length] = dat[:length, 0].ravel()
uniques = np.unique(data)
vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
vocab[skip] = '<SKIP>'
return data, vocab |
I was having this issue in Python 2.7 and tried the above fixes. Unfortunately, none of them solved the problem. I ended up trying it in Python 3.5 and it worked. Definitely an issue with the conversion of those huge uInt64 vals to Int32. |
Can you explain what you tried to make it work in python 3? I am still getting negative indices with both fixes. |
File "/home/aum/PycharmProjects/learn_p/venv/src/lda2vec/lda2vec/preprocess.py", line 35, in tokenize what should i do? |
@hirenaum97 Hi, were you able to resolve the error? i got a similar one. Thanks. |
just change version of spacy to 1.9 |
Thanks to @hirenaum97 ,I've change my version to spacy to 1.9 and accepted @NilsRethmeier 's advice unindenting the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 by one level. solved the problem on py2 |
I have the error
caused by the line in corpus.py
which lead to the error in this line in corpus.py
which then lead to the error stated when I'm running the preprocessing.py (in the data folder), in line
Does anyone have any idea how to solve this? Thanks a lot! |
@Core00077 did you manage to successfully run the run.py? I'm trying to duplicate the run for twentynews and am experiencing quite a few issues here |
@lovedatatiff sorry, i meant ive run preprocess.py successfully. actually my work pc doesnt have a nvidia GPU. so i couldnt run the whole project. but i think i did it the right way since this project was not maintened for a while, it's better to use lower version depandency like spacy==1.9.0 if possible, i would later message u when i successfully ran it OK |
@Core00077 Hey, that'd be amazing! thank you! |
@lovedatatiff Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this. |
that's amazing! I have a AWS deep learning GPU instance connected with
ubuntu - would that work? Would you also mind sharing with me your
notebook and I'll see if I could make it work?
…On 14 May 2018 at 13:47, Core-Chan ***@***.***> wrote:
@lovedatatiff <https://github.com/lovedatatiff> Sure thing. But a Nvidia
GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a
notebook with a Nvidia GPU but I don't have time. I am now working on my
exams. I will let u know when i made some progress on this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AS9PjXcY2h8xtI5wqq9dtGEbUDwuG7G6ks5tyXzYgaJpZM4I91EC>
.
|
sorry, my notebook doesn't have a public ip address. i am not sure the aws GPU would work or not. Basically it would work. |
Hi everyone, Did anybody solve this issue? |
hello i have this error in hacker_news/data
python preprocess.py
Traceback (most recent call last):
File "preprocess.py", line 47, in
merge=True)
File "build/bdist.linux-x86_64/egg/lda2vec/preprocess.py", line 76, in tokenize
author_name = authors.categories
File "spacy/tokens/doc.pyx", line 250, in noun_chunks (spacy/tokens/doc.cpp:8013)
File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559)
File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.getitem (spacy/tokens/doc.cpp:4890)
IndexError: list index out of range
The text was updated successfully, but these errors were encountered: