Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extractor.load_document (Spacy) limitation of 1000000 characters #231

Open
MainakMaitra opened this issue Oct 2, 2023 · 0 comments
Open

Comments

@MainakMaitra
Copy link

MainakMaitra commented Oct 2, 2023

While using extractor.load_document() encountering this error:

ValueError: [E088] Text of length 1717453 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

Referred the below issue link: #68

Code used:

def pke_topicrank(text):
    # initialize keyphrase extraction model, here TopicRank
    extractor = pke.unsupervised.TopicRank()

    # load the content of the document, here document is expected to be a simple 
    # test string and preprocessing is carried out using spacy
    
    #docs = list(nlp.pipe(text, batch_size=1000))
    extractor.load_document(input=text, language="en", \
                            normalization=None)

    # keyphrase candidate selection, in the case of TopicRank: sequences of nouns
    # and adjectives (i.e. `(Noun|Adj)*`)
    pos = {'NOUN', 'PROPN', 'ADJ'}
    extractor.candidate_selection(pos=pos)
    #extractor.candidate_selection()
    
    #grammar selection
    extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")

    # candidate weighting, in the case of TopicRank: using a random walk algorithm
    extractor.candidate_weighting(threshold=0.74, method='average')

    # N-best selection, keyphrases contains the 10 highest scored candidates as
    # (keyphrase, score) tuples
    keyphrases = extractor.get_n_best(n=10, redundancy_removal=True, stemming=True)
    keyphrases = ', '.join(set([candidate for candidate, weight in keyphrases]))
    return keyphrases

Solutions tried:

  • Increasing nlp.max_length to a higher value manually, while loading the spacy pre-trained model. I have installed spacy following the steps listed for GPU
# Install spacy
website: https://spacy.io/usage#gpu
pip install -U pip setuptools wheel
pip install -U 'spacy[cuda-autodetect]'
python -m spacy download en_core_web_sm
import spacy
activated = spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm',exclude=['parser', 'tagger','ner'])
# nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 2000000
  • Passing input text through loaded nlp model
extractor = pke.unsupervised.TopicRank()
# nlp.add_pipe('sentencizer')
extractor.load_document(input=nlp(text), language="en", \
                        normalization='none')
pos = {'NOUN', 'PROPN', 'ADJ'}
extractor.candidate_selection(pos=pos)
extractor.candidate_weighting( threshold=0.74, method='average', heuristic='none')
keyphrases = extractor.get_n_best(n=10, redundancy_removal=True, stemming=False)
keyphrases = ', '.join(set([candidate for candidate, weight in keyphrases]))

resulting in this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 3
      1 extractor = pke.unsupervised.TopicRank()
      2 # nlp.add_pipe('sentencizer')
----> 3 extractor.load_document(input=nlp(text), language="en", \
      4                         normalization='none')
      5 pos = {'NOUN', 'PROPN', 'ADJ'}
      6 extractor.candidate_selection(pos=pos)

File ~/.conda/envs/mainak_multi_intent/lib/python3.9/site-packages/pke/base.py:94, in LoadFile.load_document(self, input, language, stoplist, normalization, spacy_model)
     92 if isinstance(input, spacy.tokens.doc.Doc):
     93     parser = SpacyDocReader()
---> 94     sents = parser.read(spacy_doc=input)
     95 # check whether input is a string
     96 elif isinstance(input, str):

File ~/.conda/envs/mainak_multi_intent/lib/python3.9/site-packages/pke/readers.py:124, in SpacyDocReader.read(self, spacy_doc)
    122 def read(self, spacy_doc):
    123     sentences = []
--> 124     for sentence_id, sentence in enumerate(spacy_doc.sents):
    125         sentences.append(Sentence(
    126             words=[token.text for token in sentence],
    127             pos=[token.pos_ or token.tag_ for token in sentence],
   (...)
    132             }
    133         ))
    134     return sentences

File ~/.conda/envs/mainak_multi_intent/lib/python3.9/site-packages/spacy/tokens/doc.pyx:923, in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

while adding sentencizer it returns no keywords

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant