Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable unnecessary spacy pipeline components #121

Merged
merged 2 commits into from
Apr 28, 2020
Merged

disable unnecessary spacy pipeline components #121

merged 2 commits into from
Apr 28, 2020

Conversation

sp1thas
Copy link
Contributor

@sp1thas sp1thas commented Apr 4, 2020

Optimize pke.readers.RawTextReader by removing unnecessary components (ner, textcat) from spacy pipeline.

related issue: #118

@sp1thas sp1thas changed the title disable unnecessery spacy pipeline components disable unnecessary spacy pipeline components Apr 4, 2020
@ygorg
Copy link
Collaborator

ygorg commented Apr 6, 2020

Thanks, though the parser is still there. According to spacy/pipeline, the parser is needed for sentence tokenisation. But also computes dependencies (which pke does not use) which seem a long process.
I ran experiments to see the speed gain of loading spacy, disabling (ner, textcat), disabling (ner, textcat, parser) and reenabling sentence tokenisation.

nlp = spacy.load('fr')
nlp = spacy.load('fr', disable=['ner', 'textcat'])
nlp = spacy.load('fr', disable=['ner', 'textcat', 'parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

The text preprocessed is http://abu.cnam.fr/cgi-bin/donner?nddp1 (only the first 999999 first characters to match spacy's limitation). The time reported is a mean of 5 runs (in seconds).

command time loading spacy time processing document
spacy.load('fr') 4.40 33.48
+ disable=['ner', 'textcat'] 4.36 23.96
+ disable=['ner', 'textcat', 'parser'] + enable('sentencizer') 4.31 12.08

@sp1thas
Copy link
Contributor Author

sp1thas commented Apr 6, 2020

Well this is a huge improvement.

When I tried to remove parser (without enabling the sentencizer), then the tests were failing. So I though that parser was required.

Therefore, your recommendation is totally right.

@ygorg ygorg linked an issue Apr 6, 2020 that may be closed by this pull request
@ygorg
Copy link
Collaborator

ygorg commented Apr 28, 2020

Just ran pytest it works, so i'm merging thanks !

@ygorg ygorg merged commit 7eca434 into boudinfl:master Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize pke.readers.RawTextReader
2 participants