spaCy + StanfordNLP
This package wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labelled dependency parsing in 58 languages.
Using this wrapper, you'll be able to use the following annotations, computed by
- Statistical tokenization (reflected in the
Docand its tokens)
- Lemmatization (
- Part-of-speech tagging (
- Dependency parsing (
- Sentence segmentation (
pip install spacy-stanfordnlp
Make sure to also install one of the pre-trained StanfordNLP models. It's recommended to run StanfordNLP on Python 3.6.8+ or Python 3.7.2+.
📖 Usage & Examples
import stanfordnlp from spacy_stanfordnlp import StanfordNLPLanguage snlp = stanfordnlp.Pipeline(lang="en") nlp = StanfordNLPLanguage(snlp) doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.") for token in doc: print(token.text, token.lemma_, token.pos_, token.dep_)
If language data for the given language is available in spaCy, the respective
language class will be used as the base for the
nlp object – for example,
English(). This lets you use spaCy's lexical attributes like
nlp object follows the same API as any other spaCy
class – so you can visualize the
Doc objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
# Access spaCy's lexical attributes print([token.is_stop for token in doc]) print([token.like_num for token in doc]) # Visualize dependencies from spacy import displacy displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook # Efficient processing with nlp.pipe for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]): print(doc.text) # Combine with your own custom pipeline components def custom_component(doc): # Do something to the doc here return doc nlp.add_pipe(custom_component) # Serialize it to a numpy array np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
Experimental: Mixing and matching pipeline components
By default, the
nlp object's pipeline will be empty, because all attributes
are computed once and set in the custom
Tokenizer. But since it's a regular
object, you can add your own components to the pipeline.
For example, the entity recognizer from one of spaCy's pre-trained models:
import spacy import spacy_stanfordnlp import stanfordnlp snlp = stanfordnlp.Pipeline(lang="en", models_dir="./models") nlp = StanfordNLPLanguage(snlp) # Load spaCy's pre-trained en_core_web_sm model, get the entity recognizer and # add it to the StanfordNLP model's pipeline spacy_model = spacy.load("en_core_web_sm") ner = spacy_model.get_pipe("ner") nlp.add_pipe(ner) doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.") print([(ent.text, ent.label_) for ent in doc.ents]) # [('Barack Obama', 'PERSON'), ('Hawaii', 'GPE'), ('2008', 'DATE')]
You could also add and train your own custom text classification component.
Advanced: serialization and entry points
nlp object created by
StanfordNLPLanguage exposes its language as
from spacy.util import get_lang_class lang_cls = get_lang_class("stanfordnlp_en")
Normally, the above would fail because spaCy doesn't include a language class
stanfordnlp_en. But because this package exposes a
point in its
setup.py that points to
knows how to initialize it.
This means that saving to and loading from disk works:
snlp = stanfordnlp.Pipeline(lang="en") nlp = StanfordNLPLanguage(snlp) nlp.to_disk("./stanfordnlp-spacy-model")
Additional arguments on
spacy.load are automatically passed down to the
language class and pipeline components. So when loading the saved model, you can
pass in the
snlp = stanfordnlp.Pipeline(lang="en") nlp = spacy.load("./stanfordnlp-spacy-model", snlp=snlp)
Note that this will not save any model data by default. The StanfordNLP models are very large, so for now, this package expects that you load them separately.