*This notebook is released under a Creative Commons Attribution 4.0 International license ([https://creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/)).*

# LAK 2022 spaCy demo

[spaCy](https://spacy.io) is a general-purpose, opinionated, high-performance, and very modular NLP toolkit for Python.  It has various bindings in other languages, including R (via [spacyr](https://spacyr.quanteda.io/articles/using_spacyr.html)) and Julia (via the still-experimental [spaCy.jl](https://spacy.io/universe/project/spaCy.jl)).  This notebook contains a general demo of some of its most useful parts.

**This notebook should be run inside a dedicated Anaconda environment.**  It will use the `conda` command-line tool to install all needed dependencies.

## Notebook/environment setup

Let's get the important stuff installed and set up.

In [1]:
import sys

spaCy has support for GPU acceleration.  Run this cell if you have an NVidia GPU in your system that's CUDA-compatible.  If you're not sure, skip this.

In [None]:
!conda install --yes --prefix {sys.prefix} \
    pytorch torchvision torchaudio cudatoolkit=11.3 conda-forge::cupy \
    -c pytorch

Install spaCy and other required libaries for running this notebook:

In [None]:
!conda install --yes --prefix {sys.prefix} -c conda-forge scikit-learn tqdm gensim

Some setup for things we'll be using throughout the notebook.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from tqdm.notebook import tqdm as _tqdm

def tqdm(*args, **kwargs):
    return _tqdm(*args, ncols=1000, **kwargs)

def fit_random_forest(x, y):
    # Generate a train-test split, fit a Random Forest to the train split
    # with default parameters, and evaluate its score on the test set.
    train_x, test_x, train_y, test_y = train_test_split(
        x, y,
        train_size=0.8,
        stratify=y,
        random_state=0,
    )
    clf = RandomForestClassifier(n_jobs=2).fit(train_x, train_y)
    print(f"{type(clf).__name__} accuracy on test set: {clf.score(test_x, test_y):.2%}")

np.set_printoptions(threshold=100, linewidth=100)

# spaCy Quickstart

Installing spaCy is easy:

In [None]:
# Through conda--recommended
!conda install --yes --prefix {sys.prefix} -c conda-forge spacy

# Through pip--running on CPU
# !{sys.executable} -m pip install -U spacy

The core of spaCy is the *model:* a processing and annotation pipeline for text.

spaCy has [a lot of pre-trained models](https://spacy.io/models) available in several languages that you can easily download and start using.  All the models are drop-in replacements for one another: just download a new one, change what model is being loaded, and that's it.

In [None]:
# Download a spaCy model.
# Replace en_core_web_lg with whatever model you want.
!{sys.executable} -m spacy download en_core_web_lg

In [3]:
# Load the model for use.
import spacy
nlp = spacy.load("en_core_web_lg")



## Running the Pipeline

To run a piece of text through the model, just call the model (like a function) on the string.

In [4]:
text = (
    "Grace Brewster Murray Hopper "
    "was an American computer scientist and United States Navy rear admiral. One "
    "of the first programmers of the Harvard Mark I computer, she was a pioneer of "
    "computer programming who invented one of the first linkers. Hopper was the first "
    "to devise the theory of machine-independent programming languages, and the "
    "FLOW-MATIC programming language she created using this theory was later extended "
    "to create COBOL, an early high-level programming language still in use today."
)
doc = nlp(text)

The `doc` object now contains a spaCy `Document`, which is a transformed and annotated version of the text.  We can interact with this object just like a Python list to access the token-level annotations.  The model is a pipeline of text transformation and annotation steps:

In [5]:
for step_name, step_fn in nlp.pipeline:
    print(step_name)

tok2vec
tagger
parser
attribute_ruler
lemmatizer
ner


### Some of the available token-level annotations

In [6]:
print(f"{'TOKEN':<15}{'LEMMA':<15}{'PART OF SPEECH':<16}{'STOPWORD?':<10}"
      f"{'SYNTACTIC ROLE':<16}{'SYNTACTIC HEAD':<16}{'MORPHOLOGY'}")
for tok in doc[:10]:
    print(f"{tok.text:<15}{tok.lemma_:<15}{tok.pos_:<16}{tok.is_stop!s:<10}"
          f"{tok.dep_:<16}{tok.head.text:<16}{tok.morph}")

TOKEN          LEMMA          PART OF SPEECH  STOPWORD? SYNTACTIC ROLE  SYNTACTIC HEAD  MORPHOLOGY
Grace          Grace          PROPN           False     compound        Hopper          Number=Sing
Brewster       Brewster       PROPN           False     compound        Hopper          Number=Sing
Murray         Murray         PROPN           False     compound        Hopper          Number=Sing
Hopper         Hopper         PROPN           False     nsubj           was             Number=Sing
was            be             AUX             True      ROOT            was             Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
an             an             DET             True      det             scientist       Definite=Ind|PronType=Art
American       american       ADJ             False     amod            scientist       Degree=Pos
computer       computer       NOUN            False     compound        scientist       Number=Sing
scientist      scientist      NOUN            

### Named entity recognition

In [7]:
print(f"{'ENTITY':<30}ENTITY TYPE")
for ent in doc.ents:
    print(f"{ent.text:<30}{ent.label_} ({spacy.explain(ent.label_)})")

ENTITY                        ENTITY TYPE
Grace Brewster                ORG (Companies, agencies, institutions, etc.)
Murray Hopper                 PERSON (People, including fictional)
American                      NORP (Nationalities or religious or political groups)
United States Navy            ORG (Companies, agencies, institutions, etc.)
One                           CARDINAL (Numerals that do not fall under another type)
one                           CARDINAL (Numerals that do not fall under another type)
first                         ORDINAL ("first", "second", etc.)
Hopper                        ORG (Companies, agencies, institutions, etc.)
first                         ORDINAL ("first", "second", etc.)
FLOW-MATIC                    ORG (Companies, agencies, institutions, etc.)
COBOL                         ORG (Companies, agencies, institutions, etc.)
today                         DATE (Absolute or relative dates or periods)


### Sentence tokenization

In [8]:
for sent in doc.sents:
    print(sent)
    print()

Grace Brewster Murray Hopper was an American computer scientist and United States Navy rear admiral.

One of the first programmers of the Harvard Mark I computer, she was a pioneer of computer programming who invented one of the first linkers.

Hopper was the first to devise the theory of machine-independent programming languages, and the FLOW-MATIC programming language she created using this theory was later extended to create COBOL, an early high-level programming language still in use today.



### Similarity queries

Similarities between `Doc` objects are between 0 and 1, with higher values indicating greater similarity.  The similarity is based on 

In [9]:
dog = nlp("dog")
cat = nlp("cat")
story = nlp("My pet won't stop clawing the couch!")

print(dog.similarity(cat))
print(story.similarity(dog))
print(story.similarity(cat))

0.8016854705531046
0.6165512506011797
0.600524898806014


### Text vectorization

spaCy models apply vectors to individual token, spans of tokens, and whole documents, and stores them in the `.vector` attribute.  Different models have different approaches to vectorization--`en_core_web_lg` static GloVe vectors trained on Common Crawl; `*_trf` models use contextual vectors generated by Transformer neural networks; `*_sm` and `*_md` use token

(the similarity queries we just saw are just the cosine similarities between the two pieces of text)

In [10]:
print(f"{doc[0]}: {doc[0].vector}")
print(f"{doc[1]}: {doc[1].vector}")
print("...")
print(f"Whole document: ", doc.vector)

Grace: [-0.25481  0.4372   0.21204 ...  0.18271 -0.45479 -0.18673]
Brewster: [-0.56768  -0.26426   0.089199 ...  0.086851 -0.77606   0.24845 ]
...
Whole document:  [-0.01057081  0.12500678  0.02788484 ...  0.04178689 -0.10531644  0.08284231]


## Processing lots of text at once

`nlp.pipe()` is the easiest way to process lots of texts at once.  It supports parallelism and lazy-loading of the texts, and it returns a generator--so it's perfect for processing more texts than you can fit in RAM at once.  We can also disable processing steps that we don't want to apply to our texts, which can speed things up a lot for large document collections.  We can also specify a batch size to control how many documents are processed at once (this is more important when the model is running on a GPU, not so much when running on CPU).

Note: when using multiple processes, there'll be a pretty big spike in memory use, and a small delay before things start running.  The model needs to be copied into each worker process that gets spawned, which can take some time.

In [11]:
my_texts = [
    "I'm a document!",
    "I, too, am a document!",
    "Look at that, another document.",
    "Who keeps putting all these documents here?"
]

my_processed_texts = nlp.pipe(
    my_texts,
    n_process=1,           # single worker process
    batch_size=256,        # buffer 256 documents per worker
    disable=[              # Processing steps to disable/skip--for speed
        "tok2vec",         # token vectorization
        "parser",          # syntax parser
        "tagger",          # part-of-speech tagger
        "ner",             # named entity recognition
        "attribute_ruler", # various rules-based transformarions
        "lemmatizer",      # lemmatization
    ]
)

# .pipe() returns a generator...
print(my_processed_texts)

# ...so we have to call list() or explicitly iterate through it
# to get processed documents back.
print(list(my_processed_texts))

<generator object Language.pipe at 0x000001F959D88C80>
[I'm a document!, I, too, am a document!, Look at that, another document., Who keeps putting all these documents here?]


(to keep the code simple, the rest of the examples won't disable pipeline components or use multiple processes.  But feel free to experiment with changing these settings!)

# Text Classification
spaCy is very useful for a wide range of text classification workflows.  We'll try three different approaches to the same binary classification task: predict whether an Amazon review was >3 or <=3 stars, based just on the review text.
## Getting the data

In [12]:
# Downlod the data.  You'll need to do this manually if you don't have wget installed.
# !wget --no-clobber http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz

# Now, process the reviews.
import gzip
import json
import random

# Extract text and create our classification target: True for >3 stars, False for <=3
reviews = [
    (i["reviewText"], i["overall"] > 3)
    for i in map(
        json.loads,
        tqdm(gzip.open("reviews_Video_Games_5.json.gz", "rt"), total=231780)
    )
]

# Resample to a smaller subset for the sake of this demo.
positive = [t for t in reviews if t[1] == True]
negative = [t for t in reviews if t[1] == False]
resampled_reviews = positive[:2_000] + negative[:2_000]
random.shuffle(resampled_reviews)
review_texts = [i[0] for i in resampled_reviews]
review_scores = [i[1] for i in resampled_reviews]

  0%|                                                                                                         …

## Using spaCy's Text Vectorization

One of the easiest approaches is to use spaCy's text vectors as features for a predictive model.  Since they're just numpy arrays, we can use them with `scikit-learn`, `keras`, `pytorch`, or any other library we want.

`nlp.make_doc(text)` is probably the fastest way to get these vectors.  It will just run tokenization + vectorization.

In [13]:
# nlp.make_doc is the fastest way to get just the vector representation
# of a document.  This will run the tokenizer and vectorizer components.
text_vectors = np.array([
    i.vector
    for i in map(nlp.make_doc, tqdm(review_texts))
])
print(text_vectors)
print(text_vectors.shape)
fit_random_forest(text_vectors, review_scores)

  0%|                                                                                                         …

[[ 0.04479684  0.19694032 -0.13956733 ... -0.06383649 -0.00293173  0.03228827]
 [ 0.00834184  0.13011977 -0.07146262 ...  0.03144804  0.0060417   0.013466  ]
 [ 0.01021749  0.12195702 -0.07611904 ... -0.05591595 -0.00627276  0.0879147 ]
 ...
 [-0.04679825  0.18046853 -0.10175125 ... -0.08377992  0.00120563  0.08296575]
 [ 0.01141285  0.15055837 -0.14503597 ... -0.06690759  0.0610111   0.09047323]
 [-0.02274511  0.20885104 -0.1764832  ... -0.02571562  0.07243696  0.1297923 ]]
(4000, 300)
RandomForestClassifier accuracy on test set: 77.62%


## Using spaCy for Text Preprocessing

Sometimes you need a bag-of-words model (e.g. for interpretability).  We can use spaCy to do some pretty fine-grained text preprocessing.  Let's lemmatize all our reviews and remove stopwords and punctuation tokens.

In [14]:
# Lemmatize and remove stopwords+punctuation.  THis requires
# running most of the spaCy model's pipelines to get these annoations.
cleaned_texts = []
to_disable=["tok2vec", "ner"]
processed_documents = nlp.pipe(tqdm(review_texts), disable=to_disable)
for doc in processed_documents:
    cleaned = [
        tok.lemma_.lower()
        for tok in doc
        if not (tok.is_stop or tok.is_punct)
    ]
    cleaned = " ".join(cleaned)
    cleaned_texts.append(cleaned)
    
print(f"Before cleaning: {review_texts[0][:100]}...")
print(f"After cleaning:  {cleaned_texts[0][:100]}...")

# Now we feed this through some scikit-learn text preprocessing tools.
from sklearn.feature_extraction.text import CountVectorizer
bow_texts = CountVectorizer().fit_transform(cleaned_texts)
fit_random_forest(bow_texts, review_scores)

  0%|                                                                                                         …

Before cleaning: I enjoyed Riven more than Myst. Riven was more difficult than Myst, for me, but I still only needed ...
After cleaning:  enjoyed riven myst riven difficult myst needed hints complete riven highly recommend riven adventure...
RandomForestClassifier accuracy on test set: 76.25%


We could also use the cleaned texts as inputs for topic models like LDA or LSA.

In [15]:
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
corpus = [i.split() for i in cleaned_texts]
id2word = Dictionary(corpus)
corpus = [id2word.doc2bow(i) for i in corpus]

model = LdaModel(corpus, num_topics=5, id2word=id2word)
for i, words in model.show_topics(formatted=False):
    print(f"Topic {i}: {', '.join(i[0] for i in words)}")

Topic 0: game, time, like, games, mario, play, super, good, version, characters
Topic 1: game, like, games, play, best, graphics, great, level, time, good
Topic 2: game, games, like, play, mario, time, good, 2, graphics, best
Topic 3: game, games, great, play, like, fun, graphics, time, good, best
Topic 4: game, like, games, good, graphics, time, play, fun, way, better


(Gensim is another excellent library for a wide range of NLP tasks, more focused on unsupervised learning tasks like topic modeling and building word embedding models.  But we don't have time to go into it today).

## Training a spaCy Model for Text Classification

We can also tweak the spaCy models to do the classification themselves, rather than just using them for feature extraction.  The steps required to do this:
1. Generate a config file containing all of our training settings.
2. Convert our data into one of spaCy's file formats.
3. Train + evaluate the model.

The config file makes it very easy to re-train our model in exactly the same way, and it has a *lot* of settings we could tweak.  Once the training is done, we get a spaCy model that we can `spacy.load()` and use like any other model to annotate texts with the categories we trained it on.

### Generate the config files
We'll generate a pretty basic config file, and let spaCy fill in sensible defaults for all the settings.

In [16]:
# If you're running this at home: try changing `-o efficiency` to `-o accuracy`.
# The model will be much larger and slower to train, but should be more accurate
# (though it may not make much difference for how little data we're using).
!{sys.executable} -m spacy init config \
    -F \
    -p textcat \
    -l en \
    -o efficiency \
    default_config.cfg

# This does not always change the config, but usually it'll fill in sensible defaults.
!{sys.executable} -m spacy init fill-config default_config.cfg config.cfg

print()
print("Config file contents:")
print(open("config.cfg").read())

[i] Generated config template specific for your use case
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[+] Auto-filled config with all values
[+] Saved config
default_config.cfg
You can now add your data and train your pipeline:
python -m spacy train default_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy




[!] Nothing to auto-fill: base config is already complete
[+] Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

Config file contents:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path



### Convert documents to the right format
We need to:
1. Convert our texts to spaCy `Doc`s, but we only need to run tokenization (not any other steps).  `nlp.make_doc()` is a very fast way to do this.
2. Add a `.cats` attribute, which should be a dictionar in the form: {category1: True, category2: False, ...}, where the category this document in is lbeled `True` and all others are `False`.
3. Put all our documents into a `DocBin` so we can save it to disk.  (this is the format spaCy expects for training its models).

The spaCy models are much more data hungry, so we'll use a bigger set of the reviews for this than we have for previous model examples.

In [17]:
import spacy

# 10k positive and 10k negative examples
# resampled_reviews = positive[:10_000] + negative[:10_000]
# print(resampled_reviews)
# random.shuffle(resampled_reviews)

# Create the Document objects and set their .cats attributes
docs = []
for (text, category) in tqdm(resampled_reviews):
    doc = nlp.make_doc(text)
    doc.cats = {"Positive": category, "Negative": not category}
    docs.append(doc)

# train-dev-test split: 16k train, 2k each for dev/test
# Create them as DocBin objects so we can easily save them to file.
random.shuffle(docs)
train = spacy.tokens.DocBin(docs=docs[:3200])
dev   = spacy.tokens.DocBin(docs=docs[3200:3600])
test  = spacy.tokens.DocBin(docs=docs[3600:])

# Save the splits to file
train.to_disk("data/train.spacy")
dev.to_disk("data/dev.spacy")
test.to_disk("data/test.spacy")

  0%|                                                                                                         …

### Train + evaluate the model
Now we can train the model.  We'll pass some arguments to override what might be in the config file.

In [18]:
!{sys.executable} -m spacy train \
    config.cfg \
    --paths.train ./data/train.spacy \
    --paths.dev ./data/dev.spacy \
    --output ./text_categorization \
    --gpu-id 0 \
    --nlp.batch_size 1024

[i] Saving to output directory: text_categorization
[i] Using GPU: 0
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['textcat']
[i] Initial learn rate: 0.001
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.12       31.15    0.31
  0     200        102.42       68.80    0.69
  0     400         81.19       76.38    0.76
  0     600         68.26       80.04    0.80
  0     800         76.99       76.66    0.77
  0    1000         50.70       81.20    0.81
  0    1200         62.01       80.71    0.81
  0    1400         58.68       78.42    0.78
  0    1600         43.77       79.73    0.80
  0    1800         45.28       82.20    0.82
  0    2000         40.09       80.99    0.81
  0    2200         29.04       84.73    0.85
  1    2400         37.40       85.12    0.85
  1    2600          3.83       83.85    0.84
  1    2800         15.59       83.49    0.83
  1    3000          6.96       84.18    0.84
  1    3200 

[2022-03-21 15:40:36,750] [INFO] Set up nlp object from config
[2022-03-21 15:40:36,759] [INFO] Pipeline: ['textcat']
[2022-03-21 15:40:36,763] [INFO] Created vocabulary
[2022-03-21 15:40:36,763] [INFO] Finished initializing nlp object
[2022-03-21 15:40:48,604] [INFO] Initialized pipeline components: ['textcat']


Evaluate the model's performance on the testing set, and save the results to a JSON file for later reference:

In [19]:
!{sys.executable} -m spacy evaluate \
    ./text_categorization/model-best \
    ./data/test.spacy \
    --gpu-id 0 \
    -o TestSetEvaluation.json

from pprint import pprint
print()
print("Contents of the evaluation data file:")
pprint(json.load(open("TestSetEvaluation.json")))


Contents of the evaluation data file:
[i] Using GPU: 0
[1m

TOK                 100.00
TEXTCAT (macro F)   82.24 
SPEED               232736

[1m

               P       R       F
Positive   85.28   80.00   82.56
Negative   79.31   84.74   81.93

[1m

           ROC AUC
Positive      0.89
Negative      0.89

[+] Saved results to TestSetEvaluation.json
{'cats_auc_per_type': {'Negative': 0.8933333333, 'Positive': 0.8932581454},
 'cats_f_per_type': {'Negative': {'f': 0.8193384224,
                                  'p': 0.7931034483,
                                  'r': 0.8473684211},
                     'Positive': {'f': 0.8255528256,
                                  'p': 0.8527918782,
                                  'r': 0.8}},
 'cats_macro_auc': 0.8932957393,
 'cats_macro_f': 0.822445624,
 'cats_macro_p': 0.8229476632,
 'cats_macro_r': 0.8236842105,
 'cats_micro_f': 0.8225,
 'cats_micro_p': 0.8225,
 'cats_micro_r': 0.8225,
 'cats_score': 0.822445624,
 'cats_score_desc': 'macro



Now, we can load the `model-best` file from the output folder, just like any other spaCy model.

In [20]:
my_model = spacy.load("./text_categorization/model-best")

negative_example = (
    "I hate this terrible, awful, garbage game.  "
    "This is the worst thing I've ever played.  "
    "The developers should be absolutely ashamed of themselves."
)
positive_example = (
    "Wow, this might be my favorite game ever!  "
    "I can't believe how much fun I'm having.  This game rules.  "
    "Definitely a contender for Game of the Year for me."
)

print(my_model(negative_example).cats)
print(my_model(positive_example).cats)

{'Positive': 0.07472464442253113, 'Negative': 0.9252753257751465}
{'Positive': 0.5814142823219299, 'Negative': 0.41858571767807007}


## The spaCy universe

It's fairly easy to write custom pipeline components that do all sorts of things to documents, like adding customized annotations and functionality.  There is a long, but non-exhaustive list, on [the spaCy site.](https://spacy.io/universe)

One that I use a lot is PyTextRank, which is an implementation of the TextRank algorithm for automatic text summarization.  It's very easy to use (but it's not available through `conda`, so we have to `pip install` it):

In [None]:
!{sys.executable} -m pip install --user pytextrank

In [21]:
import spacy
import pytextrank
nlp_textrank = spacy.load("en_core_web_lg")
nlp_textrank.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x1f959d7fbe0>

And that's it!  Now we just run a document through the `nlp` model and access the special methods and attributes that get added.  By convention, spaCy extensions all add attributes and methods to `doc._.something`.

In [22]:
to_summarize = "\n".join([i[0] for i in positive[:100]])

textrank = nlp_textrank(to_summarize)._.textrank
summary = textrank.summary(limit_phrases=10, limit_sentences=10)
for (i, sent) in enumerate(summary):
    print(f"SENTENCE {i+1}:")
    print(sent.text.strip())
    print()

SENTENCE 1:
I started playing games on my laptop and bought a few new games to build my collection.

SENTENCE 2:
Game plays well and looks gorgeous when image spanned across three monitors.

SENTENCE 3:
My lady dug out her kids old gamecube and games and started playing her old favorite Harvest Moon titles.

SENTENCE 4:
My living room is fairly large, so these chords make it nice to play games which require the old controllers.

SENTENCE 5:
Innovation is necessary, and appreciated, but games like BOF3 need never disappear.

SENTENCE 6:
it good play game I guess well

SENTENCE 7:
With all the complaints aside, Breath of Fire 3 is fun game to play through.

SENTENCE 8:
Now  I still play SNES, playing such good games as Super Godzilla, Breath of  Fire I, and the Final Fantasy series.

SENTENCE 9:
If you like racing games you should check this out.

SENTENCE 10:
If you enjoy racing games, this is not one you should ignore.



# Some things not covered

spaCy is a surprisingly deep library, and there's a lot I didn't cover.  A few big things:
- Not all models have exactly the same capabilities, and some prioritise speed over accuracy.  The website has pretty thorough documentation on model specs, so you can choose the best pre-trained model for your work.
- You can use different word vectors if you want.  You can provide static word vectors or pre-train your model to get better starting values for the embedding layers.
- spaCy Projects give nice ways to package up an entire workflow and make it easy to redistribute.
- You can train your own part-of-speech tagger, syntactic analyzer, named entity recognition, etc. components, as long as you have enough annotated data.
- You can write your own pipeline components
- You can easily interface with various tranformer models from Huggingface's `transformers` library by installing `spacy-transformers`.

# Contact

Henry Anderson ([henry.anderson@uta.edu](mailto:henry.anderson@uta.edu)]