# What's spaCy?
SpaCy is a **free, open-source library** for advanced **Natural language processing** (NLP) in Python.

Suppose you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in the context? Who is doing what to whom? What products and companies are mentioned in the text? Which texts are similar to each other?

spaCy is designed specifically for **production use** and helps you build applications that process and "understand" large volumes of text. It can be used to build **information extraction** or **natural language processing** systems, or to pre-process text for **deep learning**.

## What spaCy isn't?

* First, **spaCy isn't a platform or an "API"**. Unlike a platform, spaCy doesn't provide software as a service or a web application. It's an open-source library designed to help you build NLP applications, not a consumable service.

* Second, **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to power conversational applications, it’s not designed specifically for chatbots and only provides the underlying text processing capabilities.

* Third, **spaCy is not research software**. It’s built on the latest research, but it’s designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

* Fourth, **spaCy is not a company**. It’s an open-source library. The company publishing spaCy and other software is called **Explosion AI**.

## Installation

spaCy is compatible with **64bit of Python 2.7/3.5+** and runs on **Unix/Linux, macOS/OS X** and **Windows**. The latest version of spaCy is available over pip and conda.

* **Installation with pip** in Linux, Windows, and macOS/OS X for both versions of Python 2.7/3.5+:
    
    pip install -U spacy or pip install spacy
    
* **Installation with conda** in Linux, Windows, and macOS/OS X for both versions of Python 2.7/3.5+:
    
    conda install -c conda-forge spacy
    
## Features
Here, you'll come across mentions of spaCy's features and capabilities.

### Statistical models
Some of spaCy's features work independently, while others require statistical models to be loaded, which enable spaCy to predict linguistic annotations. For example, whether a word is a verb or noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy, and the data they include. The model you choose always depends on your use cases and the texts you're working with. For a general use case, the small and the default models are always a good start. They typically include the following components:
* **Binary weights** for the part-of-speech tagger, dependency parser, and named entity recognizer to predict those annotations in context.
* **Lexical entries** in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
* **Data files** like lemmatization rules and lookup tables.
* **Word vectors**, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
* **Configuration options**, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.

## Linguistic annotations

spaCy provides a variety of linguistic annotations to give you **insights into a text’s grammatical structure**. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object — or whether "google" is used as a verb, or refers to the website or company in a specific context.

Once you’ve **downloaded and installed** a model, you can load it via `spacy.load()`. This will return a `Language` object containing all components and data needed to process text. We usually call it `nlp` object on a string of text, which will return a processed `Doc`:


In [1]:
!pip install numpy==2.0



In [2]:
!pip install -U spacy



In [3]:
!python -m spacy download en_core_web_sm


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/spacy/__init__.py", line 6, in <module>
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import co

In [4]:
# https://spacy.io/usage/linguistic-features
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

# Text: The original word text.
# Lemma: The base form of the word.
# POS: The simple part-of-speech tag.
# Tag: The detailed part-of-speech tag.
# Dep: Syntactic dependency, i.e. the relation between tokens.
# Shape: The word shape - capitalization, punctuation, digits.
# is alpha: Is the token an alpha character?
# is stop: Is the token part of a stop list, i.e. the most common words of the language?



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/anil/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 736, in start
    self.io_loop.start()
  File "/Users/anil/ana

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN nsubj
startup VERB ccomp
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


Even though a Doc is processed – e.g. split into individual words and annotated – it still holds all information of the original text, like a whitespace character. You can always get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace. This way, you'll never lose any information when processing text with spaCy.

## Tokenization
During processing, spaCy first tokenizes the text, i.e., segments it into words, punctuation, and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off—whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

**1. Does the substring match a tokenizer exception rule?** For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.

**2. Can a prefix, suffix, or infix be split off?** For example, punctuation like commas, periods, hyphens, or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split **complex, nested tokens** like combinations of abbreviations and multiple punctuation marks.

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

## Part-of-speech(pos) tags and dependencies

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Coronavirus Coronavirus PROPN NNP nsubj Xxxxx True False
: : PUNCT : punct : False False
Delhi Delhi PROPN NNP compound Xxxxx True False
resident resident NOUN NN nsubj xxxx True False
tests test VERB VBZ appos xxxx True False
positive positive ADJ JJ amod xxxx True False
for for ADP IN prep xxx True True
coronavirus coronavirus NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
total total ADJ JJ ROOT xxxx True False
31 31 NUM CD nummod dd False False
people people NOUN NNS dobj xxxx True False
infected infect VERB VBN acl xxxx True False
in in ADP IN prep xx True True
India India PROPN NNP pobj Xxxxx True False


Using spaCy's built-in displaCy visualizer, here's what our example sentence and its dependencies look like:

In [11]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Google, Apple crack down on fake coronavirus apps")
displacy.serve(doc, style="dep",port=4000)




Using the 'dep' visualizer
Serving on http://0.0.0.0:4000 ...



127.0.0.1 - - [26/Oct/2024 01:03:27] "GET / HTTP/1.1" 200 6739
127.0.0.1 - - [26/Oct/2024 01:03:28] "GET /favicon.ico HTTP/1.1" 200 6739


Shutting down server on port 4000.


## Named Entities

A named entity is a "real-world object" that’s assigned a name – for example, a person, a country, a product, or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [12]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

31 66 68 CARDINAL
India 88 93 GPE


## Visualizing the Named Entity Recognizer

The entity visualizer, ent, highlights named entities and their label in the text.

In [13]:
import spacy
from spacy import displacy

text = "Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent", auto_select_port=True)
# https://spacy.io/api/annotation#named-entities




Using the 'ent' visualizer
Serving on http://0.0.0.0:5001 ...



127.0.0.1 - - [26/Oct/2024 01:04:04] "GET / HTTP/1.1" 200 1126
127.0.0.1 - - [26/Oct/2024 01:04:04] "GET /favicon.ico HTTP/1.1" 200 1126


Shutting down server on port 5001.


## Words Vector and Similarity
Similarity is determined by comparing word vectors or “word embeddings,” multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

**Important note:**
To make them compact and fast, spaCy’s small models (all the packages end with sm) don’t ship with the word vectors, and only include context-sensitive tensors. This means you can still use the similarity() to compare documents, tokens, and spans — but the result won’t be as good, and individual tokens won’t have any vectors assigned. So, in order to use real word vectors, you need to download a larger model:
      
      python -m spacy download en_core_web_md
Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors.

In [14]:
# !python -m spacy download en_core_web_md
import spacy.cli
spacy.cli.download("en_core_web_md")

import en_core_web_md
nlp = en_core_web_md.load()

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [15]:
import spacy

nlp = spacy.load("en_core_web_md")
tokens = nlp("lion bear apple banana fadsfdshds")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
    
# Vector norm: The L2 norm of the token’s vector (the square root of the sum of the values squared)
# has vector: Does the token have a vector representation?
# OOV: Out-of-vocabulary


lion True 6.6788154 False
bear True 7.2436275 False
apple True 6.895898 False
banana True 6.895898 False
fadsfdshds False 0.0 True


The words “lion”, “bear”, “apple” and “banana” are all pretty common in English, so they’re part of the model’s vocabulary and come with a vector. The word “fadsfdshds” on the other hand is a lot less common and out-of-vocabulary — so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, en_vectors_web_lg, which includes over 1 million unique vectors.


Each Doc, Span, and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course, similarity is always subjective — whether “dog” and “cat” are similar really depends on how you’re looking at it. spaCy’s similarity model usually assumes a pretty general-purpose definition of similarity.

In [16]:
import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use larger model!
tokens = nlp("lion bear cow apple mango spinach")

for token11 in tokens:
    for token13 in tokens:
        print(token11.text, token13.text, token11.similarity(token13))

lion lion 1.0
lion bear 0.4252593219280243
lion cow 0.5135680437088013
lion apple 0.25588130950927734
lion mango 0.3112117350101471
lion spinach 0.2844249904155731
bear lion 0.4252593219280243
bear bear 1.0
bear cow 0.5596494078636169
bear apple 0.2901766002178192
bear mango 0.18352438509464264
bear spinach 0.11630470305681229
cow lion 0.5135680437088013
cow bear 0.5596494078636169
cow cow 1.0
cow apple 0.3741353154182434
cow mango 0.359701007604599
cow spinach 0.31815198063850403
apple lion 0.25588130950927734
apple bear 0.2901766002178192
apple cow 0.3741353154182434
apple apple 1.0
apple mango 0.5986488461494446
apple spinach 0.6040376424789429
mango lion 0.3112117350101471
mango bear 0.18352438509464264
mango cow 0.359701007604599
mango apple 0.5986488461494446
mango mango 1.0
mango spinach 0.7843544483184814
spinach lion 0.2844249904155731
spinach bear 0.11630470305681229
spinach cow 0.31815198063850403
spinach apple 0.6040376424789429
spinach mango 0.7843544483184814
spinach spin

## Pipelines
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser, and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

| **NAME**     | **COMPONENT**      | **CREATES**                                            | **DESCRIPTION**                               |
|--------------|--------------------|--------------------------------------------------------|-----------------------------------------------|
| tokenizer    | Tokenizer           | Doc                                                    | Segment text into tokens.                     |
| tagger       | Tagger              | Doc[i].tag                                             | Assign part-of-speech tags.                   |
| parser       | DependencyParser    | Doc[i].head, Doc[i].dep, Doc.sents, Doc.noun_chunks     | Assign dependency labels.                     |
| ner          | EntityRecognizer    | Doc.ents, Doc[i].ent_iob, Doc[i].ent_type              | Detect and label named entities.              |
| textcat      | TextCategorizer     | Doc.cats                                               | Assign document labels.                       |
| ...          | custom components   | Doc._xxx, Token._xxx, Span._xxx                        | Assign custom attributes, methods or properties.|


The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its metadata, as a simple list containing the component names.
    
    "pipeline": ["tagger","parser","ner]