# Floret Embeddings

In this document we'll explore some properties of Floret embeddings. You can also learn more about them [by reading out blogpost]() or by checking out the [GitHub repository](https://github.com/explosion/floret).

## Getting Started 

In order to demonstrate Floret, we will need to install it first. So let's install it along with spaCy.

In [None]:
%pip install floret spacy

Let's start by downloading some vectors. These vectors can be found on [GitHub](https://github.com/explosion/spacy-vectors-builder/releases/tag/en-3.4.0). The code below downloads and unzips the vectors. 

In [None]:
! wget https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md.floret.gz
! gzip -d en_vectors_floret_md.floret.gz

Now that these vectors are downloaded, let's create a spaCy pipeline that has these vectors loaded. 

In [None]:
! spacy init vectors en en_vectors_floret_md.floret en_core_floret_md --mode floret

We now have a pipeline on disk called `en_core_floret_md`. Let's make some comparisons with the `en_core_web_md` model on disk. In order to do that we will first need to download it.

In [None]:
! spacy download en_core_web_md

Great, let's now load both pipelines.

In [None]:
import spacy

# This is the standard spaCy pipeline
nlp_md = spacy.load("en_core_web_md")

# This is the spaCy pipeline with floret vectors
nlp_fl = spacy.load("en_core_floret_md")

As the blogpost explains, the floret embeddings are a bit different. Floret has subword embeddings instead of just word embeddings. It uses a hashing trick under the hood to keep things lightweight, but having these subwords are your disposal means that you can expect different behavior. 

To highlight this, let's consider what might happen when there's a spelling mistake. We could type "univercities" (wrong spelling) instead of "universities" (correct spelling). Can we expect the associated vectors to be similar in the normal pipeline? 

In [None]:
token_1 = nlp_md("univercities")[0]
token_2 = nlp_md("universities")[0]

token_1.similarity(token_2)

It turns out we can't! The reason is that "univercities" doesn't appear in the vocabulary of the `nlp_md` pipeline! That means that we have a vector with zeros. 

In [None]:
token_1.is_oov

In [None]:
token_1.vector

The pipeline with the floret embeddings gives different behavior.

In [None]:
token_1 = nlp_fl("univercities")[0]
token_2 = nlp_fl("universities")[0]

token_1.similarity(token_2)

These two tokens _are_ similar. It's because even though there's no entry for "univercities", there are entries for the subwords. Specifically, Floret uses the following subtokens: 

```
'<univ', 'unive', 'niver', 'iverc', 'verci', 'ercit', 'rciti', 'citie', 'ities', 'ties>'
```

Note: the `<` and `>` characters indicate the start and end of a token. 

## Consequences of Subword Embeddings 

Floret is more robust against spelling errors, but it also introduces some caveats. 

In [None]:
token_1 = nlp_fl("universities")[0]
token_2 = nlp_fl("cities")[0]

token_1.similarity(token_2)

Notice how the "ities" and "ties>" subtokens appear in both "univercities" and "cities". A direct consequence of this is that these words will be similar. Even if the meanings of both words are unrelated, the prescences of overlapping subwords will cause a vector similarity!

## Another Usecase

Spelling errors can cause tokens to no longer match an entry in a vector table, but there are many other situations that can cause this. What if somebody compounds long words together?

In [None]:
token_1 = nlp_md("zerglingmatosis")[0]
token_2 = nlp_md("neurofibromatosis")[0]

token_1.similarity(token_2)

In [None]:
token_1 = nlp_fl("zerglingmatosis")[0]
token_2 = nlp_fl("neurofibromatosis")[0]
token_3 = nlp_fl("osteochondrosis")[0]

token_1.similarity(token_2), token_1.similarity(token_3)

Even though the word "zerglingmatosis" does not exist, the "matosis" part of the word does imply that it might be about a disease. 

## Another Caveat 

spaCy applies attributes to tokens, some of which are based on the [Vocab](https://spacy.io/api/vocab) object attached to the pipeline. One of these attributes is the `is_oov` property which flags tokens to be out of vocabulary. In models with normal word embeddings this property is `True` if the token does not appear in the embedding table. If there is no entry, it is considered out of vocabulary. 

This definition of "out of vocabulary" only works if you have an actual word embedding table. So what happens when we use floret with subwords? 

In [None]:
token_md = nlp_md("supexrlongwordthatdoesnotexist")[0]
token_fl = nlp_fl("superlongwordthatdoesnotexist")[0]

In [None]:
token_md.is_oov, token_fl.is_oov

The `.is_oov` property will always be `False` when you have a pipeline with Floret embeddings. There are so many subwords that will always be able to match a token that it can no longer be used as a proxy for being out of vocabulary. 

There's a few other operations that aren't supported with subwords tables as well. One of these is the `.vocab.vectors.most_similar` method.

In [None]:
import numpy as np

query_vec = np.array([nlp_fl.vocab["zerglingmatosis"].vector])

# This line is commented out because it will throw an error.
# nlp_fl.vocab.vectors.most_similar(query_vec, n=10)

## Another Demo 

If you'd like to dive a bit deeper into the Floret subword embeddings you can also download the `.bin` files and use the `floret` library directly. These `.bin` files are fairly large but having them locally will allows to explore the embeddings in more detail. 

In [None]:
! wget https://github.com/explosion/spacy-vectors-builder/releases/download/en-3.4.0/en_vectors_floret_md.bin

In [None]:
import floret 

model_fl = floret.load_model("en_vectors_floret_md.bin")

Now that we have a floret model loaded, we can explore the subwords.

In [None]:
strings, indices = model_fl.get_subwords("univercities")
strings

It also provides us with a `.get_nearest_neighbors` method that allows us to fetch nearby tokens.

In [None]:
model_fl.get_nearest_neighbors("univercities")

You can see how "cities" is considered similar to "univercities" here. 

## Compare

Let's now compare the nearest neighbors from the floret model and the normal medium spaCy pipeline. The block of code below can run comparisons on your behalf.

In [None]:
import numpy as np
import pandas as pd 

query = 'Zerglingmatosis'

# Get the spaCy tokens
query_vec = np.array([nlp.vocab[query].vector])
keys, best_rows, scores = nlp.vocab.vectors.most_similar(query_vec, n=10)
df_spacy = pd.DataFrame({"token_spacy": [nlp.vocab[k].text for k in keys[0]], "dist_spacy": scores[0]})

# Get the floret tokens
similar = model_fl.get_nearest_neighbors(query, k=10)
df_floret = pd.DataFrame({"token_floret": [t[1] for t in similar], "dist_token_floret": [t[0] for t in similar]})

# Print the results
pd.concat([df_spacy, df_floret], axis=1)

Whenever the `en_core_web_md` pipeline doesn't have an entry for a query, the nearest neighbors are going to be random. This is because the associated query vector will be an array of zeros. 

Feel free to play around! It can be very helpful to understand the difference in what the vectors represent. You'll notice that the spaCy pipeline does a pretty good job at capturing some popular mispellings of words, but not all of them. 

Here are some words that might give some insights: 

- univercity
- hobbitshire
- superduper

## When does this make a difference? 

As of spaCy v3.3 we've started shipping some language models that support these floret embeddings. At the time of writing this notebook the `fi_core_news_md`, `fi_core_news_lg`, `ko_core_news_md` and `ko_core_news_lg` pipelines all carry these embeddings.

There's a reason why these embeddings might be more impactful in these languages. To have Finnish as an example, consider the following sentences. 

> **Talo** on helppo sana. -> **House** is an easy word.
>
> En pidä tämän **talon** väristä. -> I don't like this **house's** colour.
>
> Asun **talossa**. -> I live **in the house**.
>
> Nähdään **talolla**! -> See you **at the house**!
>
> On vaikeaa elää **talotta**. -> It's difficult to live **without a house**.

Notice how, in English, you might use extra words to describe where the house is. But in Finnish, you would change the word "house" depending on how it's used in the sentence. Now imagine that it's not just the word "house" that will have this property, but every noun! Image how big the vocabulary table would need to be if we wanted to support all of these words! 

And it's not just nouns either. Let's have a look at some of the verbs. 

> minä **kalastan** -> I **fish**
>
> sinä **kalastat** -> you **fish**
>
> me **kalastamme** -> we **fish**
>
> he **kalastavat** -> they **fish**


Verb conjugations are typically much more elaborate outside of the English language, which again motivates the use of subword embeddings. 

## Should I use Floret in English? 

You certainly could use these vectors if you want to give it a try. It's possible that if your domain is online texts that you might benefit from being more robust against spelling errors. But at the same time you might also not need them because English is a relatively simple language, which is also the reason why we didn't add official Floret pipelines for English. That said, [feedback is appreciated](https://github.com/explosion/spaCy/discussions) if you have any!