# Backprop Core Example: Text vectorisation

Text vectorisation is a matter of turning variable sized text into a fixed size vector. This is useful because you can perform mathematical operations, such as comparison, on the vectors. Popular use cases include semantic search and customer intent detection.

## Why and how does it work?

Everything inside a model is numbers. Say a model has to classify text. The way it does that is process the text to build an internal representation of it. This is then used to compute probabilities for available categories.

In a way, every natural language model does text vectorisation. It is just not visible as output. This task is all about using models which have been explicitly trained to predict and output vectors.

## What do you do with a vector?

A vector on its own is pretty much useless. It becomes useful when you have at least two vectors that have been computed by the same model.

Once you have at least two vectors you can compare them. A common approach is to use cosine similarity, which calculates a value between -1 and 1. Our models have been trained to have cosine similarities mostly between 0 and 1, which makes it easy to score how similar a piece of text is to another.

Let's see how one might use it for semantic search.

In [1]:
import backprop

In [2]:
# Set your API key to do inference on Backprop's platform
# Leave as None to run locally
api_key = None

See what models are available.

In [3]:
backprop.TextVectorisation.list_models(display=True)

Name                     msmarco-distilroberta-base-v2
Description              This English model is a standard distilroberta-base model from the Sentence Transformers repo, which has been trained on the MS MARCO dataset.
Supported tasks          ['text-vectorisation']
Finetunable tasks        ['text-vectorisation']
----------
Name                     distiluse-base-multilingual-cased-v2
Description              This model is based off Sentence-Transformer's distiluse-base-multilingual-cased multilingual model that has been extended to understand sentence embeddings in 50+ languages.
Supported tasks          ['text-vectorisation']
Finetunable tasks        ['text-vectorisation']
----------
Name                     clip-vit-b32
Alias                    clip
Description              OpenAI's recently released CLIP model — when supplied with a list of labels and an image, CLIP can accurately predict which labels best fit the provided image.
Supported tasks          ['image-classification'

In [4]:
tv = backprop.TextVectorisation(api_key=api_key)

Start the task with `backprop.TextVectorisation("multilingual")` to compute vectors of a model that understands 100+ languages.

In [5]:
# A context paragraph about the ISS, segments taken from Wikipedia.
context = """The International Space Station (ISS) is a modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project involving five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada).
The station serves as a microgravity and space environment research laboratory in which scientific research is conducted in astrobiology, astronomy, meteorology, physics, and other fields.
The station is divided into two sections: the Russian Orbital Segment (ROS), operated by Russia; and the United States Orbital Segment (USOS), which is shared by many nations.
The first ISS component was launched in 1998, and the first long-term residents arrived on 2 November 2000.
The Dragon spacecraft allows the return of pressurised cargo to Earth, which is used, for example, to repatriate scientific experiments for further analysis. As of September 2019, 239 astronauts, cosmonauts, and space tourists from 19 different nations have visited the space station, many of them multiple times; this includes 151 Americans, 47 Russians, nine Japanese, eight Canadians, and five Italians."""

In [6]:
sentences = context.split("\n")
sentences

['The International Space Station (ISS) is a modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project involving five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada).',
 'The station serves as a microgravity and space environment research laboratory in which scientific research is conducted in astrobiology, astronomy, meteorology, physics, and other fields.',
 'The station is divided into two sections: the Russian Orbital Segment (ROS), operated by Russia; and the United States Orbital Segment (USOS), which is shared by many nations.',
 'The first ISS component was launched in 1998, and the first long-term residents arrived on 2 November 2000.',
 'The Dragon spacecraft allows the return of pressurised cargo to Earth, which is used, for example, to repatriate scientific experiments for further analysis. As of September 2019, 239 astronauts, cosmonauts, and sp

We have taken a paragraph about the ISS and turned into roughly a list of sentences.

Next, we will calculate a vector for each sentence.

In [7]:
sentence_vectors = tv(sentences)

In [8]:
import operator
def search_sentences(question):
    question_vector = tv(question)
    # Built in utility function
    ranked_sentences = backprop.cosine_similarity(question_vector, sentence_vectors)
    index, value = max(enumerate(ranked_sentences), key=operator.itemgetter(1))
    
    print(sentences[index])
    print("Score:", value)
    

This function calculates a vector for our question and compares it with the sentence vectors to find one with the highest score.

In [9]:
search_sentences("When was the first piece of the ISS launched?")

The first ISS component was launched in 1998, and the first long-term residents arrived on 2 November 2000.
Score: 0.8193097114562988


In [10]:
search_sentences("How many space agencies operate the ISS?")

The International Space Station (ISS) is a modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project involving five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada).
Score: 0.6959230899810791
