# 01. Basic methods of text encoding

There is plenty of simple methods which allow to perform text vectorization. They allow to encode the meaning of given textual data, but not surprisingly miss most of the contextual information. This notebook is a short recap of these basic ones.

## Bag-of-words (BOG)

Bag of Words is probably the simplest vectorization method one can imagine. During the training phase, it retrieves all the words used in provided corpus, forms a dictionary of all the used ones (so finally each word has a separate position in the output vector) and due to that allows to vectorize given text by counting the occurences of each word from it.

Let's consider a simple example, written with scikit-learn.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

We need to have a language corpus. Let's use some quotes of Milan Kundera (source: https://www.goodreads.com/author/quotes/6343.Milan_Kundera)

In [2]:
CORPUS = (
    "Two people in love, alone, isolated from the world, that's beautiful.",
    "You can't measure the mutual affection of two human beings by the number of words they exchange.",
    "Anyone whose goal is 'something higher' must expect someday to suffer vertigo. What is vertigo? "
    "Fear of falling? No, Vertigo is something other than fear of falling. It is the voice of the emptiness "
    "below us which tempts and lures us, it is the desire to fall, against which, terrified, we defend ourselves.",
    "When the heart speaks, the mind finds it indecent to object.",
    "Dogs are our link to paradise. They don't know evil or jealousy or discontent. To sit with a dog on a hillside "
    "on a glorious afternoon is to be back in Eden, where doing nothing was not boring - it was peace.",
    "Making love with a woman and sleeping with a woman are two separate passions, not merely different but opposite. "
    "Love does not make itself felt in the desire for copulation (a desire that extends to an infinite number of women) "
    "but in the desire for shared sleep (a desire limited to one woman).",
    "Love is the longing for the half of ourselves we have lost.",
    "for there is nothing heavier than compassion. Not even one's own pain weighs so heavy as the pain one feels with "
    "someone, for someone, a pain intensified by the imagination and prolonged by a hundred echoes.",
    "But when the strong were too weak to hurt the weak, the weak had to be strong enough to leave."
)

As an example for vectorization, it's probably better to have a sentence out of the corpora. We'll consider another quote of Kundera:

In [3]:
SENTENCE = "A person who longs to leave the place where he lives is an unhappy person."

From the human perspective, there is no difference if word is written with capital letters or not. For the vectorization that's a huge difference - if we just took the data without any preprocessing, we would simply see different positions assigned to words "Anyone" and "anyone". There are also some other issue like how to merge different words into the same position ("speaks" and "speak" hold the same information, but would be also encoded like if they were different). We'll consider it later on. First of all, let's try to fit our vectorizer.

In [4]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(CORPUS)
print(count_vectorizer.get_feature_names())

['affection', 'afternoon', 'against', 'alone', 'an', 'and', 'anyone', 'are', 'as', 'back', 'be', 'beautiful', 'beings', 'below', 'boring', 'but', 'by', 'can', 'compassion', 'copulation', 'defend', 'desire', 'different', 'discontent', 'does', 'dog', 'dogs', 'doing', 'don', 'echoes', 'eden', 'emptiness', 'enough', 'even', 'evil', 'exchange', 'expect', 'extends', 'fall', 'falling', 'fear', 'feels', 'felt', 'finds', 'for', 'from', 'glorious', 'goal', 'had', 'half', 'have', 'heart', 'heavier', 'heavy', 'higher', 'hillside', 'human', 'hundred', 'hurt', 'imagination', 'in', 'indecent', 'infinite', 'intensified', 'is', 'isolated', 'it', 'itself', 'jealousy', 'know', 'leave', 'limited', 'link', 'longing', 'lost', 'love', 'lures', 'make', 'making', 'measure', 'merely', 'mind', 'must', 'mutual', 'no', 'not', 'nothing', 'number', 'object', 'of', 'on', 'one', 'opposite', 'or', 'other', 'our', 'ourselves', 'own', 'pain', 'paradise', 'passions', 'peace', 'people', 'prolonged', 'separate', 'shared', '

As we have already trained the vectorizer, we can simply use it to check how it encodes a new sentence.

In [5]:
count_vectorizer.transform((SENTENCE, )).todense()

matrix([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

In [6]:
def get_word_scores(vectorizer, text):
    X = vectorizer.transform((text, )).todense()
    mapping = zip(vectorizer.get_feature_names(), X.tolist()[0])
    return list(filter(lambda pair: pair[1] > 0.0, mapping))
    
get_word_scores(count_vectorizer, SENTENCE)

[('an', 1), ('is', 1), ('leave', 1), ('the', 1), ('to', 1), ('where', 1)]

We actually lost all the information from the sentence used for encoding. What's more, if we just consider any permutation of the original sentence, the encoded vector will be exactly the same. Bag of words method doesn't care about the context of words, but just encoded their presence in given text. It may be used for some simple cases, but typically is not enough.

## TFIDF

When we use BOG, there is another drawback - some non-informative words have exactly the same weight like the ones carrying the essential pieces of information. So called stopwords may be removed at the very beginning, but there are also some other words which may be common for our corpus, but not for the whole language. That's where TFIDF comes to the rescue!

TFIDF stands for Term Frequency Inversed Document Frequency. In simple words - for each word we calculate two different terms and multiply them:
- **TF** - term frequency, is a popularity score for a particular word and document - the higher the value, the more common the word is in a provided text document
- **IDF** - inversed term frequency, is a inversed popularity score for a particular word, in all the corpus documents - the higher the value, the less common the word is in a whole corpus, so the more informative it is (if a word is very common, then typically it doesn't affect the meaning)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

We're going to consider the same corpus and the same sentence like before.

In [8]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(CORPUS)
print(tfidf_vectorizer.get_feature_names())

['affection', 'afternoon', 'against', 'alone', 'an', 'and', 'anyone', 'are', 'as', 'back', 'be', 'beautiful', 'beings', 'below', 'boring', 'but', 'by', 'can', 'compassion', 'copulation', 'defend', 'desire', 'different', 'discontent', 'does', 'dog', 'dogs', 'doing', 'don', 'echoes', 'eden', 'emptiness', 'enough', 'even', 'evil', 'exchange', 'expect', 'extends', 'fall', 'falling', 'fear', 'feels', 'felt', 'finds', 'for', 'from', 'glorious', 'goal', 'had', 'half', 'have', 'heart', 'heavier', 'heavy', 'higher', 'hillside', 'human', 'hundred', 'hurt', 'imagination', 'in', 'indecent', 'infinite', 'intensified', 'is', 'isolated', 'it', 'itself', 'jealousy', 'know', 'leave', 'limited', 'link', 'longing', 'lost', 'love', 'lures', 'make', 'making', 'measure', 'merely', 'mind', 'must', 'mutual', 'no', 'not', 'nothing', 'number', 'object', 'of', 'on', 'one', 'opposite', 'or', 'other', 'our', 'ourselves', 'own', 'pain', 'paradise', 'passions', 'peace', 'people', 'prolonged', 'separate', 'shared', '

In [9]:
get_word_scores(tfidf_vectorizer, SENTENCE)

[('an', 0.504069491273459),
 ('is', 0.32706807618956785),
 ('leave', 0.504069491273459),
 ('the', 0.2135243418310226),
 ('to', 0.2918487157505274),
 ('where', 0.504069491273459)]