# Text Representation Basics

In this notebook we go throught the basics of text representation. We show how to implement two popular methods: Bag of Words and Tfidf.

### Bag of Words

Many machine learning methods cannot use strings as features, we have to encode it using numbers.

We can easily do this using __Bag Of Words (BOW)__ technique and marvelous __sklearn__ library:

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]

X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
 [1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1]
 [0 0 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 2 1 0 0 1 0 1 0 0]]


Each document is represented by the row. Values ranging from 0 to N represent whether and how many times the word occured in the document. You can see what word corresponds to which column by issuing get_feature_names() on vectorizer object.

#### Exercise 1: Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- fit_transform: gets a list of strings and returns matrix with it's BoW representation
- get_features_names: returns list of words corresponding to columns in BoW

In [None]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """        
        return None
     
    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """     
        return None

vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

In [40]:
vectorizer.get_feature_names()

['as',
 'bag',
 'based',
 'below',
 'can',
 'counting',
 'document',
 'documents',
 'gives',
 'is',
 'matrix',
 'most',
 'multiple',
 'occur',
 'occurences',
 'of',
 'on',
 'once',
 'only',
 'pretty',
 'really',
 'see',
 'sparse',
 'the',
 'third',
 'this',
 'throughout',
 'us',
 'words',
 'you']

This approach allows us to easily describe the whole corpora, but it lacks informations crucial for solving some tasks.

### TF-IDF

Look at word is. It is used in most documents many times, yet it does not tell us anything about them. Let's think about sentiment analysis: if words like great or awesome occur frequently in comparison with another documents it may suggest positive attitude.

TF-IDF is one way to encode this information and I'll walk you through it step by step.

First part of TF-IDF is, yes, you guessed it, TF, which means Term Frequency. It can be calculated as:

\begin{equation}
tf_{ij}=\frac{n_{ij}}{\sum n_{ij}},
\end{equation}
where $n_{ij}$ is the number of occurence of word $i$ in document $j$.

In [47]:
import numpy as np

corpus = [
'Tom has cat',
'Tom has fish',
'Tom is german',
]

def tf(corpus):
    vec = CountVectorizer()
    bow_representation = vec.fit_transform(corpus)
    words_per_corpus = bow_representation.sum(axis=1)
    return np.divide(np.array(bow_representation.toarray()),np.array(words_per_corpus).reshape((5,))[:,None])


For each document we count how many times it occurred (BoW implementaion) and divide by the count of all words in this document.

Next part is IDF, which stands for Inverse Document Frequency:

\begin{equation}
idf=\log(\frac{N}{df_{t}}),
\end{equation}
where $N$ is the total number of documents and $df_{t}$ is number of documents containing $t$.

#### Exercise 2: Fill out the idf part of tf-idf

We need the number $N$ and count of terms in documtents. Use CountVectorizer.

In [48]:
def idf(corpus):
    document_count = len(corpus)
    bow_representation = CountVectorizer().fit_transform(corpus)
    return None

First we calculate number of documents in corpus (number of rows in our case). Next, for each word, we calculate documents containing said word at least once.

Taking logarithm allows us to dampen the effect of idf. For example, the difference between term occuring in 40 out of 50 documents and 45 out of 50 documents will be smaller than difference between 1/50 and 5/50. This puts a bigger emphasis on rarely occuring terms as they are more informative.

Finally, for the whole thing to work, we simply multiply both:

In [50]:
def tf_idf(corpus):
    return tf(corpus) * idf(corpus)
#tf_idf(corpus)

Let's calculate it:

In [51]:
corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]

tfidf_result = tf_idf(corpus)

print(tfidf_result.shape)

(5, 30)


In Jupyter it's easier to display results with pandas:

In [52]:
import pandas as pd
pd.DataFrame(tfidf_result).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,0.0,0.22992,0.22992,0.0,0.0,0.22992,0.0,0.0,0.0,0.130899,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.072975,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.321888,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.321888,0.0,0.102165,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.321888,0.0,0.0,0.183258,...,0.0,0.0,0.0,0.183258,0.321888,0.183258,0.0,0.0,0.0,0.0
3,0.146313,0.0,0.0,0.0,0.146313,0.0,0.0,0.0,0.0,0.0,...,0.0,0.083299,0.0,0.083299,0.0,0.0,0.0,0.0,0.046439,0.146313
4,0.0,0.0,0.0,0.292625,0.0,0.0,0.0,0.0,0.146313,0.0,...,0.146313,0.166598,0.146313,0.0,0.0,0.083299,0.0,0.146313,0.0,0.0


There are many versions of tf-idf, some use different smoothing, use additional logarithm for tf part and so on. Each transforms corpora a little differently, and appropriate should be used based on effect we would like to obtain.

### References

[1] Natural Language Processing with Python, Edward Loper, Ewan Klein, Steven Bird. O'Reilly 2009

[2] Applied Text Analysis with Python, Tony Ojeda , Rebecca Bilbro , Benjamin Bengfort. O'Reilly 2018

[3] Feature Engineering for Machine Learning, Amanda Casari , Alice Zheng. O'Reilly 2018