# Lab2.3 Feature representation

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

A critical component of almost any machine learning approach is **feature representation**. 
This is not strange since we need to somehow convert a textual unit, e.g., word, sentence, tweet, or document, into something meaningful that can not only be interpreted by a computer, but is also useful for the type of learning we want to do. 

In this notebook, we show two of the basic feature representation used in machine learning: **bag of words** and **TF-IDF**.

**At the end of this notebook, you will be able to:**
* build a bag of words representation
* build a TF-IDF-based model

**If you want to learn more: (information from these blogs was used in this notebook)**
* [bag of words introduction](http://www.insightsbot.com/blog/R8fu5/bag-of-words-algorithm-in-python-introduction)
* [TF-IDF introduction](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)
* [another TF-IDF introduction](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

In [2]:
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk

## Bag of words
The bag of word approach consists of two main steps:

* 1 we extract all the unique words from a collection of textual units, e.g., documents
* 2 we compute the frequency of each word in each document.

Let's try this for the following three sentences.

In [3]:
sents = ['A rose is a rose',
         'A rose is a flower',
         "A book is nice"]

Our text structure is now a list or string, where each string contains separate words or tokens. We can generate such a structure from any text by calling sentence splitting and tokenisation functions. Here each string is a sequence that represents a full texts of a document. 

Sklearn can deal such representations to create a vector representation for each unit (here the elements in the list) on the basis of the complete vocabulary of words that occurs across all the units.

Such a vector can be seen as a one-hot-encoding for the documents, in which each documents is scored for the words out of the total vocabulary that are present in the document. You could also see it as a word-to-document index.

We will use the **CountVectorizer** to create the bag of words representation:

In [7]:
vectorizer = CountVectorizer(min_df=1, # in how many documents the term minimally occurs
                             tokenizer=nltk.word_tokenize) # we use the nltk tokenizer
sents_counts = vectorizer.fit_transform(sents)

It shows us that we have 3 documents and 6 unique words:

In [8]:
# sents_counts has a dimension of 3 (document count) by 6 (# of unique words)
print(sents_counts.shape)
print('unique words:', list(vectorizer.vocabulary_.keys()))

(3, 6)
unique words: ['a', 'rose', 'is', 'flower', 'book', 'nice']


The bag of word representation looks like this:

In [9]:
# this vector is small enough to view in full! 
print(vectorizer.get_feature_names())
sents_counts.toarray()

['a', 'book', 'flower', 'is', 'nice', 'rose']


array([[2, 0, 0, 1, 0, 2],
       [2, 0, 1, 1, 0, 1],
       [1, 1, 0, 1, 1, 0]], dtype=int64)

For example, this means that:
* the word **a** occurs two times in the first two sentences, and only once in the third.
* the word **book** only occurs in the third sentence.
* ...

It is important to realise that the position in the vector represent specific words.  This means that each document is represented through the same vector. Vector positions and length should be same across data representations.

This is the most basic feature representation for machine learning. It is easy to see that we can now compare the documents in terms of similarity by simply comparing the counts of the words in the vectorized representations. 

The similarity of the documents is defined by the degree to which the same words occur equally frequently. Also note that the vectors will become very large and sparse when we vectorize large data collections.

The formal way to calculate the similarity across vectors is using the normalised dot product (NDP). NDP of two vectors is the sum of the product of each vector position normalised by the length of the vector.

## TF-IDF
One big problem of the bag of words approach is that it treats all words equally. Why is that a disadvantage? It means that words that occur in many documents, such as *a*, contribute equally to the decision making of the machine learning approach as other words that are much more informative, e.g., *rose*. 
TF-IDF addresses this problem by assigning less weight to words that occur in many documents.
You read [here](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3) a nice introduction to TF-IDF.

This is how you can do it in Python:

In [10]:
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [11]:
tf_idf_array = sents_tfidf.toarray()
print(vectorizer.get_feature_names())
print(numpy.round(tf_idf_array, decimals=1))

['a', 'book', 'flower', 'is', 'nice', 'rose']
[[0.6 0.  0.  0.3 0.  0.8]
 [0.6 0.  0.5 0.3 0.  0.4]
 [0.4 0.6 0.  0.4 0.6 0. ]]


This is a good result! In the bag of words approach, The words **"a"** and **"book"** both had a frequency of 1 in the third sentence. Now that we've applied the TF-IDF approach, we see that the word *book* has a higher weight (0.6) than the word *"a"* since *"a"* occurs in all three sentences and *"book"* only in one, which might indicate that it is more informative.

## End of this notebook