# Language understanding methods


NLP tasks involve machine learning methods. Such methods work on numbers which means that it needs to convert a text into numbers. From many methods, in this section Bag of Words, Tf-idf and Word2vec are implemented and explained.

## Text vectorization

NLP tasks involve machine learning methods. Such methods work on numbers which means that it needs to convert a text into numbers. From many methods, in this section Bag of Words, Tf-idf and Word2vec are implemented and explained.

### Bag of Words

Bag of Words is one of the simplest methods that converts text into a vector. It can consist of two methods:

- ``fit_transform`` - gets a list of strings and returns matrix with its BoW representation,
- ``get_features_names`` - returns a list of words corresponding to columns in BoW.


![timeline](images/bow.png)

It can be easily implemented with ``CountVectorizer`` from scikit-learn library.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer = CountVectorizer()

corpus = [
'This is a research paper on natural language processing',
'Dow Jones is a fintech company',
'We analyze news with natural language processing',
]

X = vectorizer.fit_transform(corpus)

bag = vectorizer.get_feature_names()

The sentence representation is an array of bag length.

In [10]:
pd.DataFrame(X.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0,0,0,1,0,1,1,0,1,1,1,1,1,0,0
1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,1,1,1,0,0,1,0,0,1,1


Where the bag contains all words without duplication.

In [11]:
print("Bag: "+str(bag))

Bag: ['analyze', 'company', 'dow', 'fintech', 'is', 'jones', 'language', 'natural', 'news', 'on', 'paper', 'processing', 'research', 'this', 'we', 'with']


### Tf-idf

Tf-idf is a short name for term frequency-inverse document frequency. This method extends the BoW method and gives more information about a word in a sentence. It measures the occurrences of a word in all sentences and each sentence separately. If a word occurs more often than other words in all sentences, the importance of this word is low. A word that does not occur too many times in all sentences will have a higher value if it occurs in a sentence.

In [14]:
import numpy as np


corpus = [
'This is a research paper on natural language processing',
'Dow Jones is a fintech company',
'We analyze news with natural language processing',
]

def tf(corpus):
    vec = CountVectorizer()
    bow_representation = vec.fit_transform(corpus)
    words_per_corpus = bow_representation.sum(axis=1)
    return np.divide(np.array(bow_representation.toarray()),np.array(words_per_corpus).reshape((len(corpus),))[:,None])

def idf(corpus):
    document_count = len(corpus)
    bow_representation = CountVectorizer().fit_transform(corpus)
    return np.log(document_count / np.count_nonzero(bow_representation.toarray(), axis=0))

def tf_idf(corpus):
    return tf(corpus) * idf(corpus)

tdidf = tf_idf(corpus)

The representation of each sentence give a better understanding of word importance. The higher the value is the importance of a word in the whole document is higher.

In [15]:
pd.DataFrame(tdidf)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,0.0,0.0,0.0,0.050683,0.0,0.050683,0.050683,0.0,0.137327,0.137327,0.050683,0.137327,0.137327,0.0,0.0
1,0.0,0.219722,0.219722,0.219722,0.081093,0.219722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.156945,0.0,0.0,0.0,0.0,0.0,0.057924,0.057924,0.156945,0.0,0.0,0.057924,0.0,0.0,0.156945,0.156945


### Word2Vec - negative sampling

Word2vec is, in fact, a family of log-linear models. These simple neural networks take text corpus as an input and produce continuous multidimensional words representation. This kind of embedding maintains similarities and relations between words. Thanks to their simplicity, models can be trained on extensive datasets in reasonable amount of time.

There are two major architectures: Continuous Bag-of-words (CBOW) and Skip-gram. Both of them similarly process text using a fixed sized window. Each training example has central word and its context.

![timeline](images/context.png)

CBOW is faster to train than Skip-gram and creates a better representation of frequent words. On the other hand, Skip-gram can be trained on small datasets and represents well even rare words.

![timeline](images/skipgram.png)

For this example, we use the NLTK treebank corpus. To do so, we need to load the NLTK corpus. Before we go to the next step we should also load all libraries needed.

In [None]:
import nltk as nltk
import numpy as np
import pandas as pd
from collections import namedtuple

nltk.download('treebank')
nltk.download('punkt')

Three variables are important for the training: ``train_dict``, ``train_tokens`` and ``train_set``. The first one contain all unique words used in the corpus. The second is a list of indices of words in the dictionary that correspond to each word used in the raw text. 

In [None]:
raw_set = nltk.corpus.treebank_raw.raw()[0:50000].replace('.START',' ').replace("\n","").replace("."," ").replace(","," ")
tokens = [token for token in nltk.word_tokenize(raw_set) if token.isalpha()]
train_dict = pd.Series(tokens).unique().tolist()
train_tokens = np.array([train_dict.index(token) for token in tokens])

The last variable consist of a list of two numbers. The current word index and the word index that is before the word and after the word. Depending on the window size we use also other words that are in the neighbourhood. In this example the window size is set to 2. It means we take two words before and two words after the given word and build the relation in the training data set.

In [None]:
train_set = []
for i in range(2,len(tokens)-2):
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i-1])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i-2])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i+1])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i+2])])

train_set = np.random.permutation(np.array(train_set))

The next step is to set the training configuration. We set the the negative samples size to 10 and the vector size to 100. Learning rate and rate decay are set to 0.1 and 0.995. The training loops are set to 8000000. Logs are displayed each 10000 epoches.

In [None]:
Config = namedtuple("Config", ["dict_size", "vect_size", "neg_samples", "updates", "learning_rate",
                               "learning_rate_decay", "decay_period", "log_period"])
conf = Config(
    dict_size=len(train_dict),
    vect_size=100,
    neg_samples=10,
    updates=8000000,
    learning_rate=0.1,
    learning_rate_decay=0.995,
    decay_period=10000,
    log_period=10000)

A few helper functions that are used later in the code are defined below.

In [None]:
def zeros(*dims):
    return np.zeros(shape=tuple(dims), dtype=np.float32)

def ones(*dims):
    return np.ones(shape=tuple(dims), dtype=np.float32)

def rand(*dims):
    return np.random.rand(*dims).astype(np.float32)

def randn(*dims):
    return np.random.randn(*dims).astype(np.float32)

def sigmoid(batch, stochastic=False):
    return  1.0 / (1.0 + np.exp(-batch))

def as_matrix(vector):
    return np.reshape(vector, (-1, 1))

We loop over ``updates`` and get the word and context from the train set. We calculate the negative context and calculate the word, context and negative sample vectors. The negative context is chosen randomly. In the next step we calcualte the cost and corresponding to it gradients.

In [None]:
def neg_sample(conf, train_set, train_tokens):
    Vp = randn(conf.dict_size, conf.vect_size)
    Vo = randn(conf.dict_size, conf.vect_size)

    J = 0.0
    learning_rate = conf.learning_rate
    for i in range(conf.updates):
        idx = i % len(train_set)

        word = train_set[idx, 0]
        context = train_set[idx, 1]

        neg_context = np.random.randint(0, len(train_tokens), conf.neg_samples)
        neg_context = train_tokens[neg_context]

        word_vect = Vp[word, :]  # word vector
        context_vect = Vo[context, :];  # context wector
        negative_vects = Vo[neg_context, :]  # sampled negative vectors

        # Cost and gradient calculation starts here
        score_pos = word_vect @ context_vect.T
        score_neg = word_vect @ negative_vects.T

        J -= np.log(sigmoid(score_pos)) + np.sum(np.log(sigmoid(-score_neg)))
        if (i + 1) % conf.log_period == 0:
            print('Update {0}\tcost: {1:>2.2f}'.format(i + 1, J / conf.log_period))
            final_cost = J / conf.log_period
            J = 0.0

        pos_g = 1.0 - sigmoid(score_pos)
        neg_g = sigmoid(score_neg)

        word_grad = -pos_g * context_vect + np.sum(as_matrix(neg_g) * negative_vects, axis=0)
        context_grad = -pos_g * word_vect
        neg_context_grad = as_matrix(neg_g) * as_matrix(word_vect).T

        Vp[word, :] -= learning_rate * word_grad
        Vo[context, :] -= learning_rate * context_grad
        Vo[neg_context, :] -= learning_rate * neg_context_grad

        if i % conf.decay_period == 0:
            learning_rate = learning_rate * conf.learning_rate_decay

    return Vp, Vo, final_cost

Vp, Vo, J = neg_sample(conf, train_set, train_tokens)

The ``similar_words`` can be used to find related words of the ``word``.

In [None]:
def lookup_word_idx(word, word_dict):
    try:
        return np.argwhere(np.array(word_dict) == word)[0][0]
    except:
        raise Exception("No such word in dict: {}".format(word))

def similar_words(embeddings, word, word_dict, hits):
    word_idx = lookup_word_idx(word, word_dict)
    similarity_scores = embeddings @ embeddings[word_idx]
    similar_word_idxs = np.argsort(-similarity_scores)
    return [word_dict[i] for i in similar_word_idxs[:hits]]

Some example relationships for the current corpus are:
- stock: currencies, issuing, vowed, expects, fully, resistant
- money: assets, purchases, red, stretching
- economists: prospects, protecting, thought, purchases
- law: convertible, diplomacy, resilient, combined