# Working with the Bag-of-Words representation

The [bow module](api.rst#tmtoolkit-bow) in tmtoolkit contains several functions for working with Bag-of-Words (BoW) representations of documents. It's divided into two sub-modules: [bow.bow_stats](api.rst#module-tmtoolkit.bow.bow_stats) and [bow.dtm](api.rst#module-tmtoolkit.bow.dtm). The former implements several statistics and transformations for BoW representations, the latter contains functions to create and convert sparse or dense document-term matrices (DTMs).

Most of the functions in both sub-modules accept and/or return DTMs. The [previous chapter](preprocessing.iypnb) contained a section about what *sparse* DTMs are and [how they can be generated with tmtoolkit](preprocessing.iypnb#Generating-a-sparse-document-term-matrix-(DTM)).

## An example document-term matrix

Before we start with the [bow.dtm](api.rst#module-tmtoolkit.bow.dtm) module, we will generate a sparse DTM from a small example corpus.

In [1]:
import random
random.seed(20191113)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(5)

Let's have a look at a sample document:

In [2]:
print(corpus['NewsArticles-2058'][:227])

Merkel: 'Only if Europe is doing well, will Germany be doing well'

Ahead of meeting her fellow European leaders at a summit in Brussels, German Chancellor Angela Merkel has reiterated her government's call for unity in the EU.


We employ a preprocessing pipeline that removes a lot of information from our original data in order to obtain a very condensed DTM.

In [3]:
from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus)
preproc.pos_tag() \
    .lemmatize() \
    .filter_for_pos('N') \
    .tokens_to_lowercase() \
    .remove_special_chars_in_tokens() \
    .clean_tokens(remove_shorter_than=2) \
    .remove_common_tokens(5, absolute=True) # remove tokens that occur in all documents
preproc.tokens_datatable

Unnamed: 0_level_0,doc,position,token,meta_pos
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪
0,NewsArticles-119,0,nhs,NNP
1,NewsArticles-119,1,nhs,NNP
2,NewsArticles-119,2,pledge,NN
3,NewsArticles-119,3,prime,NNP
4,NewsArticles-119,4,minister,NNP
5,NewsArticles-119,5,david,NNP
6,NewsArticles-119,6,cameron,NNP
7,NewsArticles-119,7,theresa,NNP
8,NewsArticles-119,8,may,NNP
9,NewsArticles-119,9,government,NN


In [4]:
preproc.n_docs, len(preproc.vocabulary)

(5, 530)

We fetch the document labels and vocabulary and convert them to NumPy arrays, because such arrays allow advanced indexing methods such as boolean indexing.

In [5]:
doc_labels = np.array(preproc.doc_labels)
doc_labels

array(['NewsArticles-119', 'NewsArticles-1206', 'NewsArticles-2058',
       'NewsArticles-3016', 'NewsArticles-3665'], dtype='<U17')

In [6]:
vocab = np.array(preproc.vocabulary)
vocab[:10]  # only showing the first 10 tokens here

array(['abuse', 'access', 'accession', 'accusation', 'act', 'addition',
       'address', 'addressing', 'administration', 'affiliation'],
      dtype='<U20')

Finally, we fetch the sparse DTM:

In [7]:
dtm = preproc.dtm
dtm

<5x530 sparse matrix of type '<class 'numpy.int32'>'
	with 596 stored elements in Compressed Sparse Row format>

We now have a sparse DTM `dtm`, a list of document labels `doc_labels` that represent the rows of the DTM and a list of vocabulary tokens `vocab` that represent the columns of the DTM. We will use this data for the remainder of the chapter.

## The `bow.dtm` module

This module is quite small. There are two functions to convert a DTM to a [datatable](https://github.com/h2oai/datatable/) or [DataFrame](https://pandas.pydata.org/): [dtm_to_datatable()](api.rst#tmtoolkit.bow.dtm.dtm_to_datatable) and [dtm_to_dataframe()](api.rst#tmtoolkit.bow.dtm.dtm_to_dataframe). Note that the generated datatable or DataFrame is *dense*, i.e. it uses up (much) more memory than the input DTM.

Let's generate a datatable via [dtm_to_datatable()](api.rst#tmtoolkit.bow.dtm.dtm_to_datatable) from our DTM, the document labels and the vocabulary:

In [8]:
from tmtoolkit.bow.dtm import dtm_to_datatable

dtm_to_datatable(dtm, doc_labels, vocab)

Unnamed: 0_level_0,_doc,abuse,access,accession,accusation,act,addition,address,addressing,administration,…,year,yes,york,yucel,yucellast
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪
0,NewsArticles-119,0,1,0,0,0,0,0,0,0,…,0,0,0,0,0
1,NewsArticles-1206,0,0,0,0,0,0,0,0,0,…,0,0,0,0,0
2,NewsArticles-2058,0,0,1,1,0,0,0,1,0,…,2,1,0,1,1
3,NewsArticles-3016,1,0,0,0,0,0,0,0,0,…,0,0,1,0,0
4,NewsArticles-3665,0,1,0,0,1,1,1,0,1,…,1,0,0,0,0


We can see that a row `_doc` with the document labels was created and that the vocabulary tokens become the column names. [dtm_to_dataframe()](api.rst#tmtoolkit.bow.dtm.dtm_to_dataframe) works the same way.

You can combine tmtoolkit with [Gensim](https://radimrehurek.com/gensim/). The `bow.dtm` module provides several functions to convert data between both packages:

- [dtm_and_vocab_to_gensim_corpus_and_dict()](api.rst#tmtoolkit.bow.dtm.dtm_and_vocab_to_gensim_corpus_and_dict): converts a (sparse) DTM and a vocabulary list to a *Gensim Corpus* and *Gensim Dictionary*
- [dtm_to_gensim_corpus()](api.rst#tmtoolkit.bow.dtm.dtm_to_gensim_corpus): convert a (sparse) DTM only to a *Gensim Corpus*
- [gensim_corpus_to_dtm()](api.rst#tmtoolkit.bow.dtm.gensim_corpus_to_dtm): converts a *Gensim Corpus* object to a sparse DTM in COO format

## The `bow.bow_stats` module

This module provides several statistics and transformations for sparse or dense DTMs.

### Document lengths, document and term frequencies, token co-occurrences

Let's start with the [doc_lengths()](api.rst#tmtoolkit.bow.bow_stats.doc_lengths) function, which simply gives the number of tokens per document:

In [9]:
from tmtoolkit.bow.bow_stats import doc_lengths

doc_lengths(dtm)

array([ 36,  37, 338, 164, 349])

The returned array is aligned to the document labels `doc_labels` so we can see that the last document, "NewsArticles-3665", is the one with the most tokens. Or to do it computationally:

In [10]:
doc_labels[doc_lengths(dtm).argmax()]

'NewsArticles-3665'

The function [doc_frequencies()](api.rst#tmtoolkit.bow.bow_stats.doc_frequencies) returns how often each token in the vocabulary occurs at least *n* times per document. You can control *n* per parameter `min_val` which is set to `1` by default. The returned array is aligned with the vocabulary. Here, we calculate the document frequency with `min_val=1`, extract the maximum document frequency and see which of tokens in the `vocab` array reach the maximum document frequency:

In [11]:
from tmtoolkit.bow.bow_stats import doc_frequencies

df = doc_frequencies(dtm)
max_df = df.max()
max_df, vocab[df == max_df]

(4, array(['minister'], dtype='<U20'))

It turns out that the maximum document frequency is 4 and only the token "minister" reaches that document frequency. This means only "minister" is mentioned across 4 documents at least once (because `min_val` is `1`). Remember that during preprocessing, we removed all tokens that occur across *all* five documents, hence there can't be a vocabulary token with a document frequency of 5.

Let's see which vocabulary tokens occur within a single document at least 10 times:

In [12]:
df = doc_frequencies(dtm, min_val=10)
vocab[df > 0]

array(['candidate', 'eu', 'macron', 'medium', 'merkel', 'refugee'],
      dtype='<U20')

codoc_frequencies
term_frequencies

sorted_terms
sorted_terms_data_table

tfidf

tf_proportions
idf