# Text preprocessing and basic text mining

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose implemented as *corpus functions* in the [tmtoolkit.corpus](api.rst#tmtoolkit-corpus) module.

<div class="alert alert-info">

### Reminder: Corpus functions

All *corpus functions* accept a [`Corpus`](api.rst#TODO) object as first argument and operate on it. A corpus function may retrieve information from a corpus and/or modify the corpus object.

<div/>


## Optional: enabling logging output

By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it's helpful to enable logging output display. For that, you can use the [`enable_logging`](api.rst#TODO) function. By default, it enables logging to console for the `INFO` level.

In [1]:
from tmtoolkit.utils import enable_logging

enable_logging()

## Loading example data

Let's load a sample of ten documents from the built-in *NewsArticles* dataset. We'll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample. To apply sampling right at the beginning when loading the data, we pass the `sample=10` parameter to the [`from_builtin_corpus`](api.rst#TODO) class method. We also use [`print_summary`](api.rst#TODO) like shown in the previous chapter.

In [2]:
import random
random.seed(20220119)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus, print_summary

corpus_small = Corpus.from_builtin_corpus('en-NewsArticles', sample=10)
print_summary(corpus_small)

2022-01-19 15:03:27,051:INFO:tmtoolkit:creating Corpus instance with no documents
2022-01-19 15:03:27,052:INFO:tmtoolkit:using serial processing
2022-01-19 15:03:27,557:INFO:tmtoolkit:sampling 10 documents(s) out of 3824
2022-01-19 15:03:27,560:INFO:tmtoolkit:adding text from 10 documents(s)
2022-01-19 15:03:28,451:INFO:tmtoolkit:generating document texts


Corpus with 10 documents in English
> NewsArticles-2433 (842 tokens): DOJ : 2 Russian spies indicted in Yahoo hack    Wa...
> NewsArticles-2225 (539 tokens): Rutte and Wilders face - off in Dutch general elec...
> NewsArticles-2487 (1015 tokens): Dutch election : High turnout in key national vote...
> NewsArticles-49 (1112 tokens): Trump vs. America : The fight for democracy    Fri...
> NewsArticles-2766 (700 tokens): Depeche Mode releases ' Spirit , ' an unusually po...
> NewsArticles-2712 (571 tokens): Grieving families speak out as police hunt for kil...
> NewsArticles-2301 (464 tokens): DOJ seeks more time on Trump wiretapping inquiry  ...
> NewsArticles-1377 (774 tokens): Turkey - backed rebels in ' near full control ' of...
> NewsArticles-3428 (776 tokens): In Breakthrough Discovery , Scientists Mass - Prod...
total number of tokens: 7191 / vocabulary size: 2149


## Accessing tokens and token attributes

We start with accessing the documents' tokens and their *token attributes* using [`doc_tokens`](api.rst#TODO) and [`tokens_table`](api.rst#TODO). Token attributes are meta information attached to each token. These can be linguistic features, such as the Part of Speech (POS) tag, indicators for stopwords or punctuation, etc. The default attributes are a subset of [SpaCy's token attributes](https://spacy.io/api/token#attributes). You can configure which of these attributes are stored using the `spacy_token_attrs` parameter of the [`Corpus`](api.rst#TODO) constructor. You can also add your own token attributes. This will be shown later on.

At first we load the tokens along with their attributes via `doc_tokens`, which gives us a dictionary mapping document labels to document data. Each document data is another dictionary that contains the tokens and their attributes. We start by checking which token attributes are loaded by default in any document (here, we use `'NewsArticles-2433'`):

In [3]:
from tmtoolkit.corpus import doc_tokens, tokens_table

# with_attr=True adds default set of token attributes
tok = doc_tokens(corpus_small, with_attr=True)
tok['NewsArticles-2433'].keys()

dict_keys(['token', 'is_punct', 'is_stop', 'like_num', 'tag', 'pos', 'lemma'])

So each document's data can be accessed like in the example above and it will contain the seven data entries listed above. The `'token'` entry gives the actual tokens of the document. Let's show the first five tokens for a document:

In [4]:
tok['NewsArticles-2433']['token'][:5]

['DOJ', ':', '2', 'Russian', 'spies']

The other entries are the attributes corresponding to each token. Here, we display the first five lemmata for the same document and the first five punctuation indicator values. The colon is correctly identified as punctuation character.

In [5]:
tok['NewsArticles-2433']['lemma'][:5]

['doj', ':', '2', 'russian', 'spy']

In [6]:
tok['NewsArticles-2433']['is_punct'][:5]

[False, True, False, False, False]

If your NLP pipeline performs sentence recognition, you can pass the parameter `sentences=True` which will add another level to the output representing sentences. This means that for each item like `'token'`, `'lemma'`, etc. we will get a list of sentences. For example, the following will print the tokens of the eighth sentences (index 7):

In [7]:
tok_sents = doc_tokens(corpus_small, sentences=True, with_attr=True)
tok_sents['NewsArticles-2433']['token'][7]   # index 7 means 8th sentence

['A',
 'Justice',
 'Department',
 'official',
 'said',
 'the',
 'agency',
 'has',
 'not',
 'confirmed',
 'it',
 'is',
 'the',
 'same',
 'person',
 'and',
 'declined',
 'further',
 'comment',
 'to',
 ...]

For a more compact overview, it's better to use the [`tokens_table`](api.rst#TODO) function. This will generate a [pandas DataFrame](https://pandas.pydata.org/) from the documents in the corpus and it will be default include all token attributes, along with a column for the document label (`doc`) and the token position inside the document (`position`).

In [8]:
tbl = tokens_table(corpus_small)
tbl

Unnamed: 0,doc,position,token,is_punct,is_stop,lemma,like_num,pos,tag
0,NewsArticles-1377,0,Turkey,False,False,Turkey,False,PROPN,NNP
1,NewsArticles-1377,1,-,True,False,-,False,PUNCT,HYPH
2,NewsArticles-1377,2,backed,False,False,back,False,VERB,VBN
3,NewsArticles-1377,3,rebels,False,False,rebel,False,NOUN,NNS
4,NewsArticles-1377,4,in,False,True,in,False,ADP,IN
...,...,...,...,...,...,...,...,...,...
7186,NewsArticles-49,1107,fight,False,False,fight,False,VERB,VB
7187,NewsArticles-49,1108,to,False,True,to,False,PART,TO
7188,NewsArticles-49,1109,defend,False,False,defend,False,VERB,VB
7189,NewsArticles-49,1110,it,False,True,it,False,PRON,PRP


You can use all sorts of filtering operations on this dataframe. See the [pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html) for details. Here, we select all tokens that were identified as "number-like":

In [9]:
tbl[tbl.like_num]

Unnamed: 0,doc,position,token,is_punct,is_stop,lemma,like_num,pos,tag
73,NewsArticles-1377,73,25,False,False,25,True,NUM,CD
88,NewsArticles-1377,88,three,False,True,three,True,NUM,CD
115,NewsArticles-1377,115,2014,False,False,2014,True,NUM,CD
292,NewsArticles-1377,292,13,False,False,13,True,NUM,CD
522,NewsArticles-1377,522,first,False,True,first,True,ADV,RB
...,...,...,...,...,...,...,...,...,...
6644,NewsArticles-49,565,one,False,True,one,True,NUM,CD
6795,NewsArticles-49,716,2011,False,False,2011,True,NUM,CD
6826,NewsArticles-49,747,1999,False,False,1999,True,NUM,CD
7011,NewsArticles-49,932,seven,False,False,seven,True,NUM,CD


This however only filters the table output. We will later see how to filter corpus documents and tokens.

If you want to generate the table only for a subset of documents, you can use the `select` parameter and provide one or more document labels. Similar to that, you can use the `with_attr` parameter to list only a subset of the token attributes.

In [10]:
# select a single document and only show the "pos" attribute (coarse POS tag)
tokens_table(corpus_small, select='NewsArticles-2433', sentences=True, with_attr='pos')

Unnamed: 0,doc,sent,position,token,pos
0,NewsArticles-2433,0,0,DOJ,NOUN
1,NewsArticles-2433,0,1,:,PUNCT
2,NewsArticles-2433,0,2,2,NUM
3,NewsArticles-2433,0,3,Russian,ADJ
4,NewsArticles-2433,0,4,spies,NOUN
...,...,...,...,...,...
837,NewsArticles-2433,27,837,to,PART
838,NewsArticles-2433,27,838,reflect,VERB
839,NewsArticles-2433,27,839,new,ADJ
840,NewsArticles-2433,27,840,developments,NOUN


In [11]:
# select two documents and only show the "pos" and "tag" attributes (coarse and detailed POS tags)
tokens_table(corpus_small, select=['NewsArticles-2433', 'NewsArticles-49'], with_attr=['pos', 'tag'])

Unnamed: 0,doc,position,token,pos,tag
0,NewsArticles-2433,0,DOJ,NOUN,NN
1,NewsArticles-2433,1,:,PUNCT,:
2,NewsArticles-2433,2,2,NUM,CD
3,NewsArticles-2433,3,Russian,ADJ,JJ
4,NewsArticles-2433,4,spies,NOUN,NNS
...,...,...,...,...,...
1949,NewsArticles-49,1107,fight,VERB,VB
1950,NewsArticles-49,1108,to,PART,TO
1951,NewsArticles-49,1109,defend,VERB,VB
1952,NewsArticles-49,1110,it,PRON,PRP


<div class="alert alert-info">

### Side note: Common corpus function parameters
    
Many corpus functions share the same parameter names and when they do, they implicate the same behavior. As already explained, all corpus functions accept a `Corpus` object as first parameter. But next to that, many corpus functions also accept a `select` parameter, which can always be used to specify a subset of the documents to which the respective function is applied. We also already got to know the `sentences` parameter that some corpus functions accept in order to also represent the sentence structure of a document in their output.
    
To know which functions accept which parameter, check their documentation.

<div/>

## Corpus vocabulary

The corpus *vocabulary* is the set of unique tokens in a corpus. We can get that set via [`vocabulary`](api.rst#TODO).

In [12]:
from tmtoolkit.corpus import vocabulary

vocabulary(corpus_small)

{'safe',
 'France',
 'You',
 'movement',
 'spokesperson',
 'between',
 'as',
 'Attorney',
 '200',
 'four',
 'me',
 'outstanding',
 'Thursday',
 'MORE',
 'legal',
 'requires',
 'selling',
 'growing',
 'Election',
 'restaurant',
 ...}

This corpus function also accepts a `select` parameter. We can also sort the vocabulary via `sort=True`, which returns a list instead of a set. To get the sorted vocabulary for document "NewsArticles-2433", we can write:

In [13]:
vocabulary(corpus_small, select='NewsArticles-2433', sort=True)

['\n\n',
 '"',
 "'s",
 '(',
 ')',
 ',',
 '-',
 '--',
 '.',
 '2',
 '2014',
 '22',
 '29',
 '33',
 '43',
 '500',
 ':',
 'A',
 'Akehmet',
 'Aleksandrovich',
 ...]

To get the number of unique tokens in the corpus, i.e. the vocabulary size, we can use [`vocabulary_size`](api.rst#vocabulary_size), which is basically a shortcut for `len(vocabulary(<Corpus object>))`:

In [14]:
from tmtoolkit.corpus import vocabulary_size

vocabulary_size(corpus_small)

2149

The corpus function [`vocabulary_counts`](api.rst#vocabulary_size) is useful to find out how often each token in the vocabulary occurs in the corpus:

In [15]:
from tmtoolkit.corpus import vocabulary_counts

vocabulary_counts(corpus_small)

{'Where': 2,
 'types.-': 1,
 'Reflection': 1,
 'approach': 1,
 'average': 1,
 'cell': 4,
 'conduct': 1,
 'angry': 1,
 'alone': 1,
 'veterans': 2,
 'tell': 1,
 'Bill': 1,
 'cyberexperts': 1,
 'surge': 1,
 'lead': 2,
 'countering': 1,
 'basically': 1,
 'cleared': 3,
 'Health': 2,
 'boils': 1,
 ...}

If you don't want to obtain absolute counts, you can use the `proportions` parameter. Setting it to `1` gives you ordinary proportions (i.e. $\frac{x_i}{\sum_j x_j}$) and `2` gives you proportions on a log10 scale ($\log_{10} \frac{x_i}{\sum_j x_j}$).

In [18]:
vocab_proportions = vocabulary_counts(corpus_small, proportions=1)
vocab_proportions   # will reuse that later

{'Where': 0.0002781254345709915,
 'types.-': 0.00013906271728549575,
 'Reflection': 0.00013906271728549575,
 'approach': 0.00013906271728549575,
 'average': 0.00013906271728549575,
 'cell': 0.000556250869141983,
 'conduct': 0.00013906271728549575,
 'angry': 0.00013906271728549575,
 'alone': 0.00013906271728549575,
 'veterans': 0.0002781254345709915,
 'tell': 0.00013906271728549575,
 'Bill': 0.00013906271728549575,
 'cyberexperts': 0.00013906271728549575,
 'surge': 0.00013906271728549575,
 'lead': 0.0002781254345709915,
 'countering': 0.00013906271728549575,
 'basically': 0.00013906271728549575,
 'cleared': 0.00041718815185648727,
 'Health': 0.0002781254345709915,
 'boils': 0.00013906271728549575,
 ...}

Tabular output is often more convenient for displaying results. You can set the `as_table` parameter to `True` to get a dataframe of tokens and their frequency. You can also specify to sort the dataframe by specifying the column to sort by in the `as_table` parameter. By default, this will sort in ascending order, but if you prefix the column name by "-", you obtain a descending sort order. Here, we will get a table of tokens with their frequencies in descending order:

In [22]:
vocabulary_counts(corpus_small, as_table='-freq')

Unnamed: 0,token,freq
849,the,321
290,",",307
1461,.,250
444,to,187
1854,"""",169
...,...,...
1198,Kay,1
1197,rhetoric,1
1196,series,1
1195,answer,1


We can see that "the" and "to" are top-ranking tokens, along with some punctuation characters. We can check the share of tokens for "the":

In [19]:
vocab_proportions['the']

0.04463913224864414

So the token "the" occurs more the 4% of the time in the whole corpus.

### Part-of-Speech (POS) tagging

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The method [pos_tag()](api.rst#tmtoolkit.preprocess.TMPreproc.pos_tag) employs this for the whole corpus. The found POS tags are added as metadata to each token. These tags conform to a specific *tagset* which is explained in the [spaCy documentation](https://spacy.io/api/annotation#pos-tagging). The POS tags can be used to annotate and filter the documents. Let's apply POS tagging:

In [None]:
preproc.pos_tag()

We can now see a new column `pos` with the found POS tag for each token:

In [None]:
preproc.tokens_datatable

### Aside: TMPreproc as "state machine"

Before continuing, we should clarify that a TMPreproc instance is a "state machine", i.e. its contents (the documents) and behavior can change when you call a method. An example:


```python
corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens

# Out:
# {
#   'doc1': ['Hello', 'world', '!'],
#   'doc2': ['Another', 'example']
# }

preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }
```

As you can see, the tokens "inside" `preproc` are changed *in place*. It's important to see that after calling the method `tokens_to_lowercase()`, the tokens in `preproc` were transformed and the original tokens from before calling this method are not available anymore. In Python, assigning a *mutable* object to a variable binds the same object only to a different name, it doesn't copy it. Since a `TMPreproc` object is a mutable object (you can change its state by calling its methods), when we simply assign such an object to a different variable (say `preproc_upper`) we essentially only have two names for the same object and by calling a method on one of these variable names, the values will be changed for *both* names.

#### Copying `TMPreproc` objects

What can we do about that? We need to *copy* the object which can be done with the [TMPreproc.copy()](api.rst#tmtoolkit.preprocess.TMPreproc.copy) method. By this, we create another variable `preproc_upper` that points to a separate `TMPreproc` object.

In [None]:
preproc_upper = preproc.copy()

In [None]:
# the IDs confirm that we have two different objects
id(preproc_upper), id(preproc)

In [None]:
preproc_upper.transform_tokens(str.upper)

# the transformation now only applied to "preproc_upper"
preproc.vocabulary == preproc_upper.vocabulary

In [None]:
# show a sample
preproc_upper.tokens['NewsArticles-1880'][:10]

In [None]:
# the original "preproc" still holds the same data
preproc.tokens['NewsArticles-1880'][:10]

Note that this also uses up twice as much computer memory now. So you shouldn't create copies that often and also release unused memory by using `del`:

In [None]:
# removing the objects again
del preproc_upper

### Lemmatization and term normalization

Before we start with token normalization, we will create a copy of the original `TMPreproc` object and its data, so that we can later use it for comparison:

In [None]:
preproc_orig = preproc.copy()

Lemmatization brings a token, if it is a word, to its base form. The lemma is already found out during the tokenization process and is available in the `lemma` metadata column. However, when you want to further process the tokens on the base of the lemmata, you should use the [lemmatize()](api.rst#tmtoolkit.preprocess.TMPreproc.lemmatize) method. This method sets the lemmata as tokens and all further processing will happen using the lemmatized tokens:

In [None]:
preproc.lemmatize()
preproc.tokens_datatable

As we can see, the `lemma` column was copied over to the `token` column.

<div class="alert alert-info">

Stemming
    
tmtoolkit doesn't support stemming directly, since lemmatization is generally accepted as a better approach to bring different word forms of one word to a common base form. However, you may install [NLTK](https://www.nltk.org/) and apply stemming by using the [transform_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.transform_tokens) method together with the [stem()](api.rst#tmtoolkit.preprocess.stem) function.
    
</div>

Depending on how you further want to analyze the data, it may be necessary to "clean" or "normalize" your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).   

If you want to remove certain characters in *all* tokens in your documents, you can use [remove_chars_in_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_chars_in_tokens) and pass it a sequence of characters to remove. There is also a shortcut [remove_special_chars_in_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_special_chars_in_tokens) which will remove all "special characters" (all characters in [string.punction](https://docs.python.org/3/library/string.html#string.punctuation) by default).

In [None]:
preproc.remove_chars_in_tokens(['-'])  # remove only "-"
preproc.print_summary()

In [None]:
# remove all punctuation
preproc.remove_special_chars_in_tokens()
preproc.print_summary()   # the "?" also vanishes

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with [tokens_to_lowercase()](api.rst#tmtoolkit.preprocess.TMPreproc.tokens_to_lowercase):

In [None]:
preproc.tokens_to_lowercase()
preproc.print_summary()

The method [clean_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.clean_tokens) finally applies several steps that remove tokens that meet certain criteria. This includes removing:

- punctuation tokens
- stopwords (very common words for the given language)
- empty tokens (i.e. `''`)
- tokens that are longer or shorter than a certain number of characters
- numbers  

Note that this is a language-dependent method, because the default stopword list is determined per language. This method has many parameters to tweak, so it's recommended to check out the documentation.

In [None]:
# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
preproc.print_summary()

Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

In [None]:
preproc.doc_lengths, preproc_orig.doc_lengths

We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

In [None]:
len(preproc.vocabulary), len(preproc_orig.vocabulary)

You can also apply custom token transform functions by using [transform_tokens()](api.rst#tmtoolkit.preprocess.TMpreproc.transform_tokens) and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let's define such a function. Here we create a simple function that should return a token's "shape" in terms of the case of its characters:

In [None]:
def token_shape(t):
    return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('EU'), token_shape('CamelCase'), token_shape('lower')

We can now apply this function to our documents (we will use the original documents here, because they were not transformed to lower case):

In [None]:
preproc = preproc_orig.copy() # swap instances for later

preproc_orig.transform_tokens(token_shape)   # apply function
preproc_orig.print_summary()

# remove instance
del preproc_orig

#### Expanding compound words and joining tokens

Compound words like "US-Student" or "non-recyclable" can be expanded to separate tokens like "US", "Student" and "non", "recyclable" using [expand_compound_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.expand_compound_tokens). However, depending on the language model, most of these compounds will already be separated on initial tokenization.

In [None]:
orig_vocab = preproc.vocabulary
preproc.expand_compound_tokens()

# create set difference to show vocabulary tokens
# that were expanded
set(orig_vocab) - set(preproc.vocabulary)

It's also possible to join together certain *subsequent* occurrences of tokens or token patterns. This means you can for example transform all of the subsequent tokens "White" and "House" to single tokens "White_House". In case you don't use n-grams (described in a separate section), this is very helpful when you want to capture a named entity that is made up by several tokens, such as persons, institutions or concepts like "Climate Change", as a single token. The method to use for this is [glue_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.glue_tokens). It accepts the following parameters:

- a `patterns` sequence of length *N* that is used to match the subsequent *N* tokens;
- a `glue` string that is used to join the matched subsequent tokens (by default: `"_"`).

Along with that, you can adjust the token matching with the [common token matching parameters](#Common-parameters-for-pattern-matching-functions) described below.

Let's "glue" all subsequent occurrences of "White" and "House". The `glue_tokens()` method will return a set of glued tokens that matched the provided pattern:

In [None]:
preproc_orig = preproc.copy()  # make a copy of full orig. data for later use
preproc.glue_tokens(['White', 'House'])

In [None]:
preproc.tokens['NewsArticles-1880'][:20]

In [None]:
del preproc

### Keywords-in-context (KWIC) and general filtering methods

*Keywords-in-context (KWIC)* allow you to quickly investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

`TMPreproc` provides three methods for this purpose:

- [get_kwic()](api.rst#tmtoolkit.preprocess.TMPreproc.get_kwic) is the base method accepting a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;
- [get_kwic_table()](api.rst#tmtoolkit.preprocess.TMPreproc.get_kwic_table) is the more "user friendly" version of the above method as it produces a datatable with the highlighted keyword by default
- [filter_tokens_with_kwic()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens_with_kwic) works similar to the above functions but applies the result by filtering the documents again; it is explained in the [section on filtering](#Filtering-tokens-and-documents)

Let's see the KWIC methods in action:

In [None]:
preproc = preproc_orig.copy()  # use orig. full data
preproc.get_kwic('house', ignore_case=True)

The method returns a dictionary that maps document labels to the KWIC results. Each document contains a list of "contexts", i.e. a list of tokens that surround a keyword, here `"house"`. This keyword stands in the middle and is surrounded by its "context tokens", which by default means two tokens to the left and two tokens to the right (which may be less when the keyword is near the start or the end of a document). 

We can see that `NewsArticles-1880` contains four contexts, `NewsArticles-99` one context and `NewsArticles-3350` none.

With `get_kwic_table()`, we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as `*house*` and empty results are removed:

In [None]:
preproc.get_kwic_table('house', ignore_case=True)

An important parameter is `context_size`. It determines the number of tokens to display left and right to the found keyword. You can either pass a single integer for a symmetric context or a tuple with integers `(<left>, <right>)`:

In [None]:
preproc.get_kwic_table('house', ignore_case=True, context_size=4)

In [None]:
preproc.get_kwic_table('house', ignore_case=True, context_size=(1, 4))

The KWIC functions become really powerful when using the pattern matching options. So far, we were looking for *exact* (but case insensitive) matches between the corpus tokens and our keyword `"house"`. However, it is also possible to match patterns like `"new*"` (matches any word starting with "new") or `"agenc(y|ies)"` (a regular expression matching "agency" and "agencies"). The next section gives an introduction on the different options for pattern matching.

#### Common parameters for pattern matching functions

Several functions and methods in tmtoolkit support pattern matching, including the already mentioned KWIC functions but also functions for filtering tokens or documents as you will see later. They all share similar function signatures, i.e. similar parameters:

- `search_token` or `search_tokens`: allows to specify one or more patterns as strings
- `match_type`: sets the matching type and can be one of the following options:
  - `'exact'` (default): exact string matching (optionally ignoring character case), i.e. no pattern matching
  - `'regex'` uses [regular expression](https://docs.python.org/3/library/re.html) matching
  - `'glob'` uses "glob patterns" like `"politic*"` which matches for example "politic", "politics" or "politician" (see [globre package](https://pypi.org/project/globre/))
- `ignore_case`: ignore character case (applies to all three match types)
- `glob_method`: if `match_type` is 'glob', use this glob method. Must be `'match'` or `'search'` (similar behavior as Python's [re.match](https://docs.python.org/3/library/re.html#re.match) or [re.search](https://docs.python.org/3/library/re.html#re.search))
- `inverse`: inverse the match results, i.e. if matching for "hello", return all results that do *not* match "hello"

Let's try out some of these options with `get_kwic_table()`:

In [None]:
# using a regular expression, ignoring case
preproc.get_kwic_table(r'agenc(y|ies)', match_type='regex', ignore_case=True)

In [None]:
# using a glob, ignoring case
preproc.get_kwic_table('pol*', match_type='glob', ignore_case=True)

In [None]:
# using a glob, ignoring case
preproc.get_kwic_table('*sol*', match_type='glob', ignore_case=True)

In [None]:
# using a regex that matches all tokens with at least one vowel and
# inverting these matches, i.e. all tokens *without* any vowels
preproc.get_kwic_table(r'[AEIOUaeiou]', match_type='regex', inverse=True)

#### Filtering tokens and documents

We can use the pattern matching parameters in numerous filtering methods. The heart of many of these methods is [token_match()](api.rst#tmtoolkit.preprocess.token_match). Given a search pattern, a list of tokens and optionally some pattern matching parameters, it returns a binary NumPy array of the same length as the input tokens. Each occurrence of `True` in this binary array signals a match.

In [None]:
from tmtoolkit.preprocess import token_match

# first 10 tokens of document "NewsArticles-1880"
doc_snippet = preproc.tokens['NewsArticles-1880'][:10]
# get all tokens that match "to*"
matches = token_match('to*', doc_snippet, match_type='glob')

# iterate through tokens and matches, show pair-wise results
for tok, match in zip(doc_snippet, matches):
    print(tok, ':', match)

The `token_match()` function is a rather low-level function that you may use for pattern matching against any list/array of strings, e.g. a list of tokens, file names, etc.

The following methods cover common use-cases for filtering during text preprocessing. Many of these methods start either with `filter_...()` or `remove_...()` and these pairs of filter and remove functions are complements. A filter method will always *retain* the matched elements whereas a remove method will always *drop* the matched elements. We can observe that with the first pair of method, [filter_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens) and [remove_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_tokens):

<div class="alert alert-info">

So much `.copy()`
    
Note that the following code snippets make lot of use of the `copy()` methods. This is because we want to show how the different methods work with the *same original data* (remember that a `TMPreproc` instance behaves like a state machine) and also want to "clean up" the temporary instances. Under normal circumstances, you wouldn't use `copy()` so excessively.
    
</div>

In [None]:
# retain only the tokens that match the pattern in each document
preproc.filter_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_tokens('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

The pair [filter_documents()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_documents) and [remove_documents()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_documents) works similarily, but filters or drops whole documents regarding the supplied match criteria. Both accept the standard pattern matching parameters but also a parameter `matches_threshold` with default value `1`. When this number of matching tokens is hit, the document will be part of the result set (`filter_documents()`) or removed from the result set (`remove_documents()`). By this, we can for example retain only those documents that contain certain token patterns.

Let's try these methods out in practice:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

We can see that two out of three documents contained the pattern `'*house*'` and hence were retained.

We can also adjust `matches_threshold` to set the minimum number of token matches for filtering:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents('*house*', match_type='glob', ignore_case=True,
                         matches_threshold=4)
preproc.print_summary()

del preproc

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_documents('*house*', match_type='glob', ignore_case=True)
preproc.print_summary()

del preproc

When we use `remove_documents()` we get only the documents that did *not* contain the specified pattern.

Another useful pair of methods is [filter_documents_by_name()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_documents_by_name) and [remove_documents_by_name()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_documents_by_name). Both methods again accept the same pattern matching parameters but they only apply them to the document names, i.e. document *labels*:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_documents_by_name(r'-\d{4}$', match_type='regex')
preproc.print_summary()

del preproc

In the above example we wanted to retain only the documents whose document labels ended with exactly 4 digits, like "...-1234". Hence, we only get "NewsArticles-1880" and "NewsArticles-3350" but not "NewsArticles-99". Again, `remove_documents_by_name()` will do the exact opposite.

You may also use [Keywords-in-context (KWIC)](#Keywords-in-context-(KWIC)-and-general-filtering-methods) to filter your tokens in the neighborhood around certain keyword pattern(s). The method for that is called [filter_tokens_with_kwic()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens_with_kwic) and works very similar to [get_kwic()](api.rst#tmtoolkit.preprocess.TMPreproc.get_kwic) but filters the documents in the `TMPreproc` instance with which you can continue working as usual. Here, we filter the tokens in each document to get the tokens directly in front and after the glob pattern `'*house*'` (`context_size=1`):

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.filter_tokens_with_kwic('*house*', context_size=1,
                                match_type='glob', ignore_case=True)
preproc.tokens_datatable

When you annotated your documents' tokens with Part-of-Speech (POS) tags, you can also filter them using [filter_for_pos()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_for_pos):

In [None]:
del preproc 

preproc = preproc_orig.copy()  # make a copy from full data

# apply POS tagging and retain only nouns
preproc.pos_tag().filter_for_pos('N').tokens_datatable

In [None]:
del preproc

In this example we filtered for tokens that were identified as nouns by passing the *simplified POS tag* `'N'` (for more on simplified tags, see the method documentation). We can also filter for more than one tag, e.g. nouns or verbs by passing a list of required POS tags.

`filter_for_pos()` has no `remove_...()` counterpart, but you can set the `inverse` parameter to `True` to achieve the same effect.

Finally there are two methods for removing tokens based on their [document frequency](#Accessing-tokens,-vocabulary-and-other-important-properties): [remove_common_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_common_tokens) and [remove_uncommon_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_uncommon_tokens). The former removes all tokens that have a document frequency greater or equal a certain threshold defined by parameter `df_threshold`. The latter does the same for all tokens that have a document frequency lower or equal `df_threshold`. This parameter can either be a relative frequency (default) or absolute count (by setting parameter `absolute=True`).

Before applying the method, let's have a look at the number of tokens per document again, to later see how many we will remove. We will also store the vocabulary in `orig_vocab` for later comparison:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data
orig_vocab = preproc.vocabulary
preproc.doc_lengths

In [None]:
preproc.remove_common_tokens(df_threshold=0.9).doc_lengths

By removing all tokens with a document frequency threshold of 0.9, we removed quite a number of tokens in each document. Let's investigate the vocabulary in order to see which tokens were removed:

In [None]:
# set difference gives removed vocabulary tokens
set(orig_vocab) - set(preproc.vocabulary)

In [None]:
del preproc

`remove_uncommon_tokens()` works similarily. This time, let's use an absolute number as threshold:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

preproc.remove_uncommon_tokens(df_threshold=1, absolute=True)

# set difference gives removed vocabulary tokens
# this time, show only the first 10 tokens that were removed
sorted(set(orig_vocab) - set(preproc.vocabulary))[:10]

The above means that we remove all tokens that appear only in exactly one document.

In [None]:
del preproc

### Working with token metadata

`TMPreproc` allows to attach arbitrary metadata to each token in each document. This kind of "annotations" for tokens is very useful. For example, you may add metadata that records a token's length or whether it is all uppercase letters and later use that for filtering or in further analyses. One method to add such metadata is [add_metadata_per_doc()](api.rst#tmtoolkit.preprocess.TMPreproc.add_metadata_per_doc). This method requires to pass a dict that maps document labels to the respective token metadata list. The list's length must match the number of tokens in the respective document. At first we need to create such a metadata dict. Let's do that for the tokens' length first:

In [None]:
preproc = preproc_orig.copy()  # make a copy from full data

meta_tok_lengths = {doc_label: list(map(len, doc_tokens))
                    for doc_label, doc_tokens in preproc.tokens.items()}

# show first 5 tokens and their string length for a sample document
list(zip(preproc.tokens['NewsArticles-1880'][:10],
         meta_tok_lengths['NewsArticles-1880'][:10]))

We can now add these metadata via [add_metadata_per_doc()](api.rst#tmtoolkit.preprocess.TMPreproc.add_metadata_per_doc). We pass a label, the metadata key, and the previously generated metadata:

In [None]:
preproc.add_metadata_per_doc('length', meta_tok_lengths)
del meta_tok_lengths  # we don't need that object anymore

The property `.tokens_datatable` now shows an additional column with `meta_length` (the metadata key in always prefixed with `meta_`):

In [None]:
preproc.tokens_datatable

Let's add a boolean indicator for whether the given token is all uppercase:

In [None]:
meta_tok_upper = {doc_label: list(map(str.isupper, doc_tokens))
                  for doc_label, doc_tokens in preproc.tokens.items()}

preproc.add_metadata_per_doc('upper', meta_tok_upper)
del meta_tok_upper

preproc.tokens_datatable

You may use these newly added columns now for example for filtering the datatable:

In [None]:
import datatable as dt

preproc.tokens_datatable[dt.f.meta_upper == 1,:]

To see which metadata keys were already created, you can use [get_available_metadata_keys()](api.rst#tmtoolkit.preprocess.TMPreproc.get_available_metadata_keys):

In [None]:
preproc.get_available_metadata_keys()

Token metadata can be removed with [remove_metadata()](api.rst#tmtoolkit.preprocess.TMPreproc.remove_metadata):

In [None]:
preproc.remove_metadata('upper')
preproc.get_available_metadata_keys()

In [None]:
preproc.tokens_datatable

We can tell [filter_tokens()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens) and similar methods to use metadata instead of the tokens for matching. For example, we can use the metadata `meta_length`, which we created before, to filter for tokens of a certain length:

In [None]:
preproc_meta_example = preproc.copy()
preproc_meta_example.filter_tokens(3, by_meta='length')
preproc_meta_example.tokens_datatable

In [None]:
del preproc_meta_example

Note that all matching options then apply to the metadata column, in this case to the `meta_length` column which contains integers. Since `filter_tokens()` by default employs exact matching, we get all tokens where `meta_length` equals the first argument, `3`. If we used regular expression or glob matching instead, this method would fail because you can only use that for string data.

If you want to use more complex filter queries, you should create a "filter mask" and pass it to [filter_tokens_by_mask()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens_by_mask). A filter mask is a dictionary that maps a document label to a sequence of booleans. For all occurrences of `True`, the respective token in the document will be retained, all others will be removed. Let's try that out with a small sample:

In [None]:
preproc.pos_tag().tokens_datatable

We now generate the filter mask, which means for each document we create a boolean list or array that for each token in that document indicates whether that token should be kept or removed.

We will iterate through the [tokens_with_metadata](api.rst#tmtoolkit.preprocess.TMPreproc.tokens_with_metadata) property, which is a dict that for each document contains a datatable with its tokens and metadata. Let's have a look at the first document's datatable:

In [None]:
next(iter(preproc.tokens_with_metadata.values()))

Now we can create the filter mask:

In [None]:
import numpy as np

filter_mask = {}
for doc_label, doc_data in preproc.tokens_with_metadata.items():
    # extract the columns "meta_length" and "pos"
    # and convert them to NumPy arrays
    doc_data_subset = doc_data[:, [dt.f.meta_length, dt.f.pos]]
    tok_lengths, tok_pos = map(np.array, doc_data_subset.to_list())
    
    # create a boolean array for nouns with token length less or equal 5
    filter_mask[doc_label] = (tok_lengths <= 5) & np.isin(tok_pos, ['NOUN', 'PROPN'])

# it's not necessary to add the filter mask as metadata
# but it's a good way to check the mask
preproc.add_metadata_per_doc('small_nouns', filter_mask)
preproc.tokens_datatable

Finally, we can pass the mask dict to [filter_tokens_by_mask()](api.rst#tmtoolkit.preprocess.TMPreproc.filter_tokens_by_mask):

In [None]:
preproc.filter_tokens_by_mask(filter_mask)
preproc.tokens_datatable

### Generating n-grams

So far, we worked with *unigrams*, i.e. each document consisted of a sequence of discrete tokens. We can also generate *n-grams* from our corpus where each document consists of a sequence of *n* subsequent tokens. An example would be:

Document: "This is a simple example."

**n=1 (unigrams):**

    ['This', 'is', 'a', 'simple', 'example', '.']

**n=2 (bigrams):**

    ['This is', 'is a', 'a simple', 'simple example', 'example .']

**n=3 (trigrams):**

    ['This is a', 'is a simple', 'a simple example', 'simple example .']

The method [generate_ngrams()](api.rst#tmtoolkit.preprocess.TMPreproc.generate_ngrams) allows us to generate n-grams from tokenized documents. We can then get the results with the `ngrams` property:

In [None]:
del preproc

preproc = preproc_orig.copy()  # make a copy from full data

preproc.generate_ngrams(2)  # generate bigrams
preproc.ngrams['NewsArticles-1880'][:10]  # show first 10 bigrams of this document

You may afterwards use [join_ngrams()](api.rst#tmtoolkit.preprocess.TMPreproc.join_ngrams) to merge the generated n-grams to joint tokens and use these as new tokens in this TMPreproc instance:

In [None]:
preproc.join_ngrams()
preproc.tokens_datatable

In [None]:
del preproc

### Generating a sparse document-term matrix (DTM)

If you're working with a bag-of-words representation of your data, you usually convert the preprocessed documents to a document-term matrix (DTM), which represents of the number of occurrences of each term (i.e. vocabulary token) in each document. This is a *N* rows by *M* columns matrix, where *N* is the number of documents and *M* is the vocabulary size (i.e. the number of unique tokens in the corpus).

Not all tokens from the vocabulary occur in all documents. In fact, many tokens will occur only in a small subset of the documents if you're dealing with a "real world" dataset. This means that most entries in such a DTM will be zero. Almost all functions in tmtoolkit therefore generate and work with *sparse* matrices, where only non-zero values are stored in computer memory.

For this example, we'll generate a DTM from the `preproc_orig` instance. First, let's check the number of documents and the vocabulary size:

In [None]:
preproc_orig.n_docs, preproc_orig.vocabulary_size

We can use the [dtm](api.rst#tmtoolkit.preprocess.TMPreproc.dtm) property to generate a sparse DTM from the current instance:

In [None]:
preproc_orig.dtm

We can see that a sparse matrix with 3 rows (which corresponds with the number of documents) and 683 columns was generated (which corresponds to the vocabulary size). 816 elements in this matrix are non-zero.

We can convert this matrix to a non-sparse, i.e. *dense*, representation and see parts of its elements:

In [None]:
preproc_orig.dtm.todense()

However, note that you should only convert a sparse matrix to a dense representation when you're either dealing with a small amount of data (which is what we're doing in this example), or use only a part of the full matrix. Converting a sparse matrix to a dense representation can otherwise easily exceed the available computer memory.

There exist different "formats" for sparse matrices, which have different advantages and disadvantages (see for example the [SciPy "sparse" module documentation](https://docs.scipy.org/doc/scipy/reference/sparse.html#usage-information)). **Not all formats support all operations that you can usually apply to an ordinary, dense matrix.** By default, the generated DTM is in *Compressed Sparse Row (CSR)* format. This format allows indexing and is especially optimized for fast row access. You may convert it to any other sparse matrix format; see the mentioned SciPy documentation for this.

The rows of the DTM are aligned to the sequence of the document labels and its columns are aligned to the vocabulary. For example, let's find the frequency of the term "House" in the document "NewsArticles-1880". To do this, we find out the indices into the matrix:

In [None]:
preproc_orig.doc_labels.index('NewsArticles-1880')

In [None]:
preproc_orig.vocabulary.index('House')

This means the frequency of the term "House" in the document "NewsArticles-1880" is located in row 0 and column 4 of the DTM:

In [None]:
preproc_orig.dtm[0, 67]

See also the following example of finding out the index for "administration" and then getting an array that represents the number of occurrences of this token across all three documents:

In [None]:
vocab_admin_ix = preproc_orig.vocabulary.index('administration')
preproc_orig.dtm[:, vocab_admin_ix].todense()

Apart from the [dtm](api.rst#tmtoolkit.preprocess.TMPreproc.dtm) property, there's also the [get_dtm()](api.rst#tmtoolkit.preprocess.TMPreproc.get_dtm) method which allows to also return the result as datatable or pandas DataFrame. Note that these representations are not sparse and hence can consume much memory.

In [None]:
preproc_orig.get_dtm(as_datatable=True)

### Serialization: Saving and loading `TMPreproc` objects

The current state of a `TMPreproc` object can also be stored to a file on disk so that you (or someone else who has tmtoolkit installed) can later restore it using that file. The methods for that are [save_state()](api.rst#tmtoolkit.preprocess.TMPreproc.save_state) and [load_state()](api.rst#tmtoolkit.preprocess.TMPreproc.load_state) / [from_state()](api.rst#tmtoolkit.preprocess.TMPreproc.from_state).

Let's store the current state of the `preproc_orig` instance:

In [None]:
preproc_orig.print_summary()
preproc_orig.save_state('data/preproc_state.pickle')

Let's change the object by retaining only documents that contain the token "house" (see the reduced number of documents):

In [None]:
preproc_orig.filter_documents('*house*', match_type='glob', ignore_case=True)
preproc_orig.print_summary()

We can restore the saved data using [from_state()](api.rst#tmtoolkit.preprocess.TMPreproc.from_state):

In [None]:
preproc_restored = TMPreproc.from_state('data/preproc_state.pickle')
preproc_restored.print_summary()

You can see that the full dataset with three documents was restored.

This is very useful especially when you have a large amount of data and run time consuming operations, e.g. POS tagging. When you're finished running these operations, you can easily store the current state to disk and later retrieve it without the need to re-run these operations.

## Functional API

The `TMPreproc` class provides a convenient object-oriented interface for parallel text processing and analysis. There is also a *functional API* provided in the [tmtoolkit.preprocess](api.rst#tmtoolkit-preprocess) module. Most of these functions accept a list of spaCy documents along with additional parameters. You may use these functions for quick prototyping, but it is generally much more convenient to use `TMPreproc`. The functional API does not provide parallel processing.

To initialize the functional API for a certain language, you need to start with [init_for_language()](api.rst#tmtoolkit.preprocess.init_for_language) and may then tokenize your raw text documents via [tokenize()](api.rst#tmtoolkit.preprocess.tokenize), which will generate a list of spaCy documents. Most other functions in this API accept such a list of list of spaCy documents as input.

```
init_for_language('en')
docs = tokenize(['Hello this is a test.', 'And here comes another one.'])
```

---

The final result after applying preprocessing steps and hence transforming the text data is often a document-term matrix (DTM). The [bow module](api.rst#tmtoolkit-bow) contains several functions to work with DTMs, e.g. apply transformations such as *tf-idf* or compute some important summary statistics. The [next chapter](bow.ipynb) will introduce some of these functions.