# Text preprocessing

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the [tmtoolkit.preprocess](api.rst#tmtoolkit-preprocess) module.   

## Two approaches: functional API and TMPreproc class

There are two ways to apply text preprocessing methods to your documents: First, there is the [functional API](api.rst#module-tmtoolkit.preprocess) which consists of a set of Python functions that accept a list of (tokenized) documents. An example might be:

```python
corpus = [
    "Hello world!",    # document 1
    "Another example"  # document 2
]

docs = tokenize(corpus)
to_lowercase(docs)
# Out: [['hello', 'world', '!'],
#       ['another', 'example']]
```


The advantage of this approach is that it's very straight-forward and flexible. However, you must manage any meta data associated with the documents on your own (e.g. document labels or token metadata). Furthermore, the processing is not done in parallel.

Second, there is the [TMPreproc class](api.rst#tmpreproc-class-for-parallel-text-preprocessing) which addresses these limitations. You can create an instance of this class from your (labelled) documents and then apply preprocessing methods to it. This instance is a "state-machine", i.e. its contents (the documents) can change when you call a method. An example:

```python
corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens                  # one of many ways to access the tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }
```

The most important advantage is that `TMPreproc` employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up. 

Both approaches offer mostly the same features in terms of available preprocessing methods. `TMPreproc` has some more methods to export the data to dataframes or datatables. In general, the functional API is mostly used for quick prototyping and when using a small amount of data. For projects with large amounts of data, it's recommended to use `TMPreproc`.

This chapter starts with a few examples using the functional API and then turns to `TMPreproc`.

## Functional API

The functions in the preprocessing module make up the [functional API](api.rst#module-tmtoolkit.preprocess) for text preprocessing. We will explore some of the available functions. Most of them require at least passing a list of tokenized documents. In order to tokenize raw text documents (for example from a [Corpus](text_corpora.ipynb) object), we can use [tokenize()](tmtoolkit.preprocess.tokenize): 

Let's load sample of three documents from the built-in *NewsArticles* dataset. We'll save the document labels in `doc_labels` since the functional API only works with lists of documents (not with dicts): 

In [7]:
from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(3)
doc_labels = corpus.keys()
doc_labels

dict_keys(['NewsArticles-774', 'NewsArticles-1647', 'NewsArticles-1396'])

We can now tokenize these documents. We use `corpus.values()` to pass a list of documents. We get a list of tokenized documents back (i.e. a list of lists). We peak into the documents by only showing the first 10 words at maximum.

In [10]:
docs = tokenize(corpus.values())
[doc[:10] for doc in docs]

[['Russian',
  'adventurer',
  'Konyukhov',
  "'s",
  'balloon',
  'lands',
  'after',
  '55-hour',
  'nonstop',
  'flight'],
 ['Fallen',
  'Navy',
  'SEAL',
  "'s",
  'widow',
  'receives',
  'standing',
  'ovation',
  'during',
  'Trump'],
 ['Troops',
  'advance',
  'towards',
  'Ghazlani',
  'base',
  'near',
  'Mosul',
  'airport',
  'Iraqi',
  'troops']]

language
