# Working with text corpora

Your text data usually comes in the form of (long) plain text strings that are stored in one or several files on disk. We can load and transform this data into a [`Corpus`](api.rst#TODO) object so that we can perform all kinds of operations that are implemented as *corpus functions* in tmtoolkit. The [`Corpus`](api.rst#TODO) class itself resembles a [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) with some additional functionality.

Let's import the `Corpus` class first:

In [1]:
from tmtoolkit.corpus import Corpus

## Loading text data

Several methods are implemented to load text data from different sources:

- load built-in datasets
- load plain text files (".txt files")
- load folder(s) with plain text files
- load a tabular (i.e. CSV or Excel) file containing document IDs and texts
- load a ZIP file containing plain text or tabular files

We can create a `Corpus` object directly by immediately loading a dataset using one of the `Corpus.from_...` methods. This is what we've done when we used `corp = Corpus.from_builtin_corpus('en-News100')` in the [previous chapter](getting_started.ipynb).

Let's load a folder with example documents. Make sure that the path is relative to the current working directory. The data for these examples can be downloaded from [GitHub](https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/doc/source/data). 


<div class="alert alert-info">

### Note: Rich text documents

If you want to work with "rich text documents", i.e. formatted, non-plain text sources such as PDFs, Word documents, HTML files, etc. you must convert them to one of the supported formats first. For example you can use the [pdftotext](https://www.mankier.com/1/pdftotext) command from the Linux package `poppler-utils` to convert from PDF to plain text files or [pandoc](https://pandoc.org/) to convert from Word or HTML to plain text.

</div>

In [2]:
corp = Corpus.from_folder('data/corpus_example', language='en')
corp

<Corpus [3 documents  / language "en"]>

Again, we can have a look which document labels were created and print one sample document:

In [3]:
corp.doc_labels

['sample1', 'sample2', 'sample3']

The [`corpus_summary`](api.rst#TODO) and [`print_summary`](api.rst#TODO) functions are very helpful to get a first overview of a loaded corpus:

In [4]:
from tmtoolkit.corpus import print_summary

print_summary(corp)

Corpus with 3 documents in English
> sample1 (8 tokens): This is the first example file . ☺
> sample2 (20 tokens): Here comes the second example .    This one contai...
> sample3 (36 tokens): And here we go with the third and final example fi...
total number of tokens: 64 / vocabulary size: 38


<div class="alert alert-info">

### Side note: Corpus functions

The [`corpus_summary`](api.rst#TODO) and [`print_summary`](api.rst#TODO) functions are examples of *corpus functions*. All corpus functions accept a [`Corpus`](api.rst#TODO) object as first argument and operate on it. A corpus function may retrieve information from a corpus and/or modify it. Most functions in the `tmtoolkit.corpus` module are corpus functions.

<div/>

Another option is to create a `Corpus` object and adding further documents using the `corpus_add_...` functions. Here we create an empty `Corpus` and then add documents via [`corpus_add_files`](api.rst#TODO) which is another example of a corpus function (one that modifies a `Corpus` object). It takes a `Corpus` object and one or more paths to raw text files.

In [5]:
corp = Corpus(language='en')
print_summary(corp)

Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0


In [6]:
from tmtoolkit.corpus import corpus_add_files

corpus_add_files(corp, 'data/corpus_example/sample1.txt')
print_summary(corp)

Corpus with 1 document in English
> data_corpus_example-sample1 (8 tokens): This is the first example file . ☺
total number of tokens: 8 / vocabulary size: 8


Note that this time the document label is different. Its prefixed by a normalized version of the path to the document. We can alter the `doc_label_fmt` argument of [Corpus.add_files()](api.rst#tmtoolkit.corpus.Corpus.add_files) in order to control how document labels are generated. But at first, let's remove the previously loaded document from the corpus. Since a `Corpus` instance behaves like a Python `dict`, we can use `del`:

In [7]:
del corp['data_corpus_example-sample1']
print_summary(corp)

Corpus with 0 document in English
total number of tokens: 0 / vocabulary size: 0


Now we use a modified `doc_label_fmt` paramater value to generate document labels only from the file name and not from the full path to the document. We also load three files now:

In [8]:
corpus_add_files(corp, ['data/corpus_example/sample1.txt',
                        'data/corpus_example/sample2.txt',
                        'data/corpus_example/sample3.txt'],
                 doc_label_fmt='{basename}')
print_summary(corp)


Corpus with 3 documents in English
> sample1 (8 tokens): This is the first example file . ☺
> sample2 (20 tokens): Here comes the second example .    This one contai...
> sample3 (36 tokens): And here we go with the third and final example fi...
total number of tokens: 64 / vocabulary size: 38


As noted in the beginning, there are more `corpus_add_...` and `Corpus.from_...` functions/methods to load text data from different sources. See the [corpus module API](api.rst#TODO) for details.

<div class="alert alert-info">

### Note

Please be aware of the difference of the `corpus_add_...` and `Corpus.from_...` functions/methods: The former *modifies* a given `Corpus` object, whereas the latter *creates* a new `Corpus` object.

</div>

## Corpus properties

A `Corpus` object provides several helpful properties that summarize the text data and several methods to manage the documents.

Let's start with the number of documents in the corpus. There are two ways to obtain this value: 

TODO: cont. here

In [9]:
len(corp)

3

In [10]:
corp.n_docs

3

Another important property is the number of characters per document: 

In [11]:
corp.doc_labels

['sample1', 'sample2', 'sample3']

In [13]:
corp.language

'en'

In [14]:
corp.language_model

'en_core_web_sm'

In [17]:
corp.has_sents

True

In [18]:
corp.max_workers

1

## Splitting by paragraphs

Another helpful method is [Corpus.split_by_paragraphs()](api.rst#tmtoolkit.corpus.Corpus.split_by_paragraphs). This allows splitting each document of the corpus by paragraph.

Again, let's have a look at our current corpus' documents:

In [16]:
print_corpus(corpus)

sample1 :
This is the first example file. 
---

sample2 :
Here comes the second example.

This one contains three lines of plain text which means two paragraphs.
---

sample3 :
And here we go with the third and final example file.
Another line of text.

§2.
This is the second paragraph.

The third and final paragraph.
---



As we can see, `sample1` contains one paragraph, `sample2` two and `sample3` three paragraphs. Now we can split those and get the expected number of documents (each paragraph is then an individual document):

In [17]:
corpus.split_by_paragraphs()
corpus

<Corpus [6 documents]>

Our newly created six documents:

In [18]:
print_corpus(corpus)

sample1-1 :
This is the first example file. 
---

sample2-1 :
Here comes the second example.
---

sample2-2 :
This one contains three lines of plain text which means two paragraphs.
---

sample3-1 :
And here we go with the third and final example file. Another line of text.
---

sample3-2 :
§2. This is the second paragraph.
---

sample3-3 :
The third and final paragraph.
---



You can further customize the splitting process by tweaking the parameters, e.g. the minimum number of line breaks used to detect paragraphs (default is two line breaks).

### Sampling a corpus   

Finally you can sample the documents in a corpus using [Corpus.sample()](api.rst#tmtoolkit.corpus.Corpus.sample). To get a random sample of three documents from our corpus:

In [19]:
corpus.sample(3)

<Corpus [3 documents]>

In [20]:
corpus.doc_labels

['sample1-1', 'sample2-1', 'sample2-2', 'sample3-1', 'sample3-2', 'sample3-3']

Note that this returns a new `Corpus` instance by default. You can pass `as_corpus=False` if you only need a Python dict.

The [next chapter](preprocessing.ipynb) will show how to apply several text preprocessing functions to a corpus.