# Getting started

This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.

## Loading a built-in text corpus

Once you have installed tmtoolkit, you can start by loading a built-in dataset. Let's import the [Corpus](api.rst#tmtoolkit-corpus) class first and have a look which datasets are available:

In [1]:
from tmtoolkit.corpus import Corpus

Corpus.builtin_corpora()

['english-NewsArticles', 'german-bt18_speeches_sample']

Let's load one of these corpora, the [News Articles dataset from Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GMFCTR):

In [2]:
corpus = Corpus.from_builtin_corpus('english-NewsArticles')
corpus

<Corpus [3824 documents]>

We can have a look which documents were loaded (showing only the first ten document labels):

In [3]:
corpus.doc_labels[:10]

['NewsArticles-1',
 'NewsArticles-2',
 'NewsArticles-3',
 'NewsArticles-4',
 'NewsArticles-5',
 'NewsArticles-6',
 'NewsArticles-7',
 'NewsArticles-8',
 'NewsArticles-9',
 'NewsArticles-10']

The first 100 characters from the the document `NewsArticles-1`:

In [4]:
corpus['NewsArticles-1'][:100]

'Betsy DeVos Confirmed as Education Secretary, With Pence Casting Historic Tie-Breaking Vote\n\nMichiga'

The [Corpus](api.rst#tmtoolkit-corpus) class is for loading and managing *plain text* corpora, i.e. a set of documents with a label and their content as text strings. It resembles a [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). See [working with text corpora](text_corpora.ipynb) for more information.


## Tokenizing a corpus

For quantitative text analysis, you usually work with words in documents as units of interest. This means the plain text strings in the corpus' documents need to be split up into individual *tokens* (words, punctuation, etc.). For a quick starter, we can do so by using [tokenize](api.rst#tmtoolkit.preprocess.tokenize):

In [5]:
from tmtoolkit.preprocess import tokenize

doc_labels = corpus.doc_labels   # save the document labels as list for later use

docs = tokenize(corpus.values())

The function `tokenize()` takes a sequence of text strings, tokenizes them and returns a list of tokenized documents:

In [6]:
type(docs)

list

Each document in `docs` in turn is a list of token strings (words, punctuation). Let's peek into the first document (index 0) and return the first ten tokens from it:

In [7]:
docs[0][:10]

['Betsy',
 'DeVos',
 'Confirmed',
 'as',
 'Education',
 'Secretary',
 ',',
 'With',
 'Pence',
 'Casting']

`docs` and `doc_labels` are aligned, i.e. the first element in `doc_labels` is the label of the first tokenized document in `docs`:

In [8]:
doc_labels[0]

'NewsArticles-1'

Tokenization is part of text preprocessing, which also includes several transformations that you can apply to the tokens (e.g. transform all to lower case). The [chapter on text preprocessing](preprocessing.ipynb) explains this in much more detail.