# Getting started

This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.

## Loading a built-in text corpus

Once you have installed tmtoolkit, you can start by loading a built-in dataset. Note that you must have installed tmtoolkit with the ``[recommended]`` or ``[textproc]`` option for this to work. See the [installation instructions](install.rst) for details.

Let's import the [`builtin_corpora_info`](api.rst#TODO) function first and have a look which datasets are available:

In [1]:
from tmtoolkit.corpus import builtin_corpora_info

builtin_corpora_info()

['de-parlspeech-v2-sample-bundestag',
 'en-NewsArticles',
 'en-NewsArticles-sample100',
 'en-parlspeech-v2-sample-houseofcommons',
 'es-parlspeech-v2-sample-congreso',
 'nl-parlspeech-v2-sample-tweedekamer']

Let's load one of these corpora, a sample of 100 articles from the [News Articles dataset from Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GMFCTR). For this, we import the [`Corpus`](api.rst#TODO) class and use [`Corpus.from_builtin_corpus`](api.rst#TODO). The raw text data will then be processed by an [NLP pipeline](https://spacy.io/usage/spacy-101#pipelines) with [SpaCy](https://spacy.io). That is, it will be tokenized and analyzed for the grammatical structure of each sentence and the linguistic attributes of each token, among other things. Since this step is computationally intense, it takes quite some time for large text corpora (it can be sped up by enabling parallel processing as explained later).

In [2]:
from tmtoolkit.corpus import Corpus

corp = Corpus.from_builtin_corpus('en-NewsArticles-sample100')
corp

<Corpus [100 documents  / language "en"]>

We can have a look which documents were loaded (showing only the first ten document labels):

In [4]:
corp.doc_labels[:10]

['NewsArticles-sample100-2338',
 'NewsArticles-sample100-3228',
 'NewsArticles-sample100-1253',
 'NewsArticles-sample100-1615',
 'NewsArticles-sample100-3334',
 'NewsArticles-sample100-92',
 'NewsArticles-sample100-869',
 'NewsArticles-sample100-3092',
 'NewsArticles-sample100-3088',
 'NewsArticles-sample100-1173']

We can now access each document in this corpus via its document label:

In [6]:
corp['NewsArticles-sample100-2338']

Document "NewsArticles-sample100-2338" (680 tokens, 9 token attributes, 2 document attributes)

By accessing the corpus in this way, we get a [`Document`](api.rst#TODO) object. We can query a document for its contents again using the square brackets syntax. Here, we access its tokens and show only the first ten:

In [12]:
corp['NewsArticles-sample100-2338']['token'][:10]

["'",
 'This',
 'Is',
 'Us',
 "'",
 'Makes',
 'Surprising',
 'Reveal',
 'About',
 'Jack']

Most of the time, you won't need to access the `Document` objects of a corpus directly. You would rather use functions that provide a convenient interface to a corpus' contents, e.g. the [`doc_tokens`](api.rst#TODO) function which allows to retrieve all documents' tokens along with additional token attributes like Part-of-Speech (POS) tags, token lemma, etc.

TODO: cont. here

## Tokenizing a corpus

For quantitative text analysis, you usually work with words in documents as units of interest. This means the plain text strings in the corpus' documents need to be split up into individual *tokens* (words, punctuation, etc.). For a quick starter, we can do so by using [tokenize](api.rst#tmtoolkit.preprocess.tokenize) *after* we have specified the language that is used via [init_for_language](api.rst#tmtoolkit.preprocess.init_for_language).

In [5]:
from tmtoolkit.preprocess import init_for_language, tokenize

doc_labels = corpus.doc_labels   # save the document labels as list for later use

init_for_language('en')   # we use an English corpus
docs = tokenize(list(corpus.values()))

The function `tokenize()` takes a sequence of text strings, tokenizes them and returns a list of tokenized [spaCy  documents](https://spacy.io/api/doc/):

In [6]:
type(docs)

list

In [7]:
type(docs[0])

spacy.tokens.doc.Doc

Each document in `docs` in turn is a list of token strings (words, punctuation). Let's peek into the first document (index 0) and return the first ten tokens from it:

In [8]:
docs[0][:10]

Betsy DeVos Confirmed as Education Secretary, With Pence Casting

`docs` and `doc_labels` are aligned, i.e. the first element in `doc_labels` is the label of the first tokenized document in `docs`:

In [9]:
doc_labels[0]

'NewsArticles-1'

Tokenization is part of text preprocessing, which also includes several transformations that you can apply to the tokens (e.g. transform all to lower case). The [chapter on text preprocessing](preprocessing.ipynb) explains this in much more detail. Next, we proceed with [working with text corpora](text_corpora.ipynb).