# Tutorial: Create your own NLP Pipeline

For the following tutorial, the file ``dewiki_simple_one_line.txt`` (see folder: [../sample_data/](../sample_data) ) will be used as a source for input text data. This file contains one text per line and is generated via Consileon NLP Framework (the code is provided in python notebook convert_wiki_simple.ipynb). Each text is extracted from a German Wikipedia article and line breaks are eliminated such that texts are represented as one-line-texts or "Whole sentences". This allows for an easier access to textual data. Other than that the texts are not processed any further to show some real NLP-tasks.  

(See [./convert_wiki_simple.ipynb](convert_wiki_simple.ipynb) for details.)

Now let's get started! First we will import necessary packages:

In [25]:
import consileon.nlp.pipeline as nlp

### Generation of an Input stream: 
Here we create an _input stream_ **`s`**. An input stream is an _Iterator_ that will provide a _text items_ in each iteration.

These text items are later "piped" into further processing steps.

In [8]:
s = nlp.LineSourceIterator("../sample_data/dewiki_simple_one_line.txt")

Get the first text provided by the input stream using the function **`next()`**.

Notice that each time this function is called the _next text_ item in the sequence will be given as an output:

Try out by running the cell several times.

In [29]:
text = next(s)
print(text)

Ang Lee wurde 1954 in Taiwan geboren. Seine Eltern, Emigranten aus China, lernten sich in Taiwan kennen, Lee ist ihr ältester Sohn. Die Großeltern väterlicher- und mütterlicherseits sind im Zuge der kommunistischen Revolution in China ums Leben gekommen. Da sein Vater als Lehrer häufiger die Arbeitsstelle wechselte, wuchs Ang Lee in verschiedenen Städten Taiwans auf. Entgegen den Wünschen seiner Eltern, wie sein Vater eine klassische akademische Laufbahn einzuschlagen, interessierte sich Lee für das Schauspiel und absolvierte mit ihrem Einverständnis zunächst ein Theater- und Filmstudium in Taipeh. Im Anschluss daran ging er 1978 in die USA, um an der Universität von Illinois in Urbana-Champaign Theaterwissenschaft und -regie zu studieren. Nach dem Erwerb seines B.A. in Illinois verlegte er sich ganz auf das Studium der Film- und Theaterproduktion an der Universität von New York, das er 1985 mit einem Master abschloss. Danach entschloss er sich, mit seiner ebenfalls aus Taiwan stammend

### Tokenization, Lemmatization and Lower-casing:

In order for NLP-algorithms to process textual data, text elements need to be represented as tokens. This pre-processing step is called tokenization. <br> In the following code we create a tokenizer **`t`** with the help of the function **`TokenizeText()`** and apply it on a string example **`sample_text`** to see what this pre-processing step is all about:

In [10]:
t = nlp.TokenizeText()
sample_text = "Der Frühling läßt sein blaues Band wieder flattern durch die Lüfte. Süße, wohlbekannte Düfte streifen ahnungsvoll das Land."

print( t(sample_text) )

['Der', 'Frühling', 'läßt', 'sein', 'blaues', 'Band', 'wieder', 'flattern', 'durch', 'die', 'Lüfte.', 'Süße', ',', 'wohlbekannte', 'Düfte', 'streifen', 'ahnungsvoll', 'das', 'Land', '.']


Notice that each interpunctation character (as a text element) leads to a single token.

In languages that freuently use conjugation and declination (e.g. German) it's often necessary to reduce the number of different tokens via lemmatization. Lemmatization reduces words to their basic/dictionary form, e.g. "Lüfte" to "Luft" or "writen" to "write".

In comparison with Tokenization, lemmatization depends on the natural language. For Consileon the default language (currently) is German.

To apply tokenization and lemmatization at once we use the function **`LemmaTokenizeText()`**:

In [12]:
l = nlp.LemmaTokenizeText()

print( l(sample_text) )

['der', 'Frühling', 'lassen', 'mein', 'blau', 'Band', 'wieder', 'flattern', 'durch', 'der', 'Luft', '.', 'süßen', ',', 'wohlbekannte', 'Duft', 'streifen', 'ahnungsvoll', 'der', 'Land', '.']


As seen above, lemmatization is not perfect, esp. not context sensitive.
<br>
Tokenizers and lemmatizers are examples of modifiers. Modifiers take one input object and transform it into an output object that may differ in its type from the former. In the case seen above, the input object is of type string and the output object if of type list.

Another modifier is **`Lower()`** which transforms the tokens within a list to lower case:

In [13]:
lc = nlp.Lower()

lc(["Hallo Welt"])

['hallo welt']

### Composition of multiple modifiers via the operator * :

Modifiers can be composed (as in mathematics) using the operator `*` . In the following code, the tokenizer **`t`** and the modifier **`lc`** are composed and applied to the string **`sample_text`** :

In [15]:
m1 = lc * t
print( m1(sample_text), (lc * t)(sample_text), (lc * nlp.LemmaTokenizeText())("Der Ball ist rund.") )

['der', 'frühling', 'läßt', 'sein', 'blaues', 'band', 'wieder', 'flattern', 'durch', 'die', 'lüfte.', 'süße', ',', 'wohlbekannte', 'düfte', 'streifen', 'ahnungsvoll', 'das', 'land', '.'] ['der', 'frühling', 'läßt', 'sein', 'blaues', 'band', 'wieder', 'flattern', 'durch', 'die', 'lüfte.', 'süße', ',', 'wohlbekannte', 'düfte', 'streifen', 'ahnungsvoll', 'das', 'land', '.'] ['der', 'ball', 'sein', 'rund', '.']


One very important feature of modifiers is that they can easily be applied element wise to an input stream. In that way, you are able to create a textual-preprocessing pipeline in which you insert the input texts iteratively. This is done using the operator `**` . Now Let's see how the modifier-composition **`m1`** will handle the input stream **`s`**:

In [18]:
p = m1 ** s
tkns = next(p)

print( tkns )

['alternative', 'schreibweisen', 'sind', 'unter', 'anderem', 'die', 'ursprungsvariante', '``', 'allen', 'smithee', "''", 'sowie', '``', 'alan', 'smythee', "''", 'und', '``', 'adam', 'smithee', "''", '.', 'auch', 'zwei', 'teilweise', 'asiatisch', 'anmutende', 'schreibweisen', '``', 'alan', 'smi', 'thee', "''", 'und', '``', 'sumishii', 'aran', "''", 'gehören', '–', 'so', 'die', 'internet', 'movie', 'database', '–', 'dazu.', 'das', 'pseudonym', 'entstand', '1968', 'infolge', 'der', 'arbeiten', 'am', 'western-film', '``', 'death', 'of', 'a', 'gunfighter', "''", 'deutscher', 'titel', '``', 'frank', 'patch', '–', 'deine', 'stunden', 'sind', 'gezählt', "''", '.', 'regisseur', 'robert', 'totten', 'und', 'hauptdarsteller', 'richard', 'widmark', 'gerieten', 'in', 'einen', 'streit', ',', 'woraufhin', 'don', 'siegel', 'als', 'neuer', 'regisseur', 'eingesetzt', 'wurde.', 'der', 'film', 'trug', 'nach', 'abschluss', 'der', 'arbeiten', 'noch', 'deutlich', 'tottens', 'handschrift', ',', 'der', 'auch', 

With the same results as before, the preprocessing pipeline **`p`** can be implemented as follows: 

In [None]:
p = nlp.Lower() ** nlp.TokenizeText() ** s
tkns = next(p)
print( tkns )

The former structure applies a modifier **` m1= lc * t`** to the elements of the stream **`s`**. In the later structure, the  application of **`nlp.TokenizeText()`** to the elements of **`s`** results in a **new stream**. The pipeline p is re-obtained by appling the modifier **`nlp.Lower()`** to the items of the stream **`nlp.TokenizeText() ** s`**.