# Tutorial: Create your own NLP Pipeline

For the following tutorial, the file ``dewiki_simple_one_line.txt`` (see folder: [../sample_data/](../sample_data) ) will be used as a source for input text data. This file contains one text per line and is generated via Consileon NLP Framework (the code is provided in python notebook convert_wiki_simple.ipynb). Each text is extracted from a German Wikipedia article and line breaks are eliminated such that texts are represented as one-line-texts or "Whole sentences". This allows for an easier access to textual data. Other than that the texts are not processed any further to show some real NLP-tasks.  

(See [./convert_wiki_simple.ipynb](convert_wiki_simple.ipynb) for details.)

Now let's get started! First we will import necessary packages:

In [10]:
import consileon.nlp.pipeline as nlp

### Generation of an input stream:
Here we create an input stream **``s``** from a text file.

Within the Consileon NLP Framework, a "stream" is a _Python iterator_
that will provide a text or other object for each iteration.

In [2]:
s = nlp.LineSourceIterator("../sample_data/dewiki_simple_one_line.txt")

Get the first text provided by the input stream using the function **`next()`**. <br>
Notice that each time this function is called the next text in the sequence will be given as an output.

Try it out by running the following cell several times.

In [3]:
text = next(s)
print(text)

Alternative Schreibweisen sind unter anderem die Ursprungsvariante ''Allen Smithee'' sowie ''Alan Smythee'' und ''Adam Smithee''. Auch zwei teilweise asiatisch anmutende Schreibweisen ''Alan Smi Thee'' und ''Sumishii Aran'' gehören – so die Internet Movie Database – dazu. Das Pseudonym entstand 1968 infolge der Arbeiten am Western-Film ''Death of a Gunfighter'' (deutscher Titel ''Frank Patch – Deine Stunden sind gezählt''). Regisseur Robert Totten und Hauptdarsteller Richard Widmark gerieten in einen Streit, woraufhin Don Siegel als neuer Regisseur eingesetzt wurde. Der Film trug nach Abschluss der Arbeiten noch deutlich Tottens Handschrift, der auch mehr Drehtage als Siegel daran gearbeitet hatte, weshalb dieser die Nennung seines Namens als Regisseur ablehnte. Totten selbst lehnte aber ebenfalls ab. Als Lösung wurde  ''Allen Smithee'' als ein möglichst einzigartiger Name gewählt (bei der späteren Variante ''Alan Smithee'' war das Anagramm ''The Alias Men'' vermutlich kein Entstehungs

To _reset_ the stream (i.e. to start it from the beginning) you have to call

In [None]:
s.__iter__()

### Tokenization, Lemmatization and Lower-casing:

In order for NLP-algorithms to process textual data, text elements need to be represented as tokens.
This pre-processing step is called tokenization. <br>
In the following code we create a tokenizer **`t`** with the help of the function **`TokenizeText()`**
and apply it on a string example **`sample_text`** to see what this pre-processing step is all about:

In [4]:
t = nlp.TokenizeText()
sample_text = "Der Frühling läßt sein blaues Band wieder flattern durch die Lüfte. Süße, wohlbekannte Düfte streifen ahnungsvoll das Land."

print( t(sample_text) )

['Der', 'Frühling', 'läßt', 'sein', 'blaues', 'Band', 'wieder', 'flattern', 'durch', 'die', 'Lüfte.', 'Süße', ',', 'wohlbekannte', 'Düfte', 'streifen', 'ahnungsvoll', 'das', 'Land', '.']


Notice that each interpunctation character (as a text element) leads to a single token.

In languages that frequently use conjugation and declination (e.g. German) it's often necessary to reduce
the number of different tokens via _lemmatization_. Lemmatization reduces words to their _basic/dictionary form_,
e.g. "Lüfte" to "Luft" or "written" to "write".

In comparison with Tokenization, lemmatization depends on the natural language.
For Consileon the default language (currently) is German.

To apply tokenization and lemmatization at once we use the function **`LemmaTokenizeText()`**:

In [5]:
l = nlp.LemmaTokenizeText()

print( l(sample_text) )

['der', 'Frühling', 'lassen', 'mein', 'blau', 'Band', 'wieder', 'flattern', 'durch', 'der', 'Luft', '.', 'süßen', ',', 'wohlbekannte', 'Duft', 'streifen', 'ahnungsvoll', 'der', 'Land', '.']


As seen above, lemmatization is not perfect, esp. not context sensitive.
<br>
Tokenizers and lemmatizers are examples of _modifiers_.
Modifiers take one input object and transform it into an output object that may differ in its type from the former. In the case seen above, the input object is of type string and the output object if of type list.

Another modifier is **`Lower()`** which transforms the tokens within a list to lower case:

In [6]:
lc = nlp.Lower()

lc(["Hallo Welt"])

['hallo welt']

### Composition of multiple modifiers via the operator * :

Modifiers can be composed (as in mathematics) using the operator `*` . In the following code, the tokenizer **`t`** and the modifier **`lc`** are composed and applied to the string **`sample_text`** :

In [7]:
m1 = lc * t
print( m1(sample_text), (lc * t)(sample_text), (lc * nlp.LemmaTokenizeText())("Der Ball ist rund.") )

['der', 'frühling', 'läßt', 'sein', 'blaues', 'band', 'wieder', 'flattern', 'durch', 'die', 'lüfte.', 'süße', ',', 'wohlbekannte', 'düfte', 'streifen', 'ahnungsvoll', 'das', 'land', '.'] ['der', 'frühling', 'läßt', 'sein', 'blaues', 'band', 'wieder', 'flattern', 'durch', 'die', 'lüfte.', 'süße', ',', 'wohlbekannte', 'düfte', 'streifen', 'ahnungsvoll', 'das', 'land', '.'] ['der', 'ball', 'sein', 'rund', '.']


One very important feature of modifiers is that they can easily be applied element wise to an input stream. In that way, you are able to create a textual-preprocessing pipeline in which you insert the input texts iteratively. This is done using the operator `**` . Now Let's see how the modifier-composition **`m1`** will handle the input stream **`s`**:

In [8]:
p = m1 ** s
tkns = next(p)

print( tkns )

['alternative', 'schreibweisen', 'sind', 'unter', 'anderem', 'die', 'ursprungsvariante', '``', 'allen', 'smithee', "''", 'sowie', '``', 'alan', 'smythee', "''", 'und', '``', 'adam', 'smithee', "''", '.', 'auch', 'zwei', 'teilweise', 'asiatisch', 'anmutende', 'schreibweisen', '``', 'alan', 'smi', 'thee', "''", 'und', '``', 'sumishii', 'aran', "''", 'gehören', '–', 'so', 'die', 'internet', 'movie', 'database', '–', 'dazu.', 'das', 'pseudonym', 'entstand', '1968', 'infolge', 'der', 'arbeiten', 'am', 'western-film', '``', 'death', 'of', 'a', 'gunfighter', "''", 'deutscher', 'titel', '``', 'frank', 'patch', '–', 'deine', 'stunden', 'sind', 'gezählt', "''", '.', 'regisseur', 'robert', 'totten', 'und', 'hauptdarsteller', 'richard', 'widmark', 'gerieten', 'in', 'einen', 'streit', ',', 'woraufhin', 'don', 'siegel', 'als', 'neuer', 'regisseur', 'eingesetzt', 'wurde.', 'der', 'film', 'trug', 'nach', 'abschluss', 'der', 'arbeiten', 'noch', 'deutlich', 'tottens', 'handschrift', ',', 'der', 'auch', 

With the same results as before, the preprocessing pipeline **`p`** can be implemented as follows: 

In [9]:
p = nlp.Lower() ** nlp.TokenizeText() ** s
tkns = next(p)
print( tkns )

['alternative', 'schreibweisen', 'sind', 'unter', 'anderem', 'die', 'ursprungsvariante', '``', 'allen', 'smithee', "''", 'sowie', '``', 'alan', 'smythee', "''", 'und', '``', 'adam', 'smithee', "''", '.', 'auch', 'zwei', 'teilweise', 'asiatisch', 'anmutende', 'schreibweisen', '``', 'alan', 'smi', 'thee', "''", 'und', '``', 'sumishii', 'aran', "''", 'gehören', '–', 'so', 'die', 'internet', 'movie', 'database', '–', 'dazu.', 'das', 'pseudonym', 'entstand', '1968', 'infolge', 'der', 'arbeiten', 'am', 'western-film', '``', 'death', 'of', 'a', 'gunfighter', "''", 'deutscher', 'titel', '``', 'frank', 'patch', '–', 'deine', 'stunden', 'sind', 'gezählt', "''", '.', 'regisseur', 'robert', 'totten', 'und', 'hauptdarsteller', 'richard', 'widmark', 'gerieten', 'in', 'einen', 'streit', ',', 'woraufhin', 'don', 'siegel', 'als', 'neuer', 'regisseur', 'eingesetzt', 'wurde.', 'der', 'film', 'trug', 'nach', 'abschluss', 'der', 'arbeiten', 'noch', 'deutlich', 'tottens', 'handschrift', ',', 'der', 'auch', 

The former structure applies a modifier **` m1= lc * t`** to the elements of the stream **`s`**.
In the later structure, the  application of **`nlp.TokenizeText()`** to the elements of **`s`** results in a **new stream**. The pipeline p is re-obtained by appling the modifier **`nlp.Lower()`** to the items of the stream **`nlp.TokenizeText() ** s`**.
