# Create your first NLP pipeline
For documentation of the framework see

[Project Readme](../../README.md)

## The first Pipeline

Necessary imports:

In [1]:
import consileon.nlp.pipeline as nlp

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sohr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### The Source
In this case the source is an input stream ("iterator") of texts which are taken from a sample input file
which contains one text per line.

This Texts has been prepared from German wikipedia using the Consileon NLP framework. One "whole sentences"
have been taken from wikipedia articles and line breaks have been removed for ease of access to the subject
(a python notebook which did exactly this can be found here [convert_wiki_simple.ipynb](./convert_wiki_simple.ipynb))

But apart from that the texts have been "as is" to show some real word nlp tasks.

**Generation of the source stream:**

In [2]:
s = nlp.LineSourceIterator("../sample_data/dewiki_simple_one_line.txt")

**Get first text of source stream:**

(To get the following texts simple execute the cell again.)

In [3]:
text = next(s)
text

"Alternative Schreibweisen sind unter anderem die Ursprungsvariante ''Allen Smithee'' sowie ''Alan Smythee'' und ''Adam Smithee''. Auch zwei teilweise asiatisch anmutende Schreibweisen ''Alan Smi Thee'' und ''Sumishii Aran'' gehören – so die Internet Movie Database – dazu. Das Pseudonym entstand 1968 infolge der Arbeiten am Western-Film ''Death of a Gunfighter'' (deutscher Titel ''Frank Patch – Deine Stunden sind gezählt''). Regisseur Robert Totten und Hauptdarsteller Richard Widmark gerieten in einen Streit, woraufhin Don Siegel als neuer Regisseur eingesetzt wurde. Der Film trug nach Abschluss der Arbeiten noch deutlich Tottens Handschrift, der auch mehr Drehtage als Siegel daran gearbeitet hatte, weshalb dieser die Nennung seines Namens als Regisseur ablehnte. Totten selbst lehnte aber ebenfalls ab. Als Lösung wurde  ''Allen Smithee'' als ein möglichst einzigartiger Name gewählt (bei der späteren Variante ''Alan Smithee'' war das Anagramm ''The Alias Men'' vermutlich kein Entstehung

##Tokenzation, lemmatization, lower case

For many NLP algorithms the texts has to split into tokens.
The tokenizer ``t`` does this. It can be applied to a (string) text to see how it works.

In [6]:
t = nlp.TokenizeText()
sample_text = "Der Frühling läßt sein blaues Band wieder flattern durch die Lüfte. Süße, wohlbekannte Düfte streifen ahnungsvoll das Land."
t(sample_text)

['Der',
 'Frühling',
 'läßt',
 'sein',
 'blaues',
 'Band',
 'wieder',
 'flattern',
 'durch',
 'die',
 'Lüfte.',
 'Süße',
 ',',
 'wohlbekannte',
 'Düfte',
 'streifen',
 'ahnungsvoll',
 'das',
 'Land',
 '.']

Mind that each interpunctation character leeds to a single token.

In languages which use conjugation and declination (massively) as German does, its often necessary to reduce the
number of different tokens by using lemmatization. Lemmatization reduces words to their basic form, e.g. "Lüfte" to
"Luft".

Other than lokenization, lemmatization depends on the natural language. For Consileon the default language (currently)
is German.

In [12]:
l = nlp.LemmaTokenizeText()
l(sample_text)

['der',
 'Frühling',
 'lassen',
 'mein',
 'blau',
 'Band',
 'wieder',
 'flattern',
 'durch',
 'der',
 'Luft',
 '.',
 'süßen',
 ',',
 'wohlbekannte',
 'Duft',
 'streifen',
 'ahnungsvoll',
 'der',
 'Land',
 '.']

(you see that here, lemmatization is not perfect, esp. not context sensitive.)

Tokenizers and lemmatizers are examples of _modifiers_ modifiers take one object and transform is into exactly one
object. The type of the input and output objects may differ, as in this case where the input object is of type ``string``
and the output object if of type ``list``.

Another modifier is ``Lower`` which transforms the _tokens_ within a list to lower case:

In [13]:
lc = nlp.Lower()
lc(["Hallo Welt"])

['hallo welt']

Modifiers can be _composed_ (as in mathematics) using the operator ``*``:

In [19]:
m1 = lc * t
m1(sample_text), (lc * t)(sample_text), (lc * nlp.LemmaTokenizeText())("Der Ball ist rund.")

(['der',
  'frühling',
  'läßt',
  'sein',
  'blaues',
  'band',
  'wieder',
  'flattern',
  'durch',
  'die',
  'lüfte.',
  'süße',
  ',',
  'wohlbekannte',
  'düfte',
  'streifen',
  'ahnungsvoll',
  'das',
  'land',
  '.'],
 ['der',
  'frühling',
  'läßt',
  'sein',
  'blaues',
  'band',
  'wieder',
  'flattern',
  'durch',
  'die',
  'lüfte.',
  'süße',
  ',',
  'wohlbekannte',
  'düfte',
  'streifen',
  'ahnungsvoll',
  'das',
  'land',
  '.'],
 ['der', 'ball', 'sein', 'rund', '.'])

A very important feature is that modifiers can easily be _applied element wise to an input stream_.
By this you get the first "pipeline. You than can iterate through the input and look how your modifiers handles it:

In [20]:
p = m1 ** s

In [22]:
tkns = next(p)
tkns

['ang',
 'lee',
 'wurde',
 '1954',
 'in',
 'taiwan',
 'geboren.',
 'seine',
 'eltern',
 ',',
 'emigranten',
 'aus',
 'china',
 ',',
 'lernten',
 'sich',
 'in',
 'taiwan',
 'kennen',
 ',',
 'lee',
 'ist',
 'ihr',
 'ältester',
 'sohn.',
 'die',
 'großeltern',
 'väterlicher-',
 'und',
 'mütterlicherseits',
 'sind',
 'im',
 'zuge',
 'der',
 'kommunistischen',
 'revolution',
 'in',
 'china',
 'ums',
 'leben',
 'gekommen.',
 'da',
 'sein',
 'vater',
 'als',
 'lehrer',
 'häufiger',
 'die',
 'arbeitsstelle',
 'wechselte',
 ',',
 'wuchs',
 'ang',
 'lee',
 'in',
 'verschiedenen',
 'städten',
 'taiwans',
 'auf.',
 'entgegen',
 'den',
 'wünschen',
 'seiner',
 'eltern',
 ',',
 'wie',
 'sein',
 'vater',
 'eine',
 'klassische',
 'akademische',
 'laufbahn',
 'einzuschlagen',
 ',',
 'interessierte',
 'sich',
 'lee',
 'für',
 'das',
 'schauspiel',
 'und',
 'absolvierte',
 'mit',
 'ihrem',
 'einverständnis',
 'zunächst',
 'ein',
 'theater-',
 'und',
 'filmstudium',
 'in',
 'taipeh.',
 'im',
 'anschluss',

You may achieve the same result the following way, but _structurally_ the is different:

Here, ``lc * t`` is a modifier which is applied to the elements of the stream ``s``.
There, a new stream ``p_1 = nlp.TokenizeText() ** s`` is created by applying ``nlp.TokenizeText()`` to the items (or elements)
of ``s``. The ``p`` results from ``p_1`` by applying the modifier ``nlp.Lower()`` to its items.

In [23]:
p = nlp.Lower() ** nlp.TokenizeText() ** s

In [24]:
tkns = next(p)
tkns


['alternative',
 'schreibweisen',
 'sind',
 'unter',
 'anderem',
 'die',
 'ursprungsvariante',
 '``',
 'allen',
 'smithee',
 "''",
 'sowie',
 '``',
 'alan',
 'smythee',
 "''",
 'und',
 '``',
 'adam',
 'smithee',
 "''",
 '.',
 'auch',
 'zwei',
 'teilweise',
 'asiatisch',
 'anmutende',
 'schreibweisen',
 '``',
 'alan',
 'smi',
 'thee',
 "''",
 'und',
 '``',
 'sumishii',
 'aran',
 "''",
 'gehören',
 '–',
 'so',
 'die',
 'internet',
 'movie',
 'database',
 '–',
 'dazu.',
 'das',
 'pseudonym',
 'entstand',
 '1968',
 'infolge',
 'der',
 'arbeiten',
 'am',
 'western-film',
 '``',
 'death',
 'of',
 'a',
 'gunfighter',
 "''",
 'deutscher',
 'titel',
 '``',
 'frank',
 'patch',
 '–',
 'deine',
 'stunden',
 'sind',
 'gezählt',
 "''",
 '.',
 'regisseur',
 'robert',
 'totten',
 'und',
 'hauptdarsteller',
 'richard',
 'widmark',
 'gerieten',
 'in',
 'einen',
 'streit',
 ',',
 'woraufhin',
 'don',
 'siegel',
 'als',
 'neuer',
 'regisseur',
 'eingesetzt',
 'wurde.',
 'der',
 'film',
 'trug',
 'nach',
 