# Gertrude Stein on NLP

1. **prepositions** are usually wrong
2.  **articles** are delicate and varied items
3. **adjectives** are not interesting
4. **nouns** are not interesting
5. **verbs** are in motion
6. & **adverbs** moves with them
7. **pronomis** are moving in a very large space of possibility
8. **names** do not
9. **upper and lower case spelling** is fun to play with 
10. **question marks** are uninteresting
11. **exclamation marks and inverted commas** are unnecessary and ugly
12. **commas** are useless
13. the **dot** leads the text to its own life

### Tokenization
#### 1. Set variables (strings)
So let us now read a poem by Gertrude Stein from [»Before the Flowers of Friendship Faded Friendship Faded«](https://www.poetrynook.com/poem/flowers-friendship-faded-friendship-faded), Kapitel XII 



<pre>
I am very hungry when I drink
I need to leave it when I have it held,
They will be white with which they know they see, that darker makes it be a color white for me, white is not shown when I am dark indeed with red despair who comes who has to care that they will let me a little lie like now I like to lie I like to live I like to die I like to lie and live and die and live and die and by and by I like to live and die and by and by they need to sew, the difference is that sewing makes it bleed and such with them in all the way of seed and seeding and repine and they will which is mine and not all mine who can be thought curious of this of all of that made it and come lead it and done weigh it and mourn and sit upon it know it for ripeness without deserting all of it of which without which it has not been born. Oh no not to be thirsty with the thirst of hunger not alone to know that they plainly and ate or wishes. Any little one will kill himself for milk.
</pre>

read into the machine and output again *in the same way* with the `print` command

In [9]:
XII = """I am very hungry when I drink
I need to leave it when I have it held,
They will be white with which they know they see, that darker makes it be a color white for me, white is not shown when I am dark indeed with red despair who comes who has to care that they will let me a little lie like now I like to lie I like to live I like to die I like to lie and live and die and live and die and by and by I like to live and die and by and by they need to sew, the difference is that sewing makes it bleed and such with them in all the way of seed and seeding and repine and they will which is mine and not all mine who can be thought curious of this of all of that made it and come lead it and done weigh it and mourn and sit upon it know it for ripeness without deserting all of it of which without which it has not been born. Oh no not to be thirsty with the thirst of hunger not alone to know that they plainly and ate or wishes. Any little one will kill himself for milk."""
print(XII)

I am very hungry when I drink
I need to leave it when I have it held,
They will be white with which they know they see, that darker makes it be a color white for me, white is not shown when I am dark indeed with red despair who comes who has to care that they will let me a little lie like now I like to lie I like to live I like to die I like to lie and live and die and live and die and by and by I like to live and die and by and by they need to sew, the difference is that sewing makes it bleed and such with them in all the way of seed and seeding and repine and they will which is mine and not all mine who can be thought curious of this of all of that made it and come lead it and done weigh it and mourn and sit upon it know it for ripeness without deserting all of it of which without which it has not been born. Oh no not to be thirsty with the thirst of hunger not alone to know that they plainly and ate or wishes. Any little one will kill himself for milk.


#### 2. Import necessary libraries
then we import a Python library called `NLTK` (Natural Language Toolkit) and continue working with this library, which was created especially for NLP purposes:

In [10]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to /home/whoami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/whoami/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/whoami/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /home/whoami/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package punkt to /home/whoami/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/whoami/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

#### 3. Tokenisation & punctuation
from NLTK we now import the so-called *Wordtokenizer* `word_tokenize()`, an algorithm that (simply said) breaks the text down into single strings, so-called *tokens* which represent words.

Then we filter out all tokens that are not alphabetical. For our case: all independent punctuation.

Python has the function `isalpha ()`, which can be used for that.

In [11]:
from nltk.tokenize import word_tokenize

tokenized = word_tokenize(XII)
# remove all tokens that are not alphabetic
tokenized_word = [word for word in tokenized if word.isalpha()]
print(tokenized_word)

['I', 'am', 'very', 'hungry', 'when', 'I', 'drink', 'I', 'need', 'to', 'leave', 'it', 'when', 'I', 'have', 'it', 'held', 'They', 'will', 'be', 'white', 'with', 'which', 'they', 'know', 'they', 'see', 'that', 'darker', 'makes', 'it', 'be', 'a', 'color', 'white', 'for', 'me', 'white', 'is', 'not', 'shown', 'when', 'I', 'am', 'dark', 'indeed', 'with', 'red', 'despair', 'who', 'comes', 'who', 'has', 'to', 'care', 'that', 'they', 'will', 'let', 'me', 'a', 'little', 'lie', 'like', 'now', 'I', 'like', 'to', 'lie', 'I', 'like', 'to', 'live', 'I', 'like', 'to', 'die', 'I', 'like', 'to', 'lie', 'and', 'live', 'and', 'die', 'and', 'live', 'and', 'die', 'and', 'by', 'and', 'by', 'I', 'like', 'to', 'live', 'and', 'die', 'and', 'by', 'and', 'by', 'they', 'need', 'to', 'sew', 'the', 'difference', 'is', 'that', 'sewing', 'makes', 'it', 'bleed', 'and', 'such', 'with', 'them', 'in', 'all', 'the', 'way', 'of', 'seed', 'and', 'seeding', 'and', 'repine', 'and', 'they', 'will', 'which', 'is', 'mine', 'and',

#### 4. Upper and lower case

Now we convert all words into one case, in *lower case*.

This means that the vocabulary is reduced, but also that some distinctions are lost  (e.g. “Apple” the company v.s. “apple” the fruit is a commonly used example)

To do this, we call the function `lower()` for each word.


In [12]:
# convert to lower case
## note4me: when nicht alles in kleingeschrieben dann bleibt beispielsweise das I nach dem lemmatizen und stemmen mit drinnen...
tokenized_word = [w.lower() for w in tokenized_word]

#### 5. stop words
**Stop words** are regarded as noise in the text. Words like *is, am, are, this, a, an, the* for example.

These words are among the most common words in English texts and Intuitively, it seems strange to count these words like "the" and "and" among the "most common," because words like these are presumably common across all texts, not just this text in particular.

To remove stopwords, NLTK requires that we first create a list of stopwords (a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency.) and then filter this list of tokens out of the text.

1. create a list of stopwords: 

In [13]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'both', 'are', 'yourselves', 'to', 'then', 'wasn', 'you', 'those', 'shouldn', 'that', 'out', "should've", 'mustn', 'your', 'in', 'itself', 'these', 'am', 'only', 'yours', 'which', 'all', 're', 'just', "mustn't", 'mightn', 'few', 'an', 'have', 'again', 'can', 'ain', 'we', "weren't", 'now', 'before', "wouldn't", 'it', "don't", 'but', 'm', 't', 'me', 'was', 'were', "didn't", 'most', 'own', 'should', 'themselves', 'than', 'shan', 'no', 'won', "hasn't", 'whom', 'here', 'hadn', 'such', "shouldn't", 've', 'what', "hadn't", 'some', "needn't", 'himself', 'doesn', 'below', 'about', 'his', 'y', 'has', 'as', 'further', "shan't", 'off', 'our', 'how', 'through', "that'll", 'my', 'once', 'ma', 'did', 'other', "wasn't", 'into', 'above', "couldn't", 'at', "won't", 'had', "it's", 'its', "you've", 'doing', 'will', 'hasn', 'haven', 'too', 'so', "she's", 'a', "doesn't", 'hers', 'against', 'who', 'any', 'of', 'by', 'because', 'during', 'down', "you'd", 'does', 'over', 'couldn', 'they', 'why', 'aren', 'be',

2. filter out stopwords: 

In [14]:
filtered_sent=[]

for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)

print("Tokenized Sentence:",tokenized_word, "\n")
print("Filterd Sentence:",filtered_sent)

Tokenized Sentence: ['i', 'am', 'very', 'hungry', 'when', 'i', 'drink', 'i', 'need', 'to', 'leave', 'it', 'when', 'i', 'have', 'it', 'held', 'they', 'will', 'be', 'white', 'with', 'which', 'they', 'know', 'they', 'see', 'that', 'darker', 'makes', 'it', 'be', 'a', 'color', 'white', 'for', 'me', 'white', 'is', 'not', 'shown', 'when', 'i', 'am', 'dark', 'indeed', 'with', 'red', 'despair', 'who', 'comes', 'who', 'has', 'to', 'care', 'that', 'they', 'will', 'let', 'me', 'a', 'little', 'lie', 'like', 'now', 'i', 'like', 'to', 'lie', 'i', 'like', 'to', 'live', 'i', 'like', 'to', 'die', 'i', 'like', 'to', 'lie', 'and', 'live', 'and', 'die', 'and', 'live', 'and', 'die', 'and', 'by', 'and', 'by', 'i', 'like', 'to', 'live', 'and', 'die', 'and', 'by', 'and', 'by', 'they', 'need', 'to', 'sew', 'the', 'difference', 'is', 'that', 'sewing', 'makes', 'it', 'bleed', 'and', 'such', 'with', 'them', 'in', 'all', 'the', 'way', 'of', 'seed', 'and', 'seeding', 'and', 'repine', 'and', 'they', 'will', 'which', 

#### 6 Lemmatization
Another important method of text preparation, probably the most common reduction method in NLP is *lemmatization*. It reduces words to their source word, the linguistically correct *lemma*. The word *better*, for example, has *good* as lemma. This means that lemmas, in contrast to the process of *Stemming* (a process of linguistic normalisation), already carry the word context within themselves:

In [15]:
from nltk.stem.wordnet import WordNetLemmatizer

#ps = PorterStemmer()
lem = WordNetLemmatizer()
lemmatized_words=[]

for w in filtered_sent:
    lemmatized_words.append(lem.lemmatize(w))

#print("Filtered Sentence:",filtered_sent)
print("Lemmatized Sentence:",lemmatized_words)

Lemmatized Sentence: ['hungry', 'drink', 'need', 'leave', 'held', 'white', 'know', 'see', 'darker', 'make', 'color', 'white', 'white', 'shown', 'dark', 'indeed', 'red', 'despair', 'come', 'care', 'let', 'little', 'lie', 'like', 'like', 'lie', 'like', 'live', 'like', 'die', 'like', 'lie', 'live', 'die', 'live', 'die', 'like', 'live', 'die', 'need', 'sew', 'difference', 'sewing', 'make', 'bleed', 'way', 'seed', 'seeding', 'repine', 'mine', 'mine', 'thought', 'curious', 'made', 'come', 'lead', 'done', 'weigh', 'mourn', 'sit', 'upon', 'know', 'ripeness', 'without', 'deserting', 'without', 'born', 'oh', 'thirsty', 'thirst', 'hunger', 'alone', 'know', 'plainly', 'ate', 'wish', 'little', 'one', 'kill', 'milk']


#### 7. POS tagging
this is now our text, which we can read into further NLP-algorithms or to create f.ex. *word embeddings* to generate meanings in the form of vectors.

Let us remember the description of Gertrude Stein when she starts talking about verbs, subjectives, nouns etc. This grammatical classification is also necessary in NLP. In our example text, we determine the respective grammatical group according to the *Penn Treebank-Stadards Table* using the *Part-of-Speech(POS) tagging method, which searches for relationships within the sentence and assigns it a corresponding 'tag'. For example, in the common [Penn Treebank-Stadandards Table](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) (link), *VB* stands for the verb base form or *NN* for a noun.

We place the initial *tokens* in the POS tagger: 

In [16]:
nltk.pos_tag(tokenized_word)

[('i', 'NN'),
 ('am', 'VBP'),
 ('very', 'RB'),
 ('hungry', 'JJ'),
 ('when', 'WRB'),
 ('i', 'NN'),
 ('drink', 'VBP'),
 ('i', 'NNS'),
 ('need', 'VBP'),
 ('to', 'TO'),
 ('leave', 'VB'),
 ('it', 'PRP'),
 ('when', 'WRB'),
 ('i', 'NN'),
 ('have', 'VBP'),
 ('it', 'PRP'),
 ('held', 'VBD'),
 ('they', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('white', 'JJ'),
 ('with', 'IN'),
 ('which', 'WDT'),
 ('they', 'PRP'),
 ('know', 'VBP'),
 ('they', 'PRP'),
 ('see', 'VBP'),
 ('that', 'IN'),
 ('darker', 'NN'),
 ('makes', 'VBZ'),
 ('it', 'PRP'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('color', 'NN'),
 ('white', 'JJ'),
 ('for', 'IN'),
 ('me', 'PRP'),
 ('white', 'JJ'),
 ('is', 'VBZ'),
 ('not', 'RB'),
 ('shown', 'VBN'),
 ('when', 'WRB'),
 ('i', 'NN'),
 ('am', 'VBP'),
 ('dark', 'JJ'),
 ('indeed', 'RB'),
 ('with', 'IN'),
 ('red', 'JJ'),
 ('despair', 'NN'),
 ('who', 'WP'),
 ('comes', 'VBZ'),
 ('who', 'WP'),
 ('has', 'VBZ'),
 ('to', 'TO'),
 ('care', 'VB'),
 ('that', 'IN'),
 ('they', 'PRP'),
 ('will', 'MD'),
 ('let', 'VB'),
