# NLP for Beginners using NLTK and spaCy

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, Open Source, easy to use, well documented and it has a large community. NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analyse, preprocess and understand written text.

You can install it by running the following command:

In [None]:
!pip install nltk

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

## 1. Tokenizing

When we deal with text, we need to break it down into smaller pieces for analysis. This is
where tokenization comes into the picture. It is the process of dividing the input text into a
set of pieces like words or sentences. These pieces are called tokens.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

# define input text
input_text = "Do you know how tokenization works? It's actually quite interesting! Let's analyze a couple of sentences and figure it out."

# sentence tokenizer
print("\nSentence tokenizer:")
print(sent_tokenize(input_text))

# word tokenizer
print("\nWord tokenizer:")
print(word_tokenize(input_text))

### Stopwords

Stopwords are considered as noise in the text. Text may contain stopwords such as *is, am, are, this, a, an, the,* etc.

It is clear that you first need a list of stopwords so these words can be removed. This list can be easily created as follows:

In [None]:
# create stopwords
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

### Print all Dutch stopwords -  Exercise

Write a little program that prints all Dutch stopwords.

In [None]:
# print stopwords in dutch


### Remove stopwords and punctuation - Exercise

Now write a function `words()` that has a string as input parameter and returns all the words in that string without the stopwords. Also get rid of punctuation. The output of the input_text above should be:

```
Words without stopwords:  ['know', 'tokenization', 'works', 'actually', 'quite', 'interesting', 'let', 'analyze', 'couple', 'sentences', 'figure']
```

In [None]:
from nltk.tokenize import RegexpTokenizer

def words (input_text):
    tokenizer = RegexpTokenizer(r'\w+')
    output = []
    for word in tokenizer.tokenize(input_text):
        if word.lower() not in stop_words:
            output.append(word.lower())
    return output

print("Words without stopwords: ", words(input_text))

## 2. Stemming

When working with text, we have to deal with different forms of the same word. For example, the word *sing* can appear in many forms such as *sang, singer, singing, singer,* and so on. When we analyze text, it's useful to reduce words in their different forms into a base form. This will enable us to extract useful statistics to analyze the input text.

Stemming is one way to achieve this. It is basically a process that cuts off the ends of words to extract their base forms. Let's see how to do it using NLTK.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

input_words = ['writing', 'connections', 'connected', 'connecting', 'horse', 'randomize', 'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'calves']

# create various stemmer objects
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

# create a list of stemmer names for display
stemmer_names = ['PORTER', 'SNOWBALL', 'LANCASTER']
formatted_text = '{:>16}' * (len(stemmer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *stemmer_names), '\n', '='*68)

# stem each word and display the output
for word in input_words:
    output = [word, porter.stem(word), snowball.stem(word), lancaster.stem(word)]
    print(formatted_text.format(*output))

The difference between the three stemmers above is the level of strictness that's used to arrive at the base form. The Porter stemmer is the least in terms of strictness ("possibly" becomes "possibl") and Lancaster is the strictest ("possibly" becomes "poss").

Note that the result might not be an actual word. All the three stemmers said that the base form of "calves" is "calv", which is not a real word.

On the other hand all the three stemmers reduced "connections, connected, connecting" to a correct common word "connect".

## 3. Lemmatization - Exercise

Lemmatization is another way of reducing words to their base form. The lemmatization process uses a vocabulary and morphological analysis of words. It obtains the base forms by removing word endings such as ing or ed. This
base form of a word is known as a lemma. If you lemmatize the word "calves", you
should get "calf" as the output. One thing to note is that the output depends on whether the word is a verb or a noun.

Before using lemmatization, we have to download WordNet, a large lexical database of English.

In [None]:
import nltk


Now write a little program to lemmatize the same `input_words` as above. Use the `lemmatize`-method from the `WordNetLemmatizer`-class. This method has two parameters: the first parameter is the word to be lemmatized, the second parameter is the type of output (pos='n' for a noun lemma, pos='v' for a verb lemma). The output should be something like this:

```
               INPUT WORD         NOUN LEMMATIZER         VERB LEMMATIZER 
 ===========================================================================
                 writing                 writing                   write
             connections              connection             connections
               connected               connected                 connect
```

In [None]:
from nltk.stem import WordNetLemmatizer

input_words = ['writing', 'connections', 'connected', 'connecting', 'horse', 'randomize', 'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'calves']

# Create lemmatizer object
lemmatizer = WordNetLemmatizer()

# Create a list of lemmatizer names for display
lemmatizer_names = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']
formatted_text = '{:>24}' * (len(lemmatizer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *lemmatizer_names), '\n', '='*75)

# Lemmatize each word and display the output
for word in input_words:
    output = [word, lemmatizer.lemmatize(word, pos='n'), lemmatizer.lemmatize(word, pos='v')]
    print(formatted_text.format(*output))

We can see that the noun lemmatizer works differently than the verb lemmatizer when it
comes to words like writing or calves. If you compare these outputs to stemmer outputs, you
will see that there are differences too. The lemmatizer outputs are all meaningful whereas
stemmer outputs may or may not be meaningful.

## 4. POS Tagging

The target of Part-of-Speech (POS) Tagging is to identify the grammatical group of a given word, whether it is a noun, pronoun, adjective, verb, adverb, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

We will use spaCy, a different Python library for NLP because it gives better results than NLTK for POS Tagging and Named Entity Recognition. First install spaCy.

In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download nl_core_news_sm

Next install the English model (restart the Kernel afterwords).

In [None]:
#pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

Import the core spaCy English model and create a spaCy document that we will be using to perform Part-of-Speech tagging.

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')

sen = sp("I like to play football. I hated it in my childhood though.")

The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text attribute is used. 

In [None]:
print(sen.text)

Similarly, the pos_ attribute returns the POS tag. And finally, to get the explanation of the POS tag, we can use the spacy.explain() method and pass it the tag name.

In [None]:
print(sen[7])
print(sen[7].pos_)
print(spacy.explain(sen[7].tag_))

We can print all the POS tags (we've improved the readability by adding 12 spaces between the text and the POS tag and then another 10 spaces between the POS tags and the explanation).

In [None]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {spacy.explain(word.tag_)}')

Another cool thing about spaCy is, that you can use the dependency visualizer to show Part-of-Speech tags and syntactic dependencies. Maybe you can try some other sentences to visualise.

In [None]:
import spacy
from spacy import displacy

sp = spacy.load("en_core_web_sm")
sen = sp("I like to play football. I hated it in my childhood though.")
displacy.render(sen, style="dep", jupyter=True)

### POS Tagging in Dutch - Exercise

POS Tagging can be done in Dutch as well. You will probably have to install the Dutch model.

Use this sentences as input: "De concentratie broeikasgassen die bijdragen aan de verandering van het klimaat, heeft opnieuw een recordhoogte bereikt." The output should be as follows:

```
De           DET        Art|bep|zijdofmv|neut__Definite=Def|PronType=Art
concentratie NOUN       N|soort|ev|neut__Number=Sing
broeikasgassen ADP        Prep|voor__AdpType=Prep
die          PRON       Pron|aanw|neut|attr__PronType=Dem
bijdragen    NOUN       N|soort|mv|neut__Number=Plur
aan          ADP        Prep|voor__AdpType=Prep
de           DET        Art|bep|zijdofmv|neut__Definite=Def|PronType=Art
verandering  NOUN       N|soort|ev|neut__Number=Sing
van          ADP        Prep|voor__AdpType=Prep
het          DET        Art|bep|onzijd|neut__Definite=Def|Gender=Neut|PronType=Art
klimaat      NOUN       N|soort|ev|neut__Number=Sing
,            PUNCT      Punc|komma__PunctType=Comm
heeft        VERB       V|hulp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
opnieuw      ADV        Adv|gew|geenfunc|stell|onverv__Degree=Pos
een          DET        Art|onbep|zijdofonzijd|neut__Definite=Ind|Number=Sing|PronType=Art
recordhoogte NOUN       N|soort|ev|neut__Number=Sing
bereikt      VERB       V|trans|verldw|onverv__Subcat=Tran|Tense=Past|VerbForm=Part
.            PUNCT      Punc|punt__PunctType=Peri

```

It is not possible to explain the POS tag in Dutch. Just use tag_ in the third column.

In [None]:
#pip install https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.2.0/nl_core_news_sm-2.2.0.tar.gz

In [None]:
import spacy
sp = spacy.load('nl_core_news_sm')

sen = sp('De concentratie broeikasgassen die bijdragen aan de verandering van het klimaat, heeft opnieuw een recordhoogte bereikt.')

In [None]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {(word.tag_)}')

## 5. Named Entity Recognition

Named Entity Recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc. Let's see how the spaCy library performs Named Entity Recognition. Look at the following script:

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')

sen = sp('Manchester United is looking to sign Harry Kane for $90 million.')

print(sen.ents)

You can see that three named entities were identified. To see the detail of each named entity, you can use the text, label, and the spacy.explain method which takes the entity object as a parameter.

In [None]:
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Like the POS tags, we can also view named entities.

In [None]:
from spacy import displacy

sen = sp('Manchester United is looking to sign Harry Kane for $90 million. David wants 100 Million Dollars.')
displacy.render(sen, style='ent', jupyter=True)