# 1. Natural Language Toolkit

### What is Natural Language Toolkit?

Natural Language Toolkit (or NLTK for short) is a group of libraries and programs used for symbolic and statistical natural language processing.

As it has been mentioned previously, data for NLP model has to be preprocessed prior to the training procedure. Such preprocessing operations could include, converting string-type data to numerical data, performing semantical analysis, etc. All of these (and many more) operations can be simply implemented using the NLTK library.

In the following section, we will look at the most relevant functions.

In [None]:
#Setting up
!pip install nltk

In [None]:
import nltk

nltk.download()

### Tokenization

![tokenization](https://th.bing.com/th/id/OIP.mp2GAfOG8L4JxUv54-364gHaB1?pid=ImgDet&rs=1)

Prior to processing textual data, we should first tokenize it. In other words, we should split it into smaller parts (sentences to words, paragraphs to sentences), as it reduces further processing time.

#### Sentence tokenization

As the name might suggest, in the sentence tokenization we aim to split groups of sentences/paragraphs to shorter sentences.

In [None]:
from nltk.tokenize import sent_tokenize

Text = "Natural language processing (NLP) refers to the branch of computer science. To be more specific, the branch of artificial intelligence. It is concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."

sent_tokenize = sent_tokenize(Text)

sent_tokenize

#### Word tokenization

In contrast to sentence tokenization, the goal of word tokenization is to divide textual data into individual words.

In [None]:
word_tokenize = nltk.word_tokenize(Text)

word_tokenize

### Stemming

![stemming](https://th.bing.com/th/id/OIP.vzQ5E_6loS0hz8fJbQQbXAHaFj?pid=ImgDet&rs=1)

Stemming is the process of producing morphological variants using the given base word. The usage of stemming algorithms allows multiple word variations to share a *same* meaning / attribute (for instance, 'fish', 'fishing' are variants of the base word 'fish').

In many cases, stemming algorithm tends to cut the end of the word until the base word is found, as it works in the most cases.

Let's look at one of the most common stemming tools implementation - **Porter Stemmer**.

In [None]:
from nltk.stem.porter import *

stemmer = PorterStemmer()

words = ['fishing', 'believes', 'writes', 'loving', 'cats']

for word in words:
    print(word + '-----' + stemmer.stem(word))

### Lemmatization

In contrast to stemming, lemmatization does not apply simple *'word end cutting'* and rather considers a full vocabulary to apply a morphological word analysis. The simple examples of such reduction could be interpretting 'was' as one of the 'be' or 'mice' as 'mouse'.

As a result, lemmatization is considered to be more informative than simple stemming, but also, slower.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer();

words = ['believes', 'lives', 'mice']

for word in words:
    print(word + '-----' + lemmatizer.lemmatize(word))

On the other hand, same words can be interpreted differently based on the part of speech. To correctly interpret such words, we can specify the ```pos``` argument in the lemmatizer function.

In [None]:
#crossing as adjective
print(lemmatizer.lemmatize('crossing', pos = 'a'))

#crossing as verb
print(lemmatizer.lemmatize('crossing', pos = 'v'))

### Stopwords

Stopwords are the words in any language which do not add much meaning to a sentence, thus can be ignored without lossing accuracy. The examples of such words could be 'is', 'the', 'at', however, each NLP tool has a different list of stopwords.

In [None]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

As these words do not carry much information, they can be removed, however, this depends from case to case. 

Generally, NLP models for text classification, sentiment analysis, spam classification (and so on) would require the removal of stop words.

On the other hand, if we are dealing with the translation models, stopwords should be left as they might provide some contextual information.

Let's remove stopwords from the previously analyzed NLP definition.

In [None]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

Text = "Natural language processing (NLP) refers to the branch of computer science. To be more specific, the branch of artificial intelligence. It is concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."

words = nltk.word_tokenize(Text)

filtered_words = []

for word in words:
    if word not in stopwords:
        filtered_words.append(word)
        
print(filtered_words)