# Tokenization

In [1]:
import numpy
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [3]:
sent = sent_tokenize(" Before the extensive use of deep learning, many problems are solved using machine learning but did not get the results as expected. With the boom in technology specially in Graphical Processing Units(GPUs), deep learning comes into picture and solved many complicated problems with the help of artificial neural networks. One of the most important aspect of deep learning is that no feature extraction is required. The model itself manages the weight and extract the best features and then training is done whereas in machine learning, feature extraction is always done prior model training. ")
print(sent)

[' Before the extensive use of deep learning, many problems are solved using machine learning but did not get the results as expected.', 'With the boom in technology specially in Graphical Processing Units(GPUs), deep learning comes into picture and solved many complicated problems with the help of artificial neural networks.', 'One of the most important aspect of deep learning is that no feature extraction is required.', 'The model itself manages the weight and extract the best features and then training is done whereas in machine learning, feature extraction is always done prior model training.']


In [4]:
len(sent)

4

In [5]:
word = word_tokenize('Many problems are solved using machine learning')

print(word)

['Many', 'problems', 'are', 'solved', 'using', 'machine', 'learning']


In [6]:
len(word)

7

In [7]:
from nltk.tokenize.punkt import PunktSentenceTokenizer

In [8]:
pst = PunktSentenceTokenizer()

In [9]:
punkt_sentence = pst.tokenize("Before the extensive use of deep learning, many problems are solved using machine learning but did not get the results as expected. With the boom in technology specially in Graphical Processing Units(GPUs), deep learning comes into picture and solved many complicated problems with the help of artificial neural networks. One of the most important aspect of deep learning is that no feature extraction is required. The model itself manages the weight and extract the best features and then training is done whereas in machine learning, feature extraction is always done prior model training. ")
print(punkt_sentence)

['Before the extensive use of deep learning, many problems are solved using machine learning but did not get the results as expected.', 'With the boom in technology specially in Graphical Processing Units(GPUs), deep learning comes into picture and solved many complicated problems with the help of artificial neural networks.', 'One of the most important aspect of deep learning is that no feature extraction is required.', 'The model itself manages the weight and extract the best features and then training is done whereas in machine learning, feature extraction is always done prior model training.']


In [10]:
len(punkt_sentence)

4

In [11]:
span_sentence = pst.span_tokenize("Before the extensive use of deep learning, many problems are solved using machine learning but did not get the results as expected. With the boom in technology specially in Graphical Processing Units(GPUs), deep learning comes into picture and solved many complicated problems with the help of artificial neural networks. One of the most important aspect of deep learning is that no feature extraction is required. The model itself manages the weight and extract the best features and then training is done whereas in machine learning, feature extraction is always done prior model training. ")
print(list(span_sentence))

[(0, 131), (132, 321), (322, 414), (415, 591)]


In [12]:
sentences = pst.sentences_from_tokens(word)
list(sentences)

[['Many', 'problems', 'are', 'solved', 'using', 'machine', 'learning']]

In [13]:
from nltk.probability import FreqDist

In [14]:
freq = FreqDist(word)
freq.most_common()

[('Many', 1),
 ('problems', 1),
 ('are', 1),
 ('solved', 1),
 ('using', 1),
 ('machine', 1),
 ('learning', 1)]

# Stemming

  Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision here. We discuss the virtues of lemmatization in the next section.

# Porter Stemmer
One of the most common - and effective - stemming tools is Porter's Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules. 

In [15]:
import nltk
from nltk.stem.porter import *

In [16]:
p_stemmer = PorterStemmer()

In [17]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly', 'accordingly', 'consolingly']

In [18]:
for word in words:
  print(word+' -->  '+p_stemmer.stem(word))

run -->  run
runner -->  runner
running -->  run
ran -->  ran
runs -->  run
easily -->  easili
fairly -->  fairli
accordingly -->  accordingli
consolingly -->  consolingli


# Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we'll use it here.

In [19]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english')

In [20]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly', 'accordingly', 'consolingly']

In [21]:
for word in words:
  print(word+' -->  '+s_stemmer.stem(word))

run -->  run
runner -->  runner
running -->  run
ran -->  ran
runs -->  run
easily -->  easili
fairly -->  fair
accordingly -->  accord
consolingly -->  consol
