Basically tokenizing involves splitting sentences and words from the body of the text.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
data = "All work and no play makes jack a dull boy."


In [4]:
phrases = sent_tokenize(data)
words = word_tokenize(data)
print(phrases)

['All work and no play makes jack a dull boy.']


In [5]:
print(words)

['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', '.']


The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

#### What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words.

#### Because, it was not already present. I had to download it using this code.

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/auro2677/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [6]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))
stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

###### We modify our code to:

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
 
print(wordsFiltered)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.']


##### The languages in which stopwords are available.

In [8]:
print (stopwords.fileids())

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']


English Stopwords

In [9]:
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
print(len(stopwords.words('english')))

179


#### STEMMING
Stemming any word means returning stem of the word. A single word can have different versions. But all the different versions of that word has a single stem/base/root word. The stem word is not necessary to be identical to the morphological root of the word.

Example:

The word Work will be the stem word for working, worked, and works.



#### Loading Stemmer Module

There are many stemming algorithms. Porter stemming algorithm is the most popular one.

In [12]:
from nltk.stem import PorterStemmer

In [13]:
stemmer = PorterStemmer()
 
print (stemmer.stem('working')) # output: work
print (stemmer.stem('works')) # output: work
print (stemmer.stem('worked')) # output: work

work
work
work


In [14]:
text = "A quick brown fox jumps over the lazy dog."

In [15]:
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example, Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()

In [16]:
# tokenize text 
words = word_tokenize(text)
 
print (words)

['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']


In [17]:
stemmer = PorterStemmer()
 
words_stem = [stemmer.stem(word) for word in words]

In [18]:
print (words_stem)

['a', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']


We can do word stemming for sentences too:

In [19]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
 
ps = PorterStemmer()
 
sentence = "gaming, the gamers play games"
words = word_tokenize(sentence)
 
for word in words:
    print(word + ":" + ps.stem(word))

gaming:game
,:,
the:the
gamers:gamer
play:play
games:game


### let’s knock out some quick vocabulary:
Corpus : Body of text, singular. Corpora is the plural of this.

Lexicon : Words and their meanings.

Token : Each “entity” that is a part of whatever was split up based on rules.

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation.

In [26]:

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import PunktSentenceTokenizer

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/auro2677/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [27]:
document = 'Everything is all about money.'
sentences = nltk.sent_tokenize(document)   
for sent in sentences:
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

[('Everything', 'NN'), ('is', 'VBZ'), ('all', 'DT'), ('about', 'IN'), ('money', 'NN'), ('.', '.')]


Here’s a list of the tags, what they mean, and some examples:

In [28]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [29]:
document = 'Today the Netherlands celebrates King\'s Day. To honor this tradition, the Dutch embassy in San Francisco invited me to'
sentences = nltk.sent_tokenize(document)   
 
data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))
 
for word in data: 
    if 'NNP' in word[1]: 
        print(word)

('Netherlands', 'NNP')
('King', 'NNP')
('Day', 'NNP')
('San', 'NNP')
('Francisco', 'NNP')


credits:https://pythonspot.com/nltk-stemming/
https://www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/
http://blog.chapagain.com.np/python-nltk-stemming-lemmatization-natural-language-processing-nlp/
http://blog.chapagain.com.np/python-nltk-stop-words-natural-language-processing-nlp/