# Natural Language Toolkit 

NLTK is a toolkit build for working with NLP in Python. It provides us various text processing libraries with a lot of test datasets. A variety of tasks can be performed using NLTK such as tokenizing, parse tree visualization, etc… In this article, we will go through how we can set up NLTK in our system and use them for performing various NLP tasks during the text processing step.

# Corpora

In [24]:
import os
import nltk
import nltk.corpus

In [25]:
nltk.download('inaugural')

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


True

In [26]:
from nltk.corpus import inaugural

In [27]:
nltk.corpus.inaugural.fileids()

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt',
 '1829-Jackson.txt',
 '1833-Jackson.txt',
 '1837-VanBuren.txt',
 '1841-Harrison.txt',
 '1845-Polk.txt',
 '1849-Taylor.txt',
 '1853-Pierce.txt',
 '1857-Buchanan.txt',
 '1861-Lincoln.txt',
 '1865-Lincoln.txt',
 '1869-Grant.txt',
 '1873-Grant.txt',
 '1877-Hayes.txt',
 '1881-Garfield.txt',
 '1885-Cleveland.txt',
 '1889-Harrison.txt',
 '1893-Cleveland.txt',
 '1897-McKinley.txt',
 '1901-McKinley.txt',
 '1905-Roosevelt.txt',
 '1909-Taft.txt',
 '1913-Wilson.txt',
 '1917-Wilson.txt',
 '1921-Harding.txt',
 '1925-Coolidge.txt',
 '1929-Hoover.txt',
 '1933-Roosevelt.txt',
 '1937-Roosevelt.txt',
 '1941-Roosevelt.txt',
 '1945-Roosevelt.txt',
 '1949-Truman.txt',
 '1953-Eisenhower.txt',
 '1957-Eisenhower.txt',
 '1961-Kennedy.txt',
 '1965-Johnson.txt',
 '1969-Nixon.txt',
 '1973-Nixon.txt',
 '1

In [28]:
words = inaugural.words('2009-Obama.txt')
words

['My', 'fellow', 'citizens', ':', 'I', 'stand', 'here', ...]

In [29]:
l1 = len(inaugural.words('2009-Obama.txt'))
l1

2726

In [30]:
for word in words[:500]:
    print(word, sep =" ", end = " ")

My fellow citizens : I stand here today humbled by the task before us , grateful for the trust you have bestowed , mindful of the sacrifices borne by our ancestors . I thank President Bush for his service to our nation , as well as the generosity and cooperation he has shown throughout this transition . Forty - four Americans have now taken the presidential oath . The words have been spoken during rising tides of prosperity and the still waters of peace . Yet , every so often the oath is taken amidst gathering clouds and raging storms . At these moments , America has carried on not simply because of the skill or vision of those in high office , but because We the People have remained faithful to the ideals of our forbearers , and true to our founding documents . So it has been . So it must be with this generation of Americans . That we are in the midst of crisis is now well understood . Our nation is at war , against a far - reaching network of violence and hatred . Our economy is badl

In [31]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [32]:
from nltk.corpus import twitter_samples

In [33]:
nltk.corpus.twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

# Tokenization

In [34]:
import nltk

In [35]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [36]:
text_file = open("demo_text.txt",'r')
my_text = text_file.read()
print(my_text)

Hi Learner! Welcome to NLP (Natural Language Processing) with Python. Here you will learn text mining and processing on natural language data. Are you aware about the Python basics? 



## 1. Word Tokenization

In [37]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(my_text)
print(word_tokens) # print function requires Python 3

['Hi', 'Learner', '!', 'Welcome', 'to', 'NLP', '(', 'Natural', 'Language', 'Processing', ')', 'with', 'Python', '.', 'Here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data', '.', 'Are', 'you', 'aware', 'about', 'the', 'Python', 'basics', '?']


## 2. Sentences Tokenization

In [38]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(my_text)
print(sent_tokens) # print function requires Python 3

['Hi Learner!', 'Welcome to NLP (Natural Language Processing) with Python.', 'Here you will learn text mining and processing on natural language data.', 'Are you aware about the Python basics?']


## 3. Tokenization (N-Grams)
#### Creating Bigrams

In [39]:
from nltk.util import ngrams

my_words = word_tokenize(my_text) # This is the list of all words
twograms = list(ngrams(my_words,2)) # This is for two-word combos, but can pick any n
print(twograms)

[('Hi', 'Learner'), ('Learner', '!'), ('!', 'Welcome'), ('Welcome', 'to'), ('to', 'NLP'), ('NLP', '('), ('(', 'Natural'), ('Natural', 'Language'), ('Language', 'Processing'), ('Processing', ')'), (')', 'with'), ('with', 'Python'), ('Python', '.'), ('.', 'Here'), ('Here', 'you'), ('you', 'will'), ('will', 'learn'), ('learn', 'text'), ('text', 'mining'), ('mining', 'and'), ('and', 'processing'), ('processing', 'on'), ('on', 'natural'), ('natural', 'language'), ('language', 'data'), ('data', '.'), ('.', 'Are'), ('Are', 'you'), ('you', 'aware'), ('aware', 'about'), ('about', 'the'), ('the', 'Python'), ('Python', 'basics'), ('basics', '?')]


In [40]:
bigrams = list(nltk.bigrams(my_words))
print(bigrams)

[('Hi', 'Learner'), ('Learner', '!'), ('!', 'Welcome'), ('Welcome', 'to'), ('to', 'NLP'), ('NLP', '('), ('(', 'Natural'), ('Natural', 'Language'), ('Language', 'Processing'), ('Processing', ')'), (')', 'with'), ('with', 'Python'), ('Python', '.'), ('.', 'Here'), ('Here', 'you'), ('you', 'will'), ('will', 'learn'), ('learn', 'text'), ('text', 'mining'), ('mining', 'and'), ('and', 'processing'), ('processing', 'on'), ('on', 'natural'), ('natural', 'language'), ('language', 'data'), ('data', '.'), ('.', 'Are'), ('Are', 'you'), ('you', 'aware'), ('aware', 'about'), ('about', 'the'), ('the', 'Python'), ('Python', 'basics'), ('basics', '?')]


### Creating trigrams

In [41]:
threegrams = list(ngrams(my_words,3))
print(threegrams)

[('Hi', 'Learner', '!'), ('Learner', '!', 'Welcome'), ('!', 'Welcome', 'to'), ('Welcome', 'to', 'NLP'), ('to', 'NLP', '('), ('NLP', '(', 'Natural'), ('(', 'Natural', 'Language'), ('Natural', 'Language', 'Processing'), ('Language', 'Processing', ')'), ('Processing', ')', 'with'), (')', 'with', 'Python'), ('with', 'Python', '.'), ('Python', '.', 'Here'), ('.', 'Here', 'you'), ('Here', 'you', 'will'), ('you', 'will', 'learn'), ('will', 'learn', 'text'), ('learn', 'text', 'mining'), ('text', 'mining', 'and'), ('mining', 'and', 'processing'), ('and', 'processing', 'on'), ('processing', 'on', 'natural'), ('on', 'natural', 'language'), ('natural', 'language', 'data'), ('language', 'data', '.'), ('data', '.', 'Are'), ('.', 'Are', 'you'), ('Are', 'you', 'aware'), ('you', 'aware', 'about'), ('aware', 'about', 'the'), ('about', 'the', 'Python'), ('the', 'Python', 'basics'), ('Python', 'basics', '?')]


In [42]:
trigrams = list(nltk.trigrams(my_words))
print(trigrams)

[('Hi', 'Learner', '!'), ('Learner', '!', 'Welcome'), ('!', 'Welcome', 'to'), ('Welcome', 'to', 'NLP'), ('to', 'NLP', '('), ('NLP', '(', 'Natural'), ('(', 'Natural', 'Language'), ('Natural', 'Language', 'Processing'), ('Language', 'Processing', ')'), ('Processing', ')', 'with'), (')', 'with', 'Python'), ('with', 'Python', '.'), ('Python', '.', 'Here'), ('.', 'Here', 'you'), ('Here', 'you', 'will'), ('you', 'will', 'learn'), ('will', 'learn', 'text'), ('learn', 'text', 'mining'), ('text', 'mining', 'and'), ('mining', 'and', 'processing'), ('and', 'processing', 'on'), ('processing', 'on', 'natural'), ('on', 'natural', 'language'), ('natural', 'language', 'data'), ('language', 'data', '.'), ('data', '.', 'Are'), ('.', 'Are', 'you'), ('Are', 'you', 'aware'), ('you', 'aware', 'about'), ('aware', 'about', 'the'), ('about', 'the', 'Python'), ('the', 'Python', 'basics'), ('Python', 'basics', '?')]


# 1. Tokenization using Python’s split() function
Let’s start with the split() method as it is the most basic one. It returns a list of strings after breaking the given string by the specified separator. By default, split() breaks a string at each space. We can change the separator to anything. Let’s check it out.

### Word Tokenization

In [44]:
split_word_tokens = my_text.split()
print(split_word_tokens)

['Hi', 'Learner!', 'Welcome', 'to', 'NLP', '(Natural', 'Language', 'Processing)', 'with', 'Python.', 'Here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data.', 'Are', 'you', 'aware', 'about', 'the', 'Python', 'basics?']


### Sentence Tokenization

In [45]:
split_sent_tokens = my_text.split('. ')
print(split_sent_tokens)

['Hi Learner! Welcome to NLP (Natural Language Processing) with Python', 'Here you will learn text mining and processing on natural language data', 'Are you aware about the Python basics? \n']


# 2. Tokenization using Regular Expressions (RegEx)
First, let’s understand what a regular expression is. It is basically a special character sequence that helps you match or find other strings or sets of strings using that sequence as a pattern.

We can use the re library in Python to work with regular expression. This library comes preinstalled with the Python installation package.

Now, let’s perform word tokenization and sentence tokenization keeping RegEx in mind.
### Word Tokenization

In [46]:
import re
from nltk.tokenize import RegexpTokenizer

# RegEx Tokenizer with whitespace delimiter
whitespace_tokenizer = RegexpTokenizer("\s+", gaps = True)

whitespace_tokens = whitespace_tokenizer.tokenize(my_text)
print(whitespace_tokens)

['Hi', 'Learner!', 'Welcome', 'to', 'NLP', '(Natural', 'Language', 'Processing)', 'with', 'Python.', 'Here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data.', 'Are', 'you', 'aware', 'about', 'the', 'Python', 'basics?']


In [47]:
re_word_tokens = re.findall("[\w']+", my_text)
print(re_word_tokens)

['Hi', 'Learner', 'Welcome', 'to', 'NLP', 'Natural', 'Language', 'Processing', 'with', 'Python', 'Here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data', 'Are', 'you', 'aware', 'about', 'the', 'Python', 'basics']


In [48]:
# RegexpTokenizer to match only capitalized words
cap_tokenizer = RegexpTokenizer("[A-Z]['\w]+")
print(cap_tokenizer.tokenize(my_text))

['Hi', 'Learner', 'Welcome', 'NLP', 'Natural', 'Language', 'Processing', 'Python', 'Here', 'Are', 'Python']


### Sentence Tokenization

In [49]:
re_sentence_tokens = re.compile('[.!?] ').split(my_text)
print(re_sentence_tokens)

['Hi Learner', 'Welcome to NLP (Natural Language Processing) with Python', 'Here you will learn text mining and processing on natural language data', 'Are you aware about the Python basics', '\n']


# 3. Tokenization using NLTK
NLTK contains a module called tokenize() which further classifies into two sub-categories:

Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences

### Word Tokenization

In [50]:
from nltk.tokenize import word_tokenize

nltk_word_tokens = word_tokenize(my_text)
print(nltk_word_tokens)

['Hi', 'Learner', '!', 'Welcome', 'to', 'NLP', '(', 'Natural', 'Language', 'Processing', ')', 'with', 'Python', '.', 'Here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data', '.', 'Are', 'you', 'aware', 'about', 'the', 'Python', 'basics', '?']


### Sentence Tokenization

In [51]:
from nltk.tokenize import sent_tokenize

nltk_sent_tokens = sent_tokenize(my_text)
print(nltk_sent_tokens)

['Hi Learner!', 'Welcome to NLP (Natural Language Processing) with Python.', 'Here you will learn text mining and processing on natural language data.', 'Are you aware about the Python basics?']


# 4. Tokenization using Keras
Keras! One of the hottest deep learning frameworks in the industry right now. It is an open-source neural network library for Python. Keras is super easy to use and can also run on top of TensorFlow.

In the NLP context, we can use Keras for cleaning the unstructured text data that we typically collect.

### Word Tokenization

In [57]:
from keras.preprocessing.text import text_to_word_sequence

In [58]:
keras_word_tokens = text_to_word_sequence(my_text)
print(keras_word_tokens)

['hi', 'learner', 'welcome', 'to', 'nlp', 'natural', 'language', 'processing', 'with', 'python', 'here', 'you', 'will', 'learn', 'text', 'mining', 'and', 'processing', 'on', 'natural', 'language', 'data', 'are', 'you', 'aware', 'about', 'the', 'python', 'basics']
