# Natural Language Processing

The essence of Natural Language Processing lies in making computers understand the natural language. That’s not an easy task though. Computers can understand the structured form of data like spreadsheets and the tables in the database, but human languages, texts, and voices form an unstructured category of data, and it gets difficult for the computer to understand it, and there arises the need for Natural Language Processing.

# Importing Libraries

In [1]:
import os
import nltk
import nltk.corpus

In [2]:
print(os.listdir(nltk.data.find("corpora")))

['abc', 'abc.zip', 'alpino', 'alpino.zip', 'biocreative_ppi', 'biocreative_ppi.zip', 'brown', 'brown.zip', 'brown_tei', 'brown_tei.zip', 'cess_cat', 'cess_cat.zip', 'cess_esp', 'cess_esp.zip', 'chat80', 'chat80.zip', 'city_database', 'city_database.zip', 'cmudict', 'cmudict.zip', 'comparative_sentences', 'comparative_sentences.zip', 'comtrans.zip', 'conll2000', 'conll2000.zip', 'conll2002', 'conll2002.zip', 'conll2007.zip', 'crubadan', 'crubadan.zip', 'dependency_treebank', 'dependency_treebank.zip', 'dolch', 'dolch.zip', 'europarl_raw', 'europarl_raw.zip', 'floresta', 'floresta.zip', 'framenet_v15', 'framenet_v15.zip', 'framenet_v17', 'framenet_v17.zip', 'gazetteers', 'gazetteers.zip', 'genesis', 'genesis.zip', 'gutenberg', 'gutenberg.zip', 'ieer', 'ieer.zip', 'inaugural', 'inaugural.zip', 'indian', 'indian.zip', 'jeita.zip', 'kimmo', 'kimmo.zip', 'knbc.zip', 'lin_thesaurus', 'lin_thesaurus.zip', 'machado.zip', 'mac_morpho', 'mac_morpho.zip', 'masc_tagged.zip', 'movie_reviews', 'movie

In [3]:
from nltk.corpus import brown
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [4]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [5]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

In [6]:
for word in hamlet[:500]:
    print(word, sep = ' ', end = ' ')

[ The Tragedie of Hamlet by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Barnardo and Francisco two Centinels . Barnardo . Who ' s there ? Fran . Nay answer me : Stand & vnfold your selfe Bar . Long liue the King Fran . Barnardo ? Bar . He Fran . You come most carefully vpon your houre Bar . ' Tis now strook twelue , get thee to bed Francisco Fran . For this releefe much thankes : ' Tis bitter cold , And I am sicke at heart Barn . Haue you had quiet Guard ? Fran . Not a Mouse stirring Barn . Well , goodnight . If you do meet Horatio and Marcellus , the Riuals of my Watch , bid them make hast . Enter Horatio and Marcellus . Fran . I thinke I heare them . Stand : who ' s there ? Hor . Friends to this ground Mar . And Leige - men to the Dane Fran . Giue you good night Mar . O farwel honest Soldier , who hath relieu ' d you ? Fra . Barnardo ha ' s my place : giue you goodnight . Exit Fran . Mar . Holla Barnardo Bar . Say , what is Horatio there ? Hor . A peece of him Bar 

# Tokenization

Tokenization is one of the most common tasks when it comes to working with text data. But what does the term ‘tokenization’ actually mean?

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

# Importing the library for word tokenization

In [7]:
from nltk.tokenize import word_tokenize

In [8]:
AI ="""Alan Turing, a brilliant mathematician, who broke the Nazi encryption machine Enigma, came up with a history-changing question, “Can machines think?” in 1950. The actual research began in 1956, at a conference held at Dartmouth College (a lot of the inventions have come into the picture, thanks to the Ivy League). A couple of attendees at the conference were the ones who came up with the idea and also the name “Artificial Intelligence”. But since the whole idea was new, people didn’t buy the idea and funding for further research was pulled off. This period, the 1950s – 1980s was called “AI Winter”. In the early 1980s however, the Japanese government saw a future in AI and started funding the field again. As this was interconnected to the electronics and computer science fields, there was a sudden spike in those as well. The first AI machine was introduced to the world in 1997; IBM’s Deep Blue became the first computer to beat a chess champion when it defeated Russian grandmaster Garry Kasparov. And that, my dear readers, was the advent of a massive field called “AI”."""

In [9]:
type(AI)

str

# Performing word tokenization

In [10]:
AI_tokens = word_tokenize(AI)
AI_tokens 

['Alan',
 'Turing',
 ',',
 'a',
 'brilliant',
 'mathematician',
 ',',
 'who',
 'broke',
 'the',
 'Nazi',
 'encryption',
 'machine',
 'Enigma',
 ',',
 'came',
 'up',
 'with',
 'a',
 'history-changing',
 'question',
 ',',
 '“',
 'Can',
 'machines',
 'think',
 '?',
 '”',
 'in',
 '1950',
 '.',
 'The',
 'actual',
 'research',
 'began',
 'in',
 '1956',
 ',',
 'at',
 'a',
 'conference',
 'held',
 'at',
 'Dartmouth',
 'College',
 '(',
 'a',
 'lot',
 'of',
 'the',
 'inventions',
 'have',
 'come',
 'into',
 'the',
 'picture',
 ',',
 'thanks',
 'to',
 'the',
 'Ivy',
 'League',
 ')',
 '.',
 'A',
 'couple',
 'of',
 'attendees',
 'at',
 'the',
 'conference',
 'were',
 'the',
 'ones',
 'who',
 'came',
 'up',
 'with',
 'the',
 'idea',
 'and',
 'also',
 'the',
 'name',
 '“',
 'Artificial',
 'Intelligence',
 '”',
 '.',
 'But',
 'since',
 'the',
 'whole',
 'idea',
 'was',
 'new',
 ',',
 'people',
 'didn',
 '’',
 't',
 'buy',
 'the',
 'idea',
 'and',
 'funding',
 'for',
 'further',
 'research',
 'was',
 '

In [11]:
len(AI_tokens)

223

For finding the occurance of lowercase words.,

In [12]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [13]:
for word in AI_tokens:
    fdist[word.lower()]+=1
fdist

FreqDist({'the': 20, ',': 12, 'a': 9, '.': 9, 'was': 7, 'in': 6, 'and': 5, '“': 4, '”': 4, 'to': 4, ...})

In [14]:
fdist['the']

20

In [15]:
len(fdist)

128

In [16]:
fdist_top5 = fdist.most_common(5)
fdist_top5

[('the', 20), (',', 12), ('a', 9), ('.', 9), ('was', 7)]

In [17]:
from nltk.tokenize import blankline_tokenize
AI_blank = blankline_tokenize(AI)
AI_blank

['Alan Turing, a brilliant mathematician, who broke the Nazi encryption machine Enigma, came up with a history-changing question, “Can machines think?” in 1950. The actual research began in 1956, at a conference held at Dartmouth College (a lot of the inventions have come into the picture, thanks to the Ivy League). A couple of attendees at the conference were the ones who came up with the idea and also the name “Artificial Intelligence”. But since the whole idea was new, people didn’t buy the idea and funding for further research was pulled off. This period, the 1950s – 1980s was called “AI Winter”. In the early 1980s however, the Japanese government saw a future in AI and started funding the field again. As this was interconnected to the electronics and computer science fields, there was a sudden spike in those as well. The first AI machine was introduced to the world in 1997; IBM’s Deep Blue became the first computer to beat a chess champion when it defeated Russian grandmaster Garr

In [18]:
len(AI_blank)

1

# Ngrams, Bigrams and Trigrams

1) Tokens of any number of consecutive written words are called Ngrams.

2) Tokens of two consecutive written words are called Bigrams.

3) Tokens of three consecutive written words are calles Trigrams.

In [19]:
from nltk.util import bigrams, trigrams, ngrams

In [20]:
string = "The basic function of the algorithms of AI is data analysis. Let me put it this way. How do you think human beings learn new things? They observe. They observe and that’s how they learn. Machines learn the same way. "
quotes_tokens = nltk.word_tokenize(string)
quotes_tokens

['The',
 'basic',
 'function',
 'of',
 'the',
 'algorithms',
 'of',
 'AI',
 'is',
 'data',
 'analysis',
 '.',
 'Let',
 'me',
 'put',
 'it',
 'this',
 'way',
 '.',
 'How',
 'do',
 'you',
 'think',
 'human',
 'beings',
 'learn',
 'new',
 'things',
 '?',
 'They',
 'observe',
 '.',
 'They',
 'observe',
 'and',
 'that',
 '’',
 's',
 'how',
 'they',
 'learn',
 '.',
 'Machines',
 'learn',
 'the',
 'same',
 'way',
 '.']

In [21]:
quotes_bigrams = list(nltk.bigrams(quotes_tokens))
quotes_bigrams

[('The', 'basic'),
 ('basic', 'function'),
 ('function', 'of'),
 ('of', 'the'),
 ('the', 'algorithms'),
 ('algorithms', 'of'),
 ('of', 'AI'),
 ('AI', 'is'),
 ('is', 'data'),
 ('data', 'analysis'),
 ('analysis', '.'),
 ('.', 'Let'),
 ('Let', 'me'),
 ('me', 'put'),
 ('put', 'it'),
 ('it', 'this'),
 ('this', 'way'),
 ('way', '.'),
 ('.', 'How'),
 ('How', 'do'),
 ('do', 'you'),
 ('you', 'think'),
 ('think', 'human'),
 ('human', 'beings'),
 ('beings', 'learn'),
 ('learn', 'new'),
 ('new', 'things'),
 ('things', '?'),
 ('?', 'They'),
 ('They', 'observe'),
 ('observe', '.'),
 ('.', 'They'),
 ('They', 'observe'),
 ('observe', 'and'),
 ('and', 'that'),
 ('that', '’'),
 ('’', 's'),
 ('s', 'how'),
 ('how', 'they'),
 ('they', 'learn'),
 ('learn', '.'),
 ('.', 'Machines'),
 ('Machines', 'learn'),
 ('learn', 'the'),
 ('the', 'same'),
 ('same', 'way'),
 ('way', '.')]

In [22]:
quotes_trigrams = list(nltk.trigrams(quotes_tokens))
quotes_trigrams

[('The', 'basic', 'function'),
 ('basic', 'function', 'of'),
 ('function', 'of', 'the'),
 ('of', 'the', 'algorithms'),
 ('the', 'algorithms', 'of'),
 ('algorithms', 'of', 'AI'),
 ('of', 'AI', 'is'),
 ('AI', 'is', 'data'),
 ('is', 'data', 'analysis'),
 ('data', 'analysis', '.'),
 ('analysis', '.', 'Let'),
 ('.', 'Let', 'me'),
 ('Let', 'me', 'put'),
 ('me', 'put', 'it'),
 ('put', 'it', 'this'),
 ('it', 'this', 'way'),
 ('this', 'way', '.'),
 ('way', '.', 'How'),
 ('.', 'How', 'do'),
 ('How', 'do', 'you'),
 ('do', 'you', 'think'),
 ('you', 'think', 'human'),
 ('think', 'human', 'beings'),
 ('human', 'beings', 'learn'),
 ('beings', 'learn', 'new'),
 ('learn', 'new', 'things'),
 ('new', 'things', '?'),
 ('things', '?', 'They'),
 ('?', 'They', 'observe'),
 ('They', 'observe', '.'),
 ('observe', '.', 'They'),
 ('.', 'They', 'observe'),
 ('They', 'observe', 'and'),
 ('observe', 'and', 'that'),
 ('and', 'that', '’'),
 ('that', '’', 's'),
 ('’', 's', 'how'),
 ('s', 'how', 'they'),
 ('how', 'they

# Stemming

Stemming involves normalizing a word into its base or root form.

Ex: Consider the words: Affect, Affection, Affected, Affecting.
The base or root form of the above words is "Äffect."

Note: The NLTK tool provides mainly three types of stemmers, namely:
1) PorterStemmer

2) LancasterStemmer

3) SnowballStemmer

In [23]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [24]:
pst.stem("Loving")

'love'

In [25]:
words_to_stem = ['Killing', 'Saving', 'Protecting', 'Served', 'loved']

for words in words_to_stem:
    print(words + ":" + pst.stem(words))

Killing:kill
Saving:save
Protecting:protect
Served:serv
loved:love


In [26]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()

for words in words_to_stem:
    print(words + ":" + lst.stem(words))


Killing:kil
Saving:sav
Protecting:protect
Served:serv
loved:lov


In [27]:
from nltk.stem import SnowballStemmer
sbst = SnowballStemmer('english') 



In [28]:
for words in words_to_stem:
    print(words + ":" + sbst.stem(words))

Killing:kill
Saving:save
Protecting:protect
Served:serv
loved:love


# Lemmatization

1) Groups together different inflicted forms of a word, called a lemma.

2) Somehow similar to stemming as it maps words into one common root.

3) The outcome of lemmatization is a proper word.

for example, the words "going" and "gone" when lemmatized, should return "go" as the result.

Lemmatization does acquire Wordnet database often for its functioning.

In [29]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer

word_lem = WordNetLemmatizer()

In [30]:
word_lem.lemmatize('corpora')

'corpus'

In [31]:
for words in words_to_stem:
    print(words + ":" + word_lem.lemmatize(words))

Killing:Killing
Saving:Saving
Protecting:Protecting
Served:Served
loved:loved


In [32]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [33]:
len(stopwords.words('english'))

179

In [34]:
fdist_top5

[('the', 20), (',', 12), ('a', 9), ('.', 9), ('was', 7)]

In [36]:
import re 
punctuation = re.compile(r'[-.?!:;()|0-9]')

# Parts of Speech Tagging

In [39]:
sentence = "Mama is a natural when it comes to kicking ass."
sent_tokens = word_tokenize(sentence)
len(sent_tokens)
sent_tokens

['Mama',
 'is',
 'a',
 'natural',
 'when',
 'it',
 'comes',
 'to',
 'kicking',
 'ass',
 '.']

In [40]:
len(sent_tokens)

11

In [41]:
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('Mama', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('kicking', 'VBG')]
[('ass', 'NN')]
[('.', '.')]


In [42]:
sent2 = "Elise kicked Arno in the balls."
sen2_tokens = word_tokenize(sent2)

sen2_tokens

['Elise', 'kicked', 'Arno', 'in', 'the', 'balls', '.']

In [43]:
for token in sen2_tokens:
    print(nltk.pos_tag([token]))
    

[('Elise', 'NN')]
[('kicked', 'VBN')]
[('Arno', 'NN')]
[('in', 'IN')]
[('the', 'DT')]
[('balls', 'NNS')]
[('.', '.')]


# Named Entity Recognition

The detection of a named entity which could be either a movie, a monetary value, a location or even a person is called Named Entity Recognition.

We import ne_chunk to perform Named Entity Recognition.

In [44]:
from nltk import ne_chunk

In [48]:
line = "The US President resides in the White House."

line_token = word_tokenize(line)

line_tag = nltk.pos_tag(line_token)

In [49]:
line_ner = ne_chunk(line_tag)
line_ner

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [('The', 'DT'), Tree('ORGANIZATION', [('US', 'NNP')]), ('President', 'NNP'), ('resides', 'VBZ'), ('in', 'IN'), ('the', 'DT'), Tree('FACILITY', [('White', 'NNP'), ('House', 'NNP')]), ('.', '.')])

# Syntax Tree

It is a tree repersentation of the syntactic structure of sentences or strings.

Notebook author - Sathvik.