# Natural Language Process Tasks

The natural languange process (NLP) consist in 6 tasks:
1. Tokenization
2. Stopword Removal
3. N-Grams
4. Word Sense Disambiguation
5. Parts of Speech
6. Stemming

1. Tokenization: Breaking down a text into words and sentences

In [66]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/Administrator/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Administrator/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/Administrator/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Administrator/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [70]:
text="Mary had a little lamb. Her fleece was white as snow"
from nltk.tokenize import word_tokenize, sent_tokenize
sents=sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her fleece was white as snow']


In [15]:
words=[word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


2. StopWord removal:
Filter common words

In [16]:
from nltk.corpus import stopwords
from string import punctuation
customStopWords=set(stopwords.words('english')+list(punctuation))

In [17]:
wordsWOStopwords=[word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopwords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


3.N-Grams (Bigrams): Identifying commonly occurring group of words.

In [18]:
from nltk.collocations import *
bigram_measures=nltk.collocations.BigramAssocMeasures()
finder=BigramCollocationFinder.from_words(wordsWOStopwords)
sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

4. Word sense disambiguation: Identify the meaning of the word in the context where it occurs. 

In [42]:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('white'):
    print(ss,ss.definition())

Synset('white.n.01') a member of the Caucasoid race
Synset('white.n.02') the quality or state of the achromatic color of greatest lightness (bearing the least resemblance to black)
Synset('white.n.03') United States jurist appointed chief justice of the United States Supreme Court in 1910 by President Taft; noted for his work on antitrust legislation (1845-1921)
Synset('white.n.04') Australian writer (1912-1990)
Synset('white.n.05') United States political journalist (1915-1986)
Synset('white.n.06') United States architect (1853-1906)
Synset('white.n.07') United States writer noted for his humorous essays (1899-1985)
Synset('white.n.08') United States educator who in 1865 (with Ezra Cornell) founded Cornell University and served as its first president (1832-1918)
Synset('white.n.09') a tributary of the Mississippi River that flows southeastward through northern Arkansas and southern Missouri
Synset('egg_white.n.01') the white part of an egg; the nutritive and protective gelatinous subs

In [64]:
from nltk.wsd import lesk
sense1=lesk(word_tokenize("dressed in white"),'white')
print(sense1,sense1.definition())

Synset('flannel.n.03') (usually in the plural) trousers made of flannel or gabardine or tweed or white cloth


5. Part of Speech: Identify nouns, verbs and another parts of the text. 


In [67]:
nltk.pos_tag(word_tokenize(text))

[('Mary', 'NNP'),
 ('had', 'VBD'),
 ('a', 'DT'),
 ('little', 'JJ'),
 ('lamb', 'NN'),
 ('.', '.'),
 ('Her', 'PRP$'),
 ('fleece', 'NN'),
 ('was', 'VBD'),
 ('white', 'JJ'),
 ('as', 'IN'),
 ('snow', 'NN')]

6. Stemming: Removing end of the words.

In [71]:
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text)]
print(stemmedWords)

['mary', 'had', 'a', 'littl', 'lamb', '.', 'her', 'fleec', 'was', 'whit', 'as', 'snow']


In [72]:
text2="Mary closed on closing night when she was in the mood to close."
stemmedWords=[st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']
