# Topics
#### Task

1. Tokenization  
2. Stopword Removal  
3. N- Grams  
4. Stemming  
5. Word Sense Disambiguation  


## 1. Tokenization
Taking a text or set of text and breaking it up into its individual words
<img src="Image\token.JPG" width=300>

- Word Tokenization
- Sentence Tokenization

In [3]:
#Tokenization
import nltk
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

text = "You are ready to learn and do your best. but you are also nervous."
sents = (sent_tokenize(text)) 
print(sents)
print(word_tokenize(text))
words = [word_tokenize(sent) for sent in sents]
print(words)

['You are ready to learn and do your best.', 'but you are also nervous.']
['You', 'are', 'ready', 'to', 'learn', 'and', 'do', 'your', 'best', '.', 'but', 'you', 'are', 'also', 'nervous', '.']
[['You', 'are', 'ready', 'to', 'learn', 'and', 'do', 'your', 'best', '.'], ['but', 'you', 'are', 'also', 'nervous', '.']]


## 2. Stopword Removal
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
<img src="Image\stop.jpg" width=400>


In [5]:
#Removing Stopwords
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
text = "You are ready to learn and do your best. but you are also nervous."
#make set of stopword and punctuation
customstopwords=set(stopwords.words('english')+list(punctuation))
wordslist=[word for word in word_tokenize(text) if word not in customstopwords]
print(wordslist)

['You', 'ready', 'learn', 'best', 'also', 'nervous']


## 3. N-Grams
An n-gram is a contiguous sequence of n items from a given sample of text or speech.

<img src="Image\n-grams.jpg" width=300>

- While typing we get suggestion


In [6]:
#N-grams
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
#trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordslist)
#most important bigram on top
sorted(finder.ngram_fd.items())

[(('You', 'ready'), 1),
 (('also', 'nervous'), 1),
 (('best', 'also'), 1),
 (('learn', 'best'), 1),
 (('ready', 'learn'), 1)]

## 4. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form

<img src="Image\stem.jpg" width=300>

In [7]:
#Stemming
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
stemmedwords=[st.stem(word) for word in word_tokenize(new_text)]
print(stemmedwords)

['it', 'is', 'import', 'to', 'by', 'very', 'python', 'whil', 'you', 'ar', 'python', 'with', 'python', '.', 'al', 'python', 'hav', 'python', 'poor', 'at', 'least', 'ont', '.']


In [8]:
#Part of Speech
#nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(word_tokenize(new_text))

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('important', 'JJ'),
 ('to', 'TO'),
 ('by', 'IN'),
 ('very', 'RB'),
 ('pythonly', 'RB'),
 ('while', 'IN'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('pythoning', 'VBG'),
 ('with', 'IN'),
 ('python', 'NN'),
 ('.', '.'),
 ('All', 'DT'),
 ('pythoners', 'NNS'),
 ('have', 'VBP'),
 ('pythoned', 'VBN'),
 ('poorly', 'RB'),
 ('at', 'IN'),
 ('least', 'JJS'),
 ('once', 'RB'),
 ('.', '.')]

## 5. Word Sense Disambiguation
WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings. 
<img src="Image\wordsense.jpg" width=300>

In [9]:
#Word Sense Disambiguation
import nltk
from nltk.corpus import wordnet as wn
#nltk.download('wordnet')
for ss in wn.synsets('mouse'):
    print (ss, ss.definition())

Synset('mouse.n.01') any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails
Synset('shiner.n.01') a swollen bruise caused by a blow to the eye
Synset('mouse.n.03') person who is quiet or timid
Synset('mouse.n.04') a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad
Synset('sneak.v.01') to go stealthily or furtively
Synset('mouse.v.02') manipulate the mouse of a computer


In [13]:
from nltk.wsd import lesk
from nltk.tokenize import sent_tokenize, word_tokenize
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"), 'bass')
print (sense1, sense1.definition())

sense2 = lesk(word_tokenize("The sea bass really very hard to catch"), 'bass')
print (sense2, sense2.definition())

sense3 = lesk(word_tokenize("Cat is chasing the mouse"), 'mouse')
print (sense3, sense3.definition())


Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('mouse.v.02') manipulate the mouse of a computer
