# Topics
#### Task

1. Tokenization  
2. Stopword Removal  
3. N- Grams  
4. Stemming  
5. Word Sense Disambiguation  
6. Count Vectorizer
7. TF-IDF(TfidfVectorizer)
8. HashingVectorizer

## 1. Tokenization
Taking a text or set of text and breaking it up into its individual words
<img src="Image/token.JPG" width="300" />

- Word Tokenization
- Sentence Tokenization

In [3]:
#Tokenization
import nltk
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

text = "You are ready to learn and do your best. but you are also nervous."
sents = (sent_tokenize(text)) 
print(sents)
print(word_tokenize(text))
words = [word_tokenize(sent) for sent in sents]
print(words)

['You are ready to learn and do your best.', 'but you are also nervous.']
['You', 'are', 'ready', 'to', 'learn', 'and', 'do', 'your', 'best', '.', 'but', 'you', 'are', 'also', 'nervous', '.']
[['You', 'are', 'ready', 'to', 'learn', 'and', 'do', 'your', 'best', '.'], ['but', 'you', 'are', 'also', 'nervous', '.']]


## 2. Stopword Removal
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.
<img src="Image/stop.jpg">


In [5]:
#Removing Stopwords
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
text = "You are ready to learn and do your best. but you are also nervous."
#make set of stopword and punctuation
customstopwords=set(stopwords.words('english')+list(punctuation))
wordslist=[word for word in word_tokenize(text) if word not in customstopwords]
print(wordslist)

['You', 'ready', 'learn', 'best', 'also', 'nervous']


## 3. N-Grams
An n-gram is a contiguous sequence of n items from a given sample of text or speech.

<img src="Image/n-grams.jpg" width="300" />

- While typing we get suggestion


In [6]:
#N-grams
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
#trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordslist)
#most important bigram on top
sorted(finder.ngram_fd.items())

[(('You', 'ready'), 1),
 (('also', 'nervous'), 1),
 (('best', 'also'), 1),
 (('learn', 'best'), 1),
 (('ready', 'learn'), 1)]

## 4. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form

<img src="Image/stem.jpg" width="300" />

In [7]:
#Stemming
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
stemmedwords=[st.stem(word) for word in word_tokenize(new_text)]
print(stemmedwords)

['it', 'is', 'import', 'to', 'by', 'very', 'python', 'whil', 'you', 'ar', 'python', 'with', 'python', '.', 'al', 'python', 'hav', 'python', 'poor', 'at', 'least', 'ont', '.']


In [8]:
#Part of Speech
#nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(word_tokenize(new_text))

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('important', 'JJ'),
 ('to', 'TO'),
 ('by', 'IN'),
 ('very', 'RB'),
 ('pythonly', 'RB'),
 ('while', 'IN'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('pythoning', 'VBG'),
 ('with', 'IN'),
 ('python', 'NN'),
 ('.', '.'),
 ('All', 'DT'),
 ('pythoners', 'NNS'),
 ('have', 'VBP'),
 ('pythoned', 'VBN'),
 ('poorly', 'RB'),
 ('at', 'IN'),
 ('least', 'JJS'),
 ('once', 'RB'),
 ('.', '.')]

## 5. Word Sense Disambiguation
WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings. 
<img src="Image/wordsense.jpg" width="300" />

In [9]:
#Word Sense Disambiguation
import nltk
from nltk.corpus import wordnet as wn
#nltk.download('wordnet')
for ss in wn.synsets('mouse'):
    print (ss, ss.definition())

Synset('mouse.n.01') any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails
Synset('shiner.n.01') a swollen bruise caused by a blow to the eye
Synset('mouse.n.03') person who is quiet or timid
Synset('mouse.n.04') a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad
Synset('sneak.v.01') to go stealthily or furtively
Synset('mouse.v.02') manipulate the mouse of a computer


In [13]:
from nltk.wsd import lesk
from nltk.tokenize import sent_tokenize, word_tokenize
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"), 'bass')
print (sense1, sense1.definition())

sense2 = lesk(word_tokenize("The sea bass really very hard to catch"), 'bass')
print (sense2, sense2.definition())

sense3 = lesk(word_tokenize("Cat is chasing the mouse"), 'mouse')
print (sense3, sense3.definition())


Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('mouse.v.02') manipulate the mouse of a computer


## 6. CountVectorizer
- provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
- the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
- Issue: Appearance of “the”
- Each column represents one word, count refers to frequency of the word
- Sequence of words are not maintained

In [28]:
import pandas as pd
corpus = [
     'This is the first document from heaven',
     'but the second document is from mars',
     'And this is the third one from nowhere',
     'Is this the first document from nowhere?',
]
df = pd.DataFrame({'Text':corpus})
df

Unnamed: 0,Text
0,This is the first document from heaven
1,but the second document is from mars
2,And this is the third one from nowhere
3,Is this the first document from nowhere?


In [29]:
from sklearn.feature_extraction.text import CountVectorizer
count_v = CountVectorizer()
X = count_v.fit_transform(df.Text).toarray()
print(count_v.get_feature_names())

['and', 'but', 'document', 'first', 'from', 'heaven', 'is', 'mars', 'nowhere', 'one', 'second', 'the', 'third', 'this']


In [30]:
print(X)
print(count_v.vocabulary_)

[[0 0 1 1 1 1 1 0 0 0 0 1 0 1]
 [0 1 1 0 1 0 1 1 0 0 1 1 0 0]
 [1 0 0 0 1 0 1 0 1 1 0 1 1 1]
 [0 0 1 1 1 0 1 0 1 0 0 1 0 1]]
{'this': 13, 'is': 6, 'the': 11, 'first': 3, 'document': 2, 'from': 4, 'heaven': 5, 'but': 1, 'second': 10, 'mars': 7, 'and': 0, 'third': 12, 'one': 9, 'nowhere': 8}


In [31]:
#Removing Stopwords
count_v = CountVectorizer(stop_words=['the','is'])
print(count_v.fit_transform(df.Text).toarray())
print(count_v.vocabulary_)

[[0 0 1 1 1 1 0 0 0 0 0 1]
 [0 1 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 0 1 0 0 1 1 0 1 1]
 [0 0 1 1 1 0 0 1 0 0 0 1]]
{'this': 11, 'first': 3, 'document': 2, 'from': 4, 'heaven': 5, 'but': 1, 'second': 9, 'mars': 6, 'and': 0, 'third': 10, 'one': 8, 'nowhere': 7}


In [33]:
count_v = CountVectorizer(vocabulary=['heaven','mars','nowhere'])
count_v.fit_transform(df.Text).toarray()

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1]], dtype=int64)

In [40]:
#n-grams
count_v = CountVectorizer(ngram_range=[1,2])
print(count_v.fit_transform(df.Text).toarray())
print(count_v.vocabulary_)

[[0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0]
 [0 0 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0]
 [1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 0]
 [0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]]
{'this': 30, 'is': 14, 'the': 24, 'first': 7, 'document': 4, 'from': 9, 'heaven': 13, 'this is': 31, 'is the': 16, 'the first': 25, 'first document': 8, 'document from': 5, 'from heaven': 10, 'but': 2, 'second': 22, 'mars': 18, 'but the': 3, 'the second': 26, 'second document': 23, 'document is': 6, 'is from': 15, 'from mars': 11, 'and': 0, 'third': 28, 'one': 20, 'nowhere': 19, 'and this': 1, 'the third': 27, 'third one': 29, 'one from': 21, 'from nowhere': 12, 'is this': 17, 'this the': 32}


## 7. TF-IDF (TfidfVectorizer)
TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
- The importance is in scale of 0 & 1

<b>Term Frequency:</b> This summarizes how often a given word appears within a document.  
<b>Inverse Document Frequency:</b> This downscales words that appear a lot across documents.
  
Adv:  
- Feature vector much more tractable in size  
- Frequency and relevance captured  

DisAdv:  
- Context still not captured  



In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
     'This is the first document from heaven',
     'but the second document is from mars',
     'And this is the third one from nowhere',
     'Is this the first document from nowhere?',
]
df = pd.DataFrame({'Text':corpus})
tfidf_v = TfidfVectorizer(stop_words='english')
tfidf_v.fit_transform(df.Text).toarray()

array([[0.53802897, 0.84292635, 0.        , 0.        ],
       [0.41137791, 0.        , 0.64450299, 0.64450299],
       [0.        , 0.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        , 0.        ]])

In [42]:
tfidf_v.get_feature_names()

['document', 'heaven', 'mars', 'second']

## 8. HashingVectorizer
- Issue with Counts and frequencies – vocabulary can become very large
- Work around is to use a one way hash of words to convert them to integers
- No vocabulary is required and you can choose an arbitrary-long fixed length vector
- downside - no way to convert the encoding back to a word

Step1:
<img src="Image/hash-1.jpg" width="400" />
Step2:
<img src="Image/hash-2.jpg" width="400" />
Step3:
<img src="Image/hash-3.jpg" width="400" />

In [44]:
from sklearn.feature_extraction.text import HashingVectorizer
import pandas as pd
corpus = [
     'This is the first document from heaven',
     'but the second document is from mars',
     'And this is the third one from nowhere',
     'Is this the first document from nowhere?',
]
df = pd.DataFrame({'Text':corpus})
hash_v = HashingVectorizer(n_features=5)
hash_v.fit_transform(df.Text).toarray()

array([[ 0.        , -1.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , -1.        ,  0.        ],
       [-0.35355339,  0.35355339,  0.70710678, -0.35355339, -0.35355339],
       [ 0.        , -0.57735027,  0.57735027, -0.57735027,  0.        ]])