### Topic for today's discussion
* Understanding NLTK
* Tokenizing
* Stop-words
* Stemming
* Lemmetizing

<hr>

* Text cannot be processed by ML algos
* They needs to be pre-processed
* They needs to be feature reduction
* NLTK is a very foundation which provides all these things

In [10]:
import nltk

In [6]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### Tokenization

In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [16]:
my_txt = "Hello Mr. Learners, how is learning going on? Hope things are fine. Hope the lockdown solves all the issues."

In [17]:
sent_tokenize(my_txt)

['Hello Mr. Learners, how is learning going on?',
 'Hope things are fine.',
 'Hope the lockdown solves all the issues.']

In [18]:
word_tokenize(my_txt)

['Hello',
 'Mr.',
 'Learners',
 ',',
 'how',
 'is',
 'learning',
 'going',
 'on',
 '?',
 'Hope',
 'things',
 'are',
 'fine',
 '.',
 'Hope',
 'the',
 'lockdown',
 'solves',
 'all',
 'the',
 'issues',
 '.']

### Stemming
* Many variations of words carry the same meaning, other than when tense is involved.
* Objective is reduce the dimension of data
* Curse of dimension - lot of algorithms don't work that well if the dimensions is too many

In [19]:
from nltk.stem import PorterStemmer

In [20]:
ps = PorterStemmer()

In [31]:
words = ['runs','runner','running','run']

In [32]:
for word in words:
    print(ps.stem(word))

run
runner
run
run


In [103]:
text_data = ['I runs verying is fast','I was very running fast veries veried']

In [104]:
import pandas as pd

In [105]:
df = pd.DataFrame({'Text':text_data})

In [106]:
from sklearn.feature_extraction.text import CountVectorizer

In [107]:
cv = CountVectorizer()

In [108]:
def f(r):
    words = word_tokenize(r)
    res = []
    for word in words:
        res.append(ps.stem(word))
    return (' '.join(res))
df.Text = df.Text.map(f)

In [109]:
df.Text

0              I run veri is fast
1    I wa veri run fast veri veri
Name: Text, dtype: object

In [110]:
cv.fit_transform(df.Text)

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [111]:
cv.vocabulary_

{'run': 2, 'veri': 3, 'is': 1, 'fast': 0, 'wa': 4}

In [112]:
cv.fit_transform(df.Text).toarray()

array([[1, 1, 1, 1, 0],
       [1, 0, 1, 3, 1]], dtype=int64)

### Lemmatizing
* Similar to Stemming
* Stemming can work for incorrect words
* Lemmatizing works on the actual words

In [113]:
from nltk.stem import WordNetLemmatizer

In [114]:
wl = WordNetLemmatizer()

In [115]:
wl.lemmatize('cats')

'cat'

In [116]:
wl.lemmatize('runs')

'run'

In [119]:
wl.lemmatize('goose')

'goose'

In [120]:
wl.lemmatize('geese')

'goose'

In [125]:
wl.lemmatize('better',pos="a")

'good'

In [127]:
wl.lemmatize('good',pos="a")

'good'

In [128]:
ps.stem('paying')

'pay'

In [129]:
ps.stem('pays')

'pay'

In [130]:
ps.stem('payed')

'pay'

In [131]:
from nltk.stem import LancasterStemmer

In [132]:
ls = LancasterStemmer()

In [133]:
ls.stem('trouble')

'troubl'

In [134]:
ls.stem('troubling')

'troubl'

In [136]:
text = 'He was running and eating at the, same time. He also has a very bad habbit of playing in the Sun after having food?'

In [137]:
punctuations = ',.?'

In [142]:
text = text.replace(',','').replace('?','').replace('.','')

In [143]:
words = word_tokenize(text)

In [144]:
words

['He',
 'was',
 'running',
 'and',
 'eating',
 'at',
 'the',
 'same',
 'time',
 'He',
 'also',
 'has',
 'a',
 'very',
 'bad',
 'habbit',
 'of',
 'playing',
 'in',
 'the',
 'Sun',
 'after',
 'having',
 'food']

In [146]:
for word in words:
    print(wl.lemmatize(word,pos='v'))

He
be
run
and
eat
at
the
same
time
He
also
have
a
very
bad
habbit
of
play
in
the
Sun
after
have
food


In [188]:
horror_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/horror-train.csv')

In [149]:
horror_data.columns

Index(['id', 'text', 'author'], dtype='object')

In [155]:
horror_data = horror_data[['text']]

In [151]:
horror_data[:5]

0    This process, however, afforded me no means of...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box, from wh...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else, not even gold, the Super...
Name: text, dtype: object

* Using NearestNeighbours with metrices as cosine distance, we will find similar texts
* We can use regex to remove punchuations

In [157]:
def f(t):
    return t.replace(',','').replace('?','').replace('.','')
horror_data['new_text'] =horror_data.text.map(f)

In [159]:
def stem_func(r):
    words = word_tokenize(r)
    sent = []
    for word in words:
        sent.append(ps.stem(word))
    return ' '.join(sent)

horror_data['stem_words'] = horror_data.new_text.map(stem_func)

In [161]:
cv = CountVectorizer(stop_words='english')

In [162]:
cv.fit(horror_data.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [164]:
len(cv.vocabulary_)

24764

In [165]:
cv.fit(horror_data.stem_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [169]:
out = cv.transform(horror_data.stem_words)

In [166]:
len(cv.vocabulary_)

15355

In [167]:
from sklearn.neighbors import NearestNeighbors

In [168]:
nn = NearestNeighbors(metric='cosine')

In [170]:
nn.fit(out)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [180]:
nn.kneighbors(out[4:5])

(array([[1.11022302e-16, 4.57917835e-01, 4.79516561e-01, 4.96637990e-01,
         4.98449609e-01]]), array([[    4, 15457,  7409, 18122, 13440]]))

In [174]:
horror_data[:1].text[0]

'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'

In [181]:
horror_data.loc[4].text

'Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.'

In [185]:
horror_data.loc[15457].text

'His countenance was rough but intelligent his ample brow and quick grey eyes seemed to look out, over his own plans, and the opposition of his enemies.'

In [187]:
horror_data.loc[18122].text

'The smile of triumph shone on his countenance; determined to pursue his object to the uttermost, his manner and expression seem ominous of the accomplishment of his wishes.'

In [189]:
horror_data.loc[18122]

id                                                  id10251
text      The smile of triumph shone on his countenance;...
author                                                  MWS
Name: 18122, dtype: object

In [190]:
horror_data.loc[15457]

id                                                  id26034
text      His countenance was rough but intelligent his ...
author                                                  MWS
Name: 15457, dtype: object

In [191]:
horror_data.loc[4]

id                                                  id12958
text      Finding nothing else, not even gold, the Super...
author                                                  HPL
Name: 4, dtype: object