# <center>`Stemming` VS `Lemmatization`</center>
### <justify>Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source.</justify>


In [2]:
from nltk.stem import WordNetLemmatizer
wn = WordNetLemmatizer()

In [11]:
word_list = ["dance", "dances", "dancing", "danced"]
word_list

['dance', 'dances', 'dancing', 'danced']

# NLTK tagset

In [8]:
import nltk
nltk.download('tagsets')
#nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\amaN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.


True

In [9]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [12]:
for w in word_list:
    print(wn.lemmatize(w, pos="v"))

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\amaN/nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\amaN\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amaN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [14]:
for w in word_list:
    print(wn.lemmatize(w, pos="v"))

dance
dance
dance
dance


In [15]:
for w in word_list:
    print(wn.lemmatize(w, pos="n"))

dance
dance
dancing
danced


In [33]:
sentance = "Text can be added to Jupyter Notebooks using Markdown cells. You can change the cell type to Markdown by using the Cell menu, the toolbar  Markdown in Jupyter Notebook In this tutorial."
sentance

'Text can be added to Jupyter Notebooks using Markdown cells. You can change the cell type to Markdown by using the Cell menu, the toolbar  Markdown in Jupyter Notebook In this tutorial.'

In [34]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [37]:
word_tokens = word_tokenize(sentance)
print(word_tokens)

['Text', 'can', 'be', 'added', 'to', 'Jupyter', 'Notebooks', 'using', 'Markdown', 'cells', '.', 'You', 'can', 'change', 'the', 'cell', 'type', 'to', 'Markdown', 'by', 'using', 'the', 'Cell', 'menu', ',', 'the', 'toolbar', 'Markdown', 'in', 'Jupyter', 'Notebook', 'In', 'this', 'tutorial', '.']


# see the parts of speech of the words in the example sentence.

In [38]:
nltk.pos_tag(word_tokens)

LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\amaN/nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\amaN\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\amaN\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [39]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\amaN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [40]:
nltk.pos_tag(word_tokens)

[('Text', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('added', 'VBN'),
 ('to', 'TO'),
 ('Jupyter', 'NNP'),
 ('Notebooks', 'NNP'),
 ('using', 'VBG'),
 ('Markdown', 'NNP'),
 ('cells', 'NNS'),
 ('.', '.'),
 ('You', 'PRP'),
 ('can', 'MD'),
 ('change', 'VB'),
 ('the', 'DT'),
 ('cell', 'NN'),
 ('type', 'NN'),
 ('to', 'TO'),
 ('Markdown', 'NNP'),
 ('by', 'IN'),
 ('using', 'VBG'),
 ('the', 'DT'),
 ('Cell', 'NNP'),
 ('menu', 'NN'),
 (',', ','),
 ('the', 'DT'),
 ('toolbar', 'NN'),
 ('Markdown', 'NNP'),
 ('in', 'IN'),
 ('Jupyter', 'NNP'),
 ('Notebook', 'NNP'),
 ('In', 'IN'),
 ('this', 'DT'),
 ('tutorial', 'NN'),
 ('.', '.')]

In [44]:
import string
punc = string.punctuation
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [45]:
empty_list = []
for val in word_tokens:
    if val not in punc:
        empty_list.append(val)
        
print(empty_list)

['Text', 'can', 'be', 'added', 'to', 'Jupyter', 'Notebooks', 'using', 'Markdown', 'cells', 'You', 'can', 'change', 'the', 'cell', 'type', 'to', 'Markdown', 'by', 'using', 'the', 'Cell', 'menu', 'the', 'toolbar', 'Markdown', 'in', 'Jupyter', 'Notebook', 'In', 'this', 'tutorial']


In [48]:
word_tokens2 = empty_list
print(word_tokens2)

['Text', 'can', 'be', 'added', 'to', 'Jupyter', 'Notebooks', 'using', 'Markdown', 'cells', 'You', 'can', 'change', 'the', 'cell', 'type', 'to', 'Markdown', 'by', 'using', 'the', 'Cell', 'menu', 'the', 'toolbar', 'Markdown', 'in', 'Jupyter', 'Notebook', 'In', 'this', 'tutorial']


In [49]:
nltk.pos_tag(word_tokens2)

[('Text', 'NN'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('added', 'VBN'),
 ('to', 'TO'),
 ('Jupyter', 'NNP'),
 ('Notebooks', 'NNP'),
 ('using', 'VBG'),
 ('Markdown', 'NNP'),
 ('cells', 'NNS'),
 ('You', 'PRP'),
 ('can', 'MD'),
 ('change', 'VB'),
 ('the', 'DT'),
 ('cell', 'NN'),
 ('type', 'NN'),
 ('to', 'TO'),
 ('Markdown', 'NNP'),
 ('by', 'IN'),
 ('using', 'VBG'),
 ('the', 'DT'),
 ('Cell', 'NNP'),
 ('menu', 'VBZ'),
 ('the', 'DT'),
 ('toolbar', 'NN'),
 ('Markdown', 'NNP'),
 ('in', 'IN'),
 ('Jupyter', 'NNP'),
 ('Notebook', 'NNP'),
 ('In', 'IN'),
 ('this', 'DT'),
 ('tutorial', 'NN')]