# Practice for Stemming and Lemmatization

Use the following sentence for the tasks set below. The sentence is stored in the variable named `str`

In [1]:
str = (
    "It was July 21, 1969, and Neil Armstrong awoke with a start. " 
   "It was the day he would become the first human being to ever walk on the moon. "
   "The journey had begun several days earlier, when on July 16th, the Apollo 11 "
   "launched from Earth headed into outer space. On board with Neil Armstrong were "
   "Michael Collins and Buzz Aldrin. The crew landed on the moon in the Sea of "
   "Tranquility a day before the actual walk. Upon Neil’s first step onto the moon’s "
   "surface, he declared, “That’s one small step for man, one giant leap for mankind.” It sure was!"
      )

**Task 1.** Use `nltk` packages `word_tokenize()` function to tokenize the string `str`. Print out the first ten (10) words.

In [2]:
# Your code goes here
#---------------------

from nltk import word_tokenize
tokens = word_tokenize(str)
print(tokens[:10])

['It', 'was', 'July', '21', ',', '1969', ',', 'and', 'Neil', 'Armstrong']


**Task 2.** It might be evident from the prints above that we also have punctuations set as words. It can be checked if a word is a punctuation or not by using the `isalpha()` function over it (if the variable containing a word is `word`, then we can determine it to be a punctuation if the value of `word.isalpha()` is `false` ). Filter out the punctuations from the words and print out the first ten (10) words from it.

In [3]:
# Your code goes here
#---------------------

tokens = [word for word in tokens if word.isalpha()]
print(tokens[:10])

['It', 'was', 'July', 'and', 'Neil', 'Armstrong', 'awoke', 'with', 'a', 'start']


**Task 3.** Convert all of the words to lowercase. Print out the first ten (10) words from it.

In [4]:
# Your code goes here
#---------------------

tokens = [word.lower() for word in tokens]
print(tokens[:10])

['it', 'was', 'july', 'and', 'neil', 'armstrong', 'awoke', 'with', 'a', 'start']


**Task 4.** Remove the stopwords from the words list. Stopwords for English is already set in the variable `stop_words` for you. Print out the first ten (10) words from it.

In [11]:
## Stopwords already pulled for convenience

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /home/dcphw2/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# Your code goes here
#---------------------

tokens = [word for word in tokens if word not in stop_words]
print([tokens[:10]])

[['july', 'neil', 'armstrong', 'awoke', 'start', 'day', 'would', 'become', 'first', 'human']]


**Task 5.** Use `PorterStemmer` to find the stem (or root word) from the words list. Print out the first ten (10) words from it.

In [7]:
# Your code goes here
#---------------------

from nltk.stem import PorterStemmer
porter = PorterStemmer()

stems = [porter.stem(word) for word in tokens]
print(stems[:10])

['juli', 'neil', 'armstrong', 'awok', 'start', 'day', 'would', 'becom', 'first', 'human']


**Task 6.** Use `WordNetLemmatizer` to find the lemma (or root word) from the words list. Consider all of the words to be `Verb`. Print out the first ten (10) words from it. 

In [8]:
# Your code goes here
#---------------------

from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()

lemmas = [wordnet.lemmatize(word, pos="v") for word in tokens]
print(lemmas[:10])

['july', 'neil', 'armstrong', 'awake', 'start', 'day', 'would', 'become', 'first', 'human']


**Task 7.** Use the `pos_tag()` function from `nltk` package to tag each of the words with their corresponding Parts-of-Speech and use that for Lemmatization using `WordNetLemmatizer`. 

*A function is written for convinience, which taken in the second part (tag) of the `pos_tag()` function and returns a wordnet tag that can be used with the `pos` option for the `lemmatize()` function.* 

Print out the first ten (10) words from it.

In [9]:
## Function given for convenience
## Return value from this function can be passed as the value for the "pos" option in the "lemmatize()" function

from nltk.corpus import wordnet as wrdnet

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wrdnet.ADJ
    elif nltk_tag.startswith('V'):
        return wrdnet.VERB
    elif nltk_tag.startswith('R'):
        return wrdnet.ADV
    else:          
        return wrdnet.NOUN

In [10]:
# Your code goes here
#---------------------

import nltk
tagged_words = nltk.pos_tag(tokens)

lemmas = [wordnet.lemmatize(word, pos=nltk_tag_to_wordnet_tag(tag)) for word, tag in tagged_words]
print(lemmas[:10])

['july', 'neil', 'armstrong', 'awoke', 'start', 'day', 'would', 'become', 'first', 'human']
