### Codio Activity 18.3: Stemming and Lemmatization

**Expected Time = 60 minutes**

**Total Points = 25**

In this activity, you will stem and lemmatize a text to normalize a given text.  Here, you will review using the lemmatizer and stemmer on a basic list and then turn to data in a DataFrame, writing a function to apply the lemmatization and stemming operations to a column of text data.  The data is the WhatsApp status dataset from kaggle, and you will focus on the `content` feature.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [29]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /home/codio/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/codio/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/codio/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### The Data

The text data again comes from [kaggle](https://www.kaggle.com/datasets/sankha1998/emotion?select=Emotion%28sad%29.csv) and is related to classifying WhatsApp status. We load in only the "angry" sentiment below.


In [30]:
angry = pd.read_csv('data/Emotion(angry).csv')

In [31]:
angry.head()

Unnamed: 0,content,sentiment
0,"Sometimes I’m not angry, I’m hurt and there’s ...",angry
1,Not available for busy people☺,angry
2,I do not exist to impress the world. I exist t...,angry
3,Everything is getting expensive except some pe...,angry
4,My phone screen is brighter than my future 🙁,angry


[Back to top](#-Index)

### Problem 1

#### Stemming a list of words

**5 Points**

Use `PorterStemmer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `stemmed_words` below. 

In [32]:
C = ['computer', 'computing', 'computed', 'computes', 'computation', 'compute']

In [33]:
### GRADED
stemmer = ''
stemmed_words = ''

    
### BEGIN SOLUTION
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(w) for w in C]
### END SOLUTION

### ANSWER CHECK
print(type(stemmed_words))
print(stemmed_words)

<class 'list'>
['comput', 'comput', 'comput', 'comput', 'comput', 'comput']


In [34]:
### BEGIN HIDDEN TESTS
stemmer_ = PorterStemmer()
stemmed_words_ = [stemmer_.stem(w) for w in C]
#
#
#
#
assert stemmed_words == stemmed_words_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 2

#### Lemmatizing a list of words

**5 Points**

Use `WordNetLemmatizer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `lemmatized_words` below. 

In [35]:
### GRADED
lemma = ''
lemmatized_words = ''

    
### BEGIN SOLUTION
lemma = WordNetLemmatizer()
lemmatized_words = [lemma.lemmatize(w) for w in C]
### END SOLUTION

### ANSWER CHECK
print(type(lemmatized_words))
print(lemmatized_words)

<class 'list'>
['computer', 'computing', 'computed', 'computes', 'computation', 'compute']


In [36]:
### BEGIN HIDDEN TESTS
lemma_ = WordNetLemmatizer()
lemmatized_words_ = [lemma_.lemmatize(w) for w in C]
#
#
#
#
assert lemmatized_words == lemmatized_words_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 3

#### Which performed better

**5 Points**

Assuming we wanted all the words in `C` to be normalized to the same word, which worked better to this goal -- stemming or lemmatizing.  Assign your response as a string -- `stem` or `lemmatize` -- to `ans3` below.

In [37]:
### GRADED
ans3 = ''

    
### BEGIN SOLUTION
ans3 = 'stem'
### END SOLUTION

### ANSWER CHECK
print(ans3)

stem


In [38]:
### BEGIN HIDDEN TESTS
ans3_ = 'stem'
#
#
#
#
assert ans3 == ans3_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 4

#### A function for stemming

**5 Points**

Use `PorterStemmer` to complete the function `stemmer` below. This function should take in a string of text and return a string of stemmed text. Note that you will need to tokenize the text before stemming and should return a single string.  

Hint: use the `join` method

In [39]:
### GRADED
def stemmer(text):
    '''
    This function takes in a string of text and returns
    a string of stemmed text.
    
    Arguments
    ---------
    text: str
        string of text to be stemmed
        
    Returns
    -------
    str
       string of stemmed words from the text input
    '''
    return ''
    
### BEGIN SOLUTION
def stemmer(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])
### END SOLUTION

### ANSWER CHECK
text = 'The computer did not compute the answers correctly.'
print(text)
print(stemmer(text))#should return --> the comput did not comput the answer correctli .

The computer did not compute the answers correctly.
the comput did not comput the answer correctli .


In [40]:
### BEGIN HIDDEN TESTS
def stemmer_(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])
text = 'The computer did not compute the answers correctly.'
stu = stemmer(text)
soln = stemmer_(text)
#
#
#
#
assert stu == soln
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 5

#### Using the stemmer on a DataFrame

**5 Points**

Use your function `stemmer` to apply to the `content` feature of the DataFrame `angry`.  Assign the resulting series to `stemmed_content` below.

Hint: use the `.apply` method

In [42]:
### GRADED
stemmed_content = ''

    
### BEGIN SOLUTION
stemmed_content = angry['content'].apply(stemmer)
### END SOLUTION

### ANSWER CHECK
print(type(stemmed_content))
print(stemmed_content.head())

<class 'pandas.core.series.Series'>
0    sometim i ’ m not angri , i ’ m hurt and there...
1                           not avail for busi people☺
2    i do not exist to impress the world . i exist ...
3    everyth is get expens except some peopl , they...
4          my phone screen is brighter than my futur 🙁
Name: content, dtype: object


In [43]:
### BEGIN HIDDEN TESTS
stemmed_content_ = angry['content'].apply(stemmer)
#
#
#
#
pd.testing.assert_series_equal(stemmed_content, stemmed_content_)
### END HIDDEN TESTS