[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/demos/nlp/nlp_foundations.ipynb)

# Fundamentals of natural language processing (NLP)
The notebook revisits the lecture on the foundations of natural language processing. We examine common tasks in the preparation of textual data for analysis. Several Python libraries including `scikit-learn` and `Keras` offer functionality for text data prepartion. In this notebook, we will use the `NLTK toolkit`. It has a clear and easy to understand syntax and is well-suited to demonstrate standard NLP operations. Although not the focus of this tutorial, we also introduce a library called `Beautiful Soup`, which gained a lot of popularity in web-scraping. Make sure to have these libraries installed before running the following codes. 

Also note that the demo draws inspiration from a [Kaggle kernel](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing/notebook#Conversion-of-Emoji-to-Words). The kernel demonstrates yet more functionality s check it out if you are interested.


Here is the agenda of the session:

1. Preparing text for analysis: the standard NLP pipeline
2. Use case: the IMDB movie review data set

In [35]:
import numpy as np
# Library for standard NLP workflow
import nltk  
# When running this notebook for the first time, you have to download some NLTK packages. To do so, simply uncomment the next lines
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

### 1. Preparing text for analysis: the standard NLP pipeline
To illustrate standard NLP preprocessing operations, we need some demo text. Below is an extract from a famous book; no need to quote I guess 😉

In [7]:
text_raw = """ 
            I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? 
            I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' 
            Ah, that is the great puzzle!
           """
print(text_raw)

 
            I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? 
            I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' 
            Ah, that is the great puzzle!
           


In the following parts, we incrementally build the functionality for a full preprocessing chain. To be able to nicely put everything together in the end, we will wrap up every piece of functionality that we build in a custom function.

#### Remove Whitespace

In [8]:
def remove_whitespace(text):
    """ Function to remove whitespace (tabs, newlines). """
    return ' '.join(text.split())

text_processed = remove_whitespace(text_raw)
print(text_processed)

I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' Ah, that is the great puzzle!


Hm, but the punctuation is there still. Is it noise or is it useful? Let's try removing it for now (there is a bunch of methods out there). Additionally we will drop weird symbols and lower the big cases.

#### Punctuation, Whitespace and Casing


In [9]:
def remove_punctuation_and_casing(text):
    """
    Function to remove the punctuation, upper casing and words that include
    non-alphanumeric characters.
    """
    chars = '!\"#$%&()*+,-./:;<=>?@[\]^_`{|}~'
    text = text.translate(str.maketrans(chars, ' ' * len(chars)))
    return ' '.join([word.lower() for word in text.split() if word.isalpha()])

text_processed = remove_punctuation_and_casing(text_processed)
print(text_processed)

i wonder if i have been changed in the night let me think was i the same when i got up this morning i almost can remember feeling a little different but if i am not the same the next question is in the world am i ah that is the great puzzle


This is starting to look like a bag of words, right? There are some more issues we want to address though. Like 'stop words' - semantically they do not mean much but serve to put sentences together ("the", "a", "and", etc) - they will add noise. NLTK can offer you its own list of stop words.

#### Stopwords

In [10]:
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')
english_stopwords[0:9]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you']

In [11]:
len(english_stopwords)

179

The list of stop words looks comprehensive. However, say you miss a 'stop word' that you would also like to filter. You can extend the above list easily. After all, it is just a list.

In [12]:
t = type(english_stopwords)
print('Data type of stopwords is:', t )

Data type of stopwords is: <class 'list'>


In [14]:
# Add some custom stopwords
english_stopwords.append('some_word_you_dont_like')  # you can do everything that is allowed with a <list>
english_stopwords[-1]  # for example get the element at a certain position

'some_word_you_dont_like'

Finally, let's remove the stopwords from our processed sample text.

In [15]:
def remove_stopwords(text):
    """ Function to remove stopwords. """
    return ' '.join([word for word in str(text).split() if word not in english_stopwords])

text_processed = remove_stopwords(text_processed)
print(text_processed)

wonder changed night let think got morning almost remember feeling little different next question world ah great puzzle


Quite a reduction in the number of tokens by filtering stopwords, correct? Recall that the number of unique tokens plays a key role in the bag of word model. For example, representing text in the form of a document term matrix (DTM), the dimensionality of the DTM will be equal to the number of distinct tokens. So let's try to reduce it even further.

#### Lemmatization and Stemming
You might have already thought of the issue: what if a word is used in different forms? It will be treated as different words semantically right? That is where **stemming** and **lemmatization** comes into play. The former approach is simpler and consists mainly of 'cutting of the end' of words. The later reduces a word to its dictionary form. To that end, we need to have a dictionary available. Let's first illustrate simple stemming.

In [16]:
from nltk.stem import PorterStemmer  # Other stemmers are supported as well
stemmer = PorterStemmer()

def stem_words(text):
    """ Function to stem words. """
    return ' '.join([stemmer.stem(word) for word in text.split()])

text_processed = stem_words(text_processed)
print(text_processed)

wonder chang night let think got morn almost rememb feel littl differ next question world ah great puzzl


Simple, isn't it. With just one example sentence, it is hard to appreciate the benefits of stemming. The idea is that if we have a large corpus many words will appear multiple times in different grammatical forms. Still, the meaning that these words carry is roughly identical. Running, run, ran, runner, etc. all of these words indicate that the text has something to do with running. Assuming that this is all we need to know -- yes that is a bold assumption -- stemming makes sense as it could greatly reduce the number of distinct words in a corpus. This number of distinct words, also called **vocabulary size**, is very important. It effects the efficiency of NLP operations and may also have a big impact on the accuracy of text classification. <br>
Let's now take a look on lemmatization. Here, things are a little more complicated. While NLTK offers a ready-to-use function, we need to tell it the grammatical form of the word that we want to lemmatize. Consider this example:

In [18]:
# NLTK lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# You need to choose the type of word:
print(lemmatizer.lemmatize("stripes", 'n'))  # here we claim stripes is a noun
print(lemmatizer.lemmatize("stripes", 'v'))  # what happens if we claim it is a verb? 

stripe
strip


How would we know that grammatical form? In fact, determining this form is an NLP task in its own right. It is called **POS tagging**. Much research has been done on coming up with clever ways to determine POS (part-of-speech) tags. We will not go into details. A simple POS tagger is available as part of the `NLTK` library. It can be used like this:

In [19]:
nltk.pos_tag(["She", "earned", "her", "stripes", "with", "great", "performance"])

[('She', 'PRP'),
 ('earned', 'VBD'),
 ('her', 'PRP'),
 ('stripes', 'NNS'),
 ('with', 'IN'),
 ('great', 'JJ'),
 ('performance', 'NN')]

We make use of the above POS tagger later. For now, let's simply use the lemmatizer to reduce words to their dictionary form, irrespective of part of speech.

In [20]:
def lemmatize_words(text, **kwargs):
    """ Function to lemmatize words. """
    return ' '.join([lemmatizer.lemmatize(word, **kwargs) for word in text.split()])

text_processed = lemmatize_words(text_processed)
print(text_processed)

wonder chang night let think got morn almost rememb feel littl differ next question world ah great puzzl


#### Cleaning HTML

For a more sophisticated cleaning of text, you might want to consider **regular expressions**. In a nutshell, regular expressions are a family of text processing techniques for searching and replacing text. Their capability to match expressions in a text, for example an email, is quite powerful. A quick read through the corresponding [Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression) would be useful. Also, here is a [nice playground](https://regexr.com/). 

Let's at least exemplify regular expression briefly. To that end, we need some new demo text, which includes HTML. Here we go:<br>
This text includes the email address of Stefan, which is <stefan.lessmann@hu-berlin.de>. 
Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.

In [23]:
# Another piece of demo text illustrating some common issues
re_demo = """
            This text includes the email address of Stefan, which is <stefan.lessmann@hu-berlin.de>. 
            Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.
          """

Finding or filtering email addresses is a common use case when processing text.

In [24]:
# Finding emails using RE
import re  # Python library for regular expressions

# Simple pattern to match email addresses
pat = '([\w\.-]+@[\w\.-]+\.[\w]+)+'

# Extracting email addresses
email = re.findall(pat, re_demo)
print('Found: ', email)

# Filter sub-strings
re.sub(pat, '', remove_whitespace(re_demo))

Found:  ['stefan.lessmann@hu-berlin.de']


'This text includes the email address of Stefan, which is <>. Also, we use <em>html</em> to <b>emphasize</b> parts and include breaks <br> to separate lines.'

So we can filter e-mails. Nice. But creating such regular expressions for all sorts of HTML tags we might want to filter will prove challenging. Luckily, we do not have to worry. Entry `Beatiful Soup` <br>With no more than two lines of code, we our text is nice and clean.

In [26]:
# Library beatifulsoup4 handles html
from bs4 import BeautifulSoup

# Remove html content
remove_whitespace(BeautifulSoup(re_demo).get_text())

'This text includes the email address of Stefan, which is . Also, we use html to emphasize parts and include breaks to separate lines.'

#### Conversion of Emojis and Emoticons

Emoticons and emojis are a sequence of ASCII characters or unicode images that express moods or feelings in written communication. In use cases like sentiment analysis, emoticons and emojis give very valuable information.

One way to make use of the information they may convey is to convert the emoticons and emojis into text that reflects their meaning. To that end, we will use a condensed copy of the `emot` library by Neel Shah [(Github)](https://github.com/NeelShah18/emot/blob/master/emot/core.py). When running this notebook, make sure that the file ```emot_dictionary.py``` is in the same directory as the notebook. Note, that when using Google Colab it is simpler to just install the package by running `!pip install emoji`.

In [28]:
import emot_dictionary as emot

emo_demo = """
            The movie was fantastic :o :-)) 🚀 👏
           """

Let's first convert the emoticons.

In [29]:
def convert_emoticons(text):
    """ Function to convert emoticons into a text that reflects their meaning. """
    EMOTICONS = emot.EMOTICONS()
    for i in EMOTICONS:
        text = text.replace(i, EMOTICONS[i])
    return text

emo_demo = convert_emoticons(remove_whitespace(emo_demo))
emo_demo

'The movie was fantastic Surprise Very happy 🚀 👏'

And now we also convert the emojis.

In [30]:
def convert_emojis(text):
    """ Function to convert emojis into a text that reflects their meaning. """
    EMOJIS = emot.EMOJIS_UNICODE()
    for i in EMOJIS:
        text = text.replace(EMOJIS[i], i.translate(str.maketrans('', '', ':')).replace(r'_', r' '))
    return text

convert_emojis(emo_demo)

'The movie was fantastic Surprise Very happy rocket clapping hands'

# Wrapping up
Albeit simple, the above demos provide a glance on text cleaning. While you could do a lot more, tasks like stop word removal, etc. will come up in many NLP projects. We conclude this part by putting all of the above steps into a helper function, which we will use later to clean a data set of online movie reviews. Our helper function will use lemmatization instead of stemming because it is likely to give better results in downstream tasks (i.e., text classification). The following function is a helper function to call the lemmatizer with the right dictionary form of a word.   

In [36]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Helper function that calls the POS tagger for an input word and return a code that can be used for lemmatization"""
    # Extract the first letter of the POS tag (see the above example to understand the output coming from pos_tag)
    tag = nltk.pos_tag([word])[0][1][0].upper()  
    # Dictionary to map these letters to wordnet codes that the lemmatizer understands
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [32]:
# Test the helper function
[get_wordnet_pos(x) for x in ["She", "earned", "her", "stripes", "with", "great", "performance"]]

['n', 'v', 'n', 'n', 'n', 'a', 'n']

And here is the real helper function for text cleaning. We will make use of it right after introducing our data set for subsequent parts. Since that data is stored in the form of a data frame, we refrain from making our helper function more general and simply assume that incoming text is a Pandas Series object (i.e., one column of a data frame).

In [37]:
def text_cleaning(documents):
    """
    Function for standard NLP pre-processing including removal of html tags,
    whitespaces, non-alphanumeric characters, and stopwords. Emoticons are
    converted to text that reflects their meaning. Words are subject to
    lemmatization using their POS tags.
    """
    cleaned_text = []  # our output will be a list of documents
    lemmatizer = WordNetLemmatizer()
    
    print('Processing input array with {} elements...'.format(documents.shape[0]))
    counter = 0
    
    for doc in documents:
        text = BeautifulSoup(doc).get_text() # remove html content
        text = remove_whitespace(text) # remove whitespaces
        text = convert_emoticons(text) # convert emoticons to text
        text = remove_punctuation_and_casing(text) # remove punctuation and casing
        text = remove_stopwords(text) # remove stopwords
        text = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()]) # lemmatize each word
        
        cleaned_text.append(text)

        if (counter > 0 and counter % 50 == 0):
            print('Processed {} documents'.format(counter))
            
        counter += 1
        
    return cleaned_text

## 2. Use case: the IMDB movie review data set
We use a popular NLP data set consisting of movie reviews posted at [IMDB](https://www.imdb.com/). The data is available in different sizes and shapes (cleaned, raw, ...) on the web. We use a version from Kaggle, which includes 50K reviews and binary labels whether a review is positive or negative. The labels are useful for sentiment analysis, which we will do in a future demo. Here, we simply prepare the data for subsequent uses and, in doing so, further elaborate on the NLP operations introduced in the previous part. You can download the raw data from Kaggle: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data. A version is also available in our [GitHub repository](https://github.com/Humboldt-WI/adams/tree/master/demos/nlp). 

### Load the data

In [49]:
# Remeber to adjust the path so that it matches your environment
import pandas as pd

imdb_data = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [50]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


The data is really simple; just two columns, one for the binary sentiment and one for the text of the review. Apparently, some of the reviews include HTML. We already added functionality to handle HTML into our text cleaning function. So this should not cause us any trouble. Let's look at an arbitrary review to get a better understanding of the text.

In [53]:
ix = 8
imdb_data.loc[ix, 'review']

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

In [54]:
imdb_data.loc[ix, 'sentiment']

'negative'

### Sampling
Working with the full data set of 50K reviews is time consuming. When experimenting with the notebook, you might want to draw a random sample to increase the speed of computations. For a modern computer, a sample size of 5000 should be feasible without increasing the time too much. For the same of our demo, we use only 500 reviews to save time. 
Note that results of processing the full data sets are available in our course folder.

In [56]:
# Draw a radnom sample to save time
sample_size = 500
np.random.seed(888)
idx = np.random.randint(low=0, high=imdb_data.shape[0], size=sample_size)
imdb_data = imdb_data.loc[idx,:]

imdb_data.reset_index(inplace=True, drop=True)  # dropping the index prohibits a reidentification of the cases in the original data frame
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     500 non-null    object
 1   sentiment  500 non-null    object
dtypes: object(2)
memory usage: 7.9+ KB


### Data cleaning
Thanks to our careful preparation, cleaning the reviews should be easy. All it takes is applying our cleaning function to the data.

In [57]:
# Do the cleaning
# CAUTION: depending on your data set size, the processing might take a while 
import time  # To keep an eye on runtimes
start = time.time()
imdb_data['review_clean'] = text_cleaning(imdb_data.review)
print('Duration: {:.0f} sec'.format(time.time()-start))

Processing input array with 500 elements...
Processed 50 documents
Processed 100 documents
Processed 150 documents
Processed 200 documents
Processed 250 documents
Processed 300 documents
Processed 350 documents
Processed 400 documents
Processed 450 documents
Duration: 20 sec


In [61]:
# Check all is well
ix = 0  # just one example, play with other play to further examine the effect of our clearning
print('Original Review:\n' + imdb_data.review[ix])  
print('\nCleaned Review:\n' + imdb_data.review_clean[ix])

Original Review:
i, too, loved this series when i was a kid. In 1952 i was 5 and my family always watched this show. My favorite character was the one played by Marion Lorne as a rather stuttering, bumbling and very lovable "aunt" type person. i can still recall her "ubba bubba um um" type comments as she would try and say something important. And then when she came back and played Aunt Clara in Bewitched it was great casting! <br /><br />It was the first time that i can remember seeing Walter Matthau whose career i followed as a fan for many many years.<br /><br />i have a question if anyone can verify: was the title or end credits music the "Swedish Rhapsody" by Hugo Alfven? Every time i hear it played on my classical radio station here in Southern California it brings back memories of the image of Mr. Peepers walking away with his back to the camera. i'm not even certain if this image in my mind's eye is correct.

Cleaned Review:
love series kid family always watch show favorite cha

In [62]:
imdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review        500 non-null    object
 1   sentiment     500 non-null    object
 2   review_clean  500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB


In [63]:
imdb_data.head(8)

Unnamed: 0,review,sentiment,review_clean
0,"i, too, loved this series when i was a kid. In...",positive,love series kid family always watch show favor...
1,I saw this film on the same night I saw 6 othe...,positive,saw film night saw short one leap bound ahead ...
2,My first thoughts on this film were of using s...,negative,first thought film use science fiction bad way...
3,"I couldn't tell if ""The Screaming Skull"" was t...",negative,tell scream skull try hitchcock rip modernize ...
4,I found The FBI Story considerably entertainin...,positive,found fbi story considerably entertain suitabl...
5,"Firstly, the title has no relevance whatsoever...",negative,firstly title relevance whatsoever movie start...
6,"The acronymic ""F.P.1"" stands for ""Floating Pla...",negative,acronymic f p stand float platform film porten...
7,This short is one of the best of all time and ...,positive,short one best time proof like charlie work so...


Looks like the cleaning has fulfilled its purpose.

### File input and output
Should you have used the full data set in the above cleaning, you will want to store your results. The following codes exemplifies the use of a library called `Pickle`, which Pandas support natively to store data sets in a binary format. Compared to csv, the advantage of a binary format is that the data needs less space on disk. Note that you might have to install `Pickle` for the code to work. 

In [64]:
# Saving objects to disk using pickle
import pickle

imdb_data.to_pickle('your_file_name.pkl')

### A bird's eye view on the data
Let's have a quick look at what folks talk about in this data set. Using the class *Counter* from the collections package, we can easily count word occurrences and query the most common words. We can also check the number of occurrences for specific words. We do not really need the *word_counter* here and only use it to get a feeling for the data set. Our course, these types of checks make more sense when using the full data set. 

In [94]:
# Here is a bit of code to load the full data set of the cleaned reviews, make sure to download it first
import pickle
with open('imdb_clean_full_v2.pkl','rb') as path_name:
    clean_reviews = pickle.load(path_name)
len(clean_reviews)

50000

In [92]:
# Loop through the words and update a counter keeping track of word counts
import collections

word_counter = collections.Counter()
for r in clean_reviews["review_clean"]:
    for w in r.split():        
        word_counter.update({w: 1})        

In [95]:
# Query the top most frequent words
top_n = 10
word_counter.most_common(top_n)

[('movie', 102248),
 ('film', 93765),
 ('one', 54830),
 ('make', 46061),
 ('like', 44268),
 ('see', 41548),
 ('get', 34802),
 ('well', 32800),
 ('time', 31451),
 ('good', 29700)]

The above results hints at some more challenges when working with text data. Among the top ten most frequent words, none is really surprising or appears interesting. Well, what is interesting depends on the task. For example, words like *like* and *good* have meaning in a sentiment analysis setting. However, words like *movie* or *film* will naturally appear in a data set on movie reviews and will likely not contribute useful information to any downstream task. This indicates that, in addition to filtering stop words, there could be other 'normal' words (i.e., not stop words) that we might want to filter. Again, preparing text data can be rather laborious...<br>
Let's check if people also talk about something more relevant.

In [96]:
# Check frequency of some target word
word_counter["spielberg"]

198

## Summary
Well done, you hit the end of yet another ADAMS demo exposing you to the fundamentals of text preprocessing and NLP. Our next demo will bring us back to the IMDB movie data sets and revisit a famous algorithm for learning distributed representations of textual data called **Word-to-Vec**. So stay tuned.