# Tutorial 4 - Introduction to Natural Language Processing (NLP)

You are already familiar with building predictive models on tabular data. With the latter, you have a feature matrix `X` and a target vector `Y`. Given these data structures, you can apply ML algorithms to learn the relationship between `X` and `Y`. In this notebook, you will work with a dataset consisting of movie reviews, each labeled with either a negative or positive sentiment. In contrast to numerical tabular data, textual data cannot be fed directly into ML algorithms for predictive modeling purposes. Thus, with textual data, you need to preprocess each data sample to obtain the required feature matrix `X`. This processing of the data is what we call the "NLP pipeline". The dimensionality and the quality of the features, which can be extracted from textual data, depend on the preprocessing steps implemented in the NLP pipeline. In this notebook, we will mainly focus on NLP preprocessing and frequency-based feature creation.<br>

Here is the outline of today's notebook:
*   NLP Processing Pipeline (Demo).
*   Preprocessing of Movie Reviews Data (Exercise 1).
*   Frequency-based Feature Creation (Bag-of-Words) for Text Classification (Exercise 2).

In [1]:
# required packages
import pandas as pd
import nltk #Natural Language Toolkit
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import recall_score,precision_score,roc_auc_score
import numpy as np

## **1. NLP Processing Pipeline** (DEMO)

In this section, we will cover key data preprocessing steps of an NLP pipeline. First, let's revisit the latter with the visualization below:<br>

<img src="https://github.com/Humboldt-WI/demopy/raw/main/NLP_Pipeline_Overview.PNG" width="950" height="520" alt="NLP Pipeline">

We begin by gathering samples for our textual data corpus, i.e., we go through the archives of, e.g., a company, and pull out Word files, PowerPoint presentations, etc. Since the machine learning models we have learned so far such as neural networks, decision trees, random forest, etc., do not natively process text, we need to go through specific data preprocessing and cleaning steps to convert textual data into a format that our algorithms can work with. Let's assume we have the following movie review:<br>

In [2]:
movie_review='<br>Last week, I went to the Berlin movie theater Zoo Palast, https://zoopalast.premiumkino.de, to watch the new action movie with Tom Cruise, which turned out to be very enjoyable experience. While there was a long queue at the entrance of the movie theater and I had to wait around 10 - 15 min. to go in, the movie was really worth it!<br>'
print(movie_review)

<br>Last week, I went to the Berlin movie theater Zoo Palast, https://zoopalast.premiumkino.de, to watch the new action movie with Tom Cruise, which turned out to be very enjoyable experience. While there was a long queue at the entrance of the movie theater and I had to wait around 10 - 15 min. to go in, the movie was really worth it!<br>


If our goal is to determine the sentiment of the movie review, then the HTML content, i.e., `< br >` indicating a line break, and the URL link would not provide us with any useful information. Therefore, the first two data preprocessing steps, which we will take, are to remove the HTML content and to filter out the URL from our textual sample:<br>

In [3]:
# Remove html content
print('1. Step: Removing URL Link')
clean_movie_review = re.sub("http\S+", "",movie_review)
print(clean_movie_review,'\n')

print('2. Step: Removing HMTL Content')
clean_movie_review = BeautifulSoup(clean_movie_review).get_text()
print(clean_movie_review)


1. Step: Removing URL Link
<br>Last week, I went to the Berlin movie theater Zoo Palast,  to watch the new action movie with Tom Cruise, which turned out to be very enjoyable experience. While there was a long queue at the entrance of the movie theater and I had to wait around 10 - 15 min. to go in, the movie was really worth it!<br> 

2. Step: Removing HMTL Content
Last week, I went to the Berlin movie theater Zoo Palast,  to watch the new action movie with Tom Cruise, which turned out to be very enjoyable experience. While there was a long queue at the entrance of the movie theater and I had to wait around 10 - 15 min. to go in, the movie was really worth it!


In the previous example, we make use of *regular expressions* to clean the text. Regular expressions are a powerful mechanism for text processing. However, a comprehensive discussion of what you can do with regular expressions and how to use them is out of the scope of this notebook. If you'd like to learn more about regular expression, check out the [Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression) as a starting point. The [W3Schools](https://www.w3schools.com/python/python_regex.asp) website or the [Regex101](https://regex101.com/) website provide easily accessible playgrounds to start working with regular expressions.

 In our demo, we use the python package for regular expressions, i.e., `re`, to remove unnecessary characters. To achieve this, we specify a pattern that we would like to remove, i.e, "http\S+", and we also specify the replacement, i.e, an empty character. "\S+" matches a word or token that does not contain any whitespace. In combination with "http", we essentially look for a substring in our textual sample that starts with "http" and is followed by one or more non-whitespace characters. Put in other words, the pattern will find everything from "http" until the next space, and remove the corresponding content. Additionally, the package `BeautifulSoup` facilitates the scraping of information from web pages. When we feed text with any HMTL content to `BeautifulSoup` we use the function `get_text()` to retrieve the HTML tag-free version of our textual sample. Besides HTML and URL content, we would also like to remove any non-alphabetic characters, as they usually do not carry any semantic information:  

In [4]:
print('3. Step: Removing non-alphabetic Characters')
clean_movie_review = re.sub("[^a-zA-Z]", " ",clean_movie_review)
print(clean_movie_review)

3. Step: Removing non-alphabetic Characters
Last week  I went to the Berlin movie theater Zoo Palast   to watch the new action movie with Tom Cruise  which turned out to be very enjoyable experience  While there was a long queue at the entrance of the movie theater and I had to wait around         min  to go in  the movie was really worth it 


The bracket list "[^...]" will find any set of characters that are not matched ("^") by the specified pattern ("..."). Thus, by setting "[^a-zA-Z]", we are looking for all substrings, that do not overlap with lower case or upper case alphabet characters. Since letter casing (whether upper or lower) does not provide any information about the underlying meaning or context of the textual data, we will transform the whole movie review to lower case:

In [5]:
print('4. Step: Transforming all Characters to lower Case')
clean_movie_review = clean_movie_review.lower()
print(clean_movie_review)

4. Step: Transforming all Characters to lower Case
last week  i went to the berlin movie theater zoo palast   to watch the new action movie with tom cruise  which turned out to be very enjoyable experience  while there was a long queue at the entrance of the movie theater and i had to wait around         min  to go in  the movie was really worth it 


Next, before we proceed to splitting our movie review into a list of words using the package `nltk`, we will revisit the concept of tokenization with the visualization below:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/Tokenization.PNG" width="950" height="440" alt="Tokenization">

In [6]:
import nltk
nltk.download('punkt_tab')
print('5. Step: Performing Tokenization')
clean_movie_review=nltk.word_tokenize(clean_movie_review)
print(clean_movie_review[:10])

5. Step: Performing Tokenization
['last', 'week', 'i', 'went', 'to', 'the', 'berlin', 'movie', 'theater', 'zoo']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


The resulting list of tokens directly impacts the feature extraction process. Therefore, before splitting the movie review into individual words, we removed some unnecessary content. However, when we consider all tokens put together, their number can still very easily turn large. To further reduce this number, we eliminate the so-called stopwords. The latter usually do not contribute to the model‘s perception of the movie review, whether it has a negative or a positive sentiment:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/Removal_Stopwords.PNG" width="950" height="530" alt="Stopwords">


In [12]:
nltk.download('stopwords') ## to download stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
print('Examples for Stopwords from NLTK: ',stopwords.words("english")[:10])

Examples for Stopwords from NLTK:  ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']


The stopwords from `nltk` represent a list of tokens, which we will exclude from our textual sample. If we have a reason to believe that some of the tokens in the stopwords list could hold a semantic meaning for our task, we would first remove these tokens before the stopwords elimination. Adjustments could also be performed in the opposite case, when we decide to extend the list of stopwords with some specific tokens, which are not covered by `nltk`, and are yet contained in our textual data. After all, and as shown in the previous, `nltk` maintains the list of stopword list as a Python `list`. You can manipulate this list just as any other `list` using its inbuilt methods like `append()`, `remove()`, etc. For example, if you want to add the token "yet" to the list of stopwords, you can do so by calling `stopwords.append("yet")`. In our demo, we will use the default stopword list provided by `nltk`.

In [14]:
all_tokens=len(clean_movie_review)
print('6. Step: Stopwords Removal')
clean_movie_review = [w for w in clean_movie_review if w not in stopwords.words("english")]
print('- Number of Tokens before Stopwords Removal: ', all_tokens)
print('- Number of Tokens after Stopwords Removal: ', len(clean_movie_review))

6. Step: Stopwords Removal
- Number of Tokens before Stopwords Removal:  57
- Number of Tokens after Stopwords Removal:  29


In [None]:
#choosing lemma most of the time is better because stem can eliminate the ending of the word and create a
#word that does not exist but lemma searches for its dictionary format

In addition to the elimination of stopwords, we can further reduce number of distinct tokens with stemming and lemmatization:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/Stemming_Lemmatization.PNG" width="950" height="530" alt="Stemming_Lemmatization">

For example, to 'understand' text algorithmically, it might be enough to know that *someone performs the action of going somewhere*, whereas it might matter less whether they are going now, went in the past, or will go in the future. While stemming is a crude heuristic to remove the end or the beginning of a word based on pre-defined suffixes and prefixes, respectively, lemmatization normalizes tokens by reducing words to their dictionary form. Since stemming can produce non-existent words, lemmatization is the better choice for ensuring consistency in how we represent text. Determining the grammatical role, i.e., a token's part-of-speech tag, can improve the results obtained with lemmatization. Compared to stemming, however, lemmatization is a more complex step and requires, for example, a POS tagger:<br>
<img src="https://github.com/Humboldt-WI/demopy/raw/main/POS.PNG" width="950" height="530" alt="POS">
<br>
The following functions illustrates POS tagging with the help of the `nltk` package.

In [15]:
# Lemmatize with POS Tag (Parts of Speech tagging)
#wordnet lemmatizer lexicon, which maps words to their lemmas
nltk.download('wordnet')
def get_wordnet_pos(word:str)->str:
    """Map POS tag to first character for lemmatization

    Returns:
    --------
    pos: str
        The positional tag of speech retrieved from wordnet database.
    """

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    pos=tag_dict.get(tag,wordnet.NOUN)

    return pos

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [16]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [17]:
print('7. Step: Lemmatization with POS Tagging:')
print('- First three Tokens before Lemmatization: ',clean_movie_review[:3])
lemmatizer = WordNetLemmatizer()
clean_movie_review=[lemmatizer.lemmatize(i, get_wordnet_pos(i)) for i in clean_movie_review]
print('- First three Tokens after Lemmatization: ',clean_movie_review[:3])

7. Step: Lemmatization with POS Tagging:
- First three Tokens before Lemmatization:  ['last', 'week', 'went']
- First three Tokens after Lemmatization:  ['last', 'week', 'go']


The function `lemmatize` from the class `WordNetLemmatizer` retrieves the dictionary form of the input token from the WordNet database. The latter represents a large english lexical database that has been continuously extended over the years. WordNet groups words together based on their semantic meaning forming synonym sets, also known as synsets. Most of the relations in WordNet connect words from the same POS. For more details on WordNet, we refer the reader to https://wordnet.princeton.edu/.

**Demo Summary**:<br>
- the first five steps of our NLP preprocessing pipeline involve the elimination of HTML content, URLs, non-alphabetic characters, converting all tokens to lower case, and tokenization.
- once the text is split into individual tokens, we apply two techniques to further reduce the number of words: stopwords removal and lemmatization. The latter reduces the number of distinctive words per data sample by replacing different forms of the same token with its dictionary form. This is essential for frequency-based feature extraction techniques, as each textual sample is represented with a numerical vector, the dimensionality of which is determined by the number of distinct tokens in the entire vocabulary.

## **2. Preprocessing of Movie Reviews Data** (1. Exercise)<br>
- put the individual preprocessing steps from the demo in a well documented function. The latter should take as inputs a single textual sample.
- preprocess the first 1,000 movie reviews from the IMDB dataset. For this purpose, make use of the `apply()` function in `pandas` to transform each movie review using the NLP preprocessing pipeline function.
- split your data into 80% train and 20% test subsets using the function `train_test_split()`.

In [18]:
def NLP_preprocessing_pipeline(textual_sample:str)->list:
    '''
    Implements 7 steps of an NLP preprocessing pipeline.

    Parameters:
    -----------
    textual_sample:str
        The input text that requires preprocessing

    Returns:
    --------
    preprocessed_textual_sample:list
        The textual sample after each of the 7 preprocessing steps have been applied.

    '''
    lemmatizer = WordNetLemmatizer()

    #Preprocessing:

    #Removing of URLs:
    preprocessed_textual_sample = re.sub("http\S+","",textual_sample)

    #Removing of HTML tags:
    preprocessed_textual_sample = BeautifulSoup(preprocessed_textual_sample).get_text()

    #Removing of non-alphabetic characters:
    preprocessed_textual_sample = re.sub("[^a-zA-Z]"," ",preprocessed_textual_sample)

    #Changing all tokens to lower case:
    preprocessed_textual_sample = preprocessed_textual_sample.lower()

    #Tokenization:
    preprocessed_textual_sample = nltk.word_tokenize(preprocessed_textual_sample)

    #Stopwords removal:
    preprocessed_textual_sample =[w for w in preprocessed_textual_sample if w not in stopwords.words("english")]

    #Lemmatization:
    preprocessed_textual_sample = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in preprocessed_textual_sample]

    return preprocessed_textual_sample

In [20]:
df = pd.read_csv("IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1").iloc[:1000,:]
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [21]:
#Calling our preprocessing functions to clean the reviews (caution, this step can take some time)
X= df['review'].apply(NLP_preprocessing_pipeline)

In [23]:
#Map y to integer format:
y = df['sentiment'].map({'positive':1,'negative':0}).values   #map text-based class labels to numbers
Xclean_train, Xclean_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)  #data partitioning

## **3. Frequency-based Feature Creation (Bag-of-Words) for Text Classification** <br>(2. Exercise)<br>
- extract frequency-based features from your training and test sets using two alternative techniques: `TfidfVectorizer` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and `CountVectorizer` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer). Both classes are available in the `sklearn` library. Similar to data preprocessing in tabular datasets, you will call the function `fit_transform()` to extract frequency features on the training set, whereas you will apply the function `transform()` to your test set. Also, since you have already applied your custom preprocessing function to the moview reviews dataset, you can pass a dummy function as an argument to the input parameters  `tokenizer` and `preprocessor` of `TfidfVectorizer` and `CountVectorizer`. In this way, the vectorizers would not clean the data, but would only extract frequency-based features.
- train and test the following algorithms on the two resulting feature spaces: Logistic Regression and XGBClassifier
- evaluate the predictions on the test set in terms of the AUC, recall, and precision, and store all results in a single pandas dataframe.

In [40]:
def dummy_fun(doc):
    return doc

tfidf_vectorizer = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

#TFIDF Feature Extraction:
reviews_clean_tfidf_tr = tfidf_vectorizer.fit_transform(Xclean_train)
reviews_clean_tfidf_ts = tfidf_vectorizer.transform(Xclean_test)

#CountVectorizer Feature Extraction:
count_vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer= dummy_fun,
    preprocessor=dummy_fun,
    token_pattern = None
)

reviews_clean_count_tr = count_vectorizer.fit_transform(Xclean_train)
reviews_clean_count_ts = count_vectorizer.transform(Xclean_test)

#An alternative way of preprocessing the data with TfidfVectorizer and CountVectorizer
#would be to use the custom preprocessing function as an input parameter to the vectorizers.
#Since the function performs tokenization, you would pass NLP_preprocessing_pipeline as an input only to
#the preprocessor parameter, and you will keep tokenizer = dummy_fun. This would essentially do the same
#as you have done before: the custom NLP pipeline function would be applied to each row of the movie reviews dataset.
#First, you would split the data into train and test subsets, and then you would apply the vectorizers, e.g:
#TFIDF Feature Extraction:
#reviews_clean_tfidf_tr = tfidf_vectorizer.fit_transform(Xclean_train)

In [None]:
predictions_results_frame=[]

xgbc=XGBClassifier()
#TFIDF features:
...

#Count features:
...


logit=LogisticRegression()
#TFIDF features:
...

#Count features:
...

#Append results to predictions_results_frame:

#Put multidimensional list in pandas df:
results_overview=pd.DataFrame(np.around(np.array(predictions_results_frame),3),columns=[...],index=[...])

In [None]:
results_overview