# __10 Statistical Natural Language Processing for Sentiment Analysis__

Generally, sentiment analysis is performed based on the processing of natural
language, the analysis of text and computational linguistics. Although data can come
from different data sources, in this chapter we will analyze sentiment in text data,
using two particular text data examples: one from film critics, where the text is highly
structured and maintains text semantics; and another example coming from social
networks (tweets in this case), where the text can show a lack of structure and users
may use (and abuse!) text abbreviations.

In the following sections, we will review some basic mechanisms required to
perform sentiment analysis. In particular, we will analyze the steps required for
data cleaning (that is, removing irrelevant text items not associated with sentiment
information), producing a general representation of the text, and performing some
statistical inference on the text represented to determine positive and negative senti-
ments.

Although the scope of sentiment analysis may introduce many aspects to be ana-
lyzed, in this chapter and for simplicity, we will analyze binary sentiment analysis
categorization problems. We will thus basically learn to classify positive against
negative opinions from text data. The scope of sentiment analysis is broader, and it
includes many aspects that make analysis of sentiments a challenging task. Some
interesting open issues in this topic are as follows:

+ Identification of sarcasm: sometimes without knowing the personality of the per-
    son, you do not know whether “bad” means bad or good.
+ Lack of text structure: in the case of Twitter, for example, it may contain abbre-
    viations, and there may be a lack of capitals, poor spelling, poor punctuation, and
    poor grammar, all of which make it difficult to analyze the text.
+ Many possible sentiment categories and degrees: positive and negative is a simple
    analysis, one would like to identify the amount of hate there is inside the opinion,
    how much happiness, how much sadness, etc.
+ Identification of the object of analysis: many concepts can appear in text, and how
    to detect the object that the opinion is positive for and the object that the opinion is
    negative for is an open issue. For example, if you say “She won him!”, this means
    a positive sentiment for her and a negative sentiment for him, at the same time.
+ Subjective text: another open challenge is how to analyze very subjective sentences
    or paragraphs. Sometimes, even for humans it is very hard to agree on the sentiment
    of these highly subjective texts.

## __Data Cleaning__

The main task of data cleaning is to remove
those characters considered as noise in the data mining process. For instance, comma
or colon characters. Of course, in each particular data mining problem different char-
acters can be considered as noise, depending on the final objective of the analysis. In
our case, we are going to consider that all punctuation characters should be removed,
including other non-conventional symbols.

In [1]:
raw_docs = [
    'Here are some very simple basic sentences.',
    'They won`t be very interesting, I`m afraid.',
    'The point of these examples i to _learn how basoc text\ cleaning works_ on *very simple* data. '
]

The first step consists of defining a list with all word-vectors in the text.
`NLTK` makes it easy to convert documents-as-trings into word-vectors, a process
called tokenizing.

In [2]:
from nltk.tokenize import word_tokenize

tokenized_docs = [word_tokenize(doc) for doc in raw_docs]

In [3]:
print(tokenized_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'won', '`', 't', 'be', 'very', 'interesting', ',', 'I', '`', 'm', 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'i', 'to', '_learn', 'how', 'basoc', 'text\\', 'cleaning', 'works_', 'on', '*', 'very', 'simple', '*', 'data', '.']]


Thus, for each line of text in raw_docs, word_tokenize function will set
the list of word-vectors. Now we can search the list for punctuation symbols, for
instance, and remove them. There are many ways to perform this step. Let us see
one possible solution using the `String` library.

In [4]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
import re
import string

regex = re.compile(f'[{re.escape(string.punctuation)}]')
tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)

print(tokenized_docs_no_punctuation)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'won', 't', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'i', 'to', 'learn', 'how', 'basoc', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]


Another important step in many data mining systems for text analysis consists of
stemming and lemmatizing. Morphology is the notion that words have a root form.
If you want to get to the basic term meaning of the word, you can try applying
a stemmer or lemmatizer. This step is useful to reduce the dictionary size and the
posterior high-dimensional and sparse feature spaces. NLTK provides different ways
of performing this procedure. In the case of running the porter.stem(word)
approach, the output is shown next.

In [6]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

preprocessed_docs = []
for doc in tokenized_docs_no_punctuation:
    final_doc = []
    for word in doc:
        final_doc.append(porter.stem(word))
    preprocessed_docs.append(final_doc)
    
print(tokenized_docs_no_punctuation)
print(preprocessed_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'won', 't', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'i', 'to', 'learn', 'how', 'basoc', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]
[['here', 'are', 'some', 'veri', 'simpl', 'basic', 'sentenc'], ['they', 'won', 't', 'be', 'veri', 'interest', 'I', 'm', 'afraid'], ['the', 'point', 'of', 'these', 'exampl', 'i', 'to', 'learn', 'how', 'basoc', 'text', 'clean', 'work', 'on', 'veri', 'simpl', 'data']]


This kind of approaches are very useful in order to reduce the exponential number
of combinations of words with the same meaning and match similar texts. Words
such as “interest” and “interesting” will be converted into the same word “interest”
making the comparison of texts easier, as we will see later.


## __Text Representation__

In the previous section we have analyzed different techniques for data cleaning, stemming, and lemmatizing, and filtering the text to remove other unnecessary tags for
posterior text analysis. In order to analyze sentiment from text, the next step consists
of having a representation of the text that has been cleaned.
Although different reprresentations of text exist, the most common ones are variants of Bag of Words (BoW) models.

The basic idea is to think about word frequencies. If we can define a
dictionary of possible different words, the number of different existing words will
define the length of a feature space to represent each text.



Next, we will see a particular case of bag of words, the Vector Space Model of
text: TF–IDF (term frequency–inverse distance frequency). First, we need to count
the terms per document, which is the term frequency vector. See a code example
below.

In [7]:
mydoclist = [
    'Mireia loves me more than Hector loves me',
    'Sergio likes me more than Mireia loves me',
    'HE likes basketball more than football'
]

from collections import Counter

for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] += 1
    print(tf.items())

dict_items([('Mireia', 1), ('loves', 2), ('me', 2), ('more', 1), ('than', 1), ('Hector', 1)])
dict_items([('Sergio', 1), ('likes', 1), ('me', 2), ('more', 1), ('than', 1), ('Mireia', 1), ('loves', 1)])
dict_items([('HE', 1), ('likes', 1), ('basketball', 1), ('more', 1), ('than', 1), ('football', 1)])


Let us call this a first stab at representing documents quantitatively, just by their
word counts (also thinking that we may have previously filtered and cleaned the text
using previous approaches). Here we show an example for computing the feature
vector based on word frequencies.

In [8]:
def build_lexicon(corpus):
    '''Define a set with all possible words included in all the
    sentences or corpus.
    '''
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

def freq(term, document):
    return document.split().count(term)

def tf(term, document):
    return freq(term, document)


In [9]:
vocabulary = build_lexicon(mydoclist)
doc_term_matrix = []

print('out vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')

for doc in mydoclist:
    print('the doc is "' + doc + '"')
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print(f'the tf vector for Document {mydoclist.index(doc) + 1} is [{tf_vector_string}]')
    doc_term_matrix.append(tf_vector)

print('all combined, here is our master document term matrix: ')
print(doc_term_matrix)

out vocabulary vector is [Mireia, football, HE, me, loves, more, likes, than, basketball, Hector, Sergio]
the doc is "Mireia loves me more than Hector loves me"
the tf vector for Document 1 is [1, 0, 0, 2, 2, 1, 0, 1, 0, 1, 0]
the doc is "Sergio likes me more than Mireia loves me"
the tf vector for Document 2 is [1, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1]
the doc is "HE likes basketball more than football"
the tf vector for Document 3 is [0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0]
all combined, here is our master document term matrix: 
[[1, 0, 0, 2, 2, 1, 0, 1, 0, 1, 0], [1, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1], [0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0]]


Now, every document is in the same feature space, meaning that we can represent
the entire corpus in the same dimensional space. Once we have the data in the
same feature space, we can start applying some machine learning methods: learning,
classifying, clustering, and so on. But actually, we have a few problems. Words are
not all equally informative. If words appear too frequently in a single document,
they are going to muck up our analysis. We want to perform some weighting of these
term frequency vectors into something a bit more representative. That is, we need to
do some vector normalizing. One possibility is to ensure that the L2 norm of each
vector is equal to 1.

In [17]:
import math
import numpy as np

def l2_normalizer(vec):
    denom = np.sum([el ** 2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))
    
print('a regular old document term matrix: ')
print(np.matrix(doc_term_matrix))

print('\nA document term matrix with row-wise L2 norm: ')
print(np.matrix(doc_term_matrix_l2))

a regular old document term matrix: 
[[1 0 0 2 2 1 0 1 0 1 0]
 [1 0 0 2 1 1 1 1 0 0 1]
 [0 1 1 0 0 1 1 1 1 0 0]]

A document term matrix with row-wise L2 norm: 
[[0.28867513 0.         0.         0.57735027 0.57735027 0.28867513
  0.         0.28867513 0.         0.28867513 0.        ]
 [0.31622777 0.         0.         0.63245553 0.31622777 0.31622777
  0.31622777 0.31622777 0.         0.         0.31622777]
 [0.         0.40824829 0.40824829 0.         0.         0.40824829
  0.40824829 0.40824829 0.40824829 0.         0.        ]]


You can see that we have scaled down the vectors so that each element is between
[0, 1]. This will avoid getting a diminishing return on the informative value of a word
massively used in a particular document. For that, we need to scale down words that
appear too frequently in a document.

Finally, we have a final task to perform. Just as not all words are equally valuable
within a document, not all words are valuable across all documents. We can try
reweighting every word by its inverse document frequency.

In [13]:
def num_docs_containing(word, doclist):
    doccount = 0
    for doc in doclist:
        if freq(word, doc) > 0:
            doccount += 1
    return doccount

def idf(word, doclist):
    n_samples = len(doclist)
    df = num_docs_containing(word, doclist)
    return np.log(n_samples / float(df))


my_idf_vector = [idf(word, mydoclist) for word in vocabulary]

print('Out vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
print('The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']')

Out vocabulary vector is [Mireia, football, HE, me, loves, more, likes, than, basketball, Hector, Sergio]
The inverse document frequency vector is [0.405465, 1.098612, 1.098612, 0.405465, 0.405465, 0.000000, 0.405465, 0.000000, 1.098612, 1.098612, 1.098612]


Now we have a general sense of information values per term in our vocabulary,
accounting for their relative frequency across the entire corpus.
Note that his is an inverse. To get TF-IDF weighted word-vectors, we have to perform
the simple calculation of the term frequencies multiplied by the inverse frequency
values.

In [14]:
def build_idf_matrix(idf_vector):
    idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
    np.fill_diagonal(idf_mat, idf_vector)
    return idf_mat

my_idf_matrix = build_idf_matrix(my_idf_vector)

print(my_idf_matrix)

[[0.40546511 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         1.09861229 0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         1.09861229 0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.40546511 0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.40546511 0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.40546511 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        

That means we can now multiply every term frequency vector by the inverse
document frequency matrix. Then, to make sure we are also accounting for words
that appear too frequently within documents, we will normalize each document using
the L2 norm

In [16]:
doc_term_matrix_tfidf = []

# performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
    doc_term_matrix_tfidf.append(np.dot(tf_vector, my_idf_matrix))
    
# normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
    doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
    
print(vocabulary)

print(np.matrix(doc_term_matrix_tfidf_l2))

{'Mireia', 'football', 'HE', 'me', 'loves', 'more', 'likes', 'than', 'basketball', 'Hector', 'Sergio'}
[[0.24737436 0.         0.         0.49474872 0.49474872 0.
  0.         0.         0.         0.67026363 0.        ]
 [0.2640605  0.         0.         0.52812101 0.2640605  0.
  0.2640605  0.         0.         0.         0.71547492]
 [0.         0.56467328 0.56467328 0.         0.         0.
  0.20840411 0.         0.56467328 0.         0.        ]]


## __Bi-Grams and n-Grams__

It is sometimes useful to take significant bi-grams into the model based on the BoW.
Note that this example can be extended to n-grams. In the fields of computational
linguistics and probability, an n-gram is a contiguous sequence of n items from
a given sequence of text or speech. The items can be phonemes, syllables, letters,
words, or base pairs according to the application. The n-grams are typically collected
from a text or speech corpus.

A n-gram of size 1 is referred to as a “uni-gram”; size 2 is a “bi-gram” (or, less
commonly, a “digram”); size 3 is a “tri-gram”. Larger sizes are sometimes referred
to by the value of n, e.g., “four-gram”, “five-gram”, and so on. These n-grams can
be introduced within the BoW model just by considering each different n-gram as a
new position within the feature vector representation.

## __Practical Cases__

Python packages provide useful tools for analyzing text. The reader is referred to
the NLTK and Textblob package documentation for further details. Here, we will
perform all the previously presented procedures for data cleaning, stemming, and
representation and introduce some binary learning schemes to learn the text representations in the feature space. The binary learning schemes will receive examples
for training positive and negative sentiment texts and we will apply them later to
unseen examples from a test set.

We will apply the whole sentiment analysis process in two examples. The first
corresponds to the Large Movie reviews dataset. This is one of the largest public
available data sets for sentiment analysis, which includes more than 50,000 texts
from movie reviews including the groundtruth annotation related to positive and
negative movie reviews. As a proof on concept, for this example we use a subset of
the dataset consisting of about 30% of the data.

The code reuses part of the previous examples for data cleaning, reads training
and testing data from the folders as provided by the authors of the dataset. Then,
TF–IDF is computed, which performs all steps mentioned previously for computing
feature space, normalization, and feature weights. Note that at the end of the script we
perform training and testing based on two different state-of-the-art machine learning
approaches: Naive Bayes and Support Vector Machines. It is beyond the scope of
this chapter to give details of the methods and parameters. The important point here
is that the documents are represented in feature spaces that can be used by different
data mining tools.

In this example we apply the whole sentiment analysis process to the Large Movie reviews dataset (http://www.aclweb.org/anthology/P11-1015). This is one of the largest public available data sets for sentiment analysis, which includes more than 50.000 texts from movie reviews including the ground truth annotation related to positive and negative movie review. As a proof on concept for this example we use a subset of the dataset consisting in about 10% of the data.

In [31]:
import os
import shutil
files=5
count=0
for file in os.listdir("files/ch10/aclImdb/train/pos/"):
    if count > files:
        break
    if file.endswith(".txt"):
        os.rename('files/ch10/aclImdb/train/pos/' + file, 'files/ch10/train/pos2/' + file)
    count=count+1
count=0
for file in os.listdir("files/ch10/aclImdb/train/neg/"):
    if count > files:
        break
    if file.endswith(".txt"):
        os.rename('files/ch10/aclImdb/train/neg/' + file, 'files/ch10/train/neg2/' + file)
    count=count+1
count=0
for file in os.listdir("files/ch10/aclImdb/test/pos/"):
    if count > files:
        break
    if file.endswith(".txt"):
        os.rename('files/ch10/aclImdb/test/pos/' + file, 'files/ch10/test/pos2/' + file)
    count=count+1
count=0
for file in os.listdir("files/ch10/aclImdb/test/neg/"):
    if count > files:
        break
    if file.endswith(".txt"):
        os.rename('files/ch10/aclImdb/test/neg/' + file, 'files/ch10/test/neg2/' + file)
    count=count+1
    

In [43]:
import time
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.classify import NaiveBayesClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from unidecode import unidecode

In [44]:
def BoW():
    '''Implementation of bag of words preprocessing for corpii.
    Applies tokenization, removing punctuation and stemming and lemmatizing.
    
    Returns a list containing the preprocessed documents of all the corpus.
    '''
    # tokenizing text
    text_tokenized = [word_tokenize(doc) for doc in text]
    
    # removing punctuation
    regex = re.compile(f'[{re.escape(string.punctuation)}]')
    tokenized_docs_no_punctuation = []
    for review in text_tokenized:
        new_review = []
        for token in review:
            new_token = regex.sub(u'', token)
            if not new_token == u'':
                new_review.append(new_token)
        tokenized_docs_no_punctuation.append(new_review)
    
    # stemming and lemmatizing
    porter = PorterStemmer()
    preprocessed_docs = []
    for doc in tokenized_docs_no_punctuation:
        final_doc = ''
        for word in doc:
            final_doc = final_doc + ' ' + porter.stem(word)
        preprocessed_docs.append(final_doc)
    return preprocessed_docs



In [48]:
print('Reading the training data positive')
text = []
for file in os.listdir('files/ch10/train/pos2/'):
    if file.endswith('.txt'):
        infile = open('files/ch10/train/pos2/' + file, 'br')
        text.append(unidecode(infile.read().decode('utf-8')))
        infile.close()
num_pos_train = len(text)

print('Reading the training data negative')
for file in os.listdir('files/ch10/train/neg2'):
    if file.endswith('.txt'):
        infile = open('files/ch10/train/neg2/' + file, 'br')
        text.append(unidecode(infile.read().decode('utf-8')))
        infile.close()
num_train = len(text)

print('Defining dictionaries')
preprocessed_docs = BoW()

# computing TIDF word space
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
train_data = tfidf_vectorizer.fit_transform(preprocessed_docs)


# reading the test data
print('Reading the test data positive')
text = []
for file in os.listdir('files/ch10/test/pos2/'):
    if file.endswith('.txt'):
        infile = open('files/ch10/test/pos2/' + file, 'br')
        text.append(unidecode(infile.read().decode('utf-8')))
        infile.close()
num_pos_test = len(text)

print('Reading the training data negative')
for file in os.listdir('files/ch10/test/neg2'):
    if file.endswith('.txt'):
        infile = open('files/ch10/test/neg2/' + file, 'br')
        text.append(unidecode(infile.read().decode('utf-8')))
        infile.close()
num_test = len(text)


print('Computing test feature vectors')
start_time = time.time()

preprocessed_docs = BoW()
test_data = tfidf_vectorizer.transform(preprocessed_docs)

target_train = []
for i in range(0, num_pos_train):
    target_train.append(0)

for i in range(0, num_train - num_pos_train):
    target_train.append(1)

target_test = []
for i in range(0, num_pos_test):
    target_test.append(0)
for i in range(0, num_test - num_pos_test):
    target_test.append(1)


print('Training and testing on training Naive Bayes')
start_time = time.time()

gnb = GaussianNB()
test_data.todense()
y_pred = gnb.fit(train_data.todense(), target_train).predict(train_data.todense())
print("Number of mislabeled training points out of a total %d points : %d" % (train_data.shape[0],(target_train != y_pred).sum()))

print('Training and testing on test Naive Bayes')

y_pred = gnb.fit(train_data.todense(), target_train).predict(test_data.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (test_data.shape[0],(target_test != y_pred).sum()))

print('Training and testing on train with SVM')
clf = svm.SVC()
clf.fit(train_data.todense(), target_train)
y_pred = clf.predict(train_data.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (train_data.shape[0],(target_train != y_pred).sum()))

print('Testing on test with already trained SVM')
y_pred = clf.predict(test_data.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (test_data.shape[0],(target_test != y_pred).sum()))


Reading the training data positive
Reading the training data negative
Defining dictionaries
Reading the test data positive
Reading the training data negative
Computing test feature vectors
Training and testing on training Naive Bayes
Number of mislabeled training points out of a total 12 points : 0
Training and testing on test Naive Bayes
Number of mislabeled test points out of a total 12 points : 5
Training and testing on train with SVM
Number of mislabeled test points out of a total 12 points : 0
Testing on test with already trained SVM
Number of mislabeled test points out of a total 12 points : 2
