# 10.5 Regular Expression and Sentiment Analysis

Sentiment analysis is the use of natural language processing to quantify subjective information. Our goal in this section is to use machine learning to identify whether a piece of text captures positive, negative, or neutral emotions. Sentiment analysis has become more prevalent in our world through its application in algorithmic traders, recommendation systems, and market research.  

## Sentiment Analysis

Consider the following sentences:

1. "I am so happy to be here right now!"
2. "I'm pretty sad about this whole thing."

Most people would agree that the first sentence exhibits positive emotion and the second sentence exhibits negative emotion. We perceive it to be this way because the first sentence has the word happy and the second sentence has the word sad. With this very simple idea in mind, we can build a very naïve classifier that determines if a sentence exhibits positive, negative, or neutral emotion. 

Our output being a range between -1 and 1 with sentence towards -1 as having negative sentiment and sentices towards +1 having positive sentiment.


In [1]:
positive_words = set(["happy", "great", "fanstastic", "love", "appreciate", "grateful"])
negative_words = set(["sad", "gross", "disturbing", "bitter", "sorry", "pathetic"])

def sentiment_analyzer_v2(sentence): 
    sentence = sentence.lower().split(" ")

    pos_word_cnt = 0
    neg_word_cnt = 0

    for word in sentence: 
        if word in positive_words: 
            pos_word_cnt += 1
        elif word in negative_words: 
            neg_word_cnt += 1
    
    return (pos_word_cnt - neg_word_cnt) / (pos_word_cnt + neg_word_cnt)
        
print(sentiment_analyzer_v2("This is making me happy !"))
print(sentiment_analyzer_v2("This is making me sad !"))
print(sentiment_analyzer_v2("I am neither happy nor sad ."))

1.0
-1.0
0.0


Given a big enough dictionary of positive and negative words, this algorithm can work pretty well. But it takes a lot of work to figure out what words are happy and then type it into a list, its just not efficient. So, being the clever Data Scientists that we are, let's create a machine learning algorithm. 

Here is the schematics: let's get a list of texts that are labelled as either positive, negative, or netural. We use a count vector to see what words occured in which text and how many times that word occured. We then use a machine learning algorithm to train on this count vector along with the sentiment label. In essense, this algorithm is saying "if these words occured $n$ number of times in a text, then it is likely for it to be a specific sentiment"

Download a set of tweets from (link) and you will see that there are two features that we care about: polarity and text. Polarity tells us what the sentiment is (0 for negative, 2 for neutral, 4 for positive). To keep things consistent with our naive algorithm above, let's map the polarity such that positive is +1, neutral is 0, and negative is -1. With this encoding, the closer our prediction is to 1, the more positive the sentiment is and the closer our prediction is to -1, the more negative the sentiment is. 

In [2]:
import pandas as pd

df_tweets = pd.read_csv("https://raw.githubusercontent.com/bfkwong/data/master/twitter_sentiment.csv")
df_tweets["polarity"] = df_tweets["polarity"].map({4:1, 2:0, 0:-1})
df_tweets.head()

Unnamed: 0,polarity,tweet_id,tweet_date,query,user,text
0,1,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,1,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,1,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,1,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,1,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


### Text Normalization

The goal of normalizing is to remove excess noise so that the algorithm only has to focus on what is important. Think of this process as the text version of `StandardScaler`.

**Lemmatization** is one popular normalization technique. During lemmatization, the words `studies` and `studying` gets lemmatized to `study`. In essense, the process of lemmatization turns different forms of the same word (i.e. studies, studying) into the same base lemma (study). This helps reduce the noise in our dataset. 

**Stop word removal** is another way to normalize our text in order to reduce noise. Stop words such as "there", "how", "then", "we" offer no additional clues for deciding what the sentiment of a sentence is. Thus, it makes sense for us to remove these words before training out algorithm 

Lemmatization and stop word removal are both tedious tasks for us to do, which is why NLTK provides us with functions to remove them. 

In [3]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

stop_words=set(stopwords.words("english"))
tweets = list(df_tweets["text"])
for tweet in range(len(tweets)): 
    tweets[tweet] = [x for x in word_tokenize(tweets[tweet]) if x not in stop_words]

lem = WordNetLemmatizer()
for tweet in range(len(tweets)): 
    tweets[tweet] = [lem.lemmatize(x) for x in tweets[tweet]]

df_tweets["processed_text"] = [" ".join(x) for x in tweets]
df_tweets.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,polarity,tweet_id,tweet_date,query,user,text,processed_text
0,1,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...,@ stellargirl I loooooooovvvvvveee Kindle2 . N...
1,1,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...,Reading kindle2 ... Love ... Lee child good re...
2,1,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck...","Ok , first assesment # kindle2 ... fucking roc..."
3,1,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...,@ kenburbary You 'll love Kindle2 . I 've mine...
4,1,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...,@ mikefish Fair enough . But Kindle2 I think '...


With this data, let's use the CountVectorizer to turn the tweets into a collection of words. To make sure we exclude any hashtags and @ symbols, we will also specify a regular expression tokenizer to only include alphabet characters with the `tokenizer` parameter.

Additionally, we want to specify `ngram_range = (1,2)`. This is to help provide context to the words. Consider the double negative string `I do not dislike` which carries positive sentiment. If we split the words into unigrams, we get words like `not` and `dislike`, which are negative words. By using a bigram, we are able to train the algorithm to realize that `not dislike` is actually a positive term. Thus, allowing the algorithm to be able to handle difficult to decipher sentiments like double negatives and sarcasm. 

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z]+')
cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,2),tokenizer = token.tokenize)
tweet_text_cv = cv.fit_transform(df_tweets['processed_text'])

Why are we doing this? Given that we have a label for whether the piece of text exhibits positive, negative, or neutral emotions, we can use the count vector to see what words tend to occur in positive sentences, and etc. 

In [40]:
vocab = [[x, cv.vocabulary_[x]] for x in cv.vocabulary_]
vocab.sort(key=lambda x:x[1])
vocab = [x[0] for x in vocab]

df_twitter_cv = pd.DataFrame(tweet_text_cv.todense(), columns=vocab)
df_twitter_cv["polarity"] = df_tweets["polarity"]

df_twitter_cv.head()

Unnamed: 0,aapl,aapl es,abortion,abortion zealot,absolutely,absolutely blow,absolutely hilarious,accannis,accannis edog,access,access damn,access throttle,accident,accident guess,accident location,according,according create,accosts,accosts roger,account,account request,acg,acg custom,aching,acia,acia pills,actually,actually quite,ad,ad adobe,ad w,adam,adam lambert,add,add people,addiction,addiction thank,addictive,adidas,adidas billups,...,years great,yeeeee,yeezy,yeezy khaki,yema,yes,yes gm,yes lol,yes m,yes video,yesterday,yesterday cbs,yk,yo,yo teach,york,york times,youtube,youtube adobe,yr,yr old,ytz,yuan,yuan invested,yummmmmy,zealot,zealot n,zero,zero desire,zet,zet o,zic,zlff,zomg,zomg g,zoom,zoom lebron,zydrunas,zydrunas awesome,polarity
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


If a certain word occurs very frequently in texts that are labeled as positive texts, then we can make the assumption that the word is positive. So if in the future we encounter a sentence with this word, we should classify the sentence as positive. 

With this idea in mind, let's train a model to predict whether a sentence exhibits positive, negative, or neutral emotions. 

In [41]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(tweet_text_cv, 
                                                    df_tweets['polarity'], 
                                                    test_size=0.3, 
                                                    random_state=1)

sentiment_analyzer = LinearRegression().fit(X_train, y_train)
predicted = sentiment_analyzer.predict(X_test)
print("LinearRegression R^2:\t", sentiment_analyzer.score(X_test, y_test))
print("LinearRegression MSE:\t",metrics.mean_squared_error(y_test, predicted))

LinearRegression R^2:	 0.3904061882859056
LinearRegression MSE:	 0.43156532563528044


Let's see this bad boy in action. Consider the following sentences. The model was able to correctly classify the sentence has having positive leaning polarity

In [47]:
# Clearly positive sentence
test = cv.transform(["I love being here!"]).todense()
sentiment_analyzer.predict(test)

array([0.23681886])

In [48]:
# Clearly negative sentence
test = cv.transform(["I hate this."]).todense()
sentiment_analyzer.predict(test)

array([-0.41152944])

In [9]:
# Ambiguous positive sentence with double negative
test = cv.transform(["I do not dislike school."]).todense()
sentiment_analyzer.predict(test)

array([0.05733004])

Since we used a linear model in this, we can analyze the coefficients of each variable to see what contributes most to positive and negative sentiment. Recall that each variable represents either an ngram, the ngram that is the biggest is the `most positive` and the ngram that is the smallest is the `most negative`. 

In [10]:
ngrams = [x for x in df_twitter_cv.columns if x != "polarity"]

df_word_coef = pd.DataFrame([ngrams,sentiment_analyzer.coef_], index=["word", "coef"]).T.set_index("word")
df_word_coef.sort_values("coef", ascending=False)

Unnamed: 0_level_0,coef
word,Unnamed: 1_level_1
loves twitter,0.441781
loves,0.441781
g,0.404112
cool,0.387307
loved,0.38585
...,...
gm,-0.389355
comcast,-0.390205
fighting,-0.406231
fighting latex,-0.406231


As expected, words like `hate` and `fight` has very negative connotations to it while words like `loves` and `cool` has very positive connotation to it. Another thing we can look at is the intercept, which tells us the overall sentiment of the entire training corpus. As you can see below, the overall sentinment of the training corpus is rather neutral.

In [11]:
sentiment_analyzer.intercept_

0.01035837109566326

## NLTK Implementation 

This is a lot of tedious work, and whenever there is a lot of tedious work, you can bet that there's a library for that. The following is the NLTK sentiment analysis algorithm using a very similar technique as we implemented above. 

NLTK is a different library than SciKit-Learn so it will require us to do our preprocessing a bit differently. The remainder of the section will walk you through the differences.

### Text Preprocessing

NLTK requires that your training examples are in the form of a list of tuples with 2 elements where the first element are word tokens and the second element is the class. An example would be: 

```
[(["I", "love", "Pepsi"], 1), (["I", "am", "not", "a", "fan", "of", "Coke"], -1), ...]
```

The following code creates this encoding as well as creating a train test split:

In [28]:
from nltk.tokenize import TweetTokenizer
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import random

tweet_tknize = TweetTokenizer()
df_tweetsnltk = df_tweets.copy()[["polarity", "processed_text"]]

polarity_score = df_tweetsnltk.polarity
tokenized_tweets = list(df_tweetsnltk.processed_text.apply(tweet_tknize.tokenize))

tweets_formatted = []
for x in range(len(polarity_score)):
    tweets_formatted.append((tokenized_tweets[x], polarity_score[x]))

training_tweets = tweets_formatted[:int(0.70 * len(tweets_formatted))]
testing_tweets = tweets_formatted[int(0.70 * len(tweets_formatted)):]

tweets[0:2]

[['@',
  'stellargirl',
  'I',
  'loooooooovvvvvveee',
  'Kindle2',
  '.',
  'Not',
  'DX',
  'cool',
  ',',
  '2',
  'fantastic',
  'right',
  '.'],
 ['Reading',
  'kindle2',
  '...',
  'Love',
  '...',
  'Lee',
  'child',
  'good',
  'read',
  '.']]

### Feature Extraction

Now, that we have the tokenize string and its labels. Let's create our `SentimentAnalyzer` object and extract unigrams to prepare the tweets for training: 

In [0]:
# Create our SentimentAnalyzer
sentim_analyzer = SentimentAnalyzer()

# Get all words/tokens that is in our trianing set
# This formats our data in the way that the next function requires it
all_words = sentim_analyzer.all_words(training_tweets)

# Get the formatted all_words list and create unigrams out of it
unigram_feats = sentim_analyzer.unigram_word_feats(all_words, min_freq=4)

# We add this feature to our SentimentAnalyzer object
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

### Training our Sentiment Analyzer

We will be using a NaiveBayesClassifer for this instance. NLTK only supports classifiers for sentiment analysis

In [57]:
training_set = sentim_analyzer.apply_features(training_tweets)
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

Training classifier


### Testing our Sentiment Analyzer

In [58]:
testing_set = sentim_analyzer.apply_features(testing_tweets)
sorted(sentim_analyzer.evaluate(testing_set).items())

Evaluating NaiveBayesClassifier results...


[('Accuracy', 0.6333333333333333),
 ('F-measure [-1]', 0.6837606837606838),
 ('F-measure [0]', 0.6304347826086957),
 ('F-measure [1]', 0.5714285714285714),
 ('Precision [-1]', 0.625),
 ('Precision [0]', 0.8055555555555556),
 ('Precision [1]', 0.52),
 ('Recall [-1]', 0.7547169811320755),
 ('Recall [0]', 0.5178571428571429),
 ('Recall [1]', 0.6341463414634146)]

Now that we know it works, let's test it with some random tweets we pulled from Twitter.

In [63]:
# Clearly positive sentence
sentim_analyzer.classify("I love being here!".split(" "))

1

In [64]:
# Clearly negative sentence
sentim_analyzer.classify("I hate this.".split(" "))

-1

In [65]:
# Ambiguous positive sentence with double negative
sentim_analyzer.classify("I do not dislike school.".split(" "))

1

# Exercises