# Learning Sentiment

Someone else's learned sentiment may not be appropriate for your research. This is a common enough issue when you begin to work in narrow niches where the nature of the text is substantially  different than a general corpus (like newspapers) or it has specific constraints on the nature of how somethign can be written (like the character limits on Twitter and other microblogs).

## How do you go about this?

The simple answer is that you want to accomplish a general task -- given $x$ estimate $y$, where $x$ is a token and $y$ is its sentiment score.  

In general, the process is the same as any other prediction task you've already done except in this one you will have to assemble your own prediction dataset and labels. The process is generally as follows with sentiment as a specific example:

1. Extract tokens
2. Show token in context and label token with sentiment (Repeat ad nauseum)
3. Run algorithm

A relatively simple process minus the effort that we have to put into token labelling. 

To construct an individual token's sentiment score, you could simply use the `mean` as your algorithm. But I'd like to take a detour and expand our horizons on how we can estimate a quantity.

# Bayesian estimation

We can assume that there is a true value of the sentiment for a specific word in a single context. We know that the responses, and the spread in them, informs our approximation of the real value and accounts for the uncertainty we have in stating that it is the true value. 

When using a Bayesian approach, we are trying to estimate the probability distribution function for the real value (it inherently incorporates uncertainty - which is a good approach when considering something like quantifying the amount of sentiment a word encodes).

The basic idea is that we start with some prior knowledge/distribution of 'truth' for a value and then update it as we receive additional information (i.e. mturk responses). 

<img src='https://miro.medium.com/max/600/0*BwLwi0fgMY6m7b0V.png'>

Mathematically, we just need to follow bayes rule

$P(A\mbox{ | }B) = \frac{P(B\mbox{ | }A)P(A)}{P(B)}$

or stated in a data-centric way

$P(Model\mbox{ | }Data) = \frac{P(Data\mbox{ | }Model)P(Model)}{P(Data)}$

where the $P(Model)$ is prior probability for our model and $P(Model\mbox{ | }Data)$ is our posterior probability after we have incorporated the data. $P(Data\mbox{ | }Model)$ is simply the probability of observing the data given our current model (likelihood) and $P(Data)$ is the marginal likelihood (which is the same for all models under consideration).

# Sounds complicated?

Fortunately, it's not that hard in practice. There are two ways to go about this - the first that I want you to explore is by hand with scipy.

In [None]:
import scipy.stats as st
import numpy as np

#Set the likelihood
likelihood = np.array([])

#Set our supports
params = np.linspace(-6, 6, 1201)

#And initialize the posterior
posterior = np.array([])

#Construct the prior
prior_sample = np.random.normal(0, 0.2)
prior = np.array([np.product(st.norm.pdf(prior_sample, p)) for p in params])
prior = prior / np.sum(prior)

In [None]:
def update_probability(datapoint, likelihood, prior, posterior, params):
    likelihood = np.array([np.product(st.norm.pdf([datapoint], p)) for p in params])
    #Construct the posterior
    tposterior = [prior[i] * likelihood[i] for i in range(prior.shape[0])]
    posterior = tposterior / np.sum(tposterior)
    #Reset the prior to the new posterior
    prior = np.copy(posterior)
    return likelihood, prior, posterior

#Graph setup
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (16,8))
j = 1

for i in range(100):
    likelihood, prior, posterior = update_probability(2, likelihood, prior, posterior, params)
    if i % 10==0:
        ax = fig.add_subplot(2,5,j)
        ax.plot(params, posterior)
        j+=1
plt.tight_layout()

Of course you don't have to do this by hand -- you can use packages like PyMC or emcee to perform bayesian estimation and do other MCMC fitting of model parameters.

# Generalizing text prediction problems

We've started with the idea of estimating the sentiment of a single token -- e.g. `happy` is `+2`. But what if we wanted to expand that to a "document", like a tweet -- how could we do it?

`Blank airlines sucks. I'm so happy to be off`

Is that tweet happy or sad? We could make the prediction based off the individual tokens or we could move up the ladder and try to learn the relationship between all of the tokens and the label for the tweet (which in this case would be `-1` for its sentiment).

There are a host of machine learning algorithms that that can handle this task of learning the relationship between tokens and a predicted score. The general way to think about this problem is:

<img src='../images/token_prediction.jpeg' width='600px'>

And what we need to do is learn the weighting of those connections from the tokens to the labels. Once we learn those weightings and we have a new single document, we can then simply feed it in, calculate, and out comes our predicted label. 

<img src='../images/token_prediction_1.jpeg' width='600px'>


# The world of `scikit-learn`

There's a vast world of machine-learning algorithms that can be implemented to perform that type of prediction task. One of the most popular libraries that collects these algorithms is `scikit-learn`. Algorithms that could be used to peform this task include:

* Naive Bayes
* Random Forests
* Boosted Forests
* SVM

and more. A typical question that students ask is which algorithm should I use? In truth, that answer should depend on the problem type and the amount of data that you have to train it. There have been rules of thumb like so:

<img src='https://scikit-learn.org/stable/_static/ml_map.png'>

but these of course grow outdated with time. However, the basic map identifying the core types of ML tasks is a fundamental piece of solving the potential problem universe (and a more important step than it may seem in real world problems).

## Predicting Sentiment for a Paragraph

Here we'll use the the `PerSenT` data (https://stonybrooknlp.github.io/PerSenT/) -- this dataset captures the sentiment an author towards the main entity in a news article. It has >5000 documents and 38,000 paragraphs.

In [None]:
import pandas as pd

traindf = pd.read_csv('../data/PerSenT/train.csv')
traindf.head()

In [None]:
for i,x in enumerate(traindf.DOCUMENT[0].split('\n')):
    print(i, x)

In [None]:
traindf.iloc[0]

So a straightforward set-up, we have our document and each paragraph is marked by a newline character. We then have a column with the sentiment label for each paragraph or for the document overall. We can see that the longest piece should be  16 paragraphs, so we either have 3,355 documents or 3,355 documents * N_sentences as our training sample.

If we're going to do build up our universal text corpus as the input set, we'll need to build that up.

**Exercise** Extract all individual tokens from the documents to form a vocabulary.

In [None]:
#Exercise 


Now that we've pulled out all of the vocabulary we could count it up and look at the counts

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

print(f'Number of words: {len(vocab)}')
print(f'Unique words: {len(set(vocab))}')
cvocab = Counter(vocab)
plt.hist(list(cvocab.values()), log=True, bins=100);

And what do we see -- lots of words that are only used a handful of times(~11,000 out of 712,971) and a few words that are used thousands of times. 

If we are going to build an input feature set of tokens -- is this what we want to submit? The raw token usage? 

Want to find out how well it works as-is anyways? Well why not, let's do it. 

First thing we need to do is to figure out what is our input matrix -- easiest thing to do is to get our vocab set

In [None]:
word_idx = sorted(list(set(vocab)))
print(len(word_idx))
word_idx[:10]

So for each document we will end up having a 30,359 length feature vector. For ease of use let's just use documents instead of sentences in documents --- that means that input matrix will be 3355 * 30359 --  which we can create with numpy.

In teh interest of time though, let's only do the first 1,000 documents

In [None]:
import numpy as np

max_doc = 1000

X = np.zeros((max_doc, 30359))
X.shape

**Exercise** Now we need to figure out how to populate this matrix....

In [None]:
#Exercise


Now that we have our X we can go off to the races and attempt to train an algorithm to predict Y which is the document label (`TRUE_SENTIMENT`).

For our first attempt, let's use something simple like a random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

#Encode the text labels to numbers to produce y
le = LabelEncoder()
y = le.fit_transform(traindf.TRUE_SENTIMENT[:1000])

#100 trees for speed......
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)

And of course, we should see how well we're doing!!! To do that, we'll need to make predictions and score against the known labels.

In [None]:
from sklearn.metrics import classification_report, accuracy_score

target_names=['Negative', 'Neutral', 'Positive']

ypred = clf.predict(X)

print( classification_report(y, ypred, target_names=target_names) )
print( accuracy_score(y, ypred))

**WOWEE** Looks perfect!!!!!!!!!

....
....
....

yeah. That's obviously not right.

You can't test your algorithm performance on the data that you trained it or else it'll be perfect (think test-cramming for rote repetition). We don't care about its learning on the given dataset, we want to be confident in its ability to predict what we have not labelled. 

For that we need a test dataset. We can split our current dataset into train and test to accomplish this.

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=9)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)

print(classification_report(ytest, ypred, target_names=target_names))
print(accuracy_score(ytest, ypred))

**WOWEE**, well now that sucks. Let's talk about the right way to do this

# TF-IDF

One of the most used processing steps when transforming unstructured text into a prediction problem is TF-IDF (Term Frequency-Inverse Document Frequency). This process involves the switch from counts to frequencies and the comparison of the term overall frequency of the term to its frequency within a document. There's more than operationalization of these terms, but for ease of use we'll use:

$tf(t,d) = \frac{f(t,d)}{\sum_{t\in d} 1}$

$idf(t, D) = log\frac{N}{|{d\in D:t\in d}|}$

where $t$ is a term, $d$ is a single document, $N$  is the total number of documents in the corpus, and $|{d\in D:t\in d}|$ is the number of documents where term $t$ appears. The final calculation is thus:

$tf * idf$

What does this do for us?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, stop_words=stop)

#Pull the texts out
texts = []
for doc in traindf.DOCUMENT:
    texts.append( ' '.join(nltk_cleaner(doc.split(), stop)) )

features = vectorizer.fit_transform(texts).toarray()

And since we already know that we need to have a test set, let's do that

In [None]:
y = le.fit_transform(traindf.TRUE_SENTIMENT)

Xtrain, Xtest, ytrain, ytest = train_test_split(features, y, test_size=0.2, random_state=9)

And re-run our model

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)

print(classification_report(ytest, ypred, target_names=target_names))
print(accuracy_score(ytest, ypred))

Hey! We got better!