# Sentiment Analysis

*Sentiment analysis* is the task of evaluating whether a given passage of text is primarily "positive" or "negative." The meanings of these terms can change in context. For example, a "positive" product review would indicate that the customer likes the product, whereas a "positive" tweet might just indicate that the user is happy that day. 

Today, we'll discuss how familiar machine learning tools can allow us to perform sentiment analysis on unstructured text. 

Our data set for this task comes from the `nltk` package again. It's a set of movie reviews. 

In [None]:
import numpy as np
import pandas as pd
import nltk

nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

In [None]:
type(movie_reviews)

The `movie_reviews` object allows us to read in the data. 

In [None]:
movie_reviews

For today, the two most important methods of this object are `fileids()` and `raw()`. The first method will allow us to locate the files on disk in which the movie reviews are contained, and the second method will allow us to then obtain the full text of the reviews from the file path. 

Let's first look at the fileids. 

In [None]:
f = movie_reviews.fileids()[0]
f

Each review is contained in its own file, in one of two folders. The `neg` folder contains negative reviews, while the `pos` folder contains positive reviews. 

Once we have picked fixed a file path, we can then use the `raw()` method to extract the raw text of the movie review. 

In [None]:
movie_reviews.raw(f)

Take a moment to think: how can we read in the complete data set? 

<br> 
<br> 
<br> 
<br> 
<br> 
<br> 

A `for`-loop would be one way. In this approach, we would create an empty list to hold the review texts, iterate over the list of file paths, and populate the list for texts as we go. For example: 

In [None]:
raw_texts = []
for p in movie_reviews.fileids():
    raw_texts.append(movie_reviews.raw(p))

This does work, but it requires three lines and still leaves us with the task of bringing the texts into a format (like a data frame) that we know how to work with. 

Using the `apply` method from `pandas` gives us a much more efficient way: 

In [None]:
# create a data frame whose only column contains the fileids

df = pd.DataFrame({"fileid" : movie_reviews.fileids()})
# create a new column by applying the movie_reviews.raw()
# method to each entry of df['fileid']
df['raw_text'] = df['fileid'].apply(movie_reviews.raw)

In [None]:
df

We now have read in the data. Do we have what we need for sentiment analysis? 

Not quite yet, but we're close! In this lecture, we'll treat sentiment analysis as a form of *classification*: our aim is to build a machine learning model that we can use to predict whether a given text is positive or negative. For this approach, we are going to need both target and predictor variables. Fortunately, we know how to obtain both of these. 

In [None]:
# check whether the text came from the pos folder. 
df['is_good'] = df['fileid'].str.split('/').str.get(0) == 'pos'
df

We can use tools from before to create a term-document matrix. This time, we treat each movie review as a document. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_df = 0.2, min_df = 30, stop_words = 'english')

counts = vec.fit_transform(df['raw_text'])
count_df = pd.DataFrame(counts.toarray(), columns = vec.get_feature_names_out())

In [None]:
df = pd.concat((df, count_df), axis = 1)

In [None]:
df

We have now successfully read in and prepared our data. 

---

# On to Sentiment Analysis

These steps should be pretty familiar. We are going to split our data into training and test sets, create a logistic classifier, and evaluate the logistic classifier on the 

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size = 0.4, random_state=0)

X_train = train.drop(['fileid', 'raw_text', 'is_good'], axis = 1)
y_train = train['is_good']

X_test = test.drop(['fileid', 'raw_text', 'is_good'], axis = 1)
y_test = test['is_good']

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(X_train, y_train)
LR.score(X_train, y_train)

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(LR, X_train, y_train, cv = 5).mean()

Our model perfectly fits the training data, but based on CV it looks like our predictive accuracy might only be around 80%. This looks like overfitting, which makes sense -- overfitting is a very common problem when we have many predictor columns (lots of words) and not that many data observations. 

There are multiple ways to address this. In this lecture, let's use the regularization parameter `C`, which controls model complexity in logistic regression. While one could be more systematic about this, here's a simple little loop: 

In [None]:
for C in np.linspace(0.005, 0.05, 10):
    print(str(np.round(C, 4)), end = ": ")
    LR = LogisticRegression(C = C)
    cv_score = cross_val_score(LR, X_train, y_train, cv = 5).mean()
    print(np.round(cv_score, 3))

Looks like we can improve our estimated accuracy using C = 0.01 or so. Let's do that and evaluate on the test set. 

In [None]:
LR = LogisticRegression(C = 0.01) # l2, the smaller, the stronger the regularization is.
LR.fit(X_train, y_train)
LR.score(X_test, y_test)

So, our simple logistic model is able to correctly identify vs. negative movie reviews about 82% of the time. Not bad! 

However, we're not done yet. 

One of the primary purposes of sentiment analysis is to determine which words carry positive or negative associations. It is common to assign scores to each word that govern how positive or negative they are. We can do this using the coefficients of the logistic model. First, let's make a data frame of the words and their scores. 

In [None]:
result_df = pd.DataFrame({"coef" : LR.coef_[0], "word" : X_train.columns})
result_df

Now let's sort the data frame to see the most negative words according to the model. 

In [None]:
result_df.sort_values('coef', ascending = True).head(10)

That makes sense! What about the most positive words? 

In [None]:
result_df.sort_values('coef', ascending = False).head(10)

This also looks pretty logical. We can conclude that our model has had some success in learning which words have positive and negative meanings. 

Of course, the story isn't over: there are many different models that can be used for sentiment analysis, some of which highlight different features. 

Finally, the combination of term-document extraction with classification models isn't just for sentiment analysis! Essentially the same pipeline can work to produce a functioning spam classifier, in which a "negative" set of text is spam and a "positive" set of text is a legitimate email. 