# Sentiment Analysis
Sentiment analysis is a machine learning application which aims to infer the sentiment associated with a piece of text. This application could be useful to businesses trying to gauge customer satisfaction, sentiment on a certain issue on social media etc.

We will attempt to use the perceptron and logistic regression to automatically analyse the sentiment of movie reviews. The [data](https://web.stanford.edu/class/cs221/assignments/sentiment/index.html) are from [Percy Liang's](https://cs.stanford.edu/~pliang/) course.

Initially let us look at a simple data set of 4 reviews of a course at the univesity
1. easy (+ve)
1. very informative (+ve)
1. useless stuff (-ve)
1. hard (-ve)

Each of the reviews is a string which we treat as our input $x$. A useful feature $\phi(x)$ would be the counts of each of the six words ['easy', 'very', 'informative', 'useless', 'stuff', 'hard'] in each review. For example, $\phi('easy') = [1, 0, 0, 0, 0, 0]$.

Let's try and train a perceptron on this dataset. We need a function to create the feature vector from the input string data.


We will use the `collections` library. Let's play around with this small data set before treating the movies data set.

In [None]:
from collections import Counter

data = ['easy', 'very informative', 'useless stuff', 'hard']

# obtain the counts of words in each string
counts = []
for string in data:
    str_split_cnt = Counter(string.split())
    counts.append(str_split_cnt)
    
    
# get the dictionary of all words
cum_counts = Counter()
for count in counts:
    cum_counts += count
words = list(cum_counts.keys())

print(words)
print(counts)

We now define a function that maps the string count to the feature vectors of counts for all words in the dictionary. This vector will often be very sparse and more efficient methods exist. Here we generate these vectors for simplicity.

In [None]:
import numpy as np
def word_count(string_count, word_list):
    ''' Compute a feature vector from the string word count
    Args:
        string_count: a Counter with words and counts
        word_list: all words
    Returns:
       word_count_vector
    '''
    word_count_vector = np.zeros(len(word_list))
    for k in string_count.keys():
       word_count_vector[word_list.index(k)] = string_count[k] 
    
    return word_count_vector

In [None]:
print(word_count(counts[3], words))

We now move to the movies data set. First we download the data and extract the training strings and labels.

In [None]:
reviews_train = open('../data/polarity.train', 'r')

# read the first five lines, strip out the final newline
for _ in range(5):
    print(reviews_train.readline().strip())
reviews_train.close()


We now obtain the entire training set.

In [None]:
reviews_train = open('../data/polarity.train', 'r', encoding="utf-8", errors='ignore')
labels = []
word_counts = []
for curr_line in reviews_train:
    curr_review = curr_line.strip()[3:-2]
    curr_label = int(curr_line.strip()[:2])
    labels.append(curr_label)
    word_counts.append(Counter(curr_review.split()))
    print(curr_label, curr_review)

In [None]:
cum_counts = Counter()
for count in word_counts:
    cum_counts += count
words = list(cum_counts.keys())

print(words)
print(word_counts)

In [None]:
# How many words?
len(words)

In [None]:
X = np.zeros((len(word_counts), len(words)))
y = np.array(labels)
for index, count in enumerate(word_counts):
    X[index, :] = word_count(count, words)

### Training the models
We will use scikit learn to train both the perceptron and logistic regression classifier and compare their performance on the test set.

In [None]:
from sklearn.model_selection import train_test_split # to obtain the train, validation and test split
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.2)# 80% for training

perceptron_clf = Perceptron()
lr_clf = LogisticRegression(random_state=0, solver='lbfgs')

models = [perceptron_clf, lr_clf]
model_labels = ['Perceptron', 'Logistic Regression']
model_train_score = np.zeros(len(models))
model_test_score = np.zeros(len(models))

for index, model in enumerate(models):
    model.fit(X_train, y_train)
    model_train_score[index] = model.score(X_train, y_train)
    model_test_score[index] = model.score(X_test, y_test)

In [None]:
for index, model in enumerate(models):
    print(model_labels[index], 'Training accuracy:', model_train_score[index])
    print(model_labels[index], 'Test accuracy:', model_test_score[index])