# Sentiment and Classification, Part 2 

We are now going to do classification of NLTK's movie review dataset with machine learning.  Specifically, we'll use Logistic Regression, and we'll process our corpus of reviews using BoW with a couple different options, as well as using TF-IDF.

In [None]:
import pandas as pd

import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
reviews = movie_reviews.fileids()

df = pd.DataFrame({'review_sentiment': [movie_reviews.categories(review)[0]
                                        for review in reviews], 
                   'review_text': [movie_reviews.raw(review).replace('\n','')
                                   for review in reviews]})

df.head()

What happens if we.... completely ignore the sentiment connotations of individual words?  Does it make sense to completely ignore meaning and look at statistical occurrences of words across a given set of texts?

# BoW

Make a word-document matrix that contains the word counts for all words across all documents.

In [None]:
vectorizer = CountVectorizer(lowercase=True, 
                             stop_words='english', 
                             max_features=1000, 
                             min_df=5, 
                             max_df=0.7)

bag_of_words = vectorizer.fit_transform(df['review_text'])

bag_of_words_df = pd.DataFrame(bag_of_words.toarray(), 
                               columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
non_zero_values = bag_of_words_df.loc[0][bag_of_words_df.loc[0] != 0]
print(non_zero_values)

In [None]:
columns_with_value_1 = bag_of_words_df.columns[bag_of_words_df.loc[0] != 0]
print(columns_with_value_1)

In [None]:
bag_of_words_df.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['review_sentiment'])

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
pos_records = (y_test == 'pos')
model.score(x_test[pos_records], y_test[pos_records])

In [None]:
neg_records = (y_test == 'neg')
model.score(x_test[neg_records], y_test[neg_records])

In [None]:
x_train.loc[[0]]

In [None]:
model.predict(x_train.loc[[0]])

In [None]:
df['logregSentiment'] = bag_of_words_df.apply(lambda row: model.predict([row]), axis=1)

In [None]:
df.loc[df['review_sentiment']=='pos', 'logregSentiment'].value_counts()

In [None]:
df.loc[df['review_sentiment']=='neg', 'logregSentiment'].value_counts()

We used stratification on our target values above.  We could also do shuffling and that would give similar results too:

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42,
                                                    shuffle=True)

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

In [None]:
y_train.value_counts()

# BoW even simpler

Using `binary=True`, we simply record whether a word is in the review or not (1 or 0).

In [None]:
vectorizer = CountVectorizer(lowercase=True, 
                             stop_words='english', 
                             max_features=1000, 
                             min_df=5, 
                             max_df=0.7,
                             binary=True)

bag_of_words = vectorizer.fit_transform(df['review_text'])

bag_of_words_df = pd.DataFrame(bag_of_words.toarray(), 
                               columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
non_zero_values = bag_of_words_df.loc[0][bag_of_words_df.loc[0] != 0]
print(non_zero_values)

In [None]:
bag_of_words_df.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['review_sentiment'])

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

# BoW with n-grams

`ngram_range=(1,2)` will allow us to retain 1-grams up to 2-grams, for a little bit of context saving.

In [None]:
vectorizer = CountVectorizer(lowercase=True, 
                             stop_words='english', 
                             max_features=1000, 
                             min_df=5, 
                             max_df=0.7,
                             binary=True,
                             ngram_range=(1, 2))

bag_of_words = vectorizer.fit_transform(df['review_text'])

bag_of_words_df = pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
non_zero_values = bag_of_words_df.loc[0][bag_of_words_df.loc[0] != 0]
print(non_zero_values)

In [None]:
bag_of_words_df.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['review_sentiment'])

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

# TF-IDF

Term-frequency inverse document frequency uses the word counts but now weighted so as to upweight words that more uniquely distinguish the vocabulary of a text (or subset of texts) relative to the entire corpus.

In [None]:
vectorizer = TfidfVectorizer(lowercase=True, 
                                        stop_words='english', 
                                        max_features=1000, 
                                        min_df=5, 
                                        max_df=0.5)

bag_of_words = vectorizer.fit_transform(df['review_text'])

bag_of_words_df = pd.DataFrame(bag_of_words.toarray(), 
                               columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
non_zero_values = bag_of_words_df.loc[0][bag_of_words_df.loc[0] != 0]
print(non_zero_values)

In [None]:
bag_of_words_df.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['review_sentiment'])

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)

## Naive Bayes

In [None]:
df

We'll use the simple BoW, where only word occurrences are recorded.

In [None]:
vectorizer = CountVectorizer(lowercase=True, 
                             stop_words='english', 
                             max_features=1000, 
                             min_df=5, 
                             max_df=0.7,
                             binary=True)

bag_of_words = vectorizer.fit_transform(df['review_text'])

bag_of_words_df = pd.DataFrame(bag_of_words.toarray(), 
                               columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
bag_of_words_df.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['review_sentiment'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['review_sentiment'])

NLTK has a module `NaiveBayesClassifier`.  Rather than using `fit` as we are used to from scikit-learn, here we use the `train` method.  Furthermore, the data passed into the `train` method has both the independent variable (the review's feature vector) and the dependent variable (the sentiment class).

In [None]:
x_train.shape

In [None]:
x_train.loc[0].to_dict()

In [None]:
y_train.iloc[1]

In [None]:
features = []
for i in range(x_train.shape[0]):
    features.append([x_train.iloc[i].to_dict(), y_train.iloc[i]])

In [None]:
trainedClassifier = nltk.NaiveBayesClassifier.train(features)

Now that we have trained our classifier, we can use it to predict the sentiment score of any review.

To make a prediction, we need to convert the review into a feature vector and then pass that feature vector into our trained classifier to get the prediction.

The following functions carries out those two steps:

In [None]:
def naiveBayesSentimentCalculator(review):
    problemFeatureVector = review.to_dict()
    return trainedClassifier.classify(problemFeatureVector)

Here is a test example:

In [None]:
x_test.iloc[0]

In [None]:
naiveBayesSentimentCalculator(x_test.iloc[0])

To quantify how our classifier performs, we now pass in the test data to produce predicted sentiment scores that we can compare against the actual test data's ground-truth sentiment scores.

In [None]:
[naiveBayesSentimentCalculator(review) for k,review in x_test.iterrows()]

In [None]:
accuracy_score(y_test, [naiveBayesSentimentCalculator(review) for k,review in x_test.iterrows()])