------
**You cannot save any changes you make to this file, so please make sure to save it on your Google Colab drive or download it as a .ipynb file.**

------

 

Practical 1: Sentiment Detection in Movie Reviews
========================================



This practical concerns detecting sentiment in movie reviews. This is a typical NLP classification task.
In [this file](https://gist.githubusercontent.com/bastings/d47423301cca214e3930061a5a75e177/raw/5113687382919e22b1f09ce71a8fecd1687a5760/reviews.json) (80MB) you will find 1000 positive and 1000 negative **movie reviews**.
Each review is a **document** and consists of one or more sentences.

To prepare yourself for this practical, you should
have a look at a few of these texts to understand the difficulties of
the task: how might one go about classifying the texts? You will write
code that decides whether a movie review conveys positive or
negative sentiment.

Please make sure you have read the following paper:

>   Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan
(2002). 
[Thumbs up? Sentiment Classification using Machine Learning
Techniques](https://dl.acm.org/citation.cfm?id=1118704). EMNLP.

Bo Pang et al. introduced the movie review sentiment
classification task, and the above paper was one of the first papers on
the topic. The first version of your sentiment classifier will do
something similar to Pang et al.'s system. If you have questions about it,
you should resolve you doubts as soon as possible with your TA.


In [1]:
import math
import os
import sys
from subprocess import call
from nltk import FreqDist
from nltk.util import ngrams
from nltk.stem.porter import PorterStemmer
import sklearn as sk
# from google.colab import drive
import pickle
import json
from collections import Counter
import requests
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# file structure:
# [
#  {"cv": integer, "sentiment": str, "content": list} 
#  {"cv": integer, "sentiment": str, "content": list} 
#   ..
# ]
# where `content` is a list of sentences, 
# with a sentence being a list of (token, pos_tag) pairs.


with open("reviews.json", mode="r", encoding="utf-8") as f:
    reviews = json.load(f)

print("Total number of reviews:", len(reviews), '\n')

def print_sentence_with_pos(s):
    print(" ".join("%s/%s" % (token, pos_tag) for token, pos_tag in s))

for i, r in enumerate(reviews):
    print(r["cv"], r["sentiment"], len(r["content"]))  # cv, sentiment, num sents
    print_sentence_with_pos(r["content"][0])
    if i == 4: 
        break
    
c = Counter()
for review in reviews:
    for sentence in review["content"]:
        for token, pos_tag in sentence:
            c[token.lower()] += 1

print("\nNumber of word types:", len(c))
print("Number of word tokens:", sum(c.values()))

print("\nMost common tokens:")
for token, count in c.most_common(20):
      print("%10s : %8d" % (token, count))

Total number of reviews: 2000 

0 NEG 29
Two/CD teen/JJ couples/NNS go/VBP to/TO a/DT church/NN party/NN ,/, drink/NN and/CC then/RB drive/NN ./.
1 NEG 11
Damn/JJ that/IN Y2K/CD bug/NN ./.
2 NEG 24
It/PRP is/VBZ movies/NNS like/IN these/DT that/WDT make/VBP a/DT jaded/JJ movie/NN viewer/NN thankful/JJ for/IN the/DT invention/NN of/IN the/DT Timex/NNP IndiGlo/NNP watch/NN ./.
3 NEG 19
QUEST/NN FOR/IN CAMELOT/NNP ``/`` Quest/NNP for/IN Camelot/NNP ''/'' is/VBZ Warner/NNP Bros./NNP '/POS first/JJ feature-length/JJ ,/, fully-animated/JJ attempt/NN to/TO steal/VB clout/NN from/IN Disney/NNP 's/POS cartoon/NN empire/NN ,/, but/CC the/DT mouse/NN has/VBZ no/DT reason/NN to/TO be/VB worried/VBN ./.
4 NEG 38
Synopsis/NNPS :/: A/DT mentally/RB unstable/JJ man/NN undergoing/VBG psychotherapy/NN saves/VBZ a/DT boy/NN from/IN a/DT potentially/RB fatal/JJ accident/NN and/CC then/RB falls/VBZ in/IN love/NN with/IN the/DT boy/NN 's/POS mother/NN ,/, a/DT fledgling/NN restauranteur/NN ./.

Number of wo

In [3]:
with open("sent_lexicon", mode="r", encoding="utf-8") as f:
    line_cnt = 0
    for line in f:
        print(line.strip())
        line_cnt += 1
        if line_cnt > 4:
            break

type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative
type=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negative
type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative
type=strongsubj len=1 word1=abase pos1=verb stemmed1=y priorpolarity=negative
type=strongsubj len=1 word1=abasement pos1=anypos stemmed1=y priorpolarity=negative


## Features, overfitting, and the curse of dimensionality

In the Bag-of-Words model, ideally we would like each distinct word in
the text to be mapped to its own dimension in the output vector
representation. However, real world text is messy, and we need to decide
on what we consider to be a word. For example, is “`word`" different
from “`Word`", from “`word`”, or from “`words`"? Too strict a
definition, and the number of features explodes, while our algorithm
fails to learn anything generalisable. Too lax, and we risk destroying
our learning signal. In the following section, you will learn about
confronting the feature sparsity and the overfitting problems as they
occur in NLP classification tasks.

### Stemming (1.5pts)

To make your algorithm more robust, use stemming and hash different inflections of a word to the same feature in the BoW vector space. Please use the [Porter stemming
    algorithm](http://www.nltk.org/howto/stem.html) from NLTK.



In [4]:
stemmer = PorterStemmer()

# example usage
list(map(stemmer.stem, ["words", "Word", "word"]))

['word', 'word', 'word']

### N-grams (1.5pts)

A simple way of retaining some of the word
order information when using bag-of-words representations is to use **n-gram** features. 






#### (Q2.9) Retrain your classifier from (Q2.4) using **unigrams+bigrams** and **unigrams+bigrams+trigrams** as features. (1pt)
Report accuracy and compare it with that of the approaches you have previously implemented. You are allowed to use NLTK to build n-grams from sentences.

In [5]:
from nltk import word_tokenize
from nltk.util import ngrams

from tqdm import tqdm
import itertools

In [47]:
class BagOfWords:
    """BoW-based feature encoder."""
    def __init__(self, n_grams=[1], stemmer=None, verbose=True, classes=["POS", "NEG"]):
        self.n_grams = n_grams
        self.stemmer = stemmer
        self.verbose = verbose
        self.classes = classes
    
    def get_sentence_ngrams(self, sent: list):
        joined_ngrams = [list(ngrams(sent, k)) for k in self.n_grams]
        return list(itertools.chain.from_iterable(joined_ngrams))
    
    def get_ngram_freq_dict(self, document: list, use_pos=False):
        document = np.concatenate(document)

        # get all words in the document (in lowercase)
        words = list(np.char.lower(document[:, 0]))
        pos_tags = document[:, 1]

        # apply stemming to the words
        if self.stemmer is not None:
            words = list(map(self.stemmer.stem, words))
        
        # apply pos tags as suffixes to words
        if use_pos:
            # document = np.concatenate(document)
            # document[:, 0] = np.char.lower(document[:, 0])
            words = np.char.add(np.char.add(np.array(words), "_"), pos_tags)
            words = list(words)

        # get collection of n-grams, for different n
        n_grams = self.get_sentence_ngrams(words)
        n_grams = np.array(n_grams, dtype="object")

        # get counts of n-grams
        n_grams_unique, n_grams_counts = np.unique(n_grams, return_counts=True)
        n_grams_freq_dict = dict(zip(n_grams_unique, n_grams_counts))
        return n_grams_freq_dict
    
    def create_vocabulary(self, documents: list, labels: list, use_pos=False):
        """Create a combined vocabulary based on given n-grams, \forall n."""
        num_docs = len(documents)
        
        iterator = tqdm(
            range(num_docs),
            desc="Creating vocabulary",
            bar_format='{l_bar}{bar:20}{r_bar}{bar:-20b}',
        )
        if not self.verbose:
            iterator = range(num_docs)
        
        vocab = Counter()
        class_wise_vocab = {c: Counter() for c in self.classes}
        class_priors = {c: 0.0 for c in self.classes}
        for i in iterator:
            d, l = documents[i], labels[i]

            # update class counts
            class_priors[l] += (1.0 / num_docs)
            
            # update vocab
            n_grams_freq_dict = self.get_ngram_freq_dict(d, use_pos=use_pos)
            vocab.update(n_grams_freq_dict)
            class_wise_vocab[l].update(n_grams_freq_dict)
            
        self.vocab = vocab
        self.vocab_terms = np.array(list(vocab.keys()))
        self.class_wise_vocab = class_wise_vocab
        self.class_priors = class_priors
        self.vocab_size = len(self.vocab)

    
    def filter_vocab(self):
        """Removes words that occur only in one class but not others."""
        vocab_common = set.intersection(*[set(self.class_wise_vocab[c].keys()) for c in self.classes])
        self.vocab_size = len(vocab_common)
        for c in self.classes:
            vocab_class = {w:v for w, v in self.class_wise_vocab[c].items() if w in vocab_common}
            self.class_wise_vocab[c] = vocab_class

    def encode(self, document: list):
        """Encodes a document (list of sentences) into a BoW representation."""

        assert hasattr(self, "vocab"), "Vocabulary has not been created!"

        n_grams_freq_dict = self.get_ngram_freq_dict(document)
        n_grams_freq_dict = {k:v for k, v in n_grams_freq_dict.items() if k in self.vocab_terms}
        
        
        words_in_doc = np.array(list(n_grams_freq_dict.keys()))
        word_freqs_in_doc = np.array(list(n_grams_freq_dict.values()))
        
        bow_vector = np.zeros(len(self.vocab))
        indices = np.in1d(self.vocab_terms, words_in_doc)
        # import ipdb; ipdb.set_trace()
        bow_vector[np.where(indices == True)] = word_freqs_in_doc
        

#         bow_vector = np.zeros(len(self.vocab))
#         for key, count in n_grams_freq_dict.items():
#             bow_vector[self.vocab_terms.index(key)] = count

        return bow_vector

In [64]:
class NaiveBayesClassifier:
    """Implements the NBClassifier."""
    def __init__(self, classes, n_grams=[1], stemmer=None, smoothing_kappa=0.0, filter_vocab=False, use_pos=False):
        self.bow = BagOfWords(n_grams=n_grams, classes=classes, stemmer=stemmer)
        self.smoothing_kappa = smoothing_kappa
        self.classes = classes
        self.filter_vocab = filter_vocab
        self.use_pos = use_pos
    
    def train(self, documents: list, labels: list):
        assert len(documents) == len(labels)
        assert set(np.unique(labels)) == set(self.classes)

        self.bow.create_vocabulary(documents, labels, use_pos=self.use_pos)
        print(f"Training finished with vocabulary of size {len(self.bow.vocab)}.")

        if self.filter_vocab:
            self.bow.filter_vocab()
            print(f"Filtered vocabulary. Size of new vocabulary: {self.bow.vocab_size}")

    def check_word_in_vocab(self, word):
        for c in self.classes:
            if word not in self.bow.class_wise_vocab[c]:
                return False
        return True

    def predict(self, documents: list):
        """Predicts class label for each of the given documents."""
        num_docs = len(documents)
        
        class_wise_sum = {c: sum(list(self.bow.class_wise_vocab[c].values())) for c in self.classes}
        
        iterator = tqdm(
            range(num_docs),
            desc="Evaluating",
            bar_format='{l_bar}{bar:20}{r_bar}{bar:-20b}',
        )
        predictions = []
        for i in iterator:
            d = documents[i]
            ngram_frequency_dict = self.bow.get_ngram_freq_dict(d, use_pos=self.use_pos)


            score = {k: np.log(self.bow.class_priors[k]) for k in self.classes}
            for word in ngram_frequency_dict:
                if self.check_word_in_vocab(word):
                    for c in self.classes:
                        count_word_in_c = self.bow.class_wise_vocab[c][word]
                        count_all_words_in_c = class_wise_sum[c]
                        nume = count_word_in_c + self.smoothing_kappa
                        deno = count_all_words_in_c + self.smoothing_kappa * len(self.bow.vocab)
                        score[c] += np.log(nume/deno)
        
            predicted_class = max(score, key=score.get)
            predictions.append(predicted_class)
        
        return predictions


In [85]:
def filter_based_on_pos_tags(documents, valid_tags=["NN", "VB", "JJ", "RB"]):
    filtered_docs = []
    for d in documents:
        # import ipdb; ipdb.set_trace()
        # sent_lens = [len(x) for x in d]
        # sent_lens = np.cumsum(sent_len)[:-1]
        
        doc = np.concatenate(d)
        idx = np.in1d(doc[:, 1], np.array(valid_tags))
        doc = doc[idx]
        filtered_docs.append([doc.tolist()])
    
    return filtered_docs

In [93]:
def compute_accuracy(y_true: list, y_pred: list):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    return np.mean((y_true == y_pred).astype(int))


def train_and_evaluate_clf(clf, train_idx, test_idx, filter_on_tags=False):
    train_documents = [reviews[i]["content"] for i in train_idx]
    train_labels = [reviews[i]["sentiment"] for i in train_idx]

    if filter_on_tags:
        # import ipdb; ipdb.set_trace()
        train_documents = filter_based_on_pos_tags(train_documents)

    test_documents = [reviews[i]["content"] for i in test_idx]
    test_labels = [reviews[i]["sentiment"] for i in test_idx]

    if filter_on_tags:
        test_documents = filter_based_on_pos_tags(test_documents)

    # train the classifier
    clf.train(train_documents, train_labels)

    # compute train accuracy
    acc = compute_accuracy(train_labels, clf.predict(train_documents))
    print(f"Obtained accuracy using NaiveBayesClassifier on train set: {acc:.3f}")

    # compute test accuracy
    acc = compute_accuracy(test_labels, clf.predict(test_documents))
    print(f"Obtained accuracy using NaiveBayesClassifier on test set: {acc:.3f}")
    
    return acc

In [9]:
# split into training and testing data
cv = np.array([reviews[i]["cv"] for i in range(len(reviews))])
labels = np.array([reviews[i]["sentiment"] for i in range(len(reviews))])

train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], filter_vocab=True)
train_and_evaluate_clf(clf, train_idx, test_idx)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:02<00:00, 759.52it/s]                                                                      


Training finished with vocabulary of size 45348.
Filtered vocabulary. Size of new vocabulary: 18799


Evaluating: 100%|████████████████████| 1800/1800 [00:05<00:00, 334.28it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.950


Evaluating: 100%|████████████████████| 200/200 [00:00<00:00, 331.12it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.875





0.875

In [10]:
cond = ((cv < 90) * (labels == "NEG")) + ((labels == "POS") * (cv < 900))
train_idx = [i for i, value in enumerate(cond) if value]

cond = ((cv >= 900) * (cv <= 909) * (labels == "NEG")) + ((labels == "POS") * (cv >= 900))
test_idx = [i for i, value in enumerate(cond) if value]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], filter_vocab=True)
train_and_evaluate_clf(clf, train_idx, test_idx)

Creating vocabulary: 100%|████████████████████| 990/990 [00:01<00:00, 742.09it/s]                                                                        


Training finished with vocabulary of size 34662.
Filtered vocabulary. Size of new vocabulary: 7351


Evaluating: 100%|████████████████████| 990/990 [00:02<00:00, 351.76it/s]                                                                                 


Obtained accuracy using NaiveBayesClassifier on train set: 0.973


Evaluating: 100%|████████████████████| 110/110 [00:00<00:00, 356.83it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.918





0.9181818181818182

In [11]:
# split into training and testing data
train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], smoothing_kappa=1.0)
train_and_evaluate_clf(clf, train_idx, test_idx)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:02<00:00, 784.51it/s]                                                                      


Training finished with vocabulary of size 45348.


Evaluating: 100%|████████████████████| 1800/1800 [00:05<00:00, 328.56it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.947


Evaluating: 100%|████████████████████| 200/200 [00:00<00:00, 314.18it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.865





0.865

In [12]:
# YOUR CODE HERE

train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

# without stemming
clf = NaiveBayesClassifier(classes=["POS", "NEG"], stemmer=None, smoothing_kappa=1.0)
train_and_evaluate_clf(clf, train_idx, test_idx)
vocab_size_wout_stem = clf.bow.vocab_size

# with stemming
stemmer = PorterStemmer()
clf = NaiveBayesClassifier(classes=["POS", "NEG"], stemmer=stemmer, smoothing_kappa=1.0)
train_and_evaluate_clf(clf, train_idx, test_idx)
vocab_size_with_stem = clf.bow.vocab_size

print(":::: Vocabulary size ::::")
print(f":: Without stemming:\t {vocab_size_wout_stem}")
print(f":: With stemming:\t {vocab_size_with_stem}")

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:02<00:00, 767.34it/s]                                                                      


Training finished with vocabulary of size 45348.


Evaluating: 100%|████████████████████| 1800/1800 [00:05<00:00, 328.87it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.947


Evaluating: 100%|████████████████████| 200/200 [00:00<00:00, 329.05it/s]                                                                                 


Obtained accuracy using NaiveBayesClassifier on test set: 0.865


Creating vocabulary: 100%|████████████████████| 1800/1800 [00:11<00:00, 152.27it/s]                                                                      


Training finished with vocabulary of size 32404.


Evaluating: 100%|████████████████████| 1800/1800 [00:15<00:00, 116.40it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.922


Evaluating: 100%|████████████████████| 200/200 [00:01<00:00, 114.12it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.860
:::: Vocabulary size ::::
:: Without stemming:	 45348
:: With stemming:	 32404





In [None]:
# split into training and testing data
train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], n_grams=[1, 2], filter_vocab=True)
train_and_evaluate_clf(clf, train_idx, test_idx)

In [69]:
clf = NaiveBayesClassifier(classes=["POS", "NEG"], n_grams=[1, 2, 3], filter_vocab=True)
train_and_evaluate_clf(clf, train_idx, test_idx)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:07<00:00, 228.31it/s]                                                                      


Training finished with vocabulary of size 1416686.
Filtered vocabulary. Size of new vocabulary: 164071


Evaluating: 100%|████████████████████| 1800/1800 [00:16<00:00, 107.72it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.991


Evaluating: 100%|████████████████████| 200/200 [00:01<00:00, 111.59it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.855





0.855

In [48]:
bow = BagOfWords(n_grams=[1], stemmer=stemmer)

In [25]:
bow.get_sentence_ngrams(["I", "am", "jongen"])

[('I',), ('am',), ('jongen',)]

In [26]:
train_documents = [reviews[i]["content"] for i in train_idx]
train_labels = [reviews[i]["sentiment"] for i in train_idx]

test_documents = [reviews[i]["content"] for i in test_idx]
test_labels = [reviews[i]["sentiment"] for i in test_idx]

bow.create_vocabulary(train_documents, train_labels)


Creating vocabulary:   0%|                    | 0/1800 [00:00<?, ?it/s]                                                                                  [A
Creating vocabulary:   1%|▏                   | 12/1800 [00:00<00:15, 117.62it/s]                                                                        [A
Creating vocabulary:   2%|▎                   | 27/1800 [00:00<00:13, 135.41it/s]                                                                        [A
Creating vocabulary:   2%|▍                   | 43/1800 [00:00<00:12, 144.21it/s]                                                                        [A
Creating vocabulary:   3%|▋                   | 62/1800 [00:00<00:11, 157.87it/s]                                                                        [A
Creating vocabulary:   4%|▊                   | 78/1800 [00:00<00:11, 156.11it/s]                                                                        [A
Creating vocabulary:   5%|█                   | 95/1800 [

In [27]:
len(bow.vocab), len(bow.class_wise_vocab["POS"]), len(bow.class_wise_vocab["NEG"])

(32404, 23501, 22229)

In [28]:
list(bow.class_wise_vocab["POS"].items())[100:110]

[('copiou', 2),
 ('copper', 4),
 ('could', 626),
 ('cours', 282),
 ('crack', 34),
 ('crazi', 42),
 ('creat', 308),
 ('creepi', 49),
 ('crime', 116),
 ('cring', 5)]

In [19]:
train_X = [bow.encode(d) for d in tqdm(train_documents, desc="Encoding train documents")]

Encoding train documents: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1800/1800 [00:37<00:00, 48.14it/s]


In [29]:
test_X = [bow.encode(d) for d in tqdm(test_documents, desc="Encoding test documents")]


Encoding test documents:   0%|                                                                                                   | 0/200 [00:00<?, ?it/s][A
Encoding test documents:   0%|▍                                                                                          | 1/200 [00:00<00:22,  8.72it/s][A
Encoding test documents:   2%|█▎                                                                                         | 3/200 [00:00<00:15, 13.04it/s][A
Encoding test documents:   3%|██▋                                                                                        | 6/200 [00:00<00:12, 15.56it/s][A
Encoding test documents:   4%|███▋                                                                                       | 8/200 [00:00<00:13, 14.63it/s][A
Encoding test documents:   5%|████▌                                                                                     | 10/200 [00:00<00:13, 13.91it/s][A
Encoding test documents:   6%|█████▍                     

In [30]:
train_X = np.vstack(train_X)
test_X = np.vstack(test_X)

In [31]:
train_X.shape, test_X.shape

((1800, 32404), (200, 32404))


#### Q2.10: How many features does the BoW model have to take into account now? (0.5pt)
How would you expect the number of features to increase theoretically (e.g., linear, square, cubed, exponential)? How does this number compare, in practice, to the number of features at (Q2.8)?

Use the held-out training set once again for this.


*Write your answer here.*

In [None]:
# YOUR CODE HERE

# Support Vector Machines (4pts)

Though simple to understand, implement, and debug, one
major problem with the Naive Bayes classifier is that its performance
deteriorates (becomes skewed) when it is being used with features which
are not independent (i.e., are correlated). Another popular classifier
that doesn’t scale as well to big data, and is not as simple to debug as
Naive Bayes, but that doesn’t assume feature independence is the Support
Vector Machine (SVM) classifier.

You can find more details about SVMs in Chapter 7 of Bishop: Pattern Recognition and Machine Learning.
Other sources for learning SVM:
* http://web.mit.edu/zoya/www/SVM.pdf
* http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
* https://pythonprogramming.net/support-vector-machine-intro-machine-learning-tutorial/







Use the scikit-learn implementation of 
[SVM](http://scikit-learn.org/stable/modules/svm.html) with the default parameters. (You are not expected to perform any hyperparameter tuning, but feel free to do it if you think it gives you good insights for the discussion in question 5.)



#### (Q3.1): Train SVM and compare to Naive Bayes (2pts)

Train an SVM classifier (sklearn.svm.LinearSVC) using the features collected for Naive Bayes. Compare the
classification performance of the SVM classifier to that of the Naive
Bayes classifier with smoothing.
Use cross-validation to evaluate the performance of the classifiers.



In [32]:
from sklearn.svm import LinearSVC

In [33]:
svc = LinearSVC()

svc.fit(train_X, train_labels)

LinearSVC()

In [35]:
compute_accuracy(test_labels, svc.predict(test_X))

0.77

### POS disambiguation (2pts)

Now add in part-of-speech features. You will find the
movie review dataset has already been POS-tagged for you ([here](https://catalog.ldc.upenn.edu/docs/LDC99T42/tagguid1.pdf) you find the tagset). Try to
replicate the results obtained by Pang et al. (2002).



####(Q3.2) Replace your features with word+POS features, and report performance with the SVM. Use cross-validation to evaluate the classifier and compare the results with (Q3.1). Does part-of-speech information help? Explain why this may be the case. (1pt)


In [11]:
train_idx

array([   0,    1,    2, ..., 1897, 1898, 1899])

In [49]:
train_documents = [reviews[i]["content"] for i in train_idx]
train_labels = [reviews[i]["sentiment"] for i in train_idx]

In [53]:
bow = BagOfWords(n_grams=[1], stemmer=stemmer)

In [54]:
train_documents[0][0]

[['Two', 'CD'],
 ['teen', 'JJ'],
 ['couples', 'NNS'],
 ['go', 'VBP'],
 ['to', 'TO'],
 ['a', 'DT'],
 ['church', 'NN'],
 ['party', 'NN'],
 [',', ','],
 ['drink', 'NN'],
 ['and', 'CC'],
 ['then', 'RB'],
 ['drive', 'NN'],
 ['.', '.']]

In [55]:
bow.get_ngram_freq_dict([train_documents[0][0]], use_pos=True)

{',_,': 1,
 '._.': 1,
 'a_DT': 1,
 'and_CC': 1,
 'church_NN': 1,
 'coupl_NNS': 1,
 'drink_NN': 1,
 'drive_NN': 1,
 'go_VBP': 1,
 'parti_NN': 1,
 'teen_JJ': 1,
 'then_RB': 1,
 'to_TO': 1,
 'two_CD': 1}

In [56]:
bow.get_ngram_freq_dict([train_documents[0][0]], use_pos=False)

{',': 1,
 '.': 1,
 'a': 1,
 'and': 1,
 'church': 1,
 'coupl': 1,
 'drink': 1,
 'drive': 1,
 'go': 1,
 'parti': 1,
 'teen': 1,
 'then': 1,
 'to': 1,
 'two': 1}

In [57]:
bow.create_vocabulary(train_documents, train_labels, use_pos=False)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:11<00:00, 150.01it/s]                                                                      


In [58]:
list(bow.class_wise_vocab["POS"].keys())[:10]

["'", "''", "'ll", "'re", "'s", ',', '-lrb-', '-rrb-', '.', '102']

In [59]:
bow.create_vocabulary(train_documents, train_labels, use_pos=True)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:13<00:00, 131.65it/s]                                                                      


In [60]:
list(bow.class_wise_vocab["POS"].keys())[:10]

["''_''",
 "'_''",
 "'_POS",
 "'ll_MD",
 "'re_VBP",
 "'s_POS",
 "'s_VBZ",
 ',_,',
 '-lrb-_-LRB-',
 '-rrb-_-RRB-']

In [65]:
# split into training and testing data
cv = np.array([reviews[i]["cv"] for i in range(len(reviews))])
labels = np.array([reviews[i]["sentiment"] for i in range(len(reviews))])

train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], filter_vocab=True, use_pos=True)
train_and_evaluate_clf(clf, train_idx, test_idx)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:04<00:00, 431.75it/s]                                                                      


Training finished with vocabulary of size 54555.
Filtered vocabulary. Size of new vocabulary: 21540


Evaluating: 100%|████████████████████| 1800/1800 [00:07<00:00, 247.83it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.953


Evaluating: 100%|████████████████████| 200/200 [00:00<00:00, 244.29it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.880





0.88

In [None]:
# YOUR CODE HERE

*Write your answer here.*

#### (Q3.3) Discard all closed-class words from your data (keep only nouns, verbs, adjectives, and adverbs), and report performance. Does this help? Use cross-validation to evaluate the classifier and compare the results with (Q3.2). Are closed-class words detrimental to the classifier? Explain why this may be the case. (1pt)

In [87]:
X = filter_based_on_pos_tags(train_documents)

In [91]:
len(X[0][0]), len(X[0])

(221, 1)

In [88]:
len(X)

1800

In [94]:
# split into training and testing data
cv = np.array([reviews[i]["cv"] for i in range(len(reviews))])
labels = np.array([reviews[i]["sentiment"] for i in range(len(reviews))])

train_idx = np.where(cv < 900)[0]
test_idx = np.where(cv >= 900)[0]

clf = NaiveBayesClassifier(classes=["POS", "NEG"], filter_vocab=True, use_pos=True)
train_and_evaluate_clf(clf, train_idx, test_idx, filter_on_tags=True)

Creating vocabulary: 100%|████████████████████| 1800/1800 [00:01<00:00, 1456.10it/s]                                                                     


Training finished with vocabulary of size 26655.
Filtered vocabulary. Size of new vocabulary: 10647


Evaluating: 100%|████████████████████| 1800/1800 [00:02<00:00, 657.24it/s]                                                                               


Obtained accuracy using NaiveBayesClassifier on train set: 0.944


Evaluating: 100%|████████████████████| 200/200 [00:00<00:00, 649.73it/s]                                                                                 

Obtained accuracy using NaiveBayesClassifier on test set: 0.875





0.875

*Write your answer here.*

# (Q4) Discussion (max. 500 words). (5pts)

> Based on your experiments, what are the effective features and techniques in sentiment analysis? What information do different features encode?
Why is this important? What are the limitations of these features and techniques?
 


*Write your answer here in up to 500 words (-0.25pt for >50 extra words, -0.5 points for >100 extra words, ...)*.


# Submission 


In [None]:
# Write your names and student numbers here:
# Student 1 #12345
# Student 2 #12345

**That's it!**

- Check if you answered all questions fully and correctly. 
- Download your completed notebook using `File -> Download .ipynb` 
- Check if your answers are all included in the file you submit.
- Submit your .ipynb file via *Canvas*. One submission per group. 