# Homework 2: Sentiment Classifier

The second programming assignment will familiarize you with the use of machine learning for text
classification, and the use of prebuilt lexicons and bag of words to build custom features.

The primary objective for the assignment is the same as the first assignment: to predict the sentiment of a movie review. We will be providing you with the dataset containing the text of the movie reviews from IMDB, and for each review, you have to predict whether the review is positive or negative.

## Data

The data is the same as the first homework. The filecontains a similar structure for the first homework. 


## Kaggle

As with HW1, you can make at most three submissions each day, so we encourage you to test
your submission files early and observe the performance of your system. By the end of the submission period, you will have to select the two submissions, one for default and one for custom (more on this later).


## Source Code

Some initial code contains methods for loading the data and lexicons, and calling the methods to run and evaluate your classifier. It also contains the code to output the submission file from your classifier (called ```rf_custom_text.csv```) that you will submit to Kaggle. Your directory structure should look like this.
```
hw2  
│
└───code
│    └───hw2_sentiment_classifier.ipynb
│    └───lexicon_reader.py
└───data
│    └───lexicon
│        │   inqtabs.txt
│        └───SentiWordNet_3.0.0_20130122.txt
│    └───test
│        │   0.txt
│        │   1.txt
│        │   ...   
│        └───24999.txt
│    └───train
│        │   0.txt
│        │   1.txt
│        │   ...   
│        └───24999.txt
│    └───train.csv
└───output
```





## What to submit?

Prepare and submit a single write-up ( **PDF, maximum 3 pages** ) with Python source code (custom_features.py, error_analysis.ipynb, and ml_sentiment.py) compressed in a zip file to Canvas. **Do not include your student ID number** , since we might share it with the class if it’s worth highlighting. The write-up pdf and code zip file should be submitted separately on Canvas. The pdf should include:

### Part 1. Preliminaries, 10 points
Kaggle Team name and Kaggle accuracy of best default and custom models. The team with the best score in the competition has 10 points, the 2nd team has 9 points, the third has 8, and the others earn 7 points. You can make ***at most _three_*** submissions each day, so we encourage you to test your submission files early, and observe the performance of your system.

- Start with a single line header: ```Id, Category```
- For each of the unlabeled speech (sorted by name) there is a line containing an increasing integer index (i.e. line number 1), then a comma, and then the string label prediction of that speech.
- See ```sample_sol.csv``` for example.

### Part 2. Default Features, 20 points
Tune classifiers on default features (include the selected range of parameters in the code), and submit your predictions to Kaggle. Include the accuracies of the best classifier for each (LR and RF) and the two plots in the write-up, and a few sentences on how you picked the range.

### Part 3. Custom Features, 35 points
Implement your custom features and vectorizers in the code, and train and tune the classifiers. Submit the predictions to Kaggle, identify the best classifier for each, and include the accuracy obtained in the report. Try at least five different sets of features and vectorizers with at least five other parameters. Include the description and comparison of your features and the vectorizers in a few sentences in the write-up.

### Part 4. Analysis, 30 points
Select the best classifier with default features and the best classifier with custom
features. The analysis will focus on comparing these two classifiers. First, use eli5 to generate the global importance weights for both classifiers in the notebook (show_weights), and in a few sentences in the write-up, describe what is different between them. Then, in the notebook, generate 2 examples each where the classifiers disagree: (i) positive reviews, where only the default is correct, (ii) positive reviews, where only the custom is correct, (iii) negative reviews, where only the default is correct, and (iv) negative reviews, where only the custom is correct. Also, include 2 examples each of when both classifiers are correct, and both are incorrect. In the write-up, include a paragraph describing the insights these errors provide about the differences between the classifiers, especially the advantages/disadvantages of your custom features.

### Part 5. Statement of Collaborations, 5 points

It is mandatory to include a Statement of Collaboration in each submission with respect to the guidelines below. Include the names of everyone involved in the discussions (especially in-person ones) and what was discussed. All students are required to follow the academic honesty guidelines posted on the course website. For programming assignments, in particular, we encourage the students to organize (perhaps using Piazza) to discuss the task descriptions, requirements, bugs in our code, and the relevant technical content before they start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to take anything written or drawn away from these discussions (i.e., no photographs of the blackboard, written notes, referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.

In [None]:
import sys
import os
import csv
import pickle
import eli5
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion, make_pipeline

import lexicon_reader

In [1]:
class Dataset:
    def __init__(self, data, start_idx, end_idx):
        self.data = data
        self.reviews = [row['Review'] for row in data[start_idx:end_idx]]
        self.labels = [row['Category'] for row in data[start_idx:end_idx]]
        self.vecs = None

def get_training_and_dev_data(filedir, dev_rate=0.2):
    with open(os.path.join(filedir, 'train.csv'), 'r', encoding='utf-8') as csvfile:
        data = [row for row in csv.DictReader(csvfile, delimiter=',')]
        for entry in data:
            with open(os.path.join(filedir, 'train', entry['FileIndex'] + '.txt'), 'r', encoding='utf-8') as reviewfile:
                entry['Review'] = reviewfile.read()
    dev_idx = int(len(data) * (1 - dev_rate))
    return Dataset(data, 0, dev_idx), Dataset(data, dev_idx, len(data))

def get_test_data(filedir, output_file_name):
    testfiledir = os.path.join(filedir, 'test')
    with open(output_file_name, 'w', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, delimiter=',', fieldnames=['FileIndex', 'Category'])
        writer.writeheader()
        for filename in sorted(os.listdir(testfiledir), key=lambda x: int(os.path.splitext(x)[0])):
            with open(os.path.join(testfiledir, filename), 'r', encoding='utf-8') as reviewfile:
                fileindex = os.path.splitext(filename)[0]
                review = reviewfile.read()
                yield (fileindex, review)

def write_predictions(filedir, classifier, output_file_name):
    with open(output_file_name, 'w', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, delimiter=',', fieldnames=['FileIndex', 'Category'])
        writer.writeheader()
        for (fileindex, review) in get_test_data(filedir, output_file_name):
            prediction = dict()
            prediction['Id'] = fileindex
            prediction['Category'] = classifier.predict([review])[0]
            writer.writerow(prediction)

def get_trained_classifier(data, model, features):
    ppl = make_pipeline(features, model)
    return ppl.fit(data.reviews, data.labels)

def get_custom_features(filedir):
    return FeatureUnion([
        ('custom_feats', make_pipeline(CustomFeats(filedir), DictVectorizer())),
        ('bag_of_words', get_custom_vectorizer())
    ])

def save(classifier, filedir, output_file_path):
    with open(output_file_path + ".pkl", 'wb') as f:
        pickle.dump(classifier, f)
    write_predictions(filedir, classifier, output_file_path + "_test.csv")

def load_classifier(input_file_path):
    return pickle.load(open(input_file_path, 'rb'))

def plot(xs, train_accuracy_list, dev_accuracy_list, output_file_path=None):
    plt.clf()
    plt.plot(xs, train_accuracy_list, label='train')
    plt.plot(xs, dev_accuracy_list, label='dev')
    plt.ylabel('Accuracy')
    plt.legend()
    if output_file_path is not None:
        plt.savefig(output_file_path)
    else:
        plt.show()

In [None]:
# Load data

filedir = '../data'
print("Reading data")
train_data, dev_data = get_training_and_dev_data(filedir)

## Part 2. Default Features
We are providing code for training a machine learning classifier for sentiment classification using unigrams as features, i.e. ```CountVectorizer()```. The first goal is to optimize the hyper-parameters of logistic regression by modifying ```get_tuned_lr```. The regularization weight C is the primary hyper-parameter for logistic regression. Currently, the range for the parameter is ```np.arange(0.5, 3.5, 0.5)``` but this should be modified. When running this function, you will see both training and dev accuracy printed. There will also be a plot ```lr.png``` that will be saved that you can view. Based on these, adjust the range for C. 

The next goal is to optimize the parameters for random forest. To do so, you need to modify ```get_tuned_rf```. ```n_estimators``` is the parameter of interest used by random forest. Currently, the range is set to ```np.arange(5, 35, 5)``` but this should also be modified. Like before, when running this function, you will see both training accuracy and dev accuracy, and the plot ```rf.png``` will be saved. Based on what you see, adjust the parameters accordingly.

Running ```save(tuned_lr, filedir, ‘lr_default’)``` will save the classifier as ```lr_default.pkl``` which you will need for your error analysis. It will also run the classifier on the test set and save the results as ```lr_default_test.csv```, which you can upload to Kaggle. Similarly, the next line will output the files ```rf_default.pkl``` and ```rf_default_test.csv```.

In [None]:
def get_tuned_lr(train, dev, features, output_file_path='./lr.png'):
    train_vecs = features.fit_transform(train.reviews)
    dev_vecs = features.transform(dev.reviews)
    train_accuracy_list = list()
    dev_accuracy_list = list()
    
    # -------------------------------------------------------------------------
    # TODO: You will change this range, or may want to use np.logspace instead of np.arrange
    cs = np.arange(0.5, 3.5, 0.5)  
    
    for c in cs:
        model = LogisticRegression(C=c)
        model.fit(train_vecs, train.labels)
        train_preds = model.predict(train_vecs)
        dev_preds = model.predict(dev_vecs)
        (train_score, dev_score) = (accuracy_score(train.labels, train_preds), accuracy_score(dev.labels, dev_preds))
        print("Train Accuracy:", train_score, ", Dev Accuracy:", dev_score)
        train_accuracy_list.append(train_score)
        dev_accuracy_list.append(dev_score)
    plot(cs, train_accuracy_list, dev_accuracy_list, output_file_path)
    best_model = LogisticRegression(C=cs[np.argmax(dev_accuracy_list)])
    return get_trained_classifier(train, best_model, features)


def get_tuned_rf(train, dev, features, output_file_path='./rf.png'):
    train_vecs = features.fit_transform(train.reviews)
    dev_vecs = features.transform(dev.reviews)
    train_accuracy_list = list()
    dev_accuracy_list = list()
    
    # -------------------------------------------------------------------------
    # TODO: You will change this range, and try different parameters to tune RF model
    n_estimators = np.arange(5, 35, 5)  
    
    for num_estimator in n_estimators:
        model = RandomForestClassifier(n_estimators=num_estimator)
        model.fit(train_vecs, train.labels)
        train_preds = model.predict(train_vecs)
        dev_preds = model.predict(dev_vecs)
        (train_score, dev_score) = (accuracy_score(train.labels, train_preds), accuracy_score(dev.labels, dev_preds))
        print("Train Accuracy:", train_score, ", Dev Accuracy:", dev_score)
        train_accuracy_list.append(train_score)
        dev_accuracy_list.append(dev_score)
    plot(n_estimators, train_accuracy_list, dev_accuracy_list, output_file_path)
    best_model = RandomForestClassifier(n_estimators=n_estimators[np.argmax(dev_accuracy_list)])
    return get_trained_classifier(train, best_model, features)

In [None]:
# Some example code to get a trained classifier
print("Training model")
lr_with_default = get_trained_classifier(train_data, LogisticRegression(), CountVectorizer())
rf_with_default = get_trained_classifier(train_data, RandomForestClassifier(), CountVectorizer())

# You can see some of the predictions of the classifier by running the following code
print(lr_with_default.predict(["This movie sucks!", "This movie is great!"]))
print(rf_with_default.predict(["This movie sucks!", "This movie is great!"]))

# You can then experiment with tuning the classifiers
# Experiment with the parameters in the get_tuned_lr and get_tuned_rf methods
# Look at the files lr.png and rf.png that are saved after running each of these functions below
print("Tuning model")
tuned_lr = get_tuned_lr(train_data, dev_data, CountVectorizer())
tuned_rf = get_tuned_rf(train_data, dev_data, CountVectorizer())

# After playing with the parameters and finding a good classifier, you can save
# This will save the classifier to a pickle object which you can then load later from when doing your error analysis
# As well as this will run the classifier on the test set which you can then upload to kaggle
print("Saving model and predictions")
save(tuned_lr, filedir, 'lr_default')
save(tuned_rf, filedir, 'rf_default')


## Part 3. Custom Features and Vectorizers
The next goal for the assignment is to improve upon these classifiers by introducing your own features. Similar to the in-class activity in week 4, you will design features that utilize lexicons and regular expressions, etc., and ```experiment with the vectorizer``` (e.g., unigrams vs. bigrams, counts vs. TF-IDF, etc.). To implement
these ```features```, you will need to modify the features function, and you can change
the ```vectorizer```.

As before, tune the parameters for both logistic regression and random forest, but this time with your custom features. 

Run the save methods to save the classifiers (```lr_custom.pkl``` and ```rf_custom.pkl```) and predictions (```lr_custom_text.csv``` and ```rf_custom_text.csv```), and upload the latter files to Kaggle.

In [None]:
class CustomFeats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""
    def __init__(self, filedir):
        self.feat_names = set()
        lexicon_dir = os.path.join(filedir, 'lexicon')
        self.inqtabs_dict = lexicon_reader.read_inqtabs(os.path.join(lexicon_dir, 'inqtabs.txt'))
        self.swn_dict = lexicon_reader.read_senti_word_net(os.path.join(lexicon_dir, 'SentiWordNet_3.0.0_20130122.txt'))

    def fit(self, x, y=None):
        return self

    @staticmethod
    def word_count(review):
        words = review.split(' ')
        return len(words)

    def pos_count(self, review):
        words = review.split(' ')
        count = 0
        for word in words:
            if word in self.inqtabs_dict.keys() and self.inqtabs_dict[word] == lexicon_reader.POS_LABEL:
                count += 1
        return count

    def features(self, review):
        return {
            # -------------------------------------------------------------------------
            # 4 example features 
            # TODO: Add your own here e.g. word_count, and pos_count
            'length': len(review),
            'num_sentences': review.count('.'),
            'num_words': self.word_count(review),
            'pos_count': self.pos_count(review)  
        }

    def get_feature_names(self):
        return list(self.feat_names)

    def transform(self, reviews):
        feats = []
        for review in reviews:
            f = self.features(review)
            [self.feat_names.add(k) for k in f]
            feats.append(f)
        return feats


def get_custom_vectorizer():
    # -------------------------------------------------------------------------
    #TODO: Experiment with different vectorizers
    return CountVectorizer()


In [None]:
# Experiment with different features by modifiying custom_features.py and test your accuracy by running:
# (Again, you can look at the lr.png  and rf.png that are saved after running each of these functions)
print("Tuning model")
tuned_lr = get_tuned_lr(train_data, dev_data, get_custom_features(filedir))
tuned_rf = get_tuned_rf(train_data, dev_data, get_custom_features(filedir))

print("Saving model and predictions")
save(tuned_lr, filedir, 'lr_custom')
save(tuned_rf, filedir, 'rf_custom')

## Part 4. Error Analysis
Along with tuning classifiers and designing features to achieve as high of accuracy as possible, you also need to perform an analysis of the classifier in this assignment. For this analysis, modify the function to get metrics and feature weights for all four of your best classifiers (note, it is currently set up to only compute metrics for two, so you will need to either modify it to get metrics for all four or run it twice). You may need to update ```load_classifier(’lr_default.pkl’)``` replacing ```lr_default.pkl``` with the model you have (the one that was saved when you run the saved method earlier).

The rest of the notebook also contains the two primary approaches for analysis that use the ```eli5``` package: (1) global importance weights of individual classifiers, and (2) weights of individual words, for example, predictions. The notebook also includes code for easily comparing classifiers to each other.

In [None]:
pd.set_option('display.max_colwidth', -1)

def get_error_type(pred, label):
    # return the type of error: tp,fp,tn,fn
    if pred == label:
        return "tp" if pred == '1' else "tn"
    return "fp" if pred == '1' else "fn"

# Change this for your different classifiers
classifier1 = load_classifier('lr_default.pkl')
classifier2 = load_classifier('rf_custom.pkl')

# Create pandas dataframe
predictions = pd.DataFrame.from_dict(dev_data.data)

# Classify data points using classifier1
predictions['Classifier1Prediction'] = classifier1.predict(predictions['Review'])
predictions['Classifier1ErrorType'] = predictions.apply(lambda row: get_error_type(row['Classifier1Prediction'], row['Category']), axis=1)

# Classify data points using classifier 2
predictions['Classifier2Prediction'] = classifier2.predict(predictions['Review'])
predictions['Classifier2ErrorType'] = predictions.apply(lambda row: get_error_type(row['Classifier2Prediction'], row['Category']), axis=1)

# Get metrics for each classifier
def print_metrics(error_type_counts):
    accuracy = (error_type_counts['tp'] + error_type_counts['tn']) / sum(error_type_counts)
    precision = error_type_counts['tp'] / (error_type_counts['tp'] + error_type_counts['fp'])
    recall = error_type_counts['tp'] / (error_type_counts['tp'] + error_type_counts['fn'])
    print("Accuracy:", accuracy, "\nPrecision:", precision, "\nRecall:", recall, "\nF1:", 2 * precision*recall/(precision + recall))

print("Classifier1 Metrics")
print_metrics(predictions['Classifier1ErrorType'].value_counts())
print("\nClassifier2 Metrics")
print_metrics(predictions['Classifier2ErrorType'].value_counts())

In [None]:
eli5.show_weights(classifier1, top=25)

In [None]:
eli5.show_weights(classifier2, top=25)

In [None]:
# See some examples of errors for each classifier
# Modify the code to get false negatives and errors for Classifier2)
predictions[predictions['Classifier1ErrorType'] == 'fp'].sample(10)

# See where they disagree
# Modify the code to find cases where one classifier's prediction is correct but the other is incorrect
predictions['ClassifiersAgree'] = predictions['Classifier1Prediction'] == predictions['Classifier2Prediction']
disagreements = predictions[predictions['ClassifiersAgree'] == False]
print("# Cases where the two classifiers disagree:", len(disagreements), "->", len(disagreements) / len(predictions) * 100, "%")
disagreements.sample(10)