# Sentiment analysis (Lab 4)

In [1]:
__author__ = "Alex Wang"
__version__ = "DSGA 1012, NYU, Spring 2019 term"

In this lab, we'll go through the process of processing a dataset, designing features, fitting a model on the feature data (sort of), and evaluate on a held-out test set. For the **bonus**, we'll have a friendly competition to see who can get the highest performance on a held out test set from a different distribution, so think throughout about how to improve and generalize our model's performance!

## Setup

First, let's load the Stanford Sentiment Treebank. Download it from here: [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. (If you want to put it somewhere else, change `sst_home` below.)

In [2]:
import re
import random
import os
import numpy as np
import collections

In [3]:
sst_home = 'trees'

def load_sst_data(path):
    # Let's do 2-way positive/negative classification instead of 5-way
    EASY_LABEL_MAP = {0:0, 1:0, 2:None, 3:1, 4:1}
    
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = EASY_LABEL_MAP[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    return data
     
train = load_sst_data('trees/train.txt')
val = load_sst_data('/dev.txt')
test = load_sst_data('/test.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'trees/train.txt'

## Extracting features

Now that we have the data, we need to build some sort of feature representation of our data. One of the simplest things we can do is to represent each sentence as a bag of its words. As part of determining what constitutes a work (or "token"), we'll have to choose how to tokenize the data. Let's do the simplest thing for now and just split on whitespace. More sophisticated methods might use a tokenizer from an outside library, such as NLTK or SpaCy.

In [3]:
def tokenize(string):
    ''' Bare-bones tokenization '''
    return string.split()

def extract_feats(datasets):
    '''Annotates datasets with feature vectors.'''
                         
    # Extract vocabulary
    word_counter = collections.Counter()
    for example in datasets[0]: # assume first dataset is training set
        word_counter.update(tokenize(example['text']))
    vocabulary = set(word_counter.keys())

    features = set()
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['features'] = collections.defaultdict(float)
            
            #Extract features (by name) for one example:
            word2count = collections.Counter(tokenize(example['text']))
            for word, count in word2count.items():
                if word in vocabulary:
                    example["features"][word] = min(count, 1) # these are *binary* features
            
            features.update(example['features'].keys())
                            
    # By now, we know what all the features will be, so we can
    # assign indices to them.
    feat2idx = dict(zip(features, range(len(features))))
    idx2feat = {v: k for k, v in feat2idx.items()}
    dim = len(feat2idx)
                
    # Now we create actual vectors from those indices.
    for dataset in datasets:
        for example in dataset:
            example['input'] = np.zeros((dim))
            for feature in example['features']:
                example['input'][feat2idx[feature]] = example['features'][feature]
    return idx2feat
    
idx2feat = extract_feats([train, val, test]) # adds the features as a key in each example dict

## Building a Model: Logistic Regression

Let's build a classifier for this dataset. Because we haven't talked about optimization yet, we’ll use the LogisticRegression class from scikit-learn out-of-the-box.

You might need to install scikit-learn via the following command:

In [4]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In order to learn the "best" parameters for our model based on the training data, we use scikit-learn’s fit method. Inside this method, the parameters are according to some loss function (see slides).

In [5]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
log_model = log_model.fit(X=X_train, y=y_train)



We now have a trained sentiment analysis model!

## Evaluating a Model and Extensions

How well does our model do? Let's define a function to see our model's accuracy on some data split and see how well we fit the training data. We'll make use of the `model.predict()` interface for generating predictions.

In [6]:
from sklearn.metrics import accuracy_score

def evaluate(inputs, targs, model):
    preds = model.predict(inputs)
    return accuracy_score(preds, targs)

In [7]:
X_train = [x['input'] for x in train]
y_train = [y['label'] for y in train]
train_acc = evaluate(X_train, y_train, log_model)
print("Train acc: %.3f" % (100 * train_acc))

Train acc: 98.367


Nice, 98% accuracy. How well do we do on held-out data?

In [8]:
X_dev = [x['input'] for x in val]
y_dev = [y['label'] for y in val]
dev_acc = evaluate(X_dev, y_dev, log_model)
print("Dev acc: %.3f" % (100 * dev_acc))

Dev acc: 77.982


We see a big drop, ~20 accuracy, on held-out data, so we overfit the training data. We can go back and revise our approach (e.g. by playing around with the different parameters for the [logistic regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)) and re-fitting on the training data, and then see how well we do on the held-out validation data.

By doing this, however, we'll be fitting to the validation data. At some point, we'll want to evaluate one completely new data. Which is what the test split is for. The test split should be used as sparingly as possible!

In [9]:
X_test = [x['input'] for x in test]
y_test = [y['label'] for y in test]
test_acc = evaluate(X_test, y_test, log_model)
print("Test acc: %.3f" % (100 * test_acc))

Test acc: 79.791


## Exercise

In the remaining time, try to maximize your model's performance on the test split without evaluating on it (until the end of class). How you go about that is completely open (feature engineering, modeling, optimization, etc.), but do not use pretrained models or libraries outside the ones we have used today. You should work on this by yourself.

## !?! ~ ~ * * BONUS * * ~ ~ !?!

We've been evaluating on data drawn roughly from the same data distribution. How do our models fare if we move out-of-distribution? 

Once the data is distributed, the following function reformats it in the same form as our SST data.

In [10]:
def load_mystery_data(path):
    
    pos_data, neg_data = [], []
    all_files = []
    _limit = 250
    
    for dirpath, dirnames, files in os.walk(path):
        for name in files:
            all_files.append(os.path.join(dirpath, name))
            
            
    for file_path in all_files:
        if '/neg' in file_path and len(neg_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 0
            neg_data.append(example)
            
        if '/pos' in file_path and len(pos_data) <= _limit:
            example = {}
            with open(file_path, 'r') as myfile:
                example['text'] = myfile.read().replace('\n', '')
            example['label'] = 1
            pos_data.append(example)
    data = neg_data + pos_data

    return data

            
mystery_test = load_mystery_data('path/to/data')
idx2feat = extract_feats([train, mystery_test]) # adds the features as a key in each example dict