 The following Python libraries are used for this part, and have been tested on Python 3.9 and Python 3.7.
 See links for instructions on installation if not already installed.
  - [NLTK](https://www.nltk.org/install.html) (tested with 3.6.7 and with 3.2.5.)
  - [Scikit-Learn](https://scikit-learn.org/stable/install.html) (test with 1.0.2)
  - [SciPy](https://scipy.org/install/) (tested with 1.7.3 and with 1.4.1)

 You may uncomment and run the code cell below to download the data.  Otherwise, you may download the data [here](https://drive.google.com/file/d/1thWkUj7uGOApr_dXRvMr9TsEHpo_H_2q/view?usp=sharing).

In [None]:
# !pip install gdown
# !gdown --id 1thWkUj7uGOApr_dXRvMr9TsEHpo_H_2q -O sst2.zip
# !mkdir -p data
# !unzip sst2.zip -d data
# !rm sst2.zip


 ## 0. Building and Extracting features
 ### Build Vocabulary using __sst2.train__
 In the following cell, you will build a vocabulary to record all unique words (except the label) apearing in the training data. A vocabulary is a dictionary that maps a word to an unique integer. However, to better connect the scikit-learn model to the Pytorch Lightning model. We will need to add special tokens `<pad>` and `<unk>` to the vocabulary as the first two words. Basically, you will use NLTK WordPunctTokenizer to tokenize each row and lowercase every tokens. Then, you will aggregate tokens of all lines and then you can build your vocabulary dictionary.

 Save the vocab dictionary into a JSON file in the `data_dir` directory and name it `unigram_vocab.json` .

 Example vocabulary after processing a corpus `I have a cat.`:

 `{'<pad>': 0, '<unk>': 1, 'i': 2, 'have': 3, 'a': 4, 'cat': 5}`

 Note that each line of the file starts with the label ∈ (0, 1) followed by the text of the review.

In [28]:
from collections import Counter
import json
from pathlib import Path
from nltk.tokenize import WordPunctTokenizer

print("Build unigram vocab from sst2.train")
data_dir = Path('data/')  # Modify the path of `data_dir` as needed.
tokenizer = WordPunctTokenizer()
counter = Counter()
counter.update(['<pad>', '<unk>'])
data_train = open(data_dir.joinpath('sst2.train')).readlines()
print(f"Size of training data: {len(data_train)}")

# Hw-TODO: Update "counter" with occurances of words in the training data.
for line in data_train: 
    l_tokens = tokenizer.tokenize(line.lower())
    counter.update(l_tokens[1:])
print(f"Vocab size before frequency filtering: {len(counter)}")

vocab = {'<pad>': 0, '<unk>': 1}
# HW TODO: Create the vocabulary with words that appear at least 3 times.
id_ = 2
for token, count in counter.items():
    if count >= 3:
        vocab[token] = id_
        id_ += 1

print(f"Vocab size after frequency filtering: {len(vocab)}")
output_filepath = data_dir.joinpath('unigram_vocab.json')
json.dump(vocab, open(output_filepath, mode='w'))


Build unigram vocab from sst2.train
Size of training data: 6920
Vocab size before frequency filtering: 13850
Vocab size after frequency filtering: 4949


In [29]:
# sanity check
assert (vocab['<pad>'] == 0)
assert (vocab['<unk>'] == 1)
assert (len(vocab) == 4949)


In [None]:
# Hw-TODO: Create vocabularies using two new feature templates. 
#          This corresponds to Part 2 of the assignment "Feature Engineering".
#          You may come back to this after going through the rest of Part 1.
          


 ### Generate features and labels files
 In this part, you will convert each input text into a unigram binary representation. When process a line, you will first separate the label and the input text. Then, you will create a list filled with zeros of length that equals to the size of vocabulary (number of unique words including special tokens; feel free to explore sparse representation on your own). Remeber to lowercase the text and use WordPunctTokenizer for tokenization.

 In the end, for each data split, you will have a feature matrix $M$ with shape $(N, \text{vocab\_size})$ where $N$ is dataset size. Each column represents a word. For example, $M_{ij} \in \{0,1\}$ denotes whether $i$-th instance has the word with id $j$ from the vocabulary. Also, you will have a label array with shape $(N,)$.

 When you do feature engineering, keep the naming system consistant so that your feature files can be applied to Pytorch-Lightning code easily. For example, you may save the features with the filename `{train,dev,test}_unigram_binary_features.npz` and the labels with the filename `{train,dev,test}_labels.npz` in the `data_dir` directory. 
 And when you extract other features, the filename of the feature can be `{train,dev,test}_{FEATURE_NAME}_features.npz`.

 To reduce the storage of saving a whole dataset into a dense matrix, we highly recommend you to store features into a sparse matrix using `scipy.sparse.csr_matrix` ([api](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)). Otherwise, it takes you several GBs for this assignment. See a tutorial [here](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/). For the labels, you can store them as vanilla numpy arrays. (You may use `numpy.asarray()` to convert a Python `list` object to a `numpy.ndarray`.)

 To save the labels (as a `numpy.ndarray`), use `numpy.savez(filepath, labels_array)` ([api](https://numpy.org/doc/stable/reference/generated/numpy.savez.html)), and to save the features (as a `scipy.sparse.csr_matrix`, use `scipy.sparse.save_npz(filepath, features_sparse_matrix)` ([api](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html)) if you use the sparse matrix for features.

 As a sanity check, your `unigram_binary` matrix shapes for {train, dev, test} should look like:

 train feature matrix shape: (6920, 4949)
 train label array shape: (6920,)

 dev feature matrix shape: (872, 4949)
 dev label array shape: (872,)

 test feature matrix shape: (1821, 4949)
 test label array shape: (1821,)

In [47]:
# Generate `npz` files of features and of labels
import json

from nltk.tokenize import WordPunctTokenizer
import numpy as np
from scipy import sparse

def extract_features(vocab, data_dir, tokenizer, feature_name):
    """
    Extract and save different features based on vocab of the features.
    # Parameters
    vocab : `dict[str, int]`, required.
        A map from the word type to the index of the word.
    data_dir : `Path`, required.
        Directory of the dataset
    tokenizer : `Callable`, required.
        Tokenizer with a method `.tokenize` which returns list of tokens.
    feature_name : `str`, required.
        Name of the feature, such as unigram_binary.
    # Returns
        `None`
    """
    if feature_name == 'unigram_binary': 
        for set_ in ['train', 'dev', 'test']:
            data = open(data_dir.joinpath(f'sst2.{set_}')).readlines()
            row = []
            col = []
            values = []
            labels = []
            shape = (len(data), len(vocab))
            for i, line in enumerate(data): 
                l_tokens = tokenizer.tokenize(line)
                labels.append(int(l_tokens[0]))
                tokens = l_tokens[1:]
                unk_flag = True
                for token in tokens: 
                    if token in vocab: 
                        col.append(vocab[token])
                        row.append(i)
                        values.append(1)
                    elif unk_flag:
                        col.append(vocab['<unk>'])
                        row.append(i)
                        values.append(1)
                        unk_flag = False
            matrix = sparse.csr_matrix((values, (row, col)), shape=shape)
            labels = np.array(labels)
            label_path = data_dir.joinpath(f'{set_}_labels.npz')
            feature_path = data_dir.joinpath(f'{set_}_{feature_name}_features.npz')
            np.savez(label_path, labels)
            sparse.save_npz(feature_path, matrix)

In [48]:
data_dir = Path('data')
tokenizer = WordPunctTokenizer()
vocab_filepath = data_dir.joinpath('unigram_vocab.json')
extract_features(vocab=json.load(open(vocab_filepath)),
                 tokenizer=tokenizer,
                 data_dir=data_dir,
                 feature_name='unigram_binary')


In [None]:
# When you implement the other feature templates you choose for Part 2 "Feature Engineering", you'll copy the few lines of code above, adjusting the parameters as needed.


 We provide you the helper function below for feature weight analysis (1.1.2 and 1.2.2).

In [46]:
def print_important_weights(weights, words):
    """
    Print important pairs of weights and words.
    # Parameters
    weights : `Iterable`, required.
        Weights from a learned model.
    words : `Iterable`, required.
        Word types of the vocabulary.  
        It must be true that `len(weights) == len(words)`.
    # Returns
        `None`
    """

    def print_pairs(pairs):
        for weight, word in pairs:
            print("{: .4f} | {}".format(weight, word))

    assert len(weights) == len(words)
    pairs = list(zip(weights, words))
    pairs = sorted(pairs, key=lambda x: x[0], reverse=True)
    print("Most positive words:")
    print_pairs(pairs[:10])
    print("\nMost negative words:")
    print_pairs(reversed(pairs[-10:]))

    pairs = list(zip(abs(weights), words))
    pairs = sorted(pairs, key=lambda x: x[0], reverse=False)
    print("\nMost neutral words:")
    print_pairs(pairs[:10])



 # Scikit-learn specific part: logistic regression with scikit-learn
 ## 1.1.1 Logistic regression with scikit-learn

In [83]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

@ignore_warnings(category=ConvergenceWarning)
def fit_and_eval_logistic_regression(data_dir: Path,
                                     feature_name: str) -> LogisticRegression:
    """
    Fit and evaluate the logistic regression model using the scikit-learn library.
    # Parameters
    data_dir : `Path`, required
        The data directory.
    feature_name : `str`, required.
        Name of the feature, such as unigram_binary.
    # Returns
        model_trained: `LogisticRegression`
            The object of `LogisticRegression` after it is trained.
    """
    # Hw-TODO: Implement logistic regression with scikit-learn.
    #          Print out the accuracy scores on dev and test data.
    #          Feel free to add arguments to the functions as needed.

    train_data = sparse.load_npz(data_dir.joinpath(f'train_{feature_name}_features.npz'))
    dev_data = sparse.load_npz(data_dir.joinpath(f'dev_{feature_name}_features.npz'))
    test_data = sparse.load_npz(data_dir.joinpath(f'test_{feature_name}_features.npz'))

    train_labels = np.load(data_dir.joinpath(f'train_labels.npz'))['arr_0']
    dev_labels = np.load(data_dir.joinpath(f'dev_labels.npz'))['arr_0']
    test_labels = np.load(data_dir.joinpath(f'test_labels.npz'))['arr_0']
    
    clf = LogisticRegression(random_state=0).fit(train_data, train_labels)
    dev_pred = clf.predict(dev_data)
    test_pred = clf.predict(test_data)

    dev_accuracy =  accuracy_score(dev_labels, dev_pred)
    dev_f1 = f1_score(dev_labels, dev_pred)
    test_accuracy = accuracy_score(test_labels, test_pred)
    test_f1 = f1_score(test_labels, test_pred)


    print(f"Dev accuracy: {round(dev_accuracy, 4)}, Dev f1: {round(dev_f1, 4)}")
    print(f"Test accuracy: {round(test_accuracy, 4)}, Test f1: {round(test_f1, 4)}")

    return clf

In [84]:
fit_and_eval_logistic_regression(feature_name='unigram_binary',
                                 data_dir=Path('data'))

Dev accuracy: 0.7821, Dev f1: 0.7879
Test accuracy: 0.804, Test f1: 0.8059


LogisticRegression(random_state=0)

 ## 1.1.2 Weights Analysis

In [85]:
model_trained: LogisticRegression = fit_and_eval_logistic_regression(
    feature_name='unigram_binary', data_dir=Path('data'))
weights = model_trained.coef_[0]
vocab = json.load(open(data_dir.joinpath('unigram_vocab.json')))
print_important_weights(weights=weights, words=vocab.keys())


Dev accuracy: 0.7821, Dev f1: 0.7879
Test accuracy: 0.804, Test f1: 0.8059
Most positive words:
 1.9463 | solid
 1.9113 | powerful
 1.8853 | remarkable
 1.7473 | fun
 1.7471 | refreshing
 1.6852 | enjoyable
 1.6273 | terrific
 1.6078 | definitely
 1.5995 | appealing
 1.5932 | works

Most negative words:
-2.0515 | stupid
-1.9932 | suffers
-1.9469 | worst
-1.9229 | mess
-1.9169 | dull
-1.8535 | unfortunately
-1.7629 | lacking
-1.6785 | bland
-1.6536 | none
-1.6491 | flat

Most neutral words:
 0.0000 | <pad>
 0.0000 | prophecies
 0.0001 | maintain
 0.0001 | manners
 0.0001 | hawn
 0.0001 | misogyny
 0.0002 | fanciful
 0.0002 | filming
 0.0003 | bollywood
 0.0004 | situations


In [None]:
# When you implement the other feature templates you choose for Part 2 "Feature Engineering", you'll make the same two calls above, adjusting the parameters as needed.
