# Binary Text Classification with Perceptron

In [1]:
import random
import numpy as np
from tqdm.notebook import tqdm

# set this variable to a number to be used as the random seed
# or to None if you don't want to set a random seed
seed = 1234

if seed is not None:
    random.seed(seed)
    np.random.seed(seed)

The dataset is divided in two directories called `train` and `test`.
These directories contain the training and testing splits of the dataset.

In [2]:
!ls -lh data/aclImdb/

total 3432
-rw-r--r--@ 1 msurdeanu  staff   826K Sep 19  2022 imdb.vocab
-rw-r--r--@ 1 msurdeanu  staff   882K Sep 19  2022 imdbEr.txt
-rw-r--r--@ 1 msurdeanu  staff   3.9K Sep 19  2022 README
drwxr-xr-x@ 8 msurdeanu  staff   256B Mar 17 17:01 [34mtest[m[m
drwxr-xr-x@ 8 msurdeanu  staff   256B Mar 17 17:01 [34mtrain[m[m


Both the `train` and `test` directories contain two directories called `pos` and `neg` that contain text files with the positive and negative reviews, respectively.

In [3]:
!ls -lh data/aclImdb/train/

total 43464
-rw-r--r--@     1 msurdeanu  staff    20M Sep 19  2022 labeledBow.feat
drwxr-xr-x@ 12502 msurdeanu  staff   391K Sep 19  2022 [34mneg[m[m
drwxr-xr-x@ 12502 msurdeanu  staff   391K Sep 19  2022 [34mpos[m[m
-rw-r--r--@     1 msurdeanu  staff   598K Sep 19  2022 urls_neg.txt
-rw-r--r--@     1 msurdeanu  staff   598K Sep 19  2022 urls_pos.txt


We will now read the filenames of the positive and negative examples.

In [5]:
from glob import glob

pos_files = glob('data/aclImdb/train/pos/*.txt')
neg_files = glob('data/aclImdb/train/neg/*.txt')

print('number of positive reviews:', len(pos_files))
print('number of negative reviews:', len(neg_files))

number of positive reviews: 12500
number of negative reviews: 12500


Now, we will use a [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to read the text files, tokenize them, acquire a vocabulary from the training data, and encode it in a document-term matrix in which each row represents a review, and each column represents a term in the vocabulary. Each element $(i,j)$ in the matrix represents the number of times term $j$ appears in example $i$.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize CountVectorizer indicating that we will give it a list of filenames that have to be read
cv = CountVectorizer(input='filename')

# learn vocabulary and return sparse document-term matrix
doc_term_matrix = cv.fit_transform(pos_files + neg_files)
doc_term_matrix

<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3445861 stored elements in Compressed Sparse Row format>

Note in the message printed above that the matrix is of shape (25000, 74894).
In other words, it has 1,871,225,000 elements.
However, only 3,445,861 elements were stored.
This is because most of the elements in the matrix are zeros.
The reason is that the reviews are short and most words in the english language don't appear in each review.
A matrix that only stores non-zero values is called *sparse*.

Now we will convert it to a dense numpy array:

In [8]:
X_train = doc_term_matrix.toarray()
X_train.shape

(25000, 74849)

We will also create a numpy array with the binary labels for the reviews.
One indicates a positive review and zero a negative review.
The label `y_train[i]` corresponds to the review encoded in row `i` of the `X_train` matrix.

In [10]:
# training labels
y_pos = np.ones(len(pos_files))
y_neg = np.zeros(len(neg_files))
y_train = np.concatenate([y_pos, y_neg])
y_train

array([1., 1., 1., ..., 0., 0., 0.])

Now we will initialize our model, in the form of an array of weights `w` of the same size as the number of features in our dataset (i.e., the number of words in the vocabulary acquired by [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)), and a bias term `b`.
Both are initialized to zeros.

In [12]:
# initialize model: the feature vector and bias term are populated with zeros
n_examples, n_features = X_train.shape
w = np.zeros(n_features)
b = 0

Now we will use the perceptron learning algorithm to learn the values of `w` and `b` from our training data.

In [14]:
n_epochs = 10

indices = np.arange(n_examples)
for epoch in range(n_epochs):
    n_errors = 0
    # randomize the order in which training examples are seen in this epoch
    np.random.shuffle(indices)
    # traverse the training data
    for i in tqdm(indices, desc=f'epoch {epoch+1}'):
        x = X_train[i]
        y_true = y_train[i]
        # the perceptron decision based on the current model
        score = x @ w + b
        y_pred = 1 if score > 0 else 0
        # update the model is the prediction was incorrect
        if y_true == y_pred:
            continue
        elif y_true == 1 and y_pred == 0:
            w = w + x
            b = b + 1
            n_errors += 1
        elif y_true == 0 and y_pred == 1:
            w = w - x
            b = b - 1
            n_errors += 1
    if n_errors == 0:
        break

epoch 1:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 2:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 3:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 4:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 5:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 6:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 7:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 8:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 9:   0%|          | 0/25000 [00:00<?, ?it/s]

epoch 10:   0%|          | 0/25000 [00:00<?, ?it/s]

The next step is evaluating the model on the test dataset.
Note that this time we use the [`transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.transform) method of the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), instead of the [`fit_transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform) method that we used above. This is because we want to use the learned vocabulary in the test set, instead of learning a new one.

In [15]:
pos_files = glob('data/aclImdb/test/pos/*.txt')
neg_files = glob('data/aclImdb/test/neg/*.txt')
doc_term_matrix = cv.transform(pos_files + neg_files)
X_test = doc_term_matrix.toarray()
y_pos = np.ones(len(pos_files))
y_neg = np.zeros(len(neg_files))
y_test = np.concatenate([y_pos, y_neg])

Using the model is easy: multiply the document-term matrix by the learned weights and add the bias.
We use Python's `@` operator to perform the matrix-vector multiplication.

In [16]:
y_pred = (X_test @ w + b) > 0

Now we print an evaluation of the prediction results using scikit-learn's [`classification_report()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function.

In [17]:
def binary_classification_report(y_true, y_pred):
    # count true positives, false positives, true negatives, and false negatives
    tp = fp = tn = fn = 0
    for gold, pred in zip(y_true, y_pred):
        if pred == True:
            if gold == True:
                tp += 1
            else:
                fp += 1
        else:
            if gold == False:
                tn += 1
            else:
                fn += 1
    # calculate precision and recall
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    # calculate f1 score
    fscore = 2 * precision * recall / (precision + recall)
    # calculate accuracy
    accuracy = (tp + tn) / len(y_true)
    # number of positive labels in y_true
    support = sum(y_true)
    return {
        "precision": precision,
        "recall": recall,
        "f1-score": fscore,
        "support": support,
        "accuracy": accuracy,
    }

In [18]:
binary_classification_report(y_test, y_pred)

{'precision': 0.9507438971544129,
 'recall': 0.52656,
 'f1-score': 0.6777531792205118,
 'support': 12500.0,
 'accuracy': 0.74964}