# Raw data

First, we need to get the raw data. The 20 Newsgroups dataset is so popular that Scikit-Learn comes with some tools for getting the data. No scraping the web or unzipping data folders. We'll [preprocess the data](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#filtering-text-for-more-realistic-training) by removing headers, footers, and quotes (which Scikit-Learn also handles for us). This gives us just the raw post text. For speed, we'll also take a random sample of 500 posts instead of all of them.

In [1]:
from sklearn.datasets import fetch_20newsgroups

CATEGORIES = ('alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space')
RM = ('headers', 'footers', 'quotes')

def get_newsgroups(subset, remove=RM, categories=CATEGORIES, n=None):
    raw = fetch_20newsgroups(subset=subset, remove=remove, categories=categories, shuffle=True)
    n = n or len(raw.data)
    return raw.data[:n], raw.target[:n]

train_raw, train_y = get_newsgroups('train', n=500)

In [2]:
print("Number of raw posts: {0}".format(len(train_raw)))
print("Shape of train_y: {0}".format(train_y.shape))

Number of raw posts: 500
Shape of train_y: (500,)


We want to define all of our operations on the training data as functions, since we need to apply the exact same sequence to process the test data. Our function returns a list of 500 strings (the raw post text) and a NumPy vector with 500 entries. `train_y[i]` is the category of `train_raw[i]`. So if `train_y[i]` is `0`, that means the true category of `train_raw[i]` is `alt.atheism`. It will take a minute or two to run the first time, since it needs to download the data.

In [3]:
print(type(train_raw[0]))
print("\n{0}\n".format('-'*40))
print(train_raw[0])
print("\n{0}\n".format('-'*40))
print("TRUE LABEL: {0}".format(train_y[0]))

<class 'str'>

----------------------------------------

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

----------------------------------------

TRUE LABEL: 1


# Structured data

Now it's time for spaCy. We need to parse the raw text (most importantly, tokenize it) so that we have a structured format we can create features with. We'll always want to do this, so let's just wrap `get_newsgroups(...)`.

In [4]:
import spacy

nlp = spacy.load('en')

def get_newsgroups_parsed(subset, nlp=nlp, remove=RM, categories=CATEGORIES, n=None):
    raw, y = get_newsgroups(subset, remove, categories, n)
    return [nlp(text) for text in raw], y

train_parsed, train_y = get_newsgroups_parsed('train', n=500)

In [5]:
print(type(train_parsed[0]))
print("\n{0}\n".format('-'*40))
print(train_parsed[0])
print("\n{0}\n".format('-'*40))
print("TRUE LABEL: {0}".format(train_y[0]))

<class 'spacy.tokens.doc.Doc'>

----------------------------------------

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

----------------------------------------

TRUE LABEL: 1


# Feature engineering

Our features will be bags-of-bigrams. That means we need to get the list of unigrams and bigrams that appear in each post. But first, we need to do some engineering. The following function will take a post and perform the following operations:

* Filter out punctuation, numbers, etc. from unigrams and bigrams
* Filter out stopwords (words with no semantic information like "the" or "and") from unigrams
* Lowercase all words
* Lemmatize all verbs (i.e. take the root)
* Use the named entity tags from spaCy to replace all person names with a special word (`~~PERSON~~`) in bigrams

This lets us generalize the information in the posts, so that our model won't overfit to the specific language used in the training set. The function isn't very general or efficient, but it illustrates the behavior we want pretty clearly.

In [6]:
PERSON = '~~PERSON~~'

# Check if is person, returns PERSON if wanted
# Runs is_alpha filter on token, returns None if failed
# Otherwise, lemmatizes if a verb and returns lowercased string
def process_token(token, swap_person=False):
    # Replace with person tag
    if swap_person and token.ent_type_ == 'PERSON': return PERSON
    # Run filter
    if not token.is_alpha: return None
    # Get lemma if verb
    s = token.lemma_ if token.pos_.startswith('V') else token.string
    # Lowercase
    return s.strip().lower()


# Get the list of unigrams and bigrams for each post
def featurize_post(post):
    unigrams, bigrams = [], []
    for sentence in post.sents:
        last_word, last_token = None, None
        for token in sentence:
            # First add unigram if not a stopword
            word = process_token(token, swap_person=False)
            if word is not None and not token.is_stop:
                unigrams.append(word)
            # Then add preceeding bigram if both aren't stopwords
            word_swap = process_token(token, swap_person=True)
            if (word_swap is not None) and (last_word is not None):
                if not token.is_stop or not last_token.is_stop:
                    bigrams.append((last_word, word_swap))
            last_word, last_token = word_swap, token
    return unigrams, bigrams


def featurize_all_posts(posts):
    return zip(*[featurize_post(post) for post in posts])

Let's featurize all of the posts and collect all of the unigrams and bigrams in the training set to create a dictionary.

In [7]:
train_unigrams, train_bigrams = featurize_all_posts(train_parsed)

In [8]:
all_unigrams = list(set(gram for post in train_unigrams for gram in post))
all_bigrams = list(set(gram for post in train_bigrams for gram in post))

In [9]:
print("Unigram dictionary size (i.e. total number of unigram features): {0}".format(len(all_unigrams)))
print("\nFirst 15 entries:")
print("\n".join("{0:<4}{1}".format(k, v) for k, v in enumerate(all_unigrams[:15])))

Unigram dictionary size (i.e. total number of unigram features): 8950

First 15 entries:
0   obey
1   greenbelt
2   nose
3   tag
4   tactic
5   leverrier
6   older
7   peanut
8   hatred
9   neal
10  coldest
11  disasters
12  anthony
13  compensation
14  constant


In [10]:
print("Bigram dictionary size (i.e. total number of bigram features): {0}".format(len(all_bigrams)))
print("\nFirst 15 entries:")
print("\n".join("{0:<4}{1}".format(k, v) for k, v in enumerate(all_bigrams[:15])))

Bigram dictionary size (i.e. total number of bigram features): 29917

First 15 entries:
0   ('age', 'of')
1   ('echo', 'balloons')
2   ('it', 'go')
3   ('en', 'studenten')
4   ('new', 'codecs')
5   ('data', 'system')
6   ('argument', 'like')
7   ('to', 'interstellar')
8   ('years', 'stomp')
9   ('the', 'viking')
10  ('you', 'steal')
11  ('sake', 'of')
12  ('announce', 'that')
13  ('analyze', 'an')
14  ('should', 'consider')


# Matrix format

## Bag-of-ngrams

Let's start with a simple representation: a bag-of-ngrams. We'll create a unigram matrix $X_{\text{unigram}}$ (dimensions $500 \times 8950$) and a bigram matrix $X_{\text{bigram}}$ (dimensions $500 \times 30535$), then just stack them to create our final $X$ (dimensions $500 \times 39485$). We'll write a couple functions to do this.

In [11]:
import numpy as np
from collections import Counter


# Construct X_unigram or X_bigram
def posts_to_matrix(posts, vocab):
    # Initialize matrix
    X = np.zeros((len(posts), len(vocab)))
    for i, post in enumerate(posts):
        # Get counts of ngrams in post that are also in vocab
        post_filtered = [p for p in Counter(post).items() if p[0] in vocab]
        if len(post_filtered) > 0:
            keys, cts = zip(*post_filtered)
            # Insert counts into row using vocabulary index
            X[i, np.ravel([vocab[k] for k in keys])] = np.ravel(cts)
    return X


# Constucts X_unigram and X_bigram, then merges
def construct_feature_matrix(unigrams, bigrams, unigram_vocab, bigram_vocab):
    X_unigram = posts_to_matrix(unigrams, unigram_vocab)
    X_bigram = posts_to_matrix(bigrams, bigram_vocab)
    return np.hstack((X_unigram, X_bigram))

Now we just construct the vocabularies as dictionaries mapping the word to their fixed index, then build the matrix.

In [12]:
unigram_vocab = {unigram: i for i, unigram in enumerate(all_unigrams)}
bigram_vocab = {bigram: i for i, bigram in enumerate(all_bigrams)}

train_X = construct_feature_matrix(train_unigrams, train_bigrams, unigram_vocab, bigram_vocab)

In [13]:
print("X shape:                     {0}".format(train_X.shape))
print("# of non-zero entries in X:  {0}".format(np.sum(train_X != 0)))
print("% of non-zero entries in X:  {0:.2f}%".format(100 * np.mean(train_X != 0)))

X shape:                     (500, 38867)
# of non-zero entries in X:  66389
% of non-zero entries in X:  0.34%


The matrix is very large, but quite sparse (almost all of the entries are zero). Our representation is very inefficient, but it's fine for our purposes. If we're training a simple softmax regression model, that means we have to train one parameter per ngram per category:

$\theta_{1,1}, \theta_{1,2}, ...\theta_{4, 39485}$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\theta_{\text{Atheism},\text{ (able, )}}, \theta_{\text{Atheism},\text{ (alert, )}}, ...\theta_{\text{Space}, \text{ (zebra, run)}}$

## Word vectors

Finally, let's create a different matrix format using word vectors: the 300 dimensional vectors we use to represent words. We'll just use unigrams for this one. We'll first construct the embedding table: the list of all the embeddings we care about.

In [14]:
U = np.vstack([nlp(word).vector for word in all_unigrams])

Now we'll write a function to collect all of the embeddings in a post. The X we return will be a list, but we can think of it like a matrix with shape $[\text{Number of posts}, \text{Length of post}, \text{Embedding dimension}]$. For our purposes, this is $[500, \textit{Varies}, 300]$ since spaCy defaults to the 300-dimensional GloVe embeddings.

In [15]:
def posts_to_embedding_matrix(posts, vocab, embedding_table):
    X = []
    for post in posts:
        post_filtered = np.ravel([vocab[word] for word in post if word in vocab])
        if len(post_filtered) == 0:
            X.append(np.zeros((1, embedding_table.shape[1])))
        else:
            X.append(embedding_table[post_filtered, :])
    return X

We want one vector for each post, so we can just take the means down the columns of the matrices stored in `train_X_embedding_all` and stack all of those to create one $500 \times 300$ matrix.

In [16]:
def construct_feature_matrix_embedding(posts, vocab, embedding_table):
    X_embedding_all = posts_to_embedding_matrix(posts, vocab, embedding_table)
    return np.vstack([np.mean(X_post, axis=0) for X_post in X_embedding_all])

train_X_embedding = construct_feature_matrix_embedding(train_unigrams, unigram_vocab, U)

In [17]:
print("X_embedding shape:                     {0}".format(train_X_embedding.shape))
print("# of non-zero entries in X_embedding:  {0}".format(np.sum(train_X_embedding != 0)))
print("% of non-zero entries in X_embedding:  {0:.2f}%".format(100 * np.mean(train_X_embedding != 0)))

X_embedding shape:                     (500, 300)
# of non-zero entries in X_embedding:  144300
% of non-zero entries in X_embedding:  96.20%


This new representation is much smaller ($500 \times 300$ vs. $500 \times 40000$ish), but much denser. Overall, it has more non-zero entries than the bag-of-words matrix. The information is much less granular, since we took the mean of the word embeddings. If we're training a simple softmax regression model, that means we have to train one parameter per dimension per model:

$\theta_{1,1}, \theta_{1,2}, ...\theta_{4, 300}$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;or&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\theta_{\text{Atheism},\text{ Dim. 1}}, \theta_{\text{Atheism},\text{ Dim. 2}}, ...\theta_{\text{Space}, \text{ Dim. 300}}$

That's way fewer than we had to train for the bag-of-words model.

# Training a simple model (or two)

We now have two matrix representations of our posts. Let's try training a couple models and see which does better! We'll use sklearn.

In [18]:
from sklearn.linear_model import LogisticRegression

model_kwargs = {
    'penalty': 'l2',
    'C': 1.0,
    'solver': 'lbfgs',
    'multi_class': 'multinomial',
    'verbose': 0
}

In [19]:
F_bow = LogisticRegression(**model_kwargs)
F_bow.fit(train_X, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [20]:
F_embedding = LogisticRegression(**model_kwargs)
F_embedding.fit(train_X_embedding, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

## Model selection

So which of these models is better? They both trained without giving an error. We could compare how well they do on the training set.

In [21]:
bow_acc = F_bow.score(train_X, train_y)
embedding_acc = F_embedding.score(train_X_embedding, train_y)

print("Bag-of-words model training set accuracy: {0:.2f}%".format(100 * bow_acc))
print("Embedding model training set accuracy:    {0:.2f}%".format(100 * embedding_acc))

Bag-of-words model training set accuracy: 97.60%
Embedding model training set accuracy:    87.40%


So the bag-of-words model does much better on the training set. But maybe it's just overfit to that dataset and won't generalize well. The best measure of model performance is to score it on an unseen validation set. The 20 Newsgroups dataset has a test set, which we'll pretend is a validation set. In essence, they're the same. But we really shouldn't look at it until we want a final, final score. Side note: look how it easy it is after we wrote those functions.

In [22]:
# Get data
test_parsed, test_y = get_newsgroups_parsed('test')
# Featurize
test_unigrams, test_bigrams = featurize_all_posts(test_parsed)
# Build BoW matrix
test_X = construct_feature_matrix(test_unigrams, test_bigrams, unigram_vocab, bigram_vocab)
# Build embedding matrix
test_X_embedding = construct_feature_matrix_embedding(test_unigrams, unigram_vocab, U)

In [23]:
bow_test_acc = F_bow.score(test_X, test_y)
embedding_test_acc = F_embedding.score(test_X_embedding, test_y)

print("Bag-of-words model test set accuracy: {0:.2f}%".format(100 * bow_test_acc))
print("Embedding model test set accuracy:    {0:.2f}%".format(100 * embedding_test_acc))

Bag-of-words model test set accuracy: 63.93%
Embedding model test set accuracy:    68.00%


Look at that: the bag-of-words model is overfit. It does much better on the training set, but the embedding model does better on the test set. So how can we improve these models? The best way is certainly to use the full training set. But we could also tune hyperparameters. In `model_kwargs`, `C` describes the strength of the $\ell_2$ regularizer. Increasing the strength (reducing `C`) could reduce overfitting for `F_bow`.