In this notebook, we'll build a model to classify online posts using a Naive Bayes classifier and a neural network.

Below we download the online posts data from the 20newsgroups database.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from pprint import pprint
pprint(list(newsgroups_train.target_names))

 

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


Here is a sample of some typical text

>Recently, RAs have been ordered (and none have resisted or cared about
> it apparently) to post a religious flyer entitled _The Soul Scroll: Thoughts
> on religion, spirituality, and matters of the soul_ on the inside of bathroom
> stall doors. (at my school, the University of New Hampshire) It is some sort
> of newsletter assembled by a Hall Director somewhere on campus. It poses a


To limit the size of this experiment we will restrict ourselves to a few similar newsgroups.
Each of the newsgroups has data arranged in train and test directories. 
Here we will train a naive Bayes classifier

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)


Next we vectorize the words in the training data and print the newsgroup names we are using as a sanity check. 

In [3]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
list(newsgroups_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

Each of the newsgroups has data arranged in train and test directories.
We will know train on the training dataset.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
vectors_test = vectorizer.transform(newsgroups_train.data)
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_train.target, pred, average='macro')


0.9994081859552928

Very high accuracy on the training data. 
Overfitted. 
Now try the test data

In [6]:
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
metrics.f1_score(pred, newsgroups_test.target, average='macro')

0.8821359240272957

88% is pretty respectable, but there are issues with the data
Headers and footers contain identifying information such as email addresses
Quotes are where chains of previous posts are copied in a new post
Let's remove them and test

In [7]:
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
metrics.f1_score(pred, newsgroups_test.target, average='macro')

0.7731035068127478

Definitely impacted the accuracy. Let's retrain without the headers,footers and quotes. 

In [8]:
newsgroups_train = fetch_20newsgroups(subset='train',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_train.data)
clf2 = MultinomialNB(alpha=.01)
clf2.fit(vectors, newsgroups_train.target)
pred = clf2.predict(vectors_test)
metrics.f1_score(newsgroups_train.target, pred, average='macro')

0.9655349072873998

Run the test data on the new classifier

In [9]:
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)
clf2 = MultinomialNB(alpha=.01)
clf2.fit(vectors, newsgroups_train.target)
pred = clf2.predict(vectors_test)
metrics.f1_score(pred, newsgroups_test.target, average='macro')

0.7731035068127478


No improvement. Now let's look it using GloVe vectors.

Start by downloading the GloVe vectors we will be using to represent our post data. The vector set we will use is 6B200d.

In [10]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2021-01-07 20:43:30--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-01-07 20:43:30--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-01-07 20:43:30--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

Below we unzip the GloVe file we downloaded.

In [None]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Next, we load the GloVe vectors.

In [None]:
import numpy as np

embeddings_index = {}
f = open('glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Next, we convert the data to a collection of word GloVe word vectors for each of the words in our dataset.

In [None]:
!pip install keras=='2.3.1'
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

print('Preparing embedding matrix.')
MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(newsgroups_train.data)
sequences = tokenizer.texts_to_sequences(newsgroups_train.data)

word_index = tokenizer.word_index

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Preparing embedding matrix.


Next, we'll build our dataset for training, `data` and `labels`, as well as our test set, `data_test` and `labels_test`.  We will limit our training set to 200 examples.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# finally, vectorize the text samples into a 2D integer tensor
MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(newsgroups_train.data)
sequences = tokenizer.texts_to_sequences(newsgroups_train.data)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(newsgroups_train.target))

# print(data.shape)

data, data_test, labels, labels_test = train_test_split(data,labels,train_size=200)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
print('Shape of data_test tensor:', data_test.shape)
print('Shape of label_test tensor:', labels_test.shape)



Found 27471 unique tokens.
Shape of data tensor: (200, 1000)
Shape of label tensor: (200, 4)
Shape of data_test tensor: (1834, 1000)
Shape of label_test tensor: (1834, 4)


Next, we'll declare a `train` function that declares and trains the model with `pretrain` weights.  

In [None]:
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import Constant
# from keras.optimizers import RMSprop
# from keras.optimizers import Adam
from tensorflow.keras import optimizers

EMBEDDING_DIM = 100

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
# num_words = len(vectorizer.vocabulary_)
# num_words = len(word_index)+1

def train(pretrain):
  if not pretrain:  # train your own embedding
    embedding_layer = Embedding(num_words,
                              EMBEDDING_DIM,
                              input_length=MAX_SEQUENCE_LENGTH,
                              trainable=True
                             )
  else:
      embedding_layer = Embedding(num_words,
                              EMBEDDING_DIM,
                              embeddings_initializer=Constant(embedding_matrix),
                              input_length=MAX_SEQUENCE_LENGTH,
                              trainable=False
                           )
  print('Training model.')

  # train a 1D convnet with global maxpooling
  sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  embedded_sequences = embedding_layer(sequence_input)
  x = Conv1D(128, 5, activation='relu')(embedded_sequences)
  x = MaxPooling1D(5)(x)
  x = Conv1D(128, 5, activation='relu')(x)
  x = MaxPooling1D(5)(x)
  x = Conv1D(128, 5, activation='relu')(x)
  x = GlobalMaxPooling1D()(x)
  x = Dense(128, activation='relu')(x)
  preds = Dense(len(categories), activation='softmax')(x)

  solver = optimizers.Adam(lr=0.0005)

  model = Model(sequence_input, preds)
  model.compile(loss='categorical_crossentropy',
                optimizer=solver,
                metrics=['acc'])

  model.fit(data, labels,
            epochs=50,
            validation_data=(data_test, labels_test))

Below we train the model without pretrained weights.

In [None]:
train(False)

Training model.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Next we train the model with pretrained weights.

In [None]:
train(True)

Training model.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
