### 1. Comparing Naive Bayes and Neural Networks in Text Categorization of the Brown Corpus

The task I am trying to solve is text categorization of the Brown corpus. Previously on the course we classified the sentiment of movie reviews using both Naive Bayes algorithm and neural networks. In those tasks, a review was detected as either positive or negative. What makes this task different, is that there are multiple possible categories that can be assigned to the given text. I want to see how well both Naive Bayes and neural networks can solve this task. I also want to learn to implement neural networks on my own data, since we did not get to dive very deep on that on this course.

To accurately compare the models, both models will use supervised machine learning and a bag of words approach. 

### Partitioning the data

First, to do data partitioning and preprocessing it is essential to know what the data consists of. The data is imported from the NLTK Gutenberg corpus.

In [13]:
# Imports
import nltk
from nltk.corpus import gutenberg, brown

# Print categories and number of words in category
for category in brown.categories():
    print('{:<16}{}'.format(category, len(brown.words(categories=category))))

adventure       69342
belles_lettres  173096
editorial       61604
fiction         68488
government      70117
hobbies         82345
humor           21695
learned         181888
lore            110299
mystery         57169
news            100554
religion        39399
reviews         40704
romance         70022
science_fiction 14470


The Brown corpus consists of over a million words compiled from 500 samples from 15 different genres. As can be seen in the output of the code above, the words are not evenly divided between categories, and some categories contain considerably more words than other categories. Therefore, some categories will be more represented in the training data. However, all categories should be somewhat equally represented in the test set. The training and test set will consist of 90% and 10% of the data respectively. A validation/development set will not be used for the Naive Bayes algorithm, because I will be using the same features to accurately compare Naive Bayes and neural networks and therefore not doing any feature engineering.

Because the data is labelled, the models will be trained using supervised machine learning, where the model learns the connections between the data and its labels.

In [14]:
# Create a list of texts and categories
documents = [(list(brown.words(fileid)), category)
            for category in brown.categories()
            for fileid in brown.fileids(category)]

# Take every 10th element from the documents for the test set
test_set = documents[4::10] # Starting at 4th element results in every genre being included in the test set
print("Samples in the test set: ",len(test_set))

train_set = [d for d in documents if d not in test_set]
print("Samples in the train set: ",len(train_set))

Samples in the test set:  50
Samples in the train set:  450


I took 10% of the corpus for a test set by taking every 10th sample from the documents. This way, I ended up with 50 samples for testing and remaining 450 for training. I manually checked the distribution of categories, and ended up starting from the 4th document for a distribution that picks a sample even from the smallest category. Next, I will check the distribution of categories in the test set.

In [15]:
import pandas as pd
import numpy as np

# Create a list of categories
categories = []
for d, c in documents:
    categories.append(c)

# Count number of samples in each category using pandas
n_categories = pd.value_counts(np.array(categories)).sort_index()
print('--Categories and amount of samples in the dataset--', '\n')
print(n_categories, '\n')

# Create a list of categories in the test set
test_categories = []
for d, c in test_set:
    test_categories.append(c)
    
# Count number of samples in each category using pandas
n_test_categories = pd.value_counts(np.array(test_categories)).sort_index()
print('--Categories and amount of samples in the test set--', '\n')
print(n_test_categories, '\n')

# Count the percentage of each category in test set
percentage = n_test_categories.values / n_categories.values 
print("Percentage of each category in test set: ", percentage.round(2))

--Categories and amount of samples in the dataset-- 

adventure          29
belles_lettres     75
editorial          27
fiction            29
government         30
hobbies            36
humor               9
learned            80
lore               48
mystery            24
news               44
religion           17
reviews            17
romance            29
science_fiction     6
dtype: int64 

--Categories and amount of samples in the test set-- 

adventure          3
belles_lettres     7
editorial          3
fiction            3
government         3
hobbies            4
humor              1
learned            8
lore               4
mystery            3
news               4
religion           2
reviews            2
romance            2
science_fiction    1
dtype: int64 

Percentage of each category in test set:  [0.1  0.09 0.11 0.1  0.1  0.11 0.11 0.1  0.08 0.12 0.09 0.12 0.12 0.07
 0.17]


Now every genre is represented in the test set, and as can be seen from the output of the code, about 10% of each category is taken for the test set. The smallest category is slightly overrepresented with 17%, but that is only 1 sample out of 6, and I wanted to include each category in the test set, to see whether the algorithms could assign a label even to the smallest category.

### Naive Bayes Classifier

Now that the datasets are partitioned, I will implement the Naive Bayes classifier. First, word features need to be defined.

In [16]:
# Defining most frequent words
all_words = nltk.FreqDist(w.lower() for w in brown.words())
word_features = [w for w, f in all_words.most_common(2000)]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Featuresets
train_features = [(document_features(d), c) for (d,c) in train_set]
test_features = [(document_features(d), c) for (d,c) in test_set] 
classifier = nltk.NaiveBayesClassifier.train(train_features)

The 2000 most frequent words in the Brown corpus are taken as word features. Then the feature extractor is defined, and the classifier simply checks whether the text contains a certain word or not, and decides which category the text belongs to. This makes it a bag of words model, where the order of the words is not taken into consideration. This type of model is suitable for this dataset, where the samples consist of quite large text documents of over 2000 words each. The words the documents contain already give out a lot of useful information for recognizing the categories, because intuitively each category contains similar content, and therefore the classifier will be able to assign labels based on word features.

Now everything is set to train and evaluate the classifier.

In [17]:
# Train the classifier
classifier = nltk.NaiveBayesClassifier.train(train_features)

# Evaluate
print("Naive Bayes accuracy: ", nltk.classify.accuracy(classifier, test_features))
print(classifier.show_most_informative_features(30))

Naive Bayes accuracy:  0.6
Most Informative Features
        contains(didn't) = True           myster : learne =     45.3 : 1.0
      contains(wouldn't) = True           myster : learne =     34.3 : 1.0
        contains(walked) = True           myster : learne =     32.1 : 1.0
        contains(wasn't) = True           romanc : learne =     30.4 : 1.0
           contains(sat) = True           fictio : learne =     29.7 : 1.0
       contains(stopped) = True           romanc : learne =     28.7 : 1.0
      contains(watching) = True           romanc : learne =     28.7 : 1.0
        contains(killed) = True           scienc : learne =     28.4 : 1.0
         contains(sleep) = True           scienc : learne =     28.4 : 1.0
       contains(watched) = True           scienc : learne =     28.4 : 1.0
     contains(telephone) = True            humor : belles =     28.1 : 1.0
           contains(ran) = True           advent : learne =     27.9 : 1.0
        contains(season) = True           revie

The accuracy for the model is 60% which is significantly better compared to the naive baseline, i.e. assigning the most common category, learned, to all texts, which would get 8/50 of the genres correct or an accuracy of 16%.

Looking at the most informative features, they seem to make sense even just by knowing the labels of different genres: "killed" belongs to science fiction, "ran" belongs to adventure, "beautiful" to review and "baby" to romance.

Next I will look at the predicted labels and further analyse the performance of the model.

In [18]:
# Print correct labels and predicted labels
for d, c in test_features:
    print(c,' : ' ,classifier.classify(d))

adventure  :  adventure
adventure  :  adventure
adventure  :  adventure
belles_lettres  :  belles_lettres
belles_lettres  :  learned
belles_lettres  :  belles_lettres
belles_lettres  :  belles_lettres
belles_lettres  :  belles_lettres
belles_lettres  :  belles_lettres
belles_lettres  :  belles_lettres
editorial  :  editorial
editorial  :  news
editorial  :  editorial
fiction  :  fiction
fiction  :  fiction
fiction  :  adventure
government  :  learned
government  :  hobbies
government  :  belles_lettres
hobbies  :  news
hobbies  :  hobbies
hobbies  :  learned
hobbies  :  hobbies
humor  :  belles_lettres
learned  :  hobbies
learned  :  learned
learned  :  lore
learned  :  government
learned  :  learned
learned  :  hobbies
learned  :  learned
learned  :  learned
lore  :  lore
lore  :  lore
lore  :  lore
lore  :  learned
mystery  :  mystery
mystery  :  romance
mystery  :  romance
news  :  news
news  :  news
news  :  news
news  :  news
religion  :  belles_lettres
religion  :  belles_lettres

The correct label is printed for each sample in the test set, followed by the predicted label. Looking at these, the easiest categories to predict seem to be adventure, belles lettres, fiction, lore, romance and news. Adventure and news were labelled correct every time, and adventure also appears a lot in the most informative features. However, news does not, but it makes sense that it would be a very recognisable genre.

The hardest categories to predict appear to be government, learned, mystery, religion, and science fiction. Government, humor, religion and science fiction were never labelled correctly. The category government seems to be hard to predict although not being a small category and the category religion seems to be easily confused with belles lettres, meaning beautiful writing. Smaller categories, like science fiction, are hard to evaluate due to their small sample size. 

Because the amount of test samples is so small, accuracy of the model can vary a lot depending on the test set. Perhaps a solution to this to consider in future experiments would be to label sentences instead, but that might require a sequential model.

### Neural networks

Next, I will implement neural networks, using Keras from Tensorflow.

In [19]:
max_features = 20000
maxlen = 2000
batch_size = 32

The most frequent 2000 words are used as word features like in the Naive Bayes model. Maximum lenght is set to 2000 to achieve documents of equal lenght. Samples in the Brown corpus consist of slightly over 2000 words each. Batch size is set to 32. For this model, a validation set will be used to avoid overfitting.

The data is shuffled for random partitioning into training and validation sets. To create a validation set consisting of 10% of the data, 50 samples are taken from the training set. To implement a neural networks model, the data needs to be transformed into document vectors. 

In [20]:
# Shuffle train data
import random
random.shuffle(train_set)

# Define documents
x_train = [d for (d,c) in train_set[50:]]
x_val = [d for (d,c) in train_set[:50]]
x_test = [d for (d,c) in test_set]# Create the tokenizer
from keras.preprocessing.text import Tokenizer
t = Tokenizer(num_words = 2000)

# Encode documents
x_train = t.texts_to_matrix(x_train, mode='binary')
x_val = t.texts_to_matrix(x_val, mode='binary')
x_test = t.texts_to_matrix(x_test, mode='binary')

# Print shape
print('x_train shape:', x_train.shape)
print('x_val shape:', x_val.shape)
print('x_test shape:', x_test.shape)

x_train shape: (400, 2000)
x_val shape: (50, 2000)
x_test shape: (50, 2000)


First, I set up a tokenizer so I can use the text to matric function from Keras. It encodes each text sample into a vector. It has multiple modes, out of which the binary mode is similar to the word features in the Naive Bayes model, which checks whether the word occurs in the training data or not. The lenght of the vector is the total size of the vocabulary, which is 2000 words. The first integer of the shape tells the amount of vectors, and the second integer their lenght.

Next, the labels, in this case the different categories, need to be preprocessed as well. They need to be in a numerical form, so I will create binary representations of the labels.

In [21]:
# Define training labels
y_train = [c for (d, c) in train_set[50:]]
y_val = [c for (d, c) in train_set[:50]]
y_test = [c for (d, c) in test_set]

# Import MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

# Create MultiLabelBinarizer object
mlb = MultiLabelBinarizer()

# One-hot encode data
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)
y_test = mlb.fit_transform(y_test)

# Print shape
print('y_train shape:', y_train.shape)
print('y_val shape:', y_val.shape)
print('y_test shape:', y_test.shape)

y_train shape: (400, 21)
y_val shape: (50, 21)
y_test shape: (50, 21)


A MultiLabelBinarizer from scikit-learn is used to create binary representations of the labels. Now everything is set to create and run the model.

In [22]:
# Import model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential

# Define model
model = Sequential()
model.add(Dropout(0.5))
model.add(Dense(21, activation='sigmoid',))

A Sequential model is used because it is a simple neural networks model and therefore fit for a simple classification task. A dropout rate of 0.5 is set to reduce overfitting and sigmoid is used as an activation function. The input shape is defined as 21 based on the shape of the binarized labels. Now, the final step is to train and evaluate the model.

In [23]:
# Compile and train model
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=8,
          validation_data=[x_val, y_val])
# Evaluate
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print('Test Accuracy:',acc)

Train...
Train on 400 samples, validate on 50 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
Test Accuracy: 0.7666667


The training accuracy goes up until the 3rd epoch and then stays at 77.4%. The validation accuracy stays at 76.3% throughout all epochs, which may be due to a small validation set. The training loss and validation loss keeps going down, which means that there is a better match between correct labels and predicted labels. Validation accuracy is slightly lower than training accuracy, which means that the model learned from the training data and may have started overfitting, but since validation accuracy does not change, it is hard to say. The test accuracy is 76.7%, which is nearly the same as the validation accuracy. Results might also vary due to different samples in the training, validation and test sets. 

I tested my hypothesis by changing the sizes of the test and validation sets from 50 or 10% to 100 or 20%. The training accuracy went slightly down to 77.2%, and the validation accuracy went up to 77.1%, which are almost the same, indicating that overfitting has not occured. Test accuracy changed to 76.8%, which is nearly the same as before. Typically, the model is not expected to perform better with a smaller training data, but the changes are very small and therefore probably due to differences in the sets of samples.

The test accuracy of 76.7% is significantly better than the Naive Bayes model with an accuracy of 60%. The results show that even a simple neural networks model performs better than a Naive Bayes model constructed in a similar way. 

### Conclusion

In conclusion, I trained two machine learning models to predict categories in the Brown corpus. First, a Naive Bayes model and then a neural networks model. To allow comparison between the models, both were built similarly using a supervised machine learning method and a bag of words approach. Naive Bayes performed significantly better than the naive baseline with an accuracy of 60%, yet the model fell short from the performance of the neural networks model with an accuracy of 76.6%. The model is very simple, and with some fine tuning, the neural networks model could probably perform better. 

The results were quite close to what I was expecting, since the task is not as simple as sentiment prediction. In addition, it is not surprising that neural networks performed better than the Naive Bayes classifier, since it is a more advanced algorithm.