## Predicting the Genre of Books from Summaries

The goal is to correctly classify books into one of 5 genres using the meta data provided. 

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  


In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [48]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [49]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))

# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

We inspect the numbers of each genre class. On the whole, each genre contains sufficient samples and look relatively balanced. 

In [50]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Modelling

We use a Guassian Naive Bayes model to predict the book's genre. More specifically we use the Multi-Nomial variant. 

In [274]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
y = genre_books['genre']

We extract the class label (genre). We use the book summary as the determinant. There is also valuable information in the title, so we add that to the summary field. The author can also give a clue as to the book's genre, so we add that to the summary field too, as there are multiple books by the same author. We remove spaces from the author's name to create a single word feature: eg: "George Orwell" becomes "GeorgeOrwell". 

In [52]:
#add the title and author to the summary
X["summary"] = X.loc[:, "summary"].str.cat(X.loc[:, "title"], sep=" ").str.cat(X.loc[:, "author"].str.replace(" ", ""), sep=" ")

We create a matrix of word counts using Python's CountVectorizer which is a special condensed matrix called a sparse matrix from the SciPy package. This is because most values contain zero and it is more memory efficient to store it in a special condensed format. There are also settings for minimum document frequency and maximum document frequency. We have specified that a word must be in atleast 8 documents. Because we have 5 genre classifications we specify we only use terms in less than 20% of the documents. 

From the shape of the matrix, we see that there are over 17,000 words used in the vector.

In [310]:
# produce a word count matrix
from sklearn.feature_extraction.text import CountVectorizer
maximum_document_frequency = 0.2
minimum_document_frequency = 8
countV = CountVectorizer(max_df=maximum_document_frequency, min_df=minimum_document_frequency) 
# max: ignore words in 70% of docs; ignore words that exist in 1 doc only

Xt = countV.fit_transform(X.summary)
print("Type of word count vector: ", type(Xt))
print("Vector shape", Xt.shape)


Type of word count vector:  <class 'scipy.sparse.csr.csr_matrix'>
Vector shape (8954, 17510)


To inspect the word counts, we combine the word list with the sparse matrix to build up a sorted list. 

In [311]:
word_list = countV.get_feature_names()  

In [312]:
# create sorted word counts to inspect

#count_list = X_train_counts.toarray().sum(axis=0) #too slow
count_list = np.asarray(Xt.sum(axis=0))[0]

count_list = zip(word_list, count_list)
sorted_count_list = sorted(count_list, key=lambda x: x[1])

In [344]:
print("Least common words:")
print(sorted_count_list[:20])
print("\nMost common words:")
print(sorted_count_list[-20:])

Least common words:
[('12th', 8), ('130', 8), ('1857', 8), ('1870s', 8), ('1910', 8), ('1924', 8), ('1925', 8), ('1935', 8), ('1981', 8), ('46', 8), ('4th', 8), ('54', 8), ('666', 8), ('96', 8), ('abbreviated', 8), ('abject', 8), ('abnormal', 8), ('acceleration', 8), ('accountable', 8), ('admiring', 8)]

Most common words:
[('year', 2443), ('just', 2454), ('school', 2465), ('even', 2488), ('having', 2503), ('must', 2526), ('again', 2567), ('how', 2612), ('children', 2627), ('king', 2642), ('son', 2653), ('night', 2692), ('killed', 2722), ('earth', 2762), ('love', 2807), ('ship', 2930), ('were', 2974), ('war', 3151), ('city', 3167), ('house', 3932)]


The least common words contain numbers which wouldn't be very good predictors. The exception would be dates which may be of some historical significance in a novel or some futuristic date in a science fiction story. 

We see that there are only 215 such numbers that could be filtered out - out of 17000, not very significant. Every little bit helps, so we still use them as 'stop_words' in our Vectorizer. 

In [314]:
#exclude numbers, except for dates
word_series = pd.Series(word_list)
words_containing_numbers = wcn = word_series[word_series.str.match(r'\d')]
print("Words containing numbers: ", len(words_containing_numbers))
date_words = wcn[wcn.str.match(r'([12][8901]\d{2})|(\d{1,2}(st|nd|rd|th))')] #1800, 1906, 21st, 2nd etc..
stop_words = set(wcn).difference(set(date_words))

Words containing numbers:  215


In [315]:
countV = CountVectorizer(stop_words=stop_words, max_df=maximum_document_frequency, min_df=minimum_document_frequency)
Xt = countV.fit_transform(X.summary)
Xt.shape


(8954, 17431)

We scale the word vector by the frequency that words appear in documents - more occurrences have greater weight. This is combined with number of times it appears in different documents - the more times it appears, the least affect it has. These concepts are called term frequency and inverse document frequency. 

In [316]:
# apply term frequency * inverse document frequency
from sklearn.feature_extraction.text import TfidfTransformer
#term frequency * inverse document frequency

tfidf_transformer = TfidfTransformer()
Xtt = tfidf_transformer.fit_transform(Xt)
Xtt.shape

(8954, 17431)

A cross validation score of only 63% is acheived. This is quite poor, so we try to leave out term frequency/inverse document frequency (tfidf) transformer. 

In [317]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
scores = cross_val_score(MultinomialNB(), Xtt, y, scoring='accuracy', cv=10)
print('Gaussian Naive Bayes accuracy: %.4f +- %.4f\n' % (scores.mean(), scores.std()))


Gaussian Naive Bayes accuracy: 0.6318 +- 0.0288



A better result is acheived by leaving out the tf-idf transformer. This could be because we have already filtered out common/uncommon words in the word vectorizer using 'max_df' and 'min_df'. 

In [319]:
# try without Tfidf transformation
NBmodel = MultinomialNB(alpha=1)
scores = cross_val_score(NBmodel, Xt, y, scoring='accuracy', cv=10)
print('Training accuracy: %.2f +- %.2f\n' % (scores.mean(), scores.std()))

Training accuracy: 0.69 +- 0.03



In [320]:
#Final model evaluation
from sklearn.metrics import confusion_matrix
NBmodel = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(Xt, y, test_size=0.1, random_state=42)
NBmodel.fit(X_train, y_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

We split the data into training and test..

In [331]:
y_train_pred = NBmodel.predict(X_train)

train_score = accuracy_score(y_train, y_train_pred)
train_matrix = confusion_matrix(y_train, y_train_pred)
print("Training score: %.2f\n" % train_score)

Training score: 0.82



In [343]:
y_test_pred = NBmodel.predict(X_test)

test_score = accuracy_score(y_test, y_test_pred)
test_matrix = confusion_matrix(y_test, y_test_pred)
print("Test score: %.2f\n" % test_score)
genres = NBmodel.classes_
print(test_matrix)
print("\n")
for i, c in enumerate(test_matrix):
    print(f"{genres[i]} accuracy: {c[i]/c.sum()*100:.1f}%")

Test score: 0.68

[[ 63  13  10  17   2]
 [ 21 149   9  21  23]
 [  7   5  95  17   7]
 [ 24   8  24 153  36]
 [  7  14   7  14 150]]


Children's literature accuracy: 60.0%
Fantasy accuracy: 66.8%
Mystery accuracy: 72.5%
Novel accuracy: 62.4%
Science Fiction accuracy: 78.1%


The higher accuracy for the training data, suggests some overfitting of the model. Children's literature was the least accurate, most often mis-classified as either Fantasy or Novel. Fantasy was often classified as either Children's literature or Science Fiction. Mystery accuracy was relatively high, but often misclassified as novel. Novel's were most mis-classified as fantasy. The highest accuracy was science fiction, which was often mis-classified as either novel or fantasy. 

One observation is that classification of genres can be arbitrary. For example the distinction between Science Fiction or Fantasy/Science fantasy is blurry. Likewise for Children's literature and Fantasy. The most general category of novel could be classified into any other category. This is reflected in a low accuracy. 

## Model 2

This model is based on a white list approach (as compared to a black list approach of the previous model) and is based on the idea that nouns will be better predictors as they will be more subject specific. We can also look at other parts of speech as predictors such as verbs and adjectives.

The multinomial naive based model is used again, but the input data will be transformed to include the relevant parts of speech. To help with this, Python's Natural Language Toolkit (NLTK) package is used to identify parts of speech (POS) for each word. 

In [242]:
import nltk
from scipy import sparse
#nltk.download("punkt")
#nltk.download('averaged_perceptron_tagger')
#nltk.download('tagsets')

A new word vector is created with no filtering..

In [252]:
#countV = CountVectorizer(stop_words=stop_words, max_df=0.1, min_df=8)
countV = CountVectorizer()
multinomialNB = MultinomialNB()
Xt = countV.fit_transform(X.summary)

The words used in the vector are retrieved and converted to a Pandas Series. 

In [253]:
word_list = countV.get_feature_names()  
word_series = pd.Series(word_list)

We create a new series, identifying the parts of speech of each word. The new series will contain a 2 or 3 letter key to identify a part of speech, for example 'NN' for Nouns, 'JJ' for adjectives etc. 3 letter keys usually denote a sub-category, for example NNP is a proper singular noun. 

In [254]:
def pos(x):
    return nltk.pos_tag([x])[0][1]
position_of_speech = word_series.transform(pos)

A list of verbs, nouns and adjectives are then created..

In [255]:
verbs = word_series[position_of_speech.str.match(r'VB[A-Z]{0,2}')]

In [256]:
nouns = word_series[position_of_speech.str.match(r'NN[A-Z]{0,2}')]

In [257]:
adjectives = word_series[position_of_speech.str.match(r'JJ[A-Z]{0,2}')]

The word vector (countV) has an attribute vocabulary_, which is a dictionary of words (as keys) with their column indexes in the sparse matrix as values. We use these to create a list of sparse matrix column indexes for nouns, verbs and adjectives. These will be used to create new matrices, depending on the part of speech we want to use in our model.

In [258]:
noun_column_indexes = [countV.vocabulary_.get(key) for key in nouns]
verb_column_indexes = [countV.vocabulary_.get(key) for key in verbs]
adjective_column_indexes = [countV.vocabulary_.get(key) for key in adjectives]

We create a new noun matrix and carry out a cross validation test for a multinomial naive bayes model..

In [259]:
df = pd.DataFrame(Xt.toarray())
noun_matrix = sparse.csr_matrix(df[noun_column_indexes])
scores = cross_val_score(multinomialNB, noun_matrix, y, scoring='accuracy', cv=10)
print('Accuracy for Nouns only: %.2f +- %.2f\n' % (scores.mean(), scores.std()))

Accuracy for Nouns only: 0.69 +- 0.03



We do the same thing again, but for a word matrix for nouns and verbs combined..

In [355]:
noun_verb_matrix = sparse.csr_matrix(df[noun_column_indexes + verb_column_indexes])
scores = cross_val_score(multinomialNB, noun_verb_matrix, y, scoring='accuracy', cv=10)
print('Nouns + Verbs: %.2f +- %.2f\n' % (scores.mean(), scores.std()))

Nouns + Verbs: 0.70 +- 0.02



There is a slight improvement. Finally we test for nouns, verbs and adjectives combined in the model..

In [354]:
noun_verb_adj_matrix = sparse.csr_matrix(df[noun_column_indexes + verb_column_indexes + adjective_column_indexes])
scores = cross_val_score(multinomialNB, noun_verb_adj_matrix, y, scoring='accuracy', cv=10)
print('Nouns + verbs + adjectives: %.2f +- %.2f\n' % (scores.mean(), scores.std()))

Nouns + verbs + adjectives: 0.70 +- 0.03



From the accuracy figures, adjectives didn't contribute anything to the model, execpt introduce slightly wider variation. In the interests of parsimony, we fit the noun + verb matrix...

In [356]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(noun_verb_matrix, y, test_size=0.1, random_state=42)
multinomialNB.fit(X_train2, y_train2)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [357]:
y_train_pred2 = multinomialNB.predict(X_train2)

train_score = accuracy_score(y_train2, y_train_pred2)
#train_matrix = confusion_matrix(y_train2, y_train_pred2)
print(f"Training score: {train_score:.2f}")

Training score: 0.91


In [358]:
y_test_pred2 = multinomialNB.predict(X_test2)

test_score = accuracy_score(y_test2, y_test_pred2)
#train_matrix = confusion_matrix(y_train2, y_train_pred2)
print(f"Testing score {test_score:.2f}")

Testing score 0.70


When comparing accuracy scores for training agaist testing, the large discrepency suggests overfitting in the model. There is a slight improvement in the testing accuracy (around 2%) but at the expense of significantly larger predictor matrix (over 84000 words compared to 17000 for model 1). There was also significantly more pre-processing for model 2. Overall, model 1 would be more preferrable. 

In [270]:
X_train2.shape[1]

84339

In [271]:
X_train.shape[1]

17590