In [2]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn

## Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Your goal in this portfolio is to take this data and build a predictive model to classify the books into one of the five target genres.  You will need to extract suitable features from the texts and select a suitable model to classify them. You should build at least one model but you could build two and compare the results if you have time.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [3]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/Portfolio3/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [23]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

In [24]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()
genre_books.sort_values('author')

Unnamed: 0,title,author,date,summary,genre
5167,Master of the Void,,,"In the post-war galaxy, ruined civilizations ...",Science Fiction
14869,Vintage Season,,1946-09,The story is set in an unnamed American city ...,Novel
14862,Alice in Verse: The Lost Rhymes of Wonderland,,2010-01-11,What distinguishes this variation on Lewis Ca...,Fantasy
14861,The Star-Crowned Kings,,,The book is about the adventures of accidenta...,Science Fiction
8244,Gooney Bird Greene,,,Gooney Bird Greene has just transferred to Mr...,Children's literature
...,...,...,...,...,...
430,La Curée,Émile Zola,1871-02,The book opens with scenes of astonishing opu...,Novel
431,L'Argent,Émile Zola,1891,"The novel takes place in 1864-1869, beginning...",Novel
1653,Nana,Émile Zola,1880,Nana tells the story of Nana Coupeau's rise f...,Novel
9744,Le Docteur Pascal,Émile Zola,1893,"Pascal, a physician in Plassans for 30 years,...",Novel


## Modelling

Now you take over to build a suitable model and present your results

In [25]:
# importing libraries
import re
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
# splitting columns into data frame
books = pd.DataFrame(genre_books['title'])
author = pd.DataFrame(genre_books['author'])
genre = pd.DataFrame(genre_books['genre'])

In [27]:
print(books.shape)
print(genre.shape)

(8954, 1)
(8954, 1)


In [28]:
genre_books['author'] = genre_books['author'].fillna('No Book')
genre_books['title'] = genre_books['title'].fillna('No Book')

In [29]:
print(len(books))
print(len(genre))
genre.head(2)
books[5000:5011]

8954
8954


Unnamed: 0,title
8075,The Last of the Jedi: Return of the Dark Side
8076,Brain
8077,Endymion
8080,A Girl Named Disaster
8081,The Egypt Game
8082,Thunder Oak
8084,Saturnalia
8085,War of the Twins
8086,Haters
8087,On Beauty


In [30]:
genre['genre'].unique()


array(["Children's literature", 'Novel', 'Fantasy', 'Science Fiction',
       'Mystery'], dtype=object)

In [31]:
# Using LabelEncoder to fit and transform the data
from sklearn.preprocessing import LabelEncoder

feat = ['genre']
for x in feat:
    le = LabelEncoder()
    le.fit(list(genre[x].values))
    genre[x] = le.transform(list(genre[x]))

In [32]:
genre['genre'].unique()


array([0, 3, 1, 4, 2], dtype=int64)

In [33]:
genre_books['Total'] = pd.DataFrame(genre_books['title'] + ' ' + genre_books['author'])
print(genre_books['Total'].head(5))

0                 Animal Farm George Orwell
1        A Clockwork Orange Anthony Burgess
2                   The Plague Albert Camus
4         A Fire Upon the Deep Vernor Vinge
6    A Wizard of Earthsea Ursula K. Le Guin
Name: Total, dtype: object


In [34]:
# Creating function to split data
def change(t):
    t = t.split()
    return ' '.join([(i) for (i) in t if i not in stop])

In [35]:
# Using stopwords to clean data
from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [36]:

genre_books['Total'].apply(change)


0                                Animal Farm George Orwell
1                       A Clockwork Orange Anthony Burgess
2                                  The Plague Albert Camus
4                            A Fire Upon Deep Vernor Vinge
6                      A Wizard Earthsea Ursula K. Le Guin
                               ...                        
16525                   Beautiful Creatures Margaret Stohl
16526                         Beautiful Chaos Gary Russell
16531    Guardians Ga'Hoole Book 4: The Siege Helen Dun...
16532                     The Casual Vacancy J. K. Rowling
16549                          The Third Lynx Timothy Zahn
Name: Total, Length: 8954, dtype: object

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [61]:
#Using TFIDF for word count
vectorizer = TfidfVectorizer(min_df=2, max_features=70000, strip_accents='unicode',lowercase =True,
                            analyzer='word', token_pattern=r'\w+', use_idf=True, 
                            smooth_idf=True, sublinear_tf=True, stop_words = 'english')
vectors = vectorizer.fit_transform(genre_books['Total'])
vectors.shape


(8954, 4134)

In [62]:
# Using naive bayes model
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix


In [63]:
#Spliting test and train data
X_train, X_test, y_train, y_test = train_test_split(vectors, genre['genre'], test_size=0.02)

In [64]:
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

(8774, 4134)
(8774,)
(180, 4134)
(180,)


# Multinomial NB


In [65]:
# Applying MultiNomialNM model
clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, pred)
print('Accuracy is: %.4f\n' % accuracy)

Accuracy is: 0.6389



In [66]:
mat = confusion_matrix(pred, y_test)
mat

array([[ 8,  0,  0,  1,  0],
       [ 7, 39,  1,  6, 11],
       [ 3,  1, 16,  2,  1],
       [ 5,  2,  3, 26,  6],
       [ 2,  6,  3,  5, 26]], dtype=int64)

__Interpretation:-__ There are 5 catagories ("Children's literature", 'Novel', 'Fantasy', 'Science Fiction', 'Mystery') of genre book. We see 0.6389(63.4%) accuracy on the training and test set for these 5 catagories. As per confusion matrix table, we see higher number in the diagonal. Therefore, we can say that this model did well
on predicting the catagories of genre book. Now, lets pick 11, which is the highest number not seen in the diagonal to explain for misclassification. Based on this number, we can say that the actual value
for second catagoroes('Novel') is 11 but our model predicted it to be the last catagories('Mystery'). 

# Logistic Regression

In [78]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import linear_model
clf = linear_model.LogisticRegression(solver= 'sag',max_iter=200,random_state=450)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)
accuracy_train = metrics.accuracy_score(y_train, y_pred_train)
accuracy = metrics.accuracy_score(y_test, pred)
print("Accuracy on training set:  %.4f\n"% accuracy_train)
print ("Accuracy Score on test data:  %.4f\n"% accuracy)

Acc on training set:  0.8084

Accuracy Score on test data:  0.6667



In [85]:
matrix = confusion_matrix(y_test, pred)
print("Confusion Matrix on Test data is:-",matrix)

Confusion Matrix on Test data is:- [[ 9  3  3  9  1]
 [ 1 39  1  2  5]
 [ 0  0 16  4  3]
 [ 1  5  2 29  3]
 [ 1  8  1  7 27]]


In [86]:
train_matrix = confusion_matrix(y_train, y_pred_train)
print("Confusion Matrix on train data is:-",train_matrix)

Confusion Matrix on train data is:- [[ 679  130   29  175   54]
 [  23 1921   27  149  143]
 [  17   72 1068  159   57]
 [  31   85   49 1928  125]
 [  12  151   21  172 1497]]


__Interpretation:-__ There are 5 catagories ("Children's literature", 'Novel', 'Fantasy', 'Science Fiction', 'Mystery') of genre book. As we see (0.80)80.80% accuracy in train data, we can deploy this training model in real world setting to distinguish catagories of genre in books. Since we do not see a huge gap in accuracy score between training data and test data, there are no overfitting and the model looks good based on the provided data set. Now, lets pick 175, which is the highest number not seen in the diagonal to explain for misclassification. Based on this number, we can say that the actual value
for first catagoroes('Children's literature') is 175 but our model predicted it to be the second last catagory('Science Fiction').

In [58]:
text = ['A Girl Named Disaster']
text[0] = text[0].lower()
s = (vectorizer.transform(text))
print(s.shape)
d = (clf.predict(s))

(1, 4134)


In [59]:
le.inverse_transform(d)[0]

"Children's literature"

__Interpretation:-__ Now, lets predict genre of book based on the title name. Here, for title 'A Girl Named Disaster', we predicted the model to be 'Children's literature' genre. 