# Portfolio 3
Student Name: **Chi Thanh Liu**  
StudentID: **45728046**   
URL: (https://github.com/MQCOMP2200-S2-2020/portfolio-2020-Thanh-Liu)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Your goal in this portfolio is to take this data and build predictive models to classify the books into one of the five target genres.  You will need to extract suitable features from the texts and select suitable models to classify them. You should build and evaluate at least TWO models and compare the prediction results.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [2]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']
books = pd.read_csv("data/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [3]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape


(8954, 5)

In [4]:
genre_books.head()

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy


In [5]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Feature Exaction

Now you take over to build a suitable model and present your results.

Firstly, you need to perform feature extraction to produce feature vectors for the predictive models.

In [6]:
Y = genre_books.genre

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=500)
X = vectorizer.fit_transform(genre_books.summary).toarray()

In [8]:
print(vectorizer.get_feature_names())

['able', 'about', 'across', 'actually', 'after', 'again', 'against', 'age', 'agrees', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'american', 'among', 'an', 'and', 'another', 'any', 'appears', 'are', 'army', 'around', 'arrive', 'arrives', 'arthur', 'as', 'asks', 'at', 'attack', 'attacked', 'attempt', 'attempts', 'away', 'back', 'battle', 'be', 'because', 'become', 'becomes', 'been', 'before', 'begin', 'begins', 'behind', 'being', 'believe', 'believes', 'best', 'between', 'black', 'blood', 'body', 'book', 'both', 'boy', 'boys', 'bring', 'brother', 'brought', 'but', 'by', 'call', 'called', 'calls', 'can', 'cannot', 'captain', 'captured', 'car', 'case', 'castle', 'chapter', 'character', 'characters', 'child', 'children', 'city', 'close', 'come', 'comes', 'company', 'continues', 'control', 'could', 'country', 'crew', 'dark', 'daughter', 'david', 'day', 'days', 'dead', 'death', 'decide', 'decides', 'despite', 'destroy', 'destroyed', 'did', 'die', 'died', 'dies', 'diffe

In [9]:
X.shape, Y.shape

((8954, 500), (8954,))

## Model Training

Then, train two predictive models from the given data set.

In [10]:
#split the data into training dataset and testing dataset
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.2, random_state=142)
Y_train, Y_test = train_test_split(Y, test_size=0.2, random_state=142)

In [11]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(7163, 500)
(7163,)
(1791, 500)
(1791,)


In [12]:
#Using logisticregression to classify
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
y_hat_test = model.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
#Using GaussianNB to classify
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
gnb_predicted = gnb.predict(X_test)

In [14]:
#Using RandomForestClassifier to classify
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train,Y_train)
clf_predicted= clf.predict(X_test)

## Model Evaluation

Finally, evaluate and compare the learned predictive models.

### LogisticRegression:

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score
#Calculate accuracy score and make a confusion matrix in LogisticRegression
print("Accuracy score: ", accuracy_score(Y_test, y_hat_test))
print("Confusion matrix: ")
print(confusion_matrix(Y_test, y_hat_test))

Accuracy score:  0.6152987158012284
Confusion matrix: 
[[101  52  18  64   2]
 [ 22 319  12  47  53]
 [ 11  29 150  58  23]
 [ 31  45  30 289  35]
 [  5  70  18  64 243]]


### Comment on the values in LogisticRegression:
* The accuracy score when using LogisticRegression to classify genres of the books is 0.615. That number is not quite high, but it is still good compared to two below models.
* Based on the confusion matrix, we can see that there are 237 books that their genre is Children's literature, but the number of books which are classified correctly is 101. There are 453 books that their genre is Science Fiction and the number of books are classified precisely is 319. In Novel genre, while there are 271 books, the number of books which are predicted correctly is 150. The other two categories are Fantasy and Mystery with 430 and 400 books in total respectively. The number of Fantasy books are classified correctly is 289. Finally, the number of books with the category of novels is correctly classified is 243.

### RandomForestClassifier:

In [16]:
#Calculate accuracy score and make a confusion matrix in RandomForestClassifier
print("Accuracy score: ", accuracy_score(Y_test, clf_predicted))
print("Confusion matrix: ")
print(confusion_matrix(Y_test, clf_predicted))

Accuracy score:  0.6035734226689
Confusion matrix: 
[[ 64  62  15  90   6]
 [  9 326  16  57  45]
 [ 11  41 135  73  11]
 [ 12  43  27 312  36]
 [  3  73  19  61 244]]


### Comment on the values in RandomForestClassifier:
* The accuracy score when using RandomForestClassifier to classify genres of the books is 0.6.
* As can be seen from the confusion matrix, the total of precisely predicted books decreases from 1102 to 1081. In which the number of science fiction books and the number of books about the fantasy genre is correctly classified are 326 and 312 books respectively. They are higher than the values in using LogisticRegression model. 

### GaussianNB:

In [17]:
#Calculate accuracy score and make a confusion matrix in GaussianNB
print("Accuracy score: ", accuracy_score(Y_test, gnb_predicted))
print("Confusion matrix: ")
print(confusion_matrix(Y_test, gnb_predicted))

Accuracy score:  0.5388051367950866
Confusion matrix: 
[[117  39  32  34  15]
 [ 53 270  24  29  77]
 [ 23  24 164  37  23]
 [ 77  42  67 174  70]
 [ 12  76  30  42 240]]


### Comment on the values in GaussianNB:
* The accuracy score when using GaussianNB to classify genres of the books is 0.539. That number is lowest compared to two above models.
* In the confusion matrix, the total of the correctly classified books is 965, which is about 200 lower than for using Logistic Regression and about 100 lower than the Random Forest Classifier methods.

# General:
I used 3 classification models for a better view instead of just 2, after calculating the accuracy of each model as well as building a confusion matrix to see the total number of correct prediction books and the total number of books with incorrect prediction in all 3 models. Then I conclude that, based on the above results, the use of the logistic regression model to classify the book genre is the most effective compared to the other 2 models.