In [1]:
import re
import sklearn
import warnings
import pandas as pd
import numpy as np
from scipy.special import logsumexp
from abc import ABCMeta, abstractmethod

## 1. Understanding the data:

**Question:** You will be predicting the genre of a book by the book’s description. Is that feasible? Give 3 examples of specific keywords that may be useful, together with statistics on how often they appear in the book description may help to predict its genre.

**Answer:** Predicting the genre of a book solely from its description seems like a daunting task at first, but it should be feasible given the importance of finding genre-specific words that will help us to easily identify a genre of any given book. First, we should see whether we can find such words.

In [2]:
books = pd.read_csv("book_dataset_a2.csv", delimiter = "\t")

In [3]:
books

Unnamed: 0,title,author,description,coverImg,genre
0,The Hunger Games,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult
1,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré (Illustrator)",There is a door at the end of a silent corrido...,https://i.gr-assets.com/images/S/compressed.ph...,Fantasy
2,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,https://i.gr-assets.com/images/S/compressed.ph...,Classics
3,Pride and Prejudice,"Jane Austen, Anna Quindlen (Introduction)",Alternate cover edition of ISBN 9780679783268S...,https://i.gr-assets.com/images/S/compressed.ph...,Classics
4,Twilight,Stephenie Meyer,About three things I was absolutely positive.\...,https://i.gr-assets.com/images/S/compressed.ph...,Young Adult
...,...,...,...,...,...
21074,Elemental,Kim Richardson (Goodreads Author),When seventeen-year-old Kara Nightingale is su...,https://i.gr-assets.com/images/S/compressed.ph...,Fantasy
21075,Unbelievable,Sherry Gammon (Goodreads Author),Lilah Lopez Dreser's in town to take care of u...,https://i.gr-assets.com/images/S/compressed.ph...,Romance
21076,Anasazi,Emma Michaels,"'Anasazi', sequel to 'The Thirteenth Chime' by...",https://i.gr-assets.com/images/S/compressed.ph...,Mystery
21077,Marked,Kim Richardson (Goodreads Author),--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,https://i.gr-assets.com/images/S/compressed.ph...,Fantasy


Here we have the description column which is full of string descriptions of the 21079 books we have in our dataset:

In [4]:
books['description']

0        WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...
1        There is a door at the end of a silent corrido...
2        The unforgettable novel of a childhood in a sl...
3        Alternate cover edition of ISBN 9780679783268S...
4        About three things I was absolutely positive.\...
                               ...                        
21074    When seventeen-year-old Kara Nightingale is su...
21075    Lilah Lopez Dreser's in town to take care of u...
21076    'Anasazi', sequel to 'The Thirteenth Chime' by...
21077    --READERS FAVORITE AWARDS WINNER 2011--Sixteen...
21078    A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...
Name: description, Length: 21079, dtype: object

### Examples of some specific keywords that may be useful for the 'Fantasy' genre:

Now let's look at some words which could be important for the 'Fantasy' genre.

In [5]:
def wordCountinGenre (wordTosearch, genreOfchoice):
    true = 0
    for i,desc in enumerate(books['description']):
        if ((books["genre"][i] == genreOfchoice) & (wordTosearch in desc)):
            true += 1
    return(true)

In [6]:
possibleImportantWordsforFantasy = ['wand','sword','shield','magic','king','elf','dragon','dwarf','kingdom']

In [7]:
print('Number of some words in the Fantasy genre:')
for word in possibleImportantWordsforFantasy:
    print(word, 'count:', wordCountinGenre (word, "Fantasy"))

Number of some words in the Fantasy genre:
wand count: 65
sword count: 238
shield count: 25
magic count: 1399
king count: 1563
elf count: 1167
dragon count: 359
dwarf count: 38
kingdom count: 423


As we can see from the above experiment, top 3 most popular words are: **king** with **1563** times, **magic** with **1399** times, and **elf** with **1167** times occured in the descriptions of fantasy books we have in the dataset.

In [8]:
from sklearn.model_selection import train_test_split
descriptions, descriptions_test = train_test_split(books["description"], test_size = 0.2, shuffle=False)

In [9]:
for i,description in enumerate(descriptions):
    descriptions[i] = description.lower()
    descriptions[i] = re.sub('[^a-z]',' ',descriptions[i])
    descriptions[i] = re.sub('/  +/g',' ',descriptions[i])

In [10]:
len(descriptions)

16863

In [11]:
len(descriptions_test)

4216

In [12]:
bag_of_all_words = {}
for i in range(len(descriptions)):
    words = descriptions[i].split()
    for j in range(len(words)):
        try:
            count  = bag_of_all_words[str(words[j])]
            count += 1
            bag_of_all_words.update({str(words[j]):count})
        except:
            bag_of_all_words[str(words[j])] = 0

In [13]:
import time
def printDict(dictionary, length = 1):
    for i in range(length):
        key = list(dictionary)[i]
        val = list(dictionary.values())[i]
        print(key,':',val)
        time.sleep(10*0.001)

In [14]:
len(bag_of_all_words)

66083

In [15]:
printDict(bag_of_all_words,10)

winning : 533
means : 673
fame : 158
and : 86053
fortune : 256
losing : 293
certain : 341
death : 2242
the : 145032
hunger : 128


In [16]:
genres = books["genre"].unique()
bagOfbagsUnigram = dict.fromkeys(genres)

In [17]:
bagOfbagsUnigram

{'Young Adult': None,
 'Fantasy': None,
 'Classics': None,
 'Science Fiction': None,
 'Fiction': None,
 'Horror': None,
 'Romance': None,
 'Mystery': None,
 'History': None,
 'Thriller': None}

In [18]:
# Create nested dictionary for unigram method aptly named 'bag of bag' 
# because it is a big bag with smaller bag of words for different genres
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsUnigram[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words)):
        try:
            count = bag_of_words[str(words[j])]
            count += 1
            bag_of_words.update({str(words[j]):count})
        except:
            bag_of_words[str(words[j])] = 0
    bagOfbagsUnigram[str(genre)] = bag_of_words

In [19]:
printDict(bagOfbagsUnigram['Young Adult'],5)

winning : 83
means : 156
fame : 18
and : 11312
fortune : 17


In [20]:
printDict(bagOfbagsUnigram['Classics'],5)

the : 6230
unforgettable : 20
novel : 317
of : 4342
a : 2601


In [21]:
printDict(bagOfbagsUnigram['Romance'],5)

set : 163
amid : 16
the : 14670
austere : 0
beauty : 136


In [22]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vector = vec.fit_transform(bagOfbagsUnigram['Romance']).toarray()
vector

array([[9.919e+03, 0.000e+00, 0.000e+00, ..., 2.000e+00, 0.000e+00,
        1.000e+00]])

In [23]:
def returnSizeofClasses (bagOfbagsUnigram):
    total_size_of_classes = dict.fromkeys(bagOfbagsUnigram.keys(),0)
    for genre in bagOfbagsUnigram.keys():
        genreBag = bagOfbagsUnigram[genre]
        size_of_class = 0 
        for count in genreBag.values():
            size_of_class += count
        total_size_of_classes[genre] = size_of_class
    return total_size_of_classes

## 2. Implementing Naive Bayes:

We have represented our data with listed approaches and used them to learn a classifier via
Naive Bayes algorithm. We have implemented our own Naive Bayes algorithm, as seen below:

In [38]:
returnSizeofClasses (bagOfbagsUnigram)

{'Young Adult': 334182,
 'Fantasy': 551334,
 'Classics': 80119,
 'Science Fiction': 123970,
 'Fiction': 567334,
 'Horror': 74828,
 'Romance': 342204,
 'Mystery': 171763,
 'History': 77406,
 'Thriller': 50526}

In [35]:
def NaiveBayes (bagOfdescription, bagOfbagsUnigram, books):
    total_unique_words = len(bag_of_words)
    outputDict = dict.fromkeys(bagOfbagsUnigram.keys())
    total_size_of_classes = returnSizeofClasses (bagOfbagsUnigram)
    N =  books['genre'].count()
    for genre in bagOfbagsUnigram.keys():
        genreBag = bagOfbagsUnigram[genre]
        total_size_of_class = total_size_of_classes[genre] # count(c)
        N_c = (books["genre"] == genre).sum()
        prior = np.log(float(N_c/N))
        for word in bagOfdescription:
            try:
                word_count_in_class = genreBag[word] # count(w,c)
            except:
                word_count_in_class = 0
            for i in range(int(bagOfdescription[word])):
                P_w_c = float(word_count_in_class + 1)/float(total_size_of_class + total_unique_words)
                prior += np.log(P_w_c)
        #print('Value for ',genre,': ',prior)
        outputDict[genre] = prior
    max_key = max(outputDict, key=outputDict.get)
    return max_key

(Note: We have also used laplace smoothing as can be seen in the '+1' addition on the divident next to **word_count_in_class** variable.)

Features: We have used Bag of Words (BoW) model which learns a vocabulary from all the
documents, then models each document by counting the number of times each word appears.

**strTobagOfwordsUnigram** function converts **string** descriptions to **nestedDict** (Unigram) bag of words.

In [25]:
def strTobagOfwordsUnigram (string):
    bag_of_words = {}
    words = string.split()
    for j in range(len(words)):
        try:
            count  = bag_of_words[str(words[j])]
            count += 1
            bag_of_words.update({str(words[j]):count})
        except:
            bag_of_words[str(words[j])] = 0
    return bag_of_words

In [26]:
NaiveBayesResultsUnigram = []
for desc in descriptions_test:
    NaiveBayesResultsUnigram.append(NaiveBayes(strTobagOfwordsUnigram(desc), bagOfbagsUnigram, books))

## 3. Error Analysis

**Question:** Find a few misclassified books and comment on why you think they were hard to classify

**Answer:** Here are three examples of misclassified books:

In [None]:
count = 0
for i in range(len(NaiveBayesResultsBigram)):
    if (books['genre'][i] != NaiveBayesResultsBigram[i]):
        print('='*16)
        print('Index of book:',i)
        print('Actual book genre:', books['genre'][i])
        print('Prediction:', NaiveBayesResultsBigram[i])
        count += 1
        if (count == 3):
            break;

**Text Descriptions of the misclassified books:**

Index of book: 0

Actual book genre: Young Adult

Prediction: Thriller

In [None]:
books['description'][0]

Index of book: 1

Actual book genre: Fantasy

Prediction: Fiction

In [None]:
books['description'][1]

Index of book: 2

Actual book genre: Classics

Prediction: Mystery

In [None]:
books['description'][2]

## 4. Modul Analysis

In [None]:
def createBagofBagforGenre (descriptions,genre,books):
    bag_of_words = {}
    for i in range(len(descriptions)):
        if(genre == books["genre"][i]):
            bag_of_words[i] = strTobagOfwordsUnigram(descriptions[i])
    return bag_of_words

In [None]:
fantasyBagofBag = createBagofBagforGenre (descriptions,'Fantasy',books)

In [None]:
def documentFrequencyforWord (bagOfbagGenre, word):
    count = 0
    for bagOfword in bagOfbagGenre.values():
        if(word in bagOfword.values()):
            count += 1
    return count

In [None]:
def tfIdfNormalizer (bagOfbagsUnigram,descriptions,books):
    bagOfbagsTfidf = dict.fromkeys(genres)
    total_size_of_classes = returnSizeofClasses(bagOfbagsUnigram)
    for genre in bagOfbagsUnigram.keys():
        n = (books["genre"] == genre).sum()
        genreBag = bagOfbagsUnigram[genre]
        bagOfbagGenre = createBagofBagforGenre (descriptions,genre,books) 
        bag_of_words = bagOfbagsTfidf[str(genre)]
        if (bag_of_words == None):
            bag_of_words = {}
        for word,count in genreBag.items():
            tf = count
            df = documentFrequencyforWord(bagOfbagGenre, word)
            idf = np.log(float(1 + n) / float(1 + df)) + 1
            bag_of_words.update({word: tf*idf })
        bagOfbagsTfidf[str(genre)] = bag_of_words
    return bagOfbagsTfidf

In [None]:
bagOfbagsTfidf = tfIdfNormalizer (bagOfbagsUnigram,descriptions,books)

In [None]:
printDict(bagOfbagsTfidf['Fantasy'], 100)

### a) Analyzing effect of the words on prediction

**Top 10 words whose _presence_ most strongly predicts the genre of the book:**

In [None]:
import heapq
print('Words whose presence are the strongest predictors for:')
for genre in bagOfbagsUnigram.keys():   
    print('='*16)
    print(genre, ':')
    output = heapq.nlargest(10, bagOfbagsTfidf[genre], key=bagOfbagsTfidf[genre].get)
    string = ''
    for val in output:
        string += (val + ', ')
    print(string)

**Top 10 words whose _absence_ most strongly predicts the genre of the book:**

In [None]:
print('Words whose absence are the strongest predictors for:')
for genre in bagOfbagsUnigram.keys():   
    print('='*16)
    print(genre, ':')
    output = heapq.nsmallest(10, bagOfbagsTfidf[genre], key=bagOfbagsTfidf[genre].get)
    string = ''
    for val in output:
        string += (val + ', ')
    print(string)

### b) Stop Words

First we import the set of English stop words from the nltk library.

In [None]:
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

Create a set of stopwords from the list above:

In [None]:
setOfstopWords = set(stopwords.words('english'))

In [None]:
setOfnonStopwordsinFantasy = {}

for word,count in bagOfbagsUnigram['Fantasy'].items():   
    if (word not in setOfstopWords):
        setOfnonStopwordsinFantasy[word] = count

#### The 10 non-stop words that most strongly predict that the book genre is ’Fantasy’:

In [None]:
output = heapq.nsmallest(10, setOfnonStopwordsinFantasy, key=setOfnonStopwordsinFantasy.get)
string = ''
for val in output:
    string += (val + ', ')
print(string)

#### The 10 non-stop words that most strongly predict that the book genre is ’Mystery’:

In [None]:
setOfnonStopwordsinMystery = {}

for word,count in bagOfbagsUnigram['Mystery'].items():   
    if (word not in setOfstopWords):
        setOfnonStopwordsinMystery[word] = count

In [None]:
output = heapq.nsmallest(10, setOfnonStopwordsinMystery, key=setOfnonStopwordsinFantasy.get)
string = ''
for val in output:
    string += (val + ', ')
print(string)

#### As we can see in the above results, the non-stop words that most strongly predict a given genre changed drastically from the initial results which included stop words as well in the above section of 4.a.

============================================================================

**For example in 4.a, the top 10 words for Fantasy were:**

the, and, of, a, to, in, is, her, as, his

**Where all the top 10 words can be classified as a stop word. After we exclude the stop words, we can see that we have more meaningful words now compared to 4.a:**

corridor, pottter, gryffindor, staples, transcended, chronlogical, unambitious, hobbiton, tattoos, brandishing

============================================================================

**The top 10 words for Mystery were:**

the, a, and, of, to, as, in, is, her, his

**Where all the top 10 words can be classified as a stop word. After we exclude the stop words, we can see that we have more meaningful words now compared to 4.a:**

dine, overslept, chopping, chopped, hive, bumblebee, stung, chancery, herring, swallowed 

### Training the Models again with Stopwords and Calculating their Accuracy:

**1. Unigram model with stopwords:**

In [None]:
bagOfbagsUnigramStopwords = dict.fromkeys(genres)

In [None]:
# Create nested dictionary for unigram method aptly named 'bag of bag' 
# because it is a big bag with smaller bag of words for different genres
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsUnigramStopwords[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words)):
        if (words[j] not in setOfstopWords):
            try:
                count = bag_of_words[str(words[j])]
                count += 1
                bag_of_words.update({str(words[j]):count})
            except:
                bag_of_words[str(words[j])] = 0
    bagOfbagsUnigramStopwords[str(genre)] = bag_of_words

In [None]:
NaiveBayesResultsUnigramStopwords = []
for desc in descriptions_test:
    NaiveBayesResultsUnigramStopwords.append(NaiveBayes(strTobagOfwordsUnigram(desc), bagOfbagsUnigramStopwords, books))

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsUnigramStopwords,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))

**2. Bigram model with stopwords:**

In [None]:
bagOfbagsBigramStopwords = dict.fromkeys(genres)

In [None]:
# Create nested dictionary for unigram method aptly named 'bag of bag' 
# because it is a big bag with smaller bag of words for different genres
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsBigramStopwords[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words)):
        if (words[j] not in setOfstopWords):
            try:
                count = bag_of_words[str(words[j])]
                count += 1
                bag_of_words.update({str(words[j]):count})
            except:
                bag_of_words[str(words[j])] = 0
    bagOfbagsBigramStopwords[str(genre)] = bag_of_words

In [None]:
NaiveBayesResultsBigramStopwords = []
for desc in descriptions_test:
    NaiveBayesResultsBigramStopwords.append(NaiveBayes(strTobagOfwordsBigram(desc), bagOfbagsBigramStopwords, books))

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsBigramStopwords,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))

**3. Unigram model with TF-IDF without stopwords**

**(Words with more tf-idf score than 1000 are removed from the bag of words):**

In [None]:
bagOfbagsUnigramTfIdfStopwords = dict.fromkeys(genres)

In [None]:
# Create nested dictionary for unigram method aptly named 'bag of bag' 
# because it is a big bag with smaller bag of words for different genres
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsUnigramTfIdfStopwords[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words)):
        if (words[j] not in setOfstopWords):
            try:
                count = bag_of_words[str(words[j])]
                count += 1
                bag_of_words.update({str(words[j]):count})
            except:
                bag_of_words[str(words[j])] = 0
    bagOfbagsUnigramTfIdfStopwords[str(genre)] = bag_of_words

In [None]:
NaiveBayesResultsUnigramTfIdfStopwords = []
for desc in descriptions_test:
    NaiveBayesResultsUnigramTfIdfStopwords.append(NaiveBayes(strTobagOfwordsUnigram(desc), bagOfbagsUnigramTfIdfStopwords, books))

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsUnigramTfIdfStopwords,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))

**4. Unigram model with TF-IDF with stopwords**

**(Words with more tf-idf score than 1000 are removed from the bag of words):**

In [None]:
bagOfbagsUnigramTfIdf = dict.fromkeys(genres)

In [None]:
# Create nested dictionary for unigram method aptly named 'bag of bag' 
# because it is a big bag with smaller bag of words for different genres
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsUnigramTfIdf[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words)):
        if (words[j] not in setOfstopWords):
            try:
                count = bag_of_words[str(words[j])]
                count += 1
                bag_of_words.update({str(words[j]):count})
            except:
                bag_of_words[str(words[j])] = 0
    bagOfbagsUnigramTfIdf[str(genre)] = bag_of_words

In [None]:
NaiveBayesResultsUnigramTfIdf = []
for desc in descriptions_test:
    NaiveBayesResultsUnigramTfIdf.append(NaiveBayes(strTobagOfwordsUnigram(desc), bagOfbagsUnigramTfIdf, books))

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsUnigramTfIdf,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))

### c) Analyzing effect of the stop words

**Question:** Why might it make sense to remove stop words when interpreting the model? Why might
it make sense to keep stop words?

**Answer:** It definitely makes more sense to **_remove_** the stop words when interpreting the results of the models. As we seen in part **4.b**, most of the words whose presence most strongly predicts the genre usually comprised of words such as:

the, a, and, of, to, as, in, is, her, his

Which do not make sense on their own, hence provide no insight to predict the genre of a given book description.


#### Overall Conclusion for Stopwords' Effect on Model Performance:
Although removing the stopwords from the results in first part of 4.b seemed to make the outputs more meaningful by eliminating a high number meaningless stop words in the top 10 list of words with lowest tf-idf score, on the contrary, the performance tests show us that the overall model accuracy usually drops by a significant amount _**(i.e. from 0.15 to 0.13 or from 0.08 to 0.01)**_ when stopwords are removed from the training set, as seen on the below table:

Method | Stopword | Accuracy 
:-: |:-: |:-:
BoW-unigram | No | 0.15
BoW-unigram | Yes | 0.13 
BoW-bigram | No | 0.08
BoW-bigram | Yes | 0.01 
TF-IDF | No | 0.02
TF-IDF | Yes | 0.02

## 5. Calculation of Accuracy

**We have computed the accuracy of our models to measure the success of our classification
methods:**

**performanceMeasures** function calculates the overall performance of the models automatically. It calculates the **precision and recall metrics for each genre** as well as the **overall accuracy**.

In [None]:
def performanceMeasures (test_input,test_output,measure="accuracy",genre="Fantasy"):
    sample_size = len(test_output)
    TP = 0 # True positive
    TN = 0 # True negative
    FP = 0 # False positive
    FN = 0 # False negative
    for i in range(sample_size):
        if (test_input[i] == test_output[i]):
            if (test_output[i] == genre):
                TP += 1
            else:
                TN += 1
        else:
            if (test_output[i] == genre):
                FP += 1
            else:
                FN += 1

    if (measure == "accuracy"):
        accuracy = float(TP+TN)/float(sample_size)
        #return "{:.2f}".format(accuracy)
        return accuracy
    elif (measure == "precision"):
        precision = float(TP)/float(TP+FP)
        #return "{:.2f}".format(precision)
        return precision
    elif (measure == "recall"):
        recall = float(TP)/float(TP+FN)
        #return "{:.2f}".format(recall)
        return recall

### Performance Results for Unigram Naive Bayes Classifier:

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsUnigram,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))
for genre in bagOfbagsUnigram.keys():
    recall = performanceMeasures (books['genre'],NaiveBayesResultsUnigram,"recall",genre)
    precision = performanceMeasures (books['genre'],NaiveBayesResultsUnigram,"precision",genre)
    print(genre,':')
    print('\tRecall: ',"{:.2f}".format(recall))
    print('\tPrecision: ',"{:.2f}".format(precision))

**strTobagOfwordsBigram** function converts **string** descriptions to **nestedDict** (Bigram) bag of words.

In [None]:
def strTobagOfwordsBigram (string):
    bag_of_words = {}
    words = string.split()
    for j in range(len(words)-1):
        word = words[j] + " " + words[j+1]
        try:
            count  = bag_of_test[str(word)]
            count += 1
            bag_of_test.update({str(word):count})
        except:
            bag_of_words[str(word)] = 1 
    return bag_of_words

In [None]:
bagOfbagsBigram = dict.fromkeys(genres)

In [None]:
bagOfbagsBigram

In [None]:
for i in range(len(descriptions)):
    genre = books["genre"][i]
    bag_of_words = bagOfbagsBigram[str(genre)]
    if (bag_of_words == None):
        bag_of_words = {}
    words = descriptions[i].split()
    for j in range(len(words) - 1):
        word = words[j] + " " + words[j+1]
        try:
            count = bag_of_words[str(word)]
            count += 1
            bag_of_words.update({str(word):count})
        except:
            bag_of_words[str(word)] = 0
    bagOfbagsBigram[str(genre)] = bag_of_words

In [None]:
NaiveBayesResultsBigram = []
for desc in descriptions_test:
    NaiveBayesResultsBigram.append(NaiveBayes(strTobagOfwordsBigram(desc), bagOfbagsBigram, books))

### Performance Results for Bigram Naive Bayes Classifier:

In [None]:
overall_accuracy = performanceMeasures (books['genre'],NaiveBayesResultsBigram,"accuracy")
print('Overall Accuracy:',"{:.2f}".format(overall_accuracy))
for genre in bagOfbagsBigram.keys():
    recall = performanceMeasures (books['genre'],NaiveBayesResultsBigram,"recall",genre)
    precision = performanceMeasures (books['genre'],NaiveBayesResultsBigram,"precision",genre)
    print(genre,':')
    print('\tRecall: ',"{:.2f}".format(recall))
    print('\tPrecision: ',"{:.2f}".format(precision))

### Performance Comparison of all Methods used in this Report:

Method | Stopword | Accuracy 
:-: |:-: |:-:
BoW-unigram | No | 0.15
BoW-unigram | Yes | 0.13 
BoW-bigram | No | 0.08
BoW-bigram | Yes | 0.01 
TF-IDF | No | 0.02
TF-IDF | Yes | 0.02

=END OF REPORT=

THANK YOU FOR YOUR TIME.