<a href="https://colab.research.google.com/github/feliciahf/NLP-Project/blob/main/amazon/colab_FH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes from scratch

# Importing & Cleaning data

This section imports the data into a pandas dataframe and goes through the following preprocessing steps:

-Case collapsing 
-Remove punctuation 
-Tokenization 
-N-Grams: bigrams and trigrams 
-Stemming -> check whether this makes a difference 
-Lemmatization -> check whether this makes a difference 
-Part-of-speech (POS) tagging 
-Named entity recognition (NER) 

## Import Data
We first import the data as a pandas dataframe. Then we create a list made up of the title and genre columns from the original data. We also create a list including all 32 genres.

In order to import the data into Google Colab, we have to first upload the csv file:

In [1]:
# mount Google Drive
from google.colab import  drive
drive.mount('/drive')

Mounted at /drive


In [2]:
# import file from Google Drive
import pandas as pd
df = pd.read_csv('/drive/My Drive/book32listing.csv',encoding='latin1', header=None)

In [3]:
# drop columns that are not needed
#df = pd.read_csv("book32listing.csv", encoding='latin1', header=None)
df1 = df[[3,6]] # only columns with titles and genres
df1.columns = ['title', 'genre']
print(df1)

                                                    title      genre
0                         Mom's Family Wall Calendar 2016  Calendars
1                         Doug the Pug 2016 Wall Calendar  Calendars
2       Moleskine 2016 Weekly Notebook, 12M, Large, Bl...  Calendars
3                 365 Cats Color Page-A-Day Calendar 2016  Calendars
4                    Sierra Club Engagement Calendar 2016  Calendars
...                                                   ...        ...
207567  ADC the Map People Washington D.C.: Street Map...     Travel
207568  Washington, D.C., Then and Now: 69 Sites Photo...     Travel
207569  The Unofficial Guide to Washington, D.C. (Unof...     Travel
207570      Washington, D.C. For Dummies (Dummies Travel)     Travel
207571  Fodor's Where to Weekend Around Boston, 1st Ed...     Travel

[207572 rows x 2 columns]


In [4]:
titles = df1['title'] # list of all titles
titles1 = titles.values.tolist() # change to list of strings
print(titles1[0:6]) # test whether it worked

["Mom's Family Wall Calendar 2016", 'Doug the Pug 2016 Wall Calendar', 'Moleskine 2016 Weekly Notebook, 12M, Large, Black, Soft Cover (5 x 8.25)', '365 Cats Color Page-A-Day Calendar 2016', 'Sierra Club Engagement Calendar 2016', 'Sierra Club Wilderness Calendar 2016']


In [5]:
genres = df1['genre']
genres = genres.values.tolist()
genres = pd.DataFrame(genres)

In [6]:
df1.genre.unique() # list of all possible genres

array(['Calendars', 'Comics & Graphic Novels', 'Test Preparation',
       'Mystery, Thriller & Suspense', 'Science Fiction & Fantasy',
       'Romance', 'Humor & Entertainment', 'Literature & Fiction',
       'Gay & Lesbian', 'Engineering & Transportation',
       'Cookbooks, Food & Wine', 'Crafts, Hobbies & Home',
       'Arts & Photography', 'Education & Teaching',
       'Parenting & Relationships', 'Self-Help', 'Computers & Technology',
       'Medical Books', 'Science & Math', 'Health, Fitness & Dieting',
       'Business & Money', 'Law', 'Biographies & Memoirs', 'History',
       'Politics & Social Sciences', 'Reference',
       'Christian Books & Bibles', 'Religion & Spirituality',
       'Sports & Outdoors', 'Teen & Young Adult', "Children's Books",
       'Travel'], dtype=object)

## Case Collapsing
Change all uppercase to lowercase letters

In [7]:
case_collap = map(lambda x:x.lower(), titles1)
case_collap_list = list(case_collap)
print(case_collap_list[0:6])

["mom's family wall calendar 2016", 'doug the pug 2016 wall calendar', 'moleskine 2016 weekly notebook, 12m, large, black, soft cover (5 x 8.25)', '365 cats color page-a-day calendar 2016', 'sierra club engagement calendar 2016', 'sierra club wilderness calendar 2016']


## Remove Punctuation
Remove punctuation by creating a translation table. 
Punctuation to be removed is given in string: string.punctuation

In [8]:
import string
trans = str.maketrans('', '', string.punctuation)
rem_punct = [s.translate(trans) for s in case_collap_list]
print(rem_punct[0:6])

['moms family wall calendar 2016', 'doug the pug 2016 wall calendar', 'moleskine 2016 weekly notebook 12m large black soft cover 5 x 825', '365 cats color pageaday calendar 2016', 'sierra club engagement calendar 2016', 'sierra club wilderness calendar 2016']


## Tokenization
This splits all titles into words (output: list of lists of strings)

In [9]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
tokenized_titles = [word_tokenize(i) for i in rem_punct]
print(tokenized_titles[:10])

[['moms', 'family', 'wall', 'calendar', '2016'], ['doug', 'the', 'pug', '2016', 'wall', 'calendar'], ['moleskine', '2016', 'weekly', 'notebook', '12m', 'large', 'black', 'soft', 'cover', '5', 'x', '825'], ['365', 'cats', 'color', 'pageaday', 'calendar', '2016'], ['sierra', 'club', 'engagement', 'calendar', '2016'], ['sierra', 'club', 'wilderness', 'calendar', '2016'], ['thomas', 'kinkade', 'the', 'disney', 'dreams', 'collection', '2016', 'wall', 'calendar'], ['ansel', 'adams', '2016', 'wall', 'calendar'], ['dilbert', '2016', 'daytoday', 'calendar'], ['mary', 'engelbreit', '2016', 'deluxe', 'wall', 'calendar', 'never', 'give', 'up']]


# Classifier

## Split data into train/test

In order to split the data into train and test sets, we create a dataframe containing the tokenized titles. We then split the data into 80% for training and 20% for testing. 

In [11]:
# create dataframe containing tokenized titles and genres
tok_title = pd.DataFrame({0: tokenized_titles})
data_in = [tok_title[0], df1["genre"]]
headers = ["titles", "genres"]

data = pd.concat(data_in, axis=1, keys=headers)
print(data[:5])

                                              titles     genres
0               [moms, family, wall, calendar, 2016]  Calendars
1             [doug, the, pug, 2016, wall, calendar]  Calendars
2  [moleskine, 2016, weekly, notebook, 12m, large...  Calendars
3       [365, cats, color, pageaday, calendar, 2016]  Calendars
4         [sierra, club, engagement, calendar, 2016]  Calendars


In [12]:
# split data into train and test
import numpy as np

test_pct=0.2 # split into 80/20%

# create mask
mask = np.random.choice([0, 1], p=[1 - test_pct, test_pct], size=data.shape[0])

# apply mask
data["mask"] = mask
test = data[data["mask"] == 1]
train = data[data["mask"] == 0]

# removing column
test = test.drop("mask", axis="columns").reset_index()
train = train.drop("mask", axis="columns").reset_index()

# remove original indexing data (otherwise we have double indexing)
test = test.drop("index", axis="columns")
train = train.drop("index", axis="columns")

In [13]:
# save split datasets as csv files in Google Drive
test.to_csv('/drive/My Drive/test_NB.csv', index=False)
train.to_csv('/drive/My Drive/train_NB.csv', index=False)

## this gives error messages if next parts are done using csv files... so just saved for reference

## Naive Bayes classifier

We build the Naive Bayes classifier from scratch. In order to do this, we create lists of the following: token frequencies in total vocabulary, token frequencies in each genre, number of words in each genre, number of words in total vocabulary, number of total titles, number of titles in each genre. Using these, we compute the priors for each genre and the likelihoods for each word to be in any genre. 
Using both the priors and the likelihoods, we create a dataframe with prediction values. By taking the highest prediction value of a particular word across the genres, we can get the predicted genre for that word.

We then test out whether the model works for the test data.

### Train model: Prep

In [14]:
from collections import Counter

In [15]:
# token frequencies within each title (we don't need this...)
unigram_count = []
for title in tokenized_titles:
    uni_title = Counter()
    for i in title:
        uni_title[i] += 1
    unigram_count.append(uni_title)

print(unigram_count[:5]) # test

[Counter({'moms': 1, 'family': 1, 'wall': 1, 'calendar': 1, '2016': 1}), Counter({'doug': 1, 'the': 1, 'pug': 1, '2016': 1, 'wall': 1, 'calendar': 1}), Counter({'moleskine': 1, '2016': 1, 'weekly': 1, 'notebook': 1, '12m': 1, 'large': 1, 'black': 1, 'soft': 1, 'cover': 1, '5': 1, 'x': 1, '825': 1}), Counter({'365': 1, 'cats': 1, 'color': 1, 'pageaday': 1, 'calendar': 1, '2016': 1}), Counter({'sierra': 1, 'club': 1, 'engagement': 1, 'calendar': 1, '2016': 1})]


In [16]:
# token frequencies in total vocab
tok_freq = Counter()
for title in train['titles']:
    for i in title:
        tok_freq[i] += 1
        
print(tok_freq)



In [17]:
# group data by genre
# could probably be done MUCH nicer, but couldn't get for-loop to work...

grouped = train.groupby(train.genres)

Calendars = grouped.get_group("Calendars")
Comics = grouped.get_group("Comics & Graphic Novels")
Test = grouped.get_group("Test Preparation")
Mystery = grouped.get_group("Mystery, Thriller & Suspense")
SciFi = grouped.get_group("Science Fiction & Fantasy")
Romance = grouped.get_group("Romance")
Humor = grouped.get_group("Humor & Entertainment")
Literature = grouped.get_group("Literature & Fiction")
LGBTQ = grouped.get_group("Gay & Lesbian")
Engineering = grouped.get_group("Engineering & Transportation")
Food = grouped.get_group("Cookbooks, Food & Wine")
Crafts = grouped.get_group("Crafts, Hobbies & Home")
Arts = grouped.get_group("Arts & Photography")
Education = grouped.get_group("Education & Teaching")
Parenting = grouped.get_group("Parenting & Relationships")
SelfHelp = grouped.get_group("Self-Help")
Computers = grouped.get_group("Computers & Technology")
Medical = grouped.get_group("Medical Books")
Science = grouped.get_group("Science & Math")
Health = grouped.get_group("Health, Fitness & Dieting")
Business = grouped.get_group("Business & Money")
Law = grouped.get_group("Law")
Biographies = grouped.get_group("Biographies & Memoirs")
History = grouped.get_group("History")
Politics = grouped.get_group("Politics & Social Sciences")
Reference = grouped.get_group("Reference")
Bibles = grouped.get_group("Christian Books & Bibles")
Religion = grouped.get_group("Religion & Spirituality")
Sports = grouped.get_group("Sports & Outdoors")
Teen = grouped.get_group("Teen & Young Adult")
Childrens = grouped.get_group("Children's Books")
Travel = grouped.get_group("Travel")

GenreGroups = [Calendars['titles'], Comics['titles'], Test['titles'], Mystery['titles'], SciFi['titles'], 
               Romance['titles'], Humor['titles'], Literature['titles'], LGBTQ['titles'], Engineering['titles'], 
               Food['titles'], Crafts['titles'], Arts['titles'], Education['titles'], Parenting['titles'], 
               SelfHelp['titles'], Computers['titles'], Medical['titles'], Science['titles'], Health['titles'], 
               Business['titles'], Law['titles'], Biographies['titles'], History['titles'], Politics['titles'], 
               Reference['titles'], Bibles['titles'], Religion['titles'], Sports['titles'], Teen['titles'], 
               Childrens['titles'], Travel['titles']]

In [18]:
# token frequencies in each genre
genre_count = []
for g in GenreGroups:
    genre_title = Counter()
    for title in g:
        for i in title:
            genre_title[i] += 1
    genre_title = dict(genre_title)
    genre_count.append(genre_title)

In [19]:
# token frequencies in each genre
genre_count[0] # genres in same order as GenreGroups

{'moms': 10,
 'family': 10,
 'wall': 788,
 'calendar': 1935,
 '2016': 1116,
 'doug': 1,
 'the': 372,
 'pug': 4,
 'moleskine': 29,
 'weekly': 59,
 'notebook': 21,
 '12m': 8,
 'large': 24,
 'black': 25,
 'soft': 9,
 'cover': 28,
 '5': 14,
 'x': 26,
 '825': 13,
 '365': 70,
 'cats': 35,
 'color': 11,
 'pageaday': 60,
 'sierra': 3,
 'club': 5,
 'engagement': 71,
 'wilderness': 10,
 'thomas': 11,
 'kinkade': 8,
 'disney': 25,
 'dreams': 6,
 'collection': 12,
 'ansel': 3,
 'adams': 3,
 'dilbert': 5,
 'daytoday': 84,
 'mary': 8,
 'engelbreit': 5,
 'deluxe': 37,
 'never': 3,
 'give': 1,
 'up': 8,
 'cat': 23,
 'gallery': 13,
 'llewellyns': 4,
 'witches': 2,
 'datebook': 2,
 'amy': 2,
 'knapp': 2,
 'big': 11,
 'grid': 3,
 'essential': 1,
 'organization': 1,
 'and': 137,
 'communication': 1,
 'tool': 1,
 'for': 63,
 'entire': 1,
 'outlander': 2,
 'audubon': 6,
 'nature': 15,
 'national': 19,
 'park': 1,
 'foundation': 1,
 'extra': 4,
 '75': 4,
 '10': 4,
 'pocket': 55,
 '35': 7,
 '55': 7,
 'susan':

In [20]:
# number of words in a class
len(genre_count[0]) # first genre

2384

In [21]:
# number of total vocabulary (training set)
V = len(Counter(tok_freq))
print(V)

74773


In [22]:
# number of titles (in training set)
N_titles = len(train)
print(N_titles)

165688


In [23]:
# number of titles in each genre
N_genre = train['genres'].value_counts()
print(N_genre[:5])

Travel                       14650
Children's Books             10886
Medical Books                 9601
Health, Fitness & Dieting     9483
Business & Money              7990
Name: genres, dtype: int64


### Priors

In [24]:
# priors of each genre (probability of title being in specific genre)
prob_title = []
for g in range(len(N_genre)):
    prob = N_genre[g] / N_titles
    prob_title.append(prob)

zipped_values = zip(df1.genre.unique(), prob_title)
prior = list(zipped_values)

print(prior[:5]) # test

[('Calendars', 0.08841919752788373), ('Comics & Graphic Novels', 0.06570180097532712), ('Test Preparation', 0.05794626044131138), ('Mystery, Thriller & Suspense', 0.057234078509004874), ('Science Fiction & Fantasy', 0.04822316643329631)]


### Likelihoods

In [25]:
# likelihoods of each word (probability of word being in specific genre)

# create empty dataframe
likelihood = pd.DataFrame(columns=df1.genre.unique(), index=dict.keys(tok_freq))

len_gc = -1
for i in genre_count: # loop through genres
    len_gc += 1 # create genre index
    for word in i: 
        p = (genre_count[len_gc][word] + 1) / (len(genre_count[len_gc]) + V) # probability
        likelihood.loc[word, likelihood.columns[len_gc]] = p # replace NaN in dataframe with p

In [26]:
# now fill likelihoods of words that haven't appeared in genre

len_gc = -1
for c, v in likelihood.iteritems(): # loop through genres in dataframe
  len_gc += 1 # create genre index
  p = 1 / (len(genre_count[len_gc]) + V) # probability of word not appearing in that genre
  likelihood[c].fillna(p, inplace=True) # replace remaining NaNs in dataframe with p

In [27]:
likelihood

Unnamed: 0,Calendars,Comics & Graphic Novels,Test Preparation,"Mystery, Thriller & Suspense",Science Fiction & Fantasy,Romance,Humor & Entertainment,Literature & Fiction,Gay & Lesbian,Engineering & Transportation,"Cookbooks, Food & Wine","Crafts, Hobbies & Home",Arts & Photography,Education & Teaching,Parenting & Relationships,Self-Help,Computers & Technology,Medical Books,Science & Math,"Health, Fitness & Dieting",Business & Money,Law,Biographies & Memoirs,History,Politics & Social Sciences,Reference,Christian Books & Bibles,Religion & Spirituality,Sports & Outdoors,Teen & Young Adult,Children's Books,Travel
moms,0.000143,0.000025,0.000013,0.000013,0.000013,0.000025,0.000060,0.000012,0.000052,0.000013,0.000132,0.000047,0.000036,0.000013,0.000292,0.000025,0.000012,0.000024,0.000024,0.000093,0.000024,0.000012,0.000012,0.000012,0.000025,0.000025,0.000108,0.000048,0.000012,0.000024,0.000047,0.000045
family,0.000143,0.000153,0.000232,0.000103,0.000050,0.000453,0.000441,0.000421,0.000426,0.000075,0.002273,0.000375,0.000266,0.000180,0.001625,0.000228,0.000110,0.000556,0.000178,0.001298,0.000642,0.001072,0.000815,0.000296,0.000212,0.000351,0.000987,0.000394,0.000122,0.000362,0.000654,0.001112
wall,0.010226,0.000013,0.000026,0.000077,0.000063,0.000013,0.000239,0.000048,0.000013,0.000038,0.000012,0.000269,0.000266,0.000013,0.000013,0.000025,0.000037,0.000047,0.000095,0.000058,0.000784,0.000073,0.000097,0.000178,0.000037,0.000151,0.000096,0.000108,0.000158,0.000060,0.000222,0.000352
calendar,0.025092,0.000013,0.000013,0.000013,0.000013,0.000013,0.000310,0.000036,0.000013,0.000126,0.000024,0.000258,0.000483,0.000013,0.000063,0.000089,0.000061,0.000012,0.000154,0.000035,0.000059,0.000012,0.000012,0.000012,0.000050,0.000050,0.000120,0.000227,0.000219,0.000024,0.000047,0.000341
2016,0.014477,0.000013,0.001302,0.000013,0.000025,0.000013,0.000370,0.000084,0.000039,0.000427,0.000060,0.000398,0.000495,0.000231,0.000089,0.000139,0.000564,0.000568,0.000119,0.000058,0.000321,0.000085,0.000024,0.000036,0.000062,0.000326,0.000192,0.000239,0.000231,0.000012,0.000128,0.001260
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
flashmaps,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000013,0.000013,0.000012,0.000012,0.000012,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000023
ballrooms,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000013,0.000013,0.000012,0.000012,0.000012,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000023
riverboat,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000013,0.000013,0.000012,0.000012,0.000012,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000023
historymapped,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000013,0.000013,0.000012,0.000012,0.000012,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000023


In [28]:
# predictions on word by word basis using words in titles of test data

predictors = pd.DataFrame(columns=df1.genre.unique(), index=dict.keys(tok_freq)) # create empty dataframe
len_gc = -1
for i in prior:
  len_gc += 1 # create genre index
  # square all likelihoods then multiply them each with prior of that genre
  predictors = likelihood.loc[:, df1.genre.unique().tolist()] ** 2 # square likelihoods
  predictors = predictors.loc[:, df1.genre.unique().tolist()] * np.array(prior[len_gc][1]) # multiply by priors

In [29]:
predictors

Unnamed: 0,Calendars,Comics & Graphic Novels,Test Preparation,"Mystery, Thriller & Suspense",Science Fiction & Fantasy,Romance,Humor & Entertainment,Literature & Fiction,Gay & Lesbian,Engineering & Transportation,"Cookbooks, Food & Wine","Crafts, Hobbies & Home",Arts & Photography,Education & Teaching,Parenting & Relationships,Self-Help,Computers & Technology,Medical Books,Science & Math,"Health, Fitness & Dieting",Business & Money,Law,Biographies & Memoirs,History,Politics & Social Sciences,Reference,Christian Books & Bibles,Religion & Spirituality,Sports & Outdoors,Teen & Young Adult,Children's Books,Travel
moms,1.283144e-10,4.099988e-12,1.049593e-12,1.046272e-12,1.002164e-12,4.003106e-12,2.246176e-11,9.124807e-13,1.679652e-11,9.959304e-13,1.104739e-10,1.385861e-11,8.275282e-12,1.044120e-12,5.379378e-10,4.054296e-12,9.484389e-13,3.535104e-12,3.561095e-12,5.425494e-11,3.565582e-12,9.374680e-13,9.335959e-13,8.876788e-13,3.939560e-12,3.976127e-12,7.401412e-11,1.442022e-11,9.330058e-13,3.671343e-12,1.375690e-11,1.301453e-11
family,1.283144e-10,1.475996e-10,3.400682e-10,6.696140e-11,1.603462e-11,1.297006e-09,1.230006e-09,1.117789e-09,1.143213e-09,3.585350e-11,3.261354e-08,8.869507e-10,4.450263e-10,2.046475e-10,1.666082e-08,3.283980e-10,7.682355e-11,1.952261e-09,2.003116e-10,1.063397e-08,2.599309e-09,7.259752e-09,4.190912e-09,5.547992e-10,2.846332e-10,7.793209e-10,6.144086e-09,9.814764e-10,9.330058e-11,8.260521e-10,2.696353e-09,7.811972e-09
wall,6.601520e-07,1.024997e-12,4.198373e-12,3.766579e-11,2.505410e-11,1.000777e-12,3.593882e-10,1.459969e-11,1.049783e-12,8.963374e-12,9.130075e-13,4.582001e-10,4.450263e-10,1.044120e-12,1.016896e-12,4.054296e-12,8.535950e-12,1.414042e-11,5.697753e-11,2.119334e-11,3.882919e-09,3.374885e-11,5.975014e-11,1.997277e-10,8.864011e-12,1.431406e-10,5.848029e-11,7.300238e-11,1.576780e-10,2.294589e-11,3.103901e-10,7.816853e-10
calendar,3.974666e-06,1.024997e-12,1.049593e-12,1.046272e-12,1.002164e-12,1.000777e-12,6.073661e-10,8.212326e-12,1.049783e-12,9.959304e-11,3.652030e-12,4.192228e-10,1.471161e-09,1.044120e-12,2.542239e-11,4.966513e-11,2.371097e-11,8.837760e-13,1.504563e-10,7.629601e-12,2.228489e-11,9.374680e-13,9.335959e-13,8.876788e-13,1.575824e-11,1.590451e-11,9.137546e-11,3.253563e-10,3.022939e-10,3.671343e-12,1.375690e-11,7.320674e-10
2016,1.323111e-06,1.024997e-12,1.070690e-08,1.046272e-12,4.008656e-12,1.000777e-12,8.634302e-10,4.471155e-11,9.448045e-12,1.151296e-09,2.282519e-11,1.001284e-09,1.545639e-09,3.382949e-10,4.982789e-11,1.226425e-10,2.006897e-09,2.036220e-09,8.902738e-11,2.119334e-11,6.498274e-10,4.593593e-11,3.734383e-12,7.989109e-12,2.462225e-11,6.719655e-10,2.339212e-10,3.605056e-10,3.368151e-10,9.178357e-13,1.040366e-10,1.002200e-08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
flashmaps,1.060449e-12,1.024997e-12,1.049593e-12,1.046272e-12,1.002164e-12,1.000777e-12,8.984706e-13,9.124807e-13,1.049783e-12,9.959304e-13,9.130075e-13,8.661628e-13,9.194758e-13,1.044120e-12,1.016896e-12,1.013574e-12,9.484389e-13,8.837760e-13,8.902738e-13,8.477334e-13,8.913955e-13,9.374680e-13,9.335959e-13,8.876788e-13,9.848901e-13,9.940318e-13,9.137546e-13,9.012639e-13,9.330058e-13,9.178357e-13,8.598063e-13,3.253633e-12
ballrooms,1.060449e-12,1.024997e-12,1.049593e-12,1.046272e-12,1.002164e-12,1.000777e-12,8.984706e-13,9.124807e-13,1.049783e-12,9.959304e-13,9.130075e-13,8.661628e-13,9.194758e-13,1.044120e-12,1.016896e-12,1.013574e-12,9.484389e-13,8.837760e-13,8.902738e-13,8.477334e-13,8.913955e-13,9.374680e-13,9.335959e-13,8.876788e-13,9.848901e-13,9.940318e-13,9.137546e-13,9.012639e-13,9.330058e-13,9.178357e-13,8.598063e-13,3.253633e-12
riverboat,1.060449e-12,1.024997e-12,1.049593e-12,1.046272e-12,1.002164e-12,1.000777e-12,8.984706e-13,9.124807e-13,1.049783e-12,9.959304e-13,9.130075e-13,8.661628e-13,9.194758e-13,1.044120e-12,1.016896e-12,1.013574e-12,9.484389e-13,8.837760e-13,8.902738e-13,8.477334e-13,8.913955e-13,9.374680e-13,9.335959e-13,8.876788e-13,9.848901e-13,9.940318e-13,9.137546e-13,9.012639e-13,9.330058e-13,9.178357e-13,8.598063e-13,3.253633e-12
historymapped,1.060449e-12,1.024997e-12,1.049593e-12,1.046272e-12,1.002164e-12,1.000777e-12,8.984706e-13,9.124807e-13,1.049783e-12,9.959304e-13,9.130075e-13,8.661628e-13,9.194758e-13,1.044120e-12,1.016896e-12,1.013574e-12,9.484389e-13,8.837760e-13,8.902738e-13,8.477334e-13,8.913955e-13,9.374680e-13,9.335959e-13,8.876788e-13,9.848901e-13,9.940318e-13,9.137546e-13,9.012639e-13,9.330058e-13,9.178357e-13,8.598063e-13,3.253633e-12


In [30]:
predictors.loc[["moms"]]

Unnamed: 0,Calendars,Comics & Graphic Novels,Test Preparation,"Mystery, Thriller & Suspense",Science Fiction & Fantasy,Romance,Humor & Entertainment,Literature & Fiction,Gay & Lesbian,Engineering & Transportation,"Cookbooks, Food & Wine","Crafts, Hobbies & Home",Arts & Photography,Education & Teaching,Parenting & Relationships,Self-Help,Computers & Technology,Medical Books,Science & Math,"Health, Fitness & Dieting",Business & Money,Law,Biographies & Memoirs,History,Politics & Social Sciences,Reference,Christian Books & Bibles,Religion & Spirituality,Sports & Outdoors,Teen & Young Adult,Children's Books,Travel
moms,1.283144e-10,4.099988e-12,1.049593e-12,1.046272e-12,1.002164e-12,4.003106e-12,2.246176e-11,9.124807e-13,1.679652e-11,9.959304e-13,1.104739e-10,1.385861e-11,8.275282e-12,1.04412e-12,5.379378e-10,4.054296e-12,9.484389e-13,3.535104e-12,3.561095e-12,5.425494e-11,3.565582e-12,9.37468e-13,9.335959e-13,8.876788e-13,3.93956e-12,3.976127e-12,7.401412e-11,1.442022e-11,9.330058e-13,3.671343e-12,1.37569e-11,1.301453e-11


In [31]:
# save predictors so these don't have to be created again when testing model
predictors.to_csv('/drive/My Drive/predictors_NB.csv')

In [32]:
# output genre that has highest prediction value for input word
maxValueIndexObj = predictors.idxmax(axis=1)
print(list(maxValueIndexObj[["high"]]))
print(list(maxValueIndexObj[["higher"]]))
print(list(maxValueIndexObj[["highest"]]))

['Health, Fitness & Dieting']
['Education & Teaching']
['Christian Books & Bibles']


### Test model

In [33]:
# function that will return list of predicted genres based on title input
# function is very slow -> OPTIMIZE (otherwise takes 11.5hrs for test dataset...)
def pred_genre(title):
  prediction = []
  for w in title:
    ## CONDITION: if word exists in total vocabulary
    if w in predictors.index:
      maxValueIndexObj = predictors.idxmax(axis=1)
      prediction += list(maxValueIndexObj[[w]])
    ## CONDITION: if word does not exist in total vocabulary
    if w not in predictors.index:
      pass

  # probability of predicted genre being chosen
  d = dict(Counter(prediction))
  d1 = {k: v / total for total in (sum(d.values()),) for k, v in d.items()}

  # output most likely predicted genre
  if prediction == []:
    return 'There is no prediction for this title.'
  else:
    return max(d1, key=d1.get) 
  
  # what if there is no maximum? (all equally predicted)
  # gives first predicted genre as output...

In [34]:
# predicted genre for first test title
import time
time_start = time.clock()

pred_genre(test['titles'][0])

time_elapsed = (time.clock() - time_start)
print(time_elapsed)

1.0751810000000006


In [35]:
pred_genre(test['titles'][678])

'Comics & Graphic Novels'

In [36]:
# "real" genre of first test title
test['genres'][0]

'Calendars'

In [37]:
# titles that would give no prediction:
# these titles are one word long and not included in training set
for t in test['titles']:
  if len(t) == 1:
    for w in t:
      if w not in list(tok_freq.keys()):
        print(t)
  else:
    pass

['dã¼sseldorf']
['unflattening']
['poseidont']
['verboten']
['peplum']
['megahex']
['escapo']
['mw']
['priapus']
['stroppy']
['cryptonomicon']
['cyberstorm']
['robogenesis']
['nightfolk']
['moonheart']
['abilites']
['binti']
['saucer']
['eifelheim']
['sharkman']
['pornucopia']
['cryptonomicon']
['cuckd']
['edenbrooke']
['lespada']
['genafinn']
['equivocation']
['hippochondriac']
['ctrlshift']
['shibumi']
['nyctophobia']
['cryptos']
['fingersmith']
['penpal']
['fantasmagoria']
['neverhome']
['habibi']
['pnin']
['kokoro']
['divisadero']
['downtime']
['laodicea']
['godlike']
['tacopedia']
['rockpool']
['koto']
['morito']
['balloonology']
['remodelista']
['skyshades']
['unbranded']
['plasticland']
['orthopantomography']
['amlodipine']
['schistosomiasis']
['nextinction']
['palomino']
['weology']
['coopetition']
['serpico']
['recapitulations']
['sleepers']
['oct64']
['mariology']
['dispensationalism']
['skatekey']
['boneseeker']
['mimus']
['guyliner']
['bunheads']
['edgewater']
['evertaster'

In [38]:
######## this function takes 9.25 hours!!
predicted_genre = []
for t in test['titles']:
  predicted_genre.append(pred_genre(t))


# change list to dataframe to enable saving as csv file
df2 = pd.DataFrame(predicted_genre)
df2.columns = ['Predicted Genre']
df2['True Genre'] = test['genres']

# create and save csv file in Google Drive
df2.to_csv('/drive/My Drive/predicted_genre_NB.csv')

In [39]:
# read csv file as pandas Dataframe
predicted_genre = pd.read_csv('/drive/My Drive/predicted_genre_NB.csv',encoding='latin1')

In [40]:
# dataframe with predicted and true genres
predicted_genre

Unnamed: 0.1,Unnamed: 0,Predicted Genre,True Genre
0,0,Calendars,Calendars
1,1,Calendars,Calendars
2,2,Calendars,Calendars
3,3,Religion & Spirituality,Calendars
4,4,Calendars,Calendars
...,...,...,...
41879,41879,Travel,Travel
41880,41880,Travel,Travel
41881,41881,Travel,Travel
41882,41882,Travel,Travel


### Accuracy of model

In [41]:
# read csv file as pandas Dataframe
predicted_genre = pd.read_csv('/drive/My Drive/predicted_genre_NB.csv',encoding='latin1')

In [42]:
# compute overall accuracy, precision, recall, f1 scores
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print('Accuracy: ', accuracy_score(predicted_genre['True Genre'], predicted_genre['Predicted Genre']))
print('Precision: ', precision_score(predicted_genre['True Genre'], predicted_genre['Predicted Genre'], average='weighted'))
print('Recall: ', recall_score(predicted_genre['True Genre'], predicted_genre['Predicted Genre'], average='weighted', zero_division=1))
print('F1:', f1_score(predicted_genre['True Genre'], predicted_genre['Predicted Genre'], average='weighted'))

Accuracy:  0.43608537866488395
Precision:  0.5212100104488911
Recall:  0.43608537866488395
F1: 0.4201469237616296


In [43]:
# compute accuracy, precision, recall, f1 scores by genre

from sklearn.metrics import precision_recall_fscore_support as score

# precision, recall, fscore, support separated by genre
precision, recall, fscore, support = score(predicted_genre['True Genre'], predicted_genre['Predicted Genre'])

df_acc = pd.DataFrame()
df_acc['precision']=pd.Series(precision)
df_acc['recall']=pd.Series(recall)
df_acc['fscore']=pd.Series(fscore)
df_acc['support']=pd.Series(support)

print(df_acc)
# order corresponds to genre IDs, with some exceptions
# order: 0-9, 31, 10, 30, 11-28, no prediction, 29

    precision    recall    fscore  support
0    0.572277  0.213284  0.310753     1355
1    0.247253  0.054348  0.089109      828
2    0.560174  0.457215  0.503485     1975
3    0.822835  0.762774  0.791667      548
4    0.428305  0.507540  0.464568     2719
5    0.545894  0.372732  0.442992     1819
6    0.628125  0.323151  0.426752      622
7    0.689172  0.683081  0.686113     1584
8    0.700734  0.686774  0.693684     1807
9    0.665145  0.437000  0.527459     2000
10   0.578125  0.120915  0.200000      306
11   0.620690  0.203390  0.306383      531
12   0.389831  0.078498  0.130682      293
13   0.284167  0.709530  0.405807     2403
14   0.314345  0.333333  0.323561     1374
15   0.586895  0.155004  0.245238     1329
16   0.710946  0.447492  0.549261     1495
17   0.475120  0.199730  0.281235     1482
18   0.629484  0.642656  0.636002     2485
19   0.487179  0.128378  0.203209      444
20   0.410000  0.077652  0.130573      528
21   0.419048  0.065967  0.113990      667
22   0.5860

  _warn_prf(average, modifier, msg_start, len(result))


In [44]:
# save model evaluation as csv file in Google Drive
df_acc.to_csv('/drive/My Drive/NB_model_eval.csv')