## Exploratory Analysis and Classification of books by Genre on Amazon Book Dataset using a simple feed forward neural net.

### Instructions

#### Following libraries need to be installed 

- pandas
- numpy
- tensorflow
- tflearn
- jupyter notebook

Download the book dataset from https://github.com/uchidalab/book-dataset/tree/master/Task2

### Importing our Dependencies

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
import gc

### Loading the dataset

In [2]:
data = pd.read_csv('data/book32-listing.csv',encoding = "ISO-8859-1")

In [3]:
data.head()

Unnamed: 0,761183272,0761183272.jpg,http://ecx.images-amazon.com/images/I/61Y5cOdHJbL.jpg,Mom's Family Wall Calendar 2016,Sandra Boynton,3,Calendars
0,1623439671,1623439671.jpg,http://ecx.images-amazon.com/images/I/61t-hrSw...,Doug the Pug 2016 Wall Calendar,Doug the Pug,3,Calendars
1,B00O80WC6I,B00O80WC6I.jpg,http://ecx.images-amazon.com/images/I/41X-KQqs...,"Moleskine 2016 Weekly Notebook, 12M, Large, Bl...",Moleskine,3,Calendars
2,761182187,0761182187.jpg,http://ecx.images-amazon.com/images/I/61j-4gxJ...,365 Cats Color Page-A-Day Calendar 2016,Workman Publishing,3,Calendars
3,1578052084,1578052084.jpg,http://ecx.images-amazon.com/images/I/51Ry4Tsq...,Sierra Club Engagement Calendar 2016,Sierra Club,3,Calendars
4,1578052076,1578052076.jpg,http://ecx.images-amazon.com/images/I/619KxYEq...,Sierra Club Wilderness Calendar 2016,Sierra Club,3,Calendars


### Renaming columns and splitting into feature(title) and target(genre) variables

In [4]:
###dsa as as 
columns = ['Id', 'Image', 'Image_link', 'Title', 'Author', 'Class', 'Genre']
data.columns = columns

In [5]:
data[20000:20010]

Unnamed: 0,Id,Image,Image_link,Title,Author,Class,Genre
20000,446552313,0446552313.jpg,http://ecx.images-amazon.com/images/I/51qTC5YF...,The Bible of Unspeakable Truths,Greg Gutfeld,13,Humor & Entertainment
20001,140280243,0140280243.jpg,http://ecx.images-amazon.com/images/I/51CiSRkd...,A Treasury of Royal Scandals: The Shocking Tru...,Michael Farquhar,13,Humor & Entertainment
20002,316236772,0316236772.jpg,http://ecx.images-amazon.com/images/I/51G8BEDi...,"Beautifully Unique Sparkleponies: On Myths, Mo...",Chris Kluwe,13,Humor & Entertainment
20003,609804618,0609804618.jpg,http://ecx.images-amazon.com/images/I/519uo9GI...,Our Dumb Century: The Onion Presents 100 Years...,The Onion,13,Humor & Entertainment
20004,62320408,0062320408.jpg,http://ecx.images-amazon.com/images/I/51U1SfEk...,President Me: The America That's in My Head,Adam Carolla,13,Humor & Entertainment
20005,1400048575,1400048575.jpg,http://ecx.images-amazon.com/images/I/51MXTST8...,A Right to Be Hostile: The Boondocks Treasury,Aaron McGruder,13,Humor & Entertainment
20006,449208397,0449208397.jpg,http://ecx.images-amazon.com/images/I/51PwqZDv...,"If Life Is a Bowl of Cherries, What Am I Doing...",Erma Bombeck,13,Humor & Entertainment
20007,1101906073,1101906073.jpg,http://ecx.images-amazon.com/images/I/41ABM1D8...,The Deleted E-Mails of Hillary Clinton: A Parody,John Moe,13,Humor & Entertainment
20008,345336976,0345336976.jpg,http://ecx.images-amazon.com/images/I/51o7Ik9i...,Without Feathers,Woody Allen,13,Humor & Entertainment
20009,692501940,0692501940.jpg,http://ecx.images-amazon.com/images/I/51lDqKfc...,Trump 2016: Off-Color Coloring Book (Off-Color...,Tom F. O'Leary,13,Humor & Entertainment


In [6]:
books = pd.DataFrame(data['Title'])
genre = pd.DataFrame(data['Genre'])

In [7]:
print (type(books))
print (type(genre))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [8]:
print (books.shape)
print (genre.shape)

(207571, 1)
(207571, 1)


### Creating our vocabulary

Counting how many times a word appears in the dataset

In [9]:
from collections import Counter

total_counts = Counter()
for i in range(len(books)):
    for word in books.values[i][0].split(" "):
        total_counts[word] += 1

print("Total words in data set: ", len(total_counts))

Total words in data set:  151193


Sorting in decreasing order (Word with highest frequency appears first)

We only use the first 20000 words for our vocab

In [10]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:20000]
print(vocab[:60])

['and', 'of', 'The', 'the', 'to', 'A', 'in', 'for', 'Guide', '&', 'a', 'Your', 'Book', 'with', 'Edition', 'How', 'Edition)', 'on', '-', 'from', 'New', 'Series)', 'An', 'Life', 'You', 'World', 'History', 'by', 'Recipes', 'Complete', 'American', '(The', 'For', 'Story', 'My', 'Travel', 'Art', '', '(Volume', 'Volume', 'Law', 'Calendar', 'Health', 'From', 'To', 'And', 'Handbook', 'at', 'Best', 'Novel', 'In', 'What', 'I', 'With', 'Practice', 'Science', 'Great', 'Vol.', 'Introduction', 'Of']


In [11]:
# Last word shows up 
print(vocab[-1], ': ', total_counts[vocab[-1]])

Trademarks :  8


Mapping from words to index

In [12]:
vocab_size = len(vocab)
word2idx = {}
#print vocab_size
for i, word in enumerate(vocab):
    word2idx[word] = i

Helper Function to convert all Titles to vectors

In [13]:
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)

#### Vector created as follow :
positions with respect to highest occuring word

Eg : 1 at first index means first word in vocab(most frequent occuring in vocab which is 'of') occurs twice in this sentence


In [14]:
text_to_vector("I am of a Legend")[:10]

array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Convert all titles to vectors

In [15]:
word_vectors = np.zeros((len(books), len(vocab)), dtype=np.int_)
for ii, (_, text) in enumerate(books.iterrows()):
    word_vectors[ii] = text_to_vector(text[0])

In [16]:
#word_vectors[:5, :23]
word_vectors.shape

(207571, 20000)

In [17]:
books.describe()

Unnamed: 0,Title
count,207571
unique,203470
top,Tao Te Ching
freq,11


In [18]:
genre.describe()

Unnamed: 0,Genre
count,207571
unique,32
top,Travel
freq,18338


In [19]:
#genre_ = pd.get_dummies(genre)

In [20]:
type(genre)
len(genre)
genre

Unnamed: 0,Genre
0,Calendars
1,Calendars
2,Calendars
3,Calendars
4,Calendars
5,Calendars
6,Calendars
7,Calendars
8,Calendars
9,Calendars


### Implementing the same alogorithm to map genres to vectors

In [21]:
## Counting how many times a word appears in the dataset
from collections import Counter

total_counts_genre = Counter()
for i in range(len(genre)):
    for word in genre.values[i][0].split(" "):
        total_counts_genre[word] += 1

print("Total words in data set: ", len(total_counts_genre))

Total words in data set:  63


In [22]:
vocab_genre = sorted(total_counts_genre, key=total_counts_genre.get, reverse=True)

## Removimg '&' as first element
vocab_genre = vocab_genre[1:]
print (vocab_genre)

['Books', 'Travel', "Children's", 'Science', 'Medical', 'Health,', 'Fitness', 'Dieting', 'Fiction', 'Business', 'Money', 'Crafts,', 'Hobbies', 'Home', 'Math', 'Christian', 'Bibles', 'Cookbooks,', 'Food', 'Wine', 'Computers', 'Technology', 'Literature', 'Religion', 'Spirituality', 'Teen', 'Young', 'Adult', 'Law', 'Humor', 'Entertainment', 'History', 'Arts', 'Photography', 'Sports', 'Outdoors', 'Romance', 'Biographies', 'Memoirs', 'Fantasy', 'Politics', 'Social', 'Sciences', 'Reference', 'Comics', 'Graphic', 'Novels', 'Test', 'Preparation', 'Self-Help', 'Engineering', 'Transportation', 'Calendars', 'Parenting', 'Relationships', 'Mystery,', 'Thriller', 'Suspense', 'Education', 'Teaching', 'Gay', 'Lesbian']


In [23]:
print(vocab_genre[-1], ': ', total_counts_genre[vocab_genre[-1]])

Lesbian :  1339


In [24]:
vocab_size_genre = len(vocab_genre)
word2idx_genre = {}
#print vocab_size
for i, word in enumerate(vocab_genre):
    word2idx_genre[word] = i

In [25]:
def text_to_vector_genre(text):
    word_vector_genre = np.zeros(vocab_size_genre)
    for word in text.split(" "):
        if word2idx_genre.get(word) is None:
            continue
        else:
            word_vector_genre[word2idx_genre.get(word)] += 1
    return np.array(word_vector_genre)

In [26]:
word_vectors_genre = np.zeros((len(genre), len(vocab_genre)), dtype=np.int_)
for ii, (_, text) in enumerate(genre.iterrows()):
    word_vectors_genre[ii] = text_to_vector_genre(text[0])

In [27]:
word_vectors_genre.shape
word_vectors_genre[1]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [28]:
print (word_vectors.shape)
word_vectors_genre[1]
list(word2idx_genre)

(207571, 20000)


['Books',
 'Travel',
 "Children's",
 'Science',
 'Medical',
 'Health,',
 'Fitness',
 'Dieting',
 'Fiction',
 'Business',
 'Money',
 'Crafts,',
 'Hobbies',
 'Home',
 'Math',
 'Christian',
 'Bibles',
 'Cookbooks,',
 'Food',
 'Wine',
 'Computers',
 'Technology',
 'Literature',
 'Religion',
 'Spirituality',
 'Teen',
 'Young',
 'Adult',
 'Law',
 'Humor',
 'Entertainment',
 'History',
 'Arts',
 'Photography',
 'Sports',
 'Outdoors',
 'Romance',
 'Biographies',
 'Memoirs',
 'Fantasy',
 'Politics',
 'Social',
 'Sciences',
 'Reference',
 'Comics',
 'Graphic',
 'Novels',
 'Test',
 'Preparation',
 'Self-Help',
 'Engineering',
 'Transportation',
 'Calendars',
 'Parenting',
 'Relationships',
 'Mystery,',
 'Thriller',
 'Suspense',
 'Education',
 'Teaching',
 'Gay',
 'Lesbian']

Spliting into training and testing

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(word_vectors, word_vectors_genre, test_size=0.25, random_state=42)

In [30]:
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

(155678, 20000)
(155678, 62)
(51893, 20000)
(51893, 62)


### Building the Model

In [31]:
def build_model():
    tf.reset_default_graph()
    
    # Input
    net = tflearn.input_data([None, 20000])
    
    #Hidden
    net = tflearn.fully_connected(net, 128, activation='ReLU')
    net = tflearn.fully_connected(net, 128, activation='ReLU')
    net = tflearn.fully_connected(net, 128, activation='ReLU')
    net = tflearn.fully_connected(net, 256, activation='ReLU')

    ## Dropout
    net = tflearn.layers.core.dropout (net, 0.5, noise_shape=None, name='Dropout')

    #Output
    net = tflearn.fully_connected(net, 62, activation='softmax') 
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.003, loss='categorical_crossentropy')
    
    
    model = tflearn.DNN(net)
    return model

In [32]:
model = build_model()

### Training

In [33]:
hist = model.fit(X_train, y_train, validation_set=0.05, show_metric=True, batch_size=32, n_epoch=10)

Training Step: 46219  | total loss: [1m[32m3.26348[0m[0m | time: 183.693s
| Adam | epoch: 010 | loss: 3.26348 - acc: 0.4876 -- iter: 147872/147894
Training Step: 46220  | total loss: [1m[32m3.17957[0m[0m | time: 187.356s
| Adam | epoch: 010 | loss: 3.17957 - acc: 0.4920 | val_loss: 6.87093 - val_acc: 0.3231 -- iter: 147894/147894
--


In [34]:
## save the model
model.save('models/Basic.tfl')

INFO:tensorflow:/Users/akshaybhatia/Desktop/Spikeway AI Internship/book-dataset/Task2/models/Basic.tfl is not in all_model_checkpoint_paths. Manually adding it.


In [35]:
model.load("models/Basic.tfl")

INFO:tensorflow:Restoring parameters from /Users/akshaybhatia/Desktop/Spikeway AI Internship/book-dataset/Task2/models/Basic.tfl


In [36]:
## Predictions for first book in testing set
print (len(X_test[1000]))

20000


### Predictions

In [41]:
text = "Lord of the Rings: Return of the King" ## Genre is Christian Books and Bible
prob = model.predict([text_to_vector(text)])
print (prob)

[[0.030702032148838043, 0.010740121826529503, 0.021121108904480934, 0.05105367675423622, 0.0013459111796692014, 0.0018938007997348905, 0.0018584177596494555, 0.0018679560162127018, 0.14497590065002441, 0.0005127931945025921, 0.0005039435345679522, 0.0015564949717372656, 0.0015988099621608853, 0.0016039994079619646, 0.002526502124965191, 0.00745767168700695, 0.007449390832334757, 0.0003610215208027512, 0.00037011198583059013, 0.00035533373011276126, 0.0010298164561390877, 0.001046090037561953, 0.12750568985939026, 0.006428136490285397, 0.006503743585199118, 0.04152499511837959, 0.04226037487387657, 0.04301370307803154, 0.0025409578811377287, 0.016943149268627167, 0.01668836921453476, 0.0035385936498641968, 0.005658373236656189, 0.005907632410526276, 0.0023282666224986315, 0.002264992566779256, 0.11412695050239563, 0.007894408889114857, 0.00796525552868843, 0.05198577418923378, 0.0019442636985331774, 0.0019404711201786995, 0.0019378394354134798, 0.002734596375375986, 0.007940965704619884

In [42]:
len(prob[0])

62

In [43]:
results = sorted(((value, index) for index, value in enumerate(prob[0])), reverse=True)[:5]
print (results)

[(0.14497590065002441, 8), (0.12750568985939026, 22), (0.11412695050239563, 36), (0.053220853209495544, 55), (0.0525556318461895, 56)]


Displaying top 5 genres

In [44]:
### Predictions 

originial_genre_vocab = list(word2idx_genre)
#originial_genre_vocab[:10]
for i,j in results:
    print (originial_genre_vocab[j])

Fiction
Literature
Romance
Mystery,
Thriller
