## Exploratory Analysis and Categorisaion by Genre of Amazon Book Dataset 

### Instructions

#### Following libraries need to be installed 

- pandas
- numpy
- tensorflow
- tflearn
- jupyter notebook

Download the book dataset from https://github.com/uchidalab/book-dataset/tree/master/Task1

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

In [3]:
data = pd.read_csv('book32-listing.csv',encoding = "ISO-8859-1")

In [4]:
data.head()

Unnamed: 0,761183272,0761183272.jpg,http://ecx.images-amazon.com/images/I/61Y5cOdHJbL.jpg,Mom's Family Wall Calendar 2016,Sandra Boynton,3,Calendars
0,1623439671,1623439671.jpg,http://ecx.images-amazon.com/images/I/61t-hrSw...,Doug the Pug 2016 Wall Calendar,Doug the Pug,3,Calendars
1,B00O80WC6I,B00O80WC6I.jpg,http://ecx.images-amazon.com/images/I/41X-KQqs...,"Moleskine 2016 Weekly Notebook, 12M, Large, Bl...",Moleskine,3,Calendars
2,761182187,0761182187.jpg,http://ecx.images-amazon.com/images/I/61j-4gxJ...,365 Cats Color Page-A-Day Calendar 2016,Workman Publishing,3,Calendars
3,1578052084,1578052084.jpg,http://ecx.images-amazon.com/images/I/51Ry4Tsq...,Sierra Club Engagement Calendar 2016,Sierra Club,3,Calendars
4,1578052076,1578052076.jpg,http://ecx.images-amazon.com/images/I/619KxYEq...,Sierra Club Wilderness Calendar 2016,Sierra Club,3,Calendars


#### Renaming columns and splitting into feature(title) and target(genre) variables

In [5]:
###dsa as as 
columns = ['Id', 'Image', 'Image_link', 'Title', 'Author', 'Class', 'Genre']
data.columns = columns

In [6]:
data.head(10)

Unnamed: 0,Id,Image,Image_link,Title,Author,Class,Genre
0,1623439671,1623439671.jpg,http://ecx.images-amazon.com/images/I/61t-hrSw...,Doug the Pug 2016 Wall Calendar,Doug the Pug,3,Calendars
1,B00O80WC6I,B00O80WC6I.jpg,http://ecx.images-amazon.com/images/I/41X-KQqs...,"Moleskine 2016 Weekly Notebook, 12M, Large, Bl...",Moleskine,3,Calendars
2,761182187,0761182187.jpg,http://ecx.images-amazon.com/images/I/61j-4gxJ...,365 Cats Color Page-A-Day Calendar 2016,Workman Publishing,3,Calendars
3,1578052084,1578052084.jpg,http://ecx.images-amazon.com/images/I/51Ry4Tsq...,Sierra Club Engagement Calendar 2016,Sierra Club,3,Calendars
4,1578052076,1578052076.jpg,http://ecx.images-amazon.com/images/I/619KxYEq...,Sierra Club Wilderness Calendar 2016,Sierra Club,3,Calendars
5,1449468713,1449468713.jpg,http://ecx.images-amazon.com/images/I/61tNsiue...,Thomas Kinkade: The Disney Dreams Collection 2...,Thomas Kinkade,3,Calendars
6,316380652,0316380652.jpg,http://ecx.images-amazon.com/images/I/51JoXERf...,Ansel Adams 2016 Wall Calendar,,3,Calendars
7,1449465145,1449465145.jpg,http://ecx.images-amazon.com/images/I/51Iy7Wil...,Dilbert 2016 Day-to-Day Calendar,Scott Adams,3,Calendars
8,1449461581,1449461581.jpg,http://ecx.images-amazon.com/images/I/61-4emum...,Mary Engelbreit 2016 Deluxe Wall Calendar: Nev...,Mary Engelbreit,3,Calendars
9,761183558,0761183558.jpg,http://ecx.images-amazon.com/images/I/517kcRbF...,Cat Page-A-Day Gallery Calendar 2016,Workman Publishing,3,Calendars


In [7]:
books = pd.DataFrame(data['Title'])
genre = pd.DataFrame(data['Genre'])

In [8]:
print (type(books))
print (type(genre))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [9]:
print (books.shape)
print (genre.shape)

(207571, 1)
(207571, 1)


In [10]:
## Counting how many times a word appears in the dataset
from collections import Counter

total_counts = Counter()
for i in range(len(books)):
    for word in books.values[i][0].split(" "):
        total_counts[word] += 1

print("Total words in data set: ", len(total_counts))

Total words in data set:  151193


In [11]:
## Sorting in decreasing order (Word with highest frequency appears first)
## We only use the first 10000 words for our vocab
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:40000]
print(vocab[:60])

['and', 'of', 'The', 'the', 'to', 'A', 'in', 'for', 'Guide', '&', 'a', 'Your', 'Book', 'with', 'Edition', 'How', 'Edition)', 'on', '-', 'from', 'New', 'Series)', 'An', 'Life', 'You', 'World', 'History', 'by', 'Recipes', 'Complete', 'American', '(The', 'For', 'Story', 'My', 'Travel', 'Art', '', '(Volume', 'Volume', 'Law', 'Calendar', 'Health', 'From', 'To', 'And', 'Handbook', 'at', 'Best', 'Novel', 'In', 'What', 'I', 'With', 'Practice', 'Science', 'Great', 'Vol.', 'Introduction', 'Of']


In [12]:
# Last word shows up 
print(vocab[-1], ': ', total_counts[vocab[-1]])

Diamonds: :  3


In [13]:
# Mapping from words to index

vocab_size = len(vocab)
word2idx = {}
#print vocab_size
for i, word in enumerate(vocab):
    word2idx[word] = i

In [14]:
### Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)

In [15]:
## Vector created as follow :
# positions with respect to highest occuring word
# Eg : 1 at first index means first word in vocab(most frequent occuring in vocab which is 'of') 
#      occurs twice in this sentence

text_to_vector("of the i am")[:10]

array([ 0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.])

In [16]:
## Convert all titles to vectors
word_vectors = np.zeros((len(books), len(vocab)), dtype=np.int_)
for ii, (_, text) in enumerate(books.iterrows()):
    word_vectors[ii] = text_to_vector(text[0])

In [17]:
#word_vectors[:5, :23]
word_vectors.shape

(207571, 40000)

In [18]:
books.describe()

Unnamed: 0,Title
count,207571
unique,203470
top,Tao Te Ching
freq,11


In [19]:
genre.describe()

Unnamed: 0,Genre
count,207571
unique,32
top,Travel
freq,18338


In [20]:
#genre_ = pd.get_dummies(genre)

In [21]:
type(genre)
len(genre)
genre

Unnamed: 0,Genre
0,Calendars
1,Calendars
2,Calendars
3,Calendars
4,Calendars
5,Calendars
6,Calendars
7,Calendars
8,Calendars
9,Calendars


### Implementing the same alogorithm to map genres to vectors

In [22]:
## Counting how many times a word appears in the dataset
from collections import Counter

total_counts_genre = Counter()
for i in range(len(genre)):
    for word in genre.values[i][0].split(" "):
        total_counts_genre[word] += 1

print("Total words in data set: ", len(total_counts_genre))

Total words in data set:  63


In [23]:
vocab_genre = sorted(total_counts_genre, key=total_counts_genre.get, reverse=True)
#print(vocab_genre)
## Removimg '&' as first element
vocab_genre = vocab_genre[1:]
print (vocab_genre)

['Books', 'Travel', "Children's", 'Science', 'Medical', 'Health,', 'Fitness', 'Dieting', 'Fiction', 'Business', 'Money', 'Crafts,', 'Hobbies', 'Home', 'Math', 'Christian', 'Bibles', 'Cookbooks,', 'Food', 'Wine', 'Computers', 'Technology', 'Literature', 'Religion', 'Spirituality', 'Teen', 'Young', 'Adult', 'Law', 'Humor', 'Entertainment', 'History', 'Arts', 'Photography', 'Sports', 'Outdoors', 'Romance', 'Biographies', 'Memoirs', 'Fantasy', 'Politics', 'Social', 'Sciences', 'Reference', 'Comics', 'Graphic', 'Novels', 'Test', 'Preparation', 'Self-Help', 'Engineering', 'Transportation', 'Calendars', 'Parenting', 'Relationships', 'Mystery,', 'Thriller', 'Suspense', 'Education', 'Teaching', 'Gay', 'Lesbian']


In [24]:
print(vocab_genre[-1], ': ', total_counts_genre[vocab_genre[-1]])

Lesbian :  1339


In [25]:
vocab_size_genre = len(vocab_genre)
word2idx_genre = {}
#print vocab_size
for i, word in enumerate(vocab_genre):
    word2idx_genre[word] = i

In [26]:
def text_to_vector_genre(text):
    word_vector_genre = np.zeros(vocab_size_genre)
    for word in text.split(" "):
        if word2idx_genre.get(word) is None:
            continue
        else:
            word_vector_genre[word2idx_genre.get(word)] += 1
    return np.array(word_vector_genre)

In [27]:
word_vectors_genre = np.zeros((len(genre), len(vocab_genre)), dtype=np.int_)
for ii, (_, text) in enumerate(genre.iterrows()):
    word_vectors_genre[ii] = text_to_vector_genre(text[0])

In [28]:
word_vectors_genre.shape
word_vectors_genre[1]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [29]:
#Y = (genre).astype(np.int_)
#test_fraction = 0.9

#shuffle = np.arange(records)
#np.random.shuffle(shuffle)
#test_fraction = 0.9

#train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
#trainX, trainY = word_vectors[train_split,:], to_categorical(Y.values[train_split], 2)
#testX, testY = word_vectors[test_split,:], to_categorical(Y.values[test_split], 2)

In [30]:
word_vectors_genre[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [31]:
print (word_vectors.shape)
word_vectors_genre[1]
list(word2idx_genre)

(207571, 40000)


['Books',
 'Travel',
 "Children's",
 'Science',
 'Medical',
 'Health,',
 'Fitness',
 'Dieting',
 'Fiction',
 'Business',
 'Money',
 'Crafts,',
 'Hobbies',
 'Home',
 'Math',
 'Christian',
 'Bibles',
 'Cookbooks,',
 'Food',
 'Wine',
 'Computers',
 'Technology',
 'Literature',
 'Religion',
 'Spirituality',
 'Teen',
 'Young',
 'Adult',
 'Law',
 'Humor',
 'Entertainment',
 'History',
 'Arts',
 'Photography',
 'Sports',
 'Outdoors',
 'Romance',
 'Biographies',
 'Memoirs',
 'Fantasy',
 'Politics',
 'Social',
 'Sciences',
 'Reference',
 'Comics',
 'Graphic',
 'Novels',
 'Test',
 'Preparation',
 'Self-Help',
 'Engineering',
 'Transportation',
 'Calendars',
 'Parenting',
 'Relationships',
 'Mystery,',
 'Thriller',
 'Suspense',
 'Education',
 'Teaching',
 'Gay',
 'Lesbian']

In [34]:
## Spliting into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(word_vectors, word_vectors_genre, test_size=0.30, random_state=42)

In [35]:
## Building the model
def build_model():
    tf.reset_default_graph()
    
    # Input
    net = tflearn.input_data([None, 40000])
    
    #Hidden
    net = tflearn.fully_connected(net, 100, activation='ReLU')
    net = tflearn.fully_connected(net, 100, activation='ReLU')
    net = tflearn.fully_connected(net, 100, activation='ReLU')
    net = tflearn.fully_connected(net, 100, activation='ReLU')
    net = tflearn.fully_connected(net, 100, activation='ReLU')

    ## Dropout
    net = tflearn.layers.core.dropout (net, 0.5, noise_shape=None, name='Dropout')

    #Output
    net = tflearn.fully_connected(net, 62, activation='softmax') 
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001, loss='categorical_crossentropy')
    
    
    model = tflearn.DNN(net)
    return model

In [36]:
model = build_model()

In [37]:
model.fit(word_vectors, word_vectors_genre, validation_set=0.2, show_metric=True, batch_size=32, n_epoch=10)

Training Step: 31139  | total loss: [1m[32m3.22854[0m[0m | time: 300.947s
[2K| Adam | epoch: 006 | loss: 3.22854 - acc: 0.4559 -- iter: 166048/166056


KeyboardInterrupt: 

In [130]:
#predictions = (np.array(model.predict(X_test))[:,0] >= 0.5).astype(np.int_)
#preds = np.array(model.predict(X_test))
#test_accuracy = np.mean(predictions == y_test[:,0], axis=0)
#print("Test accuracy: ", preds)

In [38]:
## save the model
model.save('model_large.tfl')

INFO:tensorflow:/Users/akshaybhatia/Desktop/Spikeway AI Internship/book-dataset/Task2/model_large.tfl is not in all_model_checkpoint_paths. Manually adding it.


In [12]:
model.load("model.tfl")

INFO:tensorflow:Restoring parameters from /Users/akshaybhatia/Desktop/Spikeway AI Internship/book-dataset/Task2/model.tfl


In [13]:
## Predictions for first book in testing set
print (len(X_test[1000]))


NameError: name 'X_test' is not defined

In [70]:
text = "Lord Shiva adventures" ## Genre is Christian Books and Bible
prob = model.predict([text_to_vector(text)])
print (prob)

[[0.028382688760757446, 0.04901164025068283, 0.004732618574053049, 0.00041911445441655815, 1.0030681146799836e-10, 9.594331405737844e-12, 9.518615062820146e-12, 9.716095115963608e-12, 0.00018510298104956746, 2.3761254075438387e-10, 2.366134788100993e-10, 1.318204656541866e-09, 1.3255414543777988e-09, 1.3008952803872376e-09, 0.0009322043042629957, 4.567833821056411e-05, 4.5436165237333626e-05, 1.697272253027137e-14, 1.9747033259335904e-14, 1.758029926333688e-14, 4.0316865514491984e-13, 4.328892387779615e-13, 9.873061935650185e-05, 0.007722774520516396, 0.00772290350869298, 1.806881118682213e-05, 1.806162799766753e-05, 1.8047026969725266e-05, 2.2709492952799337e-07, 2.1154376383947238e-07, 2.116620265724123e-07, 0.7527373433113098, 1.2382713521219557e-07, 1.238617386434271e-07, 7.395946255428498e-08, 7.405996882425825e-08, 7.704412041675823e-08, 0.07379534840583801, 0.07379986345767975, 1.0050019483287542e-07, 3.3553831144672586e-06, 3.349857934153988e-06, 3.3535789043526165e-06, 0.00016

In [71]:
len(prob[0])

62

In [72]:
#max(prob[0])

In [73]:
results = sorted(((value, index) for index, value in enumerate(prob[0])), reverse=True)[:5]
print (results)

[(0.7527373433113098, 31), (0.07379986345767975, 38), (0.07379534840583801, 37), (0.04901164025068283, 1), (0.028382688760757446, 0)]


In [74]:
### Predictions 

originial_genre_vocab = list(word2idx_genre)
#originial_genre_vocab[:10]
for i,j in results:
    print (originial_genre_vocab[j])

History
Memoirs
Biographies
Travel
Books
