### Notebook purpose
To create a search engine for our list of books, so our recommendation page wont have to require the exact title of the book and more user friendly. <br>
User then will be directed to use the `isbn_index` in our recommendation system to find the similar books.

In [1]:
import logging
logging.captureWarnings(True)

import numpy as np
import pandas as pd

# import pickle
import dill as pickle
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df = pd.read_csv("data/clean_data.csv")
images = pd.read_csv("data/images.csv")

df.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,mod_title,isbn_index,user_id,book_rating,location,age
0,440234743,The Testament,John Grisham,1999,Dell,the testament,87548,277478.0,0.0,"schiedam, zuid-holland, netherlands",31.0
1,440234743,The Testament,John Grisham,1999,Dell,the testament,87548,2977.0,0.0,"richland, washington, usa",25.0
2,440234743,The Testament,John Grisham,1999,Dell,the testament,87548,3363.0,0.0,"knoxville, tennessee, usa",29.0
3,440234743,The Testament,John Grisham,1999,Dell,the testament,87548,7346.0,9.0,"sunnyvale, california, usa",49.0
4,440234743,The Testament,John Grisham,1999,Dell,the testament,87548,9747.0,0.0,"o`fallon, missouri, usa",24.0


In [3]:
books_clean = df[["isbn_index","isbn","book_title","book_author","year_of_publication","publisher","mod_title",]].drop_duplicates().sort_values(by="isbn")
books_clean.head()

Unnamed: 0,isbn_index,isbn,book_title,book_author,year_of_publication,publisher,mod_title
22196,806,000649840X,Angelas Ashes,Frank Mccourt,0,Harpercollins Uk,angelas ashes
70393,1111,0007110928,Billy,Pamela Stephenson,2002,HarperCollins Entertainment,billy
48417,1336,002026478X,AGE OF INNOCENCE (MOVIE TIE-IN),Edith Wharton,1993,Scribner,age of innocence movie tiein
84225,1472,0020442203,"Lion, the Witch and the Wardrobe",C.S. Lewis,1970,MacMillan Publishing Company.,lion the witch and the wardrobe
49023,1769,002542730X,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley and Sons Inc,politically correct bedtime stories modern tal...


In [4]:
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(books_clean["mod_title"])

In [5]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

def search(query,vectorizer):
    # processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    processed = query.lower()
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]

    results = books_clean.iloc[indices].iloc[::-1]
    results = results.merge(images[["isbn","image_url_m"]], how = "left", on = "isbn")
    # results = results.sort_values("ratings", ascending=False)
    
    # return results.style.format({'image_url_s': show_image}) #use this if you want a dataframe as an output
    return results.head(5).set_index("isbn").T.to_dict() # use this if you want a dictionary as an output

In [6]:
search("night", vectorizer)

{'0553272535': {'isbn_index': 118219,
  'book_title': 'Night',
  'book_author': 'Elie Wiesel',
  'year_of_publication': 1982,
  'publisher': 'Bantam Books',
  'mod_title': 'night',
  'image_url_m': 'http://images.amazon.com/images/P/0553272535.01.MZZZZZZZ.jpg'},
 '0671525743': {'isbn_index': 135914,
  'book_title': 'Night Whispers',
  'book_author': 'Judith Mcnaught',
  'year_of_publication': 1999,
  'publisher': 'Pocket',
  'mod_title': 'night whispers',
  'image_url_m': 'http://images.amazon.com/images/P/0671525743.01.MZZZZZZZ.jpg'},
 '0425146413': {'isbn_index': 81650,
  'book_title': 'Night Prey',
  'book_author': 'John Sandford',
  'year_of_publication': 2004,
  'publisher': 'Berkley Publishing Group',
  'mod_title': 'night prey',
  'image_url_m': 'http://images.amazon.com/images/P/0425146413.01.MZZZZZZZ.jpg'},
 '1551669498': {'isbn_index': 219360,
  'book_title': 'Girls Night',
  'book_author': 'Stef Ann Holm',
  'year_of_publication': 2002,
  'publisher': 'Mira',
  'mod_title': 

## Pickle

Pickle is a serialization process, this enables the "object" from our notebook to be used in the `app.py` and then in the web app. <br>
We will pickle the data, vectorizer, vectorizer fit transform result, and the function used to search for books with similar title.

In [7]:
pickle.dump(books_clean, open('pickles/books_data.pkl','wb'))
pickle.dump(images, open('pickles/images_data.pkl','wb'))

books_data = pickle.load(open('pickles/books_data.pkl', 'rb'))
images_data = pickle.load(open('pickles/images_data.pkl', 'rb'))

In [8]:
pickle.dump(vectorizer, open('pickles/search_define.pkl','wb'))
pickle.dump(tfidf, open('pickles/search_fit_transform.pkl','wb'))

search_define = pickle.load(open('pickles/search_define.pkl', 'rb'))
search_fit_transform = pickle.load(open('pickles/search_fit_transform.pkl', 'rb'))

In [9]:
def new_search(query,vectorizer):
    # processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    processed = query.lower()
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, search_fit_transform).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = books_data.iloc[indices].iloc[::-1]
    results = results.merge(images_data[["isbn","image_url_m"]], how = "left", on = "isbn")
    # results = results.sort_values("ratings", ascending=False)
    
    # return results.style.format({'image_url_s': show_image}) #use this if you want a dataframe as an output
    return results.head(5).set_index("isbn").T.to_dict() # use this if you want a dictionary as an output

In [10]:
pickle.dump(new_search, open('pickles/search_result.pkl','wb'))

In [11]:
search_result = pickle.load(open('pickles/search_result.pkl', 'rb'))

In [12]:
book_rec = search_result("hunting",search_define)
book_rec

{'0140293248': {'isbn_index': 18466,
  'book_title': "The Girls' Guide to Hunting and Fishing",
  'book_author': 'Melissa Bank',
  'year_of_publication': 2000,
  'publisher': 'Penguin Books',
  'mod_title': 'the girls guide to hunting and fishing',
  'image_url_m': 'http://images.amazon.com/images/P/0140293248.01.MZZZZZZZ.jpg'},
 '0425188787': {'isbn_index': 83075,
  'book_title': 'Hunting Season (Anna Pigeon Novels (Paperback))',
  'book_author': 'Nevada Barr',
  'year_of_publication': 2003,
  'publisher': 'Berkley Publishing Group',
  'mod_title': 'hunting season anna pigeon novels paperback',
  'image_url_m': 'http://images.amazon.com/images/P/0425188787.01.MZZZZZZZ.jpg'},
 '067088300X': {'isbn_index': 132336,
  'book_title': "The Girls' Guide to Hunting and Fishing",
  'book_author': 'Melissa Bank',
  'year_of_publication': 1999,
  'publisher': 'Viking Books',
  'mod_title': 'the girls guide to hunting and fishing',
  'image_url_m': 'http://images.amazon.com/images/P/067088300X.01.

In [13]:
for i in book_rec:
    print(book_rec[i])

{'isbn_index': 18466, 'book_title': "The Girls' Guide to Hunting and Fishing", 'book_author': 'Melissa Bank', 'year_of_publication': 2000, 'publisher': 'Penguin Books', 'mod_title': 'the girls guide to hunting and fishing', 'image_url_m': 'http://images.amazon.com/images/P/0140293248.01.MZZZZZZZ.jpg'}
{'isbn_index': 83075, 'book_title': 'Hunting Season (Anna Pigeon Novels (Paperback))', 'book_author': 'Nevada Barr', 'year_of_publication': 2003, 'publisher': 'Berkley Publishing Group', 'mod_title': 'hunting season anna pigeon novels paperback', 'image_url_m': 'http://images.amazon.com/images/P/0425188787.01.MZZZZZZZ.jpg'}
{'isbn_index': 132336, 'book_title': "The Girls' Guide to Hunting and Fishing", 'book_author': 'Melissa Bank', 'year_of_publication': 1999, 'publisher': 'Viking Books', 'mod_title': 'the girls guide to hunting and fishing', 'image_url_m': 'http://images.amazon.com/images/P/067088300X.01.MZZZZZZZ.jpg'}
{'isbn_index': 10356, 'book_title': 'Hunting Badger (Joe Leaphorn/Ji