# ABOUT:
- this code demontrates use of semanticSearch() and personalizedSearch()
- background:
    - in previous notebooks, i created semanticSearch() that first filters by genre then reranks by the CrossEncoder
        - this is semantic because in both components Transformer models are used
    - also, i created personalizedSearch() that, given the user_id returns book recommendations by a collaborative filtering model specifically AlternatingLeastSquares model
- Motivation and advantages:
    - Many book recommendation or search engines use lexical matching. 
        - Many relevant results may not be matched e.g  "Fairies" would not be matched to "fantasy" but they are semantically similar
    - semanticSearch() pipeline is accurate yet efficient. 
        - Filtering by genres reduces the candidate pool drastically
        - this allows CrossEncoder, a more powerful semantic similarity ranker to rerank a much smaller set of items
    - our pipeline automatically extracts desired genres from users - which is useful because the list of genres is very large
        - this makes search more seamless because the user does not need to scan and pick from a large list a genres
    - personalizedSearch() makes relevant recommendation even without query, improving the user experience

In [27]:
from pymongo import MongoClient
import certifi
ca = certifi.where()
client = MongoClient("mongodb+srv://tanchingfhen:978775!Mj@dataproducts.hcjk1ct.mongodb.net/?retryWrites=true&w=majority", tlsCAFile=ca)
db = client["DP"] 
book_collection = db["books"]
full_genre_collection = db["full_genre"]

In [3]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
from numpy import dot
import numpy as np
import nanopq
import pickle
from rake_nltk import Rake

"""
FAST - Given query, extract keywords. For each keyword, extract top k genres. Utilize matrix multiplication.
"""
def search_genre_by_query(collection, query, embedding_model, keyword_model, topk = 3, return_scores = False):
    # extract query keywords
    keyword_model.extract_keywords_from_text(query)
    keywords = keyword_model.get_ranked_phrases()
    keywords = query if len(keywords)==0 else keywords
    # get document containing all genres
    document = collection.find_one({})
    # embed query keywords
    query_embedding = embedding_model.encode(keywords)
    # compute scores and sort
    scores = np.dot(query_embedding,np.array(document["embedding"]).T)
    # return top k genres for each keyword
    matched_genres = []
    for s in scores:
        temp = sorted(zip(document["genre"],s), key = lambda ele: ele[1], reverse = True)[:topk]
        # return scores too
        if not return_scores:
            matched_genres.append([ele[0] for ele in temp])
        else:
            matched_genres.append(temp)
    return matched_genres

"""
expression used to filter mongo
"""
def _get_expression_genre(desired_genres):
    return {
        "$and":[
            {"$or":[{"Genre":g1} for g1 in g0]} for g0 in desired_genres
        ]
    }

  from .autonotebook import tqdm as notebook_tqdm


### semanticSearch !

In [17]:
### load models - 3 models
# path = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4125 - Developing Data Products\Assignments\Team Assignment - book recommendation\code\models\pq_model.pkl"
# with open(path, "rb") as f:
#     pq_model = pickle.load(f)
# summary_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
genre_embedding_model = SentenceTransformer('whaleloops/phrase-bert')
ranking_model = CrossEncoder('cross-encoder/stsb-TinyBERT-L-4')
keyword_model = Rake()

In [72]:
def semanticSearch(query, book_collection, genre_collection, genre_embedder, ranking_model, keyword_model, stage_one_limit = 200, stage_two_limit = 10):
    # stage one - filter by genre
    stage_one_cursor, desired_genres = _stage_one(query, book_collection, full_genre_collection, genre_embedding_model, keyword_model, limit = stage_one_limit)
    # stage two - rerank with Transformer
    stage_two_cursor = _stage_two(stage_one_cursor, query, ranking_model, book_collection, limit = stage_two_limit)
    return stage_two_cursor, desired_genres
    
def _stage_one(query, book_collection, genre_collection, genre_embedder, keyword_model, limit = 200):
    desired_genres = search_genre_by_query(genre_collection, query, genre_embedder, keyword_model, topk = 5)
    filtering_expression = _get_expression_genre(desired_genres)
    # if None relax the criteria 
    if book_collection.find_one(filtering_expression) is None:
        desired_genres = [np.array(desired_genres).flatten().tolist()]
        filtering_expression = _get_expression_genre(desired_genres)
    results_cursor = book_collection.find(filtering_expression).limit(limit)
    return results_cursor, desired_genres

def _stage_two(mongo_cursor, query, ranking_model, book_collection, limit = 10):
    _ids = []
    # get summaries of each candidate
    crossencoder_input = []
    for doc in mongo_cursor:
        _ids.append(doc["_id"])
        crossencoder_input.append([query, doc["Summary"][0]])
    # rank and sort
    ranking_scores = ranking_model.predict(crossencoder_input)
    topk_book_ids = sorted(zip(_ids, ranking_scores), reverse = True, key = lambda ele:ele[1])[:limit]
    topk_book_ids = [_id for _id, score in topk_book_ids]
    # return cursor of top k books
    return book_collection.find({"_id":{"$in":topk_book_ids}})

In [73]:
query = "fairy and invasion"

In [74]:
stage_two_cursor, desired_genres = semanticSearch(
    query = query, 
    book_collection = book_collection, 
    genre_collection = full_genre_collection, 
    genre_embedder = genre_embedding_model, 
    ranking_model = ranking_model, 
    keyword_model = keyword_model,
    stage_one_limit = 200, 
    stage_two_limit = 10, 
)

In [75]:
desired_genres

[['War',
  'Combat',
  'Civil War',
  'Military',
  'Terrorism',
  'Fairies',
  'Fae',
  'Fairy Tales',
  'Elves',
  'Witches']]

In [76]:
for doc in stage_two_cursor:
    print(doc["Summary"])
    print(doc["Genre"])

['Here, for the first time in paperback, is an outstanding military history that offers a dramatic new perspective on the Allied campaign that began with the invasion of the D-Day beaches of Normandy. Nationa advertising in Military History.']
['History', 'War', 'World War II', 'Nonfiction', 'Military', 'Military History', 'War', 'Military Fiction', 'War', 'History', 'World History', 'European Literature', 'British Literature', 'History', 'European History', 'Earth', 'The World']
["Part Dorothy Parker, part José Saramago, with shades of George Orwell, Sheila Heti has arrived on Canada's literary scene a fully formed artist. Balancing wisdom and innocence, joy and foreboding, each story in The Middle Stories leads us to surprising places. A frog doles out sage advice to a plumber infatuated with a princess, a boy falls hopelessly in love with a monkey, and a man with a hat keeps apocalyptic thoughts at bay by resolving to follow a plan that he admits he won't stick to. Globe and Mail cr

### personalizedSearch

In [2]:
from models import ALSRecommender
# from evaluation import evaluate
# from processing import *
# from utils import IDMAP
# import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# load model
directory = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4125 - Developing Data Products\Assignments\Team Assignment - book recommendation\code\models"
model = ALSRecommender(model_name = "AlternatingLeastSquares", 
                       config_dict = {}, 
                       train_csr = None, 
                       user_map = None, 
                       item_map = None)
model.load(directory)

In [8]:
personalizedSearch(book_collection, model, 210959, 20)

[{'_id': '0140293248',
  'ISBN': '0140293248',
  'URL': 'https://www.goodreads.com/book/show/227679.The_Girls_Guide_to_Hunting_and_Fishing',
  'Review': ['I couldn\'t believe the disconnect between the reviews of this book and its content.  actually compares it to Bridget Jones, because, you know, all books written by women with a female protagonist in her 20s are the same. I thought this read like serious literary fiction. If a guy had written this book, he\'d be called the next Salinger. If an older woman had written this book her name would be Abigail Thomas and it would be a memoir titled . To be fair, I picked this up at a thrift store while on vacation and given the chick-litty appearance of it I had low expectations. By the first few pages, I realized this was not going to be some silly dating stories. This was serious. Good serious.The only reason I\'m not giving it 5 stars is because of the bizarre placement of the story titled "The Best Possible Light." I got all excited to j