# ABOUT:
- this code tries out the pipeline: query>filter by genre and quantized vector>candidate books
- background:
    - in previous notebooks, i experimented with many ways to semantically narrow down candidate space given a query
    - this notebok puts the most promising attempts together
- approach:
    - given a query
        - semantically guess the users desired genres
        - semantically match users query to book summaries
- steps:
    - given a query
        1. guess desired genres:
            - convert to embedding - phrase bert model used - accounts for lexically dissimilarity
            - retrieve genre embedding matrix from mongo
            - apply matrix multiplication
            - get top k most similar genres with highest similarity scores
        2. match user query to book summary:
            - convert to embedding - a sentence transformer used - produces quality sentence embeddings
            - convert to quantized vector
    - retrieve documents that:
        - have at least 1 matching genre
        - or have at least one matching quantized id
        
- motivation and advantages:
    - filtering by genre produces results that tends to stay within topic, but may lead to a large number of results
    - filtering semantically by summary can produce a smaller set of results but there tends to be a lot of noise i.e undesired results may also appear
    - using both methods together yields a smaller set of results that also stick within desired topics
    - both methods are designed to be efficient, at low memory cost:
        - using matrix multication to get desired genres
        - using product quantization we reduce the embedding sizes and remove the need to loop through mongo database and compute dot product 
- outcome:
    - filtering only by genre produces too many results, many have summaries that are semantically dissimilar to the user's query
    - filtering by only product quantization produces a small set of results, but some appear irrelevant
    - filtering by both genre and product quantization produces the least results, yet the result summaries are semantically similar to the users query
    - this shows to an extent the **our pipeline is effective in yielding a small set of results that are also semantically similar**

In [1]:
from pymongo import MongoClient
import certifi
ca = certifi.where()
client = MongoClient("mongodb+srv://tanchingfhen:978775!Mj@dataproducts.hcjk1ct.mongodb.net/?retryWrites=true&w=majority", tlsCAFile=ca)
db = client["DP"] 
book_collection = db["books"]
full_genre_collection = db["full_genre"]

In [31]:
from sentence_transformers import SentenceTransformer
from numpy import dot
import numpy as np
import nanopq
import pickle


"""
FAST - Given query return top k genres 
"""
def search_genre_by_query(collection, query, embedding_model, topk = 10, return_scores = False):
    # get document containing all genres
    document = collection.find_one({})
    # embed query
    query_embedding = embedding_model.encode(query)
    # compute scores and sort
    scores = np.dot(np.array(document["embedding"]),query_embedding)
    scores = sorted(zip(document["genre"],scores), key = lambda ele: ele[1], reverse = True)
    # return topk scores
    if not return_scores:
        scores = [ele[0] for ele in scores]
    return scores[:topk]


"""
Given a query, return quantized vector
"""
def quantize_query(query, product_quantization_model, embedding_model):
    query_embedding = embedding_model.encode(query)
    product_quantization_model.verbose = False
    quantized_query = product_quantization_model.encode(np.expand_dims(query_embedding,0))[0].tolist()
    return quantized_query

# """
# query mongo by an expression
# """
# def query_by_expression(collection, filter_dict):
#     return collection.find(filter_dict)

"""
merge multiple expressions - results should contain at least one matching tag from each expression
"""
def merge_expressions(*expressions):
    return {
        "$and":[
            {"$or":expression} for expression in expressions
        ]
    }

"""
expression used to filter mongo
"""
def _get_expression_quantized_vector(quantized_vector):
    return [{f"quantized_vector.subspace{i}":ID} for i,ID in enumerate(quantized_vector)]

"""
expression used to filter mongo
"""
def _get_expression_genre(genres):
    return [{"Genre":genre} for genre in genres]

In [26]:
results_cursor = book_collection.find({})

In [58]:
next(results_cursor)

{'_id': '0375406328',
 'ISBN': '0375406328',
 'URL': 'https://www.goodreads.com/book/show/899429.Lying_Awake',
 'Review': ["Sister John is a Carmelite nun who, after years of dedicated service, begins to experience visions of the divine in ecstatic, crystalline clarity. But these visions are accompanied by terrible headaches, and Sister John is encouraged to seek medical attention. When told by a neurologist that her experiences are likely the result of (curable) epileptic seizures, Sister John is forced to rethink everything she knows about spiritual life.On the surface, this seems like a(nother) religion vs. science narrative. But, this isn't simply a matter of conflicting paradigms. That's the easy part. It's the human element, the choices we have to make on our own that are difficult.Sister John is genuinely good and genuinely selfless. She's so selfless and good, I very nearly abandoned the book! I couldn't relate to her. I felt Mark Salzman was keeping too much of a respectful di

### query !
- results contain at least one matching genre and at least one matching quantized id

In [None]:
def merge_expressions(*expressions):
    return {
        "$and":[
            {"$or":expression} for expression in expressions
        ]
    }

In [4]:
### load models - 3 models
path = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4125 - Developing Data Products\Assignments\Team Assignment - book recommendation\code\models\pq_model.pkl"
with open(path, "rb") as f:
    pq_model = pickle.load(f)
genre_embedding_model = SentenceTransformer('whaleloops/phrase-bert')
summary_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

In [5]:
query = "I am looking for a novel about christianity"
desired_genres = search_genre_by_query(full_genre_collection, query, genre_embedding_model)
quantized_vector = quantize_query(query, pq_model, summary_embedding_model)

In [68]:
double_filter = merge_expressions(_get_expression_quantized_vector(quantized_vector),_get_expression_genre(desired_genres))
genre_only = merge_expressions(_get_expression_genre(desired_genres))
pq_only = merge_expressions(_get_expression_quantized_vector(quantized_vector))
double_filter

{'$and': [{'$or': [{'quantized_vector.subspace0': 254},
    {'quantized_vector.subspace1': 165},
    {'quantized_vector.subspace2': 77},
    {'quantized_vector.subspace3': 31}]},
  {'$or': [{'Genre': 'Christian Fiction'},
    {'Genre': 'Christian Non Fiction'},
    {'Genre': 'Christian Romance'},
    {'Genre': 'Theology'},
    {'Genre': 'Christianity'},
    {'Genre': 'Christian Living'},
    {'Genre': 'Church History'},
    {'Genre': 'Novels'},
    {'Genre': 'Literary Fiction'},
    {'Genre': 'Nonfiction'}]}]}

## results
- insights:
    - filtering only by genre produces too many results, many have summaries that are semantically dissimilar to the user's query
    - filtering by only product quantization produces a small set of results, but some appear irrelevant
    - filtering by both genre and product quantization produces the least results, yet the result summaries are semantically similar to the users query
    - this shows to an extent the **our pipeline is effective in yielding a small set of results that are also semantically similar**

In [82]:
results_cursor = book_collection.find(double_filter)
genre_plus_pq_results = []
for doc in results_cursor:
    genre_plus_pq_results.append(doc)
len(genre_plus_pq_results)

23

In [83]:
results_cursor = book_collection.find(genre_only)
genre_only_results = []
for doc in results_cursor:
    genre_only_results.append(doc)
len(genre_only_results)

488

In [84]:
results_cursor = book_collection.find(pq_only)
pq_only_results = []
for doc in results_cursor:
    pq_only_results.append(doc)
len(pq_only_results)

36

### example: document containing at least one matching genre AND at least one matching quantized id
- result summary is semantically similar to the user's query

In [96]:
import json
for doc in pq_only_results:
    if doc in genre_plus_pq_results:
        print(json.dumps(doc, indent=4))
        break

{
    "_id": "0060973129",
    "ISBN": "0060973129",
    "URL": "https://www.goodreads.com/book/show/3767482-decision-in-normandy",
    "Review": [
        "A great flip side to Ambrose's D Day book. This is HOW ccampaigns are PLANNED and then EXECUTED. But reading a great deal of background material will help enormausly",
        "",
        "This was a mixed bag for me. The positives are that D'Este is comprehensive in his research, provides a ton of sources to corroborate ideas, and is impartial throughout. These are all important things for providing worthwhile historical accounts of real events. The negatives are that D'Este focuses way too much on strategy rather than actual actions and what the events were like, he gets bogged down in providing a dozen corroborating sources when only three or four are needed, and he manages to strip away all of the emotion and drama of the events. It was informative but I prefer accounts that draw me in.",
        "This is not a history of the N

### example: at least one matching quantized id, but no matching genres
- appears to be an irrelevant result

In [97]:
for doc in pq_only_results:
    if doc not in genre_plus_pq_results:
        print(json.dumps(doc, indent=4))
        break

{
    "_id": "3150000335",
    "ISBN": "3150000335",
    "URL": "https://www.goodreads.com/book/show/41561009-kabale-und-liebe",
    "Review": [
        "",
        "Muscular language, noble ideals, here we have a tale of virtue beleaguered, ground down by the tyranny and despotic selfishness of an absolutist ruler and his corrupt court. Luise Miller is, however, not the victim of aristocratic sexual rapaciousness, her noble lover has every intention of marrying her: but such a marriage crosses social frontiers, they are both the object of ghastly intrigue that first separates and then kills them. Magnificent.Schiller wrote this in 1783, just one year after fleeing the repression of Duke Karl Eugen. Intrigue was a fact of life: Count Friedrich Samuel Monmartin had gained his position as leading minister at the W\u00fcrttemburg court by forging letters, and thus destroying his rival, von Rieger.",
        "the reason this isn\u2019t five stars is because i got a b+ on my essay on it (fu

### example: at least one matching genre, but no matching quantized id
- result has a matching genre "Nonfiction" which makes sense, but the summary is semantically dissimilar to the users query

In [100]:
for doc in genre_only_results:
    if doc not in genre_plus_pq_results:
        print(json.dumps(doc, indent=4))
        break

{
    "_id": "0439095026",
    "ISBN": "0439095026",
    "URL": "https://www.goodreads.com/book/show/2587531-tell-me-this-isn-t-happening",
    "Review": [
        "I loved this book. I read it simply to have a few laughs but ended up taking away some cool information about how to deal with embarrassment. Along with outrageously funny stories by kids and teens, this book gives tips on how to find the humor in embarrassing situations. I enjoyed nearly every story in this book. There were so many stories I could relate to. There were also many which gave me a round of laughs. Reading some of these stories made me realize my embarrassing moments aren't so bad. This truly was a magnificent book. It was very unique and enjoyable.",
        "",
        "This book Tell Me This Isn't Happening is a book that got into me. This book tell us the most sillyest to the most sad storys. These books are embarrsing to people. So if you like to learn and hear about these funny,embarrsing storys, you hav