# ABOUT:
- this code puts together semanticSearch(): filter by genre and quantized ids > rank by CrossEncoder
- background:
    - in previous notebooks, found that filtering books by genre and quantized ids is efficient and produces a small set of results that are relevant 
    - finally, i will rerank these results using a CrossEncoder to obtain a more accurate set of results
- approach:
    - given a query
        - semantically guess the users desired genres
        - acquire product quantization of summary
        - retrieve candidate results 
    - Rerank using CrossEncoder
        - loop through the Mongo Cursor 
        - for each document compute similarity between query and book summary
        - Use PriorityQueue to keep track of the topk most similar results    
- motivation and advantages:
    - before, we found that filtering by genre and quantized ids produce a small set of relevant results efficiently
    - CrossEncoder is relatively slower but more accurate 
    - as a result, this two staged semantic retrieval is fast, yet accurate.
- outcome:
   - number of matching genres should be taken into considering when ranking
   - quantization ids might be very noisy 

In [1]:
from pymongo import MongoClient
import certifi
ca = certifi.where()
client = MongoClient("mongodb+srv://tanchingfhen:978775!Mj@dataproducts.hcjk1ct.mongodb.net/?retryWrites=true&w=majority", tlsCAFile=ca)
db = client["DP"] 
book_collection = db["books"]
full_genre_collection = db["full_genre"]

In [2]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
from numpy import dot
import numpy as np
import nanopq
import pickle


"""
FAST - Given query return top k genres
"""
def search_genre_by_query(collection, query, embedding_model, topk = 10, return_scores = False):
    # get document containing all genres
    document = collection.find_one({})
    # embed query
    query_embedding = embedding_model.encode(query)
    # compute scores and sort
    scores = np.dot(np.array(document["embedding"]),query_embedding)
    scores = sorted(zip(document["genre"],scores), key = lambda ele: ele[1], reverse = True)
    # return topk scores
    if not return_scores:
        scores = [ele[0] for ele in scores]
    return scores[:topk]


"""
Given a query, return quantized vector 
"""
def quantize_query(query, product_quantization_model, embedding_model):
    query_embedding = embedding_model.encode(query)
    product_quantization_model.verbose = False
    quantized_query = product_quantization_model.encode(np.expand_dims(query_embedding,0))[0].tolist()
    return quantized_query

# """
# query mongo by an expression
# """
# def query_by_expression(collection, filter_dict):
#     return collection.find(filter_dict)

"""
merge multiple expressions - results should contain at least one matching tag from each expression
"""
def merge_expressions(*expressions):
    return {
        "$and":[
            {"$or":expression} for expression in expressions
        ]
    }

"""
expression used to filter mongo
"""
def _get_expression_quantized_vector(quantized_vector):
    return [{f"quantized_vector.subspace{i}":ID} for i,ID in enumerate(quantized_vector)]

"""
expression used to filter mongo
"""
def _get_expression_genre(genres):
    return [{"Genre":genre} for genre in genres]

  from .autonotebook import tqdm as notebook_tqdm


### query !
- results contain at least one matching genre and at least one matching quantized id

In [4]:
### load models - 3 models
path = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4125 - Developing Data Products\Assignments\Team Assignment - book recommendation\code\models\pq_model.pkl"
with open(path, "rb") as f:
    pq_model = pickle.load(f)
genre_embedding_model = SentenceTransformer('whaleloops/phrase-bert')
summary_embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
ranking_model = CrossEncoder('cross-encoder/stsb-TinyBERT-L-4')

Downloading: 100%|████████████████████████████████████████████████████████████████| 57.4M/57.4M [00:04<00:00, 12.2MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████| 517/517 [00:00<00:00, 144kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 339kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 30.5kB/s]


In [157]:
def semanticSearch(query, book_collection, genre_collection, genre_embedder, summary_embedder, pq_model, ranking_model, stage_one_limit = 200, stage_two_limit = 10, stage_two_off = False, quantization_on = True ):
    # stage one - filter by genre and quantization id
    mongo_cursor, desired_genres = _stage_one(query, book_collection, full_genre_collection, genre_embedding_model, summary_embedding_model, pq_model, limit = stage_one_limit, quantization_on = quantization_on)
    print(desired_genres)
    # stage two - rerank with Transformer
    if stage_two_off:
        output = []
        for _ in range(stage_two_limit):
            output.append(next(mongo_cursor))
    else:       
        output = _stage_two(mongo_cursor, query, ranking_model, desired_genres = desired_genres, limit = stage_two_limit)
    return output
    
    
def _stage_one(query, book_collection, genre_collection, genre_embedder, summary_embedder, pq_model, limit = 200, quantization_on = True):
    desired_genres = search_genre_by_query(genre_collection, query, genre_embedder)
    if quantization_on:
        quantized_vector = quantize_query(query, pq_model, summary_embedder)
        filtering_expression = merge_expressions(_get_expression_quantized_vector(quantized_vector),_get_expression_genre(desired_genres))
    else:
        filtering_expression = merge_expressions(_get_expression_genre(desired_genres))
    results_cursor = book_collection.find(filtering_expression).limit(limit)
    return results_cursor, desired_genres

def _stage_two(mongo_cursor, query, ranking_model, desired_genres, limit = 10):
    # read cursor
    results = []
    num_matching_genres = []
    crossencoder_input = []
    desired_genres = set(desired_genres)
    for doc in mongo_cursor:
        results.append(doc)
        num_matching_genres.append(compute_num_matching_genre(doc, desired_genres))
        crossencoder_input.append([query, ". ".join(doc["Genre"]+[doc["Summary"][0]])])
    print(len(results))
    # rank and sort
    ranking_scores = ranking_model.predict(crossencoder_input)
    ranking_scores = ranking_scores + num_matching_genres
    results = sorted(zip(results, ranking_scores), reverse = True, key = lambda ele:ele[1])[:limit]
    return results
    results = [result[0] for result in results]
    return results

In [153]:
def compute_num_matching_genre(doc, desired_genres:set):
    return len(set(doc["Genre"]).intersection(desired_genres))

In [154]:
query = "I am looking for a novel about christianity"


In [158]:
results = semanticSearch(
    query = query, 
    book_collection = book_collection, 
    genre_collection = full_genre_collection, 
    genre_embedder = genre_embedding_model, 
    summary_embedder = summary_embedding_model, 
    pq_model = pq_model, 
    ranking_model = ranking_model, 
    stage_one_limit = 200, 
    stage_two_limit = 10, 
    stage_two_off = False,
    quantization_on = False
)


['Christian Fiction', 'Christian Non Fiction', 'Christian Romance', 'Theology', 'Christianity', 'Christian Living', 'Church History', 'Novels', 'Literary Fiction', 'Nonfiction']
200


In [159]:
results

[({'_id': '1561011525',
   'ISBN': '1561011525',
   'URL': 'https://www.goodreads.com/book/show/877969.Practice_of_Prayer',
   'Review': ["This volume is the best one so far in the New Church's Teaching series. Guenther's wisdom really shines through; the reader easily understands that the author speaks from a position of lived experience. She isn't afraid to talk about the struggles she has experienced in her lifetime of prayer practice. I'll likely return to this one again, and it will be in my mind as I make tweaks to my own practice.",
    '',
    '',
    "Worth it! Not only a great and heartfelt overview of prayer, but full of additional resources to build one's prayer life.",
    'In the fourth volume of the NCT series, Gunther focuses on prayer as a very deep, yet pervasive topic. She broadens the subject of prayer to art, poetry, reflection, any action that lends itself to communicating with God. Yet, she shows it is of utmost importance in the Christian life. Sprinkling the te

In [136]:
temp_genre = set(['Christian Fiction', 'Christian Non Fiction', 'Christian Romance', 'Theology', 'Christianity', 'Christian Living', 'Church History', 'Novels', 'Literary Fiction', 'Nonfiction'])
for doc in results:
    print(set(doc['Genre']).intersection(temp_genre))

{'Nonfiction', 'Christianity', 'Theology'}
{'Nonfiction'}
{'Novels', 'Literary Fiction'}
{'Nonfiction'}
{'Novels', 'Literary Fiction'}
{'Novels', 'Literary Fiction'}
{'Novels', 'Literary Fiction'}
{'Novels'}
{'Novels'}
{'Novels'}


In [115]:
results

[({'_id': '0316973742',
   'ISBN': '0316973742',
   'URL': 'https://www.goodreads.com/book/show/104284.The_Gospel_of_Judas',
   'Review': ["Started this book a couple years ago and now am starting to remember why I never finished it. I'm sorry, I just can't bring myself to finish it. A priest has an affair with what appears to be a flibbertygibbit, he gets an interesting scroll in the course of his research, and then as a fallen priest he takes up with a woman who doesn't speak English but models for porn and paints on the side. In both attempts to read this I have yet to get to anything to do with the Gospel of Judas in the manner the book cover suggests, unless it is some oblique symbol of all the betrayals going on in this priest's life. And I haven't the time nor inclination to plough through this to get to it so that the light dawns on me and I understand what all this initial tedium was about. If I want to read about priests having affairs with social twits then I'd go for some b

In [93]:
from keybert import KeyBERT
from rake_nltk import Rake
bert_model = KeyBERT()
rake_model = Rake()

In [96]:
keywords = bert_model.extract_keywords(docs = query,
                                              keyphrase_ngram_range = (1,1))#,
#                                               top_n = 3,
#                                               use_mmr=True, diversity=0.7)

In [97]:
keywords

[('christianity', 0.5749), ('novel', 0.4324), ('looking', 0.0465)]

In [86]:
summary = ". ".join(doc["Summary"])
reviews = ". ".join(doc["Review"])

In [92]:
rake_model.extract_keywords_from_text(query)
rake_model.get_ranked_phrases_with_scores()

[(1.0, 'novel'), (1.0, 'looking'), (1.0, 'christianity')]

In [91]:
dir(rake_model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_frequency_dist',
 '_build_ranklist',
 '_build_word_co_occurance_graph',
 '_generate_phrases',
 '_get_phrase_list_from_words',
 '_tokenize_sentence_to_words',
 '_tokenize_text_to_sentences',
 'degree',
 'extract_keywords_from_sentences',
 'extract_keywords_from_text',
 'frequency_dist',
 'get_ranked_phrases',
 'get_ranked_phrases_with_scores',
 'get_word_degrees',
 'get_word_frequency_distribution',
 'include_repeated_phrases',
 'max_length',
 'metric',
 'min_length',
 'punctuations',
 'rank_list',
 'ranked_phrases',
 'sentence_tokenizer',
 'stopwords',
 'to_ignore',
 'word_tokenizer']

In [87]:
rake_model.extract_keywords_from_text(summary)
rake_model.get_ranked_phrases()

['woman whose nomination faces two serious obstacles',
 'newly elected president faces',
 'volatile abortion case',
 'unprecedented novelistic depiction',
 'richard north patterson',
 'patterson gives us',
 'new chief justice',
 'judicial system works',
 'important popular novelists',
 'equally momentous collision',
 'held personal secret',
 'compelling new novel',
 'work today ..',
 'senate majority leader',
 'nationally respected court',
 'majority leader',
 'personal ..',
 'definitive novel',
 'worse ..',
 'unexpected chance',
 'trial pitting',
 'supreme court',
 'stories intertwine',
 'resounding clash',
 'place among',
 'old girl',
 'nomination',
 'major departure',
 'life parents',
 'legal process',
 'judge rather',
 'first choice',
 'competing ambitions',
 'appeals judge',
 '21st century',
 'president',
 'president',
 'court',
 'year',
 'triumph',
 'thwart',
 'suspense',
 'science',
 'revelation',
 'reasons',
 'prospect',
 'pro',
 'politics',
 'politics',
 'political',
 'perspec

In [88]:
rake_model.extract_keywords_from_text(reviews)
rake_model.get_ranked_phrases()

['newly appointed president kerry kilcannon',
 'ld mary anne tierney ’',
 'stack grows ever skyward',
 'almost two years ago',
 'particularly senator chad palmer',
 'novel reflects considerable research',
 'two politically astute people',
 'cerebral cortex formation may',
 'new chief justice confirmed',
 'american political system works',
 'recommend patterson deliberately chose',
 'caroline masters plus senators',
 'defend discusses two issues',
 'plot always moves forward',
 'mary ann tierney',
 'pits mary anne',
 'women must play',
 'undermine another person',
 'saying something stupid',
 'requires parental consent',
 'recently elected president',
 'nominate caroline masters',
 'negative cultural response',
 'nationally televised trial',
 'always understanding exactly',
 'book moves forward',
 'washington political scene',
 'supreme court nominee',
 'supreme court justices',
 'mary anne ’',
 'use whatever means',
 'book could almost',
 'much maligned sixties',
 '“ people project',
 

In [33]:
output

[{'_id': '0679450440',
  'ISBN': '0679450440',
  'URL': 'https://www.goodreads.com/book/show/1750517.Protect_and_Defend',
  'Review': ['Four and a half stars.This novel takes an issue which people usually have very definite views about and turns it into a highly readable story. The main characters are Mary Ann Tierney who is fifteen and pregnant. For reasons that become clear as the book moves forward, Mary Ann wants a partial birth abortion. Opposing her desire for this are her parents, who are outspoken advocates against abortion under any circumstances. The situation ends up as a court battle, with Mary Ann and her lawyer Sarah against Mary Ann’s parents. Into the mix add the newly appointed president Kerry Kilcannon and his candidate for Chief Justice, Caroline Masters plus senators from the opposing political party and particularly Senator Chad Palmer. Each of these three have dangerous secrets in their past. This story highlights the way politicians and their families lives are d