# ABOUT:
- this notebook adds a genre collection with its embeddings to mongo
- background:
    - previously, i found that keywords from Rake are very dirty, thus their embeddings are too. 
    - The first round of filtering should be as efficient as possible i.e the number of embedding comparisons should be kept minimal while maintaining the ability to narrow down candidates quickly
    - Thus, in this code i try to: given a query select k most relevant Genres. then, select those documents that contin those Genres. Since the number of Genres in small, we can expect this to be much faster
- insight:
    - looping through a cursor is slow, even with only 400+ items!
        - solution save genre embeddings as a single matrix and perform matrix multiplication

### connect mongo

In [1]:
from pymongo import MongoClient
import certifi
ca = certifi.where()
client = MongoClient("mongodb+srv://tartiniglia:W.I.T.C.H.@atlascluster.tv8xjir.mongodb.net/?retryWrites=true&w=majority",
                     serverSelectionTimeoutMS = 5000, tlsCAFile=ca)
db = client["bookEater"]
book_collection = db["Books"]
genre_collection = db["full_genre"]

In [2]:
# from pymongo import MongoClient
# import certifi
# ca = certifi.where()
# client = MongoClient("mongodb+srv://tanchingfhen:978775!Mj@dataproducts.hcjk1ct.mongodb.net/?retryWrites=true&w=majority", tlsCAFile=ca)
# db = client["DP"] 
# book_collection = db["books"] 
# genre_collection = db["genre"] 
# full_genre_collection = db["full_genre"] 

### upload genre collection

In [7]:
from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer('whaleloops/phrase-bert')

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
"""
Make a list of genre documents i.e [{"genre":xx,"embedding":[...]}....]
"""
def generate_genre_collection(collection, embedding_model):
    def make_genre_document(box):
        return {"genre":box[0], "embedding":box[1]}
    all_genres = get_all_genres(collection)
    embeddings = embedding_model.encode(all_genres).tolist()
    genre_collection = list(map(make_genre_document,zip(all_genres,embeddings)))
    return genre_collection

"""
Make one document containing all genres and their embedding matrix i.e {"genre":[...],"embedding":[...]}
"""
def generate_full_genre_collection(collection, embedding_model):
    all_genres = get_all_genres(collection)
    embeddings = embedding_model.encode(all_genres).tolist()
    return {"genre":all_genres, "embedding":embeddings}
    
"""
Get all genres from book collection
"""
def get_all_genres(collection):
    all_genres = set()
    for doc in collection.find({},{"Genre":1}):
        all_genres = all_genres.union(set(doc["Genre"]))
    return list(all_genres)

### upload: genre collection
- each document is one genre and its embedding

In [42]:
# genre_documents = generate_genre_collection(book_collection, bi_encoder)
# genre_collection.insert_many(genre_documents)

### upload: full_genre collection
- one document containing all genres and their matrix embedding

In [17]:
# full_genre_document = generate_full_genre_collection(book_collection, bi_encoder)
genre_collection.insert_one(full_genre_document)


<pymongo.results.InsertOneResult at 0x14aa3891a00>

### delete documents

In [48]:
# x = embedding_collection.delete_many({})
print(x.deleted_count, " documents deleted.")

939  documents deleted.


### count_documents

In [19]:
genre_collection.count_documents({})

1

### all genres

In [20]:
doc = genre_collection.find_one({})

In [21]:
doc["genre"]

['Star Wars',
 'Romanticism',
 'Literary Fiction',
 'Climate Change Fiction',
 'Museology',
 'Steampunk',
 'Ethnic Studies',
 'Deception',
 'Theology',
 'Humanities',
 'School',
 'Climbing',
 'Plays',
 'Spider Man',
 'Fantasy',
 'Lds',
 'History',
 'Chapter Books',
 'Military Science Fiction',
 'Cryptozoology',
 'Nature',
 'Read For School',
 'France',
 'Pseudoscience',
 'Transport',
 'Ecclesiology',
 'Natural History',
 'American Fiction',
 'Linguistics',
 'Cinderella',
 'Southern Gothic',
 'Sexuality',
 'Marriage',
 'M F Romance',
 'International',
 'Erotic Romance',
 'Western Romance',
 'Accounting',
 'Spanish Civil War',
 'Divorce',
 'Anglo Saxon',
 'Naval History',
 'Relationships',
 'Emotion',
 'Apocalyptic',
 'Human Resources',
 'Lesbian Fiction',
 'American History',
 'Sociology',
 'Romantic',
 'Chess',
 'Science',
 '1st Grade',
 'Basketball',
 '2nd Grade',
 'Ethnography',
 'German Literature',
 'Abuse',
 'Military History',
 'Costume History',
 'Dc Comics',
 'Islam',
 'Europea