<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-Parsed-Book-Articles" data-toc-modified-id="Importing-Parsed-Book-Articles-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing Parsed Book Articles</a></span></li><li><span><a href="#Preparing-BERT-and-TFIDF-Models" data-toc-modified-id="Preparing-BERT-and-TFIDF-Models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparing BERT and TFIDF Models</a></span><ul class="toc-item"><li><span><a href="#BERT" data-toc-modified-id="BERT-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>BERT</a></span></li><li><span><a href="#TFIDF" data-toc-modified-id="TFIDF-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>TFIDF</a></span></li><li><span><a href="#Combining-Models" data-toc-modified-id="Combining-Models-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Combining Models</a></span></li></ul></li><li><span><a href="#Comparing-Outputs---TFIDF" data-toc-modified-id="Comparing-Outputs---TFIDF-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Comparing Outputs - TFIDF</a></span><ul class="toc-item"><li><span><a href="#No-Ratings" data-toc-modified-id="No-Ratings-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>No Ratings</a></span></li><li><span><a href="#Slight-Preference" data-toc-modified-id="Slight-Preference-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Slight Preference</a></span></li><li><span><a href="#Skewed-Preference" data-toc-modified-id="Skewed-Preference-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Skewed Preference</a></span></li></ul></li><li><span><a href="#Comparing-Outputs---BERT-and-TFIDF" data-toc-modified-id="Comparing-Outputs---BERT-and-TFIDF-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Comparing Outputs - BERT and TFIDF</a></span></li></ul></div>

**rec_ratings**

Demonstrates the passing multiple inputs with multiple assigned ratings to weigh book recommendations accordingly. See [examples/rec_books](https://github.com/andrewtavis/wikirec/blob/main/examples/rec_books.ipynb) for downloading and parsing steps.

If using this notebook in [Google Colab](https://colab.research.google.com/github/andrewtavis/wikirec/blob/main/examples/rec_ratings.ipynb), you can activate GPUs by following `Edit > Notebook settings > Hardware accelerator` and selecting `GPU`.

In [1]:
# pip install wikirec -U

The following gensim update might be necessary in Google Colab as the default version is very low.

In [2]:
# pip install gensim -U

In Colab you'll also need to download nltk's names data.

In [3]:
# import nltk
# nltk.download("names")

In [4]:
import os
import json
import pickle

from wikirec import data_utils, model, utils

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:99% !important; }</style>"))

# Importing Parsed Book Articles

In [5]:
topic = "books"

In [7]:
# Make sure to extract the .zip file containing enwiki_books.ndjson
with open("./enwiki_books.ndjson", "r") as fin:
    books = [json.loads(l) for l in fin]

print(f"Found a total of {len(books)} books.")

Found a total of 41234 books.


In [8]:
titles = [m[0] for m in books] # Titles of each book
texts = [m[1] for m in books] # The text from the English Wiki Articles of each page 

In [11]:
if os.path.isfile("./book_corpus_idxs.pkl"):
    print(f"Loading book corpus and selected indexes")
    with open(f"./book_corpus_idxs.pkl", "rb") as f:
        text_corpus, selected_idxs = pickle.load(f)
        selected_titles = [titles[i] for i in selected_idxs]

else:
    print(f"Creating book corpus and selected indexes")
    text_corpus, selected_idxs = data_utils.clean(
        texts=texts,
        language="en",
        min_token_freq=5,  # 0 for Bert
        min_token_len=3,  # 0 for Bert
        min_tokens=50,
        max_token_index=-1,
        min_ngram_count=3,
        remove_stopwords=True,  # False for Bert
        ignore_words=None,
        remove_names=True,
        sample_size=1,
        verbose=True,
    )

    selected_titles = [titles[i] for i in selected_idxs]

    with open("./book_corpus_idxs.pkl", "wb") as f:
        print("Pickling book corpus and selected indexes")
        pickle.dump([text_corpus, selected_idxs], f, protocol=4)

Loading book corpus and selected indexes


# Preparing BERT and TFIDF Models

In [12]:
def load_or_create_sim_matrix(
    method,
    corpus,
    metric,
    topic,
    path="./",
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    **kwargs,
):
    """
    Loads or creats a similarity matrix to deliver recommendations
    
    NOTE: the .pkl files made are 5-10GB or more in size
    """
    if os.path.isfile(f"{path}{topic}_{metric}_{method}_sim_matrix.pkl"):
        print(f"Loading {method} {topic} {metric} similarity matrix")
        with open(f"{path}{topic}_{metric}_{method}_sim_matrix.pkl", "rb") as f:
            sim_matrix = pickle.load(f)

    else:
        print(f"Creating {method} {topic} {metric} similarity matrix")
        embeddings = model.gen_embeddings(
            method=method, corpus=corpus, bert_st_model=bert_st_model, **kwargs,
        )
        sim_matrix = model.gen_sim_matrix(
            method=method, metric=metric, embeddings=embeddings,
        )

        with open(f"{path}{topic}_{metric}_{method}_sim_matrix.pkl", "wb") as f:
            print(f"Pickling {method} {topic} {metric} similarity matrix")
            pickle.dump(sim_matrix, f, protocol=4)

    return sim_matrix

## BERT

In [16]:
# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

In [17]:
# We can pass kwargs for sentence_transformers.SentenceTransformer.encode
bert_sim_matrix = load_or_create_sim_matrix(
    method="bert",
    corpus=corpus_no_ngrams,
    metric="cosine",  # euclidean
    topic=topic,
    path="./",
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    show_progress_bar=True,
    batch_size=32,
)

Loading bert books cosine similarity matrix


## TFIDF

In [14]:
# We can pass kwargs for sklearn.feature_extraction.text.TfidfVectorizer
tfidf_sim_matrix = load_or_create_sim_matrix(
    method="tfidf",
    corpus=text_corpus,
    metric="cosine",  # euclidean
    topic=topic,
    path="./",
    max_features=None,
    norm='l2',
)

Loading tfidf books cosine similarity matrix


## Combining Models

In [18]:
tfidf_weight = 0.35
bert_weight = 1.0 - tfidf_weight
bert_tfidf_sim_matrix = tfidf_weight * tfidf_sim_matrix + bert_weight * bert_sim_matrix

# Comparing Outputs - TFIDF

## No Ratings 

In this case ratings are simply averages of the input similarities.

In [30]:
model.recommend(
    inputs=["Harry Potter and the Philosopher's Stone", "The Hobbit"],
    ratings = None,
    titles=selected_titles,
    sim_matrix=tfidf_sim_matrix,
    n=10,
    metric="cosine",
)

[['The History of The Hobbit', 0.4144937936077629],
 ['Harry Potter and the Chamber of Secrets', 0.34888387038976304],
 ['The Lord of the Rings', 0.3461664662907625],
 ['The Annotated Hobbit', 0.3431651523791515],
 ['Harry Potter and the Deathly Hallows', 0.3336208844683567],
 ['Harry Potter and the Goblet of Fire', 0.3323377108209634],
 ['Harry Potter and the Half-Blood Prince', 0.32972615751499673],
 ['Mr. Bliss', 0.3219122094772891],
 ['Harry Potter and the Order of the Phoenix', 0.3160426316664049],
 ['The Magical Worlds of Harry Potter', 0.30770960167033506]]

## Slight Preference

Ratings for each input are restricted to be less than 10, and greater than or equal to 0. Notice the slight change in order, with preference for Harry Potter books being shifted higher.

In [31]:
model.recommend(
    inputs=["Harry Potter and the Philosopher's Stone", "The Hobbit"],
    ratings=[10, 7],
    titles=selected_titles,
    sim_matrix=tfidf_sim_matrix,
    n=10,
    metric="cosine",
)

[['Harry Potter and the Chamber of Secrets', 0.3338375326315423],
 ['Harry Potter and the Deathly Hallows', 0.3205803038084398],
 ['Harry Potter and the Goblet of Fire', 0.31891867694284576],
 ['Harry Potter and the Half-Blood Prince', 0.31590494471139013],
 ['Harry Potter and the Order of the Phoenix', 0.3061664463277075],
 ['The History of The Hobbit', 0.2983234055475572],
 ['The Magical Worlds of Harry Potter', 0.2918779267564048],
 ['Harry Potter and the Methods of Rationality', 0.27619951402732],
 ['Harry Potter and the Prisoner of Azkaban', 0.272304163328929],
 ['Fantastic Beasts and Where to Find Them', 0.2693397153375818]]

## Skewed Preference

The recommendations are now dominated by Harry Potter-related books.

In [32]:
model.recommend(
    inputs=["Harry Potter and the Philosopher's Stone", "The Hobbit"],
    ratings=[10, 2],
    titles=selected_titles,
    sim_matrix=tfidf_sim_matrix,
    n=10,
    metric="cosine",
)

[['Harry Potter and the Chamber of Secrets', 0.3087603030345078],
 ['Harry Potter and the Deathly Hallows', 0.29884600270857836],
 ['Harry Potter and the Goblet of Fire', 0.2965536204793163],
 ['Harry Potter and the Half-Blood Prince', 0.29286959003871244],
 ['Harry Potter and the Order of the Phoenix', 0.2897061374298785],
 ['The Magical Worlds of Harry Potter', 0.2654918018998543],
 ['Harry Potter and the Methods of Rationality', 0.2580909354240481],
 ['Harry Potter and the Prisoner of Azkaban', 0.25155784850490504],
 ['Fantastic Beasts and Where to Find Them', 0.24842432392236208],
 ['The Casual Vacancy', 0.23260474042085055]]

# Comparing Outputs - BERT and TFIDF

In [37]:
model.recommend(
    inputs=["Harry Potter and the Philosopher's Stone", "The Hobbit", "The Hunger Games"],
    ratings=None,
    titles=selected_titles,
    sim_matrix=bert_tfidf_sim_matrix,
    n=20,
    metric="cosine",
)

[['The Lord of the Rings', 0.8129448240195865],
 ['Harry Potter and the Order of the Phoenix', 0.8058152446026797],
 ['Harry Potter and the Half-Blood Prince', 0.7899741862008424],
 ['Harry Potter and the Prisoner of Azkaban', 0.7795265344828326],
 ['Harry Potter and the Deathly Hallows', 0.774902922811441],
 ['The Weirdstone of Brisingamen', 0.7718548190275306],
 ['The Magical Worlds of Harry Potter', 0.7691708768967348],
 ['Harry Potter and the Chamber of Secrets', 0.7669100258159494],
 ['Gregor and the Curse of the Warmbloods', 0.762141807244329],
 ['The Marvellous Land of Snergs', 0.7591230587892558],
 ['Mockingjay', 0.7585438304114571],
 ['Fantastic Beasts and Where to Find Them', 0.757280478510476],
 ['The Children of Húrin', 0.7570409672927969],
 ['The Book of Three', 0.7497114647690369],
 ['Harry Potter and the Goblet of Fire', 0.7414905999564945],
 ['The Bone Season', 0.7401901103966402],
 ['A Wrinkle in Time', 0.7392014390129485],
 ['A Wizard of Earthsea', 0.7337085913181924]

In [38]:
model.recommend(
    inputs=[
        "Harry Potter and the Philosopher's Stone",
        "The Hobbit",
        "The Hunger Games",
    ],
    ratings=[7, 6, 9],
    titles=selected_titles,
    sim_matrix=bert_tfidf_sim_matrix,
    n=20,
    metric="cosine",
)

[['Mockingjay', 0.5847107027999907],
 ['Harry Potter and the Order of the Phoenix', 0.5846454899012506],
 ['The Lord of the Rings', 0.5758166462534925],
 ['Harry Potter and the Half-Blood Prince', 0.5677581220645922],
 ['Harry Potter and the Deathly Hallows', 0.5591667887082767],
 ['Harry Potter and the Prisoner of Azkaban', 0.5584267832698454],
 ['Catching Fire', 0.5582404750962344],
 ['Gregor and the Curse of the Warmbloods', 0.5527128074677247],
 ['Harry Potter and the Chamber of Secrets', 0.5524299731616052],
 ['The Weirdstone of Brisingamen', 0.5520358627555212],
 ['The Magical Worlds of Harry Potter', 0.5506942177737976],
 ['The Bone Season', 0.547984210564344],
 ['The Book of Three', 0.5459088891490478],
 ['Fantastic Beasts and Where to Find Them', 0.5443195045210549],
 ['The Marvellous Land of Snergs', 0.5398665287849369],
 ['A Wrinkle in Time', 0.5373739646822866],
 ['The Casual Vacancy', 0.5358385211606874],
 ['Harry Potter and the Goblet of Fire', 0.5346379229854734],
 ['The