## **Document Similarity**

In [1]:
import pandas as pd
import numpy as np

pd.set_option('max_colwidth', None)

In [2]:
books = pd.read_csv('../Data/childrens_books.csv')

In [3]:
books.head(2)

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things Are follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry Caterpillar tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carle’s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. It’s a staple in early childhood education."


In [14]:
books[books.Title.str.contains("Harry Potter and the Sorcerer's Stone",case=False)].index

Index([13], dtype='int64')

In [18]:
## Suppress the warnings
import warnings
warnings.filterwarnings("ignore")

In [19]:
from transformers import pipeline, logging
logging.set_verbosity_error()

In [29]:
## Turn the description column into embeddings using feature extraction
feature_extractor = pipeline("feature-extraction", 
                model="sentence-transformers/all-MiniLM-L6-v2",
                device='mps'
               )

# extract the embeddings
embeddings = books['Description'].apply(lambda x: feature_extractor(x)[0][0])
embeddings_books = np.vstack(embeddings)
embeddings_books.shape

(100, 384)

In [35]:
# create a function to get book recommendations
from sklearn.metrics.pairwise import cosine_similarity

def get_similar_books(embeddings, book_index, book_details, top_n=3):
    
    # specify the book
    b_embedding = np.array(embeddings[book_index]).reshape(1, -1)
    
    # calculate cosine similarity scores
    similarity_scores = cosine_similarity(b_embedding, embeddings)
    similarity_scores_series = pd.Series(similarity_scores.flatten(), name='similarity_score')
    
    # combine with book info and their similarity scores
    similarity_df = pd.concat([book_details, similarity_scores_series], axis=1)
    
    # sort and return top n most similar books
    return similarity_df.sort_values('similarity_score', ascending=False).iloc[0:top_n+1]

In [36]:
# find the harry potter book index
books.Title[books.Title.str.contains('harry potter', case=False)]

13       Harry Potter and the Sorcerer's Stone (Harry Potter, #1)
97    Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)
98     Harry Potter and the Chamber of Secrets (Harry Potter, #2)
Name: Title, dtype: object

In [37]:
# get book recommendations
get_similar_books(embeddings_books, 13, books[['Title', 'Description', 'Rating']], top_n=5)

Unnamed: 0,Title,Description,Rating,similarity_score
13,"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)","Harry Potter and the Sorcerer’s Stone introduces readers to Harry Potter, an orphan who discovers that he is a wizard and attends the magical Hogwarts School of Witchcraft and Wizardry. Along with his new friends, Harry uncovers mysteries surrounding his past and the dark wizard who killed his parents. This book starts the beloved series and sets the stage for Harry’s journey, filled with magic, adventure, and friendship.",4.47,1.0
97,"Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)","Harry Potter and the Prisoner of Azkaban is the third book in the Harry Potter series, where Harry returns to Hogwarts for his third year and uncovers secrets about his past. With the arrival of the mysterious Sirius Black, Harry must navigate dark truths and face his fears. This thrilling installment explores themes of loyalty, friendship, and identity, marking a turning point in the magical world of Harry Potter.",4.58,0.872638
98,"Harry Potter and the Chamber of Secrets (Harry Potter, #2)","Harry Potter and the Chamber of Secrets is the second book in the Harry Potter series, where Harry returns to Hogwarts for his second year and uncovers a hidden chamber within the school. As mysterious events unfold, Harry and his friends Ron and Hermione uncover dark secrets about the school’s past. Themes of courage, friendship, and standing up for what’s right are explored in this gripping magical adventure.",4.43,0.855368
63,The Witches,"The Witches tells the story of a young boy and his grandmother who uncover a secret society of witches who despise children and plot to turn them all into mice. With the help of his grandmother, the boy must outwit the witches and save the children. The book is known for its dark humor, thrilling suspense, and memorable characters. Though it can be a bit scary, it is beloved for its unique blend of fear, adventure, and courage.",4.18,0.799051
55,"The Wonderful Wizard of Oz (Oz, #1)","The Wonderful Wizard of Oz is the first book in Baum's Oz series and tells the story of Dorothy, a young girl from Kansas who is swept away to the magical land of Oz. Along with her new friends—the Scarecrow, Tin Man, and Cowardly Lion—Dorothy embarks on a journey to meet the Wizard and find her way home. The book is filled with themes of friendship, courage, and the belief in oneself, and has become an iconic tale in American literature.",4.0,0.788523
42,"The Hobbit, or There and Back Again (The Lord of the Rings, #0)","The Hobbit follows the journey of Bilbo Baggins, a hobbit who is thrust into an epic adventure with dwarves and the wizard Gandalf. Together, they set out to reclaim treasure guarded by the dragon Smaug. Along the way, Bilbo grows in courage, wisdom, and leadership. The book introduces readers to Tolkien's richly imagined world of Middle-earth and is the prelude to the Lord of the Rings trilogy, blending adventure, fantasy, and heroism.",4.29,0.773417
