# Recommendation Algorithm

*This Jupyter Notebook imports and combines the two datasets (for young adult and children), performs exploratory data analysis, and generates the output for the Recommendation using both collaborative and content-based filtering.*

# Section 1: Data Preparation 

## 1.1 Importing & Installing Libraries

In [1]:
! pip install pandas
! pip install scikit-learn
! pip install nltk
! pip install spacy
! python -m spacy download en_core_web_sm
! pip install sentence-transformers
! pip install requests jupyter

















Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------- ----------------- 7.3/12.8 MB 41.2 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 32.9 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 30.9 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')












In [2]:
import pandas as pd
import numpy as np
import re
import string
import json
import sklearn
import nltk
import spacy
import math
import random
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from nltk import pos_tag
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
from collections import Counter
from sentence_transformers import SentenceTransformer

## 1.2 Importing Data

From goodreads, we are able to download three sets of data.

1. The `books` dataset outlines associated metadata to a specific book. Things of interest here would be `book_id`, `title`, `average_rating`, `ratings_count`,`description`, `num_pages`, `popular_shelves`,`image_url`,`authors`.
2. The `reviews` dataset consists of text reviews that may or may not be added after a rating. As the test reviews do not seem to be useful at this time, we will leave it out. We can consider the data here to scrape for genre.
3. The `interactions` dataset indicates whether or not a specific user has read and rated a specific book. It consists of columns `user_id`, `book_id`, `is_read` and `ratings`.

We have done so for two different age categories (`children` and `young_adult`), and will combine them.

### Importing and Filtering Books Data

In [3]:
start_loading = time.time()
columns_of_interest = ['book_id', 'title', 'average_rating', 'ratings_count','description', 'num_pages', 'popular_shelves','image_url','authors']
json_files = ['goodreads_books_children.json', 'goodreads_books_young_adult.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            filtered_record = {key:record[key] for key in columns_of_interest}
            data.append(filtered_record)

books = pd.DataFrame(data)
books['description_length'] = books['description'].apply(len)
books = books[books['description_length'] != 0] #filtering empty descriptions
books = books.drop('description_length', axis = 1)

In [4]:
columns_of_interest = ['author_id', 'name']
json_files = ['goodreads_book_authors.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            filtered_record = {key:record[key] for key in columns_of_interest}
            data.append(filtered_record)
            
authors = pd.DataFrame(data)

def get_name(author_id, authors = authors):
    if author_id in authors['author_id'].values:
        return authors.loc[authors['author_id'] == author_id, 'name'].values[0]
    else:
        return None
end_loading = time.time()
duration_loading_books = (end_loading - start_loading)/60

### Importing and Filtering Interactions Data

In [5]:
start_loading = time.time()
columns_of_interest = ['user_id','book_id','is_read','rating']
json_files = ['goodreads_interactions_children.json', 'goodreads_interactions_young_adult.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            filtered_record = {key:record[key] for key in columns_of_interest}
            data.append(filtered_record)
            
interactions = pd.DataFrame(data)
interactions = interactions[interactions['is_read'] != 0] #removing ratings by people who have not read the book

end_loading = time.time()
duration_loading_interactions = (end_loading - start_loading)/60

# Section 2: Analysis

There are a few ways we can consider hybridizing the approaches. We will now do the ensemble method, which generates two separate recommendation lists and then takes the intersection. The code below should generate 3 random books from the data, which will be used as a test set.

Other methods we could consider include (1) weighted hybrid, where a content-based score and a collaborative filtering score is calculated and subsequently combined with a weighted average, or (2) switching hybrid, where content-based filtering is used when the user is new, or when a book has very few ratings, and collaborative filtering is used when a user / book has sufficient history.

## 2.1 Generating Recommendations

### 2.1.1 Content-Based Filtering

For Content-Based Filtering, we use **TF-IDF** and **Cosine Similarity** as our core algorithms. 

**Term Frequency-Inverse Document Frequency (TF-IDF)** uses NLP to identify important words in the `description` attribute of the selected book by evaluating how frequently they appear, relative to the descriptions of all other books in the dataset. Once this is done, we sort the books by how similar they are using **cosine similarity**, which measures the angle between the two vectors (books). If they have a small angle, the books have similar `description` and is thus considered to be similar in content.

A better way to imagine this would be if Book A and Book B both have descriptions that talk about "magic", "spells" and "wizards", they would have similar TF-IDF vectors, and thus high cosine similarity scores.

We have also wrote two algorithms to detect genre and suitable age ranges, to identify books in the same genre and similar age ranges. Books in the same genre targeted at a similar age range would enjoy a boost in their similarity scores. The details of detecting genre and detecting age ranges are stipulated below.


#### Detecting Genre

To detect genre, we first:

1. Extract genres from user-assigned shelves (most reliable signal)
2. Apply multiple NLP techniques to analyze book description and title. This includes (1) TF-IDF analysis with genre-specific vocabulary and (2) Named entity recognition to identify genre-related entities
3. Combine all signals with appropriate weights (shelf data > NLP)
4. Return top genres that meet minimum confidence threshold

In [6]:
def detect_book_genre_with_advanced_nlp(book_data, genre_classifier=None, min_confidence=3, exclude_shelves=None):

    if exclude_shelves is None:
        exclude_shelves = get_default_excluded_shelves()

    # Extract genres from structured shelf data
    shelf_genres = extract_genres_from_shelves(book_data, get_genre_map(), exclude_shelves)
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))

    nlp_genres = {}
    # Only perform NLP analysis if we have enough text
    if len (shelf_genres) < 4 & len(description) > 500:
        nlp_genres.update(analyze_with_tfidf(title, description))
        nlp_genres.update(extract_named_entities(title, description))


    # Combine all signals and apply minimum confidence threshold
    final_genres = combine_all_genre_signals(shelf_genres, nlp_genres, min_confidence)
    return final_genres[:3]

def get_default_excluded_shelves():
    #removes shelf names that aren't useful for genre classification
    return {
        'to-read', 'currently-reading', 'owned', 'default', 
        'favorites', 'books-i-own', 'ebook', 'kindle', 
        'library', 'audiobook', 'owned-books', 'to-buy', 
        'calibre', 're-read', 'unread', 'favourites', 'my-books'
    }

def get_genre_map():
    #Dictionary mapping shelf keywords to standardized genre names
    return {
        'fantasy': 'Fantasy',
        'sci-fi': 'Science Fiction',
        'science-fiction': 'Science Fiction',
        'mystery': 'Mystery/Thriller',
        'thriller': 'Mystery/Thriller',
        'romance': 'Romance',
        'historical': 'Historical Fiction',
        'history': 'History',
        'horror': 'Horror',
        'young-adult': 'Young Adult',
        'ya': 'Young Adult',
        'childrens': 'Children\'s',
        'children': 'Children\'s',
        'kids': 'Children\'s',
        'dystopian': 'Dystopian',
        'classic': 'Classics',
        'classics': 'Classics',
        'biography': 'Biography/Memoir',
        'memoir': 'Biography/Memoir',
        'autobiography': 'Biography/Memoir',
        'self-help': 'Self Help',
        'business': 'Business',
        'philosophy': 'Philosophy',
        'psychology': 'Psychology',
        'science': 'Science',
        'poetry': 'Poetry',
        'comic': 'Comics/Graphic Novels',
        'graphic-novel': 'Comics/Graphic Novels',
        'manga': 'Manga',
        'cooking': 'Cooking/Food',
        'cookbook': 'Cooking/Food',
        'food': 'Cooking/Food',
        'travel': 'Travel',
        'religion': 'Religion/Spirituality',
        'spirituality': 'Religion/Spirituality',
        'art': 'Art/Photography',
        'photography': 'Art/Photography',
        'reference': 'Reference',
        'textbook': 'Textbook/Education',
        'education': 'Textbook/Education'
    }

def extract_genres_from_shelves(book_data, genre_map, exclude_shelves):
    
    #Extract genre information from book's popular shelves data by using shelf counts as confidence scores (more users shelving = higher confidence)
    
    shelf_genres = {}
    
    popular_shelves = book_data.get('popular_shelves', [])
    if isinstance(popular_shelves, list) and popular_shelves:
        for shelf in popular_shelves:
            shelf_name = shelf.get('name', '').strip().lower()
            shelf_count = int(shelf.get('count', 0))
            
            if shelf_name in exclude_shelves:
                continue
            
            for keyword, genre_name in genre_map.items():
                if keyword in shelf_name:
                    if genre_name in shelf_genres:
                        shelf_genres[genre_name] += shelf_count
                    else:
                        shelf_genres[genre_name] = shelf_count
                    break
    
    return shelf_genres

def preprocess_text(text, lemmatize=True):
    """
    Preprocess text by removing special characters, lemmatizing, etc. We convert text to lowercase, remove URLs and HTML tags, remove non-alphabetic 
    characters, normalize whitespace and optionally lemmatize words (reduce to base form)
    """
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        word_list = nltk.word_tokenize(text)
        text = ' '.join([lemmatizer.lemmatize(word) for word in word_list])
    
    return text


def load_nlp_models():
    
    #Load spaCy models required for named entity recognition
    try:
        nlp = spacy.load("en_core_web_sm")
    except:
        try:
            spacy.cli.download("en_core_web_sm")
            nlp = spacy.load("en_core_web_sm")
        except:
            nlp = None
            
    return nlp


def analyze_with_tfidf(title, description):
    """
    Analyze book text using TF-IDF comparison against genre-specific vocabulary.
    
    Algorithm:
    1. Define genre-specific keyword sets
    2. Preprocess the book text (title + description)
    3. Create TF-IDF vectors for genre keywords and book text
    4. Calculate cosine similarity between book vector and each genre vector
    5. Convert similarities to confidence scores and return top matches
    """

    try:
        # Define genre keyword sets
        genre_keywords = {
            'Fantasy': 'magic wizard dragon elf quest sword magical kingdom witch sorcery myth fantasy',
            'Science Fiction': 'space alien future technology robot dystopian sci-fi futuristic planet spacecraft',
            'Mystery/Thriller': 'murder detective crime case investigation killer suspense clue mystery conspiracy',
            'Romance': 'love relationship passion romantic heart affair marriage emotion desire dating romance',
            'Historical Fiction': 'century historical period king queen ancient war empire era medieval history',
            'Horror': 'fear terror ghost scary monster supernatural haunt nightmare blood evil dark horror',
            'Young Adult': 'teen school young coming-of-age adolescent teenage youth friendship high-school',
            'Children\'s': 'child kid young picture-book learning bedtime simple adventure colorful illustrated',
            'Biography/Memoir': 'life autobiography personal real journey memoir experience story true figure',
            'Self Help': 'improve success happiness guide advice life motivation habit inspiration growth',
            'Business': 'market company entrepreneur success management leadership strategy finance career investment',
            'Dystopian': 'dystopia future society control survival oppression rebellion totalitarian apocalyptic regime'
        }
        
        # Preprocess text
        processed_text = preprocess_text(f"{title} {description}")
        
        # Create TF-IDF vectorizer
        vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
        
        # Create corpus with genre keywords and the book text
        corpus = list(genre_keywords.values())
        corpus.append(processed_text)
        
        # Calculate TF-IDF
        tfidf_matrix = vectorizer.fit_transform(corpus)
        
        # Calculate similarity between book and each genre
        last_row_index = tfidf_matrix.shape[0] - 1
        similarities = cosine_similarity(tfidf_matrix[last_row_index], tfidf_matrix[:-1])[0]
        
        # Map similarities to genres
        genres = {}
        for i, genre in enumerate(genre_keywords.keys()):
            # Convert similarity scores to a more intuitive range (0-10)
            score = int(similarities[i] * 10)
            if score > 3:  # Only consider reasonable matches
                genres[genre] = score
                
        return genres
    except:
        return {}


def extract_named_entities(title, description):
    #Extract named entities from book text, use spaCy NLP model to process text and extract named entities before and mapping them to potential genres.
    
    try:
        nlp = load_nlp_models()
        if not nlp:
            return {}
            
        # Process text with spaCy
        doc = nlp(f"{title} {description}")
        
        # Extract entities
        entities = [ent.text.lower() for ent in doc.ents]
        
        # Define entity-genre associations
        entity_genre_map = {
            'fantasy': ['magic', 'wizard', 'dragon', 'elf', 'fairy', 'kingdom', 'quest', 'sorcerer'],
            'science fiction': ['space', 'planet', 'alien', 'robot', 'future', 'technology'],
            'historical fiction': ['century', 'king', 'queen', 'empire', 'war', 'battle', 'medieval', 'ancient'],
            'biography': ['life', 'biography', 'autobiography', 'memoir', 'president', 'politician', 'artist'],
            'science': ['research', 'experiment', 'theory', 'physics', 'biology', 'chemistry', 'scientist'],
            'religion': ['god', 'church', 'bible', 'faith', 'spiritual', 'religion', 'prayer']
        }
        
        # Find genres based on entities
        genres = {}
        for entity in entities:
            for genre, keywords in entity_genre_map.items():
                if any(keyword in entity for keyword in keywords):
                    standardized_genre = standardize_genre(genre)
                    genres[standardized_genre] = genres.get(standardized_genre, 0) + 1
                    
        return genres
    except:
        return {}

def standardize_genre(genre):
    #Standardize genre names to a consistent format.
    
    genre_map = {
        'fantasy': 'Fantasy',
        'science fiction': 'Science Fiction',
        'historical fiction': 'Historical Fiction',
        'biography': 'Biography/Memoir',
        'science': 'Science',
        'religion': 'Religion/Spirituality'
    }
    return genre_map.get(genre.lower(), genre.title())


def combine_all_genre_signals(shelf_genres, nlp_genres, min_confidence):
    #Combine genre signals from different sources with appropriate weights.
    
    combined_scores = Counter()
    
    # Add shelf genres with highest weight (explicit human categorization)
    for genre, score in shelf_genres.items():
        combined_scores[genre] += score * 3
    
    # Add NLP-detected genres with medium weight
    for genre, score in nlp_genres.items():
        combined_scores[genre] += score * 2
    
    # Sort by final score
    sorted_genres = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Filter by minimum confidence
    final_genres = [genre for genre, score in sorted_genres if score >= min_confidence]
    
    # Fallback if no confident genres
    if not final_genres and sorted_genres:
        final_genres = [sorted_genres[0][0]]
    
    return final_genres

#### Detecting Suitable Age Range

To detect age ranges, we first:

1. Extract age ranges from `popular_shelves` (most reliable signal)
2. Failing that, we estimate the targeted age range by approximating the difficulty of the book (1) using the number of pages `num_pages`, (2) complexity of language in the book's `title` and `description` and (3) analysis of possible themes.

In [7]:
def detect_age_range(book_data):
    if book_data.get('popular_shelves') is None:
        book_data['popular_shelves'] = []
    
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))
    num_pages = book_data['num_pages']

    try:
        num_pages = int(book_data['num_pages'])
    except (KeyError, ValueError, TypeError):
        num_pages = 0
    
    age_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    title_lower = title.lower()
    description_lower = description.lower()
    
    # Advanced NLP analysis of title and description
    title_complexity = analyze_text_complexity(title)
    desc_complexity = analyze_text_complexity(description)
    
    # Apply title complexity to age scores
    if title_complexity < 0.3:
        age_scores['0-5'] += 10
        age_scores['5-10'] += 5
    elif title_complexity < 0.5:
        age_scores['5-10'] += 8
        age_scores['0-5'] += 4
    elif title_complexity < 0.7:
        age_scores['10-15'] += 8
        age_scores['5-10'] += 4
    else:
        age_scores['15+'] += 8
        age_scores['10-15'] += 4
    
    # Apply description complexity to age scores
    if desc_complexity < 0.3:
        age_scores['0-5'] += 12
        age_scores['5-10'] += 6
    elif desc_complexity < 0.5:
        age_scores['5-10'] += 10
        age_scores['0-5'] += 5
    elif desc_complexity < 0.7:
        age_scores['10-15'] += 10
        age_scores['5-10'] += 5
    else:
        age_scores['15+'] += 12
        age_scores['10-15'] += 6
    
    # POS tag patterns analysis for age appropriateness
    pos_patterns = analyze_pos_patterns(description)
    for age_range, score in pos_patterns.items():
        age_scores[age_range] += score
    
    board_book_terms = ['board book', 'bedtime', 'goodnight', 'naptime', 'toddler', 'baby', 
                        'alphabet', 'counting', 'colors', 'shapes', 'lullaby', 'nursery']
    
    if any(term in title_lower or term in description_lower for term in board_book_terms):
        age_scores['0-5'] += 12
        age_scores['5-10'] -= 5
    
    early_reader_terms = ['early reader', 'beginning reader', 'learn to read', 'level reader',
                          'first reader', 'step into reading', 'i can read', 'reading level']
    
    if any(term in title_lower or term in description_lower for term in early_reader_terms):
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    
    grade_terms = {
        '0-5': ['preschool', 'pre-k', 'kindergarten'],
        '5-10': ['grade 1', 'grade 2', 'grade 3', 'grade 4', 'first grade', 'second grade', 
                'third grade', 'fourth grade', 'fifth grade', 'elementary'],
        '10-15': ['grade 5', 'grade 6', 'grade 7', 'grade 8', 'middle school', 'middle-grade', 
                 'middle grade', 'tween'],
        '15+': ['grade 9', 'grade 10', 'grade 11', 'grade 12', 'high school', 'teen', 'young adult',
               'ya', 'college', 'university']
    }
    
    for age_range, terms in grade_terms.items():
        if any(term in title_lower or term in description_lower for term in terms):
            age_scores[age_range] += 10
    
    if num_pages <= 32:
        age_scores['0-5'] += 15
        age_scores['5-10'] -= 3
    elif 33 <= num_pages <= 48:
        age_scores['0-5'] += 10
        age_scores['5-10'] += 5
    elif 49 <= num_pages <= 80:
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    elif 81 <= num_pages <= 120:
        age_scores['5-10'] += 8
        age_scores['10-15'] += 4
    elif 121 <= num_pages <= 200:
        age_scores['10-15'] += 8
        age_scores['5-10'] += 4
    elif 201 <= num_pages <= 350:
        age_scores['10-15'] += 10
        age_scores['15+'] += 5
    elif num_pages > 350:
        age_scores['15+'] += 12
        age_scores['10-15'] += 6

    # Content theme analysis with increased weight for theme matches
    content_themes = analyze_text_themes(title_lower, description_lower)
    for age_range, score in content_themes.items():
        age_scores[age_range] += score * 1.5
    
    # Shelf analysis
    shelves = book_data.get('popular_shelves', [])
    shelf_age_indicators = analyze_shelves_for_age(shelves)
    for age_range, score in shelf_age_indicators.items():
        age_scores[age_range] += score
    
    max_score = max(age_scores.values())
    final_age_range = max(age_scores.items(), key=lambda x: x[1])[0]
    
    return final_age_range

def analyze_text_complexity(text):
    if not text or len(text) < 5:
        return 0.5
    
    try:
        sentences = sent_tokenize(text)
        words = word_tokenize(text)
        
        if not sentences or not words:
            return 0.5
        
        avg_sentence_length = len(words) / max(1, len(sentences))
        avg_word_length = sum(len(word) for word in words if word.isalpha()) / max(1, len([w for w in words if w.isalpha()]))
        
        # Calculate lexical diversity (larger vocabulary suggests more complex text)
        unique_words = len(set(word.lower() for word in words if word.isalpha()))
        lexical_diversity = unique_words / max(1, len([w for w in words if w.isalpha()]))
        
        # Calculate percentage of complex words (words with 3+ syllables)
        complex_words = sum(1 for word in words if word.isalpha() and textstat.syllable_count(word) >= 3)
        complex_words_pct = complex_words / max(1, len([w for w in words if w.isalpha()]))
        
        # Weighted complexity score
        complexity_score = (
            (avg_sentence_length / 25) * 0.3 + 
            (avg_word_length / 7) * 0.2 + 
            lexical_diversity * 0.25 + 
            complex_words_pct * 0.25
        )
        
        return min(1.0, complexity_score)
    except:
        return 0.5

def analyze_pos_patterns(text):
    try:
        age_patterns = {
            '0-5': 0,
            '5-10': 0,
            '10-15': 0,
            '15+': 0
        }
        
        # Get POS tags
        tokens = word_tokenize(text.lower())
        tagged = pos_tag(tokens)
        
        # Count parts of speech
        pos_counts = Counter(tag for word, tag in tagged)
        total_tokens = len(tagged)
        
        if total_tokens == 0:
            return age_patterns
        
        # Simple sentence structure (mainly nouns and verbs) - for young children
        simple_structure = (pos_counts.get('NN', 0) + pos_counts.get('NNS', 0) + 
                           pos_counts.get('VB', 0) + pos_counts.get('VBZ', 0) + 
                           pos_counts.get('VBP', 0)) / total_tokens
        
        # Complex sentence markers (conjunctions, relative pronouns, etc.)
        complex_markers = (pos_counts.get('IN', 0) + pos_counts.get('WDT', 0) + 
                          pos_counts.get('WP', 0) + pos_counts.get('WRB', 0)) / total_tokens
        
        # Advanced language features (adjectives, adverbs, etc.)
        advanced_features = (pos_counts.get('JJ', 0) + pos_counts.get('JJR', 0) + 
                            pos_counts.get('JJS', 0) + pos_counts.get('RB', 0) + 
                            pos_counts.get('RBR', 0) + pos_counts.get('RBS', 0)) / total_tokens
        
        # Score assignment based on POS patterns
        if simple_structure > 0.6 and complex_markers < 0.1:
            age_patterns['0-5'] += 8
            age_patterns['5-10'] += 4
        elif simple_structure > 0.5 and complex_markers < 0.15:
            age_patterns['5-10'] += 7
            age_patterns['0-5'] += 3
        elif complex_markers > 0.15 and advanced_features > 0.2:
            age_patterns['10-15'] += 6
            age_patterns['15+'] += 3
        elif complex_markers > 0.2 and advanced_features > 0.25:
            age_patterns['15+'] += 8
            age_patterns['10-15'] += 4
        
        return age_patterns
    except:
        return {
            '0-5': 0,
            '5-10': 0,
            '10-15': 0,
            '15+': 0
        }

def analyze_text_themes(title, description):
    combined_text = title + " " + description
    
    theme_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    early_themes = ['sleep', 'bed', 'nap', 'dream', 'moon', 'star', 'night', 'bunny', 'teddy', 
                    'toy', 'farm', 'animal', 'cat', 'dog', 'duck', 'color', 'zoo', 'mommy', 
                    'daddy', 'parent', 'bath', 'diaper', 'potty', 'train', 'truck', 'car', 
                    'alphabet', 'abc', 'number', '123', 'count', 'rhyme']
    
    elementary_themes = ['school', 'teacher', 'friend', 'adventure', 'fun', 'magic', 'fairy', 
                        'dragon', 'dinosaur', 'spy', 'detective', 'mystery', 'solve', 'game', 
                        'play', 'team', 'sport', 'chapter', 'series', 'collect', 'comic', 
                        'joke', 'funny', 'humor', 'silly', 'prank', 'robot', 'space', 'science']
    
    middle_themes = ['friend', 'school', 'bully', 'crush', 'team', 'competition', 'journal', 
                    'diary', 'secret', 'club', 'grow', 'family', 'sibling', 'parent', 'problem', 
                    'solve', 'quest', 'mission', 'summer', 'camp', 'vacation', 'holiday', 
                    'fantasy', 'world', 'magic', 'spell', 'creature', 'monster', 'ghost']
    
    ya_themes = ['love', 'romance', 'relationship', 'kiss', 'boyfriend', 'girlfriend', 'dating', 
                'death', 'tragedy', 'war', 'battle', 'fight', 'survive', 'future', 'dystopian', 
                'apocalypse', 'society', 'rebellion', 'government', 'power', 'politics', 'identity', 
                'struggle', 'college', 'career', 'adult', 'mature', 'violence', 'blood']
    
    for theme in early_themes:
        if theme in combined_text:
            theme_scores['0-5'] += 1.5
    
    for theme in elementary_themes:
        if theme in combined_text:
            theme_scores['5-10'] += 1.5
    
    for theme in middle_themes:
        if theme in combined_text:
            theme_scores['10-15'] += 1.5
    
    for theme in ya_themes:
        if theme in combined_text:
            theme_scores['15+'] += 1.5
    
    return theme_scores

def analyze_shelves_for_age(shelves):
    shelf_patterns = {
        '0-5': ['picture book', 'board book', 'childrens', 'toddler', 'baby', 'preschool', 
               'bedtime', 'nursery', 'concept book'],
        '5-10': ['early reader', 'chapter book', 'childrens', 'kids', 'elementary', 'juvenile', 
                'easy reader'],
        '10-15': ['middle grade', 'middle-grade', 'tween', 'juvenile', 'preteen'],
        '15+': ['young adult', 'ya', 'teen', 'high school', 'new adult', 'adult']
    }
    
    shelf_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    for shelf in shelves:
        shelf_name = shelf.get('name', '').lower()
        shelf_count = int(shelf.get('count', 0))
        
        for age_range, patterns in shelf_patterns.items():
            if any(pattern in shelf_name for pattern in patterns):
                shelf_scores[age_range] += min(12, math.log(shelf_count + 1) * 2)
    
    return shelf_scores

#### Combining Genre, Age and Description Signals

In [8]:
tfidf = TfidfVectorizer(stop_words = 'english') 
tfidf_matrix = tfidf.fit_transform(books['description']) #generating TF-IDF matrix

def get_content_recommendations (book_id, df = books, tfidf_matrix = tfidf_matrix, top_n = 3):
    book_row = df[df['book_id'] == book_id]
    index = df.index.get_loc(book_row.index[0])
    book_data = book_row.iloc[0].to_dict()
    
    target_genres = detect_book_genre_with_advanced_nlp(book_data)
    target_age_range = detect_age_range(book_data)

    sim_scores = cosine_similarity(tfidf_matrix[index], tfidf_matrix).flatten() #calculating cosine similarity

    for i in range(len(sim_scores)):
        if i == index:
            continue
        book_data_i = df.iloc[i].to_dict()
        genres_i = detect_book_genre_with_advanced_nlp(book_data_i)
        age_range_i = detect_age_range(book_data_i)

        if set(target_genres) & set(genres_i):
            sim_scores[i] *= 2
        if set(target_age_range) & set(age_range_i):
            sim_scores[i] *= 2        
    
    top_indices = np.argsort(sim_scores)[::1][1:top_n + 1]

    recommendations = df.iloc[top_indices][['book_id']]

    return recommendations


### 2.1.2 Collaborative Filtering

We will use K-Nearest Neighbours as an algorithm to perform Collaborative Filtering.

In [9]:
def create_user_item_matrix(df): #to create user-item matrix for collaborative filtering
    users = interactions['user_id'].nunique()
    items = interactions['book_id'].nunique()
    
    user_mapper = dict(zip(np.unique(interactions['user_id']), list(range(users))))
    user_inv_mapper = dict(zip(list(range(users)), np.unique(interactions['user_id'])))
    user_index = [user_mapper[i] for i in interactions['user_id']]

    
    item_mapper = dict(zip(np.unique(interactions['book_id']), list(range(items))))
    item_inv_mapper = dict(zip(list(range(items)), np.unique(interactions['book_id'])))
    item_index = [item_mapper[i] for i in interactions['book_id']]
    
    user_item_matrix = csr_matrix((interactions['rating'], (user_index, item_index)), shape = (users, items))
    return user_item_matrix, user_mapper, item_mapper, user_inv_mapper, item_inv_mapper

user_item_matrix, user_mapper, item_mapper, user_inv_mapper, item_inv_mapper = create_user_item_matrix(interactions)

def get_collaborative_recommendations(book_id, books = books, user_item_matrix = user_item_matrix, item_mapper = item_mapper, item_inv_mapper = item_inv_mapper, top_n = 3):
    user_item_matrix = user_item_matrix.T
    neighbor_ids = []
    recommendations = []

    item_ind = item_mapper[book_id]
    item_vec = user_item_matrix[item_ind]
    if isinstance(item_vec, (np.ndarray)):
        item_vec = item_vec.reshape(1,-1)

    kNN = NearestNeighbors(n_neighbors = top_n + 1, algorithm = "brute", metric = "cosine") #measuring similarity using K-Nearest-Neighbors
    kNN.fit(user_item_matrix)
    neighbor = kNN.kneighbors(item_vec, return_distance = False)
    for i in range (0, top_n + 1):
        n = neighbor.item(i)
        neighbor_ids.append(item_inv_mapper[n])
    neighbor_ids.pop(0)

    for id in neighbor_ids: #retrieving book titles
        recommended_books = books.loc[books['book_id'] == id, ['book_id','title']].values[0]
        recommendations.append({'book_id': recommended_books[0], 'title': recommended_books[1]})
        
    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df

## 2.2 Parsing Output

In [10]:
def get_recommendations(book_id, books = books, authors = authors):
    print("Fetching recommendations...")
    step0 = time.time()
    df1 = get_content_recommendations(book_id)
    step1 = time.time()
    print(f"Content recommendations loaded! Total time taken was {((step1 - step0) / 60):.4f} mins")
    df2 = get_collaborative_recommendations(book_id)
    step2 = time.time()
    print(f"Collaborative recommendations loaded! Total time taken was {((step2 - step1) / 60):.4f} mins")
    recommendations = pd.concat([df1, df2], ignore_index = True)

    result = []
    print('Writing to File...')

    for book_id in recommendations['book_id']:
        book_details = books[books['book_id'] == book_id].iloc[0]
        
        authors_list = book_details['authors']
        author_names = []
        for author in authors_list:
            author_id = author['author_id']
            author_name = get_name(author_id) 
            author_names.append(author_name)

        if len(author_names) > 1:
            concat_authors = " & ".join(author_names)
        else:
            concat_authors = author_names[0] if author_names else "Unknown"

        book_metadata = {
            "bookid": book_details['book_id'],
            "title": book_details['title'],
            "author": concat_authors,
            "coverimage": book_details['image_url']
        }

        result.append(book_metadata)

        with open('recommendations.json', 'w') as f:
            json.dump(result, f, indent=4)
    return result

## 2.3 Testing with Test Set

To ensure the algorithm works, we randomly select 3 books and compute their recommendations list.

In [11]:
print(f"Time taken for loading books dataset: {duration_loading_books:.4f} mins")
print(f"Time taken for loading interactions dataset: {duration_loading_interactions:.4f} mins")

Time taken for loading books dataset: 0.1788 mins
Time taken for loading interactions dataset: 2.3806 mins


In [12]:
def generate_test_set(df, n = 1):
    selected_books = df.sample(n)
    return selected_books['book_id']
test_set = generate_test_set(books)
print(test_set)

206456    16074780
Name: book_id, dtype: object


In [13]:
for book in test_set:
    recommendations = get_recommendations(book)
    print(recommendations)

Fetching recommendations...
Content recommendations loaded! Total time taken was 5.0150 mins
Collaborative recommendations loaded! Total time taken was 0.0252 mins
Writing to File...
[{'bookid': '18670062', 'title': 'Dream Dark (Caster Chronicles, #2.5)', 'author': 'Kami Garcia & Margaret Stohl', 'coverimage': 'https://images.gr-assets.com/books/1381774112m/18670062.jpg'}, {'bookid': '645967', 'title': 'Eggs Mark the Spot', 'author': 'Mary Jane Auch', 'coverimage': 'https://images.gr-assets.com/books/1298825696m/645967.jpg'}, {'bookid': '2834400', 'title': "My Grandmother's Stories: A Collection of Jewish Folk Tales", 'author': 'Adele Geras & Jael Jordan (Illustrator)', 'coverimage': 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'}, {'bookid': '24885888', 'title': 'The Boy I Love', 'author': 'Nina de Gramont', 'coverimage': 'https://images.gr-assets.com/books/1441586936m/24885888.jpg'}, {'bookid': '30313378', 'title': "The You I've Never Known

# Section 3: Evaluation

From exploratory data analysis, we deduce that the following list are the top 10 most rated books. We thus decide to evaluate the hybrid recommendation model using the Precision@6 metric. To do this, we used the top 10 most rated books, running the recommender for each test book to generate the 6 recommended titles. We then manually assessed how much of these 6 recommendations are genuinely relevant, and record that number. We then averaged across all ten books to get the overall precision score.

In [14]:
books['book_id'] = books['book_id'].astype(str)
test_books = ["2767052","41865","5","6148028","11870085","7260188", "13335037","49041","428263","1162543"]

In [15]:
def evaluate_recommender(test_books, books = books, authors = authors):
    records = []
    
    for book_id in test_books:
        try:
            test_title = books.loc[books['book_id'] == book_id, 'title'].values[0]
        except IndexError:
            print(f"Book ID {book_id} not found in books dataframe.")
            continue

        try:
            recs = get_recommendations(book_id, books=books, authors=authors)
        except Exception as e:
            print(f"Error generating recommendations for book_id {book_id}: {e}")
            continue
        
        # Extract the titles of the top 6 recommendations
        top_titles = [rec['title'] for rec in recs[:6]]
        
        # Pad with None if fewer than 6 recommendations
        while len(top_titles) < 6:
            top_titles.append(None)
        
        row = {
            'test_book_title': test_title,
            'recommendation 1': top_titles[0],
            'recommendation 2': top_titles[1],
            'recommendation 3': top_titles[2],
            'recommendation 4': top_titles[3],
            'recommendation 5': top_titles[4],
            'recommendation 6': top_titles[5],
        }

        records.append(row)

    return pd.DataFrame(records)

test_recommendations = evaluate_recommender(test_books)
display(test_recommendations)

Fetching recommendations...
Content recommendations loaded! Total time taken was 4.9561 mins
Collaborative recommendations loaded! Total time taken was 0.0227 mins
Writing to File...
Fetching recommendations...
Content recommendations loaded! Total time taken was 4.9032 mins
Collaborative recommendations loaded! Total time taken was 0.0202 mins
Writing to File...
Fetching recommendations...
Content recommendations loaded! Total time taken was 4.8549 mins
Collaborative recommendations loaded! Total time taken was 0.0218 mins
Writing to File...
Fetching recommendations...
Content recommendations loaded! Total time taken was 4.9250 mins
Collaborative recommendations loaded! Total time taken was 0.0192 mins
Writing to File...
Fetching recommendations...
Content recommendations loaded! Total time taken was 5.1777 mins
Collaborative recommendations loaded! Total time taken was 0.0225 mins
Writing to File...
Fetching recommendations...
Content recommendations loaded! Total time taken was 5.39

Unnamed: 0,test_book_title,recommendation 1,recommendation 2,recommendation 3,recommendation 4,recommendation 5,recommendation 6
0,"The Hunger Games (The Hunger Games, #1)",Being Me With OCD: How I Learned to Obsess Les...,"Everwild (Skinjacker, #2)","Brave the Betrayal (Everworld, #8)","Catching Fire (The Hunger Games, #2)","Mockingjay (The Hunger Games, #3)","Twilight (Twilight, #1)"
1,"Twilight (Twilight, #1)",Peggy Sue Contra Los Invisibles,"Supercomputer (Choose Your Own Adventure, #39)",Charlie Bone und das Geheimnis der sprechenden...,"New Moon (Twilight, #2)","Eclipse (Twilight, #3)","Breaking Dawn (Twilight, #4)"
2,Harry Potter and the Prisoner of Azkaban (Harr...,"Katso eteesi, Lotta!",The tale of Jemima puddle-duck,"Sjælefanger (Riley Bloom, #1)","Catching Fire (The Hunger Games, #2)","The Hunger Games (The Hunger Games, #1)","Mockingjay (The Hunger Games, #3)"
3,"Catching Fire (The Hunger Games, #2)","Alice's adventures in Wonderland ; and, Throug...",The Aeneid for Boys and Girls,"Brave the Betrayal (Everworld, #8)","Mockingjay (The Hunger Games, #3)","The Hunger Games (The Hunger Games, #1)","Insurgent (Divergent, #2)"
4,The Fault in Our Stars,"Ondine (Ondine Quartet, #0.5)","Ölmem Gerekirse (Revenants, #3)",Itsy Bitsy Spider,Looking for Alaska,"The Hunger Games (The Hunger Games, #1)","Divergent (Divergent, #1)"
5,"Mockingjay (The Hunger Games, #3)","Katso eteesi, Lotta!","Twinkle, Twinkle, Little Star",Itsy Bitsy Spider,"Catching Fire (The Hunger Games, #2)","The Hunger Games (The Hunger Games, #1)","Insurgent (Divergent, #2)"
6,"Divergent (Divergent, #1)",Kiki and Roo,Bloß nicht blinzeln!,The Princess and Curdie,"Insurgent (Divergent, #2)","The Hunger Games (The Hunger Games, #1)","Allegiant (Divergent, #3)"
7,"New Moon (Twilight, #2)",Fear Hall: The Conclusion,Sit Still!,Einstein the Class Hamster and the Very Real G...,"Eclipse (Twilight, #3)","Breaking Dawn (Twilight, #4)","Twilight (Twilight, #1)"
8,"Eclipse (Twilight, #3)","Insurgent (Divergent, #2)",Half bad - Det mörka ödet (The Half Bad Trilog...,The Island,"New Moon (Twilight, #2)","Breaking Dawn (Twilight, #4)","Twilight (Twilight, #1)"
9,"Breaking Dawn (Twilight, #4)",Sun Up,The Mystery of the Burnt Cottage (The Five Fin...,Woof!,"Eclipse (Twilight, #3)","New Moon (Twilight, #2)","Twilight (Twilight, #1)"
