# Recommendation Algorithm

*This Jupyter Notebook imports and combines the two datasets (for young adult and children), performs exploratory data analysis, and generates the output for the Recommendation using both collaborative and content-based filtering.*

# Data Preparation 

In [1]:
! pip install pandas
! pip install scikit-learn
! pip install nltk
! pip install spacy
! python -m spacy download en_core_web_sm
! pip install sentence-transformers
! pip install textstat



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### API Demo Code

In [24]:
!pip install requests jupyter
! pip install Flask 



In [23]:
# api.py
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/analyze_sentiment', methods=['POST'])
def analyze_sentiment():
    analysis_result = """
Book Sentiment Analysis Summary:
================================================================================

Total books analyzed: 18

Polarity Category Distribution:
Extremely Positive: 7 books (38.9%)
Very Positive: 4 books (22.2%)
Positive: 2 books (11.1%)
Neutral: 5 books (27.8%)

Top Books with Scores ≥ 0.7 (ordered by review count):
- Five Nice Mice Build a House (Score: 0.80, Reviews: 12)
- Jinx (Score: 0.80, Reviews: 8)
- The Very Hungry Caterpillar (Score: 0.75, Reviews: 7)
- What I've Done (Score: 0.83, Reviews: 3)
- The Cat in the Hat (Score: 0.72, Reviews: 2)

Top Books (combined score & popularity):
- Five Nice Mice Build a House (Score: 0.80, Reviews: 12, Combined: 2.97)
- Jinx (Score: 0.80, Reviews: 8, Combined: 2.75)
- The Very Hungry Caterpillar (Score: 0.75, Reviews: 7, Combined: 2.58)
- What I've Done (Score: 0.83, Reviews: 3, Combined: 2.30)
- The Cat in the Hat (Score: 0.72, Reviews: 2, Combined: 1.89)

Most Negative Books:
- The Hostile Hospital (Score: -0.15, Reviews: 1)
- Beyond the Grave (Score: -0.25, Reviews: 1)

Most Controversial Books (mixed opinions):
- Harry Potter and the Order of the Phoenix (Controversy: 0.21, Reviews: 3)
- The Giving Tree (Controversy: 0.18, Reviews: 2)
- Wonder (Controversy: 0.12, Reviews: 2)
- Charlie and the Chocolate Factory (Controversy: 0.00, Reviews: 1)
- The Hobbit (Controversy: 0.00, Reviews: 1)
"""
    
    return analysis_result

@app.route('/health', methods=['GET'])
def health_check():
    return "API is running"

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:7900
 * Running on http://192.168.1.251:7900
[33mPress CTRL+C to quit[0m
 * Restarting with stat
Traceback (most recent call last):
  File "/Users/monikanikam/Downloads/new_anaconda/anaconda3/envs/book_buddy/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/monikanikam/Downloads/new_anaconda/anaconda3/envs/book_buddy/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/monikanikam/Downloads/new_anaconda/anaconda3/envs/book_buddy/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/monikanikam/Downloads/new_anaconda/anaconda3/envs/book_buddy/lib/python3.11/site-packages/traitlets/config/application.py", line 1074, in launch_instance
    app.initialize(argv)
  File "/Users/monikanikam/Downloads/new_a

SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [22]:
# Client code for Jupyter notebook
import requests
import json

def call_sentiment_analysis_api(data=None):
    url = "http://localhost:5000/analyze_sentiment"
    
    if data is None:
        data = {
            "min_reviews": 10,
            "prioritize_high_scores": True
        }
    
    try:
        response = requests.post(url, json=data)
        if response.status_code == 200:
            return response.text
        else:
            return f"Error: {response.status_code} - {response.text}"
    except requests.exceptions.RequestException as e:
        return f"Error connecting to API: {str(e)}"

sentiment_analysis = call_sentiment_analysis_api()
print(sentiment_analysis)

Error connecting to API: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /analyze_sentiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x6a0db5fd0>: Failed to establish a new connection: [Errno 61] Connection refused'))


## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import TruncatedSVD

## Importing Data

From goodreads, we are able to download three sets of data.

1. The `books` dataset outlines associated metadata to a specific book. Things of interest here would be `book_id`, `title`, `average_rating`, `ratings_count`,`description`, `num_pages`, `similar_books`, `popular_shelves`. Columns that may be useful, such as `link`, `url`,`image_url`,`authors` and `publishers` will be kept in a separate dataframe and referenced only if needed.
2. The `reviews` dataset consists of text reviews that may or may not be added after a rating. As the test reviews do not seem to be useful at this time, we will leave it out. We can consider the data here to scrape for genre.
3. The `interactions` dataset indicates whether or not a specific user has read and rated a specific book. It consists of columns `user_id`, `book_id`, `is_read` and `ratings`.

We have done so for two different age categories (`children` and `young_adult`), and will combine them.

### Importing Books Data

In [2]:
columns_of_interest = ['book_id', 'title', 'average_rating', 'ratings_count','description', 'num_pages', 'similar_books', 'popular_shelves']
json_files = ['goodreads_books_children.json', 'goodreads_books_young_adult.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            filtered_record = {key:record[key] for key in columns_of_interest}
            data.append(filtered_record)

### Filtering Book Data

In [3]:
books = pd.DataFrame(data)
books['description_length'] = books['description'].apply(len)
books = books[books['description_length'] != 0] #filtering empty descriptions
books = books.drop('description_length', axis = 1)

In [4]:
display(books.head())

Unnamed: 0,book_id,title,average_rating,ratings_count,description,num_pages,similar_books,popular_shelves
0,287141,The Aeneid for Boys and Girls,4.13,46,"Relates in vigorous prose the tale of Aeneas, ...",162.0,[],"[{'count': '56', 'name': 'to-read'}, {'count':..."
1,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,4.22,98,"To Kara's astonishment, she discovers that a p...",216.0,"[948696, 439885, 274955, 12978730, 372986, 216...","[{'count': '515', 'name': 'to-read'}, {'count'..."
2,89378,Dog Heaven,4.43,1331,In Newbery Medalist Cynthia Rylant's classic b...,40.0,"[834493, 452189, 140185, 1897316, 2189812, 424...","[{'count': '450', 'name': 'to-read'}, {'count'..."
4,1698376,What Do You Do?,3.57,23,WHAT DO YOU DO?\nA hen lays eggs...\nA cow giv...,24.0,[],"[{'count': '8', 'name': 'to-read'}, {'count': ..."
5,2592648,It's Funny Where Ben's Train Takes Him,3.68,21,Ben draws a train that takes him to all sorts ...,,[],"[{'count': '10', 'name': 'to-read'}, {'count':..."


### Importing Interactions Data

In [5]:
columns_of_interest = ['user_id','book_id','is_read','rating']
json_files = ['goodreads_interactions_children.json', 'goodreads_interactions_young_adult.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            filtered_record = {key:record[key] for key in columns_of_interest}
            data.append(filtered_record)

### Filtering Interactions Data

In [6]:
interactions = pd.DataFrame(data)
interactions = interactions[interactions['is_read'] != 0] #removing ratings by people who have not read the book

In [7]:
display(interactions.head())

Unnamed: 0,user_id,book_id,is_read,rating
5,8842281e1d1347389f2ab93d60773d4d,23310161,True,4
6,8842281e1d1347389f2ab93d60773d4d,18296097,True,5
7,8842281e1d1347389f2ab93d60773d4d,817720,True,5
8,8842281e1d1347389f2ab93d60773d4d,502362,True,5
9,8842281e1d1347389f2ab93d60773d4d,1969280,True,5


In [8]:


json_files = ['goodreads_reviews_children.json', 'goodreads_reviews_young_adult.json']
data = []

for json_file in json_files:
    with open(json_file, 'r') as file:
        for line in file:
            record = json.loads(line)
            data.append(record)
review_df = pd.DataFrame(data)
review_df.head()

Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
0,8842281e1d1347389f2ab93d60773d4d,23310161,f4b4b050f4be00e9283c92a814af2670,4,Fun sequel to the original.,Tue Nov 17 11:37:35 -0800 2015,Tue Nov 17 11:38:05 -0800 2015,,,7,0
1,8842281e1d1347389f2ab93d60773d4d,17290220,22d424a2b0057b18fb6ecf017af7be92,5,One of my favorite books to read to my 5 year ...,Sat Nov 08 08:54:03 -0800 2014,Wed Jan 25 13:56:12 -0800 2017,Tue Jan 24 00:00:00 -0800 2017,,4,0
2,8842281e1d1347389f2ab93d60773d4d,6954929,50ed4431c451d5677d98dd25ca8ec106,5,One of the best and most imaginative childrens...,Thu Oct 23 13:46:20 -0700 2014,Thu Oct 23 13:47:00 -0700 2014,,,6,1
3,8842281e1d1347389f2ab93d60773d4d,460548,1e4de11dd4fa4b7ffa59b6c69a6b28e9,5,My daughter is loving this. Published in the 6...,Mon Dec 02 10:43:59 -0800 2013,Wed Mar 22 11:47:25 -0700 2017,,,5,4
4,8842281e1d1347389f2ab93d60773d4d,11474551,2065145714bf747083a1c9ce81d5c4fe,5,A friend sent me this. Hilarious!,Wed May 11 22:38:11 -0700 2011,Sun Jan 29 15:56:41 -0800 2012,Wed May 11 00:00:00 -0700 2011,Wed May 11 00:00:00 -0700 2011,5,0


In [9]:
def combine_books_and_reviews(books_df, reviews_df):
    books_df_copy = books_df.copy()
    reviews_df_copy = reviews_df.copy()
    
    books_df_copy['book_id'] = books_df_copy['book_id'].astype(str)
    reviews_df_copy['book_id'] = reviews_df_copy['book_id'].astype(str)
    
    combined_df = reviews_df_copy.merge(books_df_copy, on='book_id', how='left')
    
    print(f"Combined data shape: {combined_df.shape}")
    print(f"Number of unique books: {combined_df['book_id'].nunique()}")
    print(f"Number of unique users: {combined_df['user_id'].nunique()}")
    
    missing_books = reviews_df_copy[~reviews_df_copy['book_id'].isin(books_df_copy['book_id'])]
    print(f"Reviews with missing book data: {len(missing_books)} ({len(missing_books)/len(reviews_df_copy)*100:.2f}%)")
    
    return combined_df


In [10]:
book_review_df= combine_books_and_reviews(books, review_df)
book_review_df.head()

Combined data shape: (3171054, 18)
Number of unique books: 214550
Number of unique users: 233167
Reviews with missing book data: 66565 (2.13%)


Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments,title,average_rating,ratings_count,description,num_pages,similar_books,popular_shelves
0,8842281e1d1347389f2ab93d60773d4d,23310161,f4b4b050f4be00e9283c92a814af2670,4,Fun sequel to the original.,Tue Nov 17 11:37:35 -0800 2015,Tue Nov 17 11:38:05 -0800 2015,,,7,0,The Day the Crayons Came Home,4.43,8924,The companion to the #1 blockbuster bestseller...,36,"[22249668, 25745002, 23309640, 23735067, 20518...","[{'count': '2078', 'name': 'to-read'}, {'count..."
1,8842281e1d1347389f2ab93d60773d4d,17290220,22d424a2b0057b18fb6ecf017af7be92,5,One of my favorite books to read to my 5 year ...,Sat Nov 08 08:54:03 -0800 2014,Wed Jan 25 13:56:12 -0800 2017,Tue Jan 24 00:00:00 -0800 2017,,4,0,"Rosie Revere, Engineer",4.54,4789,"Rosie may seem quiet during the day, but at ni...",32,"[18383325, 13722312, 17684972, 17245740, 16002...","[{'count': '2317', 'name': 'to-read'}, {'count..."
2,8842281e1d1347389f2ab93d60773d4d,6954929,50ed4431c451d5677d98dd25ca8ec106,5,One of the best and most imaginative childrens...,Thu Oct 23 13:46:20 -0700 2014,Thu Oct 23 13:47:00 -0700 2014,,,6,1,Zoom,4.67,33,Michele Landsberg wrote of the first Zoom book...,96,[],"[{'count': '30', 'name': 'to-read'}, {'count':..."
3,8842281e1d1347389f2ab93d60773d4d,460548,1e4de11dd4fa4b7ffa59b6c69a6b28e9,5,My daughter is loving this. Published in the 6...,Mon Dec 02 10:43:59 -0800 2013,Wed Mar 22 11:47:25 -0700 2017,,,5,4,"Go, Dog. Go!",4.08,67048,Reading goes to the dogs in this timeless Begi...,72,"[206962, 488908, 260996, 7797, 2121669, 24688,...","[{'count': '4441', 'name': 'to-read'}, {'count..."
4,8842281e1d1347389f2ab93d60773d4d,11474551,2065145714bf747083a1c9ce81d5c4fe,5,A friend sent me this. Hilarious!,Wed May 11 22:38:11 -0700 2011,Sun Jan 29 15:56:41 -0800 2012,Wed May 11 00:00:00 -0700 2011,Wed May 11 00:00:00 -0700 2011,5,0,Go the Fuck to Sleep,4.26,309,Go the Fuck to Sleepis a bedtime book for pare...,32,"[8044557, 7747422, 8687916, 58094, 6465483, 10...","[{'count': '5062', 'name': 'to-read'}, {'count..."


In [18]:
import pandas as pd
import numpy as np
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
from collections import defaultdict

def analyze_review_polarity(combined_df):
    if len(combined_df) == 0:
        print("Warning: Empty dataframe provided")
        return pd.DataFrame()
    
    try:
        nltk.download('vader_lexicon', quiet=True)
    except:
        print("Failed to download VADER lexicon, but continuing...")
    
    sia = SentimentIntensityAnalyzer()
    
    def get_review_sentiment(row):
        review_text = str(row.get('review_text', '')) if not pd.isna(row.get('review_text')) else ''
        sentiment_scores = sia.polarity_scores(review_text)
        compound_score = sentiment_scores['compound']
        
        try:
            rating = float(row.get('rating', 0)) if not pd.isna(row.get('rating')) else 0
            rating_weight = (rating - 3) / 2
        except:
            rating_weight = 0
        
        try:
            n_votes = int(row.get('n_votes', 0)) if not pd.isna(row.get('n_votes')) else 0
            vote_weight = np.log1p(n_votes)
        except:
            vote_weight = 0
        
        max_vote_weight = 5
        normalized_vote_weight = min(vote_weight / max_vote_weight, 1)
        
        weighted_score = (0.6 * compound_score) + (0.3 * rating_weight) + (0.1 * normalized_vote_weight)
        
        return {
            'text_sentiment': compound_score,
            'rating_sentiment': rating_weight,
            'vote_weight': normalized_vote_weight,
            'weighted_score': weighted_score
        }
    
    combined_df_clean = combined_df.copy()
    
    for col in ['rating', 'n_votes', 'review_text']:
        if col in combined_df_clean.columns:
            combined_df_clean[col] = combined_df_clean[col].fillna(0)
    
    sentiment_results = combined_df_clean.apply(get_review_sentiment, axis=1)
    sentiment_df = pd.DataFrame(sentiment_results.tolist())
    
    enhanced_df = pd.concat([combined_df_clean, sentiment_df], axis=1)
    
    if 'book_id' not in enhanced_df.columns:
        print("Error: book_id column missing from dataframe")
        return pd.DataFrame()
    
    if len(enhanced_df['book_id'].unique()) == 0:
        print("Error: No unique book_ids found")
        return pd.DataFrame()
    
    agg_dict = {}
    
    if 'title' in enhanced_df.columns:
        agg_dict['title'] = 'first'
    
    for col in ['weighted_score', 'text_sentiment', 'rating_sentiment']:
        if col in enhanced_df.columns:
            agg_dict[col] = ['mean', 'count', 'std']
    
    if 'rating' in enhanced_df.columns:
        agg_dict['rating'] = ['mean', 'count', 'std']
    
    if 'n_votes' in enhanced_df.columns:
        agg_dict['n_votes'] = ['sum', 'mean']
    
    book_polarity = enhanced_df.groupby('book_id').agg(agg_dict)
    
    book_polarity.columns = ['_'.join(col).strip() for col in book_polarity.columns.values]
    
    def get_polarity_category(score):
        if score >= 0.5:
            return "Very Positive"
        elif score >= 0.25:
            return "Positive"
        elif score >= -0.25:
            return "Neutral"
        elif score >= -0.5:
            return "Negative"
        else:
            return "Very Negative"
    
    if 'weighted_score_mean' in book_polarity.columns:
        book_polarity['polarity_category'] = book_polarity['weighted_score_mean'].apply(get_polarity_category)
    else:
        book_polarity['polarity_category'] = "Unknown"
    
    book_polarity = book_polarity.reset_index()
    
    book_polarity['controversy_score'] = 0
    
    if 'text_sentiment_std' in book_polarity.columns and 'rating_std' in book_polarity.columns:
        book_polarity['controversy_score'] = book_polarity['text_sentiment_std'] * book_polarity['rating_std']
    
    if 'weighted_score_mean' in book_polarity.columns and 'weighted_score_count' in book_polarity.columns:
        book_polarity['popularity_score'] = book_polarity['weighted_score_mean'] * (1 + np.log1p(book_polarity['weighted_score_count']) / 5)
        book_polarity = book_polarity.sort_values('popularity_score', ascending=False)
    elif 'weighted_score_mean' in book_polarity.columns:
        book_polarity = book_polarity.sort_values('weighted_score_mean', ascending=False)
    
    return book_polarity

def display_book_polarity_summary(book_polarity_df):
    if len(book_polarity_df) == 0:
        print("No books to analyze")
        return
    
    print("\nBook Sentiment Analysis Summary:")
    print("=" * 80)
    
    print(f"\nTotal books analyzed: {len(book_polarity_df)}")
    
    if 'polarity_category' in book_polarity_df.columns:
        category_counts = book_polarity_df['polarity_category'].value_counts()
        print("\nPolarity Category Distribution:")
        for category, count in category_counts.items():
            print(f"{category}: {count} books ({count/len(book_polarity_df)*100:.1f}%)")
    
    if 'popularity_score' in book_polarity_df.columns and 'title_first' in book_polarity_df.columns:
        print("\nTop Books (by positivity and popularity):")
        top_books = book_polarity_df.head(min(5, len(book_polarity_df)))
        for _, book in top_books.iterrows():
            review_count = book.get('weighted_score_count', 'N/A')
            print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count}, Combined: {book['popularity_score']:.2f})")
    elif 'weighted_score_mean' in book_polarity_df.columns and 'title_first' in book_polarity_df.columns:
        print("\nTop Most Positive Books:")
        top_positive = book_polarity_df.head(min(5, len(book_polarity_df)))
        for _, book in top_positive.iterrows():
            review_count = book.get('weighted_score_count', 'N/A')
            print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count})")
    
    if 'weighted_score_mean' in book_polarity_df.columns:
        negative_books = book_polarity_df[book_polarity_df['weighted_score_mean'] < 0]
        if len(negative_books) > 0:
            print("\nMost Negative Books:")
            top_negative = negative_books.head(min(5, len(negative_books)))
            for _, book in top_negative.iterrows():
                review_count = book.get('weighted_score_count', 'N/A')
                print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count})")
        else:
            print("\nNo books with negative overall sentiment found.")
    
    if 'controversy_score' in book_polarity_df.columns and 'title_first' in book_polarity_df.columns:
        print("\nMost Controversial Books (mixed opinions):")
        controversial = book_polarity_df.sort_values('controversy_score', ascending=False).head(min(5, len(book_polarity_df)))
        for _, book in controversial.iterrows():
            review_count = book.get('weighted_score_count', 'N/A')
            print(f"- {book['title_first']} (Controversy: {book['controversy_score']:.2f}, Reviews: {review_count})")
    
    return

if __name__ == "__main__":
    try:
        review_counts = book_review_df['book_id'].value_counts()
        books_with_min_reviews = review_counts[review_counts >= 10].index
        
        if len(books_with_min_reviews) == 0:
            print("No books found with at least 10 reviews")
            max_reviews = review_counts.max()
            print(f"Maximum reviews for any book: {max_reviews}")
            print("Using all data instead")
            filtered_df = book_review_df
        else:
            print(f"Found {len(books_with_min_reviews)} books with at least 10 reviews")
            filtered_df = book_review_df[book_review_df['book_id'].isin(books_with_min_reviews)]
        
        book_polarity = analyze_review_polarity(filtered_df)
        display_book_polarity_summary(book_polarity)
        
    except Exception as e:
        print(f"Error: {str(e)}")

Found 36354 books with at least 10 reviews


  vote_weight = np.log1p(n_votes)
  vote_weight = np.log1p(n_votes)



Book Sentiment Analysis Summary:

Total books analyzed: 36354

Polarity Category Distribution:
Positive: 25154 books (69.2%)
Very Positive: 9604 books (26.4%)
Neutral: 1483 books (4.1%)
Very Negative: 113 books (0.3%)

Top Books (by positivity and popularity):
- Wonder (Wonder #1) (Score: 0.44, Reviews: 15878, Combined: 1.30)
- The Giver (The Giver, #1) (Score: 0.46, Reviews: 6155, Combined: 1.27)
- Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) (Score: 0.46, Reviews: 4696, Combined: 1.25)
- Catching Fire (The Hunger Games, #2) (Score: 0.44, Reviews: 8738, Combined: 1.23)
- Holes (Holes, #1) (Score: 0.45, Reviews: 5074, Combined: 1.23)

Most Negative Books:
- The Sky Throne (Score: -0.00, Reviews: 10)
- Headlong (Score: -0.01, Reviews: 11)
- The Charm (Olivia Hart and the Gifted Program, #1) (Score: -0.02, Reviews: 4)
- The Shadow of the Bear (A Fairy Tale Retold #1) (Score: -0.02, Reviews: 8)
- Beyond the Dead Forest (Score: -0.02, Reviews: 9)

Most Controversial Books (

In [19]:
import pandas as pd
import numpy as np
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
from collections import defaultdict

def analyze_review_polarity(combined_df):
    if len(combined_df) == 0:
        print("Warning: Empty dataframe provided")
        return pd.DataFrame()
    
    try:
        nltk.download('vader_lexicon', quiet=True)
    except:
        print("Failed to download VADER lexicon, but continuing...")
    
    sia = SentimentIntensityAnalyzer()
    
    def get_review_sentiment(row):
        review_text = str(row.get('review_text', '')) if not pd.isna(row.get('review_text')) else ''
        sentiment_scores = sia.polarity_scores(review_text)
        compound_score = sentiment_scores['compound']
        
        try:
            rating = float(row.get('rating', 0)) if not pd.isna(row.get('rating')) else 0
            rating_weight = (rating - 3) / 2
        except:
            rating_weight = 0
        
        try:
            n_votes = int(row.get('n_votes', 0)) if not pd.isna(row.get('n_votes')) else 0
            vote_weight = np.log1p(n_votes)
        except:
            vote_weight = 0
        
        max_vote_weight = 5
        normalized_vote_weight = min(vote_weight / max_vote_weight, 1)
        
        weighted_score = (0.6 * compound_score) + (0.3 * rating_weight) + (0.1 * normalized_vote_weight)
        
        return {
            'text_sentiment': compound_score,
            'rating_sentiment': rating_weight,
            'vote_weight': normalized_vote_weight,
            'weighted_score': weighted_score
        }
    
    combined_df_clean = combined_df.copy()
    
    for col in ['rating', 'n_votes', 'review_text']:
        if col in combined_df_clean.columns:
            combined_df_clean[col] = combined_df_clean[col].fillna(0)
    
    sentiment_results = combined_df_clean.apply(get_review_sentiment, axis=1)
    sentiment_df = pd.DataFrame(sentiment_results.tolist())
    
    enhanced_df = pd.concat([combined_df_clean, sentiment_df], axis=1)
    
    if 'book_id' not in enhanced_df.columns:
        print("Error: book_id column missing from dataframe")
        return pd.DataFrame()
    
    if len(enhanced_df['book_id'].unique()) == 0:
        print("Error: No unique book_ids found")
        return pd.DataFrame()
    
    agg_dict = {}
    
    if 'title' in enhanced_df.columns:
        agg_dict['title'] = 'first'
    
    for col in ['weighted_score', 'text_sentiment', 'rating_sentiment']:
        if col in enhanced_df.columns:
            agg_dict[col] = ['mean', 'count', 'std']
    
    if 'rating' in enhanced_df.columns:
        agg_dict['rating'] = ['mean', 'count', 'std']
    
    if 'n_votes' in enhanced_df.columns:
        agg_dict['n_votes'] = ['sum', 'mean']
    
    book_polarity = enhanced_df.groupby('book_id').agg(agg_dict)
    
    book_polarity.columns = ['_'.join(col).strip() for col in book_polarity.columns.values]
    
    def get_polarity_category(score):
        if score >= 0.7:
            return "Extremely Positive"
        elif score >= 0.5:
            return "Very Positive"
        elif score >= 0.25:
            return "Positive"
        elif score >= -0.25:
            return "Neutral"
        elif score >= -0.5:
            return "Negative"
        else:
            return "Very Negative"
    
    if 'weighted_score_mean' in book_polarity.columns:
        book_polarity['polarity_category'] = book_polarity['weighted_score_mean'].apply(get_polarity_category)
    else:
        book_polarity['polarity_category'] = "Unknown"
    
    book_polarity = book_polarity.reset_index()
    
    book_polarity['controversy_score'] = 0
    
    if 'text_sentiment_std' in book_polarity.columns and 'rating_std' in book_polarity.columns:
        book_polarity['controversy_score'] = book_polarity['text_sentiment_std'] * book_polarity['rating_std']
    
    if 'weighted_score_mean' in book_polarity.columns and 'weighted_score_count' in book_polarity.columns:
        # Create a high score multiplier that gives priority to scores >= 0.7
        book_polarity['high_score_bonus'] = book_polarity['weighted_score_mean'].apply(
            lambda x: 2.0 if x >= 0.7 else 1.0
        )
        
        # Popularity score with higher emphasis on review count for highly rated books
        book_polarity['popularity_score'] = (
            book_polarity['weighted_score_mean'] * 
            book_polarity['high_score_bonus'] * 
            (1 + np.log1p(book_polarity['weighted_score_count']) / 3)
        )
        
        # First sort by high_score_bonus (to prioritize 0.7+ scores), then by popularity score
        book_polarity = book_polarity.sort_values(
            ['high_score_bonus', 'popularity_score'], 
            ascending=[False, False]
        )
    elif 'weighted_score_mean' in book_polarity.columns:
        book_polarity = book_polarity.sort_values('weighted_score_mean', ascending=False)
    
    return book_polarity

def display_book_polarity_summary(book_polarity_df):
    if len(book_polarity_df) == 0:
        print("No books to analyze")
        return
    
    print("\nBook Sentiment Analysis Summary:")
    print("=" * 80)
    
    print(f"\nTotal books analyzed: {len(book_polarity_df)}")
    
    if 'polarity_category' in book_polarity_df.columns:
        category_counts = book_polarity_df['polarity_category'].value_counts()
        print("\nPolarity Category Distribution:")
        for category, count in category_counts.items():
            print(f"{category}: {count} books ({count/len(book_polarity_df)*100:.1f}%)")
    
    # Display highly rated books (score >= 0.7) first
    if 'weighted_score_mean' in book_polarity_df.columns:
        high_scoring_books = book_polarity_df[book_polarity_df['weighted_score_mean'] >= 0.7]
        if len(high_scoring_books) > 0:
            print("\nTop Books with Scores ≥ 0.7 (ordered by review count):")
            high_sorted = high_scoring_books.sort_values('weighted_score_count', ascending=False).head(
                min(5, len(high_scoring_books))
            )
            for _, book in high_sorted.iterrows():
                review_count = book.get('weighted_score_count', 'N/A')
                print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count})")
    
    if 'popularity_score' in book_polarity_df.columns and 'title_first' in book_polarity_df.columns:
        print("\nTop Books (combined score & popularity):")
        top_books = book_polarity_df.head(min(5, len(book_polarity_df)))
        for _, book in top_books.iterrows():
            review_count = book.get('weighted_score_count', 'N/A')
            print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count}, Combined: {book['popularity_score']:.2f})")
    
    if 'weighted_score_mean' in book_polarity_df.columns:
        negative_books = book_polarity_df[book_polarity_df['weighted_score_mean'] < 0]
        if len(negative_books) > 0:
            print("\nMost Negative Books:")
            top_negative = negative_books.head(min(5, len(negative_books)))
            for _, book in top_negative.iterrows():
                review_count = book.get('weighted_score_count', 'N/A')
                print(f"- {book['title_first']} (Score: {book['weighted_score_mean']:.2f}, Reviews: {review_count})")
        else:
            print("\nNo books with negative overall sentiment found.")
    
    if 'controversy_score' in book_polarity_df.columns and 'title_first' in book_polarity_df.columns:
        print("\nMost Controversial Books (mixed opinions):")
        controversial = book_polarity_df.sort_values('controversy_score', ascending=False).head(min(5, len(book_polarity_df)))
        for _, book in controversial.iterrows():
            review_count = book.get('weighted_score_count', 'N/A')
            print(f"- {book['title_first']} (Controversy: {book['controversy_score']:.2f}, Reviews: {review_count})")
    
    return

if __name__ == "__main__":
    try:
        review_counts = book_review_df['book_id'].value_counts()
        books_with_min_reviews = review_counts[review_counts >= 10].index
        
        if len(books_with_min_reviews) == 0:
            print("No books found with at least 10 reviews")
            max_reviews = review_counts.max()
            print(f"Maximum reviews for any book: {max_reviews}")
            print("Using all data instead")
            filtered_df = book_review_df
        else:
            print(f"Found {len(books_with_min_reviews)} books with at least 10 reviews")
            filtered_df = book_review_df[book_review_df['book_id'].isin(books_with_min_reviews)]
        
        book_polarity = analyze_review_polarity(filtered_df)
        display_book_polarity_summary(book_polarity)
        
    except Exception as e:
        print(f"Error: {str(e)}")

Found 36354 books with at least 10 reviews


  vote_weight = np.log1p(n_votes)
  vote_weight = np.log1p(n_votes)



Book Sentiment Analysis Summary:

Total books analyzed: 36354

Polarity Category Distribution:
Positive: 25154 books (69.2%)
Very Positive: 9419 books (25.9%)
Neutral: 1483 books (4.1%)
Extremely Positive: 185 books (0.5%)
Very Negative: 113 books (0.3%)

Top Books with Scores ≥ 0.7 (ordered by review count):
- Schim en Schaduw (De Grisha, #1) (Score: 0.71, Reviews: 24)
- Little Elliot, Big Fun (Score: 0.72, Reviews: 21)
- The Kiss That Missed (Score: 0.70, Reviews: 21)
- Fangirl (Score: 0.72, Reviews: 19)
- Green Pants (Score: 0.76, Reviews: 18)

Top Books (combined score & popularity):
- Green Pants (Score: 0.76, Reviews: 18, Combined: 3.03)
- Five Nice Mice Build a House (Score: 0.80, Reviews: 12, Combined: 2.96)
- Schim en Schaduw (De Grisha, #1) (Score: 0.71, Reviews: 24, Combined: 2.96)
- The Shepherd's Crown (Discworld, #41; Tiffany Aching, #5) (Score: 0.79, Reviews: 12, Combined: 2.94)
- Little Elliot, Big Fun (Score: 0.72, Reviews: 21, Combined: 2.91)

Most Negative Books:
- 

### Inspecting Interaction DF

In [None]:
import json
import pandas as pd

def load_interaction_data():
    json_files = ['goodreads_interactions_children.json', 'goodreads_interactions_young_adult.json']
    data = []

    for json_file in json_files:
        try:
            with open(json_file, 'r') as file:
                for line in file:
                    record = json.loads(line)
                    data.append(record)
            print(f"Successfully loaded {json_file}")
        except FileNotFoundError:
            print(f"File not found: {json_file}")
        except json.JSONDecodeError:
            print(f"JSON decode error in {json_file}")
        except Exception as e:
            print(f"Error processing {json_file}: {str(e)}")
    
    # Convert to DataFrame
    interactions_df = pd.DataFrame(data)
    
    # Print basic info
    print(f"\nLoaded {len(interactions_df)} interaction records")
    print("\nColumns:")
    for col in interactions_df.columns:
        print(f"- {col}")
    
    # Print sample data
    print("\nSample data (first 5 rows):")
    print(interactions_df.head())
    
    # Print summary statistics
    print("\nSummary statistics:")
    print(f"Number of unique users: {interactions_df['user_id'].nunique()}")
    print(f"Number of unique books: {interactions_df['book_id'].nunique()}")
    
    # Check if common statistics columns exist before calculating
    if 'rating' in interactions_df.columns:
        print(f"Average rating: {interactions_df['rating'].mean():.2f}")
    if 'is_read' in interactions_df.columns:
        print(f"Number of books read: {interactions_df['is_read'].sum()}")
    
    return interactions_df

# This can be called to load the interaction data
if __name__ == "__main__":
    interactions_df = load_interaction_data()

Successfully loaded goodreads_interactions_children.json
Successfully loaded goodreads_interactions_young_adult.json


In [None]:
interactions_df.head()

## Detect Book Genre using book attributes

In [None]:
import re
import numpy as np
import pandas as pd
from collections import Counter
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
from sentence_transformers import SentenceTransformer

def detect_book_genre_with_advanced_nlp(book_data, genre_classifier=None, min_confidence=3, exclude_shelves=None):
    """
    Detect book genres using a combination of structured data analysis and advanced NLP techniques.
    
    Algorithm:
    1. Extract genres from user-assigned shelves (most reliable signal)
    2. Apply multiple NLP techniques to analyze book description and title:
       - Semantic similarity using sentence embeddings
       - TF-IDF analysis with genre-specific vocabulary
       - Named entity recognition to identify genre-related entities
    3. Extract additional signals from book metadata (page count, title patterns)
    4. Combine all signals with appropriate weights (shelf data > NLP > metadata)
    5. Return top genres that meet minimum confidence threshold
    
    Parameters:
    -----------
    book_data : dict or pandas Series
        Book information containing title, description, popular_shelves and other metadata
    genre_classifier : object, optional
        Optional pre-trained genre classifier model
    min_confidence : int, optional
        Minimum confidence score required to include a genre in the results
    exclude_shelves : set, optional
        Set of shelf names to exclude from analysis (e.g., 'to-read')
        
    Returns:
    --------
    list
        List of up to 3 most likely genres for the book
    """
    if exclude_shelves is None:
        exclude_shelves = get_default_excluded_shelves()
    
    # Extract genres from structured shelf data
    shelf_genres = extract_genres_from_shelves(book_data, get_genre_map(), exclude_shelves)
    
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))
    
    nlp_genres = {}
    
    # Only perform NLP analysis if we have enough text
    if len(description) > 20:
        nlp_genres.update(analyze_with_embeddings(title, description))
        nlp_genres.update(analyze_with_tfidf(title, description))
        nlp_genres.update(extract_named_entities(title, description))
    
    # Extract genre signals from book metadata
    metadata_genres = analyze_metadata(book_data)
    
    # Combine all signals and apply minimum confidence threshold
    final_genres = combine_all_genre_signals(shelf_genres, nlp_genres, metadata_genres, min_confidence)
    
    return final_genres[:3]


def get_default_excluded_shelves():
    """
    Get the default set of shelf names to exclude from genre analysis.
    
    Algorithm:
    - Return a predefined set of non-genre shelves that are commonly used but don't indicate genre
      (e.g., organizational shelves like "to-read" or format shelves like "ebook")
    
    Returns:
    --------
    set
        Set of shelf names that aren't useful for genre classification
    """
    return {
        'to-read', 'currently-reading', 'owned', 'default', 
        'favorites', 'books-i-own', 'ebook', 'kindle', 
        'library', 'audiobook', 'owned-books', 'to-buy', 
        'calibre', 're-read', 'unread', 'favourites', 'my-books'
    }


def get_genre_map():
    """
    Get mapping from common shelf keywords to standardized genre names.
    
    Algorithm:
    - Create a dictionary that maps various ways users might tag a genre (e.g., "sci-fi", "science-fiction") 
      to a standardized genre name ("Science Fiction")
    - This normalizes different variations of the same genre concept
    
    Returns:
    --------
    dict
        Dictionary mapping shelf keywords to standardized genre names
    """
    return {
        'fantasy': 'Fantasy',
        'sci-fi': 'Science Fiction',
        'science-fiction': 'Science Fiction',
        'mystery': 'Mystery/Thriller',
        'thriller': 'Mystery/Thriller',
        'romance': 'Romance',
        'historical': 'Historical Fiction',
        'history': 'History',
        'horror': 'Horror',
        'young-adult': 'Young Adult',
        'ya': 'Young Adult',
        'childrens': 'Children\'s',
        'children': 'Children\'s',
        'kids': 'Children\'s',
        'dystopian': 'Dystopian',
        'classic': 'Classics',
        'classics': 'Classics',
        'biography': 'Biography/Memoir',
        'memoir': 'Biography/Memoir',
        'autobiography': 'Biography/Memoir',
        'self-help': 'Self Help',
        'business': 'Business',
        'philosophy': 'Philosophy',
        'psychology': 'Psychology',
        'science': 'Science',
        'poetry': 'Poetry',
        'comic': 'Comics/Graphic Novels',
        'graphic-novel': 'Comics/Graphic Novels',
        'manga': 'Manga',
        'cooking': 'Cooking/Food',
        'cookbook': 'Cooking/Food',
        'food': 'Cooking/Food',
        'travel': 'Travel',
        'religion': 'Religion/Spirituality',
        'spirituality': 'Religion/Spirituality',
        'art': 'Art/Photography',
        'photography': 'Art/Photography',
        'reference': 'Reference',
        'textbook': 'Textbook/Education',
        'education': 'Textbook/Education'
    }


def get_genre_embeddings():
    """
    Get descriptions of genres for semantic similarity comparisons.
    
    Algorithm:
    - Define rich textual descriptions for each genre containing key concepts and vocabulary
    - These descriptions will be used to create embeddings for semantic similarity comparison
      with the book's content
    
    Returns:
    --------
    dict
        Dictionary mapping genre names to their textual descriptions
    """
    genre_descriptions = {
        'Fantasy': 'Magic, wizards, dragons, mythical creatures, quests, magical worlds and kingdoms',
        'Science Fiction': 'Space, technology, future, aliens, robots, artificial intelligence, dystopian societies',
        'Mystery/Thriller': 'Crime, murder, detective, investigation, suspense, secrets, conspiracy',
        'Romance': 'Love, relationships, passion, emotion, marriage, dating, feelings',
        'Historical Fiction': 'Past events, historical periods, ancient civilizations, history-based stories',
        'Horror': 'Fear, terror, supernatural, monsters, ghosts, nightmares, scary stories',
        'Young Adult': 'Teenage protagonists, coming of age, high school, identity, friendship, young romance',
        'Children\'s': 'Stories for kids, picture books, educational, simple stories, colorful illustrations',
        'Biography/Memoir': 'Real life stories, personal experiences, autobiographical, true events',
        'Self Help': 'Personal improvement, advice, motivation, success strategies, life guidance',
        'Business': 'Entrepreneurship, finance, management, marketing, career advice, economics',
        'History': 'Historical accounts, wars, civilizations, historical figures, factual accounts of the past',
        'Science': 'Scientific discoveries, research, theories, nature, biology, physics, academic',
        'Poetry': 'Poems, verse, rhymes, poetic language, collections of poetry',
        'Dystopian': 'Oppressive society, controlled world, rebellion, survival, future dystopia',
        'Classics': 'Literary works of lasting value, canonical literature, traditional important works',
        'Religion/Spirituality': 'Faith, belief systems, religious practices, spiritual growth, theology',
        'Comics/Graphic Novels': 'Illustrated stories, sequential art, comic book format, visual storytelling',
        'Cooking/Food': 'Recipes, culinary techniques, food culture, cooking instructions, nutrition',
        'Travel': 'Travel guides, destinations, journeys, cultural exploration, adventures abroad'
    }
    return genre_descriptions


def extract_genres_from_shelves(book_data, genre_map, exclude_shelves):
    """
    Extract genre information from book's popular shelves data.
    
    Algorithm:
    1. Iterate through the book's popular shelves data
    2. Filter out non-genre shelves (using exclude_shelves)
    3. Map shelf names to standardized genres using genre_map
    4. Use shelf counts as confidence scores (more users shelving = higher confidence)
    5. Return accumulated genre scores from shelf data
    
    Parameters:
    -----------
    book_data : dict or pandas Series
        Book information containing popular_shelves
    genre_map : dict
        Dictionary mapping shelf keywords to standardized genre names
    exclude_shelves : set
        Set of shelf names to exclude from analysis
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores from shelf data
    """
    shelf_genres = {}
    
    popular_shelves = book_data.get('popular_shelves', [])
    if isinstance(popular_shelves, list) and popular_shelves:
        for shelf in popular_shelves:
            shelf_name = shelf.get('name', '').strip().lower()
            shelf_count = int(shelf.get('count', 0))
            
            if shelf_name in exclude_shelves:
                continue
            
            for keyword, genre_name in genre_map.items():
                if keyword in shelf_name:
                    if genre_name in shelf_genres:
                        shelf_genres[genre_name] += shelf_count
                    else:
                        shelf_genres[genre_name] = shelf_count
                    break
    
    return shelf_genres


def preprocess_text(text, lemmatize=True):
    """
    Preprocess text by removing special characters, lemmatizing, etc.
    
    Algorithm:
    1. Convert text to lowercase
    2. Remove URLs and HTML tags
    3. Remove non-alphabetic characters
    4. Normalize whitespace
    5. Optionally lemmatize words (reduce to base form)
    
    Parameters:
    -----------
    text : str
        Text to preprocess
    lemmatize : bool, optional
        Whether to apply lemmatization (default: True)
        
    Returns:
    --------
    str
        Preprocessed text
    """
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        word_list = nltk.word_tokenize(text)
        text = ' '.join([lemmatizer.lemmatize(word) for word in word_list])
    
    return text


def load_nlp_models():
    """
    Load NLP models required for text analysis.
    
    Algorithm:
    1. Try to load spaCy model for named entity recognition
    2. Try to load sentence transformer model for text embeddings
    3. Handle any exceptions if models aren't available or can't be downloaded
    
    Returns:
    --------
    tuple
        (spacy_model, embedding_model) - loaded NLP models
    """
    try:
        nlp = spacy.load("en_core_web_sm")
    except:
        try:
            spacy.cli.download("en_core_web_sm")
            nlp = spacy.load("en_core_web_sm")
        except:
            nlp = None
            
    try:
        embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    except:
        embedding_model = None
    
    return nlp, embedding_model


def analyze_with_embeddings(title, description):
    """
    Analyze book text using sentence embeddings for semantic similarity.
    
    Algorithm:
    1. Load pretrained sentence transformer model
    2. Create embeddings for genre descriptions
    3. Create embedding for the book (title + description)
    4. Calculate cosine similarity between book embedding and genre embeddings
    5. Convert similarities to confidence scores and return top matches
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores from embedding analysis
    """
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        
        genre_descriptions = get_genre_embeddings()
        genre_texts = [f"{genre}: {desc}" for genre, desc in genre_descriptions.items()]
        genre_embeddings = model.encode(genre_texts)
        
        book_text = f"{title} {description}"
        book_embedding = model.encode([book_text])[0]
        
        similarities = cosine_similarity([book_embedding], genre_embeddings)[0]
        
        genres = {}
        for i, genre in enumerate(genre_descriptions.keys()):
            score = int(similarities[i] * 10)
            if score > 3:
                genres[genre] = score
                
        return genres
    except:
        return {}


def analyze_with_tfidf(title, description):
    """
    Analyze book text using TF-IDF comparison against genre-specific vocabulary.
    
    Algorithm:
    1. Define genre-specific keyword sets
    2. Preprocess the book text (title + description)
    3. Create TF-IDF vectors for genre keywords and book text
    4. Calculate cosine similarity between book vector and each genre vector
    5. Convert similarities to confidence scores and return top matches
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores from TF-IDF analysis
    """
    try:
        # Define genre keyword sets
        genre_keywords = {
            'Fantasy': 'magic wizard dragon elf quest sword magical kingdom witch sorcery myth fantasy',
            'Science Fiction': 'space alien future technology robot dystopian sci-fi futuristic planet spacecraft',
            'Mystery/Thriller': 'murder detective crime case investigation killer suspense clue mystery conspiracy',
            'Romance': 'love relationship passion romantic heart affair marriage emotion desire dating romance',
            'Historical Fiction': 'century historical period king queen ancient war empire era medieval history',
            'Horror': 'fear terror ghost scary monster supernatural haunt nightmare blood evil dark horror',
            'Young Adult': 'teen school young coming-of-age adolescent teenage youth friendship high-school',
            'Children\'s': 'child kid young picture-book learning bedtime simple adventure colorful illustrated',
            'Biography/Memoir': 'life autobiography personal real journey memoir experience story true figure',
            'Self Help': 'improve success happiness guide advice life motivation habit inspiration growth',
            'Business': 'market company entrepreneur success management leadership strategy finance career investment',
            'Dystopian': 'dystopia future society control survival oppression rebellion totalitarian apocalyptic regime'
        }
        
        # Preprocess text
        processed_text = preprocess_text(f"{title} {description}")
        
        # Create TF-IDF vectorizer
        vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
        
        # Create corpus with genre keywords and the book text
        corpus = list(genre_keywords.values())
        corpus.append(processed_text)
        
        # Calculate TF-IDF
        tfidf_matrix = vectorizer.fit_transform(corpus)
        
        # Calculate similarity between book and each genre
        last_row_index = tfidf_matrix.shape[0] - 1
        similarities = cosine_similarity(tfidf_matrix[last_row_index], tfidf_matrix[:-1])[0]
        
        # Map similarities to genres
        genres = {}
        for i, genre in enumerate(genre_keywords.keys()):
            # Convert similarity scores to a more intuitive range (0-10)
            score = int(similarities[i] * 10)
            if score > 3:  # Only consider reasonable matches
                genres[genre] = score
                
        return genres
    except:
        return {}


def extract_named_entities(title, description):
    """
    Extract named entities from book text and map them to potential genres.
    
    Algorithm:
    1. Load spaCy NLP model
    2. Process the text to extract named entities
    3. Match entities against genre-related keyword lists
    4. Score genres based on matched entities
    5. Return genres with their confidence scores
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores from entity analysis
    """
    try:
        nlp, _ = load_nlp_models()
        if not nlp:
            return {}
            
        # Process text with spaCy
        doc = nlp(f"{title} {description}")
        
        # Extract entities
        entities = [ent.text.lower() for ent in doc.ents]
        
        # Define entity-genre associations
        entity_genre_map = {
            'fantasy': ['magic', 'wizard', 'dragon', 'elf', 'fairy', 'kingdom', 'quest', 'sorcerer'],
            'science fiction': ['space', 'planet', 'alien', 'robot', 'future', 'technology'],
            'historical fiction': ['century', 'king', 'queen', 'empire', 'war', 'battle', 'medieval', 'ancient'],
            'biography': ['life', 'biography', 'autobiography', 'memoir', 'president', 'politician', 'artist'],
            'science': ['research', 'experiment', 'theory', 'physics', 'biology', 'chemistry', 'scientist'],
            'religion': ['god', 'church', 'bible', 'faith', 'spiritual', 'religion', 'prayer']
        }
        
        # Find genres based on entities
        genres = {}
        for entity in entities:
            for genre, keywords in entity_genre_map.items():
                if any(keyword in entity for keyword in keywords):
                    standardized_genre = standardize_genre(genre)
                    genres[standardized_genre] = genres.get(standardized_genre, 0) + 1
                    
        return genres
    except:
        return {}


def standardize_genre(genre):
    """
    Standardize genre names to a consistent format.
    
    Algorithm:
    - Map various genre name formats to standardized genre names
    - Default to title case if no mapping exists
    
    Parameters:
    -----------
    genre : str
        Genre name to standardize
        
    Returns:
    --------
    str
        Standardized genre name
    """
    genre_map = {
        'fantasy': 'Fantasy',
        'science fiction': 'Science Fiction',
        'historical fiction': 'Historical Fiction',
        'biography': 'Biography/Memoir',
        'science': 'Science',
        'religion': 'Religion/Spirituality'
    }
    return genre_map.get(genre.lower(), genre.title())


def analyze_metadata(book_data):
    """
    Analyze book metadata for additional genre signals.
    
    Algorithm:
    1. Check page count (short books may be children's books)
    2. Look for series indicators in title (common in fantasy, sci-fi, YA)
    3. Look for children's book indicators in title
    4. Look for educational/textbook indicators in title
    5. Return genre scores derived from metadata
    
    Parameters:
    -----------
    book_data : dict or pandas Series
        Book information containing metadata
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores from metadata
    """
    metadata_genres = {}
    
    # Check page count for children's books
    if 'num_pages' in book_data and book_data['num_pages'] and pd.notna(book_data['num_pages']):
        try:
            pages = int(book_data['num_pages'])
            if pages < 50:
                metadata_genres["Children's"] = 5
        except (ValueError, TypeError):
            pass
    
    # Check title for series indicators
    title = str(book_data.get('title', '')).lower()
    series_indicators = ['#1', '#2', '#3', 'book 1', 'book 2', 'trilogy', 'series', 'volume']
    
    if any(indicator in title for indicator in series_indicators):
        metadata_genres['Fantasy'] = metadata_genres.get('Fantasy', 0) + 2
        metadata_genres['Science Fiction'] = metadata_genres.get('Science Fiction', 0) + 2
        metadata_genres['Young Adult'] = metadata_genres.get('Young Adult', 0) + 2
    
    # Check for children's book indicators in title
    children_indicators = ['for kids', 'for children', 'children\'s', 'picture book', 'baby', 'toddler']
    if any(indicator in title for indicator in children_indicators):
        metadata_genres["Children's"] = metadata_genres.get("Children's", 0) + 5
    
    # Check for educational/textbook indicators
    educational_indicators = ['textbook', 'introduction to', 'principles of', 'fundamentals of', 'guide to']
    if any(indicator in title for indicator in educational_indicators):
        metadata_genres['Textbook/Education'] = 5
    
    return metadata_genres


def combine_all_genre_signals(shelf_genres, nlp_genres, metadata_genres, min_confidence):
    """
    Combine genre signals from different sources with appropriate weights.
    
    Algorithm:
    1. Start with shelf genres (highest weight x3)
    2. Add NLP-detected genres (medium weight x2)
    3. Add metadata-based genres (normal weight x1)
    4. Sort by final weighted score
    5. Filter by minimum confidence threshold
    6. Return top genres, or fallback to highest scoring genre if none meet threshold
    
    Parameters:
    -----------
    shelf_genres : dict
        Genres detected from shelves data
    nlp_genres : dict
        Genres detected from NLP analysis
    metadata_genres : dict
        Genres detected from metadata
    min_confidence : int
        Minimum confidence score to include a genre
        
    Returns:
    --------
    list
        Final list of detected genres in order of confidence
    """
    combined_scores = Counter()
    
    # Add shelf genres with highest weight (explicit human categorization)
    for genre, score in shelf_genres.items():
        combined_scores[genre] += score * 3
    
    # Add NLP-detected genres with medium weight
    for genre, score in nlp_genres.items():
        combined_scores[genre] += score * 2
    
    # Add metadata-based genres
    for genre, score in metadata_genres.items():
        combined_scores[genre] += score
    
    # Sort by final score
    sorted_genres = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Filter by minimum confidence
    final_genres = [genre for genre, score in sorted_genres if score >= min_confidence]
    
    # Fallback if no confident genres
    if not final_genres and sorted_genres:
        final_genres = [sorted_genres[0][0]]
    
    return final_genres


def test_genre_detection():
    """
    Test genre detection on 5 popular book examples.
    
    Algorithm:
    1. Download required NLTK resources
    2. Create test dataset of 5 popular books with metadata
    3. Apply genre detection algorithm to each book
    4. Print results for each book
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the test books
    """
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
    except:
        print("NLTK download failed, but continuing...")
        
    test_books = [
        {
            'book_id': 1,
            'title': 'Harry Potter and the Sorcerer\'s Stone',
            'average_rating': 4.47,
            'ratings_count': 6800000,
            'description': 'Harry Potter has never been the star of a Quidditch team, scoring points while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility. All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley — a great big swollen spoiled bully. Harry\'s room is a tiny closet at the foot of the stairs, and he hasn\'t had a birthday party in eleven years. But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry — and anyone who reads about him — will find unforgettable.',
            'num_pages': 309,
            'similar_books': [],
            'popular_shelves': [
                {'count': '15000', 'name': 'to-read'},
                {'count': '12000', 'name': 'fantasy'},
                {'count': '8000', 'name': 'young-adult'},
                {'count': '4000', 'name': 'fiction'},
                {'count': '3000', 'name': 'favorites'}
            ]
        },
        {
            'book_id': 2,
            'title': 'The Da Vinci Code',
            'average_rating': 3.89,
            'ratings_count': 1900000,
            'description': 'While in Paris, Harvard symbologist Robert Langdon is awakened by a phone call in the dead of the night. The elderly curator of the Louvre has been murdered inside the museum, his body covered in baffling symbols. As Langdon and gifted French cryptologist Sophie Neveu sort through the bizarre riddles, they are stunned to discover a trail of clues hidden in the works of Leonardo da Vinci—clues visible for all to see and yet ingeniously disguised by the painter.',
            'num_pages': 489,
            'similar_books': [],
            'popular_shelves': [
                {'count': '10000', 'name': 'to-read'},
                {'count': '8000', 'name': 'thriller'},
                {'count': '7000', 'name': 'mystery'},
                {'count': '5000', 'name': 'fiction'},
                {'count': '2000', 'name': 'suspense'}
            ]
        },
        {
            'book_id': 3,
            'title': 'To Kill a Mockingbird',
            'average_rating': 4.28,
            'ratings_count': 4300000,
            'description': 'The unforgettable novel of a childhood in a sleepy Southern town and the crisis of conscience that rocked it. "To Kill A Mockingbird" became both an instant bestseller and a critical success when it was first published in 1960. It went on to win the Pulitzer Prize in 1961 and was later made into an Academy Award-winning film, also a classic.',
            'num_pages': 324,
            'similar_books': [],
            'popular_shelves': [
                {'count': '11000', 'name': 'to-read'},
                {'count': '9000', 'name': 'classics'},
                {'count': '7000', 'name': 'fiction'},
                {'count': '4000', 'name': 'school'},
                {'count': '3000', 'name': 'literature'}
            ]
        },
        {
            'book_id': 4,
            'title': 'The Hunger Games',
            'average_rating': 4.32,
            'ratings_count': 6100000,
            'description': 'In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and one girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.',
            'num_pages': 374,
            'similar_books': [],
            'popular_shelves': [
            ]
        },
        {
            'book_id': 5,
            'title': 'The Very Hungry Caterpillar',
            'average_rating': 4.3,
            'ratings_count': 527000,
            'description': 'Eric Carle\'s classic, The Very Hungry Caterpillar, with a simple counting concept, has been delighting young readers for more than 30 years. This board book edition is now available in a larger format, making it perfect for laptime reading.',
            'num_pages': 26,
            'similar_books': [],
            'popular_shelves': [
            ]
        }
    ]
    
    df = pd.DataFrame(test_books)
    
    print("Testing genre detection on 5 popular books...\n")
    
    for i, book in enumerate(test_books):
        print(f"Book {i+1}: {book['title']}")
        
        genres = detect_book_genre_with_advanced_nlp(book)
        
        print(f"Detected genres: {', '.join(genres)}")
        print("-" * 60)
        
    return df


if __name__ == "__main__":
    test_genre_detection()

In [None]:
import re
import numpy as np
import pandas as pd
from collections import Counter
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
from sentence_transformers import SentenceTransformer

def detect_book_genre_with_advanced_nlp(book_data, genre_classifier=None, min_confidence=3, exclude_shelves=None):
    """
    Detect book genres using a combination of structured data analysis and advanced NLP techniques.
    Handles cases where shelf data or similar books may be missing.
    
    Parameters:
    -----------
    book_data : dict or pandas Series
        Book information containing title, description, and other metadata
    genre_classifier : object, optional
        Optional pre-trained genre classifier model
    min_confidence : int, optional
        Minimum confidence score required to include a genre in the results
    exclude_shelves : set, optional
        Set of shelf names to exclude from analysis (e.g., 'to-read')
        
    Returns:
    --------
    list
        List of up to 3 most likely genres for the book
    """
    if exclude_shelves is None:
        exclude_shelves = get_default_excluded_shelves()
    
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))
    
    has_shelf_data = bool(book_data.get('popular_shelves', []))
    
    adjusted_min_confidence = min_confidence
    if not has_shelf_data:
        adjusted_min_confidence = max(1, min_confidence - 2)
    
    shelf_genres = extract_genres_from_shelves(book_data, get_genre_map(), exclude_shelves)
    
    nlp_genres = {}
    if len(description) > 20:
        nlp_genres.update(analyze_with_embeddings(title, description, boost=not has_shelf_data))
        nlp_genres.update(analyze_with_tfidf(title, description, boost=not has_shelf_data))
        nlp_genres.update(extract_named_entities(title, description))
        nlp_genres.update(detect_specific_genre_patterns(title, description))
    
    metadata_genres = analyze_metadata(book_data)
    
    similar_book_genres = analyze_similar_books(book_data)
    
    final_genres = combine_all_genre_signals(shelf_genres, nlp_genres, metadata_genres, similar_book_genres, adjusted_min_confidence)
    
    if not final_genres:
        final_genres = genre_fallback_detection(title, description)
    
    return final_genres[:3]


def get_default_excluded_shelves():
    """
    Get the default set of shelf names to exclude from genre analysis.
    
    Returns:
    --------
    set
        Set of shelf names that aren't useful for genre classification
    """
    return {
        'to-read', 'currently-reading', 'owned', 'default', 
        'favorites', 'books-i-own', 'ebook', 'kindle', 
        'library', 'audiobook', 'owned-books', 'to-buy', 
        'calibre', 're-read', 'unread', 'favourites', 'my-books'
    }


def get_genre_map():
    """
    Get mapping from common shelf keywords to standardized genre names.
    
    Returns:
    --------
    dict
        Dictionary mapping shelf keywords to standardized genre names
    """
    return {
        'fantasy': 'Fantasy',
        'sci-fi': 'Science Fiction',
        'science-fiction': 'Science Fiction',
        'mystery': 'Mystery/Thriller',
        'thriller': 'Mystery/Thriller',
        'romance': 'Romance',
        'historical': 'Historical Fiction',
        'history': 'History',
        'horror': 'Horror',
        'young-adult': 'Young Adult',
        'ya': 'Young Adult',
        'childrens': 'Children\'s',
        'children': 'Children\'s',
        'kids': 'Children\'s',
        'dystopian': 'Dystopian',
        'classic': 'Classics',
        'classics': 'Classics',
        'biography': 'Biography/Memoir',
        'memoir': 'Biography/Memoir',
        'autobiography': 'Biography/Memoir',
        'self-help': 'Self Help',
        'business': 'Business',
        'philosophy': 'Philosophy',
        'psychology': 'Psychology',
        'science': 'Science',
        'poetry': 'Poetry',
        'comic': 'Comics/Graphic Novels',
        'graphic-novel': 'Comics/Graphic Novels',
        'manga': 'Manga',
        'cooking': 'Cooking/Food',
        'cookbook': 'Cooking/Food',
        'food': 'Cooking/Food',
        'travel': 'Travel',
        'religion': 'Religion/Spirituality',
        'spirituality': 'Religion/Spirituality',
        'art': 'Art/Photography',
        'photography': 'Art/Photography',
        'reference': 'Reference',
        'textbook': 'Textbook/Education',
        'education': 'Textbook/Education',
        'academic': 'Textbook/Education',
        'computer-science': 'Computer Science',
        'programming': 'Computer Science',
        'mathematics': 'Mathematics',
        'statistics': 'Mathematics'
    }


def get_genre_embeddings():
    """
    Get descriptions of genres for semantic similarity comparisons.
    
    Returns:
    --------
    dict
        Dictionary mapping genre names to their textual descriptions
    """
    genre_descriptions = {
        'Fantasy': 'Magic, wizards, dragons, mythical creatures, quests, magical worlds and kingdoms',
        'Science Fiction': 'Space, technology, future, aliens, robots, artificial intelligence, dystopian societies',
        'Mystery/Thriller': 'Crime, murder, detective, investigation, suspense, secrets, conspiracy',
        'Romance': 'Love, relationships, passion, emotion, marriage, dating, feelings',
        'Historical Fiction': 'Past events, historical periods, ancient civilizations, history-based stories',
        'Horror': 'Fear, terror, supernatural, monsters, ghosts, nightmares, scary stories',
        'Young Adult': 'Teenage protagonists, coming of age, high school, identity, friendship, young romance',
        'Children\'s': 'Stories for kids, picture books, educational, simple stories, colorful illustrations',
        'Biography/Memoir': 'Real life stories, personal experiences, autobiographical, true events',
        'Self Help': 'Personal improvement, advice, motivation, success strategies, life guidance',
        'Business': 'Entrepreneurship, finance, management, marketing, career advice, economics',
        'History': 'Historical accounts, wars, civilizations, historical figures, factual accounts of the past',
        'Science': 'Scientific discoveries, research, theories, nature, biology, physics, academic',
        'Poetry': 'Poems, verse, rhymes, poetic language, collections of poetry',
        'Dystopian': 'Oppressive society, controlled world, rebellion, survival, future dystopia, totalitarian government',
        'Classics': 'Literary works of lasting value, canonical literature, traditional important works',
        'Religion/Spirituality': 'Faith, belief systems, religious practices, spiritual growth, theology',
        'Comics/Graphic Novels': 'Illustrated stories, sequential art, comic book format, visual storytelling',
        'Cooking/Food': 'Recipes, culinary techniques, food culture, cooking instructions, nutrition',
        'Travel': 'Travel guides, destinations, journeys, cultural exploration, adventures abroad',
        'Textbook/Education': 'Academic topics, learning materials, educational content, textbooks, theories, concepts',
        'Computer Science': 'Programming, algorithms, data structures, computing theory, software development',
        'Mathematics': 'Mathematical concepts, equations, proofs, statistical methods, numerical analysis'
    }
    return genre_descriptions


def extract_genres_from_shelves(book_data, genre_map, exclude_shelves):
    """
    Extract genre information from book's popular shelves data.
    
    Parameters:
    -----------
    book_data : dict or pandas Series
        Book information containing popular_shelves
    genre_map : dict
        Dictionary mapping shelf keywords to standardized genre names
    exclude_shelves : set
        Set of shelf names to exclude
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores from shelf data
    """
    genre_scores = {}
    
    shelves = book_data.get('popular_shelves', [])
    if not shelves:
        return genre_scores
    
    for shelf in shelves:
        shelf_name = shelf.get('name', '').lower()
        shelf_count = int(shelf.get('count', 0))
        
        if shelf_name in exclude_shelves:
            continue
            
        for keyword, genre in genre_map.items():
            if keyword == shelf_name or keyword in shelf_name.split('-'):
                weight = min(5, shelf_count / 1000) if shelf_count else 1
                genre_scores[genre] = genre_scores.get(genre, 0) + weight
                break
    
    return genre_scores


def analyze_with_embeddings(title, description, boost=False):
    """
    Analyze book text using sentence embeddings for semantic similarity.
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
    boost : bool, optional
        Whether to boost confidence scores (default: False)
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores from embedding analysis
    """
    try:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        
        genre_descriptions = get_genre_embeddings()
        genre_texts = [f"{genre}: {desc}" for genre, desc in genre_descriptions.items()]
        genre_embeddings = model.encode(genre_texts)
        
        book_text = f"{title} {description}"
        book_embedding = model.encode([book_text])[0]
        
        similarities = cosine_similarity([book_embedding], genre_embeddings)[0]
        
        genre_scores = {}
        for i, genre in enumerate(genre_descriptions.keys()):
            score = similarities[i] * 10
            if boost:
                score *= 1.5
            if score > 3:
                genre_scores[genre] = score
        
        return genre_scores
    except:
        return {}


def analyze_with_tfidf(title, description, boost=False):
    """
    Analyze book text using TF-IDF for keyword relevance.
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
    boost : bool, optional
        Whether to boost confidence scores (default: False)
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores from TF-IDF analysis
    """
    genre_scores = {}
    
    combined_text = (title + " " + description).lower()
    
    genre_patterns = {
        'Fantasy': [r'\bmagic\b', r'\bwizard', r'\bdragon', r'\bspell', r'\bkingdom', r'\bquest\b'],
        'Science Fiction': [r'\bspace\b', r'\balien', r'\brobot', r'\bfuture', r'\btechnology'],
        'Mystery/Thriller': [r'\bmurder', r'\bdetective', r'\bcrime', r'\bmystery', r'\binvestigation'],
        'Romance': [r'\blove\b', r'\bromance', r'\brelationship', r'\bheart', r'\bpassion'],
        'Young Adult': [r'\bteen', r'\byoung adult', r'\bcoming of age', r'\bhigh school'],
        'Dystopian': [r'\bdystopian', r'\bpost-apocalyptic', r'\bdictator', r'\bsurvival', r'\brebellion'],
        'Children\'s': [r'\bchildren', r'\bkid', r'\bpicture book', r'\billustrated', r'\byoung reader'],
        'Historical Fiction': [r'\bhistorical', r'\bcentury', r'\bancient', r'\bera\b', r'\bperiod\b'],
        'Textbook/Education': [r'\btextbook', r'\bhandbook', r'\bacademic', r'\btheory', r'\bprinciples'],
        'Computer Science': [r'\bprogramming', r'\balgorithm', r'\bcomputer', r'\bsoftware', r'\bcoding'],
        'Mathematics': [r'\bmathematics', r'\bstatistics', r'\bequation', r'\bnumerical', r'\btheorem']
    }
    
    for genre, patterns in genre_patterns.items():
        score = 0
        for pattern in patterns:
            matches = len(re.findall(pattern, combined_text))
            score += matches * (1.5 if boost else 1.0)
        
        if score > 0:
            genre_scores[genre] = score
    
    return genre_scores


def extract_named_entities(title, description):
    """
    Extract named entities from book text for genre detection.
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores from entity analysis
    """
    try:
        nlp = spacy.load("en_core_web_sm")
    except:
        try:
            spacy.cli.download("en_core_web_sm")
            nlp = spacy.load("en_core_web_sm")
        except:
            return {}
    
    genre_scores = {}
    combined_text = title + " " + description
    
    doc = nlp(combined_text)
    
    for ent in doc.ents:
        if ent.label_ == "GPE" or ent.label_ == "LOC":
            genre_scores["Travel"] = genre_scores.get("Travel", 0) + 1
        elif ent.label_ == "PERSON" and len(ent.text.split()) > 1:
            genre_scores["Biography/Memoir"] = genre_scores.get("Biography/Memoir", 0) + 1
        elif ent.label_ == "ORG" and "university" in ent.text.lower():
            genre_scores["Textbook/Education"] = genre_scores.get("Textbook/Education", 0) + 2
    
    return genre_scores


def detect_specific_genre_patterns(title, description):
    """
    Detect specific genres using pattern matching in title and description.
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    dict
        Dictionary of genre names with their confidence scores
    """
    combined_text = (title + " " + description).lower()
    genre_scores = {}
    
    dystopian_indicators = [
        ('capitol', 'district'), 
        ('dystopian', 'society'),
        ('post-apocalyptic', 'survival'),
        ('oppressive', 'government'),
        ('totalitarian', 'regime'),
        ('controlled', 'society')
    ]
    for terms in dystopian_indicators:
        if all(term in combined_text for term in terms):
            genre_scores['Dystopian'] = genre_scores.get('Dystopian', 0) + 5
            genre_scores['Young Adult'] = genre_scores.get('Young Adult', 0) + 2
    
    fantasy_indicators = [
        ('magic', 'wizard'),
        ('dragon', 'kingdom'),
        ('spell', 'quest'),
        ('magical', 'creature'),
        ('enchanted', 'forest'),
        ('sword', 'sorcery')
    ]
    for terms in fantasy_indicators:
        if all(term in combined_text for term in terms):
            genre_scores['Fantasy'] = genre_scores.get('Fantasy', 0) + 4
    
    academic_indicators = [
        ('statistical', 'learning'),
        ('data', 'mining'),
        ('machine', 'learning'),
        ('algorithms', 'computational'),
        ('mathematics', 'theory'),
        ('programming', 'language'),
        ('computer', 'science'),
        ('neural', 'networks'),
        ('artificial', 'intelligence'),
        ('physics', 'principles'),
        ('engineering', 'design'),
        ('bioinformatics', 'genomics'),
        ('handbook', 'reference'),
        ('analysis', 'methods')
    ]
    for terms in academic_indicators:
        if all(term in combined_text for term in terms):
            genre_scores['Textbook/Education'] = genre_scores.get('Textbook/Education', 0) + 6
            genre_scores['Science'] = genre_scores.get('Science', 0) + 4
    
    if 'statistical' in combined_text or 'algorithms' in combined_text or 'mathematics' in combined_text:
        if 'Children\'s' in genre_scores:
            del genre_scores['Children\'s']
    
    return genre_scores


def analyze_metadata(book_data):
    """
    Analyze book metadata for genre signals.
    
    Parameters:
    -----------
    book_data : dict
        Book information with metadata fields
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores
    """
    metadata_genres = {}
    
    num_pages = book_data.get('num_pages', 0)
    
    if 0 < num_pages < 40:
        metadata_genres["Children's"] = 4
    
    if num_pages > 700:
        title = str(book_data.get('title', '')).lower()
        description = str(book_data.get('description', '')).lower()
        combined_text = title + " " + description
        
        academic_terms = ['handbook', 'textbook', 'guide', 'principles', 'introduction to', 
                        'elements of', 'fundamentals', 'statistics', 'mathematics', 
                        'engineering', 'biology', 'physics', 'chemistry']
        
        if any(term in combined_text for term in academic_terms):
            metadata_genres['Textbook/Education'] = 5
            metadata_genres['Science'] = 3
        else:
            metadata_genres['Fantasy'] = 2
    
    title = str(book_data.get('title', '')).lower()
    
    academic_patterns = [
        r'\bthe \w+ handbook\b', 
        r'\bprinciples of \w+\b',
        r'\belements of \w+\b', 
        r'\bintroduction to \w+\b',
        r'\bfundamentals of \w+\b',
        r'\bthe \w+ companion\b',
        r'\bguide to \w+\b'
    ]
    
    if any(re.search(pattern, title) for pattern in academic_patterns):
        metadata_genres['Textbook/Education'] = metadata_genres.get('Textbook/Education', 0) + 5
    
    return metadata_genres


def analyze_similar_books(book_data):
    """
    Analyze similar books for genre information.
    
    Parameters:
    -----------
    book_data : dict
        Book information containing similar_books
        
    Returns:
    --------
    dict
        Dictionary of genre names with confidence scores
    """
    genre_scores = {}
    
    similar_books = book_data.get('similar_books', [])
    if not similar_books:
        return genre_scores
    
    genre_counts = Counter()
    
    for book in similar_books:
        if 'genres' in book and book['genres']:
            for genre in book['genres']:
                genre_counts[genre] += 1
    
    total_books = len(similar_books)
    if total_books > 0:
        for genre, count in genre_counts.items():
            frequency = count / total_books
            genre_scores[genre] = frequency * 5
    
    return genre_scores


def combine_all_genre_signals(shelf_genres, nlp_genres, metadata_genres, similar_book_genres, min_confidence):
    """
    Combine genre signals from multiple sources with appropriate weighting.
    
    Parameters:
    -----------
    shelf_genres : dict
        Genres extracted from shelf data
    nlp_genres : dict
        Genres detected through NLP
    metadata_genres : dict
        Genres inferred from metadata
    similar_book_genres : dict
        Genres from similar books
    min_confidence : int
        Minimum confidence score to include a genre
        
    Returns:
    --------
    list
        Final list of detected genres in order of confidence
    """
    combined_scores = Counter()
    
    for genre, score in shelf_genres.items():
        combined_scores[genre] += score * 3
    
    for genre, score in nlp_genres.items():
        combined_scores[genre] += score * 2
    
    for genre, score in metadata_genres.items():
        combined_scores[genre] += score
    
    for genre, score in similar_book_genres.items():
        combined_scores[genre] += score * 1.5
    
    sorted_genres = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    
    final_genres = [genre for genre, score in sorted_genres if score >= min_confidence]
    
    if not final_genres and sorted_genres:
        final_genres = [sorted_genres[0][0]]
    
    return final_genres


def genre_fallback_detection(title, description):
    """
    Last-resort genre detection when all other methods fail.
    
    Parameters:
    -----------
    title : str
        Book title
    description : str
        Book description
        
    Returns:
    --------
    list
        List of detected genres from fallback method
    """
    simple_genres = []
    combined_text = (title + " " + description).lower()
    
    if any(word in combined_text for word in ['magic', 'wizard', 'dragon', 'spell', 'kingdom']):
        simple_genres.append('Fantasy')
    
    if any(word in combined_text for word in ['space', 'alien', 'future', 'technology', 'robot']):
        simple_genres.append('Science Fiction')
    
    if any(word in combined_text for word in ['murder', 'crime', 'detective', 'mystery', 'investigation']):
        simple_genres.append('Mystery/Thriller')
    
    if any(word in combined_text for word in ['love', 'romance', 'relationship', 'heart', 'passion']):
        simple_genres.append('Romance')
    
    if any(word in combined_text for word in ['dystopian', 'oppressive', 'survival', 'rebellion']):
        simple_genres.append('Dystopian')
    
    if any(word in combined_text for word in ['teen', 'young adult', 'coming of age', 'high school']):
        simple_genres.append('Young Adult')
    
    if any(word in combined_text for word in ['children', 'kid', 'picture', 'young reader']):
        simple_genres.append('Children\'s')
    
    if any(word in combined_text for word in ['textbook', 'theory', 'handbook', 'statistics', 'principles']):
        simple_genres.append('Textbook/Education')
    
    return simple_genres[:2]


def test_comprehensive_genre_detection():
    """
    Comprehensive test of genre detection with 10 diverse scenarios.
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the test books
    """
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
    except:
        print("NLTK download failed, but continuing...")
        
    test_books = [
        {
            'scenario': "Technical book with clear indicators",
            'book_id': 1,
            'title': 'The Elements of Statistical Learning',
            'average_rating': 4.7,
            'ratings_count': 5600,
            'description': 'This book describes the important ideas in areas such as data mining, machine learning, and bioinformatics in a statistical framework. It covers many methods including those from supervised learning, strategies for model selection, neural networks, and boosting.',
            'num_pages': 745,
            'similar_books': [
                {'title': 'Pattern Recognition and Machine Learning', 'genres': ['Textbook/Education', 'Computer Science']},
                {'title': 'Deep Learning', 'genres': ['Textbook/Education', 'Computer Science']}
            ],
            'popular_shelves': []
        },
        {
            'scenario': "Fantasy with rich shelf data",
            'book_id': 2,
            'title': 'The Name of the Wind',
            'average_rating': 4.55,
            'ratings_count': 720000,
            'description': 'The intimate narrative of his childhood in a troupe of traveling players, his years spent as a near-feral orphan in a crime-ridden city, his daringly brazen yet successful bid to enter a legendary school of magic, and his life as a fugitive after the murder of a king.',
            'num_pages': 662,
            'similar_books': [],
            'popular_shelves': [
                {'count': '12000', 'name': 'fantasy'},
                {'count': '6000', 'name': 'fiction'},
                {'count': '3000', 'name': 'favorites'},
                {'count': '2000', 'name': 'magic'}
            ]
        },
        {
            'scenario': "Children's book with minimal description",
            'book_id': 3,
            'title': 'Where the Wild Things Are',
            'average_rating': 4.22,
            'ratings_count': 850000,
            'description': 'A story of a young boy named Max.',
            'num_pages': 32,
            'similar_books': [
                {'title': 'Goodnight Moon', 'genres': ['Children\'s']},
                {'title': 'The Very Hungry Caterpillar', 'genres': ['Children\'s']}
            ],
            'popular_shelves': []
        },
        {
            'scenario': "Ambiguous genre with conflicting signals",
            'book_id': 4,
            'title': 'The Time Traveler\'s Wife',
            'average_rating': 4.1,
            'ratings_count': 1500000,
            'description': 'A love story about Henry, a librarian who involuntarily travels through time, and Clare, an artist whose life takes a natural sequential course. Henry and Clare\'s passionate affair endures across a sea of time and captures them in an impossibly romantic trap.',
            'num_pages': 546,
            'similar_books': [
                {'title': 'The Notebook', 'genres': ['Romance']},
                {'title': 'The Night Circus', 'genres': ['Fantasy', 'Romance']}
            ],
            'popular_shelves': [
                {'count': '9000', 'name': 'romance'},
                {'count': '7000', 'name': 'science-fiction'},
                {'count': '5000', 'name': 'fiction'},
                {'count': '3000', 'name': 'time-travel'}
            ]
        },
        {
            'scenario': "Non-fiction with specialized topic",
            'book_id': 5,
            'title': 'Sapiens: A Brief History of Humankind',
            'average_rating': 4.37,
            'ratings_count': 720000,
            'description': "From a renowned historian comes a groundbreaking narrative of humanity's creation and evolution that explores the ways in which biology and history have defined us and enhanced our understanding of what it means to be 'human.'",
            'num_pages': 443,
            'similar_books': [],
            'popular_shelves': [
                {'count': '8000', 'name': 'non-fiction'},
                {'count': '6000', 'name': 'history'},
                {'count': '5000', 'name': 'science'},
                {'count': '3000', 'name': 'anthropology'}
            ]
        },
        {
            'scenario': "Classic literature with minimal metadata",
            'book_id': 6,
            'title': 'Pride and Prejudice',
            'average_rating': 4.25,
            'ratings_count': 3100000,
            'description': 'Since its publication in 1813, Pride and Prejudice has become one of the most popular novels in English literature.',
            'num_pages': 279,
            'similar_books': [],
            'popular_shelves': []
        },
        {
            'scenario': "Clear dystopian signals in description",
            'book_id': 7,
            'title': 'The Giver',
            'average_rating': 4.13,
            'ratings_count': 1950000,
            'description': 'The story follows Jonas, a 12-year-old boy in a seemingly ideal world who is selected to inherit the position of Receiver of Memory, the person who stores all the past memories of the time before Sameness, in case they are ever needed to help make decisions. As Jonas receives the memories, he discovers the terrible truth about his community.',
            'num_pages': 179,
            'similar_books': [
                {'title': 'Divergent', 'genres': ['Young Adult', 'Dystopian']},
                {'title': 'The Hunger Games', 'genres': ['Young Adult', 'Dystopian']}
            ],
            'popular_shelves': []
        },
        {
            'scenario': "Cooking book with misleading science keyword",
            'book_id': 8,
            'title': 'The Science of Good Cooking',
            'average_rating': 4.32,
            'ratings_count': 8900,
            'description': 'In this radical new approach to home cooking, science is the principal tool of the chef. Discover the science behind basic cooking methods like grilling, roasting, and frying, and learn practical recipes that demonstrate each scientific principle.',
            'num_pages': 504,
            'similar_books': [],
            'popular_shelves': [
                {'count': '3000', 'name': 'cooking'},
                {'count': '2000', 'name': 'food'},
                {'count': '1000', 'name': 'reference'}
            ]
        },
        {
            'scenario': "Mathematics book with too-short description",
            'book_id': 9,
            'title': 'Introduction to Linear Algebra',
            'average_rating': 4.1,
            'ratings_count': 2100,
            'description': 'A comprehensive textbook.',
            'num_pages': 584,
            'similar_books': [],
            'popular_shelves': []
        },
        {
            'scenario': "Young adult with clear genre indicators",
            'book_id': 10,
            'title': 'Twilight',
            'average_rating': 3.59,
            'ratings_count': 5100000,
            'description': 'About three things I was absolutely positive. First, Edward was a vampire. Second, there was a part of him—and I didn\'t know how dominant that part might be—that thirsted for my blood. And third, I was unconditionally and irrevocably in love with him.',
            'num_pages': 498,
            'similar_books': [
                {'title': 'The Hunger Games', 'genres': ['Young Adult', 'Dystopian']},
                {'title': 'City of Bones', 'genres': ['Young Adult', 'Fantasy']}
            ],
            'popular_shelves': [
                {'count': '10000', 'name': 'young-adult'},
                {'count': '9000', 'name': 'romance'},
                {'count': '7000', 'name': 'fantasy'},
                {'count': '6000', 'name': 'vampires'}
            ]
        }
    ]
    df = pd.DataFrame(test_books)
    
    print("Testing genre detection across 10 diverse scenarios:\n")
    
    for i, book in enumerate(test_books):
        print(f"Scenario {i+1}: {book['scenario']}")
        print(f"Book: {book['title']}")
        print(f"Description: {book['description'][:100]}...")
        
        genres = detect_book_genre_with_advanced_nlp(book)
        
        print(f"Detected genres: {', '.join(genres)}")
        print("-" * 60)
        
    return df
    

if __name__ == "__main__":
    test_comprehensive_genre_detection()

## Detect Age range of book

In [None]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import pandas as pd
import numpy as np
from collections import Counter
import string
import math
import textstat

def detect_age_range(book_data):
    if book_data.get('popular_shelves') is None:
        book_data['popular_shelves'] = []
    
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))
    num_pages = book_data.get('num_pages', 0)
    
    if isinstance(num_pages, str) and num_pages.strip():
        try:
            num_pages = int(num_pages)
        except ValueError:
            num_pages = 0
    elif num_pages is None or (isinstance(num_pages, float) and np.isnan(num_pages)):
        num_pages = 0
        
    age_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    title_lower = title.lower()
    description_lower = description.lower()
    
    # Advanced NLP analysis of title and description
    title_complexity = analyze_text_complexity(title)
    desc_complexity = analyze_text_complexity(description)
    
    # Apply title complexity to age scores
    if title_complexity < 0.3:
        age_scores['0-5'] += 10
        age_scores['5-10'] += 5
    elif title_complexity < 0.5:
        age_scores['5-10'] += 8
        age_scores['0-5'] += 4
    elif title_complexity < 0.7:
        age_scores['10-15'] += 8
        age_scores['5-10'] += 4
    else:
        age_scores['15+'] += 8
        age_scores['10-15'] += 4
    
    # Apply description complexity to age scores
    if desc_complexity < 0.3:
        age_scores['0-5'] += 12
        age_scores['5-10'] += 6
    elif desc_complexity < 0.5:
        age_scores['5-10'] += 10
        age_scores['0-5'] += 5
    elif desc_complexity < 0.7:
        age_scores['10-15'] += 10
        age_scores['5-10'] += 5
    else:
        age_scores['15+'] += 12
        age_scores['10-15'] += 6
    
    # Sentiment analysis
    title_sentiment = analyze_sentiment(title)
    desc_sentiment = analyze_sentiment(description)
    
    # Very positive sentiment often indicates younger children's books
    if title_sentiment > 0.6:
        age_scores['0-5'] += 8
        age_scores['5-10'] += 4
    elif title_sentiment > 0.3:
        age_scores['5-10'] += 6
        age_scores['0-5'] += 3
    elif title_sentiment < -0.3:
        age_scores['15+'] += 6
        age_scores['10-15'] += 3
    
    if desc_sentiment > 0.6:
        age_scores['0-5'] += 6
        age_scores['5-10'] += 3
    elif desc_sentiment > 0.3:
        age_scores['5-10'] += 5
        age_scores['0-5'] += 2
    elif desc_sentiment < -0.3:
        age_scores['15+'] += 5
        age_scores['10-15'] += 3
    
    # POS tag patterns analysis for age appropriateness
    pos_patterns = analyze_pos_patterns(description)
    for age_range, score in pos_patterns.items():
        age_scores[age_range] += score
    
    board_book_terms = ['board book', 'bedtime', 'goodnight', 'naptime', 'toddler', 'baby', 
                        'alphabet', 'counting', 'colors', 'shapes', 'lullaby', 'nursery']
    
    if any(term in title_lower or term in description_lower for term in board_book_terms):
        age_scores['0-5'] += 12
        age_scores['5-10'] -= 5
    
    early_reader_terms = ['early reader', 'beginning reader', 'learn to read', 'level reader',
                          'first reader', 'step into reading', 'i can read', 'reading level']
    
    if any(term in title_lower or term in description_lower for term in early_reader_terms):
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    
    grade_terms = {
        '0-5': ['preschool', 'pre-k', 'kindergarten'],
        '5-10': ['grade 1', 'grade 2', 'grade 3', 'grade 4', 'first grade', 'second grade', 
                'third grade', 'fourth grade', 'fifth grade', 'elementary'],
        '10-15': ['grade 5', 'grade 6', 'grade 7', 'grade 8', 'middle school', 'middle-grade', 
                 'middle grade', 'tween'],
        '15+': ['grade 9', 'grade 10', 'grade 11', 'grade 12', 'high school', 'teen', 'young adult',
               'ya', 'college', 'university']
    }
    
    for age_range, terms in grade_terms.items():
        if any(term in title_lower or term in description_lower for term in terms):
            age_scores[age_range] += 10
    
    if num_pages <= 32:
        age_scores['0-5'] += 15
        age_scores['5-10'] -= 3
    elif 33 <= num_pages <= 48:
        age_scores['0-5'] += 10
        age_scores['5-10'] += 5
    elif 49 <= num_pages <= 80:
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    elif 81 <= num_pages <= 120:
        age_scores['5-10'] += 8
        age_scores['10-15'] += 4
    elif 121 <= num_pages <= 200:
        age_scores['10-15'] += 8
        age_scores['5-10'] += 4
    elif 201 <= num_pages <= 350:
        age_scores['10-15'] += 10
        age_scores['15+'] += 5
    elif num_pages > 350:
        age_scores['15+'] += 12
        age_scores['10-15'] += 6
    
    # Combine textstat metrics with more weight
    try:
        flesch_reading_ease = textstat.flesch_reading_ease(description)
        flesch_kincaid_grade = textstat.flesch_kincaid_grade(description)
        gunning_fog = textstat.gunning_fog(description)
        coleman_liau = textstat.coleman_liau_index(description)
        smog = textstat.smog_index(description)
        dale_chall = textstat.dale_chall_readability_score(description)
        ari = textstat.automated_readability_index(description)
        
        # Advanced readability score combination
        if flesch_reading_ease > 90:
            age_scores['0-5'] += 15
            age_scores['5-10'] += 5
        elif flesch_reading_ease > 80:
            age_scores['5-10'] += 12
            age_scores['0-5'] += 8
        elif flesch_reading_ease > 70:
            age_scores['5-10'] += 10
            age_scores['10-15'] += 8
        elif flesch_reading_ease > 60:
            age_scores['10-15'] += 12
            age_scores['15+'] += 5
        else:
            age_scores['15+'] += 15
            age_scores['10-15'] += 5
        
        # Weighted grade level metrics for better discrimination
        weighted_grade_level = (
            flesch_kincaid_grade * 0.35 + 
            gunning_fog * 0.2 + 
            coleman_liau * 0.15 + 
            smog * 0.15 + 
            dale_chall * 0.1 + 
            ari * 0.05
        )
        
        if weighted_grade_level < 1:
            age_scores['0-5'] += 15
            age_scores['5-10'] -= 2
        elif weighted_grade_level < 3:
            age_scores['0-5'] += 10
            age_scores['5-10'] += 5
        elif weighted_grade_level < 5:
            age_scores['5-10'] += 12
            age_scores['0-5'] += 3
        elif weighted_grade_level < 8:
            age_scores['10-15'] += 10
            age_scores['5-10'] += 5
        elif weighted_grade_level < 12:
            age_scores['15+'] += 8
            age_scores['10-15'] += 12
        else:
            age_scores['15+'] += 15
    except:
        pass
    
    # Content theme analysis with increased weight for theme matches
    content_themes = analyze_text_themes(title_lower, description_lower)
    for age_range, score in content_themes.items():
        age_scores[age_range] += score * 1.5
    
    # Shelf analysis
    shelves = book_data.get('popular_shelves', [])
    shelf_age_indicators = analyze_shelves_for_age(shelves)
    for age_range, score in shelf_age_indicators.items():
        age_scores[age_range] += score
    
    max_score = max(age_scores.values())
    final_age_range = max(age_scores.items(), key=lambda x: x[1])[0]
    
    return final_age_range

def analyze_text_complexity(text):
    if not text or len(text) < 5:
        return 0.5
    
    try:
        sentences = sent_tokenize(text)
        words = word_tokenize(text)
        
        if not sentences or not words:
            return 0.5
        
        avg_sentence_length = len(words) / max(1, len(sentences))
        avg_word_length = sum(len(word) for word in words if word.isalpha()) / max(1, len([w for w in words if w.isalpha()]))
        
        # Calculate lexical diversity (larger vocabulary suggests more complex text)
        unique_words = len(set(word.lower() for word in words if word.isalpha()))
        lexical_diversity = unique_words / max(1, len([w for w in words if w.isalpha()]))
        
        # Calculate percentage of complex words (words with 3+ syllables)
        complex_words = sum(1 for word in words if word.isalpha() and textstat.syllable_count(word) >= 3)
        complex_words_pct = complex_words / max(1, len([w for w in words if w.isalpha()]))
        
        # Weighted complexity score
        complexity_score = (
            (avg_sentence_length / 25) * 0.3 + 
            (avg_word_length / 7) * 0.2 + 
            lexical_diversity * 0.25 + 
            complex_words_pct * 0.25
        )
        
        return min(1.0, complexity_score)
    except:
        return 0.5

def analyze_sentiment(text):
    if not text:
        return 0
    
    try:
        positive_words = [
            'good', 'great', 'happy', 'love', 'fun', 'wonderful', 'joy', 'exciting', 'adventure',
            'magical', 'beautiful', 'sweet', 'friendly', 'amazing', 'delight', 'smile', 'laugh',
            'play', 'colorful', 'bright', 'gentle', 'kind', 'nice', 'fantastic', 'awesome'
        ]
        
        negative_words = [
            'bad', 'sad', 'angry', 'hate', 'fear', 'dark', 'scary', 'terrible', 'awful', 'horrible',
            'cruel', 'evil', 'death', 'fight', 'war', 'cry', 'pain', 'suffer', 'struggle', 'difficult',
            'harsh', 'violent', 'grim', 'tragic', 'disaster'
        ]
        
        words = word_tokenize(text.lower())
        
        pos_count = sum(1 for word in words if word in positive_words)
        neg_count = sum(1 for word in words if word in negative_words)
        
        total_matches = pos_count + neg_count
        if total_matches == 0:
            return 0
        
        return (pos_count - neg_count) / total_matches
    except:
        return 0

def analyze_pos_patterns(text):
    try:
        age_patterns = {
            '0-5': 0,
            '5-10': 0,
            '10-15': 0,
            '15+': 0
        }
        
        # Get POS tags
        tokens = word_tokenize(text.lower())
        tagged = pos_tag(tokens)
        
        # Count parts of speech
        pos_counts = Counter(tag for word, tag in tagged)
        total_tokens = len(tagged)
        
        if total_tokens == 0:
            return age_patterns
        
        # Simple sentence structure (mainly nouns and verbs) - for young children
        simple_structure = (pos_counts.get('NN', 0) + pos_counts.get('NNS', 0) + 
                           pos_counts.get('VB', 0) + pos_counts.get('VBZ', 0) + 
                           pos_counts.get('VBP', 0)) / total_tokens
        
        # Complex sentence markers (conjunctions, relative pronouns, etc.)
        complex_markers = (pos_counts.get('IN', 0) + pos_counts.get('WDT', 0) + 
                          pos_counts.get('WP', 0) + pos_counts.get('WRB', 0)) / total_tokens
        
        # Advanced language features (adjectives, adverbs, etc.)
        advanced_features = (pos_counts.get('JJ', 0) + pos_counts.get('JJR', 0) + 
                            pos_counts.get('JJS', 0) + pos_counts.get('RB', 0) + 
                            pos_counts.get('RBR', 0) + pos_counts.get('RBS', 0)) / total_tokens
        
        # Score assignment based on POS patterns
        if simple_structure > 0.6 and complex_markers < 0.1:
            age_patterns['0-5'] += 8
            age_patterns['5-10'] += 4
        elif simple_structure > 0.5 and complex_markers < 0.15:
            age_patterns['5-10'] += 7
            age_patterns['0-5'] += 3
        elif complex_markers > 0.15 and advanced_features > 0.2:
            age_patterns['10-15'] += 6
            age_patterns['15+'] += 3
        elif complex_markers > 0.2 and advanced_features > 0.25:
            age_patterns['15+'] += 8
            age_patterns['10-15'] += 4
        
        return age_patterns
    except:
        return {
            '0-5': 0,
            '5-10': 0,
            '10-15': 0,
            '15+': 0
        }

def analyze_text_themes(title, description):
    combined_text = title + " " + description
    
    theme_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    early_themes = ['sleep', 'bed', 'nap', 'dream', 'moon', 'star', 'night', 'bunny', 'teddy', 
                    'toy', 'farm', 'animal', 'cat', 'dog', 'duck', 'color', 'zoo', 'mommy', 
                    'daddy', 'parent', 'bath', 'diaper', 'potty', 'train', 'truck', 'car', 
                    'alphabet', 'abc', 'number', '123', 'count', 'rhyme']
    
    elementary_themes = ['school', 'teacher', 'friend', 'adventure', 'fun', 'magic', 'fairy', 
                        'dragon', 'dinosaur', 'spy', 'detective', 'mystery', 'solve', 'game', 
                        'play', 'team', 'sport', 'chapter', 'series', 'collect', 'comic', 
                        'joke', 'funny', 'humor', 'silly', 'prank', 'robot', 'space', 'science']
    
    middle_themes = ['friend', 'school', 'bully', 'crush', 'team', 'competition', 'journal', 
                    'diary', 'secret', 'club', 'grow', 'family', 'sibling', 'parent', 'problem', 
                    'solve', 'quest', 'mission', 'summer', 'camp', 'vacation', 'holiday', 
                    'fantasy', 'world', 'magic', 'spell', 'creature', 'monster', 'ghost']
    
    ya_themes = ['love', 'romance', 'relationship', 'kiss', 'boyfriend', 'girlfriend', 'dating', 
                'death', 'tragedy', 'war', 'battle', 'fight', 'survive', 'future', 'dystopian', 
                'apocalypse', 'society', 'rebellion', 'government', 'power', 'politics', 'identity', 
                'struggle', 'college', 'career', 'adult', 'mature', 'violence', 'blood']
    
    for theme in early_themes:
        if theme in combined_text:
            theme_scores['0-5'] += 1.5
    
    for theme in elementary_themes:
        if theme in combined_text:
            theme_scores['5-10'] += 1.5
    
    for theme in middle_themes:
        if theme in combined_text:
            theme_scores['10-15'] += 1.5
    
    for theme in ya_themes:
        if theme in combined_text:
            theme_scores['15+'] += 1.5
    
    return theme_scores

def analyze_shelves_for_age(shelves):
    shelf_patterns = {
        '0-5': ['picture book', 'board book', 'childrens', 'toddler', 'baby', 'preschool', 
               'bedtime', 'nursery', 'concept book'],
        '5-10': ['early reader', 'chapter book', 'childrens', 'kids', 'elementary', 'juvenile', 
                'easy reader'],
        '10-15': ['middle grade', 'middle-grade', 'tween', 'juvenile', 'preteen'],
        '15+': ['young adult', 'ya', 'teen', 'high school', 'new adult', 'adult']
    }
    
    shelf_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    for shelf in shelves:
        shelf_name = shelf.get('name', '').lower()
        shelf_count = int(shelf.get('count', 0))
        
        for age_range, patterns in shelf_patterns.items():
            if any(pattern in shelf_name for pattern in patterns):
                shelf_scores[age_range] += min(12, math.log(shelf_count + 1) * 2)
    
    return shelf_scores

def analyze_books_for_age_ranges(books_df):
    result_df = books_df.copy()
    age_ranges = []
    
    for _, book in books_df.iterrows():
        age_range = detect_age_range(book)
        age_ranges.append(age_range)
    
    result_df['age_range'] = age_ranges
    return result_df

if __name__ == "__main__":
    test_books = [
        {
            'book_id': 1,
            'title': 'Goodnight Moon',
            'average_rating': 4.3,
            'ratings_count': 310000,
            'description': 'In a great green room, tucked away in bed, is a little bunny. "Goodnight room, goodnight moon." And to all the familiar things in the softly lit room—to the picture of the three little bears sitting on chairs, to the clocks and his socks, to the mittens and the kittens, to everything one by one—the little bunny says goodnight.',
            'num_pages': 32,
            'popular_shelves': [
                {'count': '11000', 'name': 'picture-books'},
                {'count': '5000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 2,
            'title': 'The Very Hungry Caterpillar',
            'average_rating': 4.35,
            'ratings_count': 420000,
            'description': 'This is the classic edition of the bestselling story written for the very young. A newly hatched caterpillar eats his way through all kinds of food.',
            'num_pages': 24,
            'popular_shelves': [
                {'count': '9000', 'name': 'picture-books'},
                {'count': '4800', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 3,
            'title': 'Brown Bear, Brown Bear, What Do You See?',
            'average_rating': 4.25,
            'ratings_count': 340000,
            'description': 'A big happy frog, a plump purple cat, a handsome blue horse, and a soft yellow duck--all parade across the pages of this delightful book. Children will immediately respond to Eric Carle\'s flat, boldly colored collages. Combined with Bill Martin\'s singsong text, they create unforgettable images of these endearing animals.',
            'num_pages': 28,
            'popular_shelves': [
                {'count': '8500', 'name': 'picture-books'},
                {'count': '4000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 4,
            'title': 'Chicka Chicka Boom Boom',
            'average_rating': 4.18,
            'ratings_count': 180000,
            'description': 'In this lively alphabet rhyme, the letters of the alphabet race up the coconut tree. Will there be enough room? Oh, no—Chicka Chicka Boom Boom! The well-known authors of Barn Dance and Knots on a Counting Rope have created a rhythmic alphabet chant that rolls along on waves of fun.',
            'num_pages': 36,
            'popular_shelves': [
                {'count': '6000', 'name': 'picture-books'},
                {'count': '3500', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 5,
            'title': 'If You Give a Mouse a Cookie',
            'average_rating': 4.32,
            'ratings_count': 250000,
            'description': 'If a hungry little mouse shows up on your doorstep, you might want to give him a cookie. And if you give him a cookie, he\'ll ask for a glass of milk. He\'ll want to look in a mirror to make sure he doesn\'t have a milk mustache, and then he\'ll ask for a pair of scissors to give himself a trim....',
            'num_pages': 40,
            'popular_shelves': [
                {'count': '7000', 'name': 'picture-books'},
                {'count': '3800', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 6,
            'title': 'Green Eggs and Ham',
            'average_rating': 4.29,
            'ratings_count': 630000,
            'description': '"Do you like green eggs and ham?" asks Sam-I-am in this Beginner Book by Dr. Seuss. In a house or with a mouse? In a boat or with a goat? On a train or in a tree? Sam keeps asking persistently. With unmistakable characters and signature rhymes, Dr. Seuss\'s beloved favorite has cemented its place as a children\'s classic.',
            'num_pages': 62,
            'popular_shelves': [
                {'count': '12000', 'name': 'childrens'},
                {'count': '7500', 'name': 'picture-books'}
            ]
        },
        {
            'book_id': 7,
            'title': 'Magic Tree House #1: Dinosaurs Before Dark',
            'average_rating': 4.12,
            'ratings_count': 180000,
            'description': 'Jack and his younger sister Annie find a magic tree house, which whisks them back to an ancient time zone where they see live dinosaurs.',
            'num_pages': 68,
            'popular_shelves': [
                {'count': '5000', 'name': 'chapter-books'},
                {'count': '3800', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 8,
            'title': 'Charlotte\'s Web',
            'average_rating': 4.18,
            'ratings_count': 1600000,
            'description': 'Some Pig. Humble. Radiant. These are the words in Charlotte\'s Web, high up in Zuckerman\'s barn. Charlotte\'s spiderweb tells of her feelings for a little pig named Wilbur, who simply wants a friend. They also express the love of a girl named Fern, who saved Wilbur\'s life when he was born the runt of his litter.',
            'num_pages': 184,
            'popular_shelves': [
                {'count': '9000', 'name': 'childrens'},
                {'count': '6500', 'name': 'classics'}
            ]
        },
        {
            'book_id': 9,
            'title': 'How to Code: A Step-by-Step Guide to Computer Programming',
            'average_rating': 4.32,
            'ratings_count': 450,
            'description': 'This colorful guide teaches kids the basics of computer programming in a fun and easy-to-follow format. Perfect for children ages 8-12 who want to learn to code, this beginner\'s guide introduces core coding concepts through easy step-by-step instructions and simple projects.',
            'num_pages': 96,
            'popular_shelves': [
                {'count': '120', 'name': 'educational'},
                {'count': '85', 'name': 'computer-science'}
            ]
        },
        {
            'book_id': 10,
            'title': 'The Lion, the Witch and the Wardrobe',
            'average_rating': 4.22,
            'ratings_count': 2200000,
            'description': 'Four adventurous siblings—Peter, Susan, Edmund, and Lucy Pevensie—step through a wardrobe door and into the land of Narnia, a land frozen in eternal winter and enslaved by the power of the White Witch. But when almost all hope is lost, the return of the Great Lion, Aslan, signals a great change... and a great sacrifice.',
            'num_pages': 206,
            'popular_shelves': [
                {'count': '11000', 'name': 'fantasy'},
                {'count': '9000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 11,
            'title': 'Harry Potter and the Sorcerer\'s Stone',
            'average_rating': 4.47,
            'ratings_count': 7300000,
            'description': 'Harry Potter has no idea how famous he is. That\'s because he\'s being raised by his miserable aunt and uncle who are terrified Harry will learn that he\'s really a wizard, just as his parents were. But everything changes when Harry is summoned to attend an infamous school for wizards, and he begins to discover some clues about his illustrious birthright.',
            'num_pages': 309,
            'popular_shelves': [
                {'count': '25000', 'name': 'fantasy'},
                {'count': '18000', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 12,
            'title': 'Percy Jackson and the Lightning Thief',
            'average_rating': 4.25,
            'ratings_count': 2200000,
            'description': 'Percy Jackson is a good kid, but he can\'t seem to focus on his schoolwork or control his temper. And lately, being away at boarding school is only getting worse - Percy could have sworn his pre-algebra teacher turned into a monster and tried to kill him. When Percy\'s mom finds out, she knows it\'s time that he knew the truth about where he came from, and that he go to the one place he\'ll be safe.',
            'num_pages': 377,
            'popular_shelves': [
                {'count': '14000', 'name': 'fantasy'},
                {'count': '9500', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 13,
            'title': 'Diary of a Wimpy Kid',
            'average_rating': 4.11,
            'ratings_count': 680000,
            'description': 'Greg Heffley finds himself thrust into a new year and a new school where undersize weaklings share the corridors with kids who are taller, meaner and already shaving. Desperate to prove his new found maturity, which only going up a grade can bring, Greg is happy to have his not-quite-so-cool sidekick, Rowley, along for the ride.',
            'num_pages': 217,
            'popular_shelves': [
                {'count': '5000', 'name': 'middle-grade'},
                {'count': '3000', 'name': 'humor'}
            ]
        },
        {
            'book_id': 14,
            'title': 'Wonder',
            'average_rating': 4.42,
            'ratings_count': 925000,
            'description': 'August Pullman was born with a facial difference that, up until now, has prevented him from going to a mainstream school. Starting 5th grade at Beecher Prep, he wants nothing more than to be treated as an ordinary kid but his new classmates cannot get past Auggie\'s extraordinary face.',
            'num_pages': 315,
            'popular_shelves': [
                {'count': '7500', 'name': 'middle-grade'},
                {'count': '5200', 'name': 'realistic-fiction'}
            ]
        },
        {
            'book_id': 15,
            'title': 'Scratch Programming for Middle School Students',
            'average_rating': 4.4,
            'ratings_count': 180,
            'description': 'This guide introduces Scratch programming to middle school students through engaging projects and interactive games. Designed for classroom use or self-study for beginners aged 10-14, this book covers basic concepts and builds to intermediate challenges.',
            'num_pages': 145,
            'popular_shelves': [
                {'count': '35', 'name': 'educational'},
                {'count': '25', 'name': 'programming'}
            ]
        },
        {
            'book_id': 16,
            'title': 'The Hunger Games',
            'average_rating': 4.32,
            'ratings_count': 6100000,
            'description': 'In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and one girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.',
            'num_pages': 374,
            'popular_shelves': [
                {'count': '9000', 'name': 'young-adult'},
                {'count': '7000', 'name': 'dystopian'}
            ]
        },
        {
            'book_id': 17,
            'title': 'The Fault in Our Stars',
            'average_rating': 4.26,
            'ratings_count': 3700000,
            'description': 'Despite the tumor-shrinking medical miracle that has bought her a few years, Hazel has never been anything but terminal, her final chapter inscribed upon diagnosis. But when a gorgeous plot twist named Augustus Waters suddenly appears at Cancer Kid Support Group, Hazel\'s story is about to be completely rewritten.',
            'num_pages': 313,
            'popular_shelves': [
                {'count': '8500', 'name': 'young-adult'},
                {'count': '7200', 'name': 'romance'}
            ]
        },
        {
            'book_id': 18,
            'title': 'Six of Crows',
            'average_rating': 4.46,
            'ratings_count': 350000,
            'description': 'Ketterdam: a bustling hub of international trade where anything can be had for the right price—and no one knows that better than criminal prodigy Kaz Brekker. Kaz is offered a chance at a deadly heist that could make him rich beyond his wildest dreams. But he can\'t pull it off alone. A convict with a thirst for revenge. A sharpshooter who can\'t walk away from a wager. A runaway with a privileged past. A spy known as the Wraith. A Heartrender using her magic to survive the slums. A thief with a gift for unlikely escapes.',
            'num_pages': 465,
            'popular_shelves': [
                {'count': '7500', 'name': 'fantasy'},
                {'count': '6000', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 19,
            'title': 'Introduction to Algorithms for High School Students',
            'average_rating': 4.5,
            'ratings_count': 230,
            'description': 'A simplified introduction to computer algorithms designed specifically for high school students. This book covers basic searching and sorting algorithms, graphs, and dynamic programming with clear examples and exercises.',
            'num_pages': 210,
            'popular_shelves': [
                {'count': '45', 'name': 'computer-science'},
                {'count': '30', 'name': 'textbooks'}
            ]
        },
        {
            'book_id': 20,
            'title': 'Baby\'s First Animal Board Book',
            'average_rating': 4.7,
            'ratings_count': 830,
            'description': 'This durable board book introduces babies and toddlers to adorable animal friends with bright colors and simple text. Each spread features familiar animals with their names clearly labeled, perfect for early learning and vocabulary development for ages 0-2.',
            'num_pages': 12,
            'popular_shelves': [
                {'count': '150', 'name': 'board-book'},
                {'count': '95', 'name': 'baby-books'}
            ]
        }
    ]
    
    sample_df = pd.DataFrame(test_books)
    
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
        nltk.download('averaged_perceptron_tagger', quiet=True)
    except:
        print("NLTK download failed, but continuing...")
    
    result_df = analyze_books_for_age_ranges(sample_df)
    
    print("\nBook Age Range Detection Results:")
    print("=" * 80)
    for _, book in result_df.iterrows():
        print(f"Title: {book['title']}")
        print(f"Description: {book['description'][:50]}...")
        print(f"Pages: {book['num_pages']}")
        print(f"Detected Age Range: {book['age_range']}")
        print("-" * 80)

In [None]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
from collections import Counter
import string
import math
import textstat

def detect_age_range(book_data):
    if book_data.get('popular_shelves') is None:
        book_data['popular_shelves'] = []
    
    title = str(book_data.get('title', ''))
    description = str(book_data.get('description', ''))
    num_pages = book_data.get('num_pages', 0)
    
    if isinstance(num_pages, str) and num_pages.strip():
        try:
            num_pages = int(num_pages)
        except ValueError:
            num_pages = 0
    elif num_pages is None or (isinstance(num_pages, float) and np.isnan(num_pages)):
        num_pages = 0
        
    age_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    title_lower = title.lower()
    description_lower = description.lower()
    
    board_book_terms = ['board book', 'bedtime', 'goodnight', 'naptime', 'toddler', 'baby', 
                        'alphabet', 'counting', 'colors', 'shapes', 'lullaby', 'nursery']
    
    if any(term in title_lower or term in description_lower for term in board_book_terms):
        age_scores['0-5'] += 12
        age_scores['5-10'] -= 5
    
    early_reader_terms = ['early reader', 'beginning reader', 'learn to read', 'level reader',
                          'first reader', 'step into reading', 'i can read', 'reading level']
    
    if any(term in title_lower or term in description_lower for term in early_reader_terms):
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    
    grade_terms = {
        '0-5': ['preschool', 'pre-k', 'kindergarten'],
        '5-10': ['grade 1', 'grade 2', 'grade 3', 'grade 4', 'first grade', 'second grade', 
                'third grade', 'fourth grade', 'fifth grade', 'elementary'],
        '10-15': ['grade 5', 'grade 6', 'grade 7', 'grade 8', 'middle school', 'middle-grade', 
                 'middle grade', 'tween'],
        '15+': ['grade 9', 'grade 10', 'grade 11', 'grade 12', 'high school', 'teen', 'young adult',
               'ya', 'college', 'university']
    }
    
    for age_range, terms in grade_terms.items():
        if any(term in title_lower or term in description_lower for term in terms):
            age_scores[age_range] += 10
    
    if num_pages <= 32:
        age_scores['0-5'] += 15
        age_scores['5-10'] -= 3
    elif 33 <= num_pages <= 48:
        age_scores['0-5'] += 10
        age_scores['5-10'] += 5
    elif 49 <= num_pages <= 80:
        age_scores['5-10'] += 12
        age_scores['0-5'] -= 2
    elif 81 <= num_pages <= 120:
        age_scores['5-10'] += 8
        age_scores['10-15'] += 4
    elif 121 <= num_pages <= 200:
        age_scores['10-15'] += 8
        age_scores['5-10'] += 4
    elif 201 <= num_pages <= 350:
        age_scores['10-15'] += 10
        age_scores['15+'] += 5
    elif num_pages > 350:
        age_scores['15+'] += 12
        age_scores['10-15'] += 6
    
    try:
        flesch_reading_ease = textstat.flesch_reading_ease(description)
        flesch_kincaid_grade = textstat.flesch_kincaid_grade(description)
        gunning_fog = textstat.gunning_fog(description)
        coleman_liau = textstat.coleman_liau_index(description)
        smog = textstat.smog_index(description)
        
        if flesch_reading_ease > 90:
            age_scores['0-5'] += 15
            age_scores['5-10'] += 5
        elif flesch_reading_ease > 80:
            age_scores['5-10'] += 12
            age_scores['0-5'] += 8
        elif flesch_reading_ease > 70:
            age_scores['5-10'] += 10
            age_scores['10-15'] += 8
        elif flesch_reading_ease > 60:
            age_scores['10-15'] += 12
            age_scores['15+'] += 5
        else:
            age_scores['15+'] += 15
            age_scores['10-15'] += 5
        
        avg_grade_level = (flesch_kincaid_grade + gunning_fog + coleman_liau + smog) / 4
        
        if avg_grade_level < 1:
            age_scores['0-5'] += 15
        elif avg_grade_level < 3:
            age_scores['0-5'] += 10
            age_scores['5-10'] += 5
        elif avg_grade_level < 5:
            age_scores['5-10'] += 12
            age_scores['0-5'] += 3
        elif avg_grade_level < 8:
            age_scores['10-15'] += 10
            age_scores['5-10'] += 5
        elif avg_grade_level < 12:
            age_scores['15+'] += 8
            age_scores['10-15'] += 12
        else:
            age_scores['15+'] += 15
    except:
        pass
    
    content_themes = analyze_text_themes(title_lower, description_lower)
    for age_range, score in content_themes.items():
        age_scores[age_range] += score
    
    shelves = book_data.get('popular_shelves', [])
    shelf_age_indicators = analyze_shelves_for_age(shelves)
    for age_range, score in shelf_age_indicators.items():
        age_scores[age_range] += score
    
    max_score = max(age_scores.values())
    final_age_range = max(age_scores.items(), key=lambda x: x[1])[0]
    
    return final_age_range

def analyze_text_themes(title, description):
    combined_text = title + " " + description
    
    theme_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    early_themes = ['sleep', 'bed', 'nap', 'dream', 'moon', 'star', 'night', 'bunny', 'teddy', 
                    'toy', 'farm', 'animal', 'cat', 'dog', 'duck', 'color', 'zoo', 'mommy', 
                    'daddy', 'parent', 'bath', 'diaper', 'potty', 'train', 'truck', 'car', 
                    'alphabet', 'abc', 'number', '123', 'count', 'rhyme']
    
    elementary_themes = ['school', 'teacher', 'friend', 'adventure', 'fun', 'magic', 'fairy', 
                        'dragon', 'dinosaur', 'spy', 'detective', 'mystery', 'solve', 'game', 
                        'play', 'team', 'sport', 'chapter', 'series', 'collect', 'comic', 
                        'joke', 'funny', 'humor', 'silly', 'prank', 'robot', 'space', 'science']
    
    middle_themes = ['friend', 'school', 'bully', 'crush', 'team', 'competition', 'journal', 
                    'diary', 'secret', 'club', 'grow', 'family', 'sibling', 'parent', 'problem', 
                    'solve', 'quest', 'mission', 'summer', 'camp', 'vacation', 'holiday', 
                    'fantasy', 'world', 'magic', 'spell', 'creature', 'monster', 'ghost']
    
    ya_themes = ['love', 'romance', 'relationship', 'kiss', 'boyfriend', 'girlfriend', 'dating', 
                'death', 'tragedy', 'war', 'battle', 'fight', 'survive', 'future', 'dystopian', 
                'apocalypse', 'society', 'rebellion', 'government', 'power', 'politics', 'identity', 
                'struggle', 'college', 'career', 'adult', 'mature', 'violence', 'blood']
    
    for theme in early_themes:
        if theme in combined_text:
            theme_scores['0-5'] += 1.5
    
    for theme in elementary_themes:
        if theme in combined_text:
            theme_scores['5-10'] += 1.5
    
    for theme in middle_themes:
        if theme in combined_text:
            theme_scores['10-15'] += 1.5
    
    for theme in ya_themes:
        if theme in combined_text:
            theme_scores['15+'] += 1.5
    
    return theme_scores

def analyze_shelves_for_age(shelves):
    shelf_patterns = {
        '0-5': ['picture book', 'board book', 'childrens', 'toddler', 'baby', 'preschool', 
               'bedtime', 'nursery', 'concept book'],
        '5-10': ['early reader', 'chapter book', 'childrens', 'kids', 'elementary', 'juvenile', 
                'easy reader'],
        '10-15': ['middle grade', 'middle-grade', 'tween', 'juvenile', 'preteen'],
        '15+': ['young adult', 'ya', 'teen', 'high school', 'new adult', 'adult']
    }
    
    shelf_scores = {
        '0-5': 0,
        '5-10': 0,
        '10-15': 0,
        '15+': 0
    }
    
    for shelf in shelves:
        shelf_name = shelf.get('name', '').lower()
        shelf_count = int(shelf.get('count', 0))
        
        for age_range, patterns in shelf_patterns.items():
            if any(pattern in shelf_name for pattern in patterns):
                shelf_scores[age_range] += min(12, math.log(shelf_count + 1) * 2)
    
    return shelf_scores

def analyze_books_for_age_ranges(books_df):
    result_df = books_df.copy()
    age_ranges = []
    
    for _, book in books_df.iterrows():
        age_range = detect_age_range(book)
        age_ranges.append(age_range)
    
    result_df['age_range'] = age_ranges
    return result_df

if __name__ == "__main__":
    test_books = [
        {
            'book_id': 1,
            'title': 'Goodnight Moon',
            'average_rating': 4.3,
            'ratings_count': 325000,
            'description': 'In a great green room, tucked away in bed, is a little bunny. "Goodnight room, goodnight moon." And to all the familiar things in the softly lit room—to the picture of the three little bears sitting on chairs, to the clocks and his socks, to the mittens and the kittens, to everything one by one—the little bunny says goodnight.',
            'num_pages': 32,
            'popular_shelves': [
                {'count': '12000', 'name': 'picture-book'},
                {'count': '8000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 2,
            'title': 'The Very Hungry Caterpillar',
            'average_rating': 4.35,
            'ratings_count': 420000,
            'description': 'This is the classic edition of the bestselling story written for the very young. A newly hatched caterpillar eats his way through all kinds of food, getting bigger and bigger, until eventually he turns into a beautiful butterfly. One of the most popular picture books of all time, no nursery bookshelf is complete without a copy.',
            'num_pages': 24,
            'popular_shelves': [
                {'count': '14000', 'name': 'picture-book'},
                {'count': '9000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 3,
            'title': 'Brown Bear, Brown Bear, What Do You See?',
            'average_rating': 4.25,
            'ratings_count': 315000,
            'description': 'A big happy frog, a plump purple cat, a handsome blue horse, and a soft yellow duck--all parade across the pages of this delightful book. Children will immediately respond to Eric Carle\'s flat, boldly colored collages. Combined with Bill Martin\'s singsong text, they create unforgettable images of these endearing animals.',
            'num_pages': 28,
            'popular_shelves': [
                {'count': '10000', 'name': 'picture-book'},
                {'count': '6500', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 4,
            'title': 'Chicka Chicka Boom Boom',
            'average_rating': 4.15,
            'ratings_count': 250000,
            'description': 'In this lively alphabet rhyme, the letters of the alphabet race up the coconut tree. Will there be enough room? Oh, no - Chicka Chicka Boom Boom! The well-known authors of Barn Dance and Knots on a Counting Rope have created a rhythmic alphabet chant that rolls along on waves of fun.',
            'num_pages': 36,
            'popular_shelves': [
                {'count': '8500', 'name': 'picture-book'},
                {'count': '5500', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 5,
            'title': 'If You Give a Mouse a Cookie',
            'average_rating': 4.32,
            'ratings_count': 275000,
            'description': 'If a hungry little mouse shows up on your doorstep, you might want to give him a cookie. And if you give him a cookie, he\'ll ask for a glass of milk. He\'ll want to look in a mirror to make sure he doesn\'t have a milk mustache, and then he\'ll ask for a pair of scissors to give himself a trim....',
            'num_pages': 40,
            'popular_shelves': [
                {'count': '9000', 'name': 'picture-book'},
                {'count': '6000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 6,
            'title': 'Green Eggs and Ham',
            'average_rating': 4.3,
            'ratings_count': 510000,
            'description': '"Do you like green eggs and ham?" asks Sam-I-am in this Beginner Book by Dr. Seuss. In a house or with a mouse? In a boat or with a goat? On a train or in a tree? Sam keeps asking persistently. With unmistakable characters and signature rhymes, Dr. Seuss\'s beloved favorite has cemented its place as a children\'s classic.',
            'num_pages': 62,
            'popular_shelves': [
                {'count': '11000', 'name': 'childrens'},
                {'count': '7000', 'name': 'picture-book'}
            ]
        },
        {
            'book_id': 7,
            'title': 'Magic Tree House #1: Dinosaurs Before Dark',
            'average_rating': 4.1,
            'ratings_count': 175000,
            'description': 'Jack and his younger sister Annie find a magic tree house, which whisks them back to an ancient time zone where they see live dinosaurs. They find a book that tells them about dinosaurs and as they are reading the book, the dinosaurs begin to approach them.',
            'num_pages': 68,
            'popular_shelves': [
                {'count': '4500', 'name': 'chapter-book'},
                {'count': '3000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 8,
            'title': 'Charlotte\'s Web',
            'average_rating': 4.18,
            'ratings_count': 1600000,
            'description': 'Some Pig. Humble. Radiant. These are the words in Charlotte\'s Web, high up in Zuckerman\'s barn. Charlotte\'s spiderweb tells of her feelings for a little pig named Wilbur, who simply wants a friend. They also express the love of a girl named Fern, who saved Wilbur\'s life when he was born the runt of his litter.',
            'num_pages': 184,
            'popular_shelves': [
                {'count': '9000', 'name': 'childrens'},
                {'count': '6500', 'name': 'classics'}
            ]
        },
        {
            'book_id': 9,
            'title': 'How to Code: A Step-by-Step Guide to Computer Programming',
            'average_rating': 4.32,
            'ratings_count': 450,
            'description': 'This colorful guide teaches kids the basics of computer programming in a fun and easy-to-follow format. Perfect for children ages 8-12 who want to learn to code, this beginner\'s guide introduces core coding concepts through easy step-by-step instructions and simple projects.',
            'num_pages': 96,
            'popular_shelves': [
                {'count': '120', 'name': 'educational'},
                {'count': '85', 'name': 'computer-science'}
            ]
        },
        {
            'book_id': 10,
            'title': 'The Lion, the Witch and the Wardrobe',
            'average_rating': 4.22,
            'ratings_count': 2200000,
            'description': 'Four adventurous siblings—Peter, Susan, Edmund, and Lucy Pevensie—step through a wardrobe door and into the land of Narnia, a land frozen in eternal winter and enslaved by the power of the White Witch. But when almost all hope is lost, the return of the Great Lion, Aslan, signals a great change... and a great sacrifice.',
            'num_pages': 206,
            'popular_shelves': [
                {'count': '11000', 'name': 'fantasy'},
                {'count': '9000', 'name': 'childrens'}
            ]
        },
        {
            'book_id': 11,
            'title': 'Harry Potter and the Sorcerer\'s Stone',
            'average_rating': 4.47,
            'ratings_count': 7300000,
            'description': 'Harry Potter has no idea how famous he is. That\'s because he\'s being raised by his miserable aunt and uncle who are terrified Harry will learn that he\'s really a wizard, just as his parents were. But everything changes when Harry is summoned to attend an infamous school for wizards, and he begins to discover some clues about his illustrious birthright.',
            'num_pages': 309,
            'popular_shelves': [
                {'count': '25000', 'name': 'fantasy'},
                {'count': '18000', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 12,
            'title': 'Percy Jackson and the Lightning Thief',
            'average_rating': 4.25,
            'ratings_count': 2200000,
            'description': 'Percy Jackson is a good kid, but he can\'t seem to focus on his schoolwork or control his temper. And lately, being away at boarding school is only getting worse - Percy could have sworn his pre-algebra teacher turned into a monster and tried to kill him. When Percy\'s mom finds out, she knows it\'s time that he knew the truth about where he came from, and that he go to the one place he\'ll be safe.',
            'num_pages': 377,
            'popular_shelves': [
                {'count': '14000', 'name': 'fantasy'},
                {'count': '9500', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 13,
            'title': 'Diary of a Wimpy Kid',
            'average_rating': 4.11,
            'ratings_count': 680000,
            'description': 'Greg Heffley finds himself thrust into a new year and a new school where undersize weaklings share the corridors with kids who are taller, meaner and already shaving. Desperate to prove his new found maturity, which only going up a grade can bring, Greg is happy to have his not-quite-so-cool sidekick, Rowley, along for the ride.',
            'num_pages': 217,
            'popular_shelves': [
                {'count': '5000', 'name': 'middle-grade'},
                {'count': '3000', 'name': 'humor'}
            ]
        },
        {
            'book_id': 14,
            'title': 'Wonder',
            'average_rating': 4.42,
            'ratings_count': 925000,
            'description': 'August Pullman was born with a facial difference that, up until now, has prevented him from going to a mainstream school. Starting 5th grade at Beecher Prep, he wants nothing more than to be treated as an ordinary kid but his new classmates cannot get past Auggie\'s extraordinary face.',
            'num_pages': 315,
            'popular_shelves': [
                {'count': '7500', 'name': 'middle-grade'},
                {'count': '5200', 'name': 'realistic-fiction'}
            ]
        },
        {
            'book_id': 15,
            'title': 'Scratch Programming for Middle School Students',
            'average_rating': 4.4,
            'ratings_count': 180,
            'description': 'This guide introduces Scratch programming to middle school students through engaging projects and interactive games. Designed for classroom use or self-study for beginners aged 10-14, this book covers basic concepts and builds to intermediate challenges.',
            'num_pages': 145,
            'popular_shelves': [
                {'count': '35', 'name': 'educational'},
                {'count': '25', 'name': 'programming'}
            ]
        },
        {
            'book_id': 16,
            'title': 'The Hunger Games',
            'average_rating': 4.32,
            'ratings_count': 6100000,
            'description': 'In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and one girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.',
            'num_pages': 374,
            'popular_shelves': [
                {'count': '9000', 'name': 'young-adult'},
                {'count': '7000', 'name': 'dystopian'}
            ]
        },
        {
            'book_id': 17,
            'title': 'The Fault in Our Stars',
            'average_rating': 4.26,
            'ratings_count': 3700000,
            'description': 'Despite the tumor-shrinking medical miracle that has bought her a few years, Hazel has never been anything but terminal, her final chapter inscribed upon diagnosis. But when a gorgeous plot twist named Augustus Waters suddenly appears at Cancer Kid Support Group, Hazel\'s story is about to be completely rewritten.',
            'num_pages': 313,
            'popular_shelves': [
                {'count': '8500', 'name': 'young-adult'},
                {'count': '7200', 'name': 'romance'}
            ]
        },
        {
            'book_id': 18,
            'title': 'Six of Crows',
            'average_rating': 4.46,
            'ratings_count': 350000,
            'description': 'Ketterdam: a bustling hub of international trade where anything can be had for the right price—and no one knows that better than criminal prodigy Kaz Brekker. Kaz is offered a chance at a deadly heist that could make him rich beyond his wildest dreams. But he can\'t pull it off alone. A convict with a thirst for revenge. A sharpshooter who can\'t walk away from a wager. A runaway with a privileged past. A spy known as the Wraith. A Heartrender using her magic to survive the slums. A thief with a gift for unlikely escapes.',
            'num_pages': 465,
            'popular_shelves': [
                {'count': '7500', 'name': 'fantasy'},
                {'count': '6000', 'name': 'young-adult'}
            ]
        },
        {
            'book_id': 19,
            'title': 'Introduction to Algorithms for High School Students',
            'average_rating': 4.5,
            'ratings_count': 230,
            'description': 'A simplified introduction to computer algorithms designed specifically for high school students. This book covers basic searching and sorting algorithms, graphs, and dynamic programming with clear examples and exercises.',
            'num_pages': 210,
            'popular_shelves': [
                {'count': '45', 'name': 'computer-science'},
                {'count': '30', 'name': 'textbooks'}
            ]
        },
        {
            'book_id': 20,
            'title': 'Baby\'s First Animal Board Book',
            'average_rating': 4.7,
            'ratings_count': 830,
            'description': 'This durable board book introduces babies and toddlers to adorable animal friends with bright colors and simple text. Each spread features familiar animals with their names clearly labeled, perfect for early learning and vocabulary development for ages 0-2.',
            'num_pages': 12,
            'popular_shelves': [
                {'count': '150', 'name': 'board-book'},
                {'count': '95', 'name': 'baby-books'}
            ]
        }
    ]
    
    sample_df = pd.DataFrame(test_books)
    
    try:
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
    except:
        print("NLTK download failed, but continuing...")
    
    result_df = analyze_books_for_age_ranges(sample_df)
    
    print("\nBook Age Range Detection Results:")
    print("=" * 80)
    for _, book in result_df.iterrows():
        print(f"Title: {book['title']}")
        print(f"Description: {book['description'][:50]}...")
        print(f"Pages: {book['num_pages']}")
        print(f"Detected Age Range: {book['age_range']}")
        print("-" * 80)

# Analysis

There are a few ways we can consider hybridizing the approaches. We will now do the ensemble method, which generates two separate recommendation lists and then takes the intersection. Other methods we could consider include (1) weighted hybrid, where a content-based score and a collaborative filtering score is calculated and subsequently combined with a weighted average, or (2) switching hybrid, where content-based filtering is used when the user is new, or when a book has very few ratings, and collaborative filtering is used when a user / book has sufficient history.

In [None]:
#purely for visualization before analysis; to be deleted
print(books.head())
print(interactions.head())

## Generating Recommendations

### Content-Based Filtering

For Content-Based Filtering, we use **TF-IDF** and **Cosine Similarity** as our core algorithms. 

**Term Frequency-Inverse Document Frequency (TF-IDF)** uses NLP to identify important words in the `description` attribute of the selected book by evaluating how frequently they appear, relative to the descriptions of all other books in the dataset. Once this is done, we sort the books by how similar they are using **cosine similarity**, which measures the angle between the two vectors (books). If they have a small angle, the books have similar `description` and is thus considered to be similar in content.

A better way to imagine this would be if Book A and Book B both have descriptions that talk about "magic", "spells" and "wizards", they would have similar TF-IDF vectors, and thus high cosine similarity scores.

In [None]:
tfidf = TfidfVectorizer(stop_words = 'english') 
tfidf_matrix = tfidf.fit_transform(books['description']) #generating TF-IDF matrix

def get_content_recommendations (book_id, df = books, tfidf_matrix = tfidf_matrix, top_n = 5):
    index = df[df['book_id'] == book_id].index[0]
    sim_scores = cosine_similarity(tfidf_matrix[index], tfidf_matrix).flatten() #calculating cosine similarity
    top_indices = np.argsort(sim_scores)[::1][1:top_n + 1]
    return df.iloc[top_indices][['book_id','title']]

similar_books = get_content_recommendations ("6066812")
print(similar_books)

### Collaborative Filtering

We will use K-Nearest Neighbours as an algorithm to perform Collaborative Filtering.

Will work on using SVD / Matrix Factorization tomorrow!

In [None]:
def create_user_item_matrix(df): #to create user-item matrix for collaborative filtering
    users = interactions['user_id'].nunique()
    items = interactions['book_id'].nunique()
    
    user_mapper = dict(zip(np.unique(interactions['user_id']), list(range(users))))
    user_inv_mapper = dict(zip(list(range(users)), np.unique(interactions['user_id'])))
    user_index = [user_mapper[i] for i in interactions['user_id']]

    
    item_mapper = dict(zip(np.unique(interactions['book_id']), list(range(items))))
    item_inv_mapper = dict(zip(list(range(items)), np.unique(interactions['book_id'])))
    item_index = [item_mapper[i] for i in interactions['book_id']]
    
    user_item_matrix = csr_matrix((interactions['rating'], (user_index, item_index)), shape = (users, items))
    return user_item_matrix, user_mapper, item_mapper, user_inv_mapper, item_inv_mapper

user_item_matrix, user_mapper, item_mapper, user_inv_mapper, item_inv_mapper = create_user_item_matrix(interactions)

def get_collaborative_recommendations(book_id, books = books, user_item_matrix = user_item_matrix, item_mapper = item_mapper, item_inv_mapper = item_inv_mapper, top_n = 5):
    user_item_matrix = user_item_matrix.T
    neighbor_ids = []
    recommendations = []

    item_ind = item_mapper[book_id]
    item_vec = user_item_matrix[item_ind]
    if isinstance(item_vec, (np.ndarray)):
        item_vec = item_vec.reshape(1,-1)

    kNN = NearestNeighbors(n_neighbors = top_n + 1, algorithm = "brute", metric = "cosine") #measuring similarity using K-Nearest-Neighbors
    kNN.fit(user_item_matrix)
    neighbor = kNN.kneighbors(item_vec, return_distance = False)
    for i in range (0, top_n):
        n = neighbor.item(i)
        neighbor_ids.append(item_inv_mapper[n])
    neighbor_ids.pop(0)

    for id in neighbor_ids: #retrieving book titles
        recommended_books = books.loc[books['book_id'] == id, ['book_id','title']].values[0]
        recommendations.append({'book_id': recommended_books[0], 'title': recommended_books[1]})
        
    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df

similar_books = get_collaborative_recommendations ("6066812")
print(similar_books)

Havent gotten to it yet, but the two lists of books produced should be combined to generate the final list!