<a href="https://colab.research.google.com/github/christinabrnn/Python-Study/blob/main/BA820/topic_modeling_unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

# Book Rating Dataset

In this notebook, we will explore a dataset of individual book ratings from [BookCrossing](https://github.com/ipudu/book-rating-database?tab=readme-ov-file) and apply Topic Modeling to extract insights. Specifically, we aim to:  

- Identify different reading preferences and trends.  
- Generate book recommendations based on these insights.  

## Loading The Data

In [None]:
import pandas as pd

# Load books data
books = pd.read_csv('https://github.com/zygmuntz/goodbooks-10k/raw/refs/heads/master/books.csv', on_bad_lines='skip')

# Load ratings data
ratings = pd.read_csv('https://github.com/zygmuntz/goodbooks-10k/raw/refs/heads/master/ratings.csv',  on_bad_lines='skip')

# Select columns
books=books[["book_id", "authors", "original_publication_year" , "original_title"]].set_index("book_id")
ratings=ratings[["book_id", "user_id", "rating"]]


# Filter to popukar books
book_counts = ratings["book_id"].value_counts()
valid_books = book_counts[book_counts >= 185].index
ratings = ratings[ratings["book_id"].isin(valid_books)]

# Filter to avid readers
user_counts = ratings["user_id"].value_counts()
valid_users = user_counts[user_counts >= 185].index
ratings = ratings[ratings["user_id"].isin(valid_users)]

In [None]:
books_filtered = books[books.index.isin(valid_books)]
books_filtered

Unnamed: 0_level_0,authors,original_publication_year,original_title
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Suzanne Collins,2008.0,The Hunger Games
2,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone
3,Stephenie Meyer,2005.0,Twilight
4,Harper Lee,1960.0,To Kill a Mockingbird
5,F. Scott Fitzgerald,1925.0,The Great Gatsby
...,...,...,...
9761,Richard Llewellyn,1939.0,How Green Was My Valley
9796,"Else Holmelund Minarik, Maurice Sendak",1968.0,A Kiss for Little Bear
9892,J.D. Robb,2008.0,Salvation in Death
9923,Stephen King,1996.0,"The Green Mile, Part 5: Night Journey"


In [None]:
ratings

Unnamed: 0,book_id,user_id,rating
62588,283,951,4
62589,282,951,4
62590,2170,951,3
62591,43,951,5
79326,4700,2487,4
...,...,...,...
5946011,1231,50999,3
5946015,3889,50999,5
5946017,1104,50999,2
5946018,3657,50999,4


In [None]:
from sklearn.model_selection import train_test_split
train_df, new_user_df = train_test_split(ratings, test_size=0.2, random_state=42)

## Task 1: Extracting Topics

### Data Fromatting

We first need to format the data into user-book cross tabulation. Here, we consider the users to be the data points and the books to be the columns.

In [None]:
user_book_matrix = ### Don't forget to fillna with zeros

SyntaxError: invalid syntax (<ipython-input-5-1504f6710a72>, line 1)

In [None]:
user_book_matrix

In [None]:
user_book_matrix.shape

Let's try to find the top *n* topics in this dataset.

In [None]:
n_components=5

In [None]:
from sklearn.decomposition import NMF
from sklearn.preprocessing import MinMaxScaler

# Build the model
model =

# Normalize ratings (optional but recommended)
scaler = MinMaxScaler()
scaled_user_book_matrix =

# Fit and transform
W = model. # User representation is terms of topics.

In [None]:
# Topic representation in terms of books
H = ### make columns as book ids
display(H)
print(H.shape)

In [None]:
# Reconstruction error
model.reconstruction_err_

Let's plot the *n* topics in terms of the top books that represent them

In [None]:
!pip install mglearn

In [None]:
# Auxuliary function
def get_book_author_names(bookid):
  book_name = books["original_title"].loc[bookid]
  author_name = books["authors"].loc[bookid]
  return str(book_name) + " --- " + str(author_name)

In [None]:
import mglearn
import numpy as np

feature_names = user_book_matrix.columns.map(lambda col: get_book_author_names(col)) # Get feature (i.e., book) names

mglearn.tools.print_topics(topics=range(n_components), feature_names=feature_names,
                           sorting=np.argsort(model.components_, axis=1)[:, ::-1], n_words=5, topics_per_chunk=1)

In [None]:
for i in range(n_components):
  print(f"Topic {i}:")
  mglearn.tools.visualize_coefficients(     , feature_names, n_top_features=10)


It might take some effort to interpret the topics (or we may show them to a librarian... or ChatGPT?). But, here is a guess:

*answer...*

In [None]:
topics =

## Task 2: Topic Similarity and Collaborative Filtering

Now that we expressed the data in terms of an intermediate variable, the topics, we can do some interesting things...




### Book Representation in terms of Topics.

In [None]:
books_filtered

In [None]:
def get_book_id(book_name):
  return books[books["original_title"] == book_name].index[0]

def get_book_topics(name):
  book_id = get_book_id(name)
  book_features = pd.DataFrame(H[book_id])
  book_features.index = topics
  return book_features

In [None]:
book_name = "The Princess Diaries"
# "The Princess Diaries"
# "Life of Pi"
# "The Great Gatsby"
#"Harry Potter and the Order of the Phoenix"
#"Brave New World"
#"Animal Farm: A Fairy Story"

get_book_topics(book_name)

### User Representation in terms of Topics.

In [None]:
#Auxuliary function
def get_user_books(user_id):
  book_ids_read_by_user = ratings[ratings["user_id"] == user_id]["book_id"].to_list()
  books_read_by_user  = books.index.isin(book_ids_read_by_user)
  return books.loc[books_read_by_user]

def get_user_topics(user_id):
  user_index = user_book_matrix.index.get_loc(user_id)
  user_features =
  user_features.index = topics
  return user_features

In [None]:
user_id = 53292# 2487 #951

display(get_user_books(user_id))
display(get_user_topics(user_id))

### What about new users and books?

Let's say you received a new set of users who read a mix of the books in your dataset and some new books.

You want to extract the topics of interest for these users based on their readings and based on the model you have created from your training data.

You will have to express the new users in terms of the books you have in your training dataset.

In [None]:
# Convert to Cross Tab format
user_book_matrix_new_users =

# Remove the books that were not in your training data
user_book_matrix_new_users = user_book_matrix_new_users[[column for column in user_book_matrix_new_users.columns if column in user_book_matrix.columns]]

# Fill in training books that were not read by these new users with zeros.
user_book_matrix_new_users = user_book_matrix_new_users.reindex(columns=user_book_matrix.columns, fill_value=0)

user_book_matrix_new_users

In [None]:
# Transform!
new_users_topics = ### don't forget to add column topic names and user ids.
new_users_topics

### Colaborative Filtering

Now, we can use the constructed model to *compare* books, users, or both!

In [None]:
#Auxuliary function
def get_book_readers(name):
  book_id = books[books["original_title"] == name].index[0]
  return scaled_user_book_matrix[book_id]

In [None]:
training_books = books[books.index.isin(train_df['book_id'])]
training_books

#### How similar are two books?

In [None]:
book1 = "Ender's Game"
book2 = "Divergent"
#"The Adventures of Huckleberry Finn"
#"Le Comte de Monte-Cristo"
#"The Time Traveler's Wife"
#"The Book Thief"
#"Harry Potter and the Order of the Phoenix"
#"Nineteen Eighty-Four"
#"The Hitchhiker's Guide to the Galaxy"
#"A Tale of Two Cities"

# Get book_ids
book1_id = books[books["original_title"] == book1].index[0]
book2_id = books[books["original_title"] == book2].index[0]

# Get readers and topics
topics1 = get_book_topics(book1).values.reshape(1, -1)
readers1 = get_book_readers(book1).reshape(1, -1)

topics2 = get_book_topics(book2).values.reshape(1, -1)
readers2 = get_book_readers(book2).reshape(1, -1)



# Get cosine similarity in terms of readers and in terms of topics
from sklearn.metrics.pairwise import cosine_similarity
feature_similarity =
topics_similarity =

print("feature_similarity = ", feature_similarity)
print("topics_similarity = ", topics_similarity)

Notice how topic similarity is more meaningful.

#### Would a user be intersted in a certain book?

In [None]:
pd.DataFrame(W, index=user_book_matrix.index, columns = topics)

In [None]:
user_id = 2487
user_vector = pd.DataFrame(model.transform(user_book_matrix), index=user_book_matrix.index).loc[user_id] # pd.DataFrame(model.transform(user_book_matrix), index=user_book_matrix.index).loc[user_id] # or pd.DataFrame(W, index=user_book_matrix.index).loc[user_id]

book_name = "Harry Potter and the Order of the Phoenix"
book_vector = get_book_topics(book_name)

In [None]:
cosine_similarity(user_vector.values.reshape(1, -1), book_vector.values.reshape(1, -1))[0, 0]

## **Questions:**

- How would you think Topic Modeling could apply to association rules? Describe an example.
- What are the similarities and differences between Clustering and Topic Modeling? When would you use one or the other?
- What are the similarities and differences between PCA and Topic Modeling? When would you use one or the other?