In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Let's start by exploring each file to understand the data we have. Let's load each file and provide a summary of its contents.

In [2]:
# Load the data from the files
book_history_df = pd.read_csv('Book_History.csv')
book_ratings_df = pd.read_csv('Book_Ratings.csv')
book_users_df = pd.read_csv('Book_Users.csv')
books_df = pd.read_csv('Books.csv')

In [3]:
book_history_df.head()

Unnamed: 0,user,item,accessed
0,1,152,1
1,1,153,1
2,1,2176,1
3,1,154,1
4,1,734,1


In [4]:
book_ratings_df.head()

Unnamed: 0,user,item,rating
0,1,6264,7.0
1,1,4350,7.0
2,1,6252,5.0
3,1,202,9.0
4,1,6266,6.0


In [5]:
book_users_df.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [6]:
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [7]:
book_history_df.shape

(272678, 3)

In [8]:
book_ratings_df.shape

(62656, 3)

In [9]:
book_users_df.shape

(278858, 3)

In [10]:
books_df.shape

(271379, 8)

Book_History.csv
- Columns: user, item, accessed
- Description: Tracks which items (books) users have accessed.

Book_Ratings.csv
- Columns: user, item, rating
- Description: Contains ratings given by users to items (books).

Book_Users.csv
- Columns: User-ID, Location, Age
- Description: Information about the users, including their location and age.

Books.csv
- Columns: ISBN, Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M, Image-URL-L
- Description: Detailed information about the books, including title, author, year of publication, and publisher. It also contains URLs for book images.

For the content-based recommendation system, we will primarily focus on the Books.csv file to utilize book metadata (like title, author, and publisher) to find similarities between books. The other files could be used to enhance the system, for example, by considering user ratings (Book_Ratings.csv) for filtering or prioritizing recommendations.

In [11]:
median_age = book_users_df['Age'].median()
book_users_df['Age'].fillna(median_age, inplace=True)
median_age

32.0

In [12]:
books_df.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64

In [13]:
books_df['Book-Author'].fillna("Unknown Author", inplace=True)
books_df['Publisher'].fillna("Unknown Publisher", inplace=True)


In [14]:
books_df.duplicated().sum()

0

The missing values in the Book-Author and Publisher columns have been successfully filled with placeholders. There are now no missing values in the Books dataset, making it ready for the next steps in building the content-based recommendation system.

Next, we'll focus on Feature Extraction. For a content-based system, we can use features such as the book's title, author, and publisher to calculate similarities between books. 

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize

In [16]:
# Combining text features to form a single string for each book
# Note: We concatenate the title, author, and publisher with spaces in between
books_df['combined_features'] = books_df['Book-Title'] + ' ' + books_df['Book-Author'] + ' ' + books_df['Publisher']

In [17]:
# Text preprocessing and TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, ngram_range=(1, 2))
tfidf_matrix = tfidf_vectorizer.fit_transform(books_df['combined_features'])

In [18]:
# Normalizing the TF-IDF vectors to ensure numerical stability for cosine similarity calculation
tfidf_matrix_normalized = normalize(tfidf_matrix)

# Display the shape of the TF-IDF matrix to understand the number of features
tfidf_matrix_normalized.shape

(271379, 1042380)

The TF-IDF vectorization process has successfully transformed the combined text features (Book-Title, Book-Author, and Publisher) into a numerical matrix. The resulting TF-IDF matrix has 271,379 books (rows) and 1,042,380 features (columns), which correspond to the unique terms in the dataset's combined text features.

The next step is to calculate similarity scores. One common approach is to use cosine similarity, which measures the cosine of the angle between two vectors. This metric is well-suited for sparse matrices like the one we have from TF-IDF vectorization.

In [19]:
# Calculating cosine similarity scores for a subset of the TF-IDF matrix
# Note: This is a resource-intensive operation, so we'll limit it to a smaller subset for demonstration purposes

# Let's select a subset of the first 5000 books to calculate similarity scores
subset_tfidf_matrix = tfidf_matrix_normalized[:5000]

# Compute the cosine similarity matrix for the subset
cosine_sim = linear_kernel(subset_tfidf_matrix, subset_tfidf_matrix)

# Display the shape of the cosine similarity matrix to verify
cosine_sim.shape


(5000, 5000)

The cosine similarity scores have been successfully calculated for a subset of the first 5,000 books. The resulting cosine similarity matrix is of shape (5000, 5000), indicating each book's similarity score with every other book in this subset.

With this similarity matrix, we can now proceed to generate book recommendations. Here's how we can approach this:

- Choose a book to start with: We'll need a reference book from which to make recommendations.
- Find the most similar books: Use the cosine similarity scores to find the most similar books to the reference book.

In [20]:
def get_book_recommendations(title, cosine_sim=cosine_sim, books_df=books_df, top_n=10):
    # Get the index of the book that matches the title
    idx = books_df[books_df['Book-Title'] == title].index[0]

    # Get the pairwise similarity scores of all books with that book
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top-n most similar books
    sim_scores = sim_scores[1:top_n+1]  # Exclude the first one since it's the book itself

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top-n most similar books
    return books_df.iloc[book_indices][['Book-Title', 'Book-Author', 'Publisher']]

In [22]:
# Example usage: Let's find recommendations for a book.
# Note: We need to use a title from the first 5000 books due to our subset limitation.
example_title = books_df.iloc[0]['Book-Title']  # Using the first book title in the dataset as an example
example_title

'Classical Mythology'

In [23]:
get_book_recommendations(example_title, top_n=5)

Unnamed: 0,Book-Title,Book-Author,Publisher
4148,Persuasion (World's Classics),Jane Austen,Oxford University Press
2852,No Name (World's Classics),William Wilkie Collins,Oxford University Press
397,Julius Caesar (Oxford School Shakespeare),William Shakespeare,Oxford University Press
521,Cranford (The World's Classics),Elizabeth Gaskell,Oxford University Press
2231,The Selfish Gene,Richard Dawkins,Oxford University Press


These recommendations were generated by finding books with similar content features to the reference book, demonstrating the basic mechanism of a content-based recommendation system.