## Overview of dataset

Book recommendation dataset from kaggle: http://www.kaggle.com/datasets/arashnic/book-recommendation-dataset

The Book-Crossing dataset comprises 3 files.

Users:

Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.

Books:

Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.

Ratings:

Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.
The dataset is from the site https://www.bookcrossing.com/howto.

Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. I [dataset uploader, Möbius] preprocessed and cleaned data format.

Books.csv has 271360 observations and 8 variables. Predictors include Book-Author, Year-of-Publication, and Publisher.

Ratings.csv has 1,149,780 observations and 3 variables. Predictors include Book-Rating.
Users.csv has 278,858 observations and 3 variables. Predictors include Location and Age.

There are primary and foreign keys in the csv files so the different tables can reference each other. These include User-ID and ISBN.

We will likely be working with Book-Rating, Location, Book-Author, Year-of-Publication, and Publisher (along with User-ID and ISBN) to build our recommendation system.

There are 110762 missing values for Age, and 2-3 missing values for some other variables. This means about 40% of Age is missing. Since there is so much, we will likely drop this variable altogether. The other variables (Book-Author, Publisher) have a negligible number of missing values, and dropping the rows should not affect the overall dataset much, so that’s what we’ll likely do.

In [7]:
# import packages
import pandas as pd

In [11]:
# load data
users_df = pd.read_csv('data/Users.csv')
books_df = pd.read_csv('data/Books.csv', low_memory=False)
ratings_df = pd.read_csv('data/Ratings.csv')

In [14]:
# Check missingness for each dataframe
print("Missing values in ratings_df:")
print(ratings_df.isnull().sum())
print("\n")

print("Missing values in books_df:")
print(books_df.isnull().sum())
print("\n")

print("Missing values in users_df:")
print(users_df.isnull().sum())


Missing values in ratings_df:
User-ID        0
ISBN           0
Book-Rating    0
dtype: int64


Missing values in books_df:
ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64


Missing values in users_df:
User-ID          0
Location         0
Age         110762
dtype: int64


## Overview of research questions

Given a user's historical book interactions, can we recommend books they are likely to enjoy?

How similar are books based on their content (titles, authors, genres), and can we use this similarity to make personalized recommendations?

The users of the recommendation system will provide the information, including: locations, age, several favorite books, and ratings on books, as a basis for the recommendation. On the User data set, we can easily find out the book lists of people who have similar locations and ages to the users. That list of books can potentially be the recommended book list based on their similarities. In addition, we can also find a recommended book list based on the user’s preferences for books. Users may also like the books of people who have common preferences, such as the same favorite books or the same ratings on certain books. What’s more, we may also order the recommendation book lists based on variables, such as author, publisher, and year of publication.

The goal of this project is to recommend books to system users, so it is a combination of predictive and descriptive. The system needs to predict books that users may be interested in and present them based on the dataset. 

## Proposed Project Timeline

- Week 5:  EDA
- Week 6: EDA
- Week 7: Run models
- Week 8: Run models
- Week 9: Rough Draft
- Week 10: Final Draft

## Questions or Concerns

Do we use content-based or collaborative recommendations? Both?

Do we need to split into train/test sets?

What are the pros and cons of using Python vs R for a recommendation system?