**BookCrossing Dataset Analysis**

**Summary**

According to GroupLens, the BookCrossing (BX) dataset was collected by Cai-Nicolas Ziegler in a 4-week time period (August / September 2004) from the Book-Crossing community with permission from Ron Hornbaker, CTO of Humankind Systems. The dataset contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit/implicit) about 271,379 books(https://grouplens.org/datasets/book-crossing/). The BookCrossing dataset is unique in that this listing includes out-of-print listings of books as well. There are many listings of book ratings for which user score ratings determine the recommendation ranking of the individual item. However, the  purpose of BookCrossing's research is not to focus exclusively on individual ratings of books. BookCrossing's data collection builds upon prior research on recommender systems of books. The purpose of this research analysis is a little different from finding the user rating of book items. Rather, the purpose of this analysis is to find user recommended items in which one can predict unknown ratings of book items by analyzingt the similarities between users.

First it is necessary to import the necessary libraries.

**Import the necessary libraries**

In [37]:
import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

**Then it is possible to analyze the 3 datasets  which are books, ratings, and users**

First I will import the books dataset

In [38]:
book = pd.read_csv('BX-Books.csv', delimiter=";", encoding="latin-1", error_bad_lines=False)
book.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageURLS', 'imageURLM', 'imageURLL']
#books.head()

print("These are the books column names\n", books.columns)
print("\nThese are the books datatypes \n" , books.info)

#Here I also choose to count the number of null values for the books
#Count the number of null values in Books

count_null = books.isna().sum()
count_null

print("\nHere are the number of null values for the books columns\n", count_null)

#Drop the image columns in Books

book.drop(['imageURLS','imageURLM','imageURLL'], axis=1, inplace=True)
book.head()
#print("\nBelow is the books table\n", books.head())

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


These are the books column names
 Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher',
       'imageURLS', 'imageURLM', 'imageURLL'],
      dtype='object')

These are the books datatypes 
 ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageURLS            object
imageURLM            object
imageURLL            object
dtype: object

Here are the number of null values for the books columns
 ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
imageURLS            0
imageURLM            0
imageURLL            3
dtype: int64


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


Apparently the data types are inconsistent with integers and strings. The data needs to be a single type. Also, incorrect data 
exist. For instance, publisher was included instead of the year of publication as seen with 'Dk Publishing' and 'Gallimard'. 
Also, the zero makes no sense.

In [73]:
#Start correcting the books data

#Find the books incorrect year of publication which is DK Publishing Inc.

book.loc[book.yearOfPublication == 'DK Publishing Inc',:]

#Input the proper record fields for DK Publishing Inc. with the proper information

book.loc[book.ISBN == '078946697X','yearOfPublication'] = 2000
book.loc[book.ISBN == '078946697X','bookAuthor'] = "Michael Teitelbaum"
book.loc[book.ISBN == '078946697X','publisher'] = "DK Publishing Inc"

#Input the proper record fields for DK Publishing Inc. with the proper information

book.loc[book.ISBN == '0789466953', 'yearOfPublication'] = 2000
book.loc[book.ISBN == '0789466953', 'bookAuthor'] = "James Buckley"
book.loc[book.ISBN == '0789466953', 'publisher'] = "DK Publishing Inc"

#Find the books incorrect year of publication which is Gallimard

book.loc[books.yearOfPublication == 'Gallimard']

#Input the proper information

book.loc[book.ISBN == '2070426769','yearOfPublication'] = 2003
book.loc[book.ISBN == '2070426769','bookAuthor'] = 'Jean-Marie Gustave Le ClÃ?Â©zio'
book.loc[book.ISBN == '2070426769','publisher'] = 'Gallimard'

#Change the column name yearofPublication datatype to numeric
book.yearOfPublication = pd.to_numeric(books.yearOfPublication)

#Find the unique publication years
sorted(book['yearOfPublication'].unique())

#Find the year for which the publication year is 0.
book.loc[book.yearOfPublication == 0,:]

#After researching the actual year, substitute the dates with the value 1376
#books.loc[books.yearOfPublication == 1376,:]

book.head()
print("The shape of books is", books.shape)

The shape of books is (271360, 5)


**Evaluate the ratings table**

In [44]:
rating = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['userID', 'ISBN', 'bookRating']
#ratings.info()
rating.head()

Unnamed: 0,User-ID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


**Evaluate the users table**

In [45]:
user = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['userID', 'Location', 'Age']
#users.dtypes
user.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


**Join all 3 tables**

In [74]:
rating = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['User-ID', 'ISBN', 'bookRating']
#ratings.head()
#ratings.dtypes
rating_book = pd.merge(ratings, books, on='ISBN')
rating_book.head()
user = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['User-ID', 'Location', 'Age']
#users.dtypes
all_ratings = pd.merge(rating_book, users, on='User-ID')
all_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,bookTitle,bookAuthor,yearOfPublication,publisher,Location,Age
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"tyler, texas, usa",
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,"cincinnati, ohio, usa",23.0
2,2313,0812533550,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books,"cincinnati, ohio, usa",23.0
3,2313,0679745580,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage,"cincinnati, ohio, usa",23.0
4,2313,0060173289,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins,"cincinnati, ohio, usa",23.0


In [None]:
Here, it is better to drop the Location and Age columns since it is not as critical. 

In [53]:
all_ratings.drop(['Location','Age'], axis=1, inplace=True)
all_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,bookTitle,bookAuthor,yearOfPublication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
2,2313,0812533550,9,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card,1986,Tor Books
3,2313,0679745580,8,In Cold Blood (Vintage International),TRUMAN CAPOTE,1994,Vintage
4,2313,0060173289,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,Rebecca Wells,1996,HarperCollins


However, for the purposes of Recommendation Analysis, it is more productive and constructive to analyze 2 tables namely the users and ratings tables.