**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
books.shape

(271360, 8)

## Exploring books dataset

In [4]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [5]:
books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1, inplace = True)

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [7]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [8]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
dtypes: object(5)
memory usage: 10.4+ MB


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [9]:
todrop = books[(books.yearOfPublication == 'DK Publishing Inc') | (books.yearOfPublication == 'Gallimard')]

In [10]:
todroplist = todrop.index.tolist()

### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [11]:
books.drop(todroplist, axis=0, inplace= True)

In [12]:
books.shape

(271357, 5)

### Change the datatype of yearOfPublication to 'int'

In [13]:
books.yearOfPublication = books.yearOfPublication.astype('int')

In [14]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [15]:
books.yearOfPublication.dropna(inplace=True)

In [16]:
books.shape

(271357, 5)

## Exploring Users dataset

In [17]:
users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

In [18]:
users.shape

(278858, 3)

In [19]:
users.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [20]:
np.sort(users.Age.unique())

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [21]:
users.loc[users.Age <5,'Age'] = np.nan

In [22]:
users.loc[users.Age > 90,'Age'] = np.nan

In [23]:
users.Age.mean()

34.72384041634689

In [24]:
np.sort(users.Age.unique())

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

### Replace null values in column `Age` with mean

In [25]:
users.Age = users.Age.fillna(users.Age.mean())

### Change the datatype of `Age` to `int`

In [26]:
users.Age = users.Age.astype('int')

In [27]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 5.3+ MB


## Exploring the Ratings Dataset

### check the shape

In [28]:
ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

In [29]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [30]:
n_users = users.shape[0]
n_books = books.shape[0]

In [31]:
ratings.shape

(1149780, 3)

In [32]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [33]:
ratings = ratings[ratings.ISBN.isin(books.ISBN)]
print(ratings.shape)

(1031132, 3)


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [34]:
ratings = ratings[ratings.userID.isin(users.userID)]
print(ratings.shape)

(1031132, 3)


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [35]:
ratings = ratings[(ratings.bookRating >= 1)]
ratings = ratings[(ratings.bookRating <=10 )]

In [36]:
ratings.shape

(383841, 3)

### Find out which rating has been given highest number of times

In [37]:
import statistics as st

In [38]:
print(st.mode(ratings.bookRating))

8


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [66]:
df = ratings.groupby(['userID']).sum() > 100

In [80]:
userlist = df[df.bookRating == True].index.tolist()

In [81]:
ratings = ratings[ratings.userID.isin(userlist)]

In [86]:
ratings.shape

(240503, 3)

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [85]:
ratings.bookRating = ratings.bookRating.fillna(0)

In [87]:
from sklearn.model_selection import train_test_split

In [88]:
traindf, testdf= train_test_split(ratings, test_size=.30, random_state = 12)

In [90]:
traindf.shape

(168352, 3)

In [91]:
testdf.shape

(72151, 3)

In [92]:
testdf_copy = testdf.copy()

In [94]:
testdf.bookRating = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [95]:
testdf.head()

Unnamed: 0,userID,ISBN,bookRating
260430,60244,0440998050,
533900,128835,0805029648,
211623,49144,8433920103,
61227,12538,0345423615,
492922,117873,037541309X,


In [96]:
ratings = pd.concat([traindf, testdf]).reset_index()

In [101]:
rf_d = ratings.pivot(index='userID', columns='ISBN', values='bookRating').fillna(0)

In [120]:
rf_d.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B0000T6KIM,B0000VZEH8,B0000VZEJQ,B0000X8HIE,B00011SOXI,B00013AX9E,B0001FZGRQ,B0001GMSV2,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
243,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
254,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
503,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [122]:
rf_d.shape

(4742, 114930)

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [105]:
from scipy.sparse.linalg import svds

In [106]:
u, sigma, vt = svds(rf_d, k= 50)

In [108]:
sigma = np.diag(sigma)

In [109]:
sigma

array([[126.51494072,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 126.89977161,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 127.56981882, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 318.95987651,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        534.89319882,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 566.12387367]])

In [110]:
all_user_predicted_rating = np.dot(np.dot(u, sigma), vt)

In [111]:
all_user_predicted_rating

array([[-4.32442641e-04, -1.60385812e-04, -1.06923875e-04, ...,
         2.06417624e-04,  1.96668116e-04, -1.94569123e-03],
       [-1.94873802e-03,  1.67599491e-02,  1.11732994e-02, ...,
        -4.77380411e-05,  1.57255867e-03, -2.63654732e-03],
       [-6.21084949e-03, -1.02261086e-02, -6.81740570e-03, ...,
         8.56353181e-03, -2.28357497e-03,  5.33877361e-03],
       ...,
       [-3.32887377e-03, -2.69258696e-03, -1.79505797e-03, ...,
        -2.46243865e-04,  1.49386109e-03, -3.02834600e-03],
       [-1.54127441e-04, -7.61205212e-03, -5.07470141e-03, ...,
        -1.16415941e-03,  5.87175330e-03,  7.24745025e-04],
       [-8.54974069e-04, -3.04584433e-04, -2.03056288e-04, ...,
         4.23203300e-04, -3.23164404e-04,  1.77882104e-02]])

In [123]:
pred_df = pd.DataFrame(all_user_predicted_rating, columns=rf_d.columns, index=rf_d.index)

In [124]:
pred_df.shape

(4742, 114930)

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [131]:
userID = 2110

In [132]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [192]:
pred_df.loc[2110,:].sort_values(ascending=False)

ISBN
059035342X    1.939323
043935806X    1.193679
0439139597    0.976141
0439064872    0.947390
0439136369    0.889037
0439064864    0.744382
0439136350    0.676777
0439139600    0.533552
0446310786    0.408191
0385504209    0.383688
055321313X    0.381561
0843949945    0.370236
0345339681    0.329750
084394952X    0.289568
0553213148    0.285922
050552354X    0.275380
0671027360    0.266212
0316769487    0.264558
0316666343    0.259512
0553280341    0.241184
0505524503    0.240817
0505523728    0.240036
0843944463    0.238470
0505523752    0.237846
1558743669    0.226316
0449221512    0.225470
0590353403    0.219072
0671027573    0.214909
0140067477    0.210734
0440211727    0.208226
                ...   
055358099X   -0.118091
0373218486   -0.119335
0671744607   -0.119555
0440209633   -0.121646
0373218435   -0.122790
0399149392   -0.123238
0671674528   -0.123802
0380400634   -0.124072
067173976X   -0.127560
0671739794   -0.129234
0515136530   -0.129870
067168972X   -0.130830
038097

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [193]:
user_data = pd.DataFrame(pred_df.loc[2110,:].sort_values(ascending=False)).reset_index()

In [194]:
user_data.head()

Unnamed: 0,ISBN,2110
0,059035342X,1.939323
1,043935806X,1.193679
2,0439139597,0.976141
3,0439064872,0.94739
4,0439136369,0.889037


In [195]:
user_data.shape

(114930, 2)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [187]:
user_full_info = pd.merge(user_data, books, on='ISBN', how='inner')

In [188]:
user_full_info

Unnamed: 0,ISBN,2110,bookTitle,bookAuthor,yearOfPublication,publisher
0,059035342X,1.939323,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
1,043935806X,1.193679,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
2,0439139597,0.976141,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3,0439064872,0.947390,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
4,0439136369,0.889037,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,2001,Scholastic
5,0439064864,0.744382,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,1999,Scholastic
6,0439136350,0.676777,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,1999,Scholastic
7,0439139600,0.533552,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2002,Scholastic Paperbacks
8,0446310786,0.408191,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company
9,0385504209,0.383688,The Da Vinci Code,Dan Brown,2003,Doubleday


In [190]:
book_data.head()

NameError: name 'book_data' is not defined

In [191]:
user_full_info.head()

Unnamed: 0,ISBN,2110,bookTitle,bookAuthor,yearOfPublication,publisher
0,059035342X,1.939323,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
1,043935806X,1.193679,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
2,0439139597,0.976141,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3,0439064872,0.94739,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
4,0439136369,0.889037,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,2001,Scholastic


### Get top 10 recommendations for above given userID from the books not already rated by that user