In [20]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data is contained in '../Data/Processed' directory
# (Further) processed data will be written to '../Data/Processed'
# This cell lists all files under the input directory

import os
INPUT_DIR = os.path.join(os.path.dirname(os.getcwd()), 'Data', 'Processed')
OUTPUT_DIR = os.path.join(os.path.dirname(os.getcwd()), 'Data', 'Processed')
for dirname, _, filenames in os.walk(INPUT_DIR):
    for filename in filenames:
        print(os.path.join(dirname, filename))

c:\Users\ASUS\Documents\Python Programming\Book recommendations\Data\Processed\Books_valid_ISBN_known_year_no_images.csv
c:\Users\ASUS\Documents\Python Programming\Book recommendations\Data\Processed\popular_books_with_descriptions.csv
c:\Users\ASUS\Documents\Python Programming\Book recommendations\Data\Processed\ratings_for_popular_books.csv
c:\Users\ASUS\Documents\Python Programming\Book recommendations\Data\Processed\Ratings_valid_ISBN.csv
c:\Users\ASUS\Documents\Python Programming\Book recommendations\Data\Processed\users_valid_age_with_country.csv


In this notebook we choose 10 most popular (by amount of ratings) books and get information for them. First, we load the books dataframe and check what information about the books we got.

In [21]:
books_df = pd.read_csv(os.path.join(INPUT_DIR, 'Books_valid_ISBN_known_year_no_images.csv'))
print(books_df.head())

         ISBN                                         Book-Title  \
0  0195153448                                Classical Mythology   
1  0002005018                                       Clara Callan   
2  0060973129                               Decision in Normandy   
3  0374157065  Flu: The Story of the Great Influenza Pandemic...   
4  0393045218                             The Mummies of Urumchi   

            Book-Author  Year-Of-Publication                   Publisher  
0    Mark P. O. Morford               2002.0     Oxford University Press  
1  Richard Bruce Wright               2001.0       HarperFlamingo Canada  
2          Carlo D'Este               1991.0             HarperPerennial  
3      Gina Bari Kolata               1999.0        Farrar Straus Giroux  
4       E. J. W. Barber               1999.0  W. W. Norton &amp; Company  


This is not telling us much about the books themselves. Let us take the first book and try to get some information about it from Google Books, using their API. 

In [22]:
import requests
import json
import time

In [23]:
my_isbn = books_df['ISBN'].iloc[0]
res = requests.get('https://www.googleapis.com/books/v1/volumes?q=isbn:'+my_isbn)
print(res)
if res.ok:
    print(res.json())

<Response [200]>
{'kind': 'books#volumes', 'totalItems': 1, 'items': [{'kind': 'books#volume', 'id': 'KyLfwAEACAAJ', 'etag': 'NhvhflOSOCU', 'selfLink': 'https://www.googleapis.com/books/v1/volumes/KyLfwAEACAAJ', 'volumeInfo': {'title': 'Classical Mythology', 'authors': ['Mark P. O. Morford', 'Robert J. Lenardon'], 'publisher': 'Oxford University Press, USA', 'publishedDate': '2003', 'description': 'Provides an introduction to classical myths placing the addressed topics within their historical context, discussion of archaeological evidence as support for mythical events, and how these themes have been portrayed in literature, art, music, and film.', 'industryIdentifiers': [{'type': 'ISBN_10', 'identifier': '0195153448'}, {'type': 'ISBN_13', 'identifier': '9780195153446'}], 'readingModes': {'text': False, 'image': False}, 'pageCount': 808, 'printType': 'BOOK', 'categories': ['Social Science'], 'maturityRating': 'NOT_MATURE', 'allowAnonLogging': False, 'contentVersion': 'preview-1.0.0', 

While, unfortunately, there is no info about genre, there is a field with description. We are going to use that.

We have a lot of reviews about a lot of books. Let us consider only the 100 books with the most reviews and users that reviewed them. We will get the descriptions for those books and try to find some recommendation based on that. We will see then if the recommendations we get make any sense and think about scaling our system later.

First, we load the ratings dataframe.

In [24]:
ratings_df = pd.read_csv(os.path.join(INPUT_DIR, 'Ratings_valid_ISBN.csv'))
print(ratings_df.info())
print(ratings_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1136188 entries, 0 to 1136187
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1136188 non-null  int64 
 1   ISBN         1136188 non-null  object
 2   Book-Rating  1136188 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.0+ MB
None
   User-ID        ISBN  Book-Rating
0   276725  034545104X            0
1   276726  0155061224            5
2   276727  0446520802            0
3   276729  052165615X            3
4   276729  0521795028            6


Now we group the ratings by ISBN and sort them by the amount of ratings given. We create a list of top reviewed books with 100 books with the most reviews.

In [25]:
books_sorted = ratings_df.groupby('ISBN').size().sort_values(ascending=False)
print(books_sorted.head())
top_reviewed = books_sorted.index[:100]
print(top_reviewed)

ISBN
0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
dtype: int64
Index(['0971880107', '0316666343', '0385504209', '0060928336', '0312195516',
       '044023722X', '0679781587', '0142001740', '067976402X', '0671027360',
       '0446672211', '059035342X', '0316601950', '0375727345', '044021145X',
       '0452282152', '0440214041', '0804106304', '0440211727', '0345337662',
       '0060930535', '0440226430', '0312278586', '0743418174', '0671021001',
       '0345370775', '0446605239', '0156027321', '0440241073', '0671003755',
       '0060976845', '1400034779', '0786868716', '0440234743', '0440222656',
       '0440221471', '0345361792', '0440236673', '0345417623', '0316769487',
       '0446610038', '0385484518', '0446310786', '044022165X', '0375706771',
       '0440225701', '0440220602', '0060502258', '0446606812', '0345353145',
       '044651652X', '0140293248', '0440213525', '0345443284', '0440206154',
       '006101351X', '0316284955', '0375

Now that we have a list of books with the most reviews, let us only leave reviews for those books. We will only work with explicit ratings, so let us only choose reviews with non-zero rating.

In [26]:
ratings_popular_df = ratings_df[(ratings_df['ISBN'].isin(top_reviewed)) & (ratings_df['Book-Rating']!=0)]
print(ratings_popular_df.info())
print(ratings_popular_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18847 entries, 80 to 1136050
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   User-ID      18847 non-null  int64 
 1   ISBN         18847 non-null  object
 2   Book-Rating  18847 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 589.0+ KB
None
     User-ID        ISBN  Book-Rating
80    276788  043935806X            7
414   276925  0385504209            8
624   276953  0446310786           10
665   276964  0440220602            9
785   277042  0971880107            2


We see that this dataframe is much smaller than the original. How many users do we have here?

In [27]:
print(len(ratings_popular_df.groupby('User-ID')))

11030


Let us get the descriptions for the books. We will also get the title and authors just so that we can check later on what kind of books we are recommending.

In [28]:
def get_book_info(isbn):
    """Makes a request about the book through Google Books API using the ISBN provided.
    Returns a dictionary containing the information about the book or an empty dictionary if
    finds different books with the same ISBN."""
    url = 'https://www.googleapis.com/books/v1/volumes?q=isbn:'+isbn
    try:
        r = requests.get(url,timeout=3)
        r.raise_for_status()
        if (r.json()['totalItems'] == 0):
            print('The book with ISBN:'+isbn+' is not found')
            return {}
            #raise ValueError('The book with ISBN:'+isbn+' is not found')
        if (r.json()['totalItems'] > 1):
            print('There are more than one book with ISBN:'+isbn)
        book_info =r.json()['items'][0]['volumeInfo']
 #       return book_info
        result = {}
        for key in ['title', 'authors', 'description']:
            if key in book_info.keys():
                result[key] = book_info[key]
        result['ISBN'] = isbn        
        return result
    except requests.exceptions.HTTPError as errh:
        if r.status_code == 429:
            time.sleep(10)
            get_book_info(isbn)
        else:    
            print ("Http Error:",errh)
    except requests.exceptions.ConnectionError as errc:
        print ("Error Connecting:",errc)
    except requests.exceptions.Timeout as errt:
        print ("Timeout Error:",errt)
    except requests.exceptions.RequestException as err:
        print ("OOps: Something Else",err)
        
get_book_info(top_reviewed[0])      
#get_book_info('059035342X')

{'title': 'Wild Animus',
 'authors': ['Rich Shapero'],
 'description': 'Wild animus is a search for the primordial, a test of human foundations and a journey to the breaking point.',
 'ISBN': '0971880107'}

In [29]:
book_info_list = [get_book_info(isbn) for isbn in top_reviewed]

print(book_info_list[:5])            

The book with ISBN:0446672211 is not found
There are more than one book with ISBN:059035342X
There are more than one book with ISBN:0345337662
There are more than one book with ISBN:0060930535
There are more than one book with ISBN:0446605239
The book with ISBN:0156027321 is not found
There are more than one book with ISBN:0385484518
There are more than one book with ISBN:0446310786
There are more than one book with ISBN:044022165X
There are more than one book with ISBN:0440213525
The book with ISBN:0440206154 is not found
There are more than one book with ISBN:0316284955
The book with ISBN:0439064872 is not found
There are more than one book with ISBN:0060938455
There are more than one book with ISBN:0345313860
There are more than one book with ISBN:0842329129
The book with ISBN:0061009059 is not found
There are more than one book with ISBN:0345339681
There are more than one book with ISBN:080410753X
[{'title': 'Wild Animus', 'authors': ['Rich Shapero'], 'description': 'Wild animus is

There are only 5 books that we couldn't find. There is also a bunch of books that are found more than once, it seems the problem is that when a book has both ISBN-10 and ISBN-13, it is listed in the database twice - once for every ISBN. 

We create a dataframe containing this information that we found.

In [30]:
book_info_list = list(filter(None, book_info_list))
top_reviewed_df = pd.DataFrame(book_info_list)
print(top_reviewed_df.info())
print(top_reviewed_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95 entries, 0 to 94
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        95 non-null     object
 1   authors      95 non-null     object
 2   description  95 non-null     object
 3   ISBN         95 non-null     object
dtypes: object(4)
memory usage: 3.1+ KB
None
                                              title          authors  \
0                                       Wild Animus   [Rich Shapero]   
1                                  The Lovely Bones   [Alice Sebold]   
2                                 The Da Vinci Code      [Dan Brown]   
3  Divine secrets of the Ya-Ya Sisterhood : a novel  [Rebecca Wells]   
4                                      The Red Tent  [Anita Diamant]   

                                         description        ISBN  
0  Wild animus is a search for the primordial, a ...  0971880107  
1  The spirit of fourteen-year-old Susie

It seems there was another couple of books for which we could not fetch information. Let us filter the ratings dataframe again, so that we do not have reviews for books that are missing information about them.

In [31]:
ISBNs = list(top_reviewed_df['ISBN'])
ratings_final_df = ratings_popular_df[ratings_popular_df['ISBN'].isin(ISBNs)]
print(ratings_final_df.info())
print(ratings_final_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17837 entries, 80 to 1136050
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   User-ID      17837 non-null  int64 
 1   ISBN         17837 non-null  object
 2   Book-Rating  17837 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 557.4+ KB
None
     User-ID        ISBN  Book-Rating
80    276788  043935806X            7
414   276925  0385504209            8
624   276953  0446310786           10
665   276964  0440220602            9
785   277042  0971880107            2


It remains to write our dataframes. 

In [32]:
top_reviewed_df.to_csv(os.path.join(OUTPUT_DIR, 'popular_books_with_descriptions.csv'), index=False)
ratings_final_df.to_csv(os.path.join(OUTPUT_DIR, 'ratings_for_popular_books.csv'), index=False)