### Iteration -1 

Since the first recommendation system developed using the Collaborative Filtering method had a cold start problem during the pre-testing phase, a new recommendation system was developed using Content Based Filtering. This new recommendation system has been developed by calculating cosine similarity from the description data of the books.

#### Data set
Data obtained from Goodreads.com in 2021 and made public on the Kaggle platform is used. This data set includes the title, author, ISBN number, publication year, publisher, ratings, description, and language variables of the books. In addition to these variables, books are assigned randomly to the pop-up libraries in the Netherlands. In this way, the distance effect will be examined in the developed prototype.

In [1]:
##Import required libraries

## To learn where is the user's current geocode and calculate the distance among user and pop-up library
import pgeocode
import geopy.distance
import geocoder

## Data manipulation
import pandas as pd 

## To calculate cosine similarity among books 
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

#To get book_informations from ISBN number with using Web_scraping method
import requests
from bs4 import BeautifulSoup
import re

## Ignore the warnings 
import warnings
warnings.filterwarnings("ignore")

In [2]:
##Import book dataset
df_books = pd.read_csv(r'books_assigned_to_lfl.csv')

In [3]:
df_books.columns

Index(['Id', 'Name', 'Authors', 'ISBN', 'Rating', 'PublishYear', 'Publisher',
       'RatingDist5', 'RatingDist4', 'RatingDist3', 'RatingDist2',
       'RatingDist1', 'RatingDistTotal', 'Description', 'new_lang',
       'LFL_index', 'links', 'latitude', 'longitude'],
      dtype='object')

In order to calculate the distance between two geocodes, the data must be in tuple format and in the form of (latitude,longitute).

In [4]:
df_books['Geocode'] = ' '
for index, each in enumerate(df_books.latitude):
    df_books['Geocode'][index] = (each,df_books['longitude'][index])
    
df_books

Unnamed: 0,Id,Name,Authors,ISBN,Rating,PublishYear,Publisher,RatingDist5,RatingDist4,RatingDist3,RatingDist2,RatingDist1,RatingDistTotal,Description,new_lang,LFL_index,links,latitude,longitude,Geocode
0,1800000,Last Word: Media Coverage of the Supreme Court...,Florian Sauvageau,0774812435,5.00,2005,University of British Columbia Press,1,0,0,0,0,1,media coverage supreme court canada emerged cr...,en,0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)"
1,1800010,Murder on a Mystery Tour,Marian Babson,0802756689,3.20,2000,Walker & Company,21,73,121,42,10,267,i they explored every avenue seemed murder sal...,en,0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)"
2,1800011,Reel Murder: A Mystery,Marian Babson,0816144923,3.56,1988,G. K. Hall & Company,14,35,25,9,3,86,reel murder a mystery,no,0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)"
3,1800012,Principles of Bloodstain Pattern Analysis: The...,Stuart H. James,0849320143,4.58,2005,CRC Press,12,6,1,0,0,19,bloodstain evidence become deciding factor out...,en,0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)"
4,1800013,The Encyclopedia of Crime Scene Investigation,Michael Newton,0816068151,4.00,2007,Checkmark Books,3,2,3,0,0,8,recent years brought numerous developments cri...,en,0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38185,1899616,"Altri abusi: Viaggi, sonnambulismi e giri dell...",Aldo Busi,880438025X,3.00,1994,Mondadori,1,3,0,3,1,8,un viaggio del tutto particolare attraverso il...,it,334,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)"
38186,1899630,"Chicago Blues / Oh, Play That Thing",Roddy Doyle,8426415776,4.60,2006,Lumeneditorial,3,2,0,0,0,5,chicago blues oh play that thing,en,334,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)"
38187,1899631,Sports Immortals: Stories of Inspiration and A...,Jim Platt,1572434600,3.60,2002,Triumph Books,3,2,3,2,0,10,spanning century athletic achievement collecti...,en,334,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)"
38188,1899632,Julius Caesar (Discoveries),Richard Platt,0751358932,4.12,2001,Dorling Kindersley Publishers Ltd,2,5,1,0,0,8,witness rise fall ruthless leader i came i saw...,en,334,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)"


### Web scraper
Using the ISBN numbers of the books, the web scraper scrapes the variables such as name, author, publication year, publisher, language, edition, description, rating, number of voters, number of pages from the bookfinder.com website.

In [5]:
def book_adder () :
    print('Please enter ISBN number of the book:')
    isbn = input()
    global new_user_df
    new_user_df = pd.DataFrame()
    base_url = 'https://www.bookfinder.com/search/?isbn='+isbn+'&mode=isbn&st=sr&ac=qr'
    page = requests.get(base_url)
    html = BeautifulSoup(page.content, "html.parser")
    info_isbn_check = html.find_all(align = 'center')
    text_isbn_check = (str(info_isbn_check))
    regex_isbn_check = r"Sorry, we found no matching results at this time"

    matches_isbn_check = re.finditer(regex_isbn_check, text_isbn_check, re.MULTILINE)

    isbn_check = ''
    for matchNum, match in enumerate(matches_isbn_check, start=1):

        isbn_check= ("{match}".format(match = match.group()))
        print(isbn_check)

    if isbn_check == '':
        print('Processing')
        info = html.find_all(class_ = "attributes")
        text = (str(info))

        regex_name = r"\"name\">[a-zA-z\s\W]*[a-zA-z\s0-9()]*</span>"
        matches_name = re.finditer(regex_name, text, re.MULTILINE)

        for matchNum, match in enumerate(matches_name, start=1):

            name_of_book =  ("{match}".format(match = match.group()))
            name_of_book=name_of_book.replace('"name">', '')
            name_of_book=name_of_book.replace('&amp;', '&')
            name_of_book=name_of_book.replace('</span>', '')


        regex_author = r"author\">[a-zA-Z\.\s\,]*"
        matches_name = re.finditer(regex_author, text, re.MULTILINE)

        for matchNum, match in enumerate(matches_name, start=1):

            name_of_author=("{match}".format(match = match.group()))
            name_of_author=name_of_author.replace('author\">', '')
            name_of_author=name_of_author.replace('&amp;', '&')
            name_of_author=name_of_author.replace(',', ' ')

        regex_publisher_and_year = r"publisher\">[a-zA-Z\s\W]*[0-9]*"
        matches_publisher_and_year = re.finditer(regex_publisher_and_year, text, re.MULTILINE)

        for matchNum, match in enumerate(matches_publisher_and_year, start=1):
            first_match=("{match}".format(match = match.group()))
            first_match=first_match.replace('publisher\">','')
            first_match=first_match.replace('&amp;', '&')
            first_match= first_match.split(',')
            publisher = first_match[0]
            publication_year = first_match[1]
            publication_year=publication_year.replace(' ', '')

        regex_language = r'lang=\w*'
        matches_language = re.finditer(regex_language, text, re.MULTILINE)

        for matchNum, match in enumerate(matches_language):
            language=("{match}".format(match = match.group()))
            language = language.replace('lang=', '')

        regex_edution = r'bookformat\"/>[a-zA-Z]*'
        matches_edution = re.finditer(regex_edution, text, re.MULTILINE)

        for matchNum, match in enumerate(matches_edution):
            edution=("{match}".format(match = match.group()))
            edution= edution.replace('bookformat\"/>', '')

        info_desciription = html.find_all(class_ = "description")
        info_desciription = (str(info_desciription))
        if len(info_desciription) > 100:
            regex_desciription = r'itemprop=\"description\">.*\.'
            matches_description = re.finditer(regex_desciription, info_desciription, re.MULTILINE)

            for matchNum, match in enumerate(matches_description, start=1):
                description=("{match}".format(match = match.group()))
                description= description.replace('itemprop=\"description">', '')
                description= description.replace('description"><p>', '')
                description= description.replace('><p>', '')
                description= description.replace('<strong>', '')
                description= description.replace('<br/><br/>', '')
                description= description.replace('</strong></p><p>', '')
                description= description.replace('</p><p>', '')
                description= description.replace('</p', '')
                description= description.replace('<p>', '')
                description= description.replace('<br/>', '')
        else:
            description = str('No description is available')


        info_rating = html.find_all(class_ = "rating")
        info_rating = (str(info_rating))

        regex_rating = r'book-rating-average text-muted\">[0-9\.]*'
        matches_rating = re.finditer(regex_rating, info_rating, re.MULTILINE)

        for matchNum, match in enumerate(matches_rating, start=1):
            rating=("{match}".format(match = match.group()))
            rating= rating.replace('book-rating-average text-muted\">', '')


        regex_voters = r'book-rating-provider text-muted\">[0-9\s]*'
        matches_voters = re.finditer(regex_voters, info_rating, re.MULTILINE)

        for matchNum, match in enumerate(matches_voters, start=1):
            voters=("{match}".format(match = match.group()))
            voters= voters.replace('book-rating-provider text-muted\">', '')
            voters= voters.replace(' ', '')

        info_number_of_page = html.find_all(class_ = "item-note")
        info_number_of_page = (str(info_number_of_page))

        regex_number_of_pages = r'[0-9\s]*pages'
        matches_number_of_pages = re.finditer(regex_number_of_pages, info_number_of_page, re.MULTILINE)
        matches_number_of_pages_list = []
        for matchNum, match in enumerate(matches_number_of_pages, start=1):
            number_of_pages=("{match}".format(match = match.group()))
            number_of_pages= number_of_pages.replace(' ', '')
            number_of_pages= number_of_pages.replace('pages', '')
            matches_number_of_pages_list.append(number_of_pages)
        ## Create a dictionary with the scraped variables
        dict_of_book_info = {"Name": name_of_book,"Authors":name_of_author,'Publisher':publisher,
                           'ISBN':isbn, 'PublishYear':publication_year,'new_lang': language,'edution': edution,'number_of_pages':number_of_pages,
                           'avg_rating':rating, 'voters':voters,'Description':description}

        ## Create the user dataframe with the dict_of_book_info
        new_user_df = new_user_df.append(dict_of_book_info,ignore_index = True)
    else:
        print(isbn_check,'Do you want to add book manually?')
        manually_adder_decision = input('yes/no')
        if manually_adder_decision == 'yes':
                name_of_book = input('Enter the name of the book')
                name_of_author = input('Enter the name of the author')
                isbn = input('Enter ISBN number of the book')
                rating = input('Enter the name of the Rating')
                publication_year = input('Enter the name of the publish_year')
                publisher = input('Enter the name of the publisher')
                language = input('Enter the language of the language')
                avg_rating = rating
                description = name_of_book
                dict_of_book_info = {"Name": name_of_book,"Authors":name_of_author,'ISBN':isbn,
                                     'Publisher':publisher,'PublishYear':publication_year,'new_lang': language,
                                     'avg_rating':rating,'Description':description,}


                new_user_df = new_user_df.append(dict_of_book_info,ignore_index = True)




In [6]:
book_adder()

Please enter ISBN number of the book:
9781849832496
Processing


In [7]:
new_user_df

Unnamed: 0,Name,Authors,Publisher,ISBN,PublishYear,new_lang,edution,number_of_pages,avg_rating,voters,Description
0,The Playbook,Stinson Barney,Pocket Books,9781849832496,2010,en,Softcover,,3.79,4041,From the pen of Barney Stinson comes the indis...


In [8]:
###The dataframe created with web scraping is merged to the existing dataframe.
df_bookss = pd.concat([df_books,new_user_df])
df_bookss.reset_index(drop=True,inplace=True)
df_bookss

Unnamed: 0,Id,Name,Authors,ISBN,Rating,PublishYear,Publisher,RatingDist5,RatingDist4,RatingDist3,...,new_lang,LFL_index,links,latitude,longitude,Geocode,edution,number_of_pages,avg_rating,voters
0,1800000.0,Last Word: Media Coverage of the Supreme Court...,Florian Sauvageau,0774812435,5.00,2005,University of British Columbia Press,1.0,0.0,0.0,...,en,0.0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)",,,,
1,1800010.0,Murder on a Mystery Tour,Marian Babson,0802756689,3.20,2000,Walker & Company,21.0,73.0,121.0,...,en,0.0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)",,,,
2,1800011.0,Reel Murder: A Mystery,Marian Babson,0816144923,3.56,1988,G. K. Hall & Company,14.0,35.0,25.0,...,no,0.0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)",,,,
3,1800012.0,Principles of Bloodstain Pattern Analysis: The...,Stuart H. James,0849320143,4.58,2005,CRC Press,12.0,6.0,1.0,...,en,0.0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)",,,,
4,1800013.0,The Encyclopedia of Crime Scene Investigation,Michael Newton,0816068151,4.00,2007,Checkmark Books,3.0,2.0,3.0,...,en,0.0,https://minibieb.nl/minibieb/de-boeken-hove-zu...,52.66686,6.418787,"(52.66686, 6.4187875)",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38186,1899630.0,"Chicago Blues / Oh, Play That Thing",Roddy Doyle,8426415776,4.60,2006,Lumeneditorial,3.0,2.0,0.0,...,en,334.0,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)",,,,
38187,1899631.0,Sports Immortals: Stories of Inspiration and A...,Jim Platt,1572434600,3.60,2002,Triumph Books,3.0,2.0,3.0,...,en,334.0,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)",,,,
38188,1899632.0,Julius Caesar (Discoveries),Richard Platt,0751358932,4.12,2001,Dorling Kindersley Publishers Ltd,2.0,5.0,1.0,...,en,334.0,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)",,,,
38189,1899642.0,Tampa Burn,Randy Wayne White,0786267216,4.05,2004,Thorndike Press,768.0,982.0,497.0,...,en,334.0,https://minibieb.nl/minibieb/buurtboekenkastje/,52.34130,4.785793,"(52.3413, 4.785793)",,,,


In [9]:
df_bookss.iloc[-1]

Id                                                               NaN
Name                                                    The Playbook
Authors                                              Stinson  Barney
ISBN                                                   9781849832496
Rating                                                           NaN
PublishYear                                                     2010
Publisher                                               Pocket Books
RatingDist5                                                      NaN
RatingDist4                                                      NaN
RatingDist3                                                      NaN
RatingDist2                                                      NaN
RatingDist1                                                      NaN
RatingDistTotal                                                  NaN
Description        From the pen of Barney Stinson comes the indis...
new_lang                          

### Calculating the Cosine Similarty of the books 

The similarity between the descriptions of the books will be calculated using cosine similarity.

In [10]:
tf = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0)

tfidf_matrix = tf.fit_transform(df_bookss['Description'])

## Calculate cosine similarity of the books descriprion
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [11]:
## create dataframe with Name, Author, PublishYear, Publisher, new_lang, Description
titles = df_bookss[['Name', 'Authors','ISBN', 'PublishYear', 'Publisher','new_lang','Description','Geocode']]

## Save the name of the books as an index
indices = pd.Series(df_bookss.index, index=df_bookss['ISBN'])

In [13]:
def get_content_recommendations(ISBN):
    user_name = input('Please enter your name:')
    print('Welcome', user_name)
    nomi = pgeocode.Nominatim('nl')
    postal_code = input('Please enter your current postal(zip) code')
    ### Get geocode of the user with using postal(zip) code
    postal_code= str(postal_code)
    a = nomi.query_postal_code(postal_code)
    a = a.to_frame().reset_index()
    geo_codes = a[(a['index'] == 'latitude') | (a['index'] == 'longitude')]
    geo_codes.reset_index(drop= True, inplace=True)
    geo_codes=geo_codes.rename(columns={'index':'geotype',0:'geo_number'})
    geo_codes["geo_number"] = pd.to_numeric(geo_codes["geo_number"], downcast="float")
    latitude = geo_codes[0:1]
    latitude =latitude['geo_number'][0]
    longitude= geo_codes[1:2]
    longitude = longitude['geo_number'][1]
    
    try:
        # handle case in which book by same title is in dataset
        idx = indices[ISBN][0]
    except IndexError:
        idx = indices[ISBN]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:]
    book_indices = [i[0] for i in sim_scores]
    recommendation = titles.iloc[book_indices]
    language_based_list = recommendation
    coords_user= (latitude,longitude)
    language_based_list['distance'] = ''
    language_based_list = language_based_list.reset_index(drop=True)
    for index, each in enumerate(language_based_list.Geocode):
        language_based_list['distance'][index]= geopy.distance.geodesic(coords_user, each).km
    language_based_list_cols = ['Name','Authors','new_lang','distance']
    language_based_list = language_based_list[language_based_list_cols]
    sorted_recommendations = language_based_list.sort_values(by='distance').head(5)
    print('1: Distance')
    print('2: Similarity')
    user_input = input('How do you want to sort your recommendations?')
    if user_input == '1':
        print('####### Nearest books ##########')
        print(sorted_recommendations)
    else:
        print('####### Most Similar books ##########')
        print(language_based_list.head(5))
                                  

In [14]:
title = df_bookss.iloc[-1].Name
desc = df_bookss.iloc[-1].Description
ISBN = df_bookss.iloc[-1].ISBN
author = df_bookss.iloc[-1].Authors
year = df_bookss.iloc[-1].PublishYear
lang = df_bookss.iloc[-1].new_lang
print("Title:", title, "\nISBN:", ISBN, "\nDescription:", desc, "\nAuthor:", author, "\nYear:", year,"\nLang",lang)

Title: The Playbook 
ISBN: 9781849832496 
Description: From the pen of Barney Stinson comes the indispensable guide for every brother looking to score with the ladies. Featuring great plays from Barney Stinson's secret playbook of legendary moves, this book schools the reader in awsomeness. 
Author: Stinson  Barney 
Year: 2010 
Lang en


In [15]:
get_content_recommendations(ISBN)

Please enter your name:Firat
Welcome Firat
Please enter your current postal(zip) code3584
1: Distance
2: Similarity
How do you want to sort your recommendations?1
####### Nearest books ##########
                                                    Name           Authors  \
32183                                     Clay (Threads)   Annabelle Dixon   
8107   The Angry American: How Voter Rage Is Changing...  Susan J. Tolchin   
32177                                         God's Gift     Dee Henderson   
32178                       Wow! It's Great Being a Duck       Joan Rankin   
19871                                        Scaredy Cat       Joan Rankin   

      new_lang  distance  
32183       en  0.338289  
8107        en  0.338289  
32177       en  0.338289  
32178       en  0.338289  
19871       en  0.338289  


In [16]:
get_content_recommendations(ISBN)


Please enter your name:Firat
Welcome Firat
Please enter your current postal(zip) code3584
1: Distance
2: Similarity
How do you want to sort your recommendations?2
####### Most Similar books ##########
                                         Name                Authors new_lang  \
0                             Barney K. Riggs          Ellis Lindsey       no   
1  Barney Bear's Pizza Shop Super (Look-Look)          Larry Difiori       en   
2                          Hooray For Babies!  Maureen M. Valvassori       en   
3                                       Plays         George Chapman       tl   
4                        The Options Playbook           Brian Overby       en   

    distance  
0  37.069287  
1   157.9439  
2  37.050403  
3  37.088982  
4  37.725391  


#### Insights

Although the book owned by the participant is in English, the recommendation system also recommends non-English books to the user. This is a feature that attracted the attention of the participants and was not liked. In addition, since the data set includes all pop-up libraries in the Netherlands, books that are far away are also recommended to the participant. Considering that the participant will physically go and take the book, the recommendations made for distant books (such as more than 100 km) have not been found useful by the participants.