# A Librivox book recommender system

In this notebook, I will create a book recommender system using content-based filtering and the Librivox catalog. Librivox is a volunteer-run website producing audiobooks from public domain texts (https://librivox.org/).

First, I gather data using the Librivox API, clean it and prepare it. Then, I create a TF–IDF matrix and compute cosine similarities between book descriptions. Finally, recommendations for a selection of audiobooks are presented.

First, let's import relevant libraries.

In [1]:
import pandas as pd # import data manipulation and analysis library
import re # import regular expression library
import requests # import HTTP/HTTPS request library

from bs4 import BeautifulSoup # parsing
from IPython.display import display, HTML # display
from sklearn.feature_extraction.text import TfidfVectorizer # TF–IDF matrix creation
from sklearn.metrics.pairwise import linear_kernel # linear kernel

The audiobook catalog is then queried through the Librivox API.

In [2]:
df = pd.DataFrame() # create empty df
data = {} # create empty dictionary

for offset in range(0, 1000000, 1000): # for offset 0, 1000, 2000...

    URL = str('https://librivox.org/api/feed/audiobooks/limit/1000/offset/'+str(offset)+'/format/json') # URL
    data = requests.get(URL).json() # retrieve data

    if data != {'error': 'Audiobooks could not be found'}: # unless no data
        df = df.append(data['books'], ignore_index=True) # append data
        
    else: # when no more data
        break # stop iterating

print(df.shape) # print (rows, columns)
df.head(1) # print first row

(14069, 15)


Unnamed: 0,authors,copyright_year,description,id,language,num_sections,title,totaltime,totaltimesecs,url_librivox,url_other,url_project,url_rss,url_text_source,url_zip_file
0,"[{'first_name': 'Alexandre', 'dod': '1870', 'l...",1844,<p><i>The Count of Monte Cristo</i> (French: <...,47,English,128,Count of Monte Cristo,49:43:15,178995,https://librivox.org/the-count-of-monte-cristo...,,http://en.wikipedia.org/wiki/Count_of_Monte_Cr...,https://librivox.org/rss/47,http://www.gutenberg.org/etext/1184,http://www.archive.org/download/count_monte_cr...


Next, the catalog is cleaned and reformatted.

In [3]:
df = df[df['language']=='English'] # retain audiobooks in English
df = df[df['url_librivox']!=''] # remove audiobooks not yet released
df['description'] = df['description'].str.replace('\n', '') # remove \n
df['description'] = [re.sub('\<.*?\>', '', i) for i in df.description] # remove <...> tags

authors_new = [] # create empty list

for row in df.index: # for each row

    authors = '' # create empty string

    for author_index in range(len(df['authors'][row])): # for every author in authors
        temp = df['authors'][row][author_index] # select author
        temp = str(temp['first_name']+' '+temp['last_name']+' (' + str(temp['dob']) + '-' + str(temp['dod']) + ')')
        if author_index > 0: # if more than 1 author
            authors += ', ' # add ', ' between authors
        authors += temp.strip() # add trimmed author string to string

    authors_new.append(authors) # append formatted author name(s) to list

df['authors'] = authors_new # update column values

df = df[['title', 'authors', 'description', 'url_librivox']] # retain relevant columns
df.to_csv('librivox_catalog_english.csv', index=False, encoding='utf-8-sig') # save to csv

print(df.shape) # print (rows, columns)
df.head(1) # print first row

(11721, 4)


Unnamed: 0,title,authors,description,url_librivox
0,Count of Monte Cristo,Alexandre Dumas (1802-1870),The Count of Monte Cristo (French: Le Comte de...,https://librivox.org/the-count-of-monte-cristo...


An additional column is then created, containing the audiobook description, lowercased and rid of punctuation, numbers and noise.

In [4]:
df['description_clean'] = [re.sub(r'[^a-zA-Z\s]', '', i) for i in df.description] # retain alphabetical characters
df['description_clean'] = [i.lower() for i in df['description_clean']] # lowercase all words

librivox_noise = ["librivox volunteers bring you", "recordings of", "this was the",
                  "fortnightly", "weekly", "poetry project for"] # define librivox specific noise

for noise in librivox_noise: # for each noisy substring
    df["description_clean"] = df["description_clean"].str.replace(noise, "") # remove substring

df.reset_index(inplace=True, drop=True) # reset df index

print(df.shape) # print (rows, columns)
df.head(1) # print first row

(11721, 5)


Unnamed: 0,title,authors,description,url_librivox,description_clean
0,Count of Monte Cristo,Alexandre Dumas (1802-1870),The Count of Monte Cristo (French: Le Comte de...,https://librivox.org/the-count-of-monte-cristo...,the count of monte cristo french le comte de m...


Using this new column, a TF–IDF (Term Frequency–Inverse Document Frequency) matrix is created.

In this matrix, values associated with each word/group of words (column) within each audiobook description (row) represent how informative each word/group of words is in the context of the corpus – that is, all audiobook descriptions taken together, while ensuring the length of individual descriptions does not influence values.

In [5]:
tfidf = TfidfVectorizer( # initialise TF–IDF vectorizer
                        analyzer='word', # use word-based tokens
                        ngram_range=(1, 2), # using tokens and bigrams
                        stop_words='english') # remove stop words

tfidf_matrix = tfidf.fit_transform(df['description_clean']) # fit and transform

print(tfidf_matrix.get_shape()) # print (rows, columns)
tfidf_matrix[0].todense() # print first row

(11721, 497218)


matrix([[0., 0., 0., ..., 0., 0., 0.]])

Using the TF–IDF matrix, the code computes cosine similarities and measures how similar two descriptions are to each other.

In [6]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) # compute cosine similarities

print(cosine_similarities.shape) # print (rows, columns)
cosine_similarities[0] # print first row

(11721, 11721)


array([1.        , 0.00121067, 0.01772197, ..., 0.00465328, 0.00563491,
       0.00563491])

Now, let's define a function to visualise recommendations.

In [7]:
def visualise_recommendations(df_audiobook_id): # define function
    
    recommendation_indices = cosine_similarities[df_audiobook_id].argsort()[-2:-5:-1] # 3 most similar books
    recommendations = df.iloc[recommendation_indices].copy() # create df
    recommendations.reset_index(inplace=True, drop=True) # reset df index
    
    for i in range(3): # for each recommendation
        data = requests.get(recommendations.url_librivox[i]) # retrieve webpage
        soup = BeautifulSoup(data.text, "lxml") # parse it
        temp_images = str(soup.find_all(class_='book-page-book-cover')) # retrieve illustration section
        temp_image = temp_images[temp_images.find('src=')+5:temp_images.find('\" width=')] # retrieve URL
        recommendations.loc[i, 'image'] = temp_image # store URL to audiobook cover
    
    for i in range(3): # for each recommendation
        if len(recommendations.loc[i, 'description']) > 500: # reformat long strings
            recommendations.loc[i, 'description'] = recommendations.loc[i, 'description'][0:500]+'...'
    
    display(HTML("\
    <table style='border: none;'>\
        <tr style='border: none;'>\
            <td style='width: 200px; border: none; vertical-align: top'>\
            <br><img src='"+recommendations.image[0]+"'<br>\
            </td>\
            <td style='width: 50px; border: none;'>\
            </td>\
            <td style='width: 600px; border: none; vertical-align: top; text-align:justify'>\
            <br><br>\
            <a href='"+recommendations.url_librivox[0]+"'>"+recommendations.title[0]+"</a>"+
            ", by "+recommendations.authors[0]+"\
            <br><br>"+recommendations.description[0]+"<br><br>\
            </td>\
        </tr>\
        <tr style='border: none;'>\
            <td style='width: 200px; border: none; vertical-align: top'>\
            <br><img src='"+recommendations.image[1]+"'<br>\
            </td>\
            <td style='width: 50px; border: none;'>\
            </td>\
            <td style='width: 600px; border: none; vertical-align: top; text-align:justify'>\
            <br><br>\
            <a href='"+recommendations.url_librivox[1]+"'>"+recommendations.title[1]+"</a>"+
            ", by "+recommendations.authors[1]+"\
            <br><br>"+recommendations.description[1]+"<br><br>\
            </td>\
        </tr>\
        <tr style='border: none;'>\
            <td style='width: 200px; border: none; vertical-align: top'>\
            <br><img src='"+recommendations.image[2]+"'<br>\
            </td>\
            <td style='width: 50px; border: none;'>\
            </td>\
            <td style='width: 600px; border: none; vertical-align: top; text-align:justify'>\
            <br><br>\
            <a href='"+recommendations.url_librivox[2]+"'>"+recommendations.title[2]+"</a>"+
            ", by "+recommendations.authors[2]+"\
            <br><br>"+recommendations.description[2]+"<br><br>\
            </td>\
        </tr>\
    </table>\
    ")) # display results in HTML format

The first example we take is a collection of short horror stories (https://librivox.org/seven-h-p-lovecraft-stories-by-h-p-lovecraft/) by American author Howard Phillips Lovecraft (1890-1937). It results in the following recommendations.

In [8]:
visualise_recommendations(6166) # call function

0,1,2
,,"Lovecraft's Influences and Favorites, by Various (-) In 1927, H. P. Lovecraft wrote a long essay on ""Supernatural Horror in Literature"" in which he discussed the history of what came to be known as Weird Fiction. This collection includes many of the texts that Lovecraft mentioned in the essay, beginning with Edgar Allan Poe's Fall of the House of Usher, published in 1839 and ending with Walter de la Mare Seaton's Aunt from 1922. Included are 19 stories and 1 poem. - Summary by Alan Winterrowd"
,,"Willows (version 2), by Algernon Blackwood (1869-1951) ""The Willows"" is one of Algernon Blackwood's best known creepy stories. American horror author H.P. Lovecraft considered it to be the finest supernatural tale in English literature. He wrote in his treatise ""Supernatural Horror in Literature"", ""Here art and restraint in narrative reach their very highest development, and an impression of lasting poignancy is produced without a single strained passage or a single false note."" ""The Willows"" is an example of early modern horror and is connected wi..."
,,"Poems, by Leonard Cline (1893-1929) This is the first published volume of poetry by notable American journalist and author of horror stories Leonard Lanson Cline. These poems were published when Cline was only 21 years old, but the talent that would lead HP Lovecraft to admire his work is already clearly visible. - Summary by Carolin"


Then, we look at the poetry of Robert Frost (1874-1963) and more precisely at his 1916 poetry collection 'Mountain Interval' (https://librivox.org/mountain-interval-by-robert-frost-2/).

In [9]:
visualise_recommendations(10316) # call function

0,1,2
,,"Birches, by Robert Frost (1874-1963) LibriVox volunteers bring you 12 recordings of Birches by Robert Frost. This was the Fortnightly Poetry project for February 21st, 2010."
,,"New Hampshire - A Poem with Notes and Grace Notes, by Robert Frost (1874-1963) New Hampshire is a volume of poems written by Robert Frost, for which he received the Pulitzer Prize. The titular poem is the longest, and it has cross-references to 14 of the following poems. These are the ""Notes"" in the book title. The ""Grace Notes"" are the 30 final poems. Contained in this collection are some of Frost's best known works, such as ""Fire and Ice"", ""Nothing Gold Can Stay"", and ""Stopping by Woods on a Snowy Evening"". (Summary by TriciaG)"
,,"Hillside Thaw, by Robert Frost (1874-1963) LibriVox volunteers bring you 10 recordings of The Hillside Thaw by Robert Frost. This was the Fortnightly Poetry project for May 5th, 2013."


Finally, let's look at recommendations for the story of a mysterious castle. Written by Jules Verne (1828-1905), the book is entitled 'Castle of the Carpathians' (https://librivox.org/the-castle-of-the-carpathians-by-jules-verne/).

In [10]:
visualise_recommendations(10480) # call function

0,1,2
,,"Carmilla (Version 2), by Joseph Sheridan Le Fanu (1814-1873) Laura grew up on a castle in the Austrian mountains with her father, slightly lonely as there are no potential companions around. Her loneliness is at an end when a carriage accindent close by their castle brings a mysterious visitor: Carmilla was injured in the accident, and remains at the castle to heal. But there is something dark about Carmilla. Is Laura in danger? - Summary by Carolin"
,,"Laodicean, by Thomas Hardy (1840-1928) The Laodicean (someone whose religious beliefs are “lukewarm”) of the title is Paula Power who bought the ancient castle De Stancy which she is determined to restore. Being of a modern frame of mind, she has the telegraph connected to the castle – and uses it all the time in the course of the story.George Somerset is a young architect who is invited to compete for the chance of the commission to restore the castle and who falls in love with Paula.However, the brother of Paula’s great friend Char..."
,,"Mysteries of Udolpho, by Ann Radcliffe (1764-1823) Considered a change agent in early Gothic romance; oft-referenced in later literary works or paid homage to by such authors as Jane Austen (influential novel ready by her heroine, Catherine Morland, in Northanger Abbey); Edgar Allen Poe (borrowed plot elements for the short story The Oval Portrait); and Sir Walter Scott. - In The Mysteries of Udolpho, one of the most famous and popular gothic novels of the eighteenth century, Ann Radcliffe took a new tack from her predecessors and portrayed her ..."
