# Interactive EDA of the *Best Books Ever* 
An interactive notebook for readers and authors.

## Importing Data

This notebook uses data found on Goodreads, a social platform that caters to avid readers. 

Specifically, the dataset can be found on <a href="https://www.kaggle.com/austinreese/goodreads-books" title="Kaggle">Kaggle</a>. The data includes the  <a href="https://www.goodreads.com/list/show/1.Best_Books_Ever" >Best Books Ever</a> , as voted on by the general Goodreads community. 
This is the most popular list of books of the website, having started in 2008, with more than 55,000 books from more than 200,000 different voters.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
from ipywidgets import interact, fixed
import ipywidgets as widgets
from io import BytesIO
import re
from re import sub
from PIL import Image
import textwrap
import urllib.request
from wordcloud import WordCloud, ImageColorGenerator
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
df = pd.read_csv("../input/goodreads-books/goodreads_books.csv")

## Exploring the *Best Books Ever* 


In [4]:
print(df.columns)
print('Shape:', df.shape)

Index(['id', 'title', 'link', 'series', 'cover_link', 'author', 'author_link',
       'rating_count', 'review_count', 'average_rating', 'five_star_ratings',
       'four_star_ratings', 'three_star_ratings', 'two_star_ratings',
       'one_star_ratings', 'number_of_pages', 'date_published', 'publisher',
       'original_title', 'genre_and_votes', 'isbn', 'isbn13', 'asin',
       'settings', 'characters', 'awards', 'amazon_redirect_link',
       'worldcat_redirect_link', 'recommended_books', 'books_in_series',
       'description'],
      dtype='object')
Shape: (52199, 31)


In [5]:
keep_cols = ['title', 'series', 'author', 'rating_count', 'review_count', 
             'average_rating', 'five_star_ratings', 'four_star_ratings', 
             'three_star_ratings', 'two_star_ratings', 'one_star_ratings', 
             'number_of_pages', 'date_published', 'publisher',
             'genre_and_votes', 'awards', 'books_in_series', 'description']

In [6]:
df = df[keep_cols]

In [7]:
df.describe()

Unnamed: 0,rating_count,review_count,average_rating,five_star_ratings,four_star_ratings,three_star_ratings,two_star_ratings,one_star_ratings,number_of_pages
count,52199.0,52199.0,52199.0,52199.0,52199.0,52199.0,52199.0,52199.0,49869.0
mean,18873.61,1012.98,4.02,7817.18,6250.78,3456.51,935.5,413.64,328.94
std,116397.83,4054.8,0.37,58763.73,34735.33,18249.3,5890.08,3843.36,252.79
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,340.0,31.0,3.82,117.0,109.0,63.0,15.0,6.0,210.0
50%,2295.0,163.0,4.03,810.0,765.0,452.0,107.0,36.0,304.0
75%,9297.5,622.0,4.23,3375.5,3190.5,1866.0,450.0,151.0,392.0
max,6801077.0,169511.0,5.0,4414877.0,1868421.0,980183.0,529060.0,537793.0,14777.0


## Data Cleaning


Unfortunately, date is in mixed forms. We will extract the year, where possible, and drop the rest. 

In [8]:
df.dropna(subset=['date_published'], inplace = True)
df['year'] = [re.findall('(\d{4})', x) for x in df['date_published']]
df['year'] = df['year'].apply(lambda x: int(x[0]) if x else None).dropna()
df.dropna(subset=['year'], inplace = True)

In [9]:
df['year']

0       2007.00
1       2006.00
2       2009.00
3       1997.00
4       1995.00
          ...  
52194   2007.00
52195   2014.00
52196   1961.00
52197   2007.00
52198   2009.00
Name: year, Length: 51190, dtype: float64

We will remove the parentheses from book series values, for cleaner text.

In [10]:
df['series'] = df['series'].str.strip('()')
df.head()

Unnamed: 0,title,series,author,rating_count,review_count,average_rating,five_star_ratings,four_star_ratings,three_star_ratings,two_star_ratings,one_star_ratings,number_of_pages,date_published,publisher,genre_and_votes,awards,books_in_series,description,year
0,Inner Circle,Private #5,"Kate Brian, Julian Peploe",7597,196,4.03,3045,2323,1748,389,92,220.0,January 1st 2007,Simon Schuster Books for Young Readers,"Young Adult 161, Mystery 45, Romance 32",,"381489, 381501, 352428, 630103, 1783281, 17832...",Reed Brennan arrived at Easton Academy expecti...,2007.0
1,A Time to Embrace,Timeless Love #2,Karen Kingsbury,4179,177,4.35,2255,1290,518,93,23,400.0,October 29th 2006,Thomas Nelson,"Christian Fiction 114, Christian 45, Fiction 3...",,115036,"Ideje az Ã¶lelÃ©snek TÃ¶rtÃ©net a remÃ©nyrÅl,...",2006.0
2,Take Two,Above the Line #2,Karen Kingsbury,6288,218,4.23,3000,2020,1041,183,44,320.0,January 1st 2009,Zondervan,"Christian Fiction 174, Christian 81, Fiction 58",,"4010795, 40792877, 7306261",Filmmakers Chase Ryan and Keith Ellison have c...,2009.0
3,Reliquary,Pendergast #2,"Douglas Preston, Lincoln Child",38382,1424,4.01,12711,15407,8511,1429,324,464.0,1997,Tor Books,"Thriller 626, Mystery 493, Horror 432, Fiction...",,"67035, 39031, 39033, 136637, 136638, 30068, 39...",,1997.0
4,The Millionaire Next Door: The Surprising Secr...,,"Thomas J. Stanley, William D. Danko",72168,3217,4.04,27594,25219,14855,3414,1086,258.0,October 28th 1995,Gallery Books,"Economics-Finance 1162, Nonfiction 910, Busine...",Independent Publisher Book Award (IPPY) Nomine...,,The incredible national bestseller that is cha...,1995.0


We assume that entries with null values in the column books_in_series do not have a sequel. Therefore they are classified as Standalones (a.k.a have only one book).

In [11]:
df['books_in_series'] = [len([idx for idx in x.split(',')]) 
                         if pd.notna(x) 
                         else 0 
                         for x in df['books_in_series']]
df['books_in_series'] += 1

In [12]:
di = {1: 'Standalone', 2: 'Duology', 3: 'Trilogy'}
df['series_type'] = df['books_in_series'].map(di).fillna('Multiple Books')

We also remove the year each award was given to a book, and seperate multiple awards in a list.

In [13]:
df['awards'] = df['awards'].str.replace(r"\(.*\)","")
awards = df['awards'].value_counts().index.tolist()


The default value of regex will change from True to False in a future version.



Similarly, we will also seperate the authors.

In [14]:
df['author'] = [[idx for idx in x.split(',')] for x in df['author']]
df['author']

0                         [Kate Brian,  Julian Peploe]
1                                    [Karen Kingsbury]
2                                    [Karen Kingsbury]
3                    [Douglas Preston,  Lincoln Child]
4               [Thomas J. Stanley,  William D. Danko]
                             ...                      
52194                                     [Sylvia Day]
52195                                  [Marina Keegan]
52196                                  [Karl Bruckner]
52197                                     [Kate Brian]
52198    [Sarah Palin,  Lynn Vincent,  Dewey Whetsell]
Name: author, Length: 51190, dtype: object

Lastly, we handpick all genres corresponding to books.

In [15]:
df.dropna(subset=['genre_and_votes'], inplace = True)
df.genre_and_votes = df.genre_and_votes.str.replace('\d+', '')
df.genre_and_votes = df.genre_and_votes.str.replace('user', '')
df.genre_and_votes = df.genre_and_votes.str.replace(' ', '')
df['genre_and_votes'] = [[idx[:-1] if idx.endswith('-') else idx for idx in x.split(',')] 
                         for x in df['genre_and_votes']]


The default value of regex will change from True to False in a future version.



In [16]:
df.rename(columns = {'genre_and_votes' : 'genre'}, inplace = True)

## Are these the *Best* Books Ever? 
A book can be clearly defined by several objective elements such as engaging writing, pacing, absorbing story telling. However, what constitues the *best book ever written* is a highly subjective matter, we ought to examine whether theses books are as good as advertised. 

To do that, we examine the collective stars attributed to the books. We group the books by their published year to get the general sentiment of the average rating distribution by star category. 

In [17]:
df_ratings = df.groupby('year') \
        .agg({'rating_count' : 'sum', 
              'five_star_ratings' : 'sum', 
              'four_star_ratings': 'sum', 
              'three_star_ratings': 'sum', 
              'two_star_ratings': 'sum',
              'one_star_ratings': 'sum'}) \
        .reset_index()

Not all years have had books with many ratings, so we narrow down our search to the years since 1800.

In [18]:
df_ratings = df_ratings[df_ratings.year>1800]
cats = ['five_star_ratings', 'four_star_ratings', 
        'three_star_ratings', 'two_star_ratings',
        'one_star_ratings']
# show rating categories as a percentage
for cat in cats: 
    df_ratings[cat] = df_ratings[cat]/df_ratings['rating_count']*100

In [19]:
fig_1 = go.Figure()

fig_1.add_trace(
    go.Scatter(x=list(df_ratings.year), 
               y=list(df_ratings.five_star_ratings), name = '★★★★★'))
fig_1.add_trace(
    go.Scatter(x=list(df_ratings.year), 
               y=list(df_ratings.four_star_ratings), name = '★★★★'))
fig_1.add_trace(
    go.Scatter(x=list(df_ratings.year), 
               y=list(df_ratings.three_star_ratings), name = '★★★'))
fig_1.add_trace(
    go.Scatter(x=list(df_ratings.year), 
               y=list(df_ratings.two_star_ratings),name = '★★'))
fig_1.add_trace(
    go.Scatter(x=list(df_ratings.year), 
               y=list(df_ratings.one_star_ratings), name = '★'))

# Set title
fig_1.update_layout(
    title_text="Average rating distribution by star category per publishing year (since 1800)",
    legend_title="Stars"
)

# Add range slider
fig_1.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=21,
                     label="21y",
                     step="year",
                     stepmode="backward"),
                dict(count=100,
                     label="100y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

fig_1.show()

From the diagram above we see that in most years the books rated with 5-star ratings being more than the 4-star, being more than 3-star and so on. That is a good sign in determining the quality of the books. 

Furthermore, what is clearly very interesting is that people have generally rated the books of the dataset in a way that 5-and 4-star ratings account for 70% of the distribution. In other words, more than 70% of the people who have read the books of each year, liked it. 

As such, we show that this dataset, at the bare minimum, contains really good books. 

## Top genres & most prolific, beloved authors
Among the many genres, as well as authors, let's see the top 10 in the dataset. 

In [20]:
genres = df['genre'].explode().value_counts().index.tolist()
genres[:10]

['Fiction',
 'Romance',
 'Fantasy',
 'YoungAdult',
 'Nonfiction',
 'Contemporary',
 'Historical-HistoricalFiction',
 'Mystery',
 'Classics',
 'Fantasy-Paranormal']

In [21]:
print('In total we have', len(genres), 'genres.')

In total we have 856 genres.


In total we have 861 different genres. Obviously, that includes many sub-genres, applied to only a few books. However that doesn't concern us, at present. 

In [22]:
auth = df['author'].explode().value_counts().reset_index()
auth = auth[:10]
gen = df['genre'].explode().value_counts().reset_index()
gen = gen[:10]

In [23]:
    # Setting the visualization parameters
fig_2 = make_subplots(rows=1, cols=2,
                      specs=[[{'type': 'xy'}, {"type": "xy"}]],
                      subplot_titles=("Top 10 most popular genres", 
                                      "Top 10 most popular authors"))
# Setting Bar parameters
fig_2.add_trace(go.Bar(x=gen['index'], 
                       y=gen['genre'],
                       name ='Books',
                       marker_color=px.colors.sequential.Plasma),
                       row=1, col=1)
# Setting Bar parameters
fig_2.add_trace(go.Bar(x=auth['index'], 
                       y=auth['author'],
                       name ='Books',
                       marker_color=px.colors.sequential.Plotly3),
                       row=1, col=2)
# Setting the parameters of the chart when displaying
fig_2.update_traces(marker_line_width=0)

# Setting the parameters of the chart when displaying
fig_2.update_layout(showlegend=False, 
                    plot_bgcolor='rgba(0,0,0,0)',
                    font=dict(family='Arial', 
                              size=12, 
                              color='black'))

# Displaying the graph
fig_2.show()

As we see, the best books ever dataset is mostly dominated by Fiction, Romance and Fantasy. 

Concerning the authors, Stephen King and Nora Roberts are tied for first place with the 94 different books which the public regarded as the "best books". Needless to highlight what an incredible compliment that is for each author. 

## Award-based Book Recommendation
Of the best books ever, nearly 20% of them have won/been nominated to win an award (or more). For the readers who handpick their next read based on Awards, we create a treemap with the Awards, Authors, Series, Book title, and their description.

In [24]:
awarded = df.dropna(subset=['awards'])
awarded['awards'] = [[idx for idx in x.split(',')] for x in awarded['awards']]
awarded = awarded[['awards','author','series','title','description']].copy()
awarded = awarded.dropna(subset=['description'])
awarded.description = (awarded.description
                            .apply(lambda s: '<br>'.join(textwrap.wrap(s,width=120))))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [25]:
awarded['author_joined'] = awarded.author.str.join(",")
awarded['series'].fillna('Standalone', inplace = True)

In [26]:
def makingtreemap(Award,frame):
    title = 'Treemap of awarded works & their authors in '
    mask = [Award in x for x in awarded['awards']]
    newframe = awarded[mask].astype(str)
    fig = px.treemap(newframe, 
                     path=[px.Constant(Award), 'author_joined','series','title','description'],
                     color='author_joined',
                     color_continuous_scale='Purples',
                    title = title + Award)
    fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
    fig.show()

In [27]:
interact(makingtreemap, Award=awards, df=fixed(df),
         frame=fixed(awarded))

interactive(children=(Dropdown(description='Award', options=('All About Romance ', 'Hugo Award Nominee for Bes…

<function __main__.makingtreemap(Award, frame)>

## Most rated Publishing houses by genre
Apart from awards, we also explore which are the best publishers per genre. 

To define the *best* publishers, we sort those with the most rated (and by extension, most read books), and we add the colour dimension of the books' average rating. 

In [28]:
df_grouped = df.groupby('publisher') \
       .agg({'title':'count', 
             'average_rating':'mean', 
             'rating_count': 'sum'}) \
       .reset_index()

In [29]:
df_popular = df_grouped.sort_values(by=['rating_count'], ascending=False)[:10]
df_popular

Unnamed: 0,publisher,title,average_rating,rating_count
6335,Penguin Books,493,3.96,25439700
5004,"Little, Brown and Company",177,3.88,21155025
8740,Vintage,569,3.94,14314933
7402,Scholastic Press,162,4.09,14186630
7427,Scribner,165,3.91,13617858
6357,Penguin Classics,263,3.94,13457344
846,Bantam,373,3.97,13307327
824,Ballantine Books,395,3.96,12543216
3754,HarperCollins,559,4.07,11690798
3407,Grand Central Publishing,261,3.95,11158290


In [30]:
def makingbarplot(Genre):
    mask = [Genre in x for x in df['genre']]
    newframe = df[mask]
    df_grouped = newframe.groupby('publisher') \
       .agg({'title':'count', 'average_rating':'mean', 'rating_count': 'sum'}) \
       .reset_index()
    df_popular = df_grouped.sort_values(by=['rating_count'], ascending=False)[:10]
    barplot = px.bar(data_frame = df_popular,
                    x = 'publisher',
                    y = 'rating_count',
                    color = 'average_rating',
                    opacity = 0.9,
                    orientation = 'v',
                    barmode = 'relative',
                    title = "Top publishing houses by rating count in "+Genre
                    )
    barplot.show()

In [31]:
interact(makingbarplot, Genre=genres, df = fixed(df))

interactive(children=(Dropdown(description='Genre', options=('Fiction', 'Romance', 'Fantasy', 'YoungAdult', 'N…

<function __main__.makingbarplot(Genre)>

## Are book ratings related to books' number of pages, per series type? 
Now we aim to see if there is a *sweet spot* on the number of pages, depending on whether the book is a Standalone, Duology etc.  
We combine this insight with the popularity of the series type over the years. 

In [32]:
toggle = widgets.ToggleButtons(options=['All', 
                                        'Standalone', 
                                        'Duology', 
                                        'Trilogy', 
                                        'Multiple Books'],
                               description='Series Type:',
                               disabled=False,
                               button_style='',
                               tooltips=['Any Number of Books',
                                        'One Book', 
                                        'Two Books', 
                                        'Three Books', 
                                        'More than Three Books']
                                )

In [33]:
def making_pages(df, toggle):
    df_pages = df[df['number_of_pages']<=1500].copy()
    if not toggle == 'All':
        mask = [toggle in x for x in df_pages['series_type']]
        colour = 'rating_count'
    else:
        mask = [True for x in df_pages['series_type']]
        colour = 'series_type'
    title = 'Rating count per Number of Pages in "'+ toggle + '" Type Series'
    fig = px.scatter(df_pages[mask], 
                     x="number_of_pages", 
                     y="rating_count",
                     size='review_count', 
                     color="review_count",
                     hover_data=['title', 'author'], 
                     facet_col="series_type",
                     title = title
                    )
    
    fig.show()
    
    df_scat = df_pages[mask].groupby(['year', 'series_type']) \
        .agg({'average_rating' : 'mean', 
              'rating_count' : 'sum', 
              'review_count' : 'sum'}) \
        .reset_index()
    df_scat = df_scat[df_scat.year>1950]
    title = 'Average rating in '+ toggle+ ' Type Series since 1950'
    fig = px.scatter(df_scat, 
                     y="average_rating", 
                     x="year",
                     log_x=True, 
                     log_y=True, 
                     color=colour, 
                     size="rating_count",
                     title = title)
    fig.show()

In [34]:
interact(making_pages, toggle = toggle, df=fixed(df))

interactive(children=(ToggleButtons(description='Series Type:', options=('All', 'Standalone', 'Duology', 'Tril…

<function __main__.making_pages(df, toggle)>

## Most popular words in book summaries by genre

Book summaries are important for attracting new readers. Naturally, despite the writer's level of creativity, patterns arise. Therefore, we investigate the most popular words found in book descriptions by genre.

In [35]:
sw = stopwords.words('english')
def full_cleaning(Category,frame,col):
    print('Please wait....')
    mask = [Category in x for x in df['genre']]
    newframe = frame.copy()
    newframe = newframe[mask].astype(str)
    text = ' '.join(newframe[col][:])
    text = text.lower()
    text = sub(r'\[.*?\]', '', text)
    text = sub(r'([.!,?])', r' \1 ', text)
    text = sub(r'[^a-zA-Z.,!?]+', r' ', text)
    # removing stopwords
    cleanlist = [word for word in text.split() if word not in sw]
    # lemmatizing
    lemmatizer = WordNetLemmatizer()
    cleantext = ' '.join([lemmatizer.lemmatize(w) for w in cleanlist])
    return cleantext

In [36]:
def apply_image_mask(title):
    if title == 'summaries':
        book_img = 'https://thumbs.dreamstime.com/b/books-cup-tea-icon-flat-style-isolated-white-background-read-drink-symbol-82478890.jpg'
    else:
        book_img = 'https://cdn-icons-png.flaticon.com/512/308/308184.png'
    with urllib.request.urlopen(book_img) as url:
        f = BytesIO(url.read())
    img = Image.open(f)
    mask = np.array(img)
    img_color = ImageColorGenerator(mask)
    return mask, img_color

In [37]:
def makingclouds(Genre,frame,col,Words,title):
    cloudtext=full_cleaning(Genre,frame,col)
    print('Word cloud for',Genre)
    word_freq = nltk.FreqDist([i for i in cloudtext.split() if len(i) > 2])
    mask, img_color = apply_image_mask(title)
    wc = WordCloud(background_color='white',
                   max_font_size=75,
                   max_words=Words,
                   mask = mask,
                   random_state=42)
    wordcloud = wc.generate_from_frequencies(word_freq)
    wordcloud = wordcloud.recolor(color_func=img_color)
    plt.figure(figsize=(14, 8))
    plt.title('Most popular words in book ' + title + ' ('+ Genre+')', 
              fontsize=25)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")

In [38]:
interact(makingclouds, Genre=genres, df=fixed(df), 
         col=fixed('description'), Words=[1000, 800, 500, 300],
         title=fixed('summaries'), 
         frame=fixed(df[['genre','description']]))

interactive(children=(Dropdown(description='Genre', options=('Fiction', 'Romance', 'Fantasy', 'YoungAdult', 'N…

<function __main__.makingclouds(Genre, frame, col, Words, title)>

## Most popular words in book titles by genre
Having seen clear patterns in the book description, we also examine the book titles for common words by genre

In [39]:
interact(makingclouds, Genre=genres, df=fixed(df), 
         col=fixed('title'),Words=[400, 200, 100],
         title=fixed('titles'), 
         frame=fixed(df[['genre','title']]))

interactive(children=(Dropdown(description='Genre', options=('Fiction', 'Romance', 'Fantasy', 'YoungAdult', 'N…

<function __main__.makingclouds(Genre, frame, col, Words, title)>

This concludes our exploration of the best books ever dataset. I sincerely hope you enjoyed it as much as I did!