# Objective of the Analysis

The purpose of this analysis is to perform a quick EDA of the top 100 trending books dataset, and rank the authors based on the books in the top 100. 

As there are authors that have multiple books in the top 100, we need a scoring system. A simple approach is to provide a points value for the book placing i.e. 1st place book gets 100 points, 2nd place gets 99 points and so on. I use this method to then rank the authors based on the totality of books that they have in the top 100 list. This system weights how prolific an author is, rather than just their ranking.

The remainder of this notebook presents the analysis. 

# Analysis Summary - Last run on the 28th November 2023

Sarah J Maas is the clear top author based on this analysis as she has 4 books and 1 box set in the top 100. After that, Rebecca Yarros comes in at second with both books sitting inside the top 10, and the only author in the top 10 ranked authors to achieve this. 

This is unsurprising considering that the court of Thorns and Roses series by Sarah J Maas and the Fourth Wing and Iron Flame books both by Rebecca Yarros have all been very popular and trending on Tiktok.



# Import libraries

In [45]:
import numpy as np 
import pandas as pd
import plotly.express as px

import os

# Import Data

In [46]:
df = pd.read_csv('../DATA/Top-100 Trending Books.csv')

# Exploratory data analysis (EDA)

In [47]:
"""
Check first 5 rows
"""
df.head(5)

Unnamed: 0,Rank,book title,book price,rating,author,year of publication,genre,url
0,1,"Iron Flame (The Empyrean, 2)",18.42,4.1,Rebecca Yarros,2023,Fantasy Romance,amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/...
1,2,The Woman in Me,20.93,4.5,Britney Spears,2023,Memoir,amazon.com/Woman-Me-Britney-Spears/dp/16680090...
2,3,My Name Is Barbra,31.5,4.5,Barbra Streisand,2023,Autobiography,amazon.com/My-Name-Barbra-Streisand/dp/0525429...
3,4,"Friends, Lovers, and the Big Terrible Thing: A...",23.99,4.4,Matthew Perry,2023,Memoir,amazon.com/Friends-Lovers-Big-Terrible-Thing/d...
4,5,How to Catch a Turkey,5.65,4.8,Adam Wallace,2018,"Childrens, Fiction",amazon.com/How-Catch-Turkey-Adam-Wallace/dp/14...


In [48]:
"""
Basic information about the data
"""
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Rank                 100 non-null    int64  
 1   book title           100 non-null    object 
 2   book price           100 non-null    float64
 3   rating               97 non-null     float64
 4   author               100 non-null    object 
 5   year of publication  100 non-null    int64  
 6   genre                100 non-null    object 
 7   url                  100 non-null    object 
dtypes: float64(2), int64(2), object(4)
memory usage: 6.4+ KB


In [49]:
"""
Calculate some simple descriptive statistics
"""

# Average book rating
print(f"Average book rating -- {df['rating'].mean()}")
# Median book rating
print(f"Median book rating -- {df['rating'].median()}")
print()

# Average book price
print(f"Average Book Price -- {df['book price'].mean()} USD")
# Median book price
print(f"Median Book Price -- {df['book price'].median()} USD")
print()

# Earliest year of publication
print(f"Earliest Year of publication -- {df['year of publication'].min()}")
# Median year of publication
print(f"Median Year of publication -- {int(df['year of publication'].median())}")


    

Average book rating -- 4.689690721649485
Median book rating -- 4.7

Average Book Price -- 12.7086 USD
Median Book Price -- 11.48 USD

Earliest Year of publication -- 1947
Median Year of publication -- 2019


In [50]:
"""
Box plots for:
1- Book price
2- Book rating
"""


# Create the box plot
book_price_plot = px.box(df, y='book price', points="all", title="Book Price")
book_rating_plot = px.box(df,y='rating',points='all',title='Book ratings')

# Show the plot
book_price_plot.show()
book_rating_plot.show()

In [51]:
"""
Produce the counts for each genre. Books can have multiple genres
"""
unique,counts = np.unique(df['genre'],return_counts = True)
df_genre = pd.DataFrame({
    "Genre":unique,
    "count":counts
})
df_genre.sort_values(by='count',ascending=False,inplace=True)
df_genre.reset_index(inplace=True,drop=True)

df_genre.head(5)

Unnamed: 0,Genre,count
0,Nonfiction,6
1,"Childrens, literature",5
2,Childrens,4
3,Memoir,3
4,Fantasy,3


In [52]:
"""

Looking at the counts of some of the common genres including
- Fantasy
- Fiction
- Nonfiction
- Childrens
- Romance
- Thriller

"""

fantasy_count = 0
fiction_count = 0
non_fiction_count = 0
childrens_count = 0
romance_count = 0
thriller_count = 0

for index,row in df.iterrows():
    genre = df.at[index,'genre'].lower()
    if 'fantasy' in genre:
        fantasy_count += 1
    
    if 'fiction' in genre:
        fiction_count += 1
        
    if 'nonfiction' in genre:
        non_fiction_count += 1
        
    if 'childrens' in genre:
        childrens_count += 1
        
    if 'romance' in genre:
        romance_count += 1
        
    if 'thriller' in genre:
        thriller_count += 1

# Run some checks to make sure the analysis is consistent
assert(fantasy_count == 15)
assert(fiction_count == 52)
assert(non_fiction_count == 19)
assert(childrens_count == 32)
assert(romance_count == 10)

print(f"Fantasy count = {fantasy_count}")
print(f"Fiction count = {fiction_count}")
print(f"Nonfiction count = {non_fiction_count}")
print(f"Childrens count = {childrens_count}")
print(f"Romance count = {romance_count}")
print(f"Thriller count = {thriller_count}")



categories = ['Fantasy', 'Fiction', 'Nonfiction','Childrens','Romance','Thriller']
values = [fantasy_count, fiction_count, non_fiction_count,childrens_count,romance_count,thriller_count]

# Create the bar plot
fig = px.bar(x=categories, y=values,labels={'y': 'Count','x':'Genre'}, title="Counts for genres")

# Show the plot
fig.show()


Fantasy count = 15
Fiction count = 52
Nonfiction count = 19
Childrens count = 32
Romance count = 10
Thriller count = 9


In [53]:
"""
Printing the details for the top 20 ranked books
"""
for index, row in df.iterrows():
    book_rank = df.at[index,'Rank']
    if book_rank > 20:
        break
    book_title = df.at[index,'book title']
    book_price = df.at[index,'book price']
    book_genre = df.at[index,'genre']
    book_yop = df.at[index,'year of publication']
    book_rating = df.at[index,'rating']
    print(f'rank = {book_rank} -- Title = {book_title} -- genre = {book_genre} -- rating = {book_rating}')
    print()
    print()

rank = 1 -- Title = Iron Flame (The Empyrean, 2) -- genre = Fantasy Romance -- rating = 4.1


rank = 2 -- Title = The Woman in Me -- genre = Memoir -- rating = 4.5


rank = 3 -- Title = My Name Is Barbra -- genre = Autobiography -- rating = 4.5


rank = 4 -- Title = Friends, Lovers, and the Big Terrible Thing: A Memoir -- genre = Memoir -- rating = 4.4


rank = 5 -- Title = How to Catch a Turkey -- genre = Childrens, Fiction -- rating = 4.8


rank = 6 -- Title = Fourth Wing (The Empyrean, 1) -- genre = Fantasy -- rating = 4.8


rank = 7 -- Title = Unwoke: How to Defeat Cultural Marxism in America -- genre = Nonfiction, Politics -- rating = 4.3


rank = 8 -- Title = No Brainer (Diary of a Wimpy Kid Book 18) -- genre = Humor, Middle Grade -- rating = 4.8


rank = 9 -- Title = Killers of the Flower Moon: The Osage Murders and the Birth of the FBI -- genre = Nonfiction, True Crime -- rating = 4.4


rank = 10 -- Title = All the Light We Cannot See: A Novel -- genre = Historical Fiction -- r

In [85]:
"""
Code that analyses each author and calculates the number of points they score.

For each book an author has in the top 100, their score is the sum of (101 - book_rank):

e.g. if an author has rank 1 and 2 books, their score would be
(101 - 1) + (101 + 2) = 100 + 99 = 199

Printing the complete
"""

def calculate_points(df):
    
    unique,counts = np.unique(df['author'],return_counts = True)
    df_authors = pd.DataFrame({
        "Author":unique,
        "count":counts
    })
    df_authors['Points'] = 0
    author_list = []
    point_list = []
    df_authors.sort_values(by='count',ascending=False,inplace=True)
    df_authors.reset_index(inplace=True,drop=True)

    # Printing the top 10 entries sorted by number of books in top 100
    for index, _ in df_authors.iterrows():
        author = df_authors.at[index,'Author']
        count = df_authors.at[index,'count']
        print(f"Author = {author} -- Count = {count}")
        print()
        df_books_author = df[df['author'] == author]
        points = 0
        author_list.append(author)
        for i,_ in df_books_author.iterrows():
            score = 101 - df_books_author.at[i,'Rank']
            points = points + score
            print(f"Book = {df_books_author.at[i,'book title']} --- Rank = {df_books_author.at[i,'Rank']} --- Points = {score}")
        df_authors.at[index,'Points'] = points
        point_list.append(points)
        print()
        print(f"Total points = {points}")
        print()
        print('********')
    return df_authors

df_authors = calculate_points(df)

Author = Sarah J. Maas -- Count = 5

Book = A Court of Thorns and Roses (A Court of Thorns and Roses, 1) --- Rank = 23 --- Points = 78
Book = House of Flame and Shadow (Crescent City, 3) --- Rank = 25 --- Points = 76
Book = A Court of Thorns and Roses Paperback Box Set (5 books) --- Rank = 47 --- Points = 54
Book = A Court of Mist and Fury (A Court of Thorns and Roses, 2) --- Rank = 72 --- Points = 29
Book = A Court of Wings and Ruin (A Court of Thorns and Roses, 3) --- Rank = 88 --- Points = 13

Total points = 250

********
Author = Adam Wallace -- Count = 3

Book = How to Catch a Turkey --- Rank = 5 --- Points = 96
Book = How to Catch a Dinosaur --- Rank = 50 --- Points = 51
Book = How to Catch an Elf --- Rank = 67 --- Points = 34

Total points = 181

********
Author = David Grann -- Count = 2

Book = Killers of the Flower Moon: The Osage Murders and the Birth of the FBI --- Rank = 9 --- Points = 92
Book = The Wager: A Tale of Shipwreck, Mutiny and Murder --- Rank = 98 --- Points = 3

In [80]:
"""
Highest ranking authors based on the scoring system
"""

df_authors.sort_values(by='Points',ascending=False,inplace=True)
df_authors.reset_index(inplace=True,drop=True)
df_authors.head(10)

Unnamed: 0,Author,count,Points
0,Sarah J. Maas,5,250
1,Rebecca Yarros,2,195
2,Adam Wallace,3,181
3,Bill Martin Jr.,2,141
4,Alice Walstead,2,131
5,Suzanne Collins,2,104
6,Britney Spears,1,99
7,Barbra Streisand,1,98
8,Matthew Perry,1,97
9,David Grann,2,95


In [83]:
"""
import the data from recently scraped Australia amazon store. Then calculate the number of points each author scores using the calculate_points function
"""

file = "aus_top_selling_books_16_02_2024_14_46_16.csv"
df_aus = pd.read_csv(f'../DATA/AUS/{file}')
print(f"File = {file}")
df_aus_points = calculate_points(df_aus)
df_aus_points.sort_values(by='Points',ascending=False,inplace=True)
df_aus_points.reset_index(inplace=True,drop=True)

***********
File = aus_top_selling_books_16_02_2024_14_46_16.csv
Author = Sarah J. Maas -- Count = 5

Book = A Court of Thorns and Roses Paperback Box Set (5 books): The first five books of the hottest fantasy series and TikTok sensation: 1-5 --- Rank = 11 --- Points = 90
Book = House of Flame and Shadow: The BESTSELLING and SMOULDERING third instalment in the Crescent City series --- Rank = 21 --- Points = 80
Book = A Court of Thorns and Roses: The hottest Tiktok sensation: 1 --- Rank = 37 --- Points = 64
Book = Throne of Glass Box Set (Paperback) --- Rank = 48 --- Points = 53
Book = House of Earth and Blood: The first instalment of the EPIC Crescent City series from multi-million and #1 Sunday Times bestselling author Sarah J. Maas --- Rank = 50 --- Points = 51

Total points = 338

********
Author = H. D. Carlton -- Count = 3

Book = Where's Molly --- Rank = 14 --- Points = 87
Book = Haunting Adeline --- Rank = 26 --- Points = 75
Book = Hunting Adeline --- Rank = 46 --- Points = 55



In [86]:
df_aus_points.head(10)

Unnamed: 0,Author,count,Points
0,Sarah J. Maas,5,338
1,H. D. Carlton,3,217
2,Trent Dalton,2,190
3,Rebecca Yarros,2,172
4,School Zone,2,170
5,Laura Nowlin,2,157
6,Freida McFadden,2,157
7,Nicole Maguire,1,100
8,James Clear,1,99
9,Nagi Maehashi,1,98
