In [None]:
---
title: "Amazon Books Analysis"
author: "Camryn Backes"
date: "2022-11-29"
output: html_document
---

## Amazon Bestselling Books Analysis

#### Goal: 
Identify trends within bestselling Amazon book data to determine the qualities of a bestselling book.

Data contains the top 50 bestselling Amazon books from 2009 to 2019.
Source: Sooter Saalu. (2020). <i>Amazon Top 50 Bestselling Books 2009 - 2019</i> [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/1556647

#### Data Cleaning: 

In [None]:
install.packages("tidyverse")
library(tidyverse)

In [None]:
books <- read_csv("/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")

In [None]:
install.packages("janitor")
library(janitor)

In [None]:
books_cleaned <- clean_names(books)

#### Visualizations and Analysis:

In [None]:
install.packages("tidyverse")
library(tidyverse)
install.packages("ggplot2")
library(ggplot2)

In [None]:
ggplot(books_cleaned, aes(x=user_rating)) + geom_bar() + labs(title = 'Number of Books Per Rating')

The majority of the books are greater than 4.5 stars (71%). 

In [None]:
ggplot(books_cleaned, aes(x=genre)) + geom_bar() + labs(title = 'Number of Books per Genre')

Slightly more non-fiction (56%) than fiction (44%). 

In [None]:
books_cleaned2 <- books_cleaned %>% count(year, genre)

In [None]:
ggplot(books_cleaned2, aes(x=year, y=n, fill=genre)) + 
  geom_bar(stat = "identity", position = "dodge") +
  ylab("Number of Books in Genre")+
  labs(title = 'Number of Books in Each Genre per Year')

More non-fiction books are sold each year with the exception of 2014. 

In [None]:
install.packages('dplyr')
library(dplyr)

In [None]:
year = books_cleaned %>% 
  group_by(year) %>% 
  summarize(avg_user_rating = mean(user_rating), 
          avg_reviews = mean(reviews),
          avg_price = mean(price))

A preview of new data frame to compare years:

In [None]:
head(year)

In [None]:
ggplot(data = year, aes(x=year, y=avg_user_rating)) + geom_point() + labs(title = "Average User Rating Per Year") + geom_smooth()

There is somewhat of a correlation between year and rating - in recent years, bestselling books tend to be rated higher. 

In [None]:
ggplot(year, aes(x=year, y=avg_reviews)) + geom_point() + labs(title = "Average Number of Reviews Per Year") + geom_smooth()

Similarly, more recent years have somewhat higher number of reviews. 

In [None]:
ggplot(books_cleaned, aes(x=reviews, y=user_rating)) + geom_point() + geom_smooth(method=lm) + labs(title = "Number of Reviews Per Rating")

However, books with more reviews are not correlated with a higher rating. 

In [None]:
ggplot(year, aes(x=year, y=avg_price)) + geom_point() + labs(title = 'Average Price of Book Per Year') + geom_smooth()

Price of bestselling books is trending downward, starting at 16.00 in 2009 and ending at 10.00 in 2019. 

I noticed that 2015 is an outlier, so I decided to investigate this -  I saw that a lot of coloring books were sold in 2015. 

In [None]:
coloring_books <- books_cleaned %>% filter(grepl("Coloring", name))

* I filtered for the word "Coloring", which appears in the data set in 13 different book titles. 
* 9 out of 13 coloring books came from 2015, with two from 2016 and two from 2019. 
* 18% of bestselling books in 2015 were coloring books, falling to 4% in 2016. 

In [None]:
head(coloring_books)

In [None]:
mean(coloring_books$price)

With the average price of a coloring book being $6.15, this certainly brings down the average price of books for that year. 

In [None]:
author_count = books_cleaned %>% 
  count(author, year) %>% 
  arrange(desc(n))

These are authors with multiple bestselling books in a single year. 

In [None]:
head(author_count)

This indicates consumer interest in fictional multi-book series (especially romance and fantasy): Twilight, 50 Shades of Grey, Harry Potter, Percy Jackson, and The Hunger Games.

I created a filter for just these five authors.

In [None]:
series = books_cleaned %>% filter(author=='Stephenie Meyer'|
                                    author ==  'E L James' |
                                    author == 'J.K. Rowling' |
                                    author == 'Rick Riordan' |
                                    author == 'Suzanne Collins')

In [None]:
ggplot(series, aes(x=year)) + geom_bar() + labs(title = 'Count of Bestselling Books in a Series', caption = 'number of books by Stephenie Meyer, E L James, J.K. Rowling, Rick Riordan, and Suzanne Collins')

* Fiction book series were most popular starting in 2009 - 2012 (16% of bestsellers)
* We see a drastic fall in the years that followed (3% of all bestsellers from 2013 - 2019)
* In 2016 there is a slight rise again due to two Harry Potter spin-offs - Fantastic Beasts and the Cursed Child. 
* This indicates that multi-book series have become less likely to be bestsellers. 

Next, I created two more data frames to compare prices between fiction and nonfiction.

In [None]:
fiction_filter <- filter(books_cleaned, genre == 'Fiction')

fiction = fiction_filter %>% 
  group_by(year) %>% 
  summarize(avg_user_rating = mean(user_rating), 
          avg_reviews = mean(reviews),
          avg_price = mean(price))

In [None]:
nonfiction_filter <- filter(books_cleaned, genre == 'Non Fiction')

nonfiction = nonfiction_filter %>% 
  group_by(year) %>% 
  summarize(avg_user_rating = mean(user_rating),
            avg_reviews = mean(reviews),
            avg_price = mean(price))

In [None]:
ggplot() + 
  geom_line(data = fiction, aes(x = year, y = avg_price), color = "blue") +
  geom_line(data = nonfiction, aes(x = year, y = avg_price), color = "red") +
  xlab('Year') +
  ylab('Average Price') +
  labs(title = 'Average Book Price Per Year', caption = 'Blue = Fiction, Red = NonFiction')

#### Non-fiction
* DSM 5 priced at 105.00 in 2013 and 2014, bumping up the average price. 
* Official SAT Study Guide (40.00) was a bestseller from 2010 - 2014. 
* Prices drop drastically in 2015 since coloring books are considered non-fiction.

#### Fiction
* More expensive in 2009 due to The Twilight Saga Collection being sold for 82.00
* Similarly, the Harry Potter Series was being sold for 52.00 in 2016.

### Conclusion:

Bestselling books are...
* highly rated and reviewed (especially in more recent years)
* consistently majority non-fiction
* moving toward lower end of the price range (around 10.00 dollars) 
* trending away from romance/fantasy fiction series
* a reflection of short-term book trends, such as adult coloring books in 2015.