**Exploring Author Success in Amazon Bestselling Books**.

This R Markdown document explores author success in the context of Amazon’s bestselling books. I will analyze a dataset containing information about the top 50 bestselling books from 2009 to 2019, including details about the books’ authors, user ratings, reviews, prices, publication years, and genres.

In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Install required packages
#libridate for date functions
# ggplot for visualization
# # # # # # # # # # # # # # # # # # # # # # #  
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data


**Loading and Inspecting the Dataset.**

I begin by loading the dataset and taking a quick look at its structure and the first few rows.

In [None]:
# Load the dataset
amazon_books <- read.csv("/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv")

# View the first few rows of the dataset
head(amazon_books)

In [None]:
# Get the data types of the different variables
str(amazon_books)

In [None]:
# Get a summary of the dataset
summary(amazon_books)

The dataset consists of 550 observations and 7 variables, including book names, authors, user ratings, reviews, prices, publication years, and genres.

**Data Cleaning**.

Before proceeding with the analysis, I ensure data quality by checking for missing values and duplicates.

In [None]:
# Identify and correct any errors or inconsistencies in the data
# Check for missing values
sum(is.na(amazon_books))

In [None]:
# Check for duplicate rows
sum(duplicated(amazon_books))

Fortunately, there are no missing values or duplicate rows in the dataset, which simplifies our analysis.

In [None]:
colnames(amazon_books)

**Standardizing Author Names.**

To facilitate our analysis, I standardize author names by converting them to lowercase.

In [None]:
# Standardize author names
amazon_books$Author <- tolower(amazon_books$Author)
head(amazon_books)

**Author Analysis.**

**Identifying Authors with Multiple Entries.**

I identify authors who have multiple entries in the top 50 list, as this indicates their recurring success.

In [None]:
# Identify authors with multiple entries in the top 50 list
authors <- amazon_books %>%
  group_by(Author) %>%
  count(Author) %>%
  filter(n >= 2) 
head(authors)

**Calculating Success Metrics**

I calculate success metrics for each author, including their total appearances, average ratings, and total reviews.

In [None]:
# Calculate success metrics for each author
success_metrics <- amazon_books %>%
  group_by(Author) %>%
  summarise(
    total_appearances = n(),
    average_rating = mean(User.Rating),
    total_reviews = sum(Reviews)
  ) %>%
  arrange(desc(total_reviews))
head(success_metrics)

**Merging Author Data.**

I merge the data about authors with multiple entries and their success metrics, arranging the results by total reviews.

In [None]:
# Merge the author data with the success metrics by the 'author' column
merged_data <- merge(authors, success_metrics, by = "Author") %>% arrange(desc(total_reviews))

# View the merged data
head(merged_data)

**Removing Unnecessary Columns.**

I remove the ‘n’ column, which was used for counting, from the merged data.

In [None]:
# Remove the 'n' column
modified_merged_data <- merged_data %>%
  select(-n)

# View the modified dataframe
head(modified_merged_data)

**Data Visualization.**


Now, I visualize the distributions of total appearances, average ratings, and total reviews for authors in the dataset.

**Distribution of Total Appearances**

In [None]:

# Plot the distribution of total appearances
ggplot(modified_merged_data, aes(x = total_appearances)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of Total Appearances by Authors",
       x = "Total Appearances",
       y = "Frequency")


A histogram that is skewed to the right means that the data is positively skewed. In this context, it indicates that there are more authors with lower total appearances in the top 50 list, and fewer authors with higher total appearances.

In [None]:
# Plot the distribution of average rating
ggplot(modified_merged_data, aes(x = average_rating)) +
  geom_histogram(binwidth = 0.1, fill = "green", color = "black") +
  labs(title = "Distribution of Average Rating by Authors",
       x = "Average Rating",
       y = "Frequency")




A histogram that is skewed to the left means that the data is negatively skewed.This skewness can indicate that the majority of authors in the dataset tend to receive positive ratings for their work, with only a few authors receiving lower ratings.

In [None]:
# Plot the distribution of total reviews
ggplot(modified_merged_data, aes(x = total_reviews)) +
  geom_histogram(binwidth = 5000, fill = "orange", color = "black") +
  labs(title = "Distribution of Total Reviews by Authors",
       x = "Total Reviews",
       y = "Frequency")

The data is positively skewed.This skewness can indicate that the majority of authors in the dataset have books that are less reviewed, while a few authors have books that are highly reviewed.

**Conclusion**

This project explored author success in fiction and nonfiction categories using Amazon book data. The findings suggest that there is a high degree of inequality in terms of success among authors in the top 50 list. A small number of authors have consistently high total appearances, average ratings, and total reviews, while the majority of authors have lower levels of success on all three metrics.