[Kaggle Dataset](https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019)

# EDA of Amazon's Top 50 Bestselling Books 2009 - 2019

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df = pd.read_csv('amazon-bestsellers.csv')
df.head()

This dataset contains 3 categorical columns (Name, Author, and Genre) and 4 numerical columns (User Rating, Reviews, Price, Year).

## Check for missing data

In [None]:
df.shape # Seems like there are no books omitted

In [None]:
df.info()

Looking at the non-null counts for all columns, it's clear that there are no missing values in the dataset.

The "Genre" column seems to only contain 2 values. It would make sense to convert it from an "object" dtype to a "category" dtype.

In [None]:
df['Genre'] = df['Genre'].astype('category')

## Cleaning Data

### Name

In [None]:
categorical_columns = df.select_dtypes(exclude=[np.number]).columns

In [None]:
for col in categorical_columns:
    print(f'Number of values before title() in {col}: {len(set(df[col]))}', f'Number of values after title() in {col}: {len(set(df[col].str.title()))}')

The "Name" column seems to contain 1 value that has a capitalization difference. The following cell fixes it.

In [None]:
df['Name'] = df['Name'].str.title()

In [None]:
for col in categorical_columns:
    print(f'Number of values before title() in {col}: {len(set(df[col]))}', f'Number of values after title() in {col}: {len(set(df[col].str.title()))}')

Some titles may appear weirdly since ```title()``` causes letters after punctuation to be capitalized but for the purpose of data analysis, this should be acceptable

In [None]:
for col in categorical_columns:
    print(f'Number of values before strip() in {col}: {len(set(df[col]))}', f'Number of values after strip() in {col}: {len(set(df[col].str.strip()))}')

There seem to be no titles containing excessive whitespace.

### Author

In [None]:
len(df['Author'].unique())

In [None]:
df['Author'].sort_values().unique()

There are 2 authors in this dataset that appear multiple times under different spellings: George R. R. Martin and J. K. Rowling. Let's change the applicable rows so that they display the same spelling for the respective authors.

In [None]:
df = df.replace('George R.R. Martin', 'George R. R. Martin')
df = df.replace('J.K. Rowling', 'J. K. Rowling')
len(df['Author'].unique())

### Year

In [None]:
df['Year'].value_counts()

The "Year" column does not require any changes.

### Genre

In [None]:
df['Genre'].unique()

The "Genre" column does not require any more changes after changing its dtype.

## Initial Insights

This section will look into a few questions that we may be curious about:
1. Which books have appeared most often on Amazon's yearly bestseller list?
2. Which authors have had the most works appear on Amazon's yearly bestseller list?
3. What books have had the highest and lowest user rating?
4. Which genre of books have higher user ratings: fiction or non-fiction?

### Which books have appeared most often on Amazon's yearly bestseller list?

In [None]:
plt.figure(figsize=(5,2), dpi=180)

df['Name'].value_counts().head(10).plot(kind='barh')

plt.title('Top 10 Books With The Most Occurrences')
plt.xlabel('# of Occurrences')
plt.ylabel('Book Title')
plt.yticks(fontsize=6)

plt.show()

There were 2 titles that were on Amazon's bestseller list almost every year from 2009-2019. These were *The 5 Love Languages: The Secret To Love That Lasts* and *Publication Manual Of The American Psychological Association, 6th Edition*. Only 8 titles had more than 5 occurrences on the bestseller list.

However, *The 5 Love Languages: The Secret To Love That Lasts* seems to have been renamed between 2009 and 2010 so technically, the book itself has been on the bestselling list every year.

In [None]:
df[df['Name'].str.contains('Love Languages')].sort_values('Year')

### Which authors have had the most works appear on Amazon's yearly bestseller list?

In [None]:
plt.figure(figsize=(5,2), dpi=180)

df['Author'].value_counts().head(10).plot(kind='barh')

plt.title('Top 10 Authors With The Most Occurrences')
plt.xlabel('# of Occurrences')
plt.ylabel('Author')
plt.yticks(fontsize=6)

plt.show()

Jeff Kinney has the most titles that have appeared on the bestseller list. With more than 11 occurrences, he must have had multiple titles on the bestseller list for at least 1 year between 2009 and 2019.

## Summary statistics of numerical data

In [None]:
df.describe()

There is nothing strange about these statistics. It's reasonable that the number of reviews would vary so much since it's highly dependent on popularity.

The following plots are box-and-whisker plots to provide a visualization of these summary statistics.

### User Rating

In [None]:
plt.figure(figsize=(5,2), dpi=180)

plt.boxplot(df['User Rating'], vert=False)

plt.title('Boxplot of User Rating')
plt.yticks([1], labels=['User Rating'])

plt.show()

### Reviews

In [None]:
plt.figure(figsize=(5,2), dpi=180)

plt.boxplot(df['Reviews'], vert=False)
plt.title('Boxplot of # of Reviews')
plt.yticks([1], labels=['# of Reviews'])

plt.show()

### Price

In [None]:
plt.figure(figsize=(5,2), dpi=180)

plt.boxplot(df['Price'], vert=False)
plt.title('Boxplot of Price')
plt.yticks([1], labels=['Price'])

plt.show()

## Examining User Ratings

In [None]:
user_rating_counts = df['User Rating'].value_counts().sort_index()
user_rating_counts

The majority of average user ratings for Amazon's bestselling books seem to within the range of 4.6-4.8 stars. The following is a visualization for these numbers:

In [None]:
plt.figure(figsize=(4,3), dpi=180)

user_rating_counts.plot.bar()

plt.title('User Rating Frequencies')

plt.xlabel('User Rating')
plt.ylabel('Frequency')

plt.show()

## Correlations

In [None]:
df.corr()

In [None]:
numerical_df = df.select_dtypes(include=[np.number])

plt.figure(figsize=(3,3), dpi=180)

heatmap = plt.imshow(numerical_df.corr(), cmap=plt.get_cmap('plasma'))

plt.title('Correlations Between Numeric Data')

plt.xticks(ticks=np.arange(len(numerical_df.columns)), labels=numerical_df.columns, rotation=25, ha='right', rotation_mode='anchor')
plt.yticks(ticks=np.arange(len(numerical_df.columns)), labels=numerical_df.columns)

for i in range(len(numerical_df.columns)):
    for j in range(len(numerical_df.columns)):
        plt.text(i, j, '%.3f' % numerical_df.corr().iloc[i, j], size=8, color='white', va='center', ha='center')

plt.colorbar(heatmap, shrink=0.80)
        
plt.show()

From a quick glance, there does not seem to be any strong correlations between any of the numeric data. Let's take a closer look with some scatter plots: