# Goodreads Books with Genre

In [25]:
import kagglehub
import pandas as pd
import statistics

Start with data processing! Import the data, parse through data to make sure they are all the correct types, and create a DataFrame

In [26]:
# Download latest version
file = kagglehub.dataset_download("middlelight/goodreadsbookswithgenres")

df = pd.read_csv('./goodreadsbookswithgenres/Goodreads_books_with_genres.csv')
df.head()


Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic,"Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.,"Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Adve..."


In [27]:
print("(cols, rows):", df.shape)
df.dtypes

(cols, rows): (11127, 13)


Book Id                 int64
Title                  object
Author                 object
average_rating        float64
isbn                   object
isbn13                  int64
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count      int64
publication_date       object
publisher              object
genres                 object
dtype: object

In [28]:
df.count()

Book Id               11127
Title                 11127
Author                11127
average_rating        11127
isbn                  11127
isbn13                11127
language_code         11127
num_pages             11127
ratings_count         11127
text_reviews_count    11127
publication_date      11127
publisher             11127
genres                11030
dtype: int64

The `count()` function shows that the only column that has null values is genre. Since majority of our questions deal with genre, we are going to remove the books without a genre. Since the genres column is important, we are going to clean up the data by turning the string into an array.

In [29]:
# Drop all the rows without a genre
df = df.dropna(how='any')

# Turn the genres column from a string into an array
df.loc[:, 'genres'] = df['genres'].apply(lambda input: input.split(';'))
df.head()


Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic,"[Fantasy, Fiction, Young Adult, Fantasy,Magic,..."
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.,"[Fantasy, Fiction, Young Adult, Fantasy,Magic,..."
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."


Next, since we will be using the publication date, we want to convert the string into a datetime object for easier computation.

In [30]:
# Convert to datetime
df.loc[:, 'publication_date'] = df['publication_date'].apply(lambda date: pd.to_datetime(date,  errors='coerce'))

# Look at rows that are NaT (not a time)
nat_rows = df[df['publication_date'].isna()]
nat_rows

Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
8180,31373,In Pursuit of the Proper Sinner (Inspector Lyn...,Elizabeth George,4.1,553575104,9780553575101,eng,718,10608,295,NaT,Bantam Books,"[Mystery, Fiction, Mystery,Crime, Thriller,Mys..."
11098,45531,Montaillou village occitan de 1294 à 1324,Emmanuel Le Roy Ladurie/Emmanuel Le Roy-Ladurie,3.96,2070323285,9782070323289,fre,640,15,2,NaT,Folio histoire,"[History, Nonfiction, Cultural,France, Histori..."


Since there are only two rows without a date, we feel comfortable removing them, since they will not make a huge impact on our data.

In [31]:
# Remove the rows with NaT
df = df.dropna(subset=['publication_date'])
df.head()

Unnamed: 0,Book Id,Title,Author,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,genres
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,2006-09-16 00:00:00,Scholastic Inc.,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,2004-09-01 00:00:00,Scholastic Inc.,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,2003-11-01 00:00:00,Scholastic,"[Fantasy, Fiction, Young Adult, Fantasy,Magic,..."
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,2004-05-01 00:00:00,Scholastic Inc.,"[Fantasy, Fiction, Young Adult, Fantasy,Magic,..."
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,2004-09-13 00:00:00,Scholastic,"[Fantasy, Young Adult, Fiction, Fantasy,Magic,..."


Now let's do some basic exploration of the data. Looking at the range of publication dates, genres, number of pages, publishers, average rating.

In [None]:
# Range of publication dates
min_date = df['publication_date'].min()
max_date = df['publication_date'].max()
print("Publication date ranges from", min_date, "to", max_date)

# Range of number of pages
min_pages = df['num_pages'].min()
max_pages = df['num_pages'].max()
print("Number of pages ranges from", min_pages, "to", max_pages)

# Range of average rating
min_average_rating = df['average_rating'].min()
max_average_rating = df['average_rating'].max()
print("Average rating ranges from", min_average_rating, "to", max_average_rating)

# Range of number of ratings
min_ratings = df['ratings_count'].min()
max_ratings = df['ratings_count'].max()
print("Number of ratings ranges from", min_ratings, "to", max_ratings)

# List of publishers
unique_publishers = df['publisher'].unique()
print("List of publishers:", unique_publishers)

# List of genres
all_genres = df['genres'].explode()
unique_genres = all_genres.unique()
print("List of genres:", unique_genres)

def filter_genres(genres):
    return [genre for genre in genres if ',' not in genre]

main_genres = df['genres'].apply(filter_genres).explode()
unique_main_genres = main_genres.unique()
print("List of MAIN genres:", unique_main_genres)
print("Number of genres", len(unique_genres))
print("Number of MAIN genres", len(unique_main_genres))


Publication date ranges from 1900-01-01 00:00:00 to 2020-03-31 00:00:00
Number of pages ranges from 0 to 6576
Average rating ranges from 0.0 to 5.0
Number of ratings ranges from 0 to 4597666
List of publishers: ['Scholastic Inc.' 'Scholastic' 'Nimble Books' ... 'Suma'
 'Panamericana Editorial' 'Editorial Presença']
List of genres: ['Fantasy' 'Young Adult' 'Fiction' 'Fantasy,Magic' 'Childrens' 'Adventure'
 'Audiobook' 'Childrens,Middle Grade' 'Classics' 'Science Fiction Fantasy'
 'Fantasy,Supernatural' 'Mystery' 'Fantasy,Paranormal' 'Novels'
 'Paranormal,Wizards' 'Science Fiction' 'Humor' 'Humor,Comedy'
 'European Literature,British Literature' 'Nonfiction' 'Science' 'History'
 'Science,Physics' 'Science,Popular Science' 'Historical' 'Philosophy'
 'Unfinished' 'Travel' 'Cultural,Africa' 'Autobiography,Memoir'
 'Eastern Africa,Kenya' 'Biography' 'Travel,Travelogue' 'Language,Writing'
 'Humanities,Language' 'Reference' 'Humanities,Linguistics'
 'Language,Words' 'Reference,Dictionaries' 'W

- **<font color='red'>I think we should remove the subgenres like Fantasy,Epic. And just have the broader genre</font>**
    - or shrink to a smaller category of genres
- **<font color='red'>I think we should remove the rows that have 0 as the page number</font>**
- **<font color='red'>Same with average rating and number of ratings?</font>**



In [62]:
# Genres with the most books
top_genres = main_genres.value_counts().head(25)
top_genres

genres
Fiction                    6933
Classics                   3333
Nonfiction                 3107
Literature                 2845
Fantasy                    2550
Novels                     2529
Audiobook                  1637
History                    1616
Mystery                    1603
Romance                    1547
Historical                 1396
Contemporary               1358
Adventure                  1304
Adult                      1263
Young Adult                1259
Philosophy                 1234
Science Fiction            1166
Childrens                  1157
Humor                      1129
Thriller                   1099
Biography                  1071
Literary Fiction           1049
Science Fiction Fantasy     968
Short Stories               871
Horror                      748
Name: count, dtype: int64

**<font color='red'>What genres should we do?</font>**
- classics
- fantasy
- history
- mystery
- romance
- young adult
- science fiction
- childrens
- humor
- thriller
- biography
- short stories
- horror

Am I missing any that should be there?