# Part 2: Goodreads Dataset

datasource: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

The data in the goodreads dataset is separated into four files: books, works, genres, and authors. We'll join these datasets and create a new dataset with relevant information for our analysis.

Since the files are large, we'll convert them to parquet format for faster processing.

In [1]:
import json
import polars as pl
import pandas as pd
from rapidfuzz import process
import pickle

In [3]:
# read in json files
genres = []
with open('./goodreads/json/goodreads genres.json') as f:
    for line in f:
        genres.append(json.loads(line))

genres_df = pd.DataFrame(genres)
genres_df['genres'] = genres_df['genres'].apply(lambda x: x.keys())
genres_df['genres'] = genres_df['genres'].apply(lambda x: [i for i in x if i != ''])
genres_df.to_parquet('./goodreads/genres.parquet')

In [4]:
authors_df = pl.read_ndjson('./goodreads/json/goodreads authors.json')

authorskeep = ['author_id', 'name']
authors_df = authors_df[authorskeep]

authors_df.write_parquet('./goodreads/authors.parquet')

author_id = authors_df['author_id'].to_list()
author_name = authors_df['name'].to_list()

author_dict = dict(zip(author_id, author_name))

# pickle author dict
with open('./goodreads/author_dict.pickle', 'wb') as handle:
    pickle.dump(author_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [5]:
books_df = pl.read_ndjson('./goodreads/json/Goodreads Books.json')
books_df.write_parquet('./goodreads/books.parquet')

In [12]:
books_df = pl.read_parquet('./goodreads/books.parquet')

In [13]:
genres_df = pl.read_parquet('./goodreads/genres.parquet')

In [14]:
keep = [
    'country_code',
    'language_code',
    'average_rating',
    'description',
    'authors',
    'publisher',
    'num_pages',
    'publication_year',
    'image_url',
    'ratings_count',
    'book_id',
    'work_id',
    'title'
]

books_df = books_df[keep]

# join genres df to books df
books_df = books_df.join(genres_df, on='book_id', how='inner')

books_df.write_parquet('./goodreads/booksgenres.parquet')

In [15]:
books_df = pd.read_parquet('./goodreads/booksgenres.parquet')
authors_list = books_df['authors'].to_list()
with open('./goodreads/author_dict.pickle', 'rb') as handle:
    author_dict = pickle.load(handle)

for i in range(len(authors_list)):
    authors = []
    for j in authors_list[i]:
        authors.append(j['author_id'])
        authors_list[i] = authors

for i in range(len(authors_list)):
    for j in range(len(authors_list[i])):
        authors_list[i][j] = author_dict[authors_list[i][j]]

books_df['authors'] = authors_list

books_df['author_string'] = books_df['authors'].apply(lambda x: ', '.join(j for j in x))
books_df['author_title'] = books_df['author_string'] + ' - ' + books_df['title']

books_df.to_parquet('./goodreads/booksgenresauthors.parquet')

In [16]:
books_df.head()

Unnamed: 0,country_code,language_code,average_rating,description,authors,publisher,num_pages,publication_year,image_url,ratings_count,book_id,work_id,title,genres,author_string,author_title
0,US,,4.0,,[Ronald J. Fields],St. Martin's Press,256.0,1984.0,https://images.gr-assets.com/books/1310220028m...,3,5333265,5400751,W.C. Fields: A Life on Film,"[history, historical fiction, biography]",Ronald J. Fields,Ronald J. Fields - W.C. Fields: A Life on Film
1,US,,3.23,"Anita Diamant's international bestseller ""The ...",[Anita Diamant],Simon & Schuster Audio,,2001.0,https://s.gr-assets.com/assets/nophoto/book/11...,10,1333909,1323437,Good Harbor,"[fiction, history, historical fiction, biography]",Anita Diamant,Anita Diamant - Good Harbor
2,US,eng,4.03,Omnibus book club edition containing the Ladie...,[Barbara Hambly],"Nelson Doubleday, Inc.",600.0,1987.0,https://images.gr-assets.com/books/1304100136m...,140,7327624,8948723,"The Unschooled Wizard (Sun Wolf and Starhawk, ...","[fantasy, paranormal, fiction, mystery, thrill...",Barbara Hambly,Barbara Hambly - The Unschooled Wizard (Sun Wo...
3,US,eng,3.49,Addie Downs and Valerie Adler were eight when ...,[Jennifer Weiner],Atria Books,368.0,2009.0,https://s.gr-assets.com/assets/nophoto/book/11...,51184,6066819,6243154,Best Friends Forever,"[fiction, romance, mystery, thriller, crime]",Jennifer Weiner,Jennifer Weiner - Best Friends Forever
4,US,,3.4,,[Nigel Pennick],,,,https://images.gr-assets.com/books/1413219371m...,15,287140,278577,Runic Astrology: Starcraft and Timekeeping in ...,[non-fiction],Nigel Pennick,Nigel Pennick - Runic Astrology: Starcraft and...


: 

In [24]:
books_df = pd.read_parquet('./goodreads/booksgenresauthors.parquet')

authortitle = books_df['author_title'].to_list()

# pickle author title list
with open('goodreads_titles.pickle', 'wb') as handle:
    pickle.dump(authortitle, handle, protocol=pickle.HIGHEST_PROTOCOL)