Good Reads

Dataset Description
This dataset contains more than 1.3M book reviews about 25,475 books and 18,892 users , which is a review subset for spoiler detection, where each book/user has at least one associated spoiler review.

Goodreads Books Review Rating Prediction
Reviews are a good way to judge the quality of any product, whether it's books, clothes, technology, or anything else. When you want to buy something online these days, the first thing that comes to mind is the reviews from past buyers and the overall rating the product has received.
Reader feedback, whether positive or negative, five stars or one star, will encourage the product owner to make improvements.
Reader connection and engagement will be encouraged by book reviews, whether they be left on Amazon, Goodreads, or social media. Readers must determine whether or not other readers are enjoying the book.

In this competition you will work with a challenging dataset consisting reviews from the Goodreads book review website, and a variety of attributes describing the items. and you have to predict review rating which ranges from 0 to 5.

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, f1_score
from textblob import TextBlob
from sklearn.preprocessing import LabelEncoder

In [25]:
# Load the data
path = ".\goodreads-books-reviews-290312\goodreads_train_sample.csv"
df = pd.read_csv(path)
print (df.shape)
df.head()

(90000, 12)


Unnamed: 0.1,Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
0,898402,ed556b92506c3452b42fffed31697a1a,25125233,182718faad99666b70f73f3b7ffbbdb7,3,Jessica Broussard has only ever had her father...,Thu Nov 19 07:48:55 -0800 2015,Thu Nov 19 11:31:55 -0800 2015,Wed Nov 18 00:00:00 -0800 2015,Tue Nov 17 00:00:00 -0800 2015,0,0
1,853518,fcf6bca39e8f5333ba018b0e146ccfec,6837103,228c47ca18ed4c1598ae3c9214530b6b,5,"Set in the late 1700s and early 1800s, the sto...",Tue Dec 06 13:19:29 -0800 2011,Mon Dec 19 11:28:19 -0800 2011,Mon Dec 19 00:00:00 -0800 2011,Tue Dec 06 00:00:00 -0800 2011,0,0
2,366741,b8f6f163c2161555c6d887632b2ff4a2,17948485,6137b1fe0159b7eaa56a04293a00fd49,5,This book is the bomb! Love every single page ...,Fri Jul 26 17:43:19 -0700 2013,Sat Aug 10 20:06:19 -0700 2013,Sat Jul 27 00:00:00 -0700 2013,Fri Jul 26 00:00:00 -0700 2013,0,0
3,476233,33162c8e64b16bcbddc9808f3c716342,18405,c5b3dc0c0416d850380d80f5304be91f,5,Wherein I attempt to write a review using all ...,Wed Jun 30 08:01:44 -0700 2010,Tue Dec 31 06:07:21 -0800 2013,Fri Feb 18 00:00:00 -0800 2011,Sat Feb 12 00:00:00 -0800 2011,46,14
4,856723,37d8353490e210e2b3766336be99ebd4,26218626,234b51de9a79dfd51b5dc2b48df972ec,3,I can't believe I have only one volume left. T...,Wed Jun 29 19:48:42 -0700 2016,Thu Mar 30 07:18:31 -0700 2017,Thu Mar 30 07:18:31 -0700 2017,Thu Mar 30 00:00:00 -0700 2017,0,0


In [41]:
df = df.drop(["Unnamed: 0"], axis=1)

In [42]:
#Check type var
df.dtypes

user_id         object
book_id          int64
review_id       object
rating           int64
review_text     object
date_added      object
date_updated    object
read_at         object
started_at      object
n_votes          int64
n_comments       int64
dtype: object

In [None]:
df.count()

Unnamed: 0      61662
user_id         61662
book_id         61662
review_id       61662
rating          61662
review_text     61662
date_added      61662
date_updated    61662
read_at         61662
started_at      61662
n_votes         61662
n_comments      61662
dtype: int64

In [None]:
df.isna().sum(axis = 0)

Unnamed: 0      0
user_id         0
book_id         0
review_id       0
rating          0
review_text     0
date_added      0
date_updated    0
read_at         0
started_at      0
n_votes         0
n_comments      0
dtype: int64

In [47]:
df = df.dropna(how='all', axis=0)

In [48]:
df.describe(include='all')

Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
count,61662,61662.0,61662,61662.0,61662,61662,61662,61662,61662,61662.0,61662.0
unique,8906,,61662,,61381,61641,61513,28535,6242,,
top,843a44e2499ba9362b47a089b0b0ce75,,182718faad99666b70f73f3b7ffbbdb7,,Review to come.,Sat Jul 29 09:26:28 -0700 2017,Fri Jul 19 06:15:59 -0700 2013,Fri Jan 01 00:00:00 -0800 2016,Sun Jan 01 00:00:00 -0800 2017,,
freq,173,,1,,43,2,4,48,80,,
mean,,14575800.0,,3.793876,,,,,,3.631329,1.152833
std,,9167256.0,,1.1283,,,,,,15.489702,5.975086
min,,1.0,,0.0,,,,,,0.0,0.0
25%,,7746506.0,,3.0,,,,,,0.0,0.0
50%,,15767850.0,,4.0,,,,,,0.0,0.0
75%,,21857390.0,,5.0,,,,,,2.0,0.0


In [49]:
# Create a new column in the dataframe to store the sentiment scores
df['review_text'] = df['review_text'].apply(lambda x: TextBlob(x).sentiment.polarity)

# You can now analyze the sentiment scores in the new column
df.head()

Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
0,ed556b92506c3452b42fffed31697a1a,25125233,182718faad99666b70f73f3b7ffbbdb7,3,0.143082,Thu Nov 19 07:48:55 -0800 2015,Thu Nov 19 11:31:55 -0800 2015,Wed Nov 18 00:00:00 -0800 2015,Tue Nov 17 00:00:00 -0800 2015,0,0
1,fcf6bca39e8f5333ba018b0e146ccfec,6837103,228c47ca18ed4c1598ae3c9214530b6b,5,0.101091,Tue Dec 06 13:19:29 -0800 2011,Mon Dec 19 11:28:19 -0800 2011,Mon Dec 19 00:00:00 -0800 2011,Tue Dec 06 00:00:00 -0800 2011,0,0
2,b8f6f163c2161555c6d887632b2ff4a2,17948485,6137b1fe0159b7eaa56a04293a00fd49,5,0.203858,Fri Jul 26 17:43:19 -0700 2013,Sat Aug 10 20:06:19 -0700 2013,Sat Jul 27 00:00:00 -0700 2013,Fri Jul 26 00:00:00 -0700 2013,0,0
3,33162c8e64b16bcbddc9808f3c716342,18405,c5b3dc0c0416d850380d80f5304be91f,5,0.07921,Wed Jun 30 08:01:44 -0700 2010,Tue Dec 31 06:07:21 -0800 2013,Fri Feb 18 00:00:00 -0800 2011,Sat Feb 12 00:00:00 -0800 2011,46,14
4,37d8353490e210e2b3766336be99ebd4,26218626,234b51de9a79dfd51b5dc2b48df972ec,3,-0.058333,Wed Jun 29 19:48:42 -0700 2016,Thu Mar 30 07:18:31 -0700 2017,Thu Mar 30 07:18:31 -0700 2017,Thu Mar 30 00:00:00 -0700 2017,0,0


In [50]:
# Encoding the string variables
labelencoder = LabelEncoder()
df['user_id'] = labelencoder.fit_transform(df['user_id'])
df['book_id'] = labelencoder.fit_transform(df['book_id'])
df['review_id'] = labelencoder.fit_transform(df['review_id'])

In [51]:
# Extract features and target variable
X = df.drop(["rating",'date_added','date_updated','read_at','started_at'], axis=1)
y = df["rating"]

In [52]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [53]:
# Fit a linear regression model to the training data
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [54]:
# Predict the star ratings on the test data
y_pred = model.predict(X_test)

In [55]:
# Round the predicted values to the nearest integer to obtain class labels
y_pred_class = np.round(y_pred)

In [56]:
# Calculate the F1 score
f1 = f1_score(y_test, y_pred_class, average='weighted')
print("F1 Score:", f1)

F1 Score: 0.25887015534218


In [57]:
# Check if the F1 score is over 70%
if f1 > 0.7:
    print("The model has a good F1 score.")
else:
    print("The model needs improvement. Try using a different model or adding more features.")

The model needs improvement. Try using a different model or adding more features.
