![bookstore](bookstore.jpg)


Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

* `price`
* `popularity` (target variable)
* `review/summary`
* `review/text`
* `review/helpfulness`
* `authors`
* `categories`

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

In [1]:
# Import some required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Preview the first five rows
books.tail(5)

Unnamed: 0,title,price,review/helpfulness,review/summary,review/text,description,authors,categories,popularity
15714,Attack of the Deranged Mutant Killer Monster S...,7.64,0/0,Great for Calvin lovers,"Bought as a Christmas gift, great book for kin...",Online: gocomics.com/calvinandhobbes/,'Bill Watterson','Comics & Graphic Novels',Unpopular
15715,Book Savvy,33.99,2/2,literary pleasure,I thoroughly enjoyed Ms. Katona's Book Savvy. ...,"Recounts the adventures of Mibs Beaumont, whos...",'Ingrid Law','Juvenile Fiction',Popular
15716,Organizing to Win: New Research on Union Strat...,24.95,3/4,Great Book for Union Organizers!,This is a good reference tool for Union Organi...,As the American labour movement mobilizes for ...,"'Kate Bronfenbrenner', 'Sheldon Friedman', 'Ri...",'Business & Economics',Popular
15717,The Dharma Bums,39.95,3/3,"The Sad, Beautiful, Joyful World of Jack Kerouac",Jack Kerouac was intensely alive and his fiery...,THE DHARMA BUMS appeared just one year after t...,'Jack Kerouac','Fiction',Popular
15718,Palomino,7.99,0/0,"""Palomino""","This is an older novel of Danielle Steele's, a...",Samantha Taylor is shattered when her husband ...,'Danielle Steel','Fiction',Popular


# Exploratory Data Analysis (EDA)


In [2]:
# Check the shape of the dataset
print(books.shape)
print("\n")



(15719, 9)




In [3]:
# Check the data types of the columns
print(books.dtypes)
print("\n")



title                  object
price                 float64
review/helpfulness     object
review/summary         object
review/text            object
description            object
authors                object
categories             object
popularity             object
dtype: object




In [4]:
# data description
print(books.describe())
print("\n")



              price
count  15719.000000
mean      15.862783
std        8.464523
min        1.000000
25%       10.190000
50%       13.570000
75%       19.950000
max       41.770000




In [5]:
# Check for missing values
print(books.isnull().sum())
print("\n")

title                 0
price                 0
review/helpfulness    0
review/summary        0
review/text           0
description           0
authors               0
categories            0
popularity            0
dtype: int64




In [6]:
#see if some categories are more popular than others
books['categories'].value_counts()


'Fiction'                      3520
'Religion'                     1053
'Biography & Autobiography'     852
'Juvenile Fiction'              815
'History'                       754
                               ... 
'Sunflowers'                      1
'Self-confidence'                 1
'United States'                   1
'Note-taking'                     1
'Asthma'                          1
Name: categories, Length: 313, dtype: int64

In [7]:
#see if some authors are more popular than others
books['authors'].value_counts()

'Charles Dickens'                                                                                        109
'Christopher Paolini'                                                                                     90
'Thomas Harris'                                                                                           85
'Charlotte Brontë', 'Marc Cactus'                                                                         70
'Charlotte Brontë'                                                                                        51
                                                                                                        ... 
'Gilbert Morris'                                                                                           1
'John Hick'                                                                                                1
'Ira Flatow'                                                                                               1
'Sam Deep', 'Lyle S

In [8]:
#see each author's number of popular books
books[books['popularity']=='Popular']['authors'].value_counts()



'Charles Dickens'                             36
'Charlotte Brontë', 'Marc Cactus'             27
'Charlotte Brontë'                            25
'Gary Chapman'                                25
'Michael Shaara'                              20
                                              ..
'Shana Pate'                                   1
'Lynn Kurland'                                 1
'Hattie Mae Clark'                             1
'Wm. Briggs and Company', 'Marion Nichols'     1
'Danielle Steel'                               1
Name: authors, Length: 3418, dtype: int64

# Data Preprocessing


In [9]:
# Drop the columns we don't need
books = books.drop(['review/text', 'title','description'], axis=1)


In [10]:
# Check for missing values
print(books.isnull().sum())

price                 0
review/helpfulness    0
review/summary        0
authors               0
categories            0
popularity            0
dtype: int64


In [11]:
# Drop missing values
books = books.dropna()

# Check for missing values
print(books.isnull().sum())

price                 0
review/helpfulness    0
review/summary        0
authors               0
categories            0
popularity            0
dtype: int64


### fix the numerical columns


In [12]:
#first make the review/helpfulness column tow separate columns
books['review'] = books['review/helpfulness'].apply(lambda x: x.split('/')[0]).astype('int')
books['helpfulness'] = books['review/helpfulness'].apply(lambda x: x.split('/')[1]).astype('int')
#remove the review/helpfulness column


## fix the categorical columns

In [13]:
#remove all special characters from the review/summary column
books['review/summary'] = books['review/summary'].str.replace('[^a-zA-Z]', ' ')

In [14]:
#turn popular to 1 and not popular to 0
books['popularity'] = books['popularity'].apply(lambda x: 1 if x == 'Popular' else 0)

In [15]:
#replaces each category with the mean popularity of books in that category
books['categories_encoded'] = books.groupby('categories')['popularity'].transform('mean')

#replaces each author with the mean popularity of books by that author
books['authors_encoded'] = books.groupby('authors')['popularity'].transform('mean')


## fix the text columns

In [16]:
# Create a TfidfVectorizer object
tfidf = TfidfVectorizer(stop_words='english', max_features=100)
summary_tfidf = tfidf.fit_transform(books['review/summary']).toarray()
summary_tfidf = pd.DataFrame(summary_tfidf, columns=tfidf.get_feature_names_out())


In [17]:
# Concatenate the original DataFrame with the summary_tfidf DataFrame
books = pd.concat([books, summary_tfidf], axis=1)

In [18]:
# Check for missing values
print(books.isnull().sum())

price                 0
review/helpfulness    0
review/summary        0
authors               0
categories            0
                     ..
world                 0
worth                 0
wow                   0
writing               0
written               0
Length: 110, dtype: int64


In [19]:
#drop the missing values
books.dropna(inplace=True)

## Split the data into training and testing sets

In [20]:
#drop the columns we don't need
books.drop(['review/helpfulness', 'review/summary', 'authors', 'categories'], axis=1, inplace=True)

In [21]:
books

Unnamed: 0,price,popularity,review,helpfulness,categories_encoded,authors_encoded,advice,amazing,american,author,awesome,bad,beautiful,beginners,best,better,bit,book,books,buy,classic,disappointed,disappointing,does,don,easy,edition,enjoyable,entertaining,excellent,fantastic,fascinating,favorite,fun,funny,good,great,guide,hard,helpful,...,practical,pretty,quot,read,reading,real,really,reference,resource,review.1,romance,self,series,short,simple,small,star,stars,start,stories,story,tale,thought,time,title,true,useful,ve,want,waste,way,woman,women,wonderful,work,world,worth,wow,writing,written
0,10.88,0,2,3,0.327586,0.375000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.369047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.441675,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.817758,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,9.35,0,0,0,0.356125,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.451156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.892445,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,24.95,0,17,19,0.436111,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,1.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,7.99,0,0,1,0.327801,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,32.50,0,18,20,0.385965,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.492644,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15714,7.64,0,0,0,0.318182,0.500000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,1.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
15715,33.99,1,2,2,0.357055,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
15716,24.95,1,3,4,0.352192,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.641193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.767379,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
15717,39.95,1,3,3,0.265625,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.685187,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.728367,0.0,0.0,0.0,0.0


In [22]:
# Split the data into training and testing sets
X = books.drop('popularity', axis=1)
y = books['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#reshape the target variable
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
# Check the shape of the training and testing sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



(12575, 105)
(3144, 105)
(12575, 1)
(3144, 1)


In [23]:
books.isnull().sum()

price                 0
popularity            0
review                0
helpfulness           0
categories_encoded    0
                     ..
world                 0
worth                 0
wow                   0
writing               0
written               0
Length: 106, dtype: int64

## scale the data

In [24]:
# Create standard scaler object
scaler = StandardScaler()

# Fit the scaler to the training data and transform
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data
X_test_scaled = scaler.transform(X_test)

# Check the shape of the scaled data
print(X_train_scaled.shape)

(12575, 105)


## Train the model and tune the hyperparameters using RandomizedSearchCV

In [25]:
# Create a Random Forest Classifier
classifier = RandomForestClassifier()

In [26]:
#use RandomizedSearchCV to find the best hyperparameters
param_distributions = {
    'n_estimators': np.arange(10, 100, 5),
    'max_depth': [None, 3, 5, 10, 15],
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 20, 2),
    'max_features': ['auto', 'sqrt', 'log2'],
    'criterion': ['gini', 'entropy']
}




search = RandomizedSearchCV(
    classifier,
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    verbose=10,
    n_jobs=-1
)

In [27]:
search.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 2/5; 1/50] START criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=3, min_samples_split=10, n_estimators=45
[CV 2/5; 1/50] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=3, min_samples_split=10, n_estimators=45;, score=0.852 total time=   0.3s
[CV 4/5; 1/50] START criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=3, min_samples_split=10, n_estimators=45
[CV 4/5; 1/50] END criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=3, min_samples_split=10, n_estimators=45;, score=0.855 total time=   0.3s
[CV 1/5; 2/50] START criterion=entropy, max_depth=None, max_features=auto, min_samples_leaf=9, min_samples_split=6, n_estimators=95
[CV 1/5; 2/50] END criterion=entropy, max_depth=None, max_features=auto, min_samples_leaf=9, min_samples_split=6, n_estimators=95;, score=0.865 total time=   0.9s
[CV 3/5; 2/50] START criterion=entropy, max_depth=None, m

## Evaluate the model

In [28]:
# Get the best hyperparameters
search.best_params_


{'n_estimators': 95,
 'min_samples_split': 8,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': None,
 'criterion': 'gini'}

In [29]:

#best training score and test score
print(f"Best training score: {search.best_score_}")
print(f"Test score: {search.score(X_test, y_test)}")
model_accuracy=search.score(X_test, y_test)


Best training score: 0.8755467196819084
Test score: 0.8797709923664122
