## Predicting Book Ratings on GoodReads using Linear Regression Models

In this section, I will develop a linear regression model to predict the "rating" value for GoodReads books.

For this particular model, I will take advantage of the fact that each book in the original dataframe has an extensive list of genres and some of them have also received multiple awards. Before reducing their quantity, there were 980 unique genres and 5831 unique awards.

Once these features have been extracted and converted to numerical variables in previous kernels, I expect them to help me build a more robust model to predict the "rating".

The next steps in the process will be to import the cleaned dataframe, reduce, adapt and try to select the appropriate features for modeling, build the model, check the metrics, and make improvements if possible.

In [114]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#%matplotlib inline

### Import Dataframe

Importing clean dataframe
- Cleaning and wrangling done in "1. goodreads_data_CLEANING_AND_WRANGLING.ipynb"

In [115]:
gr_data = pd.read_csv('./2. Clean_df/gr_data_CLEAN.csv')
gr_data = gr_data.drop(['Unnamed: 0'],axis=1)

In [116]:
display(gr_data.head(), gr_data.shape)

Unnamed: 0,title,series,author,description,language,pages,isbn,book_format,publish_year,first_publish_year,...,West Australian Young Readers' Book Award (WAYRBA) for Older Readers,West Australian Young Readers' Book Award (WAYRBA) for Younger Readers,William Allen White Children's Book Award,William C. Morris YA Debut Award Nominee,Women's Prize for Fiction Nominee,Women's Prize for Fiction Nominee for Longlist,World Fantasy Award Nominee for Best Novel,World Fantasy Award for Best Novel,Zilveren Griffel,الجائزة العالمية للرواية العربية (أي باف) / International Prize for Arabic Fiction (IPAF) Nominee
0,The Hunger Games,The Hunger Games,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,200-300,9780440000000.0,Hardcover,2008,2008,...,1,0,0,0,0,0,0,0,0,0
1,Harry Potter and the Order of the Phoenix,Harry Potter,J.K. Rowling,There is a door at the end of a silent corrido...,English,700-800,9780440000000.0,Paperback,2004,2003,...,0,0,0,0,0,0,0,0,0,0
2,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,English,200-300,10000000000000.0,Paperback,2006,1960,...,0,0,0,0,0,0,0,0,0,0
3,Pride and Prejudice,Single Book,Jane Austen,Alternate cover edition of ISBN 9780679783268S...,English,100-200,10000000000000.0,Paperback,2000,2013,...,0,0,0,0,0,0,0,0,0,0
4,Twilight,The Twilight Saga,Stephenie Meyer,About three things I was absolutely positive.\...,English,400-500,9780320000000.0,Paperback,2006,2005,...,1,0,0,0,0,0,0,0,0,0


(48655, 862)

### Preprocessing data and features for modeling

#### Drop unnecesary columns

In [117]:
gr_data= gr_data.drop(['isbn','description'],axis=1)

In [118]:
gr_data.shape

(48655, 860)

#### NaN

In [119]:
gr_data.isna().sum().sum()

0

#### Types

In [120]:
gr_data.iloc[:, :18].dtypes

title                   object
series                  object
author                  object
language                object
pages                   object
book_format             object
publish_year             int64
first_publish_year       int64
publisher               object
liked_perc             float64
bbe_score              float64
bbe_votes              float64
log_num_ratings        float64
5_stars_num_ratings      int64
4_stars_num_ratings      int64
3_stars_num_ratings      int64
2_stars_num_ratings      int64
1_star_num_ratings       int64
dtype: object

In [121]:
numerical_columns = gr_data.select_dtypes(include=np.number).columns.tolist()
print(f"Numerical columns: {len(numerical_columns)}")

categorical_columns = gr_data.select_dtypes(include='object').columns.tolist()
print(f"Categorical columns: {len(categorical_columns)}")

Numerical columns: 853
Categorical columns: 7


# Reduce columns genre

In [178]:
# Importing dataframes
award_df = pd.read_csv('./0. Drafts/award_df.csv')
award_df = award_df.drop(['Unnamed: 0'],axis=1)
award_df = award_df.drop(['index_gr_data'],axis=1)

genres_df = pd.read_csv('./0. Drafts/genres_clean_dataset.csv')
genres_df = genres_df.drop(['Unnamed: 0'],axis=1)
genres_df = genres_df.drop(['index_gr_data'],axis=1)

In [179]:
binary_columns = [column for column in genres_df.columns]
len(binary_columns)

609

In [180]:
binary_columns

['fantasy',
 'science fiction',
 'dystopia',
 'young adult',
 'fiction',
 'action',
 'adventure',
 'romance',
 'magic',
 'children',
 'audiobook',
 'school',
 'classics',
 'historical fiction',
 'novels',
 'literature',
 'historical',
 'historical romance',
 'adult',
 'urban fantasy',
 'paranormal',
 'paranormal romance',
 'vampires',
 'holocaust',
 'war',
 'politics',
 'christian fiction',
 'christianity',
 'epic fantasy',
 'civil war',
 'love',
 'drama',
 'coming of age',
 'contemporary',
 'comedy',
 'humor',
 'poetry',
 'picture books',
 'short stories',
 'gothic',
 'mystery',
 'thriller',
 'suspense',
 'crime',
 'mystery thriller',
 'asian countries',
 'asia',
 'horror',
 'european literature',
 'european countries',
 'plays',
 'esoterica',
 'philosophy',
 'self help',
 'inspirational',
 'mental health',
 'north american countries',
 'angels',
 'chick lit',
 'animals',
 'chapter books',
 'magical realism',
 'time travel',
 'iberoamerican literature',
 'fairy tales',
 'greek mytholo

In [188]:
to_group = ['young adult', 'fantasy','science fiction', 'fiction','romance',
            'historical fiction','poetry', 'comedy', 'suspense', 'thriller',
            'horror', 'philosophy','asia','europe','america','africa','christian',
           'war','drama','epic','crime','christ','wolves','children','music','medic',
           'health','mystery']

In [193]:
genre_dict = {}

for genre in to_group:
    if genre == 'science fiction':
        genre_dict[genre] = [name for name in binary_columns if 'science fiction' in name.lower()]
    elif genre == 'fiction':
        genre_dict[genre] = [name for name in binary_columns if 'fiction' in name.lower() and 'science fiction' not in name.lower()]
    else:
        genre_dict[genre] = [name for name in binary_columns if genre in name.lower()]

print(genre_dict)


{'young adult': ['young adult', 'young adult fantasy', 'young adult contemporary', 'young adult paranormal', 'young adult historical fiction', 'young adult science fiction'], 'fantasy': ['fantasy', 'urban fantasy', 'epic fantasy', 'young adult fantasy', 'dark fantasy', 'historical fantasy', 'comic fantasy', 'heroic fantasy', 'fantasy romance', 'sci fi fantasy', 'm m fantasy', 'paranormal urban fantasy', 'low fantasy'], 'science fiction': ['science fiction', 'science fiction romance', 'young adult science fiction', 'hard science fiction', 'military science fiction'], 'fiction': ['fiction', 'historical fiction', 'christian fiction', 'nonfiction', 'speculative fiction', 'military fiction', 'animal fiction', 'weird fiction', 'womens fiction', 'young adult historical fiction', 'fan fiction', 'lesbian fiction', 'gay fiction', 'bizarro fiction', 'climate change fiction', 'slash fiction', 'amish fiction', 'naval historical fiction', 'christian contemporary fiction'], 'romance': ['romance', 'hi

In [192]:
gr_data.shape

(48655, 860)

#### Numerical Data

In [122]:
numerical = gr_data.select_dtypes(include=np.number)
numerical.shape
#categorical = gr_data.select_dtypes(include=['object'])

(48655, 853)

In [123]:
# For easier analysis, I'm going to undo the log transformation in this subset
log_transformed_votes = numerical['bbe_votes']
original_votes = 10**(log_transformed_votes)
numerical['bbe_votes'] = original_votes

log_transformed_scores = numerical['bbe_score']
original_scores = 10**(log_transformed_scores)
numerical['bbe_score'] = original_scores

##### Correlation-Association amongst features

Correlation Matrix:

In [124]:
# Correlation matrix for numerical values (854 columns)
corr_matrix = numerical.corr()

In [125]:
#As there are so many columns, displaying data where correlations > 0.6 or < -0,6 :

In [126]:
# Indices of non-NaN values in the correlation matrix
indices = [(row, col) for row in corr_matrix.index for col in corr_matrix.columns if not pd.isna(corr_matrix.loc[row, col])]

# New df with the row and column's names, and correlation number
df = pd.DataFrame([(idx[0], idx[1], corr_matrix.loc[idx]) for idx in indices if (corr_matrix.loc[idx] > 0.6 or corr_matrix.loc[idx] < -0.6) and idx[0] != idx[1]], columns=['row name', 'col name', 'corr number'])

# Drop duplicate rows where row name is greater than col name (diagonal)
df = df[df['row name'] < df['col name']]

# Sort df by the correlation number in descending order
df = df.sort_values(by='corr number', ascending=False)

# Sorted df
display(df, len(df))


Unnamed: 0,row name,col name,corr number
0,bbe_score,bbe_votes,0.999308
24,3_stars_num_ratings,4_stars_num_ratings,0.957836
30,2_stars_num_ratings,3_stars_num_ratings,0.956072
35,1_star_num_ratings,2_stars_num_ratings,0.933824
17,4_stars_num_ratings,5_stars_num_ratings,0.931859
29,2_stars_num_ratings,4_stars_num_ratings,0.856281
23,3_stars_num_ratings,5_stars_num_ratings,0.83187
34,1_star_num_ratings,3_stars_num_ratings,0.820076
28,2_stars_num_ratings,5_stars_num_ratings,0.726403
33,1_star_num_ratings,4_stars_num_ratings,0.711987


40

 - Check counts from paired highly-correlated variables to choose which one to eliminate

In [127]:
display(numerical['Hugo Award for Best Novel '].value_counts(),numerical['hugo awards'].value_counts())

0    48593
1       62
Name: Hugo Award for Best Novel , dtype: int64

0    48620
1       35
Name: hugo awards, dtype: int64

In [128]:
display(numerical['love inspired'].value_counts(),numerical['love inspired historical'].value_counts())

0    48645
1       10
Name: love inspired, dtype: int64

0    48650
1        5
Name: love inspired historical, dtype: int64

In [129]:
# High correlated variables to drop
high_correlated = ['2_stars_num_ratings','3_stars_num_ratings','4_stars_num_ratings',
                   'bbe_votes']

In [130]:
# High correlated variables to drop
high_correlated = ['2_stars_num_ratings','3_stars_num_ratings','4_stars_num_ratings',
                   'bbe_votes','hugo awards','booze','love inspired historical']

In [131]:
numerical.shape

(48655, 853)

In [132]:
# Drop high correlated variables
numerical = numerical.drop(high_correlated, axis=1)

# New shape
numerical.shape

(48655, 846)

correlation matrix can differ depending on whether the data is scaled or not. This is because correlation measures the linear relationship between two variables, and the scale of the variables can affect the strength of their relationship.

For example, if one variable has a much larger scale than another variable, it may dominate the correlation analysis and obscure any relationships between the other variables. Scaling the variables to have the same range of values can help to mitigate this problem and provide a more accurate representation of the correlations between the variables.

generally recommended to scale the numerical features before applying PCA to ensure that they are on a similar scale and have equal weight in the analysis. This is because PCA is sensitive to the scale of the variables, and features with larger scales can dominate the analysis. 

#### Categorical Data

In [133]:
categorical = gr_data.select_dtypes(include=['object'])
categorical.shape

(48655, 7)

In [134]:
# Unique values
for col in categorical.columns:
    print(f"Unique values for {col}: {categorical[col].nunique()}")

Unique values for title: 46328
Unique values for series: 4405
Unique values for author: 21720
Unique values for language: 92
Unique values for pages: 12
Unique values for book_format: 10
Unique values for publisher: 4


- Reducing unique values

Author

In [135]:
categorical['author']

0        Suzanne Collins
1           J.K. Rowling
2             Harper Lee
3            Jane Austen
4        Stephenie Meyer
              ...       
48650      Sherry Gammon
48651      Emma Michaels
48652     Kim Richardson
48653        Tom Pollack
48654       Misty Moncur
Name: author, Length: 48655, dtype: object

Reducing number of authors by grouping them into 'Top 1000', 'Top 1001-10000' and 'General Authors' according to the average 'rating' for authors that have published more than 1 book (there are a lot of authors with just 1 book published and a 5 rating).

In [136]:
# Reducing number of unique authors: TOP 1000 (average rating for authors with >1 book published)

# Count the number of books for each author
author_counts = gr_data['author'].value_counts()

# Filter the resulting series to only include authors with a count greater than 1
filtered_authors = author_counts[author_counts > 1].index

# Filter the original dataframe to only include titles with authors in the filtered series
filtered_gr_data = gr_data[gr_data['author'].isin(filtered_authors)]

# Group by author and calculate the mean rating for each author
author_avg_rating = filtered_gr_data.groupby('author')['rating'].mean()

# Sort the resulting series in descending order based on the average rating and select the top 100 authors
top_1000_most_rated_authors = author_avg_rating.sort_values(ascending=False).head(1000)

# Convert the resulting series to a dataframe and reset the index
top_1000_most_rated_authors_df = top_1000_most_rated_authors.to_frame().reset_index()

display(len(top_1000_most_rated_authors_df))

top_1000_most_rated_authors_df.head(30)

1000

Unnamed: 0,author,rating
0,James Goodman,5.0
1,"Doug ""Hollywood"" Davis",5.0
2,Natasha Lukin,5.0
3,Rita Pam Tarachi,5.0
4,Granthana Sinha,5.0
5,Nadine May,5.0
6,Bernard Cenney,5.0
7,Luke A.M. Brown,5.0
8,Susan Davis,5.0
9,Rodolfo Martin Vitangcol,5.0


In [137]:
# Reducing number of unique authors: TOP 1001-10.000

# Count the number of books for each author
author_counts = gr_data['author'].value_counts()

# Filter the resulting series to only include authors with a count greater than 1
filtered_authors = author_counts[author_counts > 1].index

# Filter the original dataframe to only include titles with authors in the filtered series
filtered_gr_data = gr_data[gr_data['author'].isin(filtered_authors)]

# Group by author and calculate the mean rating for each author
author_avg_rating = filtered_gr_data.groupby('author')['rating'].mean()

# Sort the resulting series in descending order based on the average rating and select the top 1001 to 10000 authors
top_1001_to_10000_most_rated_authors = author_avg_rating.sort_values(ascending=False).iloc[1000:10000]

# Convert the resulting series to a dataframe and reset the index
top_1001_to_10000_most_rated_authors_df = top_1001_to_10000_most_rated_authors.to_frame().reset_index()

display(len(top_1001_to_10000_most_rated_authors_df))

top_1001_to_10000_most_rated_authors_df.head(30)


6180

Unnamed: 0,author,rating
0,Adrienne Rich,4.276667
1,E.F. Benson,4.276667
2,Lexie Xu,4.276667
3,Anna Zaires,4.276667
4,Mandi Beck,4.276667
5,Arnold Lobel,4.276667
6,Lois Walfrid Johnson,4.276667
7,Amy Clipston,4.276667
8,J.M. Darhower,4.276
9,Mike Mignola,4.276


In [138]:
# Other authors

# Count the number of books for each author
author_counts = gr_data['author'].value_counts()

# Filter the resulting series to only include authors with a count equal to 1
single_authors = author_counts[author_counts == 1].index

# Filter the original dataframe to only include titles with authors not in the top rated authors lists or with only 1 book
general_authors_data = gr_data[~gr_data['author'].isin(top_1000_most_rated_authors.index) & ~gr_data['author'].isin(top_1000_most_rated_authors.index) & gr_data['author'].isin(single_authors)]
#general_authors_data = gr_data[~gr_data['author'].isin(top_1000_most_rated_authors.index) & ~gr_data['author'].isin(top_1001_to_10000_most_rated_authors.index) & ~gr_data['author'].isin(author_counts[author_counts == 1].index)]

# Group by author and calculate the mean rating for each author
general_authors_avg_rating = general_authors_data.groupby('author')['rating'].mean()

# Convert the resulting series to a dataframe and reset the index
general_authors_avg_rating_df = general_authors_avg_rating.to_frame().reset_index()

display(len(general_authors_avg_rating_df))

general_authors_avg_rating_df.head(30)


14540

Unnamed: 0,author,rating
0,50 Cent,4.15
1,A. Bates,3.54
2,A. Cort Sinnes,4.5
3,A. Helwa,4.63
4,A. Merritt,3.33
5,A. Roger Ekirch,3.68
6,A. Scott Berg,4.0
7,A. Teeuw,4.06
8,A. Ubaidillah Alias,3.94
9,A. White,4.93


In [139]:
# Check result

# Count the number of unique authors in the original dataframe
num_unique_authors = len(gr_data['author'].unique())

# Count the number of unique authors in new authors dataframes
num_top_1000_authors = len(top_1000_most_rated_authors_df)
num_top_1001_to_10000_authors = len(top_1001_to_10000_most_rated_authors_df)
num_general_authors = len(general_authors_avg_rating_df)

# Check if the sum of the number of new dataframes is equal to unique authors in the original dataframe
if num_top_1000_authors + num_top_1001_to_10000_authors + num_general_authors == num_unique_authors:
    print("The number of unique authors in the three new dataframes is equal to the number of unique authors in the original gr_data dataframe.")
else:
    print("The number of unique authors in the three new dataframes is not equal to the number of unique authors in the original gr_data dataframe.")

The number of unique authors in the three new dataframes is equal to the number of unique authors in the original gr_data dataframe.


In [140]:
# Replace the author values in categorical['author'] with the appropriate category
categorical['author'] = np.where(categorical['author'].isin(top_1000_most_rated_authors_df['author']), 'top_1000_most_rated_authors',
                    np.where(categorical['author'].isin(top_1001_to_10000_most_rated_authors_df['author']), 'top_1001_to_10000_most_rated_authors',
                    np.where(categorical['author'].isin(general_authors_avg_rating_df['author']), 'general_authors',
                    categorical['author'])))

categorical.head(5)

Unnamed: 0,title,series,author,language,pages,book_format,publisher
0,The Hunger Games,The Hunger Games,top_1001_to_10000_most_rated_authors,English,200-300,Hardcover,Top Publishers
1,Harry Potter and the Order of the Phoenix,Harry Potter,top_1000_most_rated_authors,English,700-800,Paperback,Top Publishers
2,To Kill a Mockingbird,To Kill a Mockingbird,top_1001_to_10000_most_rated_authors,English,200-300,Paperback,Top Publishers
3,Pride and Prejudice,Single Book,top_1001_to_10000_most_rated_authors,English,100-200,Paperback,Top Publishers
4,Twilight,The Twilight Saga,top_1001_to_10000_most_rated_authors,English,400-500,Paperback,Top Publishers


In [141]:
categorical['author'].unique()

array(['top_1001_to_10000_most_rated_authors',
       'top_1000_most_rated_authors', 'general_authors'], dtype=object)

Series

In [142]:
categorical['series'].nunique()

4405

Same idea as before, this time grouping series into 'Top 500 rated series' and 'Other Series' and keeping 'Single Books' for those books not included in any literary saga. And again, criteria to bin them is according to the average 'rating' for each series.

In [143]:
# Filter the original dataframe to only include titles that are not 'Single Book'
series_data = gr_data[gr_data['series'] != 'Single Book']

# Group by series and calculate the mean rating for each series
series_avg_rating = series_data.groupby('series')['rating'].mean()

# Sort the resulting series in descending order based on the average rating and select the top 500 series
top_500_most_rated_series = series_avg_rating.sort_values(ascending=False).head(500)

# Filter the original dataframe to only include titles with series not in the top rated series lists or with empty series
other_series_data = series_data[~series_data['series'].isin(top_500_most_rated_series.index) & (series_data['series'] != '')]

# Group by series and calculate the mean rating for each series
other_series_avg_rating = other_series_data.groupby('series')['rating'].mean()

# Convert the resulting series to a dataframe and reset the index
top_500_most_rated_series_df = top_500_most_rated_series.to_frame().reset_index()
other_series_avg_rating_df = other_series_avg_rating.to_frame().reset_index()


In [144]:
# Replace series values in categorical['series'] with the appropriate category
categorical['series'] = np.where(categorical['series'] == 'Single Book', 'Single Book',
                    np.where(categorical['series'].isin(top_500_most_rated_series_df['series']), 'Top 500 rated series',
                    np.where(categorical['series'].isin(other_series_avg_rating_df['series']), 'Other series',
                    categorical['series'])))


In [145]:
categorical['series'].unique()

array(['Other series', 'Top 500 rated series', 'Single Book'],
      dtype=object)

Title

In [146]:
categorical['title'].nunique()

46328

- As there are a lot of unique authors, I have decided to drop that column

In [147]:
categorical = categorical.drop(['title'],axis=1)

Language

In [148]:
categorical['language'].value_counts()

English                                  43415
Arabic                                     894
Spanish                                    632
French                                     562
German                                     501
                                         ...  
Iranian (Other)                              1
Aromanian; Arumanian; Macedo-Romanian        1
gu                                           1
Faroese                                      1
Afrikaans                                    1
Name: language, Length: 92, dtype: int64

In [149]:
# Count the unique values in the 'language' column
language_counts = categorical['language'].value_counts()

# Top 5 most common languages
top_languages = language_counts.index[:5]
top_languages

Index(['English', 'Arabic', 'Spanish', 'French', 'German'], dtype='object')

All languages not included in top_languages are going to be grouped as 'Other languages'

In [150]:
# Replace all the other languages with 'Other languages'
categorical.loc[~categorical['language'].isin(top_languages), 'language'] = 'Other languages'

In [151]:
categorical['language'].unique()

array(['English', 'French', 'German', 'Other languages', 'Arabic',
       'Spanish'], dtype=object)

In [152]:
categorical['language'].nunique()

6

In [153]:
# Unique values
for col in categorical.columns:
    print(f"Unique values for {col}: {categorical[col].nunique()}")

Unique values for series: 3
Unique values for author: 3
Unique values for language: 6
Unique values for pages: 12
Unique values for book_format: 10
Unique values for publisher: 4


#### Concatenate numerical and categorical processed data

In [154]:
gr_data_for_modelling = pd.concat([categorical,numerical],axis=1)

In [155]:
display(gr_data_for_modelling.head(3),gr_data_for_modelling.shape)

Unnamed: 0,series,author,language,pages,book_format,publisher,publish_year,first_publish_year,liked_perc,bbe_score,...,West Australian Young Readers' Book Award (WAYRBA) for Older Readers,West Australian Young Readers' Book Award (WAYRBA) for Younger Readers,William Allen White Children's Book Award,William C. Morris YA Debut Award Nominee,Women's Prize for Fiction Nominee,Women's Prize for Fiction Nominee for Longlist,World Fantasy Award Nominee for Best Novel,World Fantasy Award for Best Novel,Zilveren Griffel,الجائزة العالمية للرواية العربية (أي باف) / International Prize for Arabic Fiction (IPAF) Nominee
0,Other series,top_1001_to_10000_most_rated_authors,English,200-300,Hardcover,Top Publishers,2008,2008,96.0,2993816.0,...,1,0,0,0,0,0,0,0,0,0
1,Top 500 rated series,top_1000_most_rated_authors,English,700-800,Paperback,Top Publishers,2004,2003,98.0,2632233.0,...,0,0,0,0,0,0,0,0,0,0
2,Other series,top_1001_to_10000_most_rated_authors,English,200-300,Paperback,Top Publishers,2006,1960,95.0,2269402.0,...,0,0,0,0,0,0,0,0,0,0


(48655, 852)

In [156]:
# NaN values
gr_data_for_modelling.isna().sum().sum()

0

In [157]:
# Saving data processed for modelling
#gr_data_for_modelling.to_csv('./2. Clean_df/_MODELLING.csv')

## Linear Regression Model

In [158]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#import scipy.stats as stats

In [159]:
#gr_data = pd.read_csv('./2. Clean_df/gr_data_MODELLING.csv')
#gr_data = gr_data.drop(['Unnamed: 0'],axis=1)
gr_data_for_modelling.shape

(48655, 852)

In [160]:
gr_data_for_modelling.head()

Unnamed: 0,series,author,language,pages,book_format,publisher,publish_year,first_publish_year,liked_perc,bbe_score,...,West Australian Young Readers' Book Award (WAYRBA) for Older Readers,West Australian Young Readers' Book Award (WAYRBA) for Younger Readers,William Allen White Children's Book Award,William C. Morris YA Debut Award Nominee,Women's Prize for Fiction Nominee,Women's Prize for Fiction Nominee for Longlist,World Fantasy Award Nominee for Best Novel,World Fantasy Award for Best Novel,Zilveren Griffel,الجائزة العالمية للرواية العربية (أي باف) / International Prize for Arabic Fiction (IPAF) Nominee
0,Other series,top_1001_to_10000_most_rated_authors,English,200-300,Hardcover,Top Publishers,2008,2008,96.0,2993816.0,...,1,0,0,0,0,0,0,0,0,0
1,Top 500 rated series,top_1000_most_rated_authors,English,700-800,Paperback,Top Publishers,2004,2003,98.0,2632233.0,...,0,0,0,0,0,0,0,0,0,0
2,Other series,top_1001_to_10000_most_rated_authors,English,200-300,Paperback,Top Publishers,2006,1960,95.0,2269402.0,...,0,0,0,0,0,0,0,0,0,0
3,Single Book,top_1001_to_10000_most_rated_authors,English,100-200,Paperback,Top Publishers,2000,2013,94.0,1983116.0,...,0,0,0,0,0,0,0,0,0,0
4,Other series,top_1001_to_10000_most_rated_authors,English,400-500,Paperback,Top Publishers,2006,2005,78.0,1459448.0,...,1,0,0,0,0,0,0,0,0,0


#### X-y split

In [161]:
# Target = 'rating'
y = gr_data_for_modelling['rating']
X = gr_data_for_modelling.drop(['rating'],axis=1)

#### Train-Test Split

In [162]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1234) #77

In [163]:
#### Split both Train and Test into numerical and categoricals

In [164]:
# X_train
X_train_num = X_train.select_dtypes(include=np.number)
X_train_cat = X_train.select_dtypes(include=['object'])

# X_test
X_test_num = X_test.select_dtypes(include=np.number)
X_test_cat = X_test.select_dtypes(include=['object'])

#### Numerical Transformations

- Fit only with numericals_train

In [165]:
scaler = StandardScaler().fit(X_train_num)

- Transform both numericals_train and numericals_test

In [166]:
X_train_num_scaled = scaler.transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

In [167]:
# Convert X_train_num_scaled and X_test_num_scaled to dataframes
X_train_num_scaled_df = pd.DataFrame(X_train_num_scaled, columns=X_train_num.columns)
X_test_num_scaled_df = pd.DataFrame(X_test_num_scaled, columns=X_test_num.columns)

#### Categorical Encoding

Fit only with categoricals_train

In [168]:
encoder = OneHotEncoder().fit(X_train_cat)

Encode both categoricals_train and numericals_test

In [169]:
X_train_cat_encoded = encoder.transform(X_train_cat)
X_test_cat_encoded = encoder.transform(X_test_cat)

In [170]:
# Convert X_train_cat_encoded and X_test_cat_encoded to dataframes
X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded.toarray(), columns=encoder.get_feature_names_out())
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded.toarray(), columns=encoder.get_feature_names_out())

#### Combine numericals_train and categoricals_train into train_processed

In [171]:
train_processed = pd.concat([X_train_num_scaled_df, X_train_cat_encoded_df], axis=1)

#### Combine numericals_test and categoricals_test into test_processed

In [172]:
test_processed = pd.concat([X_test_num_scaled_df, X_test_cat_encoded_df], axis=1)

#### Define Linear Regression Model

- Fit model on train_processed

In [173]:
lr = LinearRegression().fit(train_processed, y_train)

In [174]:
# Predictions on training data
y_train_pred = lr.predict(train_processed)

print("First 5 predictions on the training data:", y_train_pred[:5])

# Evaluation metrics
r2 = r2_score(y_train, y_train_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
mse = mean_squared_error(y_train, y_train_pred)
mae = mean_absolute_error(y_train, y_train_pred)

print("\nR-squared score on the training data:", round(r2, 2))
print("Root mean squared error on the training data:", round(rmse, 2))
print("Mean squared error on the training data:", round(mse, 2))
print("Mean absolute error on the training data:", round(mae, 2))

First 5 predictions on the training data: [3.75805664 3.65991211 3.7175293  4.19458008 3.95654297]

R-squared score on the training data: 0.44
Root mean squared error on the training data: 0.26
Mean squared error on the training data: 0.07
Mean absolute error on the training data: 0.17


- Predictions on test_processed and scores

In [175]:
# Predictions on test data
y_test_pred = lr.predict(test_processed)

print("Actual data:", y_test[:6])
print("\nPredictions on the TEST data:", y_test_pred[:6])

# Evaluation metrics on test data
r2_test = r2_score(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
mse_test = mean_squared_error(y_test, y_test_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print("\nR-squared score on the test data:", round(r2_test, 2))
print("Root mean squared error on the test data:", round(rmse_test, 2))
print("Mean squared error on the test data:", round(mse_test, 2))
print("Mean absolute error on the test data:", round(mae_test, 2))

Actual data: 30447    3.90
9957     3.02
1075     4.20
11957    4.10
32085    4.27
17717    3.66
Name: rating, dtype: float64

Predictions on the TEST data: [4.00512695 3.4609375  4.19604492 4.07104492 4.69970703 3.69116211]

R-squared score on the test data: -3.1053025765579204e+21
Root mean squared error on the test data: 19153594350.87
Mean squared error on the test data: 3.668601765577344e+20
Mean absolute error on the test data: 411725683.9


In [176]:
gr_data['rating'].describe()

count    48655.000000
mean         4.015514
std          0.347446
min          0.000000
25%          3.820000
50%          4.020000
75%          4.220000
max          5.000000
Name: rating, dtype: float64

After trying to use 'genres' and 'awards' (binary values) features both as numerical or categorical, the model is not performing well on the test data:

An R-squared score of 0.42 indicates that the model explains 42% of the variance in the target variable.

The root mean squared error (RMSE) on the test data is 0.27 ('rating' std = 0.34), which means that on average, the model's predictions are off by 0.27 units from the actual values.

The mean squared error (MSE) on the test data is 0.07, which is the average squared difference between the predicted and actual values.

The mean absolute error (MAE) on the test data is 0.17, which is the average absolute difference between the predicted and actual values.

In terms of pros and cons, here are some for PCA:

Pros:

Can help reduce the number of features in high-dimensional datasets, making it easier to fit a model and reducing the risk of overfitting.
Can reveal patterns and relationships in the data that may not be apparent in the original feature space.
Can help with data visualization by reducing the dimensionality of the data to 2 or 3 dimensions.
Cons:

PCA assumes that the principal components represent the most important patterns in the data, which may not always be true.
It can be difficult to interpret the meaning of the principal components, especially if there are many of them.
Depending on the amount of variance you want to retain, PCA can result in a loss of information.

#### Attempt to improve the model

In order to check if I can get better results, I am going to reduce the number of features by applying PCA. To choose the specific number of components, I will first explore the explained variance ratio

#### PCA

In [None]:
from sklearn.decomposition import PCA

# Fit PCA on training data
pca = PCA() #n_components=100
pca.fit(train_processed)

#pca.explained_variance_ratio_.round(2)

In [None]:
# Get the explained variance ratios
exp_var_pca = pca.explained_variance_ratio_

# Calculate the cumulative sum of variance explained
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
#cum_sum_eigenvalues

# Plot the cumulative explained variance vs number of components
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, 
        alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, 
         where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

As shown in the graph, we can see that the features selected to predict the target variable seem to have a very weak ability to capture the variance in the original data. This makes it difficult to improve the regression model just by reducing dimensionality.

In [None]:
pca.explained_variance_ratio_[:650].sum()

Number of dimensions: 650 (Variance explained 0.90)

In [None]:
# Fit PCA on training data
pca = PCA(0.90)
pca.fit(train_processed)

In [None]:
# Transform training and test data
X_train_pca = pca.transform(train_processed)
X_test_pca = pca.transform(test_processed)

##### Linear Regression with PCA

In [None]:
# Fit linear regression on transformed data
lr_pca = LinearRegression().fit(X_train_pca, y_train)

# Predict on test set
y_pred_pca = lr_pca.predict(X_test_pca)

# Calculate R-squared score
r2_pca = r2_score(y_test, y_pred_pca)
print("R-squared score (PCA transformed):", r2_pca)

In [None]:
#plt.scatter(X_train_pca[:,0],X_train_pca[:,1], c = y_train)

The R2 score of the linear regression model decreased after applying PCA, indicating that the PCA transformation did not improve the model's ability to predict the target variable.

##### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
#from sklearn.metrics import mean_squared_error

# Random Forest Regressor
rf = RandomForestRegressor(max_depth=30, min_samples_split=200, min_samples_leaf=200)

# Fit the model on the scaled training data and target values
rf.fit(train_processed, y_train)

# Predictions on the test data
y_pred = rf.predict(test_processed)

# MSE
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error: ", mse)


#### Recursive Feature Elimination RFE

     - Build a succession of models starting with all features (original df) and then removes the features that contributes least to the solution
     
     - Linear Regression models

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler2 = MinMaxScaler()
numerical_scaled = scaler2.fit_transform(numerical)

In [None]:
X = numerical_scaled
y = gr_data['rating']

from sklearn.feature_selection import RFE
from sklearn import linear_model
lm = linear_model.LinearRegression() # empty
rfe = RFE(lm, n_features_to_select=100, verbose=False)
rfe.fit(X, y)

In [None]:
#rfe.ranking_

In [None]:
# Get the selected features
selected_features = X.columns[rfe.support_]

# Use the selected features to create a new dataset
X_selected = X[selected_features]

# Train a linear regression model on the selected features
lm_selected = linear_model.LinearRegression()
lm_selected.fit(X_selected, y)

# Make predictions on the same data used for training
y_pred = lm_selected.predict(X_selected)

# Check the performance of the model
from sklearn.metrics import r2_score
r2 = r2_score(y, y_pred)
print('R2 score:', r2)
