# Data Science 346 Project Stellenbosch University
### Team:
- David Nicolay
- Kellen Mossner
- Matthew Holm

In [1]:
# Libraries
import pandas as pd
from transformers import pipeline, AutoTokenizer

import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers

from langdetect import detect, DetectorFactory
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Set Random seed for reproducible results
random_seed = 100 

2024-10-16 10:43:37.192911: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-16 10:43:37.213936: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Additional Libraries


# Data Pre-processing

Load data:

In [3]:
# Import Data
reviews = pd.read_csv("../WebScrapingExplore/data/goodreads_reviews_all.csv")
reviews.head()

Unnamed: 0,Book Title,Link,Review Text,Review Date,Review Stars,Review Likes
0,Ways of Seeing,https://www.goodreads.com/book/show/2784.Ways_...,This book is based on a television series whic...,"September 29, 2014",5,513
1,Ways of Seeing,https://www.goodreads.com/book/show/2784.Ways_...,"I am not the audience for this book, mainly be...","June 3, 2014",3,216
2,Ways of Seeing,https://www.goodreads.com/book/show/2784.Ways_...,"Way of Seeing, John Berger Ways of Seeing is a...","October 21, 2021",4,0
3,Ways of Seeing,https://www.goodreads.com/book/show/2784.Ways_...,"First of all, this entire book is set in bold....","May 25, 2008",4,106
4,Ways of Seeing,https://www.goodreads.com/book/show/2784.Ways_...,This was a great introduction to the work of J...,"March 12, 2020",4,80


In [4]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29519 entries, 0 to 29518
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Book Title    29519 non-null  object
 1   Link          29519 non-null  object
 2   Review Text   29247 non-null  object
 3   Review Date   29519 non-null  object
 4   Review Stars  29519 non-null  int64 
 5   Review Likes  29519 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 1.4+ MB


Through investiagting our database further

In [10]:
# Function to detect language
def is_english(review):
    try:
        return detect(review) == 'en'
    except:
        return False

In [None]:
# Filter only English reviews- This may take a while
reviews = reviews[reviews['Review Text'].apply(is_english)]

In [5]:
# Preprocess the review text
def preprocess_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.strip()  # Remove leading/trailing spaces
    return text

In [6]:
processed_reviews = reviews.copy()
processed_reviews['Review Text'] = reviews['Review Text'].apply(preprocess_text)

AttributeError: 'float' object has no attribute 'lower'

# Part 1: Summarization

## Transformers

Initializing the pipeline will take a while to run at first, since this function downloads the model weights (about 1.6gb).

In [13]:
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [4]:
summarizer = pipeline("summarization", model=model_name)



Due to restricted input length of the summarizer the reviews text needs to be divided into chunks.

In [16]:
def chunk_text(text, max_chunk_size=500):
    words = text.split()
    chunks = []
    current_chunk = []
    current_size = 0
    for word in words:
        if current_size + len(word) > max_chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_size = len(word)
        else:
            current_chunk.append(word)
            current_size += len(word) + 1  # +1 for space
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

def summarize_text(text, max_summary_length=150):
    chunks = chunk_text(text)
    summaries = []
    for chunk in chunks:
        summary = summarizer(chunk, max_length=max_summary_length, min_length=10)[0]['summary_text']
        summaries.append(summary)
    
    final_summary = ' '.join(summaries)
    if len(final_summary) > max_summary_length:
        final_summary = summarizer(final_summary, max_length=max_summary_length, min_length=30)[0]['summary_text']
    return final_summary

Begin by summarizing 1 book's reviews - "Ways of Seeing"

In [35]:
# TODO change model to be able to handle longer length summarize inputs 

#book_title = "Ways of Seeing"
# book_df = reviews[reviews['Book Title'] == book_title]

# # Concatenate all reviews for the book
# all_reviews = ' '.join(book_df['Review Text'].dropna())

# # Generate a summary of the concatenated reviews
# try:
#     summary = summarize_text(all_reviews)
# except Exception as e:
#     print(f"An error occurred: {e}")
#     summary = "Error generating summary"

# # Calculate average rating
# avg_rating = book_df['Review Stars'].mean()

# # Create a new dataframe with the results
# result_df = pd.DataFrame({
#     'Book Title': [book_title],
#     'Review Summary': [summary],
#     'Average Rating': [avg_rating],
#     'Number of Reviews': [len(book_df)]
# })

# # Display the results
# print(result_df)

# # Print some statistics
# print(f"\nTotal length of all reviews: {len(all_reviews)} characters")
# print(f"Length of summary: {len(summary)} characters")

Here we can have a look at how the model does a good job of summarizing (but it essentially picks important sentences), however we still need to present it in a format that explains the overall sentiment from the readers.

In [36]:
summary_test = summarizer(book_df.loc[3]['Review Text'], max_length=50, min_length=10)

In [37]:
summary_test

[{'summary_text': '4 essays and 3 pictorial essays. It seems like museums are doing a lot of things wrong as well as right. Chapter on oil-painting was particularly interesting but it was the last one about advertising (or "publicity"'}]

In [38]:
book_df.loc[3]['Review Text']

'First of all, this entire book is set in bold. I don\'t know what crazy crazyman let that through the gate at Penguin but I just felt I had to point it out right away. It\'s still worth reading. 4 essays and 3 pictorial essays. Really interesting stuff cutting away some of the bullshit associated with our appreciation of art. It seems like museums are doing a lot of things wrong as well as right. Chapter on oil-painting was particularly interesting but it was the last one about advertising (or "publicity" as it\'s exclusively referred to in this book) that has me thinking. Advertising not only needs you to want this shirt, this car, the entire industry must endeavor to narrow the scope of your desires to make you amenable to the culture. The mindset must always be a future, better you achieved through important purchases. The essay is horrifying enough until you realise that it\'s thirty years old, and this is now only one facet of a business that\'s grown much more insidious. The ads

## Encoder-Decoder Approach

In [10]:
from nltk.tokenize import sent_tokenize
import skipthoughts

In [13]:
# ***************************************************************************
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
# ***************************************************************************


def preprocess(emails):
    """
    Performs preprocessing operations such as:

        Removing new line characters.
    """
    n_emails = len(emails)
    print(n_emails)
    for i in range(n_emails):
        email = emails[i]
        lines = email.split('\n')
        for j in reversed(range(len(lines))):
            lines[j] = lines[j].strip()
            if lines[j] == '':
                lines.pop(j)
        emails[i] = ' '.join(lines)
        
        
def split_sentences(emails):
    """
    Splits the emails into individual sentences
    """
    n_emails = len(emails)
    for i in range(n_emails):
        email = emails[i]
        sentences = sent_tokenize(email)
        for j in reversed(range(len(sentences))):
            sent = sentences[j]
            sentences[j] = sent.strip()
            if sent == '':
                sentences.pop(j)
        emails[i] = sentences
        
        
def skipthought_encode(emails):
    """
    Obtains sentence embeddings for each sentence in the emails
    """
    enc_emails = [None]*len(emails)
    cum_sum_sentences = [0]
    sent_count = 0
    for email in emails:
        sent_count += len(email)
        cum_sum_sentences.append(sent_count)

    all_sentences = [sent for email in emails for sent in email]
    print('Loading pre-trained models...')
    model = skipthoughts.load_model()
    encoder = skipthoughts.Encoder(model)
    print('Encoding sentences...')
    enc_sentences = encoder.encode(all_sentences, verbose=False)

    for i in range(len(emails)):
        begin = cum_sum_sentences[i]
        end = cum_sum_sentences[i+1]
        enc_emails[i] = enc_sentences[begin:end]
    return enc_emails
        
    
def summarize(emails):
    """
    Performs summarization of emails
    """
    n_emails = len(emails)
    summary = [None]*n_emails
    print('Preprecesing...')
    preprocess(emails)
    print('Splitting into sentences...')
    split_sentences(emails)
    print('Starting to encode...')
    enc_emails = skipthought_encode(emails)
    print('Encoding Finished')
    for i in range(n_emails):
        enc_email = enc_emails[i]
        n_clusters = int(np.ceil(len(enc_email)**0.5))
        kmeans = KMeans(n_clusters=n_clusters, random_state=0)
        kmeans = kmeans.fit(enc_email)
        avg = []
        closest = []
        for j in range(n_clusters):
            idx = np.where(kmeans.labels_ == j)[0]
            avg.append(np.mean(idx))
        closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_,\
                                                   enc_email)
        ordering = sorted(range(n_clusters), key=lambda k: avg[k])
        summary[i] = ' '.join([emails[i][closest[idx]] for idx in ordering])
    print('Clustering Finished')
    return summary

### First Attempt on a single book's review

In [24]:
reviews_subset = reviews[reviews['Book Title'] == 'Ways of Seeing']

In [26]:
reviews_subset = reviews['Review Text']

In [27]:
reviews_subset

0        This book is based on a television series whic...
1        I am not the audience for this book, mainly be...
2        Way of Seeing, John Berger Ways of Seeing is a...
3        First of all, this entire book is set in bold....
4        This was a great introduction to the work of J...
                               ...                        
29514    I first discovered this book when I was 15 and...
29515    As per usual, I loved it, though I feel like t...
29516    This was actually the book that made a reader ...
29517    charlotte really put her whole brontussy into ...
29518    Jane Eyre was orphaned and left in the care of...
Name: Review Text, Length: 29519, dtype: object

# Part 2: Manual Deep Learning

## Star Prediction

In [27]:
# Setup target and predictor datasets
X = reviews['Review Text'].values
y = reviews['Review Stars'].values

In [None]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X_tfidf = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=random_seed)

# Encode labels
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

# Convert labels to categorical
num_classes = len(np.unique(y_train))
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

In [40]:
# Build the model
model = tf.keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train.toarray(), y_train, epochs=10, batch_size=32, validation_split=0.3, verbose=1)

Epoch 1/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.4730 - loss: 1.2667 - val_accuracy: 0.5401 - val_loss: 1.0601
Epoch 2/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.6598 - loss: 0.8296 - val_accuracy: 0.5418 - val_loss: 1.1059
Epoch 3/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.8133 - loss: 0.5163 - val_accuracy: 0.5057 - val_loss: 1.3474
Epoch 4/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9334 - loss: 0.2233 - val_accuracy: 0.5001 - val_loss: 1.9749
Epoch 5/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9863 - loss: 0.0553 - val_accuracy: 0.5105 - val_loss: 2.6528
Epoch 6/10
[1m505/505[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9957 - loss: 0.0185 - val_accuracy: 0.4947 - val_loss: 3.1868
Epoch 7/10
[1m505/505[0m 

In [41]:
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test.toarray(), y_test, verbose=0)
print(f'Test accuracy: {test_accuracy:.3f}')

Test accuracy: 0.481


A test accuracy below 0.5 means we could have predicted 2.5 stars for every review and probably gotten beter results. The training accuracy seems to increase drastically however validation accuracy remains steady at around 0.51. This indicates overfitting.

TODO
- metric graphs
- improve model
- look online for differnet approachs and things to do to improve
- try gradient descent etc

### Example Predictions

Consider an example of a bad review. "This book was absolutely terrible! How could you think this was a good idea."

In [31]:
# Make predictions
sample_review = ["This book was absolutely terrible! How could you think this was a good idea."]
sample_review_tfidf = vectorizer.transform(sample_review)
prediction = model.predict(sample_review_tfidf.toarray())
predicted_rating = np.argmax(prediction) + 1
print(f'Predicted rating: {predicted_rating}')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
Predicted rating: 1


Consider an example of a long mixed review

In [32]:
sample_review = ["I had high hopes for The Infinite Horizon after hearing so much about it. From the beginning, the premise seemed promising, and for the most part, it delivers on its intriguing concept. The plot revolves around a futuristic world where society grapples with the boundaries of artificial intelligence, humanity, and survival—concepts that have always fascinated me. The world-building is impressive, with detailed landscapes and a unique societal structure that keeps you hooked initially. The author has clearly put a lot of thought into constructing the futuristic world, and it shows in the vivid descriptions and creative technologies. However, while the world-building is rich, the characters left much to be desired. The protagonist, Lila, felt underdeveloped. I found myself frustrated at several points because her motivations were either unclear or inconsistent. In the beginning, she starts off as a strong, determined character, but midway through, her actions seem erratic and her growth stagnates. The dialogue, too, felt stilted at times, making it hard to connect with the characters emotionally. There were a few moments where I felt the conversations between key characters were forced, almost like they were inserted to explain plot points rather than feeling organic.On the flip side, I have to give credit where it’s due—the pacing of the story is solid for the most part. There are intense moments where you’re on the edge of your seat, particularly during the battle scenes. These scenes were written with such vivid detail that I could easily imagine them playing out in a movie. The action sequences are well thought out, and they definitely add excitement to the narrative. That being said, there were also moments where the pacing lagged, especially in the middle sections. Some chapters felt like filler, dragging on with unnecessary exposition and side plots that didn’t add much to the overarching story."]
sample_review_tfidf = vectorizer.transform(sample_review)
prediction = model.predict(sample_review_tfidf.toarray())
predicted_rating = np.argmax(prediction) + 1
print(f'Predicted rating: {predicted_rating}')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
Predicted rating: 3


# Part 3: Category Generation