Classifying the books into 7 emotional categories (Ekman's six + neutral) for a refined search experience

Emotions: "anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"

# Importing Dataset

In [4]:
import os

# Changing directory to main directory for easy data access
working_directory = os.getenv("WORKING_DIRECTORY")
os.chdir(working_directory)

# Disabling error warning
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [2]:
import pandas as pd

# Loading is data previously cleaned data
path = 'data/books_classified.csv'
books = pd.read_csv(path)

# Printing the data to check validity
print(books.shape)
books.head(2)

(5197, 14)


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,words_in_description,title_and_subtitle,base_categories
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,199,Gilead,Fiction
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,205,Spider's Web: A Novel,Fiction


# Pipeline

In [None]:
from transformers import pipeline

# top_k set to `None` since we want all the labels
sentiment_classifier = pipeline(task="text-classification",
                                model="j-hartmann/emotion-english-distilroberta-base",
                                top_k=None)

# Test
sentiment_classifier("I love this!")

Device set to use cpu


[[{'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'surprise', 'score': 0.008528691716492176},
  {'label': 'neutral', 'score': 0.005764589179307222},
  {'label': 'anger', 'score': 0.004419791977852583},
  {'label': 'sadness', 'score': 0.002092393347993493},
  {'label': 'disgust', 'score': 0.001611992483958602},
  {'label': 'fear', 'score': 0.0004138525982853025}]]

In [12]:
# Test
print(f'This is horrible!: {sentiment_classifier("This is horrible!")}')
print(f'This is horrible!!!: {sentiment_classifier("This is horrible!!!")}')

This is horrible!: [[{'label': 'disgust', 'score': 0.48886463046073914}, {'label': 'fear', 'score': 0.4316416084766388}, {'label': 'anger', 'score': 0.02577614225447178}, {'label': 'surprise', 'score': 0.02349361777305603}, {'label': 'sadness', 'score': 0.0183604434132576}, {'label': 'neutral', 'score': 0.009904485195875168}, {'label': 'joy', 'score': 0.0019590831361711025}]]
This is horrible!!!: [[{'label': 'fear', 'score': 0.5684544444084167}, {'label': 'disgust', 'score': 0.3261587917804718}, {'label': 'surprise', 'score': 0.03909299895167351}, {'label': 'anger', 'score': 0.0317532904446125}, {'label': 'sadness', 'score': 0.0193428136408329}, {'label': 'neutral', 'score': 0.012686503119766712}, {'label': 'joy', 'score': 0.002511194907128811}]]


In [15]:
# Testing on the dataset
print(books["description"][0])
sentiment_classifier(books["description"][0])

A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world has

[[{'label': 'fear', 'score': 0.6548399925231934},
  {'label': 'neutral', 'score': 0.1698525995016098},
  {'label': 'sadness', 'score': 0.11640939861536026},
  {'label': 'surprise', 'score': 0.02070068009197712},
  {'label': 'disgust', 'score': 0.019100721925497055},
  {'label': 'joy', 'score': 0.015161462128162384},
  {'label': 'anger', 'score': 0.003935154061764479}]]

# Improving Classification

This does not seem to be properly representing the description, as such trying to split the paragraph into individual sentences. We will take this sentences and classify them. Using the outputed result, we will find the max value for each emotion and return an output with only them.

In [None]:
# Testing the idea
sentences = books["description"][0].split(".")
preds = sentiment_classifier(sentences)

In [None]:
# Test
print(sentences[0])
preds[0]

A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives


[{'label': 'surprise', 'score': 0.7296021580696106},
 {'label': 'neutral', 'score': 0.14038598537445068},
 {'label': 'fear', 'score': 0.06816227734088898},
 {'label': 'joy', 'score': 0.04794260486960411},
 {'label': 'anger', 'score': 0.009156367741525173},
 {'label': 'disgust', 'score': 0.0026284768246114254},
 {'label': 'sadness', 'score': 0.002122163772583008}]

In [None]:
# Test
print(sentences[3])
preds[3]

 Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist


[{'label': 'fear', 'score': 0.9281682968139648},
 {'label': 'anger', 'score': 0.03219081461429596},
 {'label': 'neutral', 'score': 0.012808660045266151},
 {'label': 'sadness', 'score': 0.008756875991821289},
 {'label': 'surprise', 'score': 0.008597892709076405},
 {'label': 'disgust', 'score': 0.008431807160377502},
 {'label': 'joy', 'score': 0.0010455832816660404}]

In [33]:
import numpy as np

emotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]

# Processes a list of emotion prediction outputs and returns a dictionary
# containing the maximum score observed for each emotion label.
def calculate_max_scores(predictions, emotion_labels=emotion_labels):
    emotion_score_individual = {label : [] for label in emotion_labels}
    for prediction in predictions:
        for emotion in prediction:
            emotion_score_individual[emotion["label"]].append(emotion["score"])
    return {label: np.max(scores) for label, scores in emotion_score_individual.items()}

In [34]:
# Test
dict(sorted(calculate_max_scores(predictions=preds).items(), key=lambda x: x[1], reverse=True))

{'sadness': np.float64(0.9671575427055359),
 'joy': np.float64(0.9327981472015381),
 'fear': np.float64(0.9281682968139648),
 'surprise': np.float64(0.7296021580696106),
 'neutral': np.float64(0.6462159156799316),
 'disgust': np.float64(0.27359113097190857),
 'anger': np.float64(0.06413363665342331)}

In [None]:
from tqdm import tqdm 

# Predicts emotion scores for a given number of book descriptions.
# For each description:
#   - It splits the text into sentences.
#   - Applies the emotion classifier to each sentence.
#   - Aggregates the predictions and selects the maximum score per emotion.
# The result is a dictionary mapping each emotion to a list of max scores (one per book),
# along with the corresponding list of ISBNs.
def predict_emotions(dataset, size=5, emotion_labels=emotion_labels, model=sentiment_classifier):
    isbns = dataset["isbn13"][:size]
    emotion_scores = {label : [] for label in emotion_labels}
    for idx in tqdm(range(size)):
        predictions = model(dataset["description"][idx].split("."))
        predictions_max_scores = calculate_max_scores(predictions)
        for label in emotion_labels:
            emotion_scores[label].append(predictions_max_scores[label])
    return isbns, emotion_scores

In [42]:
# Testing the function `predict_emotions`
isbns, emotion_scores = predict_emotions(books, size=5)

# Converting the result into a dataframe
test_output = pd.DataFrame(emotion_scores)
test_output["isbn13"] = isbns

# Printing the dataframe to validate the outputs
test_output

100%|██████████| 5/5 [00:04<00:00,  1.04it/s]


Unnamed: 0,anger,disgust,fear,joy,sadness,surprise,neutral,isbn13
0,0.064134,0.273591,0.928168,0.932798,0.967158,0.729602,0.646216,9780002005883
1,0.612619,0.348284,0.942528,0.704422,0.11169,0.252546,0.88794,9780002261982
2,0.064134,0.104007,0.972321,0.767239,0.11169,0.078765,0.549477,9780006178736
3,0.351485,0.150722,0.360705,0.251882,0.11169,0.078765,0.732685,9780006280897
4,0.081412,0.184496,0.095043,0.040564,0.47588,0.078765,0.88439,9780006280934


In [43]:
# Running on main dataset
isbns, emotion_scores = predict_emotions(books, size=len(books))

# Converting the result into a temp dataframe that can be later joined
# to the main dataset for usage
books_emotions_temp = pd.DataFrame(emotion_scores)
books_emotions_temp["isbn13"] = isbns

# Printing the dataframe
books_emotions_temp.head(2)

100%|██████████| 5197/5197 [30:17<00:00,  2.86it/s]   


Unnamed: 0,anger,disgust,fear,joy,sadness,surprise,neutral,isbn13
0,0.064134,0.273591,0.928168,0.932798,0.967158,0.729602,0.646216,9780002005883
1,0.612619,0.348284,0.942528,0.704422,0.11169,0.252546,0.88794,9780002261982


In [45]:
# Merging `books` and `books_emotions_temp`
books_emotions = pd.merge(books, books_emotions_temp, on="isbn13", how="left")

In [46]:
# Function to obtain summary of the data
def detailed_summary(df):
    summary = pd.DataFrame({
        'Data Type': df.dtypes,
        'Non-Null Count': df.notnull().sum(),
        'Null Count': df.isnull().sum(),
        'Unique Values': df.nunique(),
        'Sample Value': df.apply(lambda x: x.dropna().unique()[0] if x.dropna().any() else None)
    })

    # Add numeric stats if applicable
    numeric_cols = df.select_dtypes(include='number').columns
    if not numeric_cols.empty:
        stats = df[numeric_cols].describe().T
        summary = summary.join(stats, how='left')

    return summary

detailed_summary(books_emotions[emotion_labels])

Unnamed: 0,Data Type,Non-Null Count,Null Count,Unique Values,Sample Value,count,mean,std,min,25%,50%,75%,max
anger,float64,5197,0,2257,0.064134,5197.0,0.164808,0.218574,0.000606,0.064134,0.064134,0.138384,0.989582
disgust,float64,5197,0,2202,0.273591,5197.0,0.200597,0.212761,0.000821,0.104007,0.104007,0.187477,0.989417
fear,float64,5197,0,3138,0.928168,5197.0,0.308601,0.342392,0.000442,0.051363,0.093588,0.580464,0.995326
joy,float64,5197,0,3357,0.932798,5197.0,0.280208,0.317908,0.00055,0.040564,0.087731,0.498713,0.992068
sadness,float64,5197,0,2026,0.967158,5197.0,0.223608,0.248027,0.001251,0.11169,0.11169,0.177616,0.989361
surprise,float64,5197,0,2354,0.729602,5197.0,0.174044,0.189109,0.000779,0.078765,0.078765,0.198874,0.983455
neutral,float64,5197,0,3009,0.646216,5197.0,0.760011,0.204867,0.000981,0.549477,0.838376,0.936846,0.974344


# Saving Dataset

In [47]:
# Save Path
folder = "data"
filename = "books_classifed_with_emotions.csv"
filepath = os.path.join(folder, filename)

# Create folder if it doesn't exist
os.makedirs(folder, exist_ok=True)

# Saving the file in the data folder
books_emotions.to_csv(filepath, index=False)