In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline
from tqdm import tqdm

In [2]:
books_df = pd.read_csv("../data/cleaned_categorized_books.csv")

books_df

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_join_subtitle,simple_categories,predicted_categories
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,fiction,
1,9780002261982,0002261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,fiction,fiction
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,Rage of angels,fiction,
3,9780006280897,0006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,The Four Loves,non-fiction,non-fiction
4,9780006280934,0006280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,The Problem of Pain,non-fiction,non-fiction
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5192,9788172235222,8172235224,Mistaken Identity,Nayantara Sahgal,Indic fiction (English),http://books.google.com/books/content?id=q-tKP...,On A Train Journey Home To North India After L...,2003.0,2.93,324.0,0.0,Mistaken Identity,non-fiction,non-fiction
5193,9788173031014,8173031010,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002.0,3.70,175.0,24.0,Journey to the East,non-fiction,non-fiction
5194,9788179921623,817992162X,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003.0,3.82,198.0,1568.0,The Monk Who Sold His Ferrari: A Fable About F...,fiction,fiction
5195,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,104.0,I Am that: Talks with Sri Nisargadatta Maharaj,non-fiction,


We are going to use a text classification pipeline to analyze the sentiment in a given description

While this could have been a zero-shot classification task as well with the labels specified, there are better fine-tuned LLMs on the task of sentiment analysis and so we resort to using that

We create a classifier pipeline with one such model (j-hartmann/emotion-english-distilroberta-base) and use it in the text-classification pipeline mode

In [3]:
classifier = pipeline(
    "text-classification",
    model="j-hartmann/emotion-english-distilroberta-base",
    device="mps",
    top_k = None,
    truncation=True,
    max_length=512 # Model's limit is 512 tokens
)

Device set to use mps


In [4]:
classifier("I am going to school")[0][0]["label"] # Classifies it as fear lol

'fear'

The problem with passing the sentiment of the whole description passage to the pipeline is that some of the underlying sentiments splti across sentences may be lost - a loss of information that may otherwise be valuable

This is especially concerning given that a lot of book descriptions are hyped up to really sell the book

An example

In [5]:
print(f"Description:\n{books_df["description"][0]}\nEmotion:\n{classifier(books_df["description"][0])}")

Description:
A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst 

In [6]:
for sentence in books_df["description"][0].split("."):
    sentence = sentence.strip()
    print(f"Sentence:\n{sentence}\n")
    print(f"Emotion:\n{classifier(sentence)}\n")

Sentence:
A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives

Emotion:
[[{'label': 'surprise', 'score': 0.7296021580696106}, {'label': 'neutral', 'score': 0.14038625359535217}, {'label': 'fear', 'score': 0.06816215068101883}, {'label': 'joy', 'score': 0.0479423962533474}, {'label': 'anger', 'score': 0.009156353771686554}, {'label': 'disgust', 'score': 0.0026284793857485056}, {'label': 'sadness', 'score': 0.0021221640054136515}]]

Sentence:
John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers

Emotion:
[[{'label': 'neutral', 'score': 0.46625038981437683}, {'label': 'disgust', 'score': 0.3382384777069092}, {'label': 'joy', 'score': 0.08201313763856888}, {'label': 'sadness', 'score': 0.061116788536310196}, {'label': 'anger', 'score': 0.02964133210480213}, {'label': 'surprise', 'score': 0.017968833446502686}, {'label': 'fear', 'score': 0.0047711

But for this particular exploration, we will go with the overall description passed as a whole to make things simpler

We will create a dataframe with the 6 emotions and the scores assoicated with them for each of the descriptions

In [7]:
isbn = []
scores = {}

for i in tqdm(range(len(books_df))):
    isbn.append(books_df.loc[i, "isbn10"])
    text = books_df.loc[i, "description"]
    preds = classifier(text)[0]
    for emotion in preds:
        if emotion["label"] not in scores.keys():
            scores[emotion["label"]] = []
        scores[emotion["label"]].append(emotion["score"])

scores["isbn10"] = isbn
scores_df = pd.DataFrame(scores)
scores_df

books_df = pd.merge(books_df, scores_df, on="isbn10", how="left")
books_df

  0%|                                                                                                                                                                                                   | 0/5197 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5197/5197 [02:57<00:00, 29.29it/s]


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,...,title_join_subtitle,simple_categories,predicted_categories,fear,neutral,sadness,surprise,disgust,joy,anger
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,...,Gilead,fiction,,0.654842,0.169852,0.116408,0.020701,0.019101,0.015161,0.003935
1,9780002261982,0002261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,...,Spider's Web: A Novel,fiction,fiction,0.755521,0.050591,0.085620,0.068844,0.018019,0.003123,0.018282
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,...,Rage of angels,fiction,,0.939291,0.007241,0.002299,0.003145,0.005369,0.018979,0.023676
3,9780006280897,0006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,...,The Four Loves,non-fiction,non-fiction,0.230527,0.201329,0.027787,0.004284,0.198186,0.005105,0.332782
4,9780006280934,0006280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,...,The Problem of Pain,non-fiction,non-fiction,0.004750,0.854798,0.015526,0.004517,0.068829,0.029622,0.021958
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5192,9788172235222,8172235224,Mistaken Identity,Nayantara Sahgal,Indic fiction (English),http://books.google.com/books/content?id=q-tKP...,On A Train Journey Home To North India After L...,2003.0,2.93,324.0,...,Mistaken Identity,non-fiction,non-fiction,0.163296,0.057973,0.713509,0.009185,0.008232,0.003564,0.044241
5193,9788173031014,8173031010,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002.0,3.70,175.0,...,Journey to the East,non-fiction,non-fiction,0.013774,0.690091,0.005607,0.121302,0.008996,0.153166,0.007065
5194,9788179921623,817992162X,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003.0,3.82,198.0,...,The Monk Who Sold His Ferrari: A Fable About F...,fiction,fiction,0.005872,0.078451,0.005430,0.003024,0.004412,0.896033,0.006779
5195,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,...,I Am that: Talks with Sri Nisargadatta Maharaj,non-fiction,,0.093318,0.787642,0.010481,0.013733,0.007607,0.081673,0.005546


Finally, save the dataframe as CSV

In [8]:
books_df.to_csv("../data/cleaned_categorized_emotion_scored_books.csv", index=False)