# GENERAL SETUP
We start by importing the needed libraries.

For this we need [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) using CMake.
Installing it might take some time.

In [3]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.76.tar.gz (49.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.76-cp310-cp310-linux_x86_64.whl size=80963763 sha256=3f31683b3c2d3a88c6bb472feb39d8d57dab14dbb7410bba1cf41bad49210bd5
  Stored in direct

In [4]:
# Provides random tools
import random
# Provides json files handling
import json
#P rovides regular expressions
import re
# Allows us to go grab models from hugging face
from huggingface_hub import hf_hub_download
# The Llama class is a wrapper for llama cpp models
from llama_cpp import Llama

## MODEL SETUP

This bit of code allows us to go grabe the model from Hugging Face.
This might take some time.

In [5]:
MODEL_NAME = "myclassunil/Emollama-chat-13b-v0.1.gguf"
MODEL_FILE = "Emollama-chat-13b-v0.1.gguf"
MODEL_PATH = hf_hub_download(MODEL_NAME,
                             filename=MODEL_FILE,
                             local_dir='/content')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Emollama-chat-13b-v0.1.gguf:   0%|          | 0.00/13.8G [00:00<?, ?B/s]

Loading the model and offloading all the layers to the GPU.

In [6]:
llm = Llama(model_path=MODEL_PATH,
            n_gpu_layers=-1)

llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from /content/Emollama-chat-13b-v0.1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   5:                          llama.block_count u32              = 40
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32   

The model is now usable!

## DATA SETUP

Grabbing the reviews from our json file.
If using the notebook via google colab you need to manually upload the  json review file.

In [27]:
FILENAME = 'reviews.json'
PATH_TO_DATA = '/content/' + str(FILENAME)

with open(PATH_TO_DATA, 'r', encoding='utf-8') as f:
    sites_data = json.load(f)

## USING EMOLLM

We'll start by defining a function for each one of the five tasks accessible through EmoLLM.
For transparency reasons, the names of the variables are taken from [EmoLLM's introductory paper](https://doi.org/10.48550/arXiv.2401.08508) (Zhiwei Liu and others, 2024).

In [22]:
# We need to define the values we'll deem acceptable from the LLM.

# Possible answers for the v_oc task.
acceptable_v_oc = [
    '3: very positive mental state can be inferred',
    '2: moderately positive mental state can be inferred',
    '1: slightly positive mental state can be inferred',
    '0: neutral or mixed mental state can be inferred',
    '-1: slightly negative mental state can be inferred',
    '-2: moderately negative mental state can be inferred',
    '-3: very negative mental state can be inferred'
    ]

# Possible answers for the e_c task.
acceptable_e_c = [
    'anger',
    'anticipation',
    'disgust',
    'fear',
    'joy',
    'love',
    'optimism',
    'pessimism',
    'sadness',
    'surprise',
    'trust'
    ]

# This is the list of sentiments that are supported for ei_reg and ei_oc.
main_sentiments = ['joy', 'anger', 'fear', 'sadness']

# Possible answers for the ei_oc task.
base_ei_oc = [
    '0: no E can be inferred',
    '1: low amount of E can be inferred',
    '2: moderate amount of E can be inferred',
    '3: high amount of E can be inferred'
    ]

# Adapting the possible answers to include particular emotions.
acceptable_ei_oc = []
for sentiment in main_sentiments:
    for message in base_ei_oc:
        acceptable_ei_oc.append(message.replace("E", sentiment))


def get_v_reg(text, max_tok):
    """Estimating the valence of the review and representing it as a float."""
    prompt = f'''
    Human:
    Task: Evaluate the valence intensity of the writer's mental state based on
     the text, assigning it a real-valued score from 0 (most negative)
     to 1 (most positive).
    Text: {text}
    Intensity Score:
    '''
    v_reg = "Aberrant answer from EmoLLM : "
    raw_v_reg = llm(prompt, max_tokens=max_tok)['choices'][0]['text']
    try:
        if 0 <= float(raw_v_reg) <= 1:
            v_reg = raw_v_reg
    except ValueError:
        print("Something went wrong with get_v_reg")
        v_reg += str(raw_v_reg)
    return v_reg


# For each task, the default value is "Aberrant answer from EmoLLM"
# if the model does well, we replace this message with its answer
# otherwise, we add the aberrant answer in case it holds useful data anyway.

def get_v_oc(text, max_tok):
    """Estimates the valence of the review and represents it
    as an ordinal class."""
    prompt = f'''
        Human:
        Task: Categorize the text into an ordinal class that best characterizes
        the writer's mental state, considering various degrees of positive and
        negative sentiment intensity. 3: very positive mental state can be
        inferred. 2: moderately positive mental state can be inferred. 1:
        slightly positive mental state can be inferred. 0: neutral or mixed
        mental state can be inferred. -1: slightly negative mental state can
        be inferred. -2: moderately negative mental state can be inferred. -3:
        very negative mental state can be inferred
        Text: {text}
        Intensity Class:
        '''
    v_oc = "Aberrant answer from EmoLLM : "
    raw_v_oc = llm(prompt, max_tokens=max_tok)['choices'][0]['text']
    if raw_v_oc in acceptable_v_oc:
        v_oc = raw_v_oc
    else:
        v_oc += str(raw_v_oc)
    return raw_v_oc


def get_e_c(text, max_tok):
    """Identifies which sentiments are present in the review."""
    prompt = f'''
        Task: Categorize the text's emotional tone as either 'neutral or
        no emotion' or identify the presence of one or more of the given
        emotions (anger, anticipation, disgust, fear, joy, love, optimism,
        pessimism, sadness, surprise, trust).
        Text: {text}
        This tweet contains emotions:
        '''
    e_c = "Aberrant answer from EmoLLM : "
    raw_e_c = llm(prompt, max_tokens=max_tok)['choices'][0]['text']
    temp_e_c = re.split(r'[^a-zA-Z]', raw_e_c)
    while '' in temp_e_c:
        temp_e_c.remove('')
    is_aberrant = False
    for sent in temp_e_c:
        if sent not in acceptable_e_c:
            is_aberrant = True
            break
    if not is_aberrant:
        e_c = temp_e_c
    else:
        e_c += str(raw_e_c)
    return e_c


def get_ei_reg(text, max_tok, sent):
    """Estimates the intensity of a sentiment and represents it as a float."""
    prompt = f'''
        Human:
        Task: Assign a numerical value between 0 (least E) and 1 (most E) to
        represent the intensity of emotion E expressed in the text.
        Text: {text}
        Emotion: {sent}
        Intensity Score:
        '''
    ei_reg = "Aberrant answer from EmoLLM : "
    raw_ei_reg = llm(prompt, max_tokens=max_tok)['choices'][0]['text']
    try:
        if 0 <= float(raw_ei_reg) <= 1:
            ei_reg = raw_ei_reg
    except ValueError:
        print("Something went wrong with get_ei_reg")
        ei_reg += str(raw_ei_reg)
    return ei_reg


def get_ei_oc(text, max_tok, sent):
    """Estimates the intensity of a sentiment and represents it
    as an ordinal class."""
    prompt = f'''
        Task: Categorize the tweet into an intensity level of the specified
        emotion E, representing the mental state of the tweeter. 0: no E can
        be inferred. 1: low amount of E can be inferred. 2: moderate amount of
        E can be inferred. 3: high amount of E can be inferred.
        Tweet: {text}
        Emotion: {sent}
        Intensity Score:
        '''
    ei_oc = "Aberrant answer from EmoLLM : "
    raw_ei_oc = llm(prompt, max_tokens=max_tok)['choices'][0]['text']
    if raw_ei_oc in acceptable_ei_oc:
        ei_oc = raw_ei_oc
    else:
        ei_oc += str(raw_ei_oc)
    return [raw_ei_oc, ei_oc]


We can now create a function that applies the five tools given by EmoLLM to our reviews.

In [19]:
def sentiment_analysis(sites, max_tok):
    """Processing the reviews using EmoLLM"""
    # We create a dict to hold the sentiment analysis results.
    sa_results = {}
    # We create a dict for each site results.
    for i, site in enumerate(sites):
        sa_site = {}
        # Processing the different reviews with the five EmoLLM tasks.
        for j, review in enumerate(sites[site]):
            sa_review = {
                #'rating' : sites[site]['rating'],
                'v_reg' : get_v_reg(review, max_tok),
                'v_oc' : get_v_oc(review, max_tok),
                'e_c' : get_e_c(review, max_tok),
                'ei_reg' : [],
                'ei_oc' : []
            }
            for sent in sa_review['e_c']:
                if sent in main_sentiments:
                    current_reg = {sent : \
                                   get_ei_reg(review, max_tok, sent)}
                    current_oc = {sent : \
                                  get_ei_oc(review, max_tok, sent)}
                    sa_review['ei_reg'].append(current_reg)
                    sa_review['ei_oc'].append(current_oc)
            sa_site[j] = sa_review
        sa_results[i] = sa_site
    return sa_results

Because we have limited ressources for this project, the possibility to use GPU was limited.
We needed to generate placeholder results.
This function is useless for the final state of the project.
It generates a mockup version of the output file.
This function is not perfect. For example, regressions and ordinal classifications are not corellated.
Furthermore, it does not generate aberrant answers.

In [11]:
def mockup_sentiment_analysis(sites):
    """As GPU ressource are scarce, this function creates
    a placeholder for data."""
    # We create a dict to hold the sentiment analysis results.
    sa_results = {}
    # We create a dict for each site results.
    for _, site in enumerate(sites):
        sa_site = {}
        # Processing the different reviews with the five EmoLLM tasks.
        for j in range(len(sites[site])):
            sa_review = {
            'v_reg' : str(round(random.random(), 3)),
            'v_oc' : random.choice(acceptable_v_oc),
            'e_c' : [],
            'ei_reg' : [],
            'ei_oc' : []
            }

            for _ in range(random.randrange(1, len(acceptable_e_c))):
                random_e_c = random.choice(acceptable_e_c)
                if random_e_c not in sa_review['e_c']:
                    sa_review['e_c'].append(random_e_c)
            for sent in sa_review['e_c']:
                if sent in main_sentiments:
                    current_reg = {sent : str(round(random.random(), 3))}
                    random_oc = random.choice(base_ei_oc).replace("E", sent)
                    current_oc = {sent : random_oc}
                    sa_review['ei_reg'].append(current_reg)
                    sa_review['ei_oc'].append(current_oc)
            sa_site[f'review_{j}'] = sa_review
        sa_results[site] = sa_site
    return sa_results


Finally, we can dump the output of this process in a json file.

In [28]:
outputs = sentiment_analysis(sites_data, 500)
print(outputs)
with open('output_sentiment_analysis_data.json', 'w', encoding='utf-8') as fp:
    json.dump(outputs, fp)


Llama.generate: prefix-match hit

llama_print_timings:        load time =    1328.97 ms
llama_print_timings:      sample time =       3.66 ms /     6 runs   (    0.61 ms per token,  1641.14 tokens per second)
llama_print_timings: prompt eval time =     561.09 ms /   192 tokens (    2.92 ms per token,   342.19 tokens per second)
llama_print_timings:        eval time =     304.12 ms /     5 runs   (   60.82 ms per token,    16.44 tokens per second)
llama_print_timings:       total time =     875.05 ms /   197 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1328.97 ms
llama_print_timings:      sample time =       8.06 ms /    13 runs   (    0.62 ms per token,  1613.70 tokens per second)
llama_print_timings: prompt eval time =     671.89 ms /   277 tokens (    2.43 ms per token,   412.27 tokens per second)
llama_print_timings:        eval time =     741.95 ms /    12 runs   (   61.83 ms per token,    16.17 tokens per second)
llama_print_timings:       to

{0: {0: {'v_reg': '0.845', 'v_oc': '2: moderately positive emotional state can be inferred', 'e_c': ['joy', 'optimism'], 'ei_reg': [{'joy': '0.654'}], 'ei_oc': [{'joy': ['2.46', 'Aberrant answer from EmoLLM : 2.46']}]}, 1: {'v_reg': '0.85', 'v_oc': '3: very positive emotional state can be inferred', 'e_c': ['joy', 'optimism'], 'ei_reg': [{'joy': '0.75'}], 'ei_oc': [{'joy': ['3: high amount of joy can be inferred', '3: high amount of joy can be inferred']}]}, 2: {'v_reg': '0.583', 'v_oc': '2: moderately positive emotional state can be inferred', 'e_c': ['anger', 'disgust', 'optimism', 'sadness'], 'ei_reg': [{'anger': '0.521'}, {'sadness': '0.345'}], 'ei_oc': [{'anger': ['1.458', 'Aberrant answer from EmoLLM : 1.458']}, {'sadness': ['0.346', 'Aberrant answer from EmoLLM : 0.346']}]}}, 1: {0: {'v_reg': '0.917', 'v_oc': '3: very positive emotional state can be inferred', 'e_c': ['joy', 'optimism'], 'ei_reg': [{'joy': '0.729'}], 'ei_oc': [{'joy': ['2: moderate amount of joy can be inferred'