# Análise de sentimentos do dataset do IMDB usando Llama 3 70B.

Aluno: Leandro Carísio Fernandes

O objetivo é fazer análise de sentimentos na base do IMDB usando algumas técnicas de engenharia de prompt.

De acordo com a planilha, tenho que fazer as análises de índice 6019:6482 e as de índice 18519:18982.

# Parâmetros


Acesso à API GROQ: https://console.groq.com/playground . Nessa mesma página é possível gerar os prompts e já obter o código das chamadas.

In [87]:
from getpass import getpass
GROQ_API = getpass("API groq")

API groq ········


In [201]:
ARQUIVO_RESULTADOS_EXPERIMENTOS = 'resultados.csv'

EXP_1_ZERO_SHOT = {
    'NOME': 'llama_3_70b_zero_shot',
    'REFAZER': False
}

EXP_2_ZERO_SHOT = {
    'NOME': 'llama_3_70b_zero_shot_alternative_prompt',
    'REFAZER': False
}

EXP_3_FEW_SHOT = {
    'NOME': 'llama_3_70b_few_shot_2_samples',
    'REFAZER': False
}

EXP_4_COT = {
    'NOME': 'llama_3_70b_cot',
    'REFAZER': False
}

MODELO = "llama3-70b-8192"

In [2]:
!pip install Groq



# Leitura do dataset

Datasets completo de treino e teste:

In [15]:
import pandas as pd

train_dataset = pd.read_csv('imdb_train.csv')
test_dataset = pd.read_csv('imdb_test.csv')

Dataset filtrado para os experimentos.

Fiquei de fazer a classificação dos índices 6019:6482 e 18519:18982

In [90]:
import os

def carrega_df_exprerimentos():
    return pd.read_csv(ARQUIVO_RESULTADOS_EXPERIMENTOS, index_col=0)
def salva_df_experimentos():
    df_experimentos.to_csv(ARQUIVO_RESULTADOS_EXPERIMENTOS, index=True)
    
if os.path.exists(ARQUIVO_RESULTADOS_EXPERIMENTOS):
    df_experimentos = carrega_df_exprerimentos()
else:
    print("Não há arquivo com os resultados dos experimentos.")
    print("Criando arquivo vazio com apenas os textos e classificações.")
    # Recupera o texto e classificação dos índices 6019:6482 e 18519:18982
    indices = list(range(6019, 6482)) + list(range(18519, 18982))
    df_experimentos = test_dataset.iloc[indices]
    salva_df_experimentos()

In [91]:
df_experimentos.head()

Unnamed: 0,text,label,llama_3_70b_zero_shot
6019,I had the privilege of seeing this powerful pl...,Negative,Positive
6020,"************* SPOILERS BELOW ************* ""'N...",Negative,negative
6021,This film is likely to be a real letdown unles...,Negative,Negative
6022,Picking this up along with the rest of the Mar...,Negative,Negative
6023,"Jumpin' Butterballs, this movie stinks! It's a...",Negative,Negative


In [92]:
df_experimentos.tail()

Unnamed: 0,text,label,llama_3_70b_zero_shot
18977,I am so glad Zac was in 'The Suite life of Zac...,Positive,Positive
18978,"When I first saw the ad for this, I was like '...",Positive,Positive
18979,I read the above comment and cannot believe it...,Positive,positive
18980,"I was wandering through my local library, brow...",Positive,Positive
18981,"Wonderful songs, sprightly animation and authe...",Positive,positive


# Funções para acessar o modelo

In [217]:
from groq import Groq

client = Groq(api_key=GROQ_API)

def chat_completion(user_message, system_message = None, initial_messages = [], temperature=0, max_tokens=1024, top_p=1):
    # Documentação: https://console.groq.com/docs/text-chat
    messages = []
    # Apenas adiciona mensagem de systema se existir
    if system_message is not None:
        messages.append({
            # Set an optional system message. This sets the behavior of the
            # assistant and can be used to provide specific instructions for
            # how it should behave throughout the conversation.
            "role": "system",
            "content": system_message
        })
    # Se tiver algumas mensagens no início (útil para few-shot, por exemplo), adiciona:
    messages.extend(initial_messages)
    # Sempre adiciona o prompt do usuário
    messages.append({
        # Set a user message for the assistant to respond to.
        "role": "user",
        "content": user_message
    })
    chat_completion = client.chat.completions.create(
        messages=messages,
        
        # The language model which will generate the completion.
        model=MODELO,
        
        #
        # Optional parameters
        #
        
        # Controls randomness: lowering results in less random completions.
        # As the temperature approaches zero, the model will become deterministic
        # and repetitive.
        temperature=temperature,
        
        # The maximum number of tokens to generate. Requests can use up to
        # 32,768 tokens shared between prompt and completion.
        max_tokens=max_tokens,
        
        # Controls diversity via nucleus sampling: 0.5 means half of all
        # likelihood-weighted options are considered.
        top_p=1,
        
        # A stop sequence is a predefined or user-specified text string that
        # signals an AI to stop generating content, ensuring its responses
        # remain focused and concise. Examples include punctuation marks and
        # markers like "[end]".
        stop=None,
        
        # If set, partial message deltas will be sent.
        stream=False,
    )
    # Retorna o texto da primeira mensagem e, caso seja necessário usar depois,
    # todo o objeto
    return chat_completion.choices[0].message.content, chat_completion

In [228]:
from tqdm import tqdm
import time

def classifica_varias_sentencas(nome_experimento,
                                refazer_experimento,
                                prompt,
                                # Recebe o índice do teste e retorna uma lista de messages. Vai ser útil no few-shot
                                get_initial_messages = lambda x: [],
                                system_message = None,
                                temperature=0,
                                max_tokens=1024,
                                top_p=1,
                                msg=None,
                                delay=None):
    # Percorre o data frame de experimentos
    for i, (index, row) in enumerate(tqdm(df_experimentos.iterrows(), total=df_experimentos.shape[0], desc=msg)):
        # Extrai a mensagem do usuário a partir do prompt
        user_message = prompt.format(sentence=row['text'])
        
        # É necessário refazer a chamada se refazer_experimento for True ou se ainda não tiver
        # nenhum resultado.
        # Do jeito que essa função foi montada, eu faço uma chamada e salvo no csv
        # Dessa forma, se der algum problema, basta chamar de novo com Refazer = False que 
        # o experimento será completado
        resposta_atual_experimento = df_experimentos.at[index, nome_experimento]
        if refazer_experimento or resposta_atual_experimento == '':
            resposta_ai, _ = chat_completion(user_message=user_message,
                                             system_message=system_message, 
                                             initial_messages=get_initial_messages(i), 
                                             temperature=temperature,
                                             max_tokens=max_tokens, 
                                             top_p=top_p)
            df_experimentos.at[index, nome_experimento] = resposta_ai
            salva_df_experimentos()
            if delay is not None:
                time.sleep(delay)

In [219]:
def inicializa_df_para_experimento(nome_experimento, refazer_experimento = False):
    if refazer_experimento or nome_experimento not in df_experimentos.columns:
        df_experimentos[nome_experimento] = ''
        salva_df_experimentos()

# Experimento 1 - Llama 3 - 70B - Zero-shot

No experimento mais simples, considera exatamente o mesmo prompt utilizado na Aula 6, sem mensagem de sistema.

In [79]:
prompt_zero_shot = """You are a sentiment classifier. Use only "positive" or "negative" in your answer.

Sentece: {sentence}"""

inicializa_df_para_experimento(EXP_1_ZERO_SHOT['NOME'])
classifica_varias_sentencas(EXP_1_ZERO_SHOT['NOME'], EXP_1_ZERO_SHOT['REFAZER'], prompt_zero_shot, get_initial_messages = lambda x: [], system_message = None, temperature=0, max_tokens=1024, top_p=1, msg='Experimento 1')

Experimento 1: 100%|█████████████████████████████████████████████████████████████████| 926/926 [56:36<00:00,  3.67s/it]


# Experimento 2 - Llama 3 - 70B - Zero-shot - Mensagem mais elaborada

No experimento anterior foi passada uma mensagem simples no prompt do usuário. Vamos agora usar uma mensagem de sistema/prompt mais elaborada.

In [102]:
alt_msg_system_zero_shot = """You are a movie review sentiment analyzer. The user will send you a review and you will classify the text with JUST ONE WORD: "positive" or "negative"."""
alt_prompt_zero_shot = "{sentence}"

inicializa_df_para_experimento(EXP_2_ZERO_SHOT['NOME'])
classifica_varias_sentencas(EXP_2_ZERO_SHOT['NOME'], EXP_2_ZERO_SHOT['REFAZER'], alt_prompt_zero_shot, get_initial_messages = lambda x: [], system_message = alt_msg_system_zero_shot, temperature=0, max_tokens=1024, top_p=1, msg='Experimento 2')

Experimento 2: 100%|███████████████████████████████████████████████████████████████| 926/926 [1:03:03<00:00,  4.09s/it]


# Experimento 3 - Llama 3 - 70B - Few-shot

Para o experimento 3 (Few-shot), vamos considerar dois exemplos aleatórios (um positivo e um negativo). O conteúdo terá a mensagem de sistema e trocas de mensagens user/assistant com os exemplo. A última mensagem é o prompt do usuário com a sentença para classificar.

In [192]:
# Os exemplos positivos e negativos devem vir do conjunto de treinamento.
# Nesse conjunto, os reviews negativos estão nos primeiros 12.500 registros e, os positivos, nos últimos 12500
#
# Usa o random_state 42 para poder reproduzir os resultados e, se der problema na geração, conseguir retornar do mesmo
# lugar
few_shot_exemplos_negativos = train_dataset.iloc[0:12500].sample(n=len(df_experimentos), replace=True, random_state=42)
few_shot_exemplos_positivos = train_dataset.iloc[12500:25000].sample(n=len(df_experimentos), replace=True, random_state=42)

def get_initial_messages_few_shot(i):
    return [
        {
            "role": "user",
            "content": few_shot_exemplos_negativos.iloc[i]['text']
        },
        {
            "role": "assistant",
            "content": few_shot_exemplos_negativos.iloc[i]['label'].lower()
        },
        {
            "role": "user",
            "content": few_shot_exemplos_positivos.iloc[i]['text']
        },
        {
            "role": "assistant",
            "content": few_shot_exemplos_positivos.iloc[i]['label'].lower()
        }
    ]

In [193]:
# Para testar se as mensagens estão funcionando:

print(few_shot_exemplos_negativos.head())
print('*'*50)
print(get_initial_messages_few_shot(len(df_experimentos)-1))
print('*'*50)
print(get_initial_messages_few_shot(1))

                                                    text     label
7270   I had absolutely nothing to do the past weeken...  Negative
860    I have read the novel Reaper of Ben Mezrich a ...  Negative
5390   The story is similar to ET: an extraterrestria...  Negative
5191   When a film is independent and not rated, such...  Negative
11964  Although a film with Bruce Willis is always wo...  Negative
**************************************************
[{'role': 'user', 'content': 'Somewhere, out there, there must be a list of the all time worst gay films every made. One\'s that have overlong camera shots of the stars sitting and staring pensively into space, or one\'s where they focus unbearably long on kitty kats eating spaghetti. This motion sickness picture is a story of a boy and a boy and they live and love and swim and get stuck in grottos and one of them has a depressed mother and another has no mother and they talk and walk and swim and have sex and get drunk and then break up and

In [196]:
system_few_shot = """You are a movie review sentiment analyzer. The user will send you a review and you will classify the text with JUST ONE WORD: "positive" or "negative"."""
prompt_few_shot = "{sentence}"

inicializa_df_para_experimento(EXP_3_FEW_SHOT['NOME'])
classifica_varias_sentencas(EXP_3_FEW_SHOT['NOME'],
                            EXP_3_FEW_SHOT['REFAZER'],
                            prompt_few_shot,
                            get_initial_messages = get_initial_messages_few_shot,
                            system_message=system_few_shot,
                            temperature=0,
                            max_tokens=1024,
                            top_p=1,
                            msg='Experimento 3')

Experimento 3: 100%|█████████████████████████████████████████████████████████████████| 926/926 [15:37<00:00,  1.01s/it]


# Experimento 4 - Llama 3 - 70B - Chain of Thought

A ideia por trás do Chain of Thought é muito simples. É um few-shot, mas o assistente responde com o "raciocínio" antes de dar a resposta final.

Para implementarmos isso, vamos precisar alterar as mensagens enviadas de solicitar a resposta da mensagem que precisa ser avaliada. Além disso, será preciso oferecer um "reasoning" também. Para isso, pedi pro ChatGPT escrever uma razão para uma avaliação positiva e para uma avaliação negativa da base de treinamento. Como exemplo negativo selecionei o review train_dataset['text'][10]. Como exemplo positivo, o review train_dataset['text'][12510].

In [210]:
# Exemplos negativo e positivo usados:

print(train_dataset['text'][10])
print('*'*50)
print(train_dataset['text'][12510])

It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn't go on to star in more and better films. Sadly, I didn't think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat's Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is about a detect

In [214]:
cot_negative = """The review provided leans heavily towards a negative perspective, primarily criticizing various aspects of the film while sparingly acknowledging any positives. The reviewer starts by expressing disappointment that despite the presence of celebrated actors like John Ritter, Ben Gazarra, and Audrey Hepburn, these talents were underutilized as they were not given substantial characters or impactful dialogue. This suggests a failure in scriptwriting and character development. The review also points out a lack of engaging content, as indicated by the reviewer's struggle to stay awake and a failure to connect with the storyline or characters' actions. Additionally, while there's some appreciation for the acting of lesser-known performers, this is overshadowed by lamentations on their underrepresentation in cinema. The overall dissatisfaction is further underscored by comparisons to other works by the same director, Peter Bogdanovich, which the reviewer found far superior.

In conclusion, the review is decidedly negative. The reviewer's frustrations are clearly expressed through critiques of poor character development, lackluster dialogue, and an unengaging plot. The final summary bluntly states that the film is neither as memorable as "Paper Moon" nor as entertaining as "What’s Up, Doc?", indicating a significant letdown for the reviewer, especially given their admiration for the director's previous works. Thus, the overall impression of the film is disappointing, highlighting a failure to live up to the expectations set by the director's earlier successes.

Answer: Negative"""

cot_positive = """Lars von Trier's film "Europa" is portrayed in this review as a complex and stylistically unique cinematic experience that engages deeply with themes of post-war intrigue and personal dilemma. The protagonist, Leopold Kessler, portrayed by Jean-Marc Barr, is an American of German descent who finds himself embroiled in a post-WWII European setting fraught with danger and espionage. His character’s reluctance to take sides in an ongoing underground conflict reflects the film's exploration of moral ambiguity and identity, a theme common in the film noir genre. The reviewer highlights the film’s atmospheric use of black and white visuals interspersed with bursts of color and Max von Sydow’s hypnotic narration, enhancing its dreamlike quality. This narrative technique, along with the setting of a snowy, nocturnal Europe, contributes to a surreal, almost otherworldly atmosphere.

The review is overwhelmingly positive, detailing the film’s engaging plot and sophisticated cinematography. The reviewer appreciates how the narrative keeps the audience guessing, making the film "endlessly unpredictable." The depiction of Leopold’s journey from a passive observer to a more assertive figure who humorously and violently takes control is seen as a compelling transformation. This personal engagement with the character's evolution and the film's aesthetic qualities culminate in the reviewer declaring "Europa" as a personal favorite, indicating a strong emotional and intellectual resonance with the film. Overall, the review conveys a deep appreciation for the film’s artistic achievements and thematic depth.

Answer: Positive"""

def get_initial_messages_chain_of_thought(i):
    return [
        {
            "role": "user",
            "content": train_dataset['text'][10]
        },
        {
            "role": "assistant",
            "content": cot_negative
        },
        {
            "role": "user",
            "content": train_dataset['text'][12510]
        },
        {
            "role": "assistant",
            "content": cot_positive
        }
    ]


In [235]:
system_cot = """You are a movie review sentiment analyzer. The user will send you a movie review, and your role is to provide an answer indicating whether the review is positive or negative. Before giving your answer, you should explain your reasoning in one or two paragraphs. Then, after the reasoning, you will provide the final answer using one word: "positive" or "negative"."""
prompt_cot = "{sentence}"

inicializa_df_para_experimento(EXP_4_COT['NOME'])
classifica_varias_sentencas(EXP_4_COT['NOME'],
                            EXP_4_COT['REFAZER'],
                            prompt_cot,
                            get_initial_messages = get_initial_messages_chain_of_thought,
                            system_message=system_cot,
                            temperature=0,
                            max_tokens=2048,
                            top_p=1,
                            msg='Experimento 4',
                            delay=5)

Experimento 4: 100%|███████████████████████████████████████████████████████████████| 926/926 [1:17:14<00:00,  5.01s/it]


# Métricas

In [222]:
# Pega o texto e verifica se a string positive ou negative está nele
def extrai_classificacao_naive(texto):
    texto = texto.lower()

    if 'positive' in texto:
        classificacao = 'Positive'
    elif 'negative' in texto:
        classificacao = 'Negative'
    else:
        classificacao =  ''

    return classificacao

# Verifica se a string positive ou negative está no último parágrafo do texto gerado
def extrai_classificacao_final_texto(texto):
    texto = texto.lower()
    paragrafos = texto.split('\n')
    ultimo_paragrafo = paragrafos[-1]
    return extrai_classificacao_naive(ultimo_paragrafo)

def avalia_resposta_modelo(nome_experimento, classificador=extrai_classificacao_naive):
    # Percorre o data frame de experimentos
    acc = 0
    for index, row in tqdm(df_experimentos.iterrows(), total=df_experimentos.shape[0], desc=f'Avaliando {nome_experimento}'):
        clase_esperada = row['label']
        classe_prevista_modelo = classificador(row[nome_experimento])
        acc += (1. if clase_esperada == classe_prevista_modelo else 0.)
    return acc / df_experimentos.shape[0]

In [236]:
print('\nExperimento 1: Zero-shot')
print(avalia_resposta_modelo(EXP_1_ZERO_SHOT['NOME'], classificador=extrai_classificacao_naive))

print('\nExperimento 2: Zero-shot, mensagem mais elaborada')
print(avalia_resposta_modelo(EXP_2_ZERO_SHOT['NOME'], classificador=extrai_classificacao_naive))

print('\nExperimento 3: Few-shot, 2 exemplos')
print(avalia_resposta_modelo(EXP_3_FEW_SHOT['NOME'], classificador=extrai_classificacao_naive))

print('\nExperimento 4: Chain-of-thought')
print(avalia_resposta_modelo(EXP_4_COT['NOME'], classificador=extrai_classificacao_final_texto))



Experimento 1: Zero-shot


Avaliando llama_3_70b_zero_shot: 100%|████████████████████████████████████████████| 926/926 [00:00<00:00, 19067.45it/s]


0.9578833693304536

Experimento 2: Zero-shot, mensagem mais elaborada


Avaliando llama_3_70b_zero_shot_alternative_prompt: 100%|█████████████████████████| 926/926 [00:00<00:00, 21682.77it/s]


0.958963282937365

Experimento 3: Few-shot, 2 exemplos


Avaliando llama_3_70b_few_shot_2_samples: 100%|███████████████████████████████████| 926/926 [00:00<00:00, 17477.05it/s]


0.9622030237580994

Experimento 4: Chain-of-thought


Avaliando llama_3_70b_cot: 100%|██████████████████████████████████████████████████| 926/926 [00:00<00:00, 18885.18it/s]

0.9643628509719222





# Conclusões

Os resultados ficaram assim:


| Experimento                        | Acurácia (%) |
|------------------------------------|--------------|
| 1 - Zero-shot                      | 95.78        |
| 2 - Zero-shot (prompt alternativo) | 95.89        |
| 3 - Few-shot (2 exemplos)          | 96.22        |
| 4 - Chain of Thought (2 exemplos)  | 96.43     

Como os resultados inicias já eram muito bons (95.78% de acurácia), os ganhos das outras técnicas foram marginais. É muito difícil ajustar o prompt para melhorar os resultados quando o benchmark já é tão bom.

Nos exemplos do notebook, apensar de técnicas mais avançadas terem melhorado ligeiramente os resultados, o valor é tão pequeno que eu não consigo concluir que é devido à técnica em si ou apenas pela mudança do prompt: ou seja, um zero-shot com um prompt um pouco melhor talvez teria gerado os mesmos resultados.

Todas as respostas do Llama estão disponíveis no arquivo csv (resultados.csv) que acompanha este caderno. Os resultados estão neste formato: dos resultados.

In [243]:
display(df_experimentos)

Unnamed: 0,text,label,llama_3_70b_zero_shot,llama_3_70b_zero_shot_alternative_prompt,llama_3_70b_few_shot_2_samples,llama_3_70b_cot
6019,I had the privilege of seeing this powerful pl...,Negative,Positive,Negative,positive,This review is a mixed assessment of the film ...
6020,"************* SPOILERS BELOW ************* ""'N...",Negative,negative,negative,negative,"This review is overwhelmingly negative, with t..."
6021,This film is likely to be a real letdown unles...,Negative,Negative,negative,negative,"This review is largely negative, with the revi..."
6022,Picking this up along with the rest of the Mar...,Negative,Negative,negative,negative,"This review is overwhelmingly negative, expres..."
6023,"Jumpin' Butterballs, this movie stinks! It's a...",Negative,Negative,negative,negative,"This review is extremely negative, expressing ..."
...,...,...,...,...,...,...
18977,I am so glad Zac was in 'The Suite life of Zac...,Positive,Positive,positive,positive,This review is an enthusiastic and glowing end...
18978,"When I first saw the ad for this, I was like '...",Positive,Positive,positive,positive,"This review is generally positive, despite the..."
18979,I read the above comment and cannot believe it...,Positive,positive,positive,positive,This review is extremely positive and enthusia...
18980,"I was wandering through my local library, brow...",Positive,Positive,positive,positive,"This review is overwhelmingly positive, with t..."
