# LLM Experiments with Twitter Sentiment

Early experiments with LLMs - try to estimate the sentiment of some tweets using a large langauge model

## Libraries

In [3]:
import os
import re

import numpy as np
import pandas as pd

from kaggle.api.kaggle_api_extended import KaggleApi
from together import Together

# Easier display options for debugging: 

# Set the display width to a larger value
pd.set_option('display.width', 1000)

# Optionally, set the max column width to avoid truncating column data
pd.set_option('display.max_colwidth', None)

# Optionally, set the max number of columns to show all columns
pd.set_option('display.max_columns', None)

## Get test data

Initially downloaded the [Kaggle Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset (a load of tweets with sentiment). But when I looked through those data the sentiment estimates seemed to be pretty bad. So now I'm using: the [Twitter Sentiment Analysis Dataset](https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis) 


; useful for experimenting). Polarity is 0 for negative, 2 for neutral and 4 for positive.

In [4]:
data_dir = '../data/tweet_data'
#dataset = 'kazanova/sentiment140'
dataset = 'jp797498e/twitter-entity-sentiment-analysis'
# filename = training.1600000.processed.noemoticon.csv
filename = 'twitter_training.csv'


# Function to download dataset if not already downloaded
def download_kaggle_dataset(dataset, data_dir):
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # Check if dataset already exists
    dataset_files = os.listdir(data_dir)
    if not dataset_files:
        api = KaggleApi()
        api.authenticate()
        api.dataset_download_files(dataset, path=data_dir, unzip=True)
        print("Dataset downloaded and extracted.")
    else:
        print("Dataset already exists in the directory.")

# Run the function to download the dataset
download_kaggle_dataset(dataset, data_dir)


Dataset URL: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis
Dataset downloaded and extracted.


In [5]:

#tweets_df = pd.read_csv(os.path.join(data_dir, filename),
#                        header=None,
#                        names=["polarity", "id", "date", "query", "user", "text"],
#                        dtype={"polarity": int, "id": int, "date": str, "query": str, "user": str, "text": str},
#                        encoding='latin1',
#                        index_col=False  # Ensure the df is given an ascending, consecutive index
#                        )

tweets_df = pd.read_csv(os.path.join(data_dir, filename),
                        header=None,
                        names=["x", "entity", "sentiment", "text"],
                        dtype={"x": int, "entity": str, "sentiment": str, "text": str},
                        index_col=False  # Ensure the df is given an ascending, consecutive index
                        )

tweets_df

Unnamed: 0,x,entity,sentiment,text
0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
1,2401,Borderlands,Positive,"I am coming to the borders and I will kill you all,"
2,2401,Borderlands,Positive,"im getting on borderlands and i will kill you all,"
3,2401,Borderlands,Positive,"im coming on borderlands and i will murder you all,"
4,2401,Borderlands,Positive,"im getting on borderlands 2 and i will murder you me all,"
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my Mac is like 6 years behind Nvidia drivers and I have no idea how I did not notice
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is 6 years behind on Nvidia drivers and I have no idea how I didn't notice
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac is now 6 years behind on Nvidia drivers and I have no idea how he didn’t notice
74680,9200,Nvidia,Positive,Just realized between the windows partition of my Mac is like being 6 years behind on Nvidia drivers and cars I have no fucking idea how I ever didn ’ t notice


## Use the Together.AI API to batch classify them. 

A function that takes a batch of tweets and uses the Together API to classify them:

In [10]:
def get_sentiments(batch_tweets, batch_index=0):
    """
    Retrieves sentiment predictions for a batch of tweets using the Together AI API.

    Parameters
    ----------
    batch_tweets : pandas.DataFrame
        A DataFrame containing the tweets for the current batch. It must include a 'text' column with the tweet content.
    batch_index : int
        An optional starting index of the current batch. This is used to align the predicted sentiments with the original DataFrame indices.

    Returns
    -------
    ids : list of int
        A list of DataFrame indices corresponding to each tweet in the batch. These indices align with the main DataFrame `df`.
    sentiments : list of str
        A list of predicted sentiments for each tweet in the batch. Possible values are 'Positive', 'Negative', or 'Neutral'.
    """

    # Prepare the list of tweets
    tweet_list = "\n".join([f"{idx+1}. {tweet}" 
                            for idx, tweet in enumerate(batch_tweets.text.values)])
    
    # Create the system prompt
    system_prompt = (
        "Classify the sentiment (positive, negative, or neutral) of each of the following texts. "
        "Provide your answer in the format '1. Sentiment', '2. Sentiment', etc.\n\n"
        f"{tweet_list}"
    )

    # Prepare the messages
    messages = [
        {
            "role": "system",
            "content": system_prompt
        }
    ]

    
    # Call the API
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
        messages=messages,
        max_tokens=300,
        temperature=0.7,
        top_p=0.7,
        top_k=50,
        repetition_penalty=1,
        stop=["<|eot_id|>", "<|eom_id|>"],
        truncate=130560,
        stream=False  # Set stream to False to get the full response
    )

    # Extract the assistant's reply
    assistant_reply = response.choices[0].message.content.strip()

    # Use regular expressions to extract the sentiments
    matches = re.findall(r"(\d+)\.\s*(Positive|Negative|Neutral)", assistant_reply, re.IGNORECASE)

    # Check that the numbering is correct (optional)
    # You can add code here to verify the numbering matches the tweets

    # Compute the actual DataFrame indices
    ids = [batch_index + int(idx) - 1 for idx, sentiment in matches]
    sentiments = [sentiment.capitalize() for idx, sentiment in matches]
    assert len(ids) == len(sentiments)

    return ids, sentiments


In [12]:
# Get the API key from a file
with open('together.ai_key.txt', 'r') as f:
    api_key = f.readline().strip()

client = Together(api_key=api_key)

# Drop the 'irrelevant' sentiment
tweets_df = tweets_df[tweets_df.sentiment != 'Irrelevant']

# List of tweets to classify (only a few for now)
df = tweets_df.sample(60).copy()
# Ensure the index is consecutive and ascending
df = df.reset_index(drop=True)
# To store the results
df['sentiment_prediction'] = np.nan

# Batch processing
batch_size = 20
for i in range(0, len(df), batch_size):
    # Get the batch of tweets
    batch_tweets = df.loc[i:i + batch_size - 1, :]

    # Get sentiments using the function
    print(f"Submitting batch {i//len(batch_tweets)+1} of {len(df)//len(batch_tweets)}...")

    ids, sentiments = get_sentiments(batch_tweets, batch_index=i)

    # Update the DataFrame with the predictions
    df.loc[ids, 'sentiment_prediction'] = sentiments


Submitting batch 1 of 3...
Submitting batch 0
Submitting batch 1 of 3...
... reply received.
Submitting batch 2 of 3...
Submitting batch 20
Submitting batch 2 of 3...


  df.loc[ids, 'sentiment_prediction'] = sentiments


... reply received.
Submitting batch 3 of 3...
Submitting batch 40
Submitting batch 3 of 3...
... reply received.


In [13]:
#df = df.loc[df.sentiment != 'Irrelevant']
df

Unnamed: 0,x,entity,sentiment,text,sentiment_prediction
0,8627,NBA2K,Neutral,"Sure feels weird turning on NBA2K game and seeing Kobe on repeat, just to feel sad right over again. @NBA @NBA2KGames",Negative
1,1475,Battlefield,Negative,"No, of course not",Neutral
2,8576,NBA2K,Neutral,@Ronnie2K @NBA2K Can ya do plz u add sum old and new gold kobe jerseys we need the whole whole community just to show respect this one shit that hits different.,Positive
3,3809,Cyberpunk2077,Positive,Great headset if you want to hear johhy silverhand... u think they give a rat di * * * on how u look..... with V 17.09.2020,Positive
4,1991,CallOfDutyBlackopsColdWar,Positive,I'm really hyped.,Positive
5,9139,Nvidia,Positive,Everyone: Why training style gan is so slow.... *nvidia releases StyleGAN 3*: You can train it in just a few hours.,Positive
6,13072,Xbox(Xseries),Neutral,"Bruhn. That shit, so damn funny dawg",Positive
7,4420,Google,Neutral,"Google could face yet another class action lawsuit, this time over Pixel 3 issues androidpolice.com/2020/06/22/goo…",Negative
8,2075,CallOfDuty,Negative,Dirty cyber attack,Negative
9,9655,PlayStation5(PS5),Neutral,My Setup... My What's Up Next :.. - Your New Windows Desk. - - - -.. - Another Wild Screen.. - Sega PS5 / Xbox Series X.. - Cable Tidy [UNK] @ West Wakefield via instagram. au com / p / CAqRrTQpwbh / m …,Neutral


See how well that worked

In [14]:
def check_sentiment_estimate(row):
    if row.polarity == 0 and row.sentiment == "Negative":
        return 1
    elif row.polarity == 2 and row.sentiment == "Neutral":
        return 1
    elif row.polarity == 4 and row.sentiment == "Positive":
        return 1
    else:
        return 0
    
def check_sentiment_estimate2(row):
    if row.sentiment == row.sentiment_prediction:
        return 1
    elif row.sentiment == "Irrelevant" and row.sentiment_prediction == "Neutral":
        return 1
    else:
        return 0

df['correct'] = df.apply(check_sentiment_estimate2, axis=1)
print("Accuracy:", df.correct.mean())
df.loc[:,['text', 'sentiment', 'sentiment_prediction', 'correct']]


Accuracy: 0.5333333333333333


Unnamed: 0,text,sentiment,sentiment_prediction,correct
0,"Sure feels weird turning on NBA2K game and seeing Kobe on repeat, just to feel sad right over again. @NBA @NBA2KGames",Neutral,Negative,0
1,"No, of course not",Negative,Neutral,0
2,@Ronnie2K @NBA2K Can ya do plz u add sum old and new gold kobe jerseys we need the whole whole community just to show respect this one shit that hits different.,Neutral,Positive,0
3,Great headset if you want to hear johhy silverhand... u think they give a rat di * * * on how u look..... with V 17.09.2020,Positive,Positive,1
4,I'm really hyped.,Positive,Positive,1
5,Everyone: Why training style gan is so slow.... *nvidia releases StyleGAN 3*: You can train it in just a few hours.,Positive,Positive,1
6,"Bruhn. That shit, so damn funny dawg",Neutral,Positive,0
7,"Google could face yet another class action lawsuit, this time over Pixel 3 issues androidpolice.com/2020/06/22/goo…",Neutral,Negative,0
8,Dirty cyber attack,Negative,Negative,1
9,My Setup... My What's Up Next :.. - Your New Windows Desk. - - - -.. - Another Wild Screen.. - Sega PS5 / Xbox Series X.. - Cable Tidy [UNK] @ West Wakefield via instagram. au com / p / CAqRrTQpwbh / m …,Neutral,Neutral,1
