# LLM Experiments

Early experiments with LLMs

 - Version 2 uses a new dataset that isn't from kaggle. This depricates the need for the code that downloads and reads the kaggle data. (This version will probably replace version 1)

## Libraries

In [3]:
import os
import re

import numpy as np
import pandas as pd

#from kaggle.api.kaggle_api_extended import KaggleApi  # pip install kaggle
from together import Together  # pip install together

# Access to Hugging Face data
#conda install huggingface::datasets
#conda install huggingface::huggingface_hub
#from huggingface_hub import login
#from datasets import load_dataset


# Easier display options for debugging: 

# Set the display width to a larger value
pd.set_option('display.width', 1000)

# Optionally, set the max column width to avoid truncating column data
pd.set_option('display.max_colwidth', None)

# Optionally, set the max number of columns to show all columns
pd.set_option('display.max_columns', None)

## Open test data

ChatGPT helpfully suggested using the SemEval tweet datasets. There are three versions:
 - SemEval-2013 Task 2
 - SemEval-2014 Task 9
 - SemEval-2017 Task 4 

I am currently using the training data in the 2017 version: [SemEval-2017 Task 4 Sentiment Analysis in Twitter](https://github.com/leelaylay/TweetSemEval). It is available on HuggingFace as [maxmoynan/SemEval2017-Task4aEnglish](https://huggingface.co/datasets/maxmoynan/SemEval2017-Task4aEnglish). There is a test split that I can use too

In [5]:
#huggingface_token = "XXXX"
#login(token=huggingface_token, add_to_git_credential=True)
#dir = '../data/tweet_data'
splits = {'train': 'data/train-00000-of-00001-5cbc663900ee21a7.parquet', 'test': 'data/test-00000-of-00001-d46da9a2c99cc148.parquet', 'development': 'data/development-00000-of-00001-c0764befdd327185.parquet'}
tweets_df = pd.read_parquet("hf://datasets/maxmoynan/SemEval2017-Task4aEnglish/" + splits["train"])

In [10]:
tweets_df = tweets_df.rename(columns={"tweet": "text"})
tweets_df

Unnamed: 0,tweet_id,sentiment,text
0,520829332525441024,0,Saturday without Leeds United is like Sunday without a Sunday dinner it doesn't feel normal at all (Ryan)
1,522931511323275264,2,Catch Rainbow Valley at the @CBC #IMAF2014 Gala. Oct 26 @TheGuildPEI! http://t.co/S5aJAEpnkY
2,523099837936312321,2,"""@NiklaklePinkel it doesn't really count, I was decorating a haunted house for a Halloween program tomorrow...I get paid lots of overtime:-)"""
3,521384413217959937,2,"""#BEARDOWN Wish us luck...we may need it. (@ Georgia Dome for Atlanta Falcons vs. Chicago Bears in Atlanta, GA) https://t.co/D0nAGYO4nO"""
4,523076584497229824,2,We're so excited to be part of the Still We Rise Gala on Dec 3. Join us! @warriors4peace http://t.co/MsKuNg5VMH http://t.co/BdQm1uZe4J
...,...,...,...
30417,681877834982232064,1,"""@ShaquilleHoNeal from what I think you're asking, in no order. Future, Drake, Thug, Cole, Kendrick and Tiller a close 6th"""
30418,681879579129200640,2,"""Iran ranks 1st in liver surgeries, Allah bless the country."""
30419,681883903259357184,1,"""Hours before he arrived in Saudi Arabia on Tuesday, Turkish President Recep Tayyip Erdogan accused Syria's president of """"mercilessly""""..."""
30420,681904976860327936,0,@VanityFair Alex Kim Kardashian worth how to love Kim Kardashian she's so bad Sun Conure to


## Use the Together.AI API to batch classify them. 

A function that takes a batch of tweets and uses the Together API to classify them:

In [11]:
def get_sentiments(batch_tweets, batch_index=0):
    """
    Retrieves sentiment predictions for a batch of tweets using the Together AI API.

    Parameters
    ----------
    batch_tweets : pandas.DataFrame
        A DataFrame containing the tweets for the current batch. It must include a 'text' column with the tweet content.
    batch_index : int
        An optional starting index of the current batch. This is used to align the predicted sentiments with the original DataFrame indices.

    Returns
    -------
    ids : list of int
        A list of DataFrame indices corresponding to each tweet in the batch. These indices align with the main DataFrame `df`.
    sentiments : list of str
        A list of predicted sentiments for each tweet in the batch. Possible values are 'Positive', 'Negative', or 'Neutral'.
    """

    # Prepare the list of tweets
    tweet_list = "\n".join([f"{idx+1}. {tweet}" 
                            for idx, tweet in enumerate(batch_tweets.text.values)])
    
    # Create the system prompt
    system_prompt = (
        "Classify the sentiment (positive, negative, or neutral) of each of the following texts. "
        "Provide your answer in the format '1. Sentiment', '2. Sentiment', etc.\n\n"
        f"{tweet_list}"
    )

    # Prepare the messages
    messages = [
        {
            "role": "system",
            "content": system_prompt
        }
    ]

    
    # Call the API
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
        messages=messages,
        max_tokens=300,
        temperature=0.7,
        top_p=0.7,
        top_k=50,
        repetition_penalty=1,
        stop=["<|eot_id|>", "<|eom_id|>"],
        truncate=130560,
        stream=False  # Set stream to False to get the full response
    )

    # Extract the assistant's reply
    assistant_reply = response.choices[0].message.content.strip()

    # Use regular expressions to extract the sentiments
    matches = re.findall(r"(\d+)\.\s*(Positive|Negative|Neutral)", assistant_reply, re.IGNORECASE)

    # Check that the numbering is correct (optional)
    # You can add code here to verify the numbering matches the tweets

    # Compute the actual DataFrame indices
    ids = [batch_index + int(idx) - 1 for idx, sentiment in matches]
    sentiments = [sentiment.capitalize() for idx, sentiment in matches]
    assert len(ids) == len(sentiments)

    return ids, sentiments


In [18]:
# Get the API key from a file
with open('together.ai_key.txt', 'r') as f:
    api_key = f.readline().strip()

client = Together(api_key=api_key)

# List of tweets to classify (only a few for now)
df = tweets_df.sample(50).copy()
# Ensure the index is consecutive and ascending
df = df.reset_index(drop=True)
# To store the results
df['sentiment_prediction'] = None

# Batch processing
batch_size = 5
for i in range(0, len(df), batch_size):
    # Get the batch of tweets
    batch_tweets = df.loc[i:i + batch_size - 1, :]

    # Get sentiments using the function
    print(f"Submitting batch {i//len(batch_tweets)+1} of {len(df)//len(batch_tweets)}...")

    ids, sentiments = get_sentiments(batch_tweets, batch_index=i)

    # Update the DataFrame with the predictions
    df.loc[ids, 'sentiment_prediction'] = sentiments

print("Finished")

Submitting batch 1 of 10...
Submitting batch 2 of 10...
Submitting batch 3 of 10...
Submitting batch 4 of 10...
Submitting batch 5 of 10...
Submitting batch 6 of 10...
Submitting batch 7 of 10...
Submitting batch 8 of 10...
Submitting batch 9 of 10...
Submitting batch 10 of 10...
Finished


See how well that worked

In [19]:
def check_sentiment_estimate(row):
    if row.sentiment == 0 and row.sentiment_prediction == "Negative":
        return 1
    elif row.sentiment == 1 and row.sentiment_prediction == "Neutral":
        return 1
    elif row.sentiment == 2 and row.sentiment_prediction == "Positive":
        return 1
    else:
        return 0

df['correct'] = df.apply(check_sentiment_estimate, axis=1)
print("Accuracy:", df.correct.mean())
df.loc[:,['text', 'sentiment', 'sentiment_prediction', 'correct']]


Accuracy: 0.58


Unnamed: 0,text,sentiment,sentiment_prediction,correct
0,"""If someone wanted to find me at #saratoga tomorrow, it wouldn't be too hard. I stick out like a sore thumb with my PS4 lanyard thing.""",1,Neutral,1
1,"""Seth Rollins may have lost, but he broke Cena! #RAWSanJose""",1,Positive,0
2,Frank Ocean playing omg it's like November already,2,Positive,1
3,India has old ties with Iran. Lot of us speak Urdu which has a lot of Farsi in it. https://t.co/bO5UKt5ooi,1,Neutral,1
4,Missed @ken4london on #bbcqt last night blaming Tony Blair for July 7 attacks? Watch it again here: https://t.co/UrbXwizlw4,1,Neutral,1
5,I love modern dance at the UofU but watching nutcracker at home made me miss ballet. Still stoked to see Ballet West on Friday!!!,2,Positive,1
6,@iM_tgun @frayFenner @NvDox Anyone else noticed the uncanny resemblance between tgun and Petraeus? tgun's Dad may have a confession to make.,1,Negative,0
7,Yakub coverage: Outrage over notice to channels: Various journalists' bodies on Saturday expressed shock over the show-cause notice s...,0,Neutral,0
8,GUYS WERE ARE 17TH PLACE AND JUST YESTERDAY WE WERE 16TH PLACE PLEASE KEEP VOTING. WHAT'S WRONG? Make the boys proud! #MTVStars The Vamps.,1,Positive,0
9,so if anyone wants to bring me donuts or anything on monday id be gucci with it.,1,Positive,0
