## 498 Final - Model and Implementation (Zach Gendreau)

In [4]:
# Load in training dataset (Airline sentiment tweet data)
import pandas as pd
import numpy as np
from datasets import load_dataset

airline_df = pd.read_csv("hf://datasets/osanseviero/twitter-airline-sentiment/Tweets.csv")

### Dataset Description: large-twitter-tweets-sentiment
Pulled from Huggingface (link: https://huggingface.co/datasets/gxb912/large-twitter-tweets-sentiment), there are 224,994 tweets in it total, with 179,995 tweets in the training split and 44,999 tweets in the testing split. The dataset didn't include any data aside from the text and sentiment label of each tweet. There was no other information on the source of the dataset aside from the fact that it was annotated specifically for sentiment analysis.

In the training split, there are 104,125 positive tweets, and 75,860 negative tweets. In the test split, there are 26,032 positive tweets, and 18,967 negative tweets. This is an uneven spread with more positive tweets, but nothing too extreme. There average length of tweets in both the training and testing data is about ~15 words.

In [5]:
temp_train_df = pd.read_csv("hf://datasets/gxb912/large-twitter-tweets-sentiment/train.csv")
temp_test_df = pd.read_csv("hf://datasets/gxb912/large-twitter-tweets-sentiment/test.csv")

In [3]:
# 0 = negative, 1 = positive
# load twitter dataset into pandas 
from nltk.tokenize import TweetTokenizer 

print("Training: " + str(temp_train_df.shape))
print("Testing: " + str(temp_test_df.shape))
print()

# Get sentiment counts
print("Sentiment frequency")
print("Training data:")
print(temp_train_df['sentiment'].value_counts())
print() 
print("Testing data:")
print(temp_test_df['sentiment'].value_counts())

# Method for removing the "@airline" token from a tweet
def _remove_airline_tok(tokens):
    return tokens[1:] if tokens[0].startswith('@') else tokens   

tokenizer = TweetTokenizer()

# Tokenize tweet text (training)
tweet_text = temp_train_df["text"].values
tweet_tokenized = [tokenizer.tokenize(tweet) for tweet in tweet_text]
train_clean_tweets = [_remove_airline_tok(tokens) for tokens in tweet_tokenized]

# Tokenize tweet text (testing)
tweet_text = temp_test_df["text"].values
tweet_tokenized = [tokenizer.tokenize(tweet) for tweet in tweet_text]
test_clean_tweets = [_remove_airline_tok(tokens) for tokens in tweet_tokenized]
    
# Find the average length of tweets 
def average_tweet_length_finder(df):
    len_arr = []
    for tweet in df:
        tweet_length = len(tweet)
        len_arr.append(tweet_length)
    tweet_mean = np.mean(len_arr)
    print("Average length of tweets:", tweet_mean)
    
print("Training:")
average_tweet_length_finder(train_clean_tweets)

print("Testing:")
average_tweet_length_finder(test_clean_tweets)
    

Training: (179995, 2)
Testing: (44999, 2)

Sentiment frequency
Training data:
sentiment
1    104125
0     75870
Name: count, dtype: int64

Testing data:
sentiment
1    26032
0    18967
Name: count, dtype: int64


KeyboardInterrupt: 

In [24]:
# Logistic regression on large-twitter-tweets dataset using word2vec vectors

def tweet_to_avg_vector(tweet_tokens, w2v_vectors):
    vectors = []
    for word in tweet_tokens:
        if word in w2v_vectors:  # Check if word is in the Word2Vec model
            vectors.append(w2v_vectors[word])  # Get the word vector
        else:
            vectors.append(np.zeros(300))  # Use a zero vector for words not in the model
    if vectors:  
        return np.mean(vectors, axis=0)  # Average the word vectors to represent the whole tweet
    else:
        return np.zeros(300)  # Use a zero vector for words not in the model


# Generate word2vec representations
x_train_vectors = np.array([tweet_to_avg_vector(tweet, w2v_vectors) for tweet in train_clean_tweets])
x_test_vectors = np.array([tweet_to_avg_vector(tweet, w2v_vectors) for tweet in test_clean_tweets])

# Labels for training and testing data
label_train = temp_train_df["sentiment"]
label_test = temp_test_df["sentiment"]

# Train a LOGISTIC regression model (using word2vec vectors)
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train_vectors, label_train)

# Get predictions then evaluate performance on test set
label_pred = lr.predict(x_test_vectors)
print(classification_report(label_test, label_pred))


              precision    recall  f1-score   support

           0       0.72      0.64      0.68     18967
           1       0.76      0.82      0.79     26032

    accuracy                           0.75     44999
   macro avg       0.74      0.73      0.74     44999
weighted avg       0.75      0.75      0.74     44999



In [25]:
# Logistic regression on large-twitter-tweets dataset using TF-idf vectors

vectorizer = TfidfVectorizer()

# Convert tokenized tweets into strings that tf-idf vectorizer can actually use
train_tweets_clean_joined = [' '.join(tokens) for tokens in train_clean_tweets]
test_tweets_clean_joined = [' '.join(tokens) for tokens in test_clean_tweets]

train_tweets_tfidf = vectorizer.fit_transform(train_tweets_clean_joined)
test_tweets_tfidf = vectorizer.transform(test_tweets_clean_joined)


# Train a multinomial logistic regression model
lr = LogisticRegression(max_iter=1000)
lr.fit(train_tweets_tfidf, label_train)

# Get predictions then evaluate performance on test set
label_pred = lr.predict(test_tweets_tfidf)
print(classification_report(label_test, label_pred))


              precision    recall  f1-score   support

           0       0.77      0.71      0.74     18967
           1       0.80      0.85      0.82     26032

    accuracy                           0.79     44999
   macro avg       0.79      0.78      0.78     44999
weighted avg       0.79      0.79      0.79     44999



### TF-idf or word2vec (large-twitter-tweets-sentiment)?

#### Word2vec
There are more positive tweets (104,125 in training, 26,032 in testing) than negative tweets (75,860 in training, 18,967 in testing) in the dataset, and that's reflected in the logistic regression classifier's performance. Negative tweets had an F1 score of 0.68, while positive tweets had an F1 score of 0.79. The macro and micro averages are the same though, both sitting at 0.74.

#### TF-idf
Again, the F1 score for the negative class (0.74) is lower than the F1 score for the positive class (0.82). In this case, the macro and micro averages were very similar with a macro average of 0.78 and a micro average of 0.79.


Comparing the performance of the two models (logstic regression trained with word2vec vs tf-idf vectors), the TF-idf-trained model performed slightly better in this case, and it's faster, so I will be sticking with the tf-idf logistic regression model!

In [6]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kazanova/sentiment140")

print("Path to dataset files:", path)

Path to dataset files: /Users/zachgendreau/.cache/kagglehub/datasets/kazanova/sentiment140/versions/2


In [7]:
twitter_df = pd.read_csv(path + '/training.1600000.processed.noemoticon.csv', encoding='latin-1',  names=["sentiment", "id", "date", "flag", "user", "text"])
twitter_df.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Dataset Description: Twitter Sentiment140
The Sentiment140 dataset is a dataset of Tweets that was collected as part of a research paper which can be found at this link: (https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

The original use of the dataset was test how machine learning models would perform when classifying the sentiment of Tweets when trained on data including emoticons like ":)". Rather than hand-labeling data, they used the emoticons as a noisy label for the sentiment of each tweet. Tweets with emoticons like ":)" were labeled as containing positive sentiment, while tweets with emoticons like ":(" were labeled as negative. 

The Tweets were gathered through various queries using Twitter's API, and tweets containing both positive and negative emoticons were removed from the dataset. Each tweet is also recorded with data like when it was created, the user who created the tweet, the sentiment label given to the tweet, and the query used to find the tweet, if any.

There are 1,600,000 Tweets in the dataset, and there are tweets from 659,775 different users. The tweets are taken from the time period of April 17th, 2009 to May 27th, 2009. The distribution of sentiments is even, with 800,000 positive tweets and 800,000 negative tweets. The average length of tweets in the corpus is 14.85 words.

We will be using this dataset as the corpus we're analyzing.

##### Note: Realized Kaggle tricked me because the Sentiment140 dataset page had a note that said "target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)" but there are actually only positive / negative labels in the dataset.

In [8]:
# Create function to convert numerical sentiment labels into text
def map_sentiment(value):
    if value == 4:
        return 1
    else:
        return 0

twitter_df["sentiment_num_conv"] = twitter_df["sentiment"].apply(map_sentiment)

In [35]:
# Tokenize tweet text
tweet_text = twitter_df["text"].values  
# tweet_text = corpus_sample["text"].values
tweet_tokenized = [tokenizer.tokenize(tweet) for tweet in tweet_text]
corpus_tweets_clean = [_remove_airline_tok(tokens) for tokens in tweet_tokenized]

# Convert tokenized tweets into strings that tf-idf vectorizer can actually use
corpus_tweets_clean_joined = [' '.join(tokens) for tokens in corpus_tweets_clean]
####

# train_tweets_tfidf = vectorizer.fit_transform(train_tweets_clean_joined)
corpus_tweets_tfidf = vectorizer.transform(corpus_tweets_clean_joined)


# Get predictions then evaluate performance on test set
corpus_label_pred = lr.predict(corpus_tweets_tfidf)
print(classification_report(twitter_df["sentiment_num_conv"], corpus_label_pred))

              precision    recall  f1-score   support

           0       0.82      0.73      0.77    800000
           1       0.75      0.84      0.79    800000

    accuracy                           0.78   1600000
   macro avg       0.79      0.78      0.78   1600000
weighted avg       0.79      0.78      0.78   1600000



#### Research Questions

- What’s the relationship between sentiment and popularity?

- Potential sub-question: Which entities have the most positive or negative sentiment, on average? (bert)

- What are the most common entities in tweets associated with different sentiment labels?

- What types of entities are the most common? Which entity types are more associated with each sentiment label? 

- What's the relationship between length of tweets and their average sentiment?

### What’s the relationship between sentiment and popularity???

In [38]:
# Time for NER!

# Load spacy models and using for NER
import spacy

# load a spacy model trained on web text
nlp_web = spacy.load("en_core_web_sm")

doc_web = twitter_df['text'].apply(nlp_web) # Apply nlp web to each row in the ['document'] series



KeyboardInterrupt: 

In [37]:
from collections import Counter
# Initialize entity span counter
entity_spans = Counter()

# Iterate over each tweet and count spans
for doc in doc_web:
    for ent in doc.ents:
        entity_spans.update([ent.text])  

# Get top 10 most common spans
most_common_ents = entity_spans.most_common(10)

# Print the top 10 most common spans with their counts
print("Top 10 most common entity spans:")
for span, count in most_common_ents:
    print(f"  - {span}: {count} instances")

NameError: name 'doc_web' is not defined

Based on the top 10 most common spans, there doesn't really seem to be a relationship! If a label of 0 = negative sentiment and a 1 = positive sentiment, then on average, entities that appear in more positive tweets would have a sentiment closer to 1. Lookin at the top 10 most common spans and their average sentiments, we get this:

Top 10 most common entity spans and their average sentiment (logistic regression prediction):
  - today: 57597 instances || Average sentiment = 0.46
  - tomorrow: 27349 instances || Average sentiment = 0.45
  - 2: 26003 instances || Average sentiment = 0.49
  - tonight: 23124 instances || Average sentiment = 0.52
  - one: 14757 instances || Average sentiment = 0.6
  - first: 12105 instances || Average sentiment = 0.69
  - 4: 9864 instances || Average sentiment = 0.55
  - last night: 9733 instances || Average sentiment = 0.43
  - morning: 8907 instances || Average sentiment = 0.78
  - yesterday: 8535 instances || Average sentiment = 0.43

Top 10 most common entity spans and their average sentiment (twitter_df ground truth):
  - today: 57597 instances || Average sentiment = 0.45
  - tomorrow: 27349 instances || Average sentiment = 0.44
  - 2: 26003 instances || Average sentiment = 0.46
  - tonight: 23124 instances || Average sentiment = 0.5
  - one: 14757 instances || Average sentiment = 0.56
  - first: 12105 instances || Average sentiment = 0.64
  - 4: 9864 instances || Average sentiment = 0.52
  - last night: 9733 instances || Average sentiment = 0.41
  - morning: 8907 instances || Average sentiment = 0.72
  - yesterday: 8535 instances || Average sentiment = 0.42

It looks like, aside from "one," "first," and "morning," the ten most common entity spans generally have average sentiment scores around 0.5 or slightly below it! It seems that an entity being more common or popular doesn't make it related to more positive or negative discussions. Another note is that the average sentiments generated from the sentiment predictions from the logistic regression model tend to be a bit higher than the averages generated from ground truth sentiment labels for the top 10 most common entities (ex: "morning" has an average sentiment of 0.78 according to the logistic regression labels, but a 0.72 based on ground truth sentiment labels).


## BERT

In [9]:
from datasets import Dataset

In [10]:
train_dataset = Dataset.from_pandas(temp_train_df)

In [11]:
test_dataset = Dataset.from_pandas(temp_test_df)

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def preprocess(data):
    return tokenizer(data['text'], padding='max_length', truncation=True)

train_dataset = train_dataset.map(preprocess, batched=True)
test_dataset = test_dataset.map(preprocess, batched=True)

# Set the format for PyTorch tensors
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/179995 [00:00<?, ? examples/s]

Map:   0%|          | 0/44999 [00:00<?, ? examples/s]

ValueError: Columns ['label'] not in the dataset. Current columns in the dataset: ['sentiment', 'text', 'input_ids', 'token_type_ids', 'attention_mask']

In [53]:
print(train_dataset[1])

KeyError: 1