In [None]:
import pandas as pd
import requests
import os

# Data Collection

## Annotating the Dataset

This section provides information on the sentiment annotation process using the `twitter-roberta-base-sentiment model` from Hugging Face. This process is crucial for labeling the data sentiment, preparing it for further analysis.

### Load and Prepare Data

In [1]:
# Load our data from a CSV file, handling UTF-8 encoding issues
df = pd.read_csv("./data/1000texts.csv", encoding='utf-8-sig')

# Display the first three rows of the dataframe to inspect the data
df.head(3)

Unnamed: 0,Name,Handle,Timestamp,Verified,Content,Comments,Retweets,Likes,Analytics,Tags,Mentions,Emojis,Profile Image,Tweet Link,Tweet ID
0,Binance,@binance,2024-04-03T00:00:06.000Z,True,The #Binance towel comes everywhere with me......,2.2K,589,2.1K,240K,['#Binance'],[],['\\U0001f373'],https://pbs.twimg.com/profile_images/174428939...,https://twitter.com/binance/status/17753122840...,tweet_id:1775312284052554156
1,Ash Crypto,@Ashcryptoreal,2024-04-04T00:24:48.000Z,True,Drop your $SOL address below and\nmake sure yo...,2.2K,518,1.4K,104K,[],[],['\\U0001f447\\U0001f3fc'],https://pbs.twimg.com/profile_images/169999220...,https://twitter.com/Ashcryptoreal/status/17756...,tweet_id:1775680884982616105
2,Bitcoin,@Bitcoin,2024-04-03T03:19:33.000Z,True,"£52,356.70",156,141,767,161K,[],[],[],https://pbs.twimg.com/profile_images/421692600...,https://twitter.com/Bitcoin/status/17753624737...,tweet_id:1775362473790447725


We start by loading the dataset, ensuring that UTF-8 encoding is used to handle any special characters in the text. 

The initial peek at the data with `df.head(3)` helps to confirm the structure and data types we are working with.

### Data Cleaning

In [2]:
# Drop rows with any missing values
df = df.dropna()

# Convert the 'Content' column into a list of sentences
sentences = df['Content'].tolist()

Next, we clean the data by removing rows with missing values to maintain the quality and consistency of our dataset. 

We extract the tweet content into a list to facilitate the subsequent batch processing.

### Model Details

For our annotation, we will be using [`twitter-roberta-base-sentiment`](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) model on huggingface.

This model is a RoBERTa-based neural network trained on approximately 58 million tweets and fine-tuned for sentiment analysis, making it highly adept at understanding the nuances of language used in tweets.

**Labels Explained**
- 0: Negative
- 1: Neutral
- 2: Positive

These labels correspond to the sentiment expressed in each tweet.

### API Configuration

The api requires us to group the sentences in 10.

In [3]:
# Group sentences into sub-lists of 10 for batch processing
grouped_list = [sentences[n:n+10] for n in range(0, len(sentences), 10)]

### Set Up API for Annotation

In [6]:
# API token and endpoint for the annotation Hugging Face's model
API_TOKEN = "###"  # actual API token goes here
API_URL = "https://api-inference.huggingface.co/models/cardiffnlp/twitter-roberta-base-sentiment"
headers = {"Authorization": f"Bearer {API_TOKEN}"}  # Authorization header for the API request

We configure the API with the required endpoint and authentication details. We use the API_TOKEN gotten from Hugging Face.

### Batch Processing Setup

In [None]:
# Group sentences into sub-lists of 10 for batch processing
grouped_list = [sentences[n:n+10] for n in range(0, len(sentences), 10)]

Tweets are grouped in batches of ten to optimize the API calls.

### Annotation Execution

In [None]:
# Define a function to send data to the sentiment analysis API and get the response
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

In [None]:
# Initialize an empty list to store outputs
output = []

# Loop through each group of sentences and perform sentiment analysis
for i in range(len(grouped_list)):
    output.append(query(grouped_list[i]))


We define a function to send each batch to the API and store the responses. Each response includes sentiment scores and labels for the batch of tweets processed.

### Understanding the Output

In [1]:
output[:5]

NameError: name 'output' is not defined

The output from the API provides a score for each sentiment category per tweet, indicating the confidence level of each sentiment prediction. This allows us to determine the most likely sentiment expressed in each tweet.

### Label Extraction and Assignment

In [10]:
# Initialize an empty list to hold the highest sentiment labels
highest_labels = []

# Extract the highest sentiment label from each result
for group in output:
    for result in group:
        highest = max(result, key=lambda x: x['score'])
        highest_labels.append(highest['label'].split('_')[1])

# Add the highest sentiment labels back to the dataframe
df['label'] = highest_labels


After processing, we extract the highest scoring label for each tweet and add this label back into our DataFrame. This step converts the raw output into a practical annotation of the dataset.

### Saving the Results

In [12]:
# Define the final dataframe to be saved
df_final = df[['Content', 'label']]

# Define the file path for the new CSV
file_path = os.path.join('data', 'labeled_texts_1000.csv')

# Save the dataframe to a CSV file, without the index, and handle UTF-8 encoding
df_final.to_csv(file_path, index=False, encoding='utf-8-sig')

The fully annotated dataset is saved as a CSV file, preserving the original text alongside the newly assigned sentiment labels. This file can now be used for further analysis and training predictive models.

### Citation


Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., & Neves, L. (2020). TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1644–1650). Association for Computational Linguistics.

