# Labeling

## Automatic Sentiment Labeling with Twitter-RoBERTa

- Applies `cardiffnlp/twitter-roberta-base-sentiment-latest` to classify real reviews into **positive**, **neutral**, or **negative** sentiments.
- Includes Twitter-style text preprocessing and uses `softmax` scoring for confidence estimation.
- Saves sentiment-labeled AliExpress reviews for downstream sentiment classification or model training tasks.


In [None]:
import pandas as pd
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax
from tqdm import tqdm

# Enable tqdm in pandas apply
tqdm.pandas()

# Preprocess text for Twitter-style tokens
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Function to classify sentiment
def classify_sentiment(text, model, tokenizer, config):
    try:
        text = preprocess(text)
        encoded_input = tokenizer(text, return_tensors='pt', truncation=True)
        output = model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)
        ranking = np.argsort(scores)[::-1]
        top_label = config.id2label[ranking[0]]
        top_score = scores[ranking[0]]
        return top_label, np.round(float(top_score), 4)
    except Exception as e:
        return "error", 0.0  # fallback in case of failure

# Load model and tokenizer
MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# Load your labeled review dataset
df = pd.read_csv('/content/New_final_Processed_labeled_reviews.csv')

# Filter only real reviews
df = df[df['label'] == 1].copy()

# Drop the label column
#df = df.drop(columns=['label'])

# Log dataset info
print("Column names:", df.columns.tolist())
print("Shape of the dataset:", df.shape)

# Apply sentiment classification with progress bar
df[['sentiment', 'confidence']] = df['reviewContent'].progress_apply(
    lambda x: pd.Series(classify_sentiment(x, model, tokenizer, config))
)

# Count sentiments
sentiment_counts = df['sentiment'].value_counts()
print("Sentiment Counts:")
print(sentiment_counts)

# Save to CSV
df.to_csv('Product Review Of AliExpress SENTIMENT Processed.csv', index=False)
print("Saved to 'Product Review Of AliExpress SENTIMENT Processed.csv'")


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Column names: ['productURL', 'userName', 'userCountry', 'userStar', 'reviewContent', 'reviewTime', 'language', 'label']
Shape of the dataset: (12916, 8)


  0%|          | 0/12916 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 12916/12916 [26:24<00:00,  8.15it/s]


Sentiment Counts:
sentiment
positive    8801
neutral     2101
negative    2014
Name: count, dtype: int64
Saved to 'Product Review Of AliExpress SENTIMENT Processed.csv'
