### Sentiment Analysis with Transformers - Comprehensive Walkthrough

This notebook demonstrates how to build a sentiment analysis model using Hugging Face's Transformers library. The notebook includes detailed explanations for every step in the process, from data preprocessing to model training.

#### Importing Required Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from torch.utils.data import Dataset

In this step, we import libraries required for:
- Data handling (`pandas`).
- Splitting datasets into training, validation, and testing sets (`train_test_split`).
- Tokenization, model initialization, and training using Hugging Face's Transformers.
- Evaluation using accuracy and F1 scores.
- Data visualization (`matplotlib`, `seaborn`).
- Regular expressions (`re`) for cleaning text.
- Custom dataset creation for PyTorch integration.

#### Cleaning Text

In [None]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"@\w+", "", text)    # Remove mentions
    text = re.sub(r"[^\w\s]", "", text) # Remove special characters
    return text.strip().lower()

Text preprocessing is essential for consistent and clean input to the model. Here:
1. **Non-string handling**: Ensures the input is a string; otherwise, it returns an empty string.
2. **Removing URLs**: Eliminates web links using a regex pattern (`http\S+`).
3. **Removing mentions**: Removes mentions such as `@username`.
4. **Removing special characters**: Keeps only alphanumeric characters and spaces.
5. **Lowercasing**: Converts text to lowercase for uniformity.

#### Loading and Preparing Data

In [None]:
def load_and_prepare_data(file_path):
    df = pd.read_csv(file_path)
    sentiment_mapping = {'negative': 0, 'positive': 1}
    df['Label'] = df['Sentiment'].map(sentiment_mapping)
    df['Cleaned_Text'] = df['Text'].apply(clean_text)  # Ensure this column exists
    return df

Here, we:
1. **Load the dataset**: Reads the CSV file into a Pandas DataFrame.
2. **Map sentiments**: Converts the textual labels (`negative`, `positive`) into numerical ones (`0`, `1`).
3. **Clean the text**: Applies the `clean_text` function to preprocess the raw text.

#### Splitting Data

In [None]:
def split_data(df):
    train_data, temp_data = train_test_split(df, test_size=0.2, stratify=df['Label'], random_state=42)
    val_data, test_data = train_test_split(temp_data, test_size=0.5, stratify=temp_data['Label'], random_state=42)
    return train_data, val_data, test_data

The dataset is split into training, validation, and testing sets using stratified sampling. This ensures the class distribution remains consistent across subsets. The split ratio is 80% training, 10% validation, and 10% testing.

#### Preprocessing Text for Tokenization

In [None]:
def preprocess_function(examples, tokenizer):
    encoded = tokenizer(
        list(examples['Cleaned_Text']),  # Ensure we pass a list of text data
        truncation=True,
        padding=True,
        max_length=128
    )
    encoded["labels"] = list(examples["Label"])
    return encoded

We tokenize text data using Hugging Face's pre-trained tokenizer:
- **Truncation and padding**: Ensures all sequences have a uniform length (max 128 tokens).
- **Labels**: Adds the numerical labels to the tokenized output. This step prepares the data for the model.

#### Computing Evaluation Metrics

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted"),
    }

Evaluation metrics include:
- **Accuracy**: The proportion of correctly predicted labels.
- **F1 score**: Balances precision and recall, with weighting for class imbalance.

#### Creating a Custom Dataset for PyTorch

In [None]:
class CustomDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": self.encodings["labels"][idx],
        }

This class integrates tokenized data into PyTorch workflows. It enables:
- **Indexing**: Access individual samples.
- **Integration**: Seamless usage with PyTorch data loaders.

#### Training the Model

In [None]:
def train_model(train_df, val_df, model_name="dbmdz/bert-base-turkish-cased"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    train_encodings = preprocess_function(train_df, tokenizer)
    val_encodings = preprocess_function(val_df, tokenizer)

    train_dataset = CustomDataset(train_encodings)
    val_dataset = CustomDataset(val_encodings)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    return trainer, tokenizer

We use Hugging Face's `Trainer` to simplify the training loop. Key configurations include:
- **Learning rate**: 2e-5 for fine-tuning.
- **Batch size**: 16 for both training and evaluation.
- **Epochs**: 3 for training.
- **Model checkpointing**: Saves the best model based on accuracy.

#### Running the Entire Pipeline

In [None]:
if __name__ == "__main__":
    dataset_path = "/Users/onuryuksel/Desktop/cleaned_balanced_sentiment_data.csv"  # Update with your dataset path
    df = load_and_prepare_data(dataset_path)

    # Debug data
    print(df[['Cleaned_Text']].head())

    train_df, val_df, test_df = split_data(df)

    model_name = "dbmdz/bert-base-turkish-cased"
    trainer, tokenizer = train_model(train_df, val_df, model_name)

Finally, we run the complete pipeline, from data loading to model training. Adjust the `dataset_path` and `model_name` as per your requirements. Verify preprocessing by printing a sample of cleaned text.

### Visualizing Weekly Negativity Percentage

This section provides a comparison of weekly negativity percentages across three models:
- **Model 1**: Results visualized in `skyblue`
- **Model 2**: Results visualized in `lightcoral`
- **Model 3**: Results visualized in `lightgreen`

The bar chart allows for easy comparison of negativity trends over different weeks.

In [None]:
# Visualizing Weekly Negativity Percentage Across Models
file_path_1 = '/Users/onuryuksel/Downloads/dataset1result.csv'
file_path_2 = '/Users/onuryuksel/Downloads/dataset2result.csv'
file_path_3 = '/Users/onuryuksel/Downloads/dataset3result.csv'
data_1 = pd.read_csv(file_path_1)
data_2 = pd.read_csv(file_path_2)
data_3 = pd.read_csv(file_path_3)

# Calculate weekly negativity percentages
data_1['negative'] = data_1['sentiment'] == 'LABEL_2'
data_2['negative'] = data_2['sentiment'] == 'negative'
data_3['negative'] = data_3['sentiment'] == 'LABEL_2'

weekly_negativity_1 = data_1.groupby('week')['negative'].mean() * 100
weekly_negativity_2 = data_2.groupby('week')['negative'].mean() * 100
weekly_negativity_3 = data_3.groupby('week')['negative'].mean() * 100

weeks = weekly_negativity_1.index.tolist()

# Plot the negativity percentages for all datasets
plt.figure(figsize=(12, 8))

# First dataset (Model 1)
plt.bar([w - 0.2 for w in weeks], weekly_negativity_1.values, width=0.4, label='Model 1', color='skyblue')

# Second dataset (Model 2)
plt.bar([w + 0.2 for w in weeks], weekly_negativity_2.values, width=0.4, label='Model 2', color='lightcoral')

# Third dataset (Model 3)
plt.bar(weeks, weekly_negativity_3.values, width=0.4, label='Model 3', color='lightgreen')

# Add titles, labels, and legend
plt.title('Weekly Negativity Percentage (Comparison)', fontsize=16)
plt.xlabel('Week', fontsize=14)
plt.ylabel('Negativity Percentage (%)', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(weeks, rotation=45)
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()
