# **Sentiment Analysis with BERT**

This jupyter notebook performs sentiment analysis on the Rotten Tomatoes dataset using the **BERT (Bidirectional Encoder Representations from Transformers)** model. The dataset consists of phrases labeled with sentiment scores (0-4). Below is an overview of the workflow, model architecture, and results.


## **Workflow**

### **1. Import Libraries**

In [None]:
# import transformers
import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, EarlyStoppingCallback, EvalPrediction
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader

### **2. Load and Preprocess Data**
- Calculate phrase length statistics.
- Load the BERT tokenizer and encode the input data.

In [5]:
# Load data
train = train_data.copy(deep = True)
test = test_data.copy(deep = True)

(156060, 4) (66292, 3)


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [7]:
# Calculate phrase length statistics
train['Phrase_Length'] = train['Phrase'].apply(len)
length_stats = train['Phrase_Length'].describe()
print(length_stats)

min_length = train['Phrase_Length'].min()
max_length = train['Phrase_Length'].max()

print(f"最小长度: {min_length}")
print(f"最大长度: {max_length}")


count    156060.000000
mean         40.217224
std          38.154130
min           1.000000
25%          14.000000
50%          26.000000
75%          53.000000
max         283.000000
Name: Phrase_Length, dtype: float64
最小长度: 1
最大长度: 283


In [9]:
# Load BERT tokenizer
# Download bert_base_uncased on HuggingFace
tokenizer = AutoTokenizer.from_pretrained("./bert_base_uncased")

# Tokenize and encode input data
encoded_inputs = tokenizer(
    train["Phrase"].tolist(),  # Input text list
    padding=True,              # Pad to the same length
    truncation=True,           # Truncate sequences longer than max_length
    max_length=128,            # Maximum sequence length
    return_tensors="pt"        # Return PyTorch tensors
)

print(encoded_inputs.keys())   # Includes input_ids, attention_mask


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [10]:
# Print token_ids, token_type_ids, attention_mask
print("Input IDs:", encoded_inputs['input_ids'][1])
print("Token Type IDs:", encoded_inputs['token_type_ids'][1])
print("Attention Mask:", encoded_inputs['attention_mask'][1])

Input IDs: tensor([  101,  1037,  2186,  1997,  9686, 17695, 18673, 14313,  1996, 15262,
         3351,  2008,  2054,  2003,  2204,  2005,  1996, 13020,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
Token Type IDs: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

### **3. Load BERT Model**

In [11]:
# Load pre-trained BERT model
local_model_path = "./bert_base_uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    local_model_path,  
    num_labels=5       
)
# Print model architecture
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /lustre/user/liclab/lisky/buyf/Class/Introduce2Data/bert_base_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.3, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.3, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## **Training**

### **4. Define Training Arguments**

In [13]:
# Training arguments
batch_size = 64  # Adjust based on your machine's capabilities
metric_name = 'f1'  # Use F1 score as the evaluation metric

args = TrainingArguments(
    output_dir="./results",  # Directory to save the model
    eval_strategy="epoch",   # Evaluate after each epoch
    save_strategy="epoch",   # Save model after each epoch
    learning_rate=1e-5,      # Learning rate
    per_device_train_batch_size=batch_size,  # Batch size for training
    per_device_eval_batch_size=batch_size,   # Batch size for evaluation
    num_train_epochs=50,     # Total number of training epochs
    weight_decay=0.1,        # Weight decay (L2 regularization)
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model=metric_name,  # Metric to determine the best model
    logging_dir='./logs',    # Directory for logs
    logging_steps=10,        # Log every 10 steps
    eval_steps=500,          # Evaluate every 500 steps
    warmup_steps=500,        # Warmup steps for learning rate
    fp16=True,               # Enable mixed precision training
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### **5. Define Metrics and Dataset**

In [14]:
# Define metrics
def multi_class_metrics(predictions, labels):
    softmax = torch.nn.Softmax(dim=-1)
    probs = softmax(torch.Tensor(predictions))  # Get probabilities for each class
    y_pred = np.argmax(probs, axis=1)  # Convert predictions to class labels
    y_true = labels

    # Calculate F1 score, ROC AUC, and accuracy
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, probs, multi_class='ovr', average='macro')
    accuracy = accuracy_score(y_true, y_pred)

    return {
        'f1': f1_micro_average,
        'roc_auc': roc_auc,
        'accuracy': accuracy
    }

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    return multi_class_metrics(predictions=preds, labels=p.label_ids)

In [15]:
# Prepare dataset
labels = train['Sentiment'].tolist()
labels_tensor = torch.tensor(labels, dtype=torch.long)

In [16]:
print(labels_tensor)

tensor([1, 2, 2,  ..., 3, 2, 2])


In [18]:
# Prepare dataset
input_ids = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']
token_type_ids = encoded_inputs['token_type_ids']
labels = labels_tensor

# Split dataset into training and validation sets
train_inputs, val_inputs, train_attention_mask, val_attention_mask, train_token_type_ids, val_token_type_ids, train_labels, val_labels = train_test_split(
    input_ids, attention_mask, token_type_ids, labels, test_size=0.2, random_state=42
)

# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_mask, token_type_ids, labels):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.token_type_ids = token_type_ids
        self.labels = labels
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        # 返回一个字典
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'token_type_ids': self.token_type_ids[idx],
            'labels': self.labels[idx]
        }

# Create training and validation datasets
train_dataset = CustomDataset(train_inputs, train_attention_mask, train_token_type_ids, train_labels)
val_dataset = CustomDataset(val_inputs, val_attention_mask, val_token_type_ids, val_labels)

print(train_dataset[0]) 
print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")


{'input_ids': tensor([  101, 18178,  2229,  2100,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### **6. Train the Model**

In [19]:
# Set up Trainer
early_stopping = EarlyStoppingCallback(early_stopping_patience=2)  

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]  # Early stopping
)

  trainer = Trainer(
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [20]:
# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.9479,0.877695,0.639017,0.873499,0.639017
2,0.8845,0.830779,0.645969,0.892717,0.645969
3,0.8043,0.804763,0.65965,0.900052,0.65965
4,0.8028,0.825921,0.647571,0.900903,0.647571
5,0.7475,0.822857,0.650583,0.903984,0.650583




TrainOutput(global_step=2440, training_loss=0.9067691251879832, metrics={'train_runtime': 1502.8815, 'train_samples_per_second': 4153.621, 'train_steps_per_second': 16.235, 'total_flos': 2.56638858205824e+16, 'train_loss': 0.9067691251879832, 'epoch': 5.0})

In [24]:
# Save the best model
best_model = trainer.model 
model_save_path = "./bast_model.pt" 
torch.save(best_model, model_save_path)

## **Testing and Predictions**

### **7. Predict on Test Set**

In [25]:
# Handle missing values in the test set
test["Phrase"] = test["Phrase"].fillna("")  

# Tokenize and encode test data
encoded_inputs_test = tokenizer(
    test["Phrase"].tolist(),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)
print(encoded_inputs_test.keys())  


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [29]:
# Load the best model
best_model_cpu = trainer.model.to('cpu')
best_model_cpu.eval()
# Perform inference
with torch.no_grad():
    outputs = best_model_cpu(**encoded_inputs_test)

# Get predictions
predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()

In [None]:
# Save predictions to CSV
test_results = pd.DataFrame({
    "PhraseId": test["PhraseId"],
    "Sentiment": predictions
})
output_path = "./results/test_results.csv"
test_results.to_csv(output_path, index=False)