<a href="https://colab.research.google.com/github/cipher-sayan/multimodial-sentiment-analysis/blob/main/vgg16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
#import kagglehub
#kagglehub.login()

In [None]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [None]:
import opendatasets as od
import pandas

od.download("https://www.kaggle.com/datasets/sayan3270/mvsa-single")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: sayan3270
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/sayan3270/mvsa-single
Downloading mvsa-single.zip to ./mvsa-single


100%|██████████| 201M/201M [00:11<00:00, 19.0MB/s]





In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Install required packages
!pip install opendatasets transformers torch torchvision matplotlib seaborn



In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import pickle
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
from torchvision import models
from torch.cuda.amp import GradScaler, autocast


In [None]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Data loading functions
def load_text_data(data_folder):
    texts = []
    filenames = sorted(os.listdir(data_folder), key=lambda x: int(x[:-4]) if x[:-4].isdigit() else x)
    for filename in filenames:
        if filename.endswith(".txt"):
            with open(os.path.join(data_folder, filename), 'r', encoding='latin-1') as file:
                text = file.read().strip()
                texts.append(text)
    return texts, filenames

def load_labels(result_file):
    labels = {}
    with open(result_file, 'r') as file:
        next(file)  # Skip header
        for line in file:
            parts = line.strip().split('\t')
            text_id = int(parts[0])
            text_label, image_label = parts[1].split(',')
            labels[text_id] = text_label.strip()
    return labels

def filter_existing_files(texts, filenames, labels, data_folder):
    existing_texts = []
    existing_images = []
    existing_labels = []
    for i, text in enumerate(texts):
        image_file = os.path.join(data_folder, f"{i+1}.jpg")
        if os.path.exists(image_file) and (i+1) in labels:
            existing_texts.append(text)
            existing_images.append(image_file)
            existing_labels.append(labels[i+1])
    return existing_texts, existing_images, existing_labels


In [None]:
# Preprocess and save data
def preprocess_and_save(texts, image_paths, labels, tokenizer, transform, save_path):
    processed_data = []
    sentiment_to_label = {'negative': 0, 'neutral': 1, 'positive': 2}
    for text, image_path, label in zip(texts, image_paths, labels):
        encoded_text = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='pt')
        image = transform(Image.open(image_path).convert('RGB'))
        processed_data.append((encoded_text, image, sentiment_to_label[label]))
    with open(save_path, 'wb') as f:
        pickle.dump(processed_data, f)

In [None]:
# Load processed data
def load_processed_data(file_path):
    with open(file_path, 'rb') as f:
        return pickle.load(f)


In [None]:
# Dataset class
class MultimodalDataset(Dataset):
    def __init__(self, processed_data):
        self.processed_data = processed_data

    def __len__(self):
        return len(self.processed_data)

    def __getitem__(self, idx):
        encoded_text, image, label = self.processed_data[idx]
        return {
            'text': encoded_text['input_ids'].squeeze(),
            'attention_mask': encoded_text['attention_mask'].squeeze(),
            'image': image,
            'label': torch.tensor(label, dtype=torch.long)
        }


In [None]:
# Paths
data_folder = "mvsa-single/MVSA_Single/data"
result_file = 'mvsa-single/MVSA_Single/labelResultAll.txt'
train_data_path = 'train_data.pkl'
val_data_path = 'val_data.pkl'

In [None]:
# Load and preprocess data
texts, filenames = load_text_data(data_folder)
labels = load_labels(result_file)
texts, image_paths, labels = filter_existing_files(texts, filenames, labels, data_folder)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

train_texts, val_texts, train_images, val_images, train_labels, val_labels = train_test_split(
    texts, image_paths, labels, test_size=0.2, random_state=42, stratify=labels
)

preprocess_and_save(train_texts, train_images, train_labels, tokenizer, train_transform, train_data_path)
preprocess_and_save(val_texts, val_images, val_labels, tokenizer, val_transform, val_data_path)

train_data = load_processed_data(train_data_path)
val_data = load_processed_data(val_data_path)

train_dataset = MultimodalDataset(train_data)
val_dataset = MultimodalDataset(val_data)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Model class
class MultimodalSentimentModel(nn.Module):
    def __init__(self, bert_model, resnet_model, num_classes):
        super(MultimodalSentimentModel, self).__init__()
        self.text_model = bert_model
        self.image_model = resnet_model

        self.text_output_size = 768
        self.image_output_size = 25088  # Changed to 25088 to match VGG19 output

        # The input size to the linear layer should match the output size from previous layer.
        # The output of concatenation in the forward() method will be (batch_size, self.text_output_size + self.image_output_size)
        self.fc1 = nn.Linear(self.text_output_size + self.image_output_size, 512)
        self.fc2 = nn.Linear(512, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, input_ids, attention_mask, image):
        text_output = self.text_model(input_ids=input_ids, attention_mask=attention_mask)[1]
        image_output = self.image_model(image)
        # image_output = image_output.view(image_output.size(0), -1) # flatten vgg19 output
        # image_output = image_output.reshape(image_output.shape[0], -1)

        combined = torch.cat((text_output, image_output), dim=1)
        x = self.fc1(combined)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

In [None]:

from torchvision.models import vgg16

# Initialize models
bert_model = BertModel.from_pretrained('bert-base-uncased')

image_model = vgg16(pretrained=True)
image_model.classifier = nn.Identity()  # Remove final classification layer

model = MultimodalSentimentModel(bert_model, image_model, num_classes=3).to(device)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/hub/checkpoints/vgg16-397923af.pth
100%|██████████| 528M/528M [00:07<00:00, 72.5MB/s]


In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=2e-5)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.1)
scaler = GradScaler()

  scaler = GradScaler()


In [None]:
# Training loop
num_epochs = 50
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    train_correct = 0
    train_total = 0

    for batch in train_loader:
        input_ids = batch['text'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        images = batch['image'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        with torch.amp.autocast('cuda'):  # Updated usage
            outputs = model(input_ids, attention_mask, images)
            loss = criterion(outputs, labels)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        train_loss += loss.item()
        _, predicted = outputs.max(1)
        train_total += labels.size(0)
        train_correct += predicted.eq(labels).sum().item()

    train_loss /= len(train_loader)
    train_acc = train_correct / train_total

    model.eval()
    val_loss = 0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['text'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            images = batch['image'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask, images)
            loss = criterion(outputs, labels)

            val_loss += loss.item()
            _, predicted = outputs.max(1)
            val_total += labels.size(0)
            val_correct += predicted.eq(labels).sum().item()

    val_loss /= len(val_loader)
    val_acc = val_correct / val_total

    print(f'Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')
    print(f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')

    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping")
            break


Epoch 1/50: Train Loss: 1.0732, Train Acc: 0.4312
Val Loss: 1.0244, Val Acc: 0.4837
Epoch 2/50: Train Loss: 0.8421, Train Acc: 0.6197
Val Loss: 1.0288, Val Acc: 0.5195
Epoch 3/50: Train Loss: 0.4077, Train Acc: 0.8419
Val Loss: 1.4228, Val Acc: 0.4935
Epoch 4/50: Train Loss: 0.1090, Train Acc: 0.9639
Val Loss: 1.8125, Val Acc: 0.5011
Epoch 5/50: Train Loss: 0.0332, Train Acc: 0.9916
Val Loss: 2.6421, Val Acc: 0.4957
Epoch 6/50: Train Loss: 0.0167, Train Acc: 0.9959
Val Loss: 2.5447, Val Acc: 0.4892
Epoch 7/50: Train Loss: 0.0055, Train Acc: 0.9997
Val Loss: 2.6657, Val Acc: 0.5000
Epoch 8/50: Train Loss: 0.0032, Train Acc: 0.9997
Val Loss: 2.7613, Val Acc: 0.5000
Epoch 9/50: Train Loss: 0.0033, Train Acc: 0.9997
Val Loss: 2.8255, Val Acc: 0.5087
Epoch 10/50: Train Loss: 0.0021, Train Acc: 1.0000
Val Loss: 2.8268, Val Acc: 0.5022
Epoch 11/50: Train Loss: 0.0020, Train Acc: 1.0000
Val Loss: 2.8410, Val Acc: 0.5033
Early stopping


In [None]:
# prompt: save the model

# Assuming 'model' is your trained model
torch.save(model.state_dict(), 'best_model.pth')

In [None]:
# prompt: save the model

# Assuming 'model' is your trained model
torch.save(model.state_dict(), 'best_model.keras')

In [None]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report, confusion_matrix
import numpy as np

# Function to calculate sensitivity and specificity
def calculate_sensitivity_specificity(conf_matrix):
    # Assuming binary classification, class 0 is negative and class 1 is positive
    TP = conf_matrix[1, 1]
    TN = conf_matrix[0, 0]
    FP = conf_matrix[0, 1]
    FN = conf_matrix[1, 0]
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    return sensitivity, specificity

model.eval()
all_labels = []
all_predictions = []
all_probabilities = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['text'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        images = batch['image'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask, images)
        probabilities = torch.softmax(outputs, dim=1)  # Convert logits to probabilities
        _, predicted = outputs.max(1)

        all_labels.extend(labels.cpu().numpy())
        all_predictions.extend(predicted.cpu().numpy())
        all_probabilities.extend(probabilities.cpu().numpy())

# Convert to numpy arrays
all_labels = np.array(all_labels)
all_predictions = np.array(all_predictions)
all_probabilities = np.array(all_probabilities)

# Calculate metrics
accuracy = accuracy_score(all_labels, all_predictions)
f1 = f1_score(all_labels, all_predictions, average='weighted')
conf_matrix = confusion_matrix(all_labels, all_predictions)
sensitivity, specificity = calculate_sensitivity_specificity(conf_matrix)
roc_auc = roc_auc_score(all_labels, all_probabilities, multi_class='ovr')  # Assuming multi-class classification

# Print metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Sensitivity: {sensitivity:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"AUC-ROC: {roc_auc:.4f}")
print("\nClassification Report:")
print(classification_report(all_labels, all_predictions))


Accuracy: 0.5033
F1 Score: 0.4964
Sensitivity: 0.6936
Specificity: 0.4765
AUC-ROC: 0.6649

Classification Report:
              precision    recall  f1-score   support

           0       0.42      0.35      0.38       232
           1       0.50      0.46      0.48       358
           2       0.54      0.66      0.60       332

    accuracy                           0.50       922
   macro avg       0.49      0.49      0.49       922
weighted avg       0.50      0.50      0.50       922



In [None]:
from PIL import Image
from transformers import BertTokenizer
import torch

# Example inputs
example_text = "How I feel today #legday #jelly #aching #gym "
example_image_path = "/content/mvsa-single/MVSA_Single/data/1.jpg"  # Replace with the actual image path

# Load the tokenizer and transformation (use the same as in preprocessing)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Preprocess the inputs
encoded_text = tokenizer(example_text, padding='max_length', truncation=True, max_length=128, return_tensors='pt')
image = transform(Image.open(example_image_path).convert('RGB'))

# Move inputs to the appropriate device
input_ids = encoded_text['input_ids'].to(device)
attention_mask = encoded_text['attention_mask'].to(device)
image = image.unsqueeze(0).to(device)  # Add batch dimension

# Pass through the model
model.eval()
with torch.no_grad():
    outputs = model(input_ids, attention_mask, image)
    probabilities = torch.softmax(outputs, dim=1)
    predicted_class = torch.argmax(probabilities, dim=1).item()

# Map the prediction to sentiment
label_to_sentiment = {0: "Negative", 1: "Neutral", 2: "Positive"}  # Adjust based on your label mapping
predicted_sentiment = label_to_sentiment[predicted_class]

# Print the results
print(f"Predicted Sentiment: {predicted_sentiment}")
print(f"Probabilities: {probabilities.cpu().numpy()}")


Predicted Sentiment: Neutral
Probabilities: [[0.07889306 0.86198163 0.05912533]]


In [None]:
import pickle

# Save the model
with open('multimodal_sentiment_model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved to multimodal_sentiment_model.pkl")


Model saved to multimodal_sentiment_model.pkl


***METHODOLOGY***
---

1. Dataset Preparation:

*   Text Data Loading: Load textual data from the dataset, process it using a tokenizer (BERT tokenizer in this case), and store the results.
*   Image Data Loading: Load images from the dataset and preprocess them using transformations (resize, crop, normalization).


*  Label Loading: Read labels from the dataset and map them to sentiment classes (e.g., negative, neutral, positive).




---




2. Data Preprocessing:

* Filter out incomplete samples where both text and image data are not available.

* Tokenize text using BERT tokenizer and apply transformations to images.

* Encode the sentiment labels into numerical values.

---

3. Data Splitting:

* Split the dataset into training and validation sets using a stratified approach to maintain class balance.

---

4. Data Serialization:

Save the preprocessed data (text, images, and labels) into .pkl files for faster loading during training.

---

5. Model Design:

* Textual Model: Use the BERT model to encode the text input into a feature vector.

* Visual Model: Use a ResNet50 model pre-trained on ImageNet to extract image features.

* Fusion Layer: Concatenate text and image features and pass them through fully connected layers to predict sentiment.

---

6. Training:

* Use CrossEntropyLoss as the loss function.

* Use Adam optimizer with learning rate scheduling.

* Utilize mixed precision training for performance improvement.

* Implement early stopping based on validation loss to prevent overfitting.

---

7. Evaluation:

* Compute evaluation metrics: Accuracy, F1-score, Sensitivity, Specificity, and AUC-ROC.

* Generate a confusion matrix and classification report for detailed performance analysis.

---

8. Inference:

* Preprocess input text and image.

* Pass the inputs through the trained model to predict sentiment and output probabilities.

---

9. Model Serialization:

* Save the trained model into a .pkl file for future use.

---



***FLOWCHART***
---

1. Text Representation

* Text Input

Raw text → BERT Tokenizer → Tokenized Text → Embedding Representation

---


2. Image Representation

Raw Image → Preprocessing (Resizing, Cropping, Normalization) → ResNet50 → Feature Vector

---

3. Sentiment Prediction

* Feature Fusion

Text Features + Image Features → Concatenation → Fully Connected Layers → Sentiment Prediction

---

4. Output

* Sentiment Class: Negative, Neutral, Positive

* Metrics: Accuracy, F1-Score, Sensitivity, Specificity, AUC-ROC

---



***Example Workflow***
---

1. Input Preparation

* Example Text: "The product is amazing!"

* Example Image: Path to the product image.

---

2. Preprocessing

* Tokenize text and transform the image.

---

3. Model Inference

* Pass tokenized text and image through the model.

---

4. Prediction

* Output: Sentiment class (e.g., Positive) and probabilities for each class.

---

5. Evaluation

* Calculate metrics using validation data and visualize performance through a confusion matrix and ROC curves.

---

Option 1 (50 epochs)
Accuracy: 0.5304
F1 Score: 0.5291
Sensitivity: 0.6211
Specificity: 0.6425
AUC-ROC: 0.7102
Option 2 (10 epochs)
Accuracy: 0.5390
F1 Score: 0.5385
Sensitivity: 0.7283
Specificity: 0.6114
AUC-ROC: 0.7057
Key Considerations:
Accuracy:

Option 2 has a slightly higher accuracy (0.5390 vs. 0.5304). While accuracy can be informative, it may not always reflect performance in imbalanced datasets, where sensitivity and specificity matter more.
F1 Score:

Option 2 has a higher F1 Score (0.5385 vs. 0.5291). The F1 Score considers both precision and recall, so it provides a balanced view of model performance, especially in the case of class imbalance.
Sensitivity (Recall for class 1):

Option 2 has better sensitivity (0.7283 vs. 0.6211), which is important for identifying the positive class (class 1 in this case). This means Option 2 performs better in correctly identifying positive instances.
Specificity (True Negative Rate for class 0):

Option 1 has a higher specificity (0.6425 vs. 0.6114). While specificity is important, it is less critical in many cases compared to sensitivity, especially when false negatives (missed positive cases) are more costly than false positives.
AUC-ROC:

Option 1 has a slightly higher AUC-ROC (0.7102 vs. 0.7057). A higher AUC-ROC indicates a better ability to distinguish between classes, but the difference is marginal.
Conclusion:
While Option 1 has a better AUC-ROC and specificity, Option 2 has better sensitivity, F1 Score, and accuracy. Given that sensitivity (recall for the positive class) and the F1 Score are generally more important in a classification problem with imbalanced classes (which is likely here), Option 2 is better.

Why Option 2 is better:
It has a higher sensitivity, which means it is better at identifying positive cases, which is often more crucial in real-world applications where missed positive instances can have more significant consequences.
The F1 Score is higher, indicating better overall performance across precision and recall.
The performance is achieved in fewer epochs (10 epochs vs. 50 epochs), which suggests the model is more efficient.

In [None]:
from google.colab import files
files.download('/content/multimodal_sentiment_model.pkl')


In [None]:
from google.colab import files
files.download('/content/model.keras')

In [None]:
from google.colab import files
files.download('/content/best_model.pth')