## Sentiment Analysis

This notebook does sentiment analysis on a dataset of chinese ecommerce reviews.

We use a special chinese tokenizer `jieba` to tokenize the reviews, embed the tokens using an `Embedding` layer, then use a simple fully connected neural network to predict the sentiment of the review.

In [1]:
import data_loader
import lightning as L
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torch.utils import data
from torchvision import transforms
from torchvision.datasets import MNIST

2024-06-22 03:20:23.374414: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
x_train, y_train, x_test, y_test = data_loader.load_data()
vocalen, word_index = data_loader.createWordIndex(x_train, x_test)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache


Sample size:16595
Train set size:13276
Test set size:3319


Loading model cost 0.541 seconds.
Prefix dict has been built successfully.


voca: 9512


In [3]:
x_train_ix = data_loader.word2Index(x_train, word_index)
x_test_ix = data_loader.word2Index(x_test, word_index)

In [4]:
train = data.TensorDataset(torch.from_numpy(x_train_ix), torch.from_numpy(y_train[:, np.newaxis]))
test = data.TensorDataset(torch.from_numpy(x_test_ix), torch.from_numpy(y_test[:, np.newaxis]))
train_loader = data.DataLoader(train, batch_size=512, shuffle=True)
test_loader = data.DataLoader(test, batch_size=512, shuffle=True)

In [5]:
class Model(nn.Module):
    def __init__(self, vocab_size, embed_dim=256, hidden_dim=256, num_layers=3):
        super(Model, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.sequential = nn.Sequential(
            nn.Flatten(),
        )
        
        for _ in range(num_layers):
            self.sequential.add_module(f"linear{_}", nn.LazyLinear(hidden_dim))
            self.sequential.add_module(f"relu{_}", nn.ReLU())
        
        self.sequential.add_module("linear_final", nn.LazyLinear(1))
        self.sequential.add_module("sigmoid", nn.Sigmoid())

    def forward(self, x):
        x = self.embedding(x)
        x = self.sequential(x)
        return x

In [6]:
class Module(L.LightningModule):
    def __init__(self, vocalen, embed_dim=256, hidden_dim=256, num_layers=3):
        super().__init__()
        self.model = Model(vocalen, embed_dim, hidden_dim, num_layers)
    
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer
    
    def _calculate_loss(self, batch, mode="train"):
        x, y = batch
        preds = self.model(x)
        loss = F.binary_cross_entropy(preds, y)
        acc = (preds.round() == y).float().mean()
        
        self.log(f"{mode}_loss", loss)
        self.log(f"{mode}_acc", acc)
        return loss
    
    def training_step(self, batch, batch_idx):
        return self._calculate_loss(batch)

    def test_step(self, batch, batch_idx):
        return self._calculate_loss(batch, mode="test")


In [7]:
model = Module(vocalen)
trainer = L.Trainer(max_epochs=200)
trainer.fit(model=model, train_dataloaders=train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/wenqi/.local/lib/python3.12/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:461: The total number of parameters detected may be inaccurate because the model contains an instance of `UninitializedParameter`. To get an accurate number, set `self.example_input_array` in your LightningModule.

  | Name  | Type  | Params | Mode 
----------------------------------------
0 | model | Model | 2.4 M  | train
----------------------------------------
2.4 M     Trainable params
0         Non-trainable params
2.4 M     Total params
9.740     Total estimated model params size (MB)
/home/wenqi/.local/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` ar

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=200` reached.
