<a href="https://colab.research.google.com/github/andrewdk1123/KoSentiment/blob/main/simple_rnn_for_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

이 Notebook에서는 PyTorch를 이용하여 주어진 문장의 감정 (label 0: neg, 1: pos)을 분류하는 간단한 RNN 모델을 만들어 보겠습니다.

RNN 모델은 입력 시퀀스에 대해 순차적으로 Hidden State를 계산하는 모델입니다. 즉, 현재 Timestep의 Hidden State를 계산할 때, 이전 Timestep에서 계산된 Hidden State를 함께 계산함으로써, 입력 시퀀스의 순서를 고려하여 학습할 수 있습니다.

모델은 다음과 같이 구성됩니다.

 * Embedding Layer: 입력 문장을 단어 임베딩으로 변환합니다.
 * RNN Layer: 입력 시퀀스에 대해 순차적으로 Hidden State를 계산합니다.
 * Linear Layer: 마지막 Hidden State를 입력으로 받아 감정을 분류합니다.

# Data Preparation

In [223]:
# Load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io

## Upload Train and Test Set

In [224]:
from google.colab import files

uploaded = files.upload()

Saving processed_training.csv to processed_training (4).csv


In [225]:
print(uploaded.keys())

dict_keys(['processed_training (4).csv'])


In [226]:
train_data = pd.read_csv(io.BytesIO(uploaded['processed_training (4).csv']), sep = ',')

cols_to_keep = ['label', 'sentence', 'tokenized_sentence', 'cleaned_tokens']
train_data = train_data.loc[:, cols_to_keep]
train_data.head()

Unnamed: 0,label,sentence,tokenized_sentence,cleaned_tokens
0,0,일은 왜 해도 해도 끝이 없을까? 화가 난다.,"['일은', '왜', '해도', '해도', '끝이', '없을까', '?', '화가'...","['일은', '왜', '해도', '해도', '끝이', '없을까', '화가', '난다']"
1,0,이번 달에 또 급여가 깎였어! 물가는 오르는데 월급만 자꾸 깎이니까 너무 화가 나.,"['이번', '달에', '또', '급여', '##가', '깎', '##였어', '!...","['이번', '달에', '또', '급여', '깎', '물가', '오르는', '월급'..."
2,0,회사에 신입이 들어왔는데 말투가 거슬려. 그런 애를 매일 봐야 한다고 생각하니까 스...,"['회사에', '신입', '##이', '들어왔', '##는데', '말투', '##가...","['회사에', '신입', '들어왔', '말투', '거슬', '그런', '애를', '..."
3,0,직장에서 막내라는 이유로 나에게만 온갖 심부름을 시켜. 일도 많은 데 정말 분하고 ...,"['직장', '##에서', '막내', '##라는', '이유로', '나에게', '##...","['직장', '막내', '이유로', '나에게', '온갖', '심', '시켜', '일..."
4,0,얼마 전 입사한 신입사원이 나를 무시하는 것 같아서 너무 화가 나.,"['얼마', '전', '입사', '##한', '신입', '##사원', '##이', ...","['얼마', '전', '입사', '신입', '나를', '무시', '것', '같아서'..."


In [227]:
uploaded = files.upload()

Saving processed_test.csv to processed_test (4).csv


In [228]:
print(uploaded.keys())

dict_keys(['processed_test (4).csv'])


In [229]:
test_data = pd.read_csv(io.BytesIO(uploaded['processed_test (4).csv']), sep = ',')

test_data = test_data.loc[:, cols_to_keep]
test_data.head()

Unnamed: 0,label,sentence,tokenized_sentence,cleaned_tokens
0,0,이번 프로젝트에서 발표를 하는데 내가 실수하는 바람에 우리 팀이 감점을 받았어. 너...,"['이번', '프로젝트', '##에서', '발표를', '하는데', '내가', '실수...","['이번', '프로젝트', '발표를', '하는데', '내가', '실수', '바람에'..."
1,0,회사에서 중요한 프로젝트를 혼자 하게 됐는데 솔직히 두렵고 무서워.,"['회사에서', '중요한', '프로젝트를', '혼자', '하게', '됐는데', '솔...","['회사에서', '중요한', '프로젝트를', '혼자', '하게', '됐는데', '솔..."
2,0,상사가 너무 무섭게 생겨서 친해지는 게 너무 두려워.,"['상', '##사가', '너무', '무섭게', '생겨서', '친', '##해지는'...","['상', '너무', '무섭게', '생겨서', '친', '게', '너무', '두려워']"
3,0,이번에 힘들게 들어간 첫 직장이거든. 첫 직장이라서 그런지 너무 긴장된다.,"['이번에', '힘들게', '들어간', '첫', '직장', '##이거', '##든'...","['이번에', '힘들게', '들어간', '첫', '직장', '첫', '직장', '그..."
4,0,직장에서 동료들이랑 관계가 안 좋아질까 봐 걱정돼.,"['직장', '##에서', '동료', '##들이랑', '관계가', '안', '좋아질...","['직장', '동료', '관계가', '안', '좋아질', '봐', '걱정']"


## Data Preparation

In [230]:
import ast
import torch
from torch.nn.utils.rnn import pad_sequence
from collections import Counter

In [231]:
# Convert the string representation of the list to an actual list
train_data['cleaned_tokens'] = train_data['cleaned_tokens'].apply(ast.literal_eval)
test_data['cleaned_tokens'] = test_data['cleaned_tokens'].apply(ast.literal_eval)

In [248]:
# Flatten the list of tokens
all_tokens = train_data['cleaned_tokens'].explode()

# Build a vocabulary using Counter
vocab = Counter(all_tokens)

# Create word-to-index and index-to-word dictionaries
word_to_index = {word: index + 1 for index, (word, _) in enumerate(vocab.most_common())}
index_to_word = {index: word for word, index in word_to_index.items()}

# Add a special token for padding with index 0
word_to_index['<PAD>'] = 0
index_to_word[0] = '<PAD>'

# Add a special token for out-of-vocabulary words with index len(vocab) + 1
word_to_index['<UNK>'] = len(vocab) + 1
index_to_word[len(vocab) + 1] = '<UNK>'

# Encode tokens using the word-to-index dictionary
#train_data['encoded_tokens'] = train_data['cleaned_tokens'].apply(lambda tokens: [word_to_index[token] for token in tokens])
# Encode tokens using the word-to-index dictionary, replace out-of-vocabulary words with '<UNK>'
train_data['encoded_tokens'] = train_data['cleaned_tokens'].apply(lambda tokens: [word_to_index.get(token, word_to_index['<UNK>']) for token in tokens])


# Pad sequences to a specific length (adjust maxlen as needed)
maxlen = 20
padded_tokens = pad_sequence([torch.tensor(tokens) for tokens in train_data['encoded_tokens']], batch_first=True, padding_value=word_to_index['<PAD>']).tolist()
train_data['padded_tokens'] = [list(tokens) for tokens in padded_tokens]

In [249]:
train_data.iloc[:,-2:].head()

Unnamed: 0,encoded_tokens,padded_tokens
0,"[684, 34, 322, 322, 2465, 2290, 54, 1641]","[684, 34, 322, 322, 2465, 2290, 54, 1641, 0, 0..."
1,"[121, 1328, 64, 2434, 4978, 4979, 4980, 525, 6...","[121, 1328, 64, 2434, 4978, 4979, 4980, 525, 6..."
2,"[255, 739, 1434, 3578, 4586, 109, 1657, 209, 2...","[255, 739, 1434, 3578, 4586, 109, 1657, 209, 2..."
3,"[42, 1834, 823, 137, 3460, 618, 1482, 564, 252...","[42, 1834, 823, 137, 3460, 618, 1482, 564, 252..."
4,"[148, 196, 645, 739, 18, 294, 6, 71, 1, 54, 4]","[148, 196, 645, 739, 18, 294, 6, 71, 1, 54, 4,..."


In [250]:
print("Length of the Vocabulary is: ", len(vocab))

Length of the Vocabulary is:  12989


In [251]:
train_data['padded_tokens'].apply(len).unique()

array([39])

In [252]:
# Example: Reconstruct tokens for the first row
example_row_index = 0
reconstructed_tokens = [index_to_word[index] for index in train_data.loc[example_row_index, 'padded_tokens'] if index != 0]

# Display the reconstructed tokens
print(reconstructed_tokens)

['일은', '왜', '해도', '해도', '끝이', '없을까', '화가', '난다']


# Build RNN Sentiment Classifier

In [254]:
import random
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [255]:
class BucketIterator:
    def __init__(self, data, batch_size, shuffle=True, seed=1123):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.buckets = self.create_buckets()

    def create_buckets(self):
        buckets = {}
        for X, y in self.data:
            length = len(X)
            if length not in buckets:
                buckets[length] = []
            buckets[length].append((X, y))
        return buckets

    def shuffle_buckets(self):
        random.seed(self.seed)
        for key in self.buckets:
            random.shuffle(self.buckets[key])

    def __iter__(self):
        if self.shuffle:
            self.shuffle_buckets()

        for key in self.buckets:
            bucket = self.buckets[key]
            if not bucket:  # Check if the bucket is empty
                continue
            for i in range(0, len(bucket), self.batch_size):
                batch = bucket[i:i + self.batch_size]
                X_batch, y_batch = zip(*batch)

                X_batch = [torch.LongTensor(seq) for seq in X_batch]
                y_batch = torch.Tensor(y_batch).view(-1, 1)  # Ensure y_batch has shape (batch_size, 1)

                yield torch.stack(X_batch, dim=0), y_batch

In [256]:
train_X, val_X, train_y, val_y = train_test_split(train_data['padded_tokens'], train_data['label'], test_size=0.2, random_state=1123)

# Convert data and labels to numpy arrays and then PyTorch tensors
train_X = [torch.LongTensor(seq) for seq in train_X.values.tolist()]
train_y = torch.tensor(train_y.values.tolist())
val_X = [torch.LongTensor(seq) for seq in val_X.values.tolist()]
val_y = torch.tensor(val_y.values.tolist())

train_dataset = TensorDataset(torch.stack(train_X), train_y)
val_dataset = TensorDataset(torch.stack(val_X), val_y)

In [287]:
class SentimentRNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(SentimentRNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)

        # output shape: (batch_size, sequence_length, hidden_dim)
        # hidden shape: (num_layers * num_directions, batch_size, hidden_dim)
        # We are using the hidden state from the last time step as the representation of the sequence.

        assert torch.equal(output[:, -1, :], hidden.squeeze(0))

        # Take the last time step's hidden state
        last_hidden = hidden[-1, :, :]  # Assuming num_layers=1
        output = self.fc(last_hidden)
        output = self.sigmoid(output).squeeze(1)

        return output

In [288]:
INPUT_DIM = len(vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = SentimentRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [286]:
print(model)

SentimentRNN(
  (embedding): Embedding(12989, 100)
  (rnn): RNN(100, 256, batch_first=True)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


In [260]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,390,805 trainable parameters


In [266]:
# Loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

In [262]:
# Function to calculate accuracy
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

In [263]:
# Define a BucketIterator for training and validation
batch_size = 32
train_iterator = BucketIterator(train_dataset, batch_size=batch_size, shuffle=True)
val_iterator = BucketIterator(val_dataset, batch_size=batch_size, shuffle=False)

In [268]:
# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    train_acc = 0.0

    for batch in train_iterator:
        text, labels = batch
        optimizer.zero_grad()

        predictions = model(text).squeeze()  # Remove the dimension

        # Ensure that labels have the same shape as predictions
        labels = labels.squeeze(1)

        loss = criterion(predictions, labels.float())
        acc = binary_accuracy(predictions, labels)

        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_acc += acc.item()

    # Calculate average training loss and accuracy per epoch
    num_train_batches = sum(1 for _ in train_iterator)
    avg_train_loss = train_loss / num_train_batches
    avg_train_acc = train_acc / num_train_batches

    # Validation loop
    model.eval()
    val_loss = 0.0
    val_acc = 0.0

    with torch.no_grad():
        for batch in val_iterator:
            text, labels = batch

            text = text.clamp(max=INPUT_DIM - 1)  # Ensure indices are within the vocabulary size
            text[text >= len(vocab) + 1] = word_to_index['<UNK>']  # Replace out-of-vocabulary indices with '<UNK>'


            # # Print the maximum index in the current batch
            # max_index = torch.max(text)
            # if max_index.item() >= INPUT_DIM:
            #     print(f"Maximum index {max_index.item()} exceeds vocabulary size {INPUT_DIM}.")

            # print("Unique indices in the current batch:", torch.unique(text))

            predictions = model(text).squeeze()  # Remove the dimension

            # Ensure that labels have the same shape as predictions
            labels = labels.squeeze(1)

            loss = criterion(predictions, labels.float())
            acc = binary_accuracy(predictions, labels)

            val_loss += loss.item()
            val_acc += acc.item()

    # Calculate average validation loss and accuracy per epoch
    num_val_batches = sum(1 for _ in val_iterator)
    avg_val_loss = val_loss / num_val_batches
    avg_val_acc = val_acc / num_val_batches

    # Print results for each epoch
    print(f'Epoch {epoch + 1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Train Acc: {avg_train_acc:.4f}, Val Loss: {avg_val_loss:.4f}, Val Acc: {avg_val_acc:.4f}')

Epoch 1/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 2/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 3/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 4/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 5/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 6/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 7/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 8/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 9/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062
Epoch 10/10, Train Loss: 0.6932, Train Acc: 0.2136, Val Loss: 0.6932, Val Acc: 0.2062


# Model Evaluation

In [270]:
test_data['encoded_tokens'] = test_data['cleaned_tokens'].apply(lambda tokens: [word_to_index.get(token, word_to_index['<UNK>']) for token in tokens])

# Pad sequences to a specific length (adjust maxlen as needed)
maxlen = 20
padded_tokens = pad_sequence([torch.tensor(tokens) for tokens in test_data['encoded_tokens']], batch_first=True, padding_value=word_to_index['<PAD>']).tolist()
test_data['padded_tokens'] = [list(tokens) for tokens in padded_tokens]

In [271]:
test_iterator = BucketIterator(train_dataset, batch_size=batch_size, shuffle=True)
test_X = test_data['padded_tokens']
test_y = test_data['label']

# Convert data and labels to numpy arrays and then PyTorch tensors
test_X = [torch.LongTensor(seq) for seq in test_X.values.tolist()]
test_y = torch.tensor(test_y.values.tolist())

test_dataset = TensorDataset(torch.stack(test_X), test_y)

In [273]:
model.eval()
test_loss = 0.0
test_acc = 0.0

with torch.no_grad():
    for batch in test_iterator:
        text, labels = batch

        # Preprocess the input (similar to what you did for the validation set)
        text = text.clamp(max=INPUT_DIM - 1)
        text[text >= len(vocab) + 1] = word_to_index['<UNK>']

        predictions = model(text).squeeze()
        labels = labels.squeeze(1)

        loss = criterion(predictions, labels.float())
        acc = binary_accuracy(predictions, labels)

        test_loss += loss.item()
        test_acc += acc.item()

# Calculate average test loss and accuracy
num_test_batches = sum(1 for _ in test_iterator)

avg_test_loss = test_loss / num_test_batches
avg_test_acc = test_acc / num_test_batches

print(f'Test Loss: {avg_test_loss:.4f}, Test Acc: {avg_test_acc:.4f}')


Test Loss: 0.6932, Test Acc: 0.2136
