<a href="https://colab.research.google.com/github/Yutongzhang20080108/SST-with-MLP/blob/main/SST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a simple try to accomplish the SST2 task.
First we import the necessary APIs


In [2]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
from torch.nn.utils.rnn import pad_sequence

In [5]:
train_data = pd.read_json("sample_data/train.jsonl", lines=True).to_dict("records")
test_data = pd.read_json("sample_data/test.jsonl", lines=True).to_dict("records")
print(f"The first dict of train_data is {train_data[0]}")
print(f"The first dict of test_data is {test_data[0]}")

The first dict of train_data is {'text': 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films', 'label': 1, 'label_text': 'positive'}
The first dict of test_data is {'text': 'no movement , no yuks , not much of anything .', 'label': 0, 'label_text': 'negative'}


Here we create a function to split the original SST dataset into tokens by a pretrained BERT language model.

In [6]:
tokenizer = AutoTokenizer.from_pretrained("config")
max_length = 128
def my_collate_fn(batch):
  text = []
  label = []
  for sample in batch:
    text.append(sample["text"])
    label.append(sample["label"])

  text_list = []
  for texts in text:
    encoded_text = tokenizer.encode(texts, truncation=True, max_length=max_length, padding="max_length")
    text_list.append(encoded_text)
  tensor_text = []
  for text in text_list:
    text_tensor = torch.Tensor(text)
    tensor_text.append(text_tensor)
  tensor_text = pad_sequence(tensor_text, batch_first=True, padding_value=0)
  label_tensor = torch.Tensor(label)
  return {'text':tensor_text, "label":label_tensor}

In [7]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=my_collate_fn)
test_loader = DataLoader(test_data, batch_size=8, shuffle=True, collate_fn=my_collate_fn)

In [8]:
for train in train_loader:
  text, label = train["text"], train["label"]
print(f"The size of the text is {text.size()}")
print(f"The size of the label is {label.size()}")
print(text[0,:])
print(label[0])

The size of the text is torch.Size([8, 128])
The size of the label is torch.Size([8])
tensor([ 101., 1045., 2031., 2025., 2042., 2023., 9364., 2011., 1037., 3185.,
        1999., 1037., 2146., 2051., 1012.,  102.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
           0.,    0.,    0.,    0.,    0.,    0.,    0.,

We have uploader all the necessay train and test data. And we are going to create a MLP as our main model

In [14]:
from torch import nn

class DL(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()
    self.stack = nn.Sequential(
        nn.Linear(max_length, 512),
        nn.ReLU(),
        nn.Linear(512, 512),
        nn.ReLU(),
        nn.Linear(512, 2)
    )
  def forward(self, input):
    x = self.flatten(input)
    logits = self.stack(x)
    return logits
model = DL()

Create the train and test loop

In [15]:
learning_rate = 1e-2
batch_size = 8
epochs = 10

def trainloop(dataloader, optim, loss_fn, model):
  size = len(dataloader.dataset)
  model.train()
  for batch, train in enumerate(dataloader):
    pred = model(train['text'])
    loss = loss_fn(pred, train["label"].to(torch.int64))


    loss.backward()
    optim.step()
    optim.zero_grad()

  if batch % 100 == 0:
    loss, current = loss.item(), (batch + 1)*(len(train["label"]))
    print(f"loss:{loss}, current:[{current}/{size}]")
def testloop(dataloader, model, loss_fn):
  model.eval()
  size = len(dataloader.dataset)
  num_batches = len(dataloader)
  test_loss, correct = 0, 0

  with torch.no_grad():
      for inputs in dataloader:
          pred = model(inputs["text"])
          test_loss += loss_fn(pred, inputs["label"].to(torch.int64)).item()
          correct += (pred.argmax(1) == inputs["label"]).type(torch.float).sum().item()

  test_loss /= num_batches
  correct /= size
  print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")


We are going to start training now!!!

In [16]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    trainloop(train_loader, optimizer, loss_fn, model)
    testloop(test_loader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
Test Error: 
 Accuracy: 50.0%, Avg loss: 0.693658 

Epoch 2
-------------------------------
Test Error: 
 Accuracy: 49.9%, Avg loss: 0.696987 

Epoch 3
-------------------------------
Test Error: 
 Accuracy: 49.9%, Avg loss: 0.699184 

Epoch 4
-------------------------------
Test Error: 
 Accuracy: 49.9%, Avg loss: 0.694391 

Epoch 5
-------------------------------
Test Error: 
 Accuracy: 49.9%, Avg loss: 0.694380 

Done!
