What is transfer learning?
- Use pre-existing knowledge from one task to a related task
- Saves time
- Share expertise
- Reduces need for large data

Pre-trained model BERT:
- BERT: Bidirectional Encoder Representations from Transformers
- Trained for language modeling
- Multiple layers of transformers
- Pre-trained on large texts

In this project, the main objective is to classifify a sentence whether it is positive or negative. To solve this, fine-tuning pre-trained model BERT.

In [1]:
pip install transformers==4.18.0

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch 
import torch.nn as nn

In [3]:
from transformers import BertTokenizer, BertForSequenceClassification



In [4]:
#Read csv file
import pandas as pd
data = pd.read_csv('train_dataset.csv')
print(data)

                                                   text  label
0     technopoli plan develop stage area 100,000 squ...      0
1     international electronic industry company Elco...     -1
2     new production plant company increase capacity...      1
3     accord company update strategy year 2009 2012 ...      1
4     financing ASPOCOMP GROWTH Aspocomp aggressivel...      1
...                                                 ...    ...
4840  LONDON MarketWatch share price end lower Londo...     -1
4841  Rinkuskiai beer sale fall 6.5 cent 4.16 millio...      0
4842  operate profit fall EUR 35.4 mn EUR 68.8 mn 20...     -1
4843  net sale Paper segment decrease EUR 221.6 mn s...     -1
4844  sale Finland decrease 10.5 January sale outsid...     -1

[4845 rows x 2 columns]


In [5]:
data["label"].value_counts()

label
 0    2878
 1    1363
-1     604
Name: count, dtype: int64

In [6]:
texts_article = data["text"].tolist()
labels_article = data['label'].tolist()
print(texts_article)
print(labels_article)

[0, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [7]:
from sklearn.preprocessing import LabelEncoder

#Encode labels từ string sang integer
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['label'])

In [8]:
print(data.head(5))

                                                text  label
0  technopoli plan develop stage area 100,000 squ...      1
1  international electronic industry company Elco...      0
2  new production plant company increase capacity...      2
3  accord company update strategy year 2009 2012 ...      2
4  financing ASPOCOMP GROWTH Aspocomp aggressivel...      2


In [9]:
data["label"].value_counts()

label
1    2878
2    1363
0     604
Name: count, dtype: int64

In [10]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    data["text"].astype(str).tolist(), data["label"].astype(int).tolist(), test_size=0.7, random_state=42)

In [11]:
train_texts[0:5]

['registration require',
 'earn payment 4.0 mln euro $ 5.3 mln pay depend Intellibis financial performance 2007',
 'Swedish buyout firm sell remain 22.4 percent stake eighteen month take company public Finland',
 'value contract EUR 25mn',
 'domestic business Best close finnish dog owner']

In [12]:
train_labels[0:5]

[1, 1, 1, 1, 2]

In [13]:
# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [14]:
#Tokenize your data and return to PyTorch tensors
inputs = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt", max_length=128)
inputs['labels'] = torch.tensor(train_labels)

In [15]:
print(inputs)

{'input_ids': tensor([[ 101, 8819, 5478,  ...,    0,    0,    0],
        [ 101, 7796, 7909,  ...,    0,    0,    0],
        [ 101, 4467, 4965,  ...,    0,    0,    0],
        ...,
        [ 101, 4013, 3270,  ...,    0,    0,    0],
        [ 101, 2311, 2160,  ...,    0,    0,    0],
        [ 101, 4748, 2361,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([1, 1, 1,  ..., 1, 1, 2])}


In [16]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], inputs['labels'])

#Dataloader
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

In [17]:
# Setup the optimizer using model parameters
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00001)
model.train()


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [18]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=1e-5)

model.train()

for epoch in range(2):
    total_loss = 0
    for batch in train_loader:
        # unpack tuple từ DataLoader
        input_ids, attention_mask, labels = [x for x in batch]

        # forward
        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels)

        loss = outputs.loss
        total_loss += loss.item()

        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1} finished, avg loss = {avg_loss:.4f}")


Epoch 1 finished, avg loss = 0.9039
Epoch 2 finished, avg loss = 0.6512


# Save and Load model

In [19]:
model_path = 'model/'

In [20]:
torch.save(model.state_dict(), model_path + 'bert_finetuned_sentiment.pth')

In [21]:
#load the model
#pretrained_model_bert
model_pretrained = model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

model_pretrained.load_state_dict(torch.load(model_path + 'bert_finetuned_sentiment.pth'))

print(model_pretrained)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

# Check with my data

In [22]:
with open("article_pbs cut off.txt", "r", encoding="utf-8") as f:
    article_text = f.read().strip()

print("Sample texts:", article_text)

Sample texts: PBS to Cut 15% of Its Staff
Congress voted this year to strip $500 million in annual funding from public broadcasters, including PBS stations.
PBS, which has its headquarters in Arlington, Va., said it was cutting 100 positions, including 34 immediate layoffs.
Credit...
PBS is cutting 100 positions, or roughly 15 percent of its staff, as a result of the major federal funding cuts to public broadcasting.
Paula Kerger, the chief executive of PBS, said in an email to station managers on Thursday that the staff reductions were a last resort. The organization had already frozen hiring, restricted travel and paused pay increases.
The cuts include 34 immediate layoffs, the closing of dozens of open positions and reductions made this summer in response to the elimination of federal funding for education programming.
“These decisions, while difficult, position PBS to weather the current challenges facing public media,” Ms. Kerger said in her email.

Like every public media organiz

In [23]:
# Tokenize the text and return PyTorch tensors
input_eval = tokenizer(article_text, return_tensors="pt", truncation=True, padding=True, max_length=128)
outputs_eval = model_pretrained(**input_eval)

In [24]:
print(input_eval)

{'input_ids': tensor([[  101, 13683,  2000,  3013,  2321,  1003,  1997,  2049,  3095,  3519,
          5444,  2023,  2095,  2000,  6167,  1002,  3156,  2454,  1999,  3296,
          4804,  2013,  2270, 18706,  1010,  2164, 13683,  3703,  1012, 13683,
          1010,  2029,  2038,  2049,  4075,  1999, 13929,  1010, 12436,  1012,
          1010,  2056,  2009,  2001,  6276,  2531,  4460,  1010,  2164,  4090,
          6234,  3913, 27475,  1012,  4923,  1012,  1012,  1012, 13683,  2003,
          6276,  2531,  4460,  1010,  2030,  5560,  2321,  3867,  1997,  2049,
          3095,  1010,  2004,  1037,  2765,  1997,  1996,  2350,  2976,  4804,
          7659,  2000,  2270,  5062,  1012, 13723, 17710, 25858,  1010,  1996,
          2708,  3237,  1997, 13683,  1010,  2056,  1999,  2019, 10373,  2000,
          2276, 10489,  2006,  9432,  2008,  1996,  3095, 25006,  2020,  1037,
          2197,  7001,  1012,  1996,  3029,  2018,  2525,  7708, 14763,  1010,
          7775,  3604,  1998,  5864,  

In [25]:
print(outputs_eval)

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.3741, -0.2936, -0.1146]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [26]:
# Forward pass
with torch.no_grad():
    output_eval = model(**input_eval)

# Convert logits to probabilities
predictions = torch.nn.functional.softmax(output_eval.logits, dim=-1)

# Get predicted class
predicted_idx = torch.argmax(predictions, dim=-1).item()

# Map index -> label name
id2label = {0: "negative", 1: "neutral", 2: "positive"}
predicted_label = id2label[predicted_idx]

print(f"Text: {article_text}")
print(f"Probabilities: {predictions}")
print(f"Predicted sentiment: {predicted_label}")

Text: PBS to Cut 15% of Its Staff
Congress voted this year to strip $500 million in annual funding from public broadcasters, including PBS stations.
PBS, which has its headquarters in Arlington, Va., said it was cutting 100 positions, including 34 immediate layoffs.
Credit...
PBS is cutting 100 positions, or roughly 15 percent of its staff, as a result of the major federal funding cuts to public broadcasting.
Paula Kerger, the chief executive of PBS, said in an email to station managers on Thursday that the staff reductions were a last resort. The organization had already frozen hiring, restricted travel and paused pay increases.
The cuts include 34 immediate layoffs, the closing of dozens of open positions and reductions made this summer in response to the elimination of federal funding for education programming.
“These decisions, while difficult, position PBS to weather the current challenges facing public media,” Ms. Kerger said in her email.

Like every public media organization in