#**Sentiment Analysis using BERT**

##What is BERT ?

BERT (Bidirectional Encoder Representations from Transformers) adalah algoritma deep learning yang dirancang untuk mengolah natural language processing. BERT adalah teknik atau sistem berbasis neural network.
Neural network sendiri adalah jaringan saraf tiruan dalam machine learning dan artificial intelligence yang mencoba meniru sistem kerja otak manusia.
Sistem ini digunakan untuk pre-training natural language processing, di mana mesin bisa belajar dan meningkatkan kemampuannya

**1. Exploratory Data Analysis and Preprocessing**

bisa melihat data2 tsb dan isinya seperti apa, jumlah katanya, info dari text tsb seperti apa

In [None]:
! pip install torch torchvision



In [None]:
! pip install tqdm



In [None]:
import torch #digunakan untuk tensor library deep learning 
import pandas as pd
from tqdm.notebook import tqdm #melihat progress bar dengan looping

In [None]:
from google.colab import files
uploaded = files.upload()

Saving dataset tugas - 2.csv to dataset tugas - 2.csv


In [None]:
df = pd.read_csv('dataset tugas - 2.csv')
df.head(10)

Unnamed: 0,id,text,category
0,1957032051,@princess_oats this is happening to me too,neutral
1,1957032127,@oxygen8705 bored now because i was talking to...,neutral
2,1957032228,@xoshayzers i knoww things won't be the samee...,sadness
3,1957032539,OMG-ness it's 11:18 pm and I need to beup earl...,worry
4,1957033043,@vinylvickxen i kno i doooo!!!!!!!!!! yall par...,happiness
5,1957033103,"Okay, so twitter suddenly changed, how do I re...",worry
6,1957033219,ugh.. my dad just told me to read an article a...,neutral
7,1957033558,Decided that no matter how good my hair looks ...,worry
8,1957033776,Going to sleep. Gonna fall asleep playing apps...,sadness
9,1957033815,@ThaBillCollecta YEA I GOTTA BE UP AT 7:30,neutral


In [None]:
df.category.value_counts() #category merupakan sentiment/emosi dalam data ini

worry        425
sadness      334
neutral      284
happiness     47
fun           21
Name: category, dtype: int64

In [None]:
df = df[~df.category.str.contains('\|')] #menghapus tanda \ dan |

In [None]:
df = df[df.category != 'nocode']

In [None]:
df.category.value_counts()

worry        425
sadness      334
neutral      284
happiness     47
fun           21
Name: category, dtype: int64

In [None]:
possible_labels = df.category.unique()

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
label_dict

{'fun': 4, 'happiness': 3, 'neutral': 0, 'sadness': 1, 'worry': 2}

In [None]:
df['label'] = df.category.replace(label_dict)

In [None]:
df.head(20)

Unnamed: 0,id,text,category,label
0,1957032051,@princess_oats this is happening to me too,neutral,0
1,1957032127,@oxygen8705 bored now because i was talking to...,neutral,0
2,1957032228,@xoshayzers i knoww things won't be the samee...,sadness,1
3,1957032539,OMG-ness it's 11:18 pm and I need to beup earl...,worry,2
4,1957033043,@vinylvickxen i kno i doooo!!!!!!!!!! yall par...,happiness,3
5,1957033103,"Okay, so twitter suddenly changed, how do I re...",worry,2
6,1957033219,ugh.. my dad just told me to read an article a...,neutral,0
7,1957033558,Decided that no matter how good my hair looks ...,worry,2
8,1957033776,Going to sleep. Gonna fall asleep playing apps...,sadness,1
9,1957033815,@ThaBillCollecta YEA I GOTTA BE UP AT 7:30,neutral,0


**2. Training/Validation Split**
membagi dataset menjadi data train
sebagian data dilatih dan sebagian datanya di test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_val, y_train, y_val =  train_test_split(df.index.values,
                                                   df.label.values,
                                                   test_size=0.15,
                                                   random_state=17,
                                                   stratify=df.label.values
) 

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df.head()

Unnamed: 0,id,text,category,label,data_type
0,1957032051,@princess_oats this is happening to me too,neutral,0,not_set
1,1957032127,@oxygen8705 bored now because i was talking to...,neutral,0,not_set
2,1957032228,@xoshayzers i knoww things won't be the samee...,sadness,1,not_set
3,1957032539,OMG-ness it's 11:18 pm and I need to beup earl...,worry,2,not_set
4,1957033043,@vinylvickxen i kno i doooo!!!!!!!!!! yall par...,happiness,3,not_set


In [None]:
df.loc[x_train, 'data_type'] = 'train'
df.loc[x_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count() 
#membagi data2 ke beberapa grup berdasarkan kriteria
#akan lebih banyak train karena test size diatasnya dicantum 0.15/15%

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,text
category,label,data_type,Unnamed: 3_level_1,Unnamed: 4_level_1
fun,4,train,18,18
fun,4,val,3,3
happiness,3,train,40,40
happiness,3,val,7,7
neutral,0,train,241,241
neutral,0,val,43,43
sadness,1,train,284,284
sadness,1,val,50,50
worry,2,train,361,361
worry,2,val,64,64


**3. Loading Tokenizer and Encoding our Data**

Transformer itu utk mengubah @, username, kritik, pokonya mengubah simbol2 yang tak perlu

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 499 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 53.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

**Tokenizer**
melibatkan pemisahaan text, yg awalnya 1 kalimat diubah menjadi beberapa token(kata2)
dalam BERT memeiliki ribuan kata ada di modul, apabila ada yg tidak tersedia, akan menggunakan tokenazitation

tokenizer mengubah data text menjadi numeric

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

**Encoding** 

konversi tweet ke bentuk yang di sandikan
encode plus itu mengubah clusterring menjadi token, dilakukan secara terpisah untuk data train dan validationnya
karna yang dibawah itu train, kita mau ambil dari nilai text yang ada di data train itu

-add special token itu bagian bertnya itu buat tau kapan kalimatnnya berakhir dan kapan kalimatnnya baru dimulai
-max length = dalam satu kalimat itu max nya 256 kata (bisa diatur lgi ko)

In [None]:
# Encoding the Training data
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

# Encoding the Validation data
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

# Spliting the data for the BERT training
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


**Mengubah input ke fitur yang dipahami oleh BERT**

dataset train 
data set val
dipisah
len() itu ingin melihat dataset yang dilatih

In [None]:
# Creating two different dataset
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
len(dataset_train)

944

In [None]:
len(dataset_val)

167

**4. Setting up BERT Pretrained Model**

In [None]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

**5. Creating Data Loaders**

In [None]:

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
batch_size = 32

# We Need two different dataloder
dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                              sampler=RandomSampler(dataset_val),
                              batch_size=batch_size)

**6. Setting Up Optimiser and Scheduler**

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [None]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

**7. Defining our Performance Metrics**

In [None]:
import numpy as np

In [None]:
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):

    # Setting up the preds to axis=1
    # Flatting it to a single iterable list of array
    preds_flat = np.argmax(preds, axis=1).flatten()

    # Flattening the labels
    labels_flat = labels.flatten()

    # Returning the f1_score as define by sklearn
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    # Iterating over all the unique labels
    # label_flat are the --> True labels
    for label in np.unique(labels_flat):
        # Taking out all the pred_flat where the True alable is the lable we care about.
        # e.g. for the label Happy -- we Takes all Prediction for true happy flag
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

**8. Create a training loop to control PyTorch finetuning of BERT using CPU or GPU acceleration**

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
#training lost itu kesalahan2 dalam training tsb
#validation dan training itu untuk percobaan selanjutnya atau misal mau training lagi, maka harus menaikan epoch 2xlipat
#kalo yang optimal itu menurun tapi nga se drastis itu 
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()          
    
    loss_train_total = 0   

    # Setting up the Progress bar to Moniter the progress of training
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad() # As we not working with thew RNN's
        
        # As our dataloader has '3' iteams so batches will be the Tuple of '3'
        batch = tuple(b.to(device) for b in batch)
        
        # INPUTS
        # Pulling out the inputs in the form of dictionary
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        # OUTPUTS
        outputs = model(**inputs) # '**' Unpacking the dictionary stright into 
        #the input
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()           # backpropagation

        # Gradient Clipping -- Taking the Grad. & gives it a NORM value ~ 1 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 1
Training loss: 1.4214892586072285


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.382353961467743
F1 Score (Weighted): 0.3173630580035162


Epoch 2:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 2
Training loss: 1.2692373871803284


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.271109938621521
F1 Score (Weighted): 0.2827193845157917


Epoch 3:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 3
Training loss: 1.2149660408496856


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.3164257407188416
F1 Score (Weighted): 0.3384356126871097


Epoch 4:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 4
Training loss: 1.1433346072832744


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.2585479219754536
F1 Score (Weighted): 0.37244517187768234


Epoch 5:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 5
Training loss: 1.0807270268599192


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.327212115128835
F1 Score (Weighted): 0.35840969576125004


Epoch 6:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 6
Training loss: 1.0149738828341166


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.3657069603602092
F1 Score (Weighted): 0.31239133647571726


Epoch 7:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 7
Training loss: 0.9552386025587718


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.3124318917592366
F1 Score (Weighted): 0.3447045577020921


Epoch 8:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 8
Training loss: 0.920363332827886


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.2808411220709484
F1 Score (Weighted): 0.3766794304686413


Epoch 9:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 9
Training loss: 0.8881966292858123


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.306980013847351
F1 Score (Weighted): 0.36730379944852215


Epoch 10:   0%|          | 0/30 [00:00<?, ?it/s]


Epoch 10
Training loss: 0.8711750090122223


  0%|          | 0/6 [00:00<?, ?it/s]

Validation loss: 1.3173609972000122
F1 Score (Weighted): 0.3542542938528581


**9. Loading finetuned BERT model and evaluate its performance**

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
model.load_state_dict(torch.load('/content/finetuned_BERT_epoch_10.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

  0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
accuracy_per_class(predictions, true_vals)

Class: neutral
Accuracy: 12/43

Class: sadness
Accuracy: 17/50

Class: worry
Accuracy: 33/64

Class: happiness
Accuracy: 0/7

Class: fun
Accuracy: 0/3

