# Sentiment Analysis with Deep Learning using BERT

## Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [9]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
import random

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
import numpy as np
from sklearn.metrics import f1_score
import csv

In [1]:
import pandas as pd
import io

df = pd.read_csv('train.tsv', delimiter='\t')

In [2]:
df.head()

Unnamed: 0,text,label
0,gas by my house hit i'm going to chapel hill o...,positive
1,theo walcott is still shit watch rafa and john...,negative
2,its not that i'm a gsp fan i just hate nick di...,negative
3,iranian general says israel's iron dome can't ...,negative
4,tehran mon amour obama tried to establish ties...,neutral


In [3]:
df['label'].value_counts()

positive    9155
neutral     9075
negative    3400
Name: label, dtype: int64

In [4]:
possible_labels = df['label'].unique()

In [5]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [6]:
label_dict

{'positive': 0, 'negative': 1, 'neutral': 2}

In [7]:
df['label'] = df['label'].replace(label_dict)

In [8]:
df.head()

Unnamed: 0,text,label
0,gas by my house hit i'm going to chapel hill o...,0
1,theo walcott is still shit watch rafa and john...,1
2,its not that i'm a gsp fan i just hate nick di...,1
3,iranian general says israel's iron dome can't ...,1
4,tehran mon amour obama tried to establish ties...,2


## Training/Validation Split

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
index = []
for i in range(len(df['text'])):
    index.append(i)

In [22]:
df['index'] = index 
df.head()

Unnamed: 0,text,label,index
0,gas by my house hit i'm going to chapel hill o...,0,0
1,theo walcott is still shit watch rafa and john...,1,1
2,its not that i'm a gsp fan i just hate nick di...,1,2
3,iranian general says israel's iron dome can't ...,1,3
4,tehran mon amour obama tried to establish ties...,2,4


In [23]:
X_train, X_val, y_train, y_val = train_test_split(df['index'].values, 
                                                  df['label'].values, 
                                                  test_size=0.20, 
                                                  random_state=17, 
                                                  stratify=df['label'].values)

In [24]:
df['data_type'] = ['not_set']*df.shape[0]

In [25]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [26]:
df.groupby(['label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,text,index
label,data_type,Unnamed: 2_level_1,Unnamed: 3_level_1
0,train,7324,7324
0,val,1831,1831
1,train,2720,2720
1,val,680,680
2,train,7260,7260
2,val,1815,1815


In [18]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 19.7MB/s eta 0:00:01[K     |▉                               | 20kB 4.2MB/s eta 0:00:01[K     |█▎                              | 30kB 5.1MB/s eta 0:00:01[K     |█▊                              | 40kB 5.4MB/s eta 0:00:01[K     |██▏                             | 51kB 4.8MB/s eta 0:00:01[K     |██▋                             | 61kB 5.2MB/s eta 0:00:01[K     |███                             | 71kB 5.6MB/s eta 0:00:01[K     |███▍                            | 81kB 6.0MB/s eta 0:00:01[K     |███▉                            | 92kB 6.3MB/s eta 0:00:01[K     |████▎                           | 102kB 6.1MB/s eta 0:00:01[K     |████▊                           | 112kB 6.1MB/s eta 0:00:01[K     |█████▏                          | 122kB 6.1M

## Loading Tokenizer and Encoding our Data

In [27]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

In [28]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

In [29]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [30]:
len(dataset_train)

17304

In [31]:
len(dataset_val)

4326

## Setting up BERT Pretrained Model

In [26]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Creating Data Loaders

In [34]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

## Setting Up Optimiser and Scheduler

In [35]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [36]:
epochs = 3

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

## Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [15]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [16]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [36]:
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [38]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [14]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [39]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=541.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 0.7824841025466178
Validation loss: 0.6628267835168278
F1 Score (Weighted): 0.6966105574694801


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=541.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.6047380033350255
Validation loss: 0.6568797802662149
F1 Score (Weighted): 0.69959518610248


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=541.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.49366907345719785
Validation loss: 0.7086067118627184
F1 Score (Weighted): 0.7074433485791581


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=541.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.39836603142803123
Validation loss: 0.7403720062883461
F1 Score (Weighted): 0.7060703128423857


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=541.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.3117725335598653
Validation loss: 0.850075601874029
F1 Score (Weighted): 0.7031086867685694


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=541.0, style=ProgressStyle(description_widt…


Epoch 6
Training loss: 0.24891496933101506
Validation loss: 0.9457907915553626
F1 Score (Weighted): 0.6954829682369427


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=541.0, style=ProgressStyle(description_widt…


Epoch 7
Training loss: 0.2044786468712878
Validation loss: 0.9884037531035788
F1 Score (Weighted): 0.6962943855525616


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=541.0, style=ProgressStyle(description_widt…


Epoch 8
Training loss: 0.16252872579887703
Validation loss: 1.0732494781122488
F1 Score (Weighted): 0.6991150317126474


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=541.0, style=ProgressStyle(description_widt…


Epoch 9
Training loss: 0.14439163454111095
Validation loss: 1.1454109133166426
F1 Score (Weighted): 0.6974275605368715


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=541.0, style=ProgressStyle(description_wid…


Epoch 10
Training loss: 0.12701566459511213
Validation loss: 1.1659360797527958
F1 Score (Weighted): 0.6982260221009241



In [11]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

HBox(children=(IntProgress(value=0, description='Downloading', max=440473133, style=ProgressStyle(description_…




NameError: name 'device' is not defined

In [12]:
model.load_state_dict(torch.load('bert.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [39]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [40]:
accuracy_per_class(predictions, true_vals)

Class: positive
Accuracy: 1371/1831

Class: negative
Accuracy: 490/680

Class: neutral
Accuracy: 1302/1815



In [41]:
df = pd.read_csv('test_samples.csv', delimiter=',')

In [42]:
df.head()

Unnamed: 0,tweet_id,tweet_text
0,264238274963451904,"@jjuueellzz down in the Atlantic city, ventnor..."
1,218775148495515649,Musical awareness: Great Big Beautiful Tomorro...
2,258965201766998017,On Radio786 100.4fm 7:10 Fri Oct 19 Labour ana...
3,262926411352903682,"Kapan sih lo ngebuktiin,jan ngomong doang Susa..."
4,171874368908050432,"Excuse the connectivity of this live stream, f..."


In [43]:
import re
from string import punctuation
print("DATA CLEANING -- \n")
# emojis defined
emoji_pattern = re.compile("["
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)

def replace_emojis(t):
  '''
  This function replaces happy unicode emojis with "happy" and sad unicode emojis with "sad.
  '''
  emoji_happy = ["\U0001F600", "\U0001F601", "\U0001F602","\U0001F603","\U0001F604","\U0001F605", "\U0001F606", "\U0001F607", "\U0001F609", 
                "\U0001F60A", "\U0001F642","\U0001F643","\U0001F923",r"\U0001F970","\U0001F60D", r"\U0001F929","\U0001F618","\U0001F617",
                r"\U000263A", "\U0001F61A", "\U0001F619", r"\U0001F972", "\U0001F60B", "\U0001F61B", "\U0001F61C", r"\U0001F92A",
                "\U0001F61D", "\U0001F911", "\U0001F917", r"\U0001F92D", r"\U0001F92B","\U0001F914","\U0001F910", r"\U0001F928", "\U0001F610", "\U0001F611",
                "\U0001F636", "\U0001F60F","\U0001F612", "\U0001F644","\U0001F62C","\U0001F925","\U0001F60C","\U0001F614","\U0001F62A",
                "\U0001F924","\U0001F634", "\U0001F920", r"\U0001F973", r"\U0001F978","\U0001F60E","\U0001F913", r"\U0001F9D0"]

  emoji_sad = ["\U0001F637","\U0001F912","\U0001F915","\U0001F922", r"\U0001F92E","\U0001F927", r"\U0001F975", r"\U0001F976", r"\U0001F974",
                       "\U0001F635", r"\U0001F92F", "\U0001F615","\U0001F61F","\U0001F641", r"\U0002639","\U0001F62E","\U0001F62F","\U0001F632",
                       "\U0001F633", r"\U0001F97A","\U0001F626","\U0001F627","\U0001F628","\U0001F630","\U0001F625","\U0001F622","\U0001F62D",
                       "\U0001F631","\U0001F616","\U0001F623"	,"\U0001F61E","\U0001F613","\U0001F629","\U0001F62B", r"\U0001F971",
                       "\U0001F624","\U0001F621","\U0001F620", r"\U0001F92C","\U0001F608","\U0001F47F","\U0001F480", r"\U0002620"]

  words = t.split()
  reformed = []
  for w in words:
    if w in emoji_happy:
      reformed.append("happy")
    elif w in emoji_sad:
      reformed.append("sad") 
    else:
      reformed.append(w)
  t = " ".join(reformed)
  return t


def replace_smileys(t):
  '''
  This function replaces happy smileys with "happy" and sad smileys with "sad.
  '''
  emoticons_happy = set([':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}', ':D',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)', '<3'])

  emoticons_sad = set([':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('])  

  words = t.split()
  reformed = []
  for w in words:
    if w in emoticons_happy:
      reformed.append("happy")
    elif w in emoticons_sad:
      reformed.append("sad") 
    else:
      reformed.append(w)
  t = " ".join(reformed)
  return t

def replace_contractions(t):
  '''
  This function replaces english lanuage contractions like "shouldn't" with "should not"
  '''
  cont = {"aren't" : 'are not', "can't" : 'cannot', "couln't": 'could not', "didn't": 'did not', "doesn't" : 'does not',
  "hadn't": 'had not', "haven't": 'have not', "he's" : 'he is', "she's" : 'she is', "he'll" : "he will", 
  "she'll" : 'she will',"he'd": "he would", "she'd":"she would", "here's" : "here is", 
   "i'm" : 'i am', "i've"	: "i have", "i'll" : "i will", "i'd" : "i would", "isn't": "is not", 
   "it's" : "it is", "it'll": "it will", "mustn't" : "must not", "shouldn't" : "should not", "that's" : "that is", 
   "there's" : "there is", "they're" : "they are", "they've" : "they have", "they'll" : "they will",
   "they'd" : "they would", "wasn't" : "was not", "we're": "we are", "we've":"we have", "we'll": "we will", 
   "we'd" : "we would", "weren't" : "were not", "what's" : "what is", "where's" : "where is", "who's": "who is",
   "who'll" :"who will", "won't":"will not", "wouldn't" : "would not", "you're": "you are", "you've":"you have",
   "you'll" : "you will", "you'd" : "you would", "mayn't" : "may not"}
  words = t.split()
  reformed = []
  for w in words:
    if w in cont:
      reformed.append(cont[w])
    else:
      reformed.append(w)
  t = " ".join(reformed)
  return t  

def remove_single_letter_words(t):
  '''
  This function removes words that are single characters
  '''
  words = t.split()
  reformed = []
  for w in words:
    if len(w) > 1:
      reformed.append(w)
  t = " ".join(reformed)
  return t  

print("Cleaning the tweets from the data.\n")
print("Replacing handwritten emojis with their feeling associated.")
print("Convert to lowercase.")
print("Replace contractions.")
print("Replace unicode emojis with their feeling associated.")
print("Remove all other unicoded emojis.")
print("Remove NON- ASCII characters.")
print("Remove numbers.")
print("Remove \"#\". ")
print("Remove \"@\". ")
print("Remove usernames.")
print("Remove \'RT\'. ")
print("Replace all URLs and Links with word \'URL\'.")
print("Remove all punctuations.")
print("Removes single letter words.\n")

def dataclean(t):
  '''
  This function cleans the tweets.
  '''
  t = replace_smileys(t) # replace handwritten emojis with their feeling associated
  t = t.lower() # convert to lowercase
  t = replace_contractions(t) # replace short forms used in english  with their actual words
  t = replace_emojis(t) # replace unicode emojis with their feeling associated
  t = emoji_pattern.sub(r'', t) # remove emojis other than smiley emojis
  t = re.sub('\\\\u[0-9A-Fa-f]{4}','', t) # remove NON- ASCII characters
  t = re.sub("[0-9]", "", t) # remove numbers # re.sub("\d+", "", t)
  t = re.sub('#', '', t) # remove '#'
  t = re.sub('@[A-Za-z0–9]+', '', t) # remove '@'
  t = re.sub('@[^\s]+', '', t) # remove usernames
  t = re.sub('RT[\s]+', '', t) # remove retweet 'RT'
  t = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', '', t) # remove links (URLs/ links)
  t = re.sub('[!"$%&\'()*+,-./:@;<=>?[\\]^_`{|}~]', '', t) # remove punctuations
  t = t.replace('\\\\', '')
  t = t.replace('\\', '')
  t = remove_single_letter_words(t) # removes single letter words
  
  return t

df['tweet_text'] = df['tweet_text'].apply(dataclean)
print("Tweets have been cleaned.")

DATA CLEANING -- 

Cleaning the tweets from the data.

Replacing handwritten emojis with their feeling associated.
Convert to lowercase.
Replace contractions.
Replace unicode emojis with their feeling associated.
Remove all other unicoded emojis.
Remove NON- ASCII characters.
Remove numbers.
Remove "#". 
Remove "@". 
Remove usernames.
Remove 'RT'. 
Replace all URLs and Links with word 'URL'.
Remove all punctuations.
Removes single letter words.

Tweets have been cleaned.


In [44]:
df.head()

Unnamed: 0,tweet_id,tweet_text
0,264238274963451904,down in the atlantic city ventnor margate ocea...
1,218775148495515649,musical awareness great big beautiful tomorrow...
2,258965201766998017,on radio fm fri oct labour analyst shawn hatti...
3,262926411352903682,kapan sih lo ngebuktiinjan ngomong doang susah...
4,171874368908050432,excuse the connectivity of this live stream fr...


In [45]:
sentiment = [0]*len(df)
df['sentiment']=sentiment

In [46]:
encoded_data_test = tokenizer.batch_encode_plus(
    df['tweet_text'].values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

In [47]:
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(df['sentiment'].values)

In [48]:
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)
len(dataset_test)

5398

In [49]:
dataloader_test = DataLoader(dataset_test, 
                                   sampler=SequentialSampler(dataset_test), 
                                   batch_size=batch_size)

In [50]:
_, predictions, true_vals = evaluate(dataloader_test)

In [51]:
predictions

array([[ 0.3503828 , -2.2721543 ,  2.076835  ],
       [ 3.0944602 , -2.0891685 , -0.59133905],
       [-1.2530087 , -0.69582   ,  2.0882952 ],
       ...,
       [ 2.8455448 , -2.5214615 ,  0.03245418],
       [ 0.4016742 , -2.212023  ,  2.0782993 ],
       [ 1.4930922 , -2.561375  ,  0.9839138 ]], dtype=float32)

In [52]:
preds_flat = np.argmax(predictions, axis=1).flatten()
len(preds_flat)

5398

In [53]:
sentiment = []
for i in tqdm(preds_flat):
  if i==0:
    sentiment.append('positive')
  if i==2:
    sentiment.append('neutral')
  if i==1:
    sentiment.append('negative')

HBox(children=(IntProgress(value=0, max=5398), HTML(value='')))




In [54]:
df['label']=sentiment

In [55]:
df.head()

Unnamed: 0,tweet_id,tweet_text,sentiment,label
0,264238274963451904,down in the atlantic city ventnor margate ocea...,0,neutral
1,218775148495515649,musical awareness great big beautiful tomorrow...,0,positive
2,258965201766998017,on radio fm fri oct labour analyst shawn hatti...,0,neutral
3,262926411352903682,kapan sih lo ngebuktiinjan ngomong doang susah...,0,neutral
4,171874368908050432,excuse the connectivity of this live stream fr...,0,negative


In [56]:
testres = df[['tweet_id','label']]
test_list = []
heading = ['tweet_id', 'sentiment']
test_list.append(heading)
for i in range(len(testres['tweet_id'])):
    sub = []
    sub.append(testres['tweet_id'][i])
    sub.append(testres['label'][i])
    test_list.append(sub)

In [57]:
len(test_list)

5399

In [58]:
with open('bert_final.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter = ",")
    data = test_list
    a.writerows(data)
check = pd.read_csv("bert_final.csv")
check.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5398 entries, 0 to 5397
Data columns (total 2 columns):
tweet_id     5398 non-null int64
sentiment    5398 non-null object
dtypes: int64(1), object(1)
memory usage: 84.5+ KB
