# Sentiment analysis 

This project involves building a movie review classifier
Sentiment analysis is the task of classifying the polarity of a given text.

The dataset used is the IMDb dataset. The data was compiled by Andrew Maas (http://ai.stanford.edu/~amaas/data/sentiment/)
This dataset contains 50,000 reviews labelled as either positive or negative. Moreover, this datset is balanced, i.e. equal number of postive and negative reviews. Moreover, for each movie, a number equal or less than 30 reviews is considered since reviews tend to correlate.
Finally, only reviews with either a low score (less or equal to 4 out of 10) or high score (greater or equal to 7 out of 10) are considered to avoid considering reviews that are neutral.
For the purpose of testing, we are going to use only the training data since they are labelled. 


For this problem, we are going to use a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers)model and we will apply transfer learning to adapt the model to the sentiment analysis problem. 
The model is going to be loaded from the library transformers (https://github.com/huggingface/transformers). We are going to use Pytorch as framework for this problem.

## Data processing 
For the data processing task:
- Lowercasing is applied. Lowercasing ALL the text data, is one of the simplest and most effective form of text preprocessing. Some pre-trained model might give a different answers if the input is capitalized or not. Of course, in some specific cases, it might be useful not to change to lowercasing.
- Noise Removal is applied. Noise removal is about removing characters digits and pieces of text that can interfere with your text analysis. For instances, punctuations signs, parenthesis, mathematical symbols are removed (for instance, '.' '(' '?' or '<' or ':' are removed).
- Since we are using BERT model, we keep the words as they are, i.e. stemming is **not** used. Stemming is the process of reducing inflection in words (e.g. connected, connections) to their root form (e.g. connect).
- For the same reason explained before, lemmatization is **not** used. Lemmatization is very similar to stemming. The difference is that, lemmatization doesn’t just chop things off, it actually transforms words to the actual root.

Once the data is pre-processed, then the training data (25,000 samples) are used to train the model and testing data (25,000 samples) are used to test the final model.

## Building the model
The model used is the pre-trained BERT model with dropout and a final fully connected layer. 
Specifically, it is used BERT base model – with a hidden size of 768, with 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
Droput is added for the purpose of Regulariaztion and the linear layer for Classification. The dropout probability is always kept at 0.1

The input to BERT is a sequence of words, and the output is a sequence of vectors.
BERT model can accept up to 512 words. However, some review sentences might be longer than this number. In order to deal with this, we apply the strategy described in the paper "How to Fine-Tune BERT for Text Classification?" by Sun et al. They propose to use a truncation method, and select part of the beginning and part of the end of the sentence. 
In fact, the key information of an article is at the beginning and end. Thus, empirically the first 128 and the last 382 tokens are selected if longer than 512.




## Training the model
The final layer is used to compute the loss and determine the accuracy of the model. The loss used is Binary Cross Entropy which is implemented as BCELogits Loss in PyTorch.
A learning rate of 2.5e-5 is used and the warm-up proportion is 0.1. The optimization algorithm used is ADAM with $\beta_1 = 0.9$ and $\beta_2 = 0.999$. The batch size is 12 and the number of epochs is 4. 

## Testing the model
Once the model is trained, the algorithm is tested on the test set and reach an accuracy of $\approx 93\%$ using only 4 epochs! this shows how transfer learning is important for using these models for additional tasks!


Thanks,
Gabriele Boncoraglio

Import some useful libraries for running the code

In [0]:
!pip install -Iv transformers==2.2.2
!pip install pytorch_pretrained_bert==0.4.0

In [2]:
import re,os
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
from sklearn import metrics
import pandas as pd
import numpy as np
from contextlib import contextmanager
import time, random 

import transformers
from transformers import BertTokenizer, BertModel, BertConfig
from pytorch_pretrained_bert.optimization import BertAdam, warmup_linear

import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

os.chdir('/content/drive/My Drive/Colab Notebooks/Sentiment Analysis')
#True if you already trained the model
trained = False

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
cuda


## Data processing 

In [0]:
train = []
for line in open('training_dataset.txt', 'r'):
    train.append(line.strip())

test = []
for line in open('testing_dataset.txt', 'r'):
    test.append(line.strip())

In [0]:
# Data noise removal
def preprocess_reviews(reviews):
    reviews = [re.compile("[.;:!\'?,\"()\[\]]").sub("", line.lower()) for line in reviews]
    reviews = [re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
.sub(" ", line) for line in reviews]

    return reviews

train1 = preprocess_reviews(train)
test1 = preprocess_reviews(test)
#Creating labels for dataset
#first 12500 are postive and the rest is negative
target = [1 if i < 12500 else 0 for i in range(25000)]


In [5]:
# Dealing with long sentences
# Looking average and max length of the training sentences dataset
maxLen = 0
avgLen = 0
for stri in train1:
  s = stri.split()
  maxLen = max(maxLen,len(s))
  avgLen += len(s)
avgLen /= len(train1)
print('(training) average is: ',round(avgLen))
print('(training) max length is: ',maxLen)


maxLen = 0
avgLen = 0
for stri in test1:
  s = stri.split()
  maxLen = max(maxLen,len(s))
  avgLen += len(s)
avgLen /= len(train1)
print('(testing) average is: ',round(avgLen))
print('(testing) max length is: ',maxLen)

(training) average is:  233
(training) max length is:  2473
(testing) average is:  228
(testing) max length is:  2245


In [6]:
# Visualizing some of the comments and the response to see our data
d_train = {'comment_text': train1, 'list': target}
d_test = {'comment_text': test1, 'list': target}

df_train = pd.DataFrame(data=d_train)
df_test = pd.DataFrame(data=d_test)

df_train.head()

Unnamed: 0,comment_text,list
0,bromwell high is a cartoon comedy it ran at th...,1
1,homelessness or houselessness as george carlin...,1
2,brilliant over acting by lesley ann warren bes...,1
3,this is easily the most underrated film inn th...,1
4,this is not the typical mel brooks film it was...,1


In [7]:

# Visualizing the dataset size
print("FULL Dataset: {}".format(df_train.shape[0]+df_test.shape[0]))
print("TRAIN Dataset: {}".format(df_train.shape))
print("TEST Dataset: {}".format(df_test.shape))

FULL Dataset: 50000
TRAIN Dataset: (25000, 2)
TEST Dataset: (25000, 2)


## Building the model

In [0]:
# nice way to report running times
@contextmanager
def timer(name):
    t0 = time.time()
    yield
    print(f'[{name}] done in {time.time() - t0:.0f} s')

In [0]:
# Sections of config
# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 12
TEST_BATCH_SIZE = 12
EPOCHS = 4
LEARNING_RATE = 2e-05
SIZE_TRAIN = df_train.shape[0]

ACCUM_STEPS = 1          # wait for several backward steps, then one optimization step
WARMUP = 0.1             # warmup helps to tackle instability in the initial phase of training
USE_APEX = True         

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')





In [10]:
if USE_APEX:
    with timer('install Nvidia apex'):
        # Installing Nvidia Apex
        os.system('git clone https://github.com/NVIDIA/apex; cd apex; pip install -v --no-cache-dir' + 
                  ' --global-option="--cpp_ext" --global-option="--cuda_ext" ./')
        os.system('rm -rf apex/.git') 
        from apex import amp


[install Nvidia apex] done in 520 s


In [0]:
'''
PyTorch supports two different types of datasets:
- map-style datasets,
- iterable-style datasets.

        - Map-style datasets
        A map-style dataset is one that implements the __getitem__() and __len__() protocols, 
        and represents a map from (possibly non-integral) indices/keys to data samples.
        For example, such a dataset, when accessed with dataset[idx], 
        could read the idx-th image and its corresponding label from a folder on the disk.
        
        - An iterable-style dataset is an instance of a subclass of IterableDataset that 
        implements the __iter__() protocol, and represents an iterable over data samples. 
        This type of datasets is particularly suitable for cases where random reads 
        are expensive or even improbable, and where the batch size depends on the fetched data.
        For example, such a dataset, when called iter(dataset), 
        could return a stream of data reading from a database, a remote server, 
        or even logs generated in real time.
'''

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = dataframe.comment_text
        self.targets = self.data.list
        self.max_len = max_len


    def __len__(self):
        return len(self.comment_text)
    
 
    def __getitem__(self, index):
        comment_text = str(self.comment_text[index])
        comment_text = " ".join(comment_text.split())
        if len(comment_text) > 512:
          comment_text = comment_text[:128] + comment_text[381:] 

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            stride = 0,
            truncation_strategy = 'longest_first',
            pad_to_max_length = True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [0]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }



test_params = {'batch_size': TEST_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
               }

# Creating a Map-style datasets for PyTorch
training_set = CustomDataset(df_train, tokenizer, MAX_LEN)
testing_set = CustomDataset(df_test, tokenizer, MAX_LEN)

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [13]:
class BERTClass(torch.nn.Module):
    def __init__(self):
        '''
            The bert model output two quantities.
            The second output, output_1, is the pooled output and it is passed 
            to the Drop Out layer and 
            the subsequent output is given to the Linear layer.
        '''
        super(BERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased')
        # #Freeze bert layers
        # for p in self.l1.parameters():
        #     p.requires_grad = False
        self.l2 = torch.nn.Dropout(0.1)
        self.l3 = torch.nn.Linear(768, 1,bias=True)
    
    def forward(self, ids, mask, token_type_ids):
        _, output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = BERTClass()

if trained:
  model.load_state_dict(torch.load("bert_sentimentAnaly1"))
model.to(device)



BERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    

In [0]:
# make results fully reproducible
def seed_everything(seed=123):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

In [15]:
import time
import datetime

def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)


        
if USE_APEX:
    num_train_optimization_steps = int(EPOCHS * SIZE_TRAIN / TRAIN_BATCH_SIZE / ACCUM_STEPS)
    optimizer = BertAdam(model.parameters(),
                          lr=LEARNING_RATE, warmup=WARMUP,
                          t_total=num_train_optimization_steps)

    model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=1)
    model = model.train()
else:
    # All the parameters are being trained
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE, weight_decay = 5e-4)



Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


In [0]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

def validation(loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            targets = targets.unsqueeze(1)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

def computeScores(testing_loader,typeSet):
    outputs, targets = validation(testing_loader) 
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    print("{}".format(typeSet))
    print(f"  Accuracy Score = {accuracy}")
    print(f"  F1 Score (Micro) = {f1_score_micro}")
    print(f"  F1 Score (Macro) = {f1_score_macro}")
    print(" ")
    return outputs, targets
    # return accuracy,f1_score_micro,f1_score_macro

In [17]:
seed_everything(seed=123)
if not trained:
  for epoch in range(EPOCHS):
      # ========================================
      #               Training
      # ========================================
      
      # Perform one full pass over the training set.

      print("")
      print('======== Epoch {:} / {:} ========'.format(epoch + 1, EPOCHS))
      print('Training...')
      # Measure how long the training epoch takes.
      t0 = time.time()

      # Reset the total loss for this epoch.
      total_train_loss = 0

      # Put the model into training mode.
      model.train()

      # for _,data in enumerate(training_loader, 0):
      for step, data in enumerate(training_loader):

          ids = data['ids'].to(device, dtype = torch.long)
          mask = data['mask'].to(device, dtype = torch.long)
          token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
          targets = data['targets'].to(device, dtype = torch.float)
          targets = targets.unsqueeze(1)
          # target = target.float()
          outputs = model(ids, mask, token_type_ids)


          # We need to clear any previously calculated gradients before performing a
          # backward pass.
          optimizer.zero_grad()
          loss = loss_fn(outputs, targets)

          # Progress update every 100 batches.
          if step % 100 == 0 and not step == 0:
              # Calculate elapsed time in minutes.
              elapsed = format_time(time.time() - t0)
              
              # Report progress.
              print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.      Current Loss: {}'.format(step, len(training_loader), elapsed,loss.item()))

      
          total_train_loss += loss.item()

          
          optimizer.zero_grad()
          if USE_APEX:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
          else:
                loss.backward()

          optimizer.step()

      # Calculate the average loss over all of the batches.
      avg_train_loss = total_train_loss / len(training_loader)            
      
      # Measure how long this epoch took.
      training_time = format_time(time.time() - t0)

      print("")
      print("  Average training loss: {0:.2f}".format(avg_train_loss))
      print("  Training epcoh took: {:}".format(training_time))
      print("")
      # computeScores(testing_loader,'Testing set')
      # computeScores(validating_loader,'Validation set')
      # computeScores(validating_loader,'Validation set')
      print("")

if not trained:
  torch.save(model.state_dict(), "bert_sentimentAnaly")


Training...


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)


Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
  Batch   100  of  2,084.    Elapsed: 0:01:26.      Current Loss: 0.4339156150817871
  Batch   200  of  2,084.    Elapsed: 0:02:53.      Current Loss: 0.3533976674079895
  Batch   300  of  2,084.    Elapsed: 0:04:19.      Current Loss: 0.07138269394636154
  Batch   400  of  2,084.    Elapsed: 0:05:45.      Current Loss: 0.5555235147476196
  Batch   500  of  2,084.    Elapsed: 0:07:11.      Current Loss: 0.3588411808013916
  Batch   600  of  2,084.    Elapsed: 0:08:38.      Current Loss: 0.6481459736824036
  Batch   700  of  2,084.    Elapsed: 0:10:04.      Current Loss: 0.39229482412338257
  Batch   800  of  2,084.    Elapsed: 0:11:30.      Current Loss: 0.10386266559362411
  Batch   900  of  2,084.    Elapsed: 0:12:56.      Current Loss: 0.08484383672475815
  Batch 1,000  of  2,084.    Elapsed: 0:14:22.      Current Loss: 0.307

In [18]:
predictions, targets = computeScores(testing_loader,'Testing set')

Testing set
  Accuracy Score = 0.93072
  F1 Score (Micro) = 0.93072
  F1 Score (Macro) = 0.9307180095561274
 


In [19]:
roc_auc_score(targets, predictions)

0.93072

In [20]:
# classification_report(y_true, y_pred)
print(classification_report(targets, predictions))

              precision    recall  f1-score   support

         0.0       0.94      0.93      0.93     12500
         1.0       0.93      0.94      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000

