# Torch Models

Torch is one of the user-friendly libraries in python and also FastText function is used for text classification. 

## Aim of This Notebook

My aim in this notebook is to find a baseline with torch model, then getting better results with changing some parameters.

Preparing data for modeling steps and all details about modeling steps can be found in this notebook. Also, future improvement plans were added to the end of the notebook. 

### Metric : 

As metric, I prefer to look at loss values for train and validation data. But, I also see the accuracy to interpret my results in more smarter way.

### Best Results of This Notebook:

91.66% accuracy with 0.380 loss for train, 91.68% accucary with 0.305 loss for validation obtained. Detailed parameters and values can be found in this notebook. 

# Importing Libraries

In [1]:
# dataframe and series 
import pandas as pd
import numpy as np

# sklearn imports for modeling part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,balanced_accuracy_score
from sklearn.model_selection import train_test_split

from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix
from mlxtend.plotting import plot_decision_regions

from sklearn.metrics import confusion_matrix

# To plot
import matplotlib.pyplot as plt  
%matplotlib inline    
import matplotlib as mpl
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

# torch model
import torch
from torchtext import data
from torchtext import datasets
import random

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time

import random
import os

import spacy

In [2]:
#for text augmentation
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

I0520 13:01:43.983011 4716817856 file_utils.py:39] PyTorch version 1.4.0 available.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
nlp = spacy.load('en')
# import en_core_web_sm
# nlp = en_core_web_sm.load()

In [5]:
df = pd.read_csv('cleaned_data.csv') # taking data

In [6]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewText,summary,title,day,month,year,sentiment,review_clean
0,4,True,2014-07-03,A2LSKD2H9U8N0J,B000FA5KK0,{'Format:': ' Kindle Edition'},"pretty good story, a little exaggerated, but i...",pretty good story,,3,7,2014,2,pretty good story a little exaggerated but i l...
1,5,True,2014-05-26,A2QP13XTJND1QS,B000FA5KK0,{'Format:': ' Kindle Edition'},"if you've read other max brand westerns, you k...",A very good book,,26,5,2014,2,if youve read other max brand westerns you kno...
2,5,True,2016-09-16,A8WQ7MAG3HFOZ,B000FA5KK0,{'Format:': ' Kindle Edition'},"love max, always a fun twist",Five Stars,,16,9,2016,2,love max always a fun twist
3,5,True,2016-03-03,A1E0MODSRYP7O,B000FA5KK0,{'Format:': ' Kindle Edition'},"as usual for him, a good book",a good,,3,3,2016,2,as usual for him a good book
4,5,True,2015-09-10,AYUTCGVSM1H7T,B000FA5KK0,{'Format:': ' Kindle Edition'},mb is one of the original western writers and ...,A Western,,10,9,2015,2,mb is one of the original western writers and ...


# Changing Ternary Class to Binary Class

My aim is to compare products and determine less seller products with giving importance to negative reviewed books to take action. So, to focus on less ratings, I will divide my target to two-class as positive and negative where 1 and 2 rating values counted as negative and others are positive.

In [6]:
def calc_two_sentiment(overall):
    '''This function encodes the rating 1 and 2 as 0, others as 1'''
    if overall >= 3:
        return 1
    else:
        return 0

In [7]:
df['sentiment'] = df['overall'].apply(calc_two_sentiment) #applying function

In [8]:
df['sentiment'].value_counts()

1    2031419
0     109546
Name: sentiment, dtype: int64

# Taking Sample Data

My data is big enough for runing usual computers. I can use cloud but it also takes hours for each running. Even for 100000 rows, some models took more than 3 hours. If I try models for whole set, I have to wait more than half day for each change in model. So, I prefer to choose sample, find the best model with it and apply this model to whole dataset. For deep learning models, I prefer to take my sample unbalanced data as first 100000 row of the clean data. Because, deep learning model can handle unbalanced classes better than linear or gradient machine learning models. I would like to see how the model will do with unbalanced version.

In [9]:
df_torch = df.head(100000)

To write codes easily and keep less data in memory, I will just only choose the columns which I need for modeling.

In [10]:
df_torch= df_torch.loc[:, ['review_clean', 'sentiment']]

To use more easily I will divide train-test splits and write them csv files. Normally, I will split validation data from train and I will use this as unseen data to compare how the model does. But, firstly I will split my data to keep test set in computer. If i need to check small size data or same test future, I would like to keep test data also.

In [11]:
train_data, test_data = train_test_split(df_torch, test_size=0.2,random_state = 42)

In [12]:
train_data.to_csv('train.csv', index = False) # my main df

In [13]:
test_data.to_csv('test.csv', index = False) # not for using now, for keeping just in case as small data example

In [14]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 80000
Number of testing examples: 20000


## Preparing Data to Torch Model

These are two pages which gave me idea to get baseline and how to use FastText class. The code of the FastText class also is taken from these two people. I combined their work and changed according to my data and tried different parameters for models.

https://github.com/bentrevett/pytorch-sentiment-analysis

https://www.kaggle.com/lalwaniabhishek/abhishek-lalwani-bits-twitter-text

One of the important concepts of TorchText is the Field function, which defines how the data should be processed. 

I will use TEXT to define how the reviews will be processed and use TARGET field to process the target. As a preprocessing technique, I will use bi-grams. It creates a set of co-occuring words.

In [19]:
def generate_bigrams(text):
    '''creating set of co-occuring words'''
    bi_grams = set(zip(*[text[i:] for i in range(2)]))
    for bi_gram in bi_grams:
        text.append(' '.join(bi_gram))
    return text

In [20]:
# To check bi-gram function is working proporly or not
generate_bigrams(['I', 'love', 'this', 'book'])

['I', 'love', 'this', 'book', 'I love', 'this book', 'love this']

My bi-gram function is working properly, I can see two-words couples.

I will define my model to preprocess with bi-grams, SpaCy tokenizer and LabelField to handle the target.

In [21]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', preprocessing = generate_bigrams)
TARGET = data.LabelField(dtype = torch.float)

In [22]:
fields_train = [('review_clean', TEXT),('sentiment', TARGET)]

With using TabularDataset, we will take our train, test splits easily each time and preprocessed with bi-grams. 

In [23]:
# Taking training data from train.csv
train_data = data.TabularDataset(path = 'train.csv',
                                 format = 'csv',
                                 fields = fields_train,
                                 skip_header = True)

In [18]:
# # To check the first elements in train
# print(vars(train_data[0]))

Now, I want to split a validation data from my train data, to make sure my model is doing good. I will use default for split sizes and define my random seed to get same data each time.

### Building Validation Set 

In [25]:
# Creating validation set from train data

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [21]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 56000
Number of validation examples: 24000
Number of testing examples: 20000


Now, I need to build a vocabulary. There are lots of words so I will define maximum top words sizes. Then, I will load the pre-trained word embeddings.

### Building Vocabulary with Pre-Trained Embeddings

In [26]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

TARGET.build_vocab(train_data)

I0519 16:21:52.618643 4541910464 vocab.py:431] Loading vectors from .vector_cache/glove.6B.100d.txt.pt


I only build vocabulary on train set. Because, in machine learning models test set must not be seen before to test it well. So I do not add validation set, because I want it to reflect the test set as much as possible.

In [23]:
print(f"# of unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"# of unique tokens in TARGET vocabulary: {len(TARGET.vocab)}")

# of unique tokens in TEXT vocabulary: 25002
# of unique tokens in TARGET vocabulary: 2


I chose my max vocabulary size 25000, it means there is two additional tokens like <...> default. Because all sentences in the batches must be at same size. To make each sentence equal in the batch, it padded longer or shorter batches.

In [24]:
print(TEXT.vocab.freqs.most_common(25)) # to see most common words in the vocabulary with their frequencies

[('the', 237680), ('and', 149130), ('a', 139606), ('i', 135782), ('to', 130757), ('of', 98507), (' ', 88735), ('is', 78287), ('this', 76680), ('it', 73911), ('in', 65307), ('was', 60753), ('that', 56410), ('book', 51555), ('for', 43846), ('story', 39234), ('but', 37773), ('her', 37674), ('with', 37553), ('read', 35264), ('you', 34167), ('nt', 33702), ('\n\n', 32400), ('she', 28707), ('not', 28230)]


### Setting Iterators

Now, I create my vocabulary using pre-trained embeddings. The final step of preparing data to Torch model is creating iterators. I will iterate train and evaluation loop and get a batch of examples which indexed and converted into tensors for each iteration. I will use Iterator function of torch. Also, I need to keep the tensors which returned by iterators in GPU so I will use torch.device function.

In [27]:
# To set batch size and iterators for train and validation data 

BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator = data.Iterator(dataset = train_data, batch_size = BATCH_SIZE,device = device, 
                               shuffle = None, train = True, sort_key = lambda x: len(x.review_clean), 
                               sort_within_batch = False)
valid_iterator = data.Iterator(dataset = valid_data, batch_size = BATCH_SIZE,device = device, 
                               shuffle = None, train = False, sort_key = lambda x: len(x.review_clean), 
                               sort_within_batch = False)

## Building the Model

There are many ready classes to building a model. I prefer to use FastText class for baseline model, because gets comparable results significantly faster and using around half of the parameters. The details about this class can be found in [Bag of Tricks for Efficient Text Classification paper](https://arxiv.org/abs/1607.01759). 

In [48]:
class FastText(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
#         self.dropout = nn.Dropout(0.5) # for adding dropout
        
    def forward(self, text):
    
        
        embedded = self.embedding(text)
                
        embedded = embedded.permute(1, 0, 2)
        
        pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
        
        return self.fc(pooled)

This model only has 2 layers that have any parameters, the linear and the embedding layer. There in no RNN layer. It will calculate the word embedding by using embedding layer, and taking average of them feeds the linear layer. Now, I will create my FastText class with defining dimensions and tokens.

In [49]:
INPUT_DIM = len(TEXT.vocab) #vocabulary size 
EMBEDDING_DIM = 100 # embedding dimension
OUTPUT_DIM = 1 # our output has only 2 classes - 0/1. So, it is one-dimensional.
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # string to integer method on padding tokens

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)

To compare trainable parameters in different models, count parameters function will be used. 

In [50]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters.')

The model has 2,500,301 trainable parameters.


Now I will copy pre-trained vectors to my embedding layers.

In [51]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.6578,  0.9299,  0.0580,  ..., -0.9173,  1.2022,  0.2694],
        [-0.3626,  0.1501,  1.4050,  ...,  0.0213,  0.3717, -0.6314],
        [-1.3447, -1.4811,  0.7253,  ..., -0.5115, -0.9313, -0.3301]])

I must assign zero for initial weight for unknown and padding tokens. I have already defined padding token before as PAD_IDX. So, I will define unknows as UNK_IDX and set initials to zeros.

In [52]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

## Training the Model

To train the model, firstly I will create optimizer and criterion. Optimizer updates parameters of module. I will use SGD and Adam as optimizer. SGD is a variant of gradient descent. It does not perform on whole dataset, it computes on a small subset or random selection. It performs good when the learning rate is low. Optimizer needs two parameters, one is optimizer type and second is learning rate. Adam optimizer is a technique which implementing adaptive learning rate. 

I tried both optimizers one by one with uncommenting the cell below also with different learning rates.

In [60]:
optimizer = optim.SGD(model.parameters(), lr=1e-4)
# optimizer = optim.Adam(model.parameters(),lr=1e-4)

Now, I will define loss function. My target contains binary labels, so I will choose binary loss function as criterion.
Cross-entropy loss is commonly used for classification porblems. Also, BCEWithLogitsLoss is contains one sigmoid layer and binary cross-entropy loss. So, I will use this one.

In [54]:
criterion = nn.BCEWithLogitsLoss()

# keeping model and criterion in GPU
model = model.to(device)
criterion = criterion.to(device)

The loss will be calculated by using criterion but I want to see accuracy to compare models. This function turn the values to 0-1 with rounding them in sigmoid layer. Then, it calculates the rounded predictions equal actual labels and take the mean of the batch.

In [55]:
def binary_accuracy(pred, target):
      
    # rounding predictions to the closest integer
    rounded_pred = torch.round(torch.sigmoid(pred))
    true = (rounded_pred == target).float() # convert into float for taking mean 
    accuracy = true.sum() / len(true)
    return accuracy

In [56]:
# setting the train method

def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0 # 
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator: # for each batch
        
        optimizer.zero_grad() # zero gradient
        # PyTorch does not automatically zero the gradients calculated from the last gradient calculation
        
        predictions = model(batch.review_clean).squeeze(1) # with feeding batch with reviews no need to .forward
        
        #squeeze for removing dimension in the list and taking only batch size 
        #bec. torch wants predictions input as batch size
        
        loss = criterion(predictions, batch.sentiment) # calculating loss
        
        acc = binary_accuracy(predictions, batch.sentiment) # calculating accuracy with taking mean
        
        loss.backward() #gradient of each parameter
        
        optimizer.step() #update the optimizer algorithm
        
        # loss and accuracy by epoches
        epoch_loss += loss.item() 
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator) # returning loss and acc avg across epoch

I will do same function for evaluate validation part below.

In [57]:
def evaluate(model, iterator, criterion):
    '''Evaluating validation set'''
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.review_clean).squeeze(1)
            
            loss = criterion(predictions, batch.sentiment)
            
            acc = binary_accuracy(predictions, batch.sentiment)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

I also use a function which informs that how long each epoch takes.

In [58]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

# Training the Model for Baseline

I tried for different epoch numbers and from result I prefer to choose 5, because it keeps the information mainly in first 5 epoches.

# Adam Optimizer

In [38]:
# with Adam optimizer
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 8s
	Train Loss: 0.454 | Train Acc: 88.82%
	 Val. Loss: 0.717 |  Val. Acc: 91.69%
Epoch: 02 | Epoch Time: 2m 11s
	Train Loss: 0.318 | Train Acc: 91.66%
	 Val. Loss: 0.446 |  Val. Acc: 91.62%
Epoch: 03 | Epoch Time: 2m 25s
	Train Loss: 0.236 | Train Acc: 91.84%
	 Val. Loss: 0.438 |  Val. Acc: 92.45%
Epoch: 04 | Epoch Time: 2m 23s
	Train Loss: 0.194 | Train Acc: 92.48%
	 Val. Loss: 0.549 |  Val. Acc: 92.58%
Epoch: 05 | Epoch Time: 2m 7s
	Train Loss: 0.171 | Train Acc: 93.04%
	 Val. Loss: 0.665 |  Val. Acc: 92.56%


It looks overfit, so I added dropout and run again. 

# Adam Optimizer with Dropout

In [41]:
# with Adam optimizer with dropout
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 23s
	Train Loss: 0.427 | Train Acc: 91.66%
	 Val. Loss: 0.641 |  Val. Acc: 91.71%
Epoch: 02 | Epoch Time: 2m 33s
	Train Loss: 0.302 | Train Acc: 91.67%
	 Val. Loss: 0.404 |  Val. Acc: 91.78%
Epoch: 03 | Epoch Time: 2m 26s
	Train Loss: 0.227 | Train Acc: 91.94%
	 Val. Loss: 0.441 |  Val. Acc: 92.55%
Epoch: 04 | Epoch Time: 2m 28s
	Train Loss: 0.189 | Train Acc: 92.58%
	 Val. Loss: 0.568 |  Val. Acc: 92.31%
Epoch: 05 | Epoch Time: 2m 27s
	Train Loss: 0.168 | Train Acc: 93.16%
	 Val. Loss: 0.687 |  Val. Acc: 92.38%


It is still overfit, so I changed learning rate and tried again.

# Adam Optimizer with Different Learning Rates

I run the code with different learning rates to see which one gives better results.

In [43]:
# with Adam optimizer with dropout lr e-4
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 23s
	Train Loss: 0.157 | Train Acc: 93.55%
	 Val. Loss: 0.708 |  Val. Acc: 92.24%
Epoch: 02 | Epoch Time: 2m 36s
	Train Loss: 0.156 | Train Acc: 93.56%
	 Val. Loss: 0.723 |  Val. Acc: 92.22%
Epoch: 03 | Epoch Time: 2m 32s
	Train Loss: 0.154 | Train Acc: 93.71%
	 Val. Loss: 0.737 |  Val. Acc: 92.21%
Epoch: 04 | Epoch Time: 2m 19s
	Train Loss: 0.153 | Train Acc: 93.77%
	 Val. Loss: 0.747 |  Val. Acc: 92.29%
Epoch: 05 | Epoch Time: 2m 25s
	Train Loss: 0.152 | Train Acc: 93.75%
	 Val. Loss: 0.767 |  Val. Acc: 92.20%


Overfitting problem is still available.

# Changing Optimizer

Now, I will change my optimizer and try again.

In [36]:
# with SGD optimizer - lr e-3

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 45s
	Train Loss: 0.647 | Train Acc: 76.40%
	 Val. Loss: 0.527 |  Val. Acc: 91.68%
Epoch: 02 | Epoch Time: 2m 44s
	Train Loss: 0.531 | Train Acc: 91.66%
	 Val. Loss: 0.415 |  Val. Acc: 91.68%
Epoch: 03 | Epoch Time: 2m 32s
	Train Loss: 0.459 | Train Acc: 91.66%
	 Val. Loss: 0.355 |  Val. Acc: 91.68%
Epoch: 04 | Epoch Time: 2m 20s
	Train Loss: 0.412 | Train Acc: 91.66%
	 Val. Loss: 0.323 |  Val. Acc: 91.68%
Epoch: 05 | Epoch Time: 2m 10s
	Train Loss: 0.380 | Train Acc: 91.66%
	 Val. Loss: 0.305 |  Val. Acc: 91.68%


Accuracy is less than Adam optimizer but, these results are better for loss and overfitting problem.

# Adam Optimizer with Dropout

In [45]:
# with SGD optimizer with dropout

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 14s
	Train Loss: 0.151 | Train Acc: 93.81%
	 Val. Loss: 0.767 |  Val. Acc: 92.21%
Epoch: 02 | Epoch Time: 2m 7s
	Train Loss: 0.151 | Train Acc: 93.80%
	 Val. Loss: 0.767 |  Val. Acc: 92.21%
Epoch: 03 | Epoch Time: 2m 11s
	Train Loss: 0.151 | Train Acc: 93.86%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 04 | Epoch Time: 2m 10s
	Train Loss: 0.151 | Train Acc: 93.85%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 05 | Epoch Time: 2m 7s
	Train Loss: 0.151 | Train Acc: 93.89%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%


It is overfitting again.

# SGD Optimizer with Different Learning Rates

In [47]:
# with SGD optimizer with dropout different lr 

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 1m 59s
	Train Loss: 0.150 | Train Acc: 93.80%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 02 | Epoch Time: 2m 23s
	Train Loss: 0.151 | Train Acc: 93.82%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 03 | Epoch Time: 2m 19s
	Train Loss: 0.150 | Train Acc: 93.88%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 04 | Epoch Time: 2m 10s
	Train Loss: 0.151 | Train Acc: 93.87%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%
Epoch: 05 | Epoch Time: 2m 8s
	Train Loss: 0.151 | Train Acc: 93.81%
	 Val. Loss: 0.766 |  Val. Acc: 92.22%


Accuracy is higher but validation loss is higher also, so I decided to run with different parameters again.

# SGD Optimizer without Dropout Layer with Smaller Learning Rate

In [61]:
# with SGD optimizer without dropout lr e-4

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 9s
	Train Loss: 0.703 | Train Acc: 9.90%
	 Val. Loss: 0.705 |  Val. Acc: 30.88%
Epoch: 02 | Epoch Time: 2m 32s
	Train Loss: 0.686 | Train Acc: 85.03%
	 Val. Loss: 0.679 |  Val. Acc: 76.10%
Epoch: 03 | Epoch Time: 2m 57s
	Train Loss: 0.670 | Train Acc: 91.66%
	 Val. Loss: 0.655 |  Val. Acc: 88.40%
Epoch: 04 | Epoch Time: 3m 7s
	Train Loss: 0.655 | Train Acc: 91.66%
	 Val. Loss: 0.633 |  Val. Acc: 90.53%
Epoch: 05 | Epoch Time: 2m 35s
	Train Loss: 0.640 | Train Acc: 91.66%
	 Val. Loss: 0.612 |  Val. Acc: 91.02%


I found best results with SGD optimizer and learning rate as e-3. 

# Adding Tri-Gram Function

Now, I would like to see what it will change if I will group my words as three instead of two. I will do some steps again because my torch ready data will change. 

In [16]:
def generate_trigrams(text):
    '''creating set of 3 co-occuring words'''
    tri_grams = set(zip(*[text[i:] for i in range(3)]))
    for tri_gram in tri_grams:
        text.append(' '.join(tri_gram))
    return text

In [14]:
# To check tri-gram function is working proporly or not
generate_trigrams(['I', 'love', 'this', 'book'])

['I', 'love', 'this', 'book', 'love this book', 'I love this']

In [3]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', preprocessing = generate_trigrams)
TARGET = data.LabelField(dtype = torch.float)

In [16]:
fields_train = [('review_clean', TEXT),('sentiment', TARGET)]

In [17]:
# Taking training data from train.csv
train_data = data.TabularDataset(path = 'train.csv',
                                 format = 'csv',
                                 fields = fields_train,
                                 skip_header = True)

In [18]:
# print(vars(train_data[0])) # to check tri-grams

In [20]:
# Creating validation set from train data

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [21]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

TARGET.build_vocab(train_data)

In [22]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator = data.Iterator(dataset = train_data, batch_size = BATCH_SIZE,device = device, 
                               shuffle = None, train = True, sort_key = lambda x: len(x.review_clean), 
                               sort_within_batch = False)
valid_iterator = data.Iterator(dataset = valid_data, batch_size = BATCH_SIZE,device = device, 
                               shuffle = None, train = False, sort_key = lambda x: len(x.review_clean), 
                               sort_within_batch = False)

In [25]:
INPUT_DIM = len(TEXT.vocab) #vocabulary size 
EMBEDDING_DIM = 100 # embedding dimension
OUTPUT_DIM = 1 # our output has only 2 classes - 0/1. So, it is one-dimensional.
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # string to integer method on padding tokens

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)

In [27]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 1.5221, -0.3108, -0.2902,  ..., -0.2051, -0.9059, -0.8559],
        [ 0.9666, -0.3822, -0.2585,  ..., -1.0574, -0.6668,  0.1646],
        [ 1.8935, -0.8303,  0.2935,  ..., -0.6399, -1.8376, -1.9168]])

In [28]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

In [29]:
# optimizer = optim.Adam(model.parameters())

In [37]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)

In [38]:
criterion = nn.BCEWithLogitsLoss()

# keeping model and criterion in GPU
model = model.to(device)
criterion = criterion.to(device)

### Results with Adam Optimizer

In [35]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 39s
	Train Loss: 0.443 | Train Acc: 91.66%
	 Val. Loss: 0.686 |  Val. Acc: 91.69%
Epoch: 02 | Epoch Time: 2m 28s
	Train Loss: 0.324 | Train Acc: 91.66%
	 Val. Loss: 0.422 |  Val. Acc: 91.63%
Epoch: 03 | Epoch Time: 2m 31s
	Train Loss: 0.248 | Train Acc: 91.74%
	 Val. Loss: 0.409 |  Val. Acc: 92.18%
Epoch: 04 | Epoch Time: 2m 46s
	Train Loss: 0.206 | Train Acc: 92.15%
	 Val. Loss: 0.539 |  Val. Acc: 92.41%
Epoch: 05 | Epoch Time: 2m 24s
	Train Loss: 0.182 | Train Acc: 92.63%
	 Val. Loss: 0.681 |  Val. Acc: 92.34%


It looks overfit.

### Results with SGD Optimizer

In [39]:
# SGD
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # to keep model for test set
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut4-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 15s
	Train Loss: 0.221 | Train Acc: 92.00%
	 Val. Loss: 0.410 |  Val. Acc: 92.22%
Epoch: 02 | Epoch Time: 2m 29s
	Train Loss: 0.221 | Train Acc: 91.98%
	 Val. Loss: 0.411 |  Val. Acc: 92.28%
Epoch: 03 | Epoch Time: 2m 15s
	Train Loss: 0.220 | Train Acc: 91.97%
	 Val. Loss: 0.411 |  Val. Acc: 92.27%
Epoch: 04 | Epoch Time: 2m 33s
	Train Loss: 0.220 | Train Acc: 91.96%
	 Val. Loss: 0.412 |  Val. Acc: 92.27%
Epoch: 05 | Epoch Time: 2m 34s
	Train Loss: 0.220 | Train Acc: 91.95%
	 Val. Loss: 0.412 |  Val. Acc: 92.27%


It is better than Adam optimizer but not better than bi-grams.

I insert the code below for if someone want to see how to try downloaded model for different unseen data.

In [40]:
# model.load_state_dict(torch.load('tut4-model.pt'))

# test_loss, test_acc = evaluate(model, test_iterator, criterion)

# print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.402 | Test Acc: 92.25%


### Best Results : 
 - with SGD optimizer and learning rate e-3, 91.66% accuracy for train and 91.68% accuracy for validation obtained.

# Adding Augmentation To Data

To get better results, I will try to add augmentation to my data.

Further information can be found in https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb.

I only try one method to see synonym augmentation below.

In [19]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/ezgi/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [22]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ezgi/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [23]:
# to see differences in normal and augmented texts
aug = naw.SynonymAug(aug_src='wordnet')
text = 'I like this book'
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
I like this book
Augmented Text:
I care this book


In [24]:
def embedding(text):
    '''this function changes texts according to synonym augmentation'''
    aug = naw.SynonymAug(aug_src='wordnet')
    augmented_text = aug.augment(text)
    return augmented_text

In [30]:
df.dropna(subset=['review_clean'], inplace=True) #checking for null values

In [31]:
df_aug = df.head(100000) #takind first 100000 as sample

In [32]:
df_aug= df_aug.loc[:, ['review_clean', 'sentiment']]

In [33]:
train_aug, test_aug = train_test_split(df_aug, test_size=0.2,random_state = 42) #splitting for using same part as train

In [34]:
train_aug['review_emb'] = train_aug['review_clean'].apply(lambda x: embedding(x))
train_aug.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,review_clean,sentiment,review_emb
75241,new to me author but he spins a good tale i en...,2,new to me author but he spin around a serious ...
48970,had to keep reading addicted must know what ha...,2,get to keep reading hook mustiness know what h...
44979,good book that will be enjoyed by inmates for ...,2,good holy scripture that will live enjoyed by ...
13571,this book was a nice little slightly erotic re...,2,this book was a gracious little somewhat eroti...
92751,ive never read a work by mary jane clark befor...,1,ive never read a work by mary jane clark befor...


In [35]:
train_aug.to_csv('train_aug.csv', index = False) # to keep augmented dataframe

Our reviews were changed according to synonym. This wss the first step to figure out how data augmentation works. When we put this data to torch model, it will not give good results because it needs more deeper work.

# Future Improvements for This Notebook

After talking our instructor Bryan Arnold, some steps were determined to improve more this model;

- Tri-grams and bi-grams applied to model seperately, new dictionary can be formed which contains both of them.
- Data augmentation will be added to data.
- Dropout layer and other layers can be changed or new layers can be added.
- Test-time augmentation can be added. 
- I have already run the model for different learning rates but more different values can be tried. 
- I will run the model for higher epoch numbers. Each epoch takes time to I only tried for 5, it can be increased.


I will continue to try other deep learning models to find better results and to find easily tuned models My next step is to work on Keras models.