# 0 TorchText

Stanford Sentiment Treebank V1.0

This is the dataset of the paper:

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

If you use this dataset in your research, please cite the above paper.

@incollection{SocherEtAl2013:RNTN,
title = {{Parsing With Compositional Vector Grammars}},
author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts},
booktitle = {{EMNLP}},
year = {2013}
}

This file includes:
1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.

2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |

3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
Please note that phrase ids and sentence ids are not the same.

4. SOStr.txt and STree.txt encode the structure of the parse trees. 
STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it.

5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets.

6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma:
	1 = train
	2 = test
	3 = dev

Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt. 
Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser.

For comparing research and training models, please use the provided train/dev/test splits.


## Dataset Preview

Your first step to deep learning in NLP. We will be mostly using PyTorch. Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines. 

We will be using previous session tweet dataset. Let's just preview the dataset.

In [None]:
import pandas as pd
df1 = pd.read_table('/content/datasetSentences.txt')
df1.head()

Unnamed: 0,sentence_index,sentence
0,1,The Rock is destined to be the 21st Century 's...
1,2,The gorgeously elaborate continuation of `` Th...
2,3,Effective but too-tepid biopic
3,4,If you sometimes like to go to the movies to h...
4,5,"Emerges as something rare , an issue movie tha..."


In [None]:
#sentence_idx = pd.read_csv('/content/datasetSentences.txt', sep="\t").set_index('sentence_index')
#sentence = pd.read_csv('/content/datasetSentences.txt', sep="\t").set_index('sentence_index')
split = pd.read_csv('/content/datasetSplit.txt', sep=",").set_index('sentence_index')
sentence = pd.read_csv('/content/datasetSentences.txt', sep="\t").set_index('sentence_index')

#df = pd.concat([sentence_idx,sentence, lab],axis=1 ,sort=False)
df = pd.concat([sentence, split],axis=1 ,sort=False)

df.to_csv('/content/Combined_sentiment.csv', index=False)

In [None]:
df.shape
#1364 tweets, 2 columns, tweet and label

(11855, 2)

In [None]:
df.columns

Index(['sentence', 'splitset_label'], dtype='object')

In [None]:
sentiment_label = pd.read_csv('/content/sentiment_labels.txt', sep="|").set_index('phrase ids', 'sentiment values')
dictionary = pd.read_csv('/content/dictionary.txt', sep="|").set_index('!')
sentiment_label.columns
dictionary.columns
df3 = pd.concat([sentiment_label, dictionary],axis=1 ,sort=False)

In [None]:
df3

Unnamed: 0,sentiment values,0
0,0.50000,
1,0.50000,
2,0.44444,
3,0.50000,
4,0.42708,
...,...,...
zoning ordinances to protect your community from the dullest science fiction,,220441.0
zzzzzzzzz,,179256.0
É,,220443.0
É um passatempo descompromissado,,220444.0


In [None]:
df = pd.read_csv('/content/Combined_sentiment.csv')

df['sentence'] = df.sentence.str.replace("'s", '') # Removes nonalphabetic
df['sentence'] = df.sentence.str.lower()

df['sentence'] = df.sentence.str.replace('[^a-zA-Z ]', '') # Removes nonalphabetic
df['sentence'] = df.sentence.str.replace('  ', '') # Removes double-space

In [None]:
df.head()

Unnamed: 0,sentence,splitset_label
0,the rock is destined to be the st centurynewco...,1
1,the gorgeously elaborate continuation ofthe lo...,1
2,effective but tootepid biopic,2
3,if you sometimes like to go to the movies to h...,2
4,emerges as something rarean issue movie thatso...,2


In [None]:
df.splitset_label.value_counts()

1    8544
2    2210
3    1101
Name: splitset_label, dtype: int64

## Defining Fields

Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequential to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [None]:
# Import Library
import random
import torch, torchtext
from torchtext import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7ff857ed8d08>

Field class models common text processing datatypes that can be represented by tensors.  It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

Attributes:
sequential: Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
use_vocab: Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
init_token: A token that will be prepended to every example using this field, or None for no initial token. Default: None.
fix_length: A fixed length that all examples using this field will bepadded to, or None for flexible sequence lengths. Default: None.
dtype: The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
preprocessing: The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor.Default: None.
postprocessing: A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned
into a Tensor. The pipeline function takes the batch as a list, and the field's Vocab. Default: None.
lower: Whether to lowercase the text in this field. Default: False.
tokenize: The function used to tokenize strings using this field into sequential examples. If "spacy", the SpaCy tokenizer is used. If a non-serializable function is passed as an argument,the field will not be able to be serialized. Default: string.split.
tokenizer_language: The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
include_lengths: Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
batch_first: Whether to produce tensors with the batch dimension first.Default: False.
pad_token: The string token used as padding. Default: "<pad>".
unk_token: The string token used to represent OOV words. Default: "<unk>".
pad_first: Do the padding of the sequence at the beginning. Default: False.
truncate_first: Do the truncating of the sequence at the beginning. Default: False
stop_words: Tokens to discard during the preprocessing step. Default: None
is_target: Whether this field is a target variable. Affects iteration over batches. Default: False

A label field is a shallow wrapper around a standard field designed to hold labels for a classification task. Its only use is to set the unk_token and sequential to `None` by default

In [None]:
# Import Library
import random
import torch, torchtext
from torchtext import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7ff857ed8d08>

In [None]:
sentence = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
split = data.LabelField(tokenize ='spacy', is_target=True, batch_first =True, sequential =False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [None]:
fields = [('tweets', sentence),('labels',split)]


Armed with our declared fields, lets convert from pandas to list to torchtext. We could also use TabularDataset to apply that definition to the CSV directly but showing an alternative approach too.

In [None]:
example = [data.Example.fromlist([df.sentence[i],df.splitset_label[i]], fields) for i in range(df.shape[0])] 
vars(example[10])

{'labels': 2,
 'tweets': ['take',
  'care',
  'of',
  'my',
  'cat',
  'offers',
  'a',
  'refreshingly',
  'different',
  'slice',
  'of',
  'asian',
  'cinema']}

In [None]:
# Creating dataset
#twitterDataset = data.TabularDataset(path="tweets.csv", format="CSV", fields=fields, skip_header=True)

stanfordDataset = data.Dataset(example, fields)
vars(stanfordDataset[10])

{'labels': 2,
 'tweets': ['take',
  'care',
  'of',
  'my',
  'cat',
  'offers',
  'a',
  'refreshingly',
  'different',
  'slice',
  'of',
  'asian',
  'cinema']}

Finally, we can split into training, testing, and validation sets by using the split() method:

In [None]:
(train, valid) = stanfordDataset.split(split_ratio=[0.85, 0.15], random_state=random.seed(SEED))

In [None]:
(len(train), len(valid))

(10077, 1778)

In [None]:
#10077+1778

An example from the dataset:

In [None]:
vars(valid.examples[14])

{'labels': 3,
 'tweets': ['a',
  'biggorgeoussprawling',
  'swashbuckler',
  'that',
  'delivers',
  'its',
  'diversions',
  'in',
  'granduncomplicated',
  'fashion']}

In [None]:
vars(train.examples[10])


{'labels': 1,
 'tweets': ['in',
  'other',
  'wordsabout',
  'as',
  'bad',
  'a',
  'film',
  'you',
  're',
  'likely',
  'to',
  'see',
  'all',
  'year']}

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. 

Let’s limit the vocabulary to a maximum of 5000 words in our training set:


In [None]:
sentence.build_vocab(train)
split.build_vocab(train)

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.

In [None]:
print('Size of input vocab : ', len(sentence.vocab))
print('Size of label vocab : ', len(split.vocab))
print('Top 20 words appreared repeatedly :', list(sentence.vocab.freqs.most_common(20)))
print('Labels : ', split.vocab.stoi)

Size of input vocab :  26545
Size of label vocab :  3
Top 20 words appreared repeatedly : [('the', 7944), ('a', 5730), ('of', 5043), ('and', 4266), ('to', 3524), ('is', 2762), ('in', 2089), ('that', 1954), ('it', 1662), ('as', 1420), ('with', 1170), ('for', 1162), ('its', 1065), ('this', 1057), ('an', 1051), ('film', 1035), ('movie', 937), ('you', 856), ('nt', 784), ('be', 779)]
Labels :  defaultdict(<function _default_unk_index at 0x7ff80783fd08>, {1: 0, 2: 1, 3: 2})


**Lots of stopwords!!**

Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.tweets),
                                                            sort_within_batch=True, device = device)

Save the vocabulary for later use

In [None]:
import os, pickle
with open('tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(sentence.vocab.stoi, tokens)

## Defining Our Model

We use the Embedding and LSTM modules in PyTorch to build a simple model for classifying tweets.

In this model we create three layers. 
1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 
2. That’s then fed into a 2 stacked-LSTMs with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). We are using 2 LSTMs for using the dropout.
3. Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=dropout,
                           batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
      
        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
        
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
    
        # Hidden = [batch size, hid dim * num directions]
        dense_outputs = self.fc(hidden)   
        
        # Final activation function softmax
        output = F.softmax(dense_outputs[0], dim=1)
            
        return output

In [None]:
# Define hyperparameters
size_of_vocab = len(sentence.vocab)
embedding_dim = 300
num_hidden_nodes = 100
num_output_nodes = 3
num_layers = 2
dropout = 0.2

# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, dropout = dropout)

In [None]:
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

classifier(
  (embedding): Embedding(26545, 300)
  (encoder): LSTM(300, 100, num_layers=2, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=100, out_features=3, bias=True)
)
The model has 8,205,403 trainable parameters


## Model Training and Evaluation

First define the optimizer and loss functions

In [None]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()

# define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    _, predictions = torch.max(preds, 1)
    
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    
# push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

The main thing to be aware of in this new training loop is that we have to reference `batch.tweets` and `batch.labels` to get the particular fields we’re interested in; they don’t fall out quite as nicely from the enumerator as they do in torchvision.

**Training Loop**

In [None]:
def train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        tweet, tweet_lengths = batch.tweets   
        
        # convert to 1D tensor
        predictions = model(tweet, tweet_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.labels)        
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.labels)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Evaluation Loop**

In [None]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet, tweet_lengths = batch.tweets
            
            # convert to 1d tensor
            predictions = model(tweet, tweet_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels)
            acc = binary_accuracy(predictions, batch.labels)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Let's Train and Evaluate**

In [None]:
N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	Train Loss: 0.915 | Train Acc: 68.00%
	 Val. Loss: 0.836 |  Val. Acc: 72.34% 

	Train Loss: 0.838 | Train Acc: 71.99%
	 Val. Loss: 0.833 |  Val. Acc: 72.34% 

	Train Loss: 0.835 | Train Acc: 72.01%
	 Val. Loss: 0.832 |  Val. Acc: 72.34% 

	Train Loss: 0.833 | Train Acc: 72.03%
	 Val. Loss: 0.831 |  Val. Acc: 72.34% 

	Train Loss: 0.831 | Train Acc: 72.12%
	 Val. Loss: 0.831 |  Val. Acc: 72.34% 

	Train Loss: 0.830 | Train Acc: 72.32%
	 Val. Loss: 0.831 |  Val. Acc: 72.23% 

	Train Loss: 0.828 | Train Acc: 72.51%
	 Val. Loss: 0.831 |  Val. Acc: 72.23% 

	Train Loss: 0.827 | Train Acc: 72.79%
	 Val. Loss: 0.831 |  Val. Acc: 72.28% 

	Train Loss: 0.826 | Train Acc: 72.91%
	 Val. Loss: 0.831 |  Val. Acc: 72.23% 

	Train Loss: 0.824 | Train Acc: 73.04%
	 Val. Loss: 0.831 |  Val. Acc: 72.23% 



## Model Testing

In [None]:
#load weights and tokenizer

path='./saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();
tokenizer_file = open('./tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)

#inference 

import spacy
nlp = spacy.load('en')

def classify_tweet(tweet):
    
    categories = {0: "Negative", 1:"Positive", 2:"Neutral"}
    
    # tokenize the tweet 
    tokenized = [tok.text for tok in nlp.tokenizer(tweet)] 
    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]        
    # compute no. of words        
    length = [len(indexed)]
    # convert to tensor                                    
    tensor = torch.LongTensor(indexed).to(device)   
    # reshape in form of batch, no. of words           
    tensor = tensor.unsqueeze(1).T  
    # convert to tensor                          
    length_tensor = torch.LongTensor(length)
    # Get the model prediction                  
    prediction = model(tensor, length_tensor)

    _, pred = torch.max(prediction, 1) 
    
    return categories[pred.item()]

In [None]:
classify_tweet("Take Care of My Cat offers a refreshingly different slice of Asian cinema ")

'Negative'

In [None]:
classify_tweet("This is a good test")

'Negative'

## Discussion on Data Augmentation Techniques 

You might wonder exactly how you can augment text data. After all, you can’t really flip it horizontally as you can an image! :D 

In contrast to data augmentation in images, augmentation techniques on data is very specific to final product you are building. As its general usage on any type of textual data doesn't provides a significant performance boost, that's why unlike torchvision, torchtext doesn’t offer a augmentation pipeline. Due to powerful models as transformers, augmentation tecnhiques are not so preferred now-a-days. But its better to know about some techniques with text that will provide your model with a little more information for training. 

### Synonym Replacement

First, you could replace words in the sentence with synonyms, like so:

    The dog slept on the mat

could become

    The dog slept on the rug

Aside from the dog's insistence that a rug is much softer than a mat, the meaning of the sentence hasn’t changed. But mat and rug will be mapped to different indices in the vocabulary, so the model will learn that the two sentences map to the same label, and hopefully that there’s a connection between those two words, as everything else in the sentences is the same.

### Random Insertion
A random insertion technique looks at a sentence and then randomly inserts synonyms of existing non-stopwords into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stopwords (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:


In [None]:
def random_insertion(sentence, n): 
    words = remove_stopwords(sentence) 
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym) 
    return sentence

## Random Deletion
As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability. Consider of it as pixel dropouts while treating images.

In [None]:
def random_deletion(words, p=0.5):
    words.split(' ') 
    if len(words) == 1: # return if single word
        return ''.join(words)
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return ''.join([random.choice(words)]) 
    else:
        return ''.join([i for i in remaining])

### Random Swap
The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here we sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

In [None]:
def random_swap(sentence, n=5):
  new_sent = sentence.split(' ')
  #print(new_sent)
  length = range(len(new_sent))
  #print(length) 
  for _ in range(n):
      idx1, idx2 = random.sample(length, 2)
      #print(idx1, idx2)
      new_sent[idx1], new_sent[idx2] = new_sent[idx2], new_sent[idx1]
      listToStr = ' '.join([str(elem) for elem in new_sent])  
  return listToStr

For more on this please go through this [paper](https://arxiv.org/pdf/1901.11196.pdf).

### Back Translation

Another popular approach for augmenting text datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. We can use the Python library googletrans for this purpose. 

In [None]:
#!pip uninstall googletrans
#!git clone https://github.com/BoseCorp/py-googletrans.git


In [None]:
!pip install googletrans==3.1.0a0

In [None]:

#import random
import googletrans
from googletrans import Translator

translator = Translator()
sentence = ['The dog slept on the rug']

available_langs = list(googletrans.LANGUAGES.keys()) 
trans_lang = random.choice(available_langs) 
#trans_lang = 'ja'
print(f"Translating to {googletrans.LANGUAGES[trans_lang]}")

translations = translator.translate(sentence, dest=trans_lang) 
t_text = [t.text for t in translations]
print(t_text)

translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
en_text = [t.text for t in translations_en_random]
print(en_text)

Translating to armenian
['Շունը քնում էր գորգի վրա']
['The dog was sleeping on the carpet']


In [None]:
#testing
import random
import googletrans
from googletrans import Translator

available_langs = list(googletrans.LANGUAGES.keys()) 
trans_lang = random.choice(available_langs) 

translator = Translator()
sentence = ['The dog slept on the rug']
translations = translator.translate(sentence, dest=trans_lang) 
type(translations)

t_text = []
for t in iter(translations):
  print(t.text)
  t_text.append(t.text)

הכלב ישן על השטיח


In [None]:
def back_translate(sentence):
  translator = Translator()
  available_langs = list(googletrans.LANGUAGES.keys()) 
  trans_lang = random.choice(available_langs)
  translations = translator.translate(sentence, dest=trans_lang)
  print(translations)
  #print(dir(translations))
  t_text = []
  for t in iter(translations):
    print(t.text)
    t_text.append(t.text)
  #t_text = [t.text for t in translations]
  translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
  en_text = [t.text for t in translations_en_random]
  return en_text



In [None]:
sentence = ['The dog slept on the rug']
back_translate(sentence)


[<googletrans.models.Translated object at 0x7ff7a18b8978>]
Собака спала на килимку


['The dog slept on the mat']

In [None]:
sentence = ['The dog slept on the rug']
#sentence[0]
random_deletion(sentence[0], p=0.2)

'The dog slept on th ug'

In [None]:
sentence1 = 'The dog slept on the rug'
print(random_swap(sentence1))

['The', 'dog', 'slept', 'on', 'the', 'rug']
range(0, 6)
1 3
2 0
3 4
3 4
0 4
the on The dog slept rug


@inproceedings{wei-zou-2019-eda,
    title = "{EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks",
    author = "Wei, Jason  and
      Zou, Kai",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1670",
    pages = "6383--6389",
}

In [None]:
pip install -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |▎                               | 10kB 22.3MB/s eta 0:00:01[K     |▌                               | 20kB 28.6MB/s eta 0:00:01[K     |▊                               | 30kB 20.5MB/s eta 0:00:01[K     |█                               | 40kB 18.8MB/s eta 0:00:01[K     |█▏                              | 51kB 21.1MB/s eta 0:00:01[K     |█▍                              | 61kB 15.6MB/s eta 0:00:01[K     |█▋                              | 71kB 15.1MB/s eta 0:00:01[K     |█▉                              | 81kB 14.7MB/s eta 0:00:01[K     |██                              | 92kB 14.6MB/s eta 0:00:01[K     |██▎                             | 102kB 15.1MB/s eta 0:00:01[K     |██▌                             | 112kB 15.1MB/s eta 0:00:01[K     |██▊                             | 122kB 15.1MB/s eta 0:00:01[K 

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
#!python eda_nlp/code/augment.py --input=tweets.csv  --output=tweets_augmented.txt 
!python eda_nlp/code/augment.py --input=tweets.csv  --output=tweets_augmented.txt --num_aug=16 --alpha_sr=0.05 --alpha_rd=0.1 --alpha_ri=0.0 --alpha_rs=0.0

python3: can't open file 'eda_nlp/code/augment.py': [Errno 2] No such file or directory


In [None]:
  !git clone https://github.com/jasonwei20/eda_nlp.git

Cloning into 'eda_nlp'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 393 (delta 4), reused 10 (delta 4), pack-reused 379
Receiving objects: 100% (393/393), 20.41 MiB | 23.73 MiB/s, done.
Resolving deltas: 100% (187/187), done.


In [None]:

def read_text():
    ifname = 'SOStr.txt'
    lines = open(ifname, 'r').read().split('\n')

    texts = []
    for line in lines:
        params = line.split('|')
        if len(params) > 1:
            text = ' '.join(params)
            texts.append(text)

    return texts
texts=read_text()
len(texts)

11855

In [None]:
def read_splitlabel():
    ifname = 'datasetSplit.txt'
    lines = open(ifname, 'r').read().split('\n')

    splitlabels = []
    for line in lines[1:]:
        params = line.split(',')
        if len(params) == 2:
            splitlabels.append(int(params[1]))
    
    return splitlabels
splitlabels=read_splitlabel()
len(splitlabels)

11855

In [None]:
def read_sentiscore():
	ifname = 'sentiment_labels.txt'
	lines = open(ifname, 'r').read().split('\n')

	sentiscores = []
	for line in lines[1:]:
		params = line.split('|')
		if len(params) == 2:
			sentiscores.append(float(params[1]))

	return sentiscores
sentiscores=read_sentiscore()
len(sentiscores)

239232

In [None]:
def read_phraseid():
    ifname = 'dictionary.txt'
    lines = open(ifname, 'r').read().split('\n')

    phraseid = {}
    for line in lines:
        params = line.split('|')
        if len(params) == 2:
            phraseid[params[0]] = int(params[1])

    return phraseid
phraseid=read_phraseid()
len(phraseid)

239232

In [None]:
def prepare_valence():
    texts = read_text()
    splitlabels = read_splitlabel()
    sentiscores = read_sentiscore()
    phraseid = read_phraseid()

    train_text = []
    train_label = []
    
    valid_text = []
    valid_label = []

    test_text = []
    test_label = []

    n_sample = len(texts)
    if n_sample == len(splitlabels) and len(sentiscores) == len(phraseid):
        print('%d samples'%(n_sample))
    else:
        print('reading fail')

    for i, didx in enumerate(splitlabels):
        if didx == 1:
            list_text = train_text
            list_label = train_label
        elif didx == 3:
            list_text = valid_text
            list_label = valid_label
        elif didx == 2:
            list_text = test_text
            list_label = test_label

        list_text.append(texts[i])
        list_label.append(sentiscores[phraseid[texts[i]]])
        
    return train_text,train_label,test_text,test_label,valid_text,valid_label

In [None]:

def labelize(text,label):
        y = []
        for l in label:
            if l <= 0.2:
                y.append(0)
            elif l <= 0.4:
                y.append(1)
            elif l <= 0.6:
 fied               y.append(2)
            elif l <= 0.8:
                y.append(3)
            else:
                y.append(4)
        print(len(y))
        return (text, y)


In [None]:
import pandas as pd
testdf= pd.DataFrame(test_text,test_label)
testdf.head


NameError: ignored

In [None]:
train_tx,train_l,test_tx,test_l,valid_tx,valid_l=prepare_valence()
train_text,train_label=labelize(train_tx,train_l)    

test_text,test_label=labelize(test_tx,test_l)
valid_text,valid_label=labelize(valid_tx,valid_l)

11855 samples
8544
2210
1101


In [None]:
(len(train_tx), len(valid_tx))

(8544, 1101)

In [None]:
print(type(train_tx))

<class 'list'>


In [None]:
new_train_tx = []
count = 0 
for tweet in train_tx:
  translator = Translator()
  available_langs = list(googletrans.LANGUAGES.keys()) 
  trans_lang = random.choice(available_langs)
  translations = translator.translate(sentence, dest=trans_lang)
  #print(translations)
  #print(dir(translations))
  t_text = []
  for t in iter(translations):
    #print(t.text)
    t_text.append(t.text)
  #t_text = [t.text for t in translations]
  translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
  tweet = [t.text for t in translations_en_random]
  #tweet = back_translate(tweet)
  #print("Tweet after tranlate: ", tweet)
  tweet = random_swap(tweet[0])
  #print("tweet after swap: ",tweet)
  tweet = random_deletion(tweet, p=0.5)
  #print("tweet after delete: ",tweet)
  new_train_tx.append(tweet)
  count += 1
  print("count: ", count)

# Pickling the augmented train set

In [None]:
import os, pickle
with open('augmented_train_set.pkl', 'wb') as tokens: 
    pickle.dump(new_train_tx, tokens)

In [None]:
import csv

with open('augmented_train_set.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(new_train_tx)

In [None]:
from google.colab import files
files.download('augmented_train_set.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import files
files.download('augmented_train_set.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print(new_train_tx)

['t g seprugon Te', 'ruep do en', 'hepg n rg', 'g dog tesl Te', 'Teogru te lt on', 'dgon thset g Te', 'sltg tehe oo', 'the  oset o The', 'dog  le oherug', 'r do spt he', 'Thel onrhdo', 'p e  on Th u', 'p d e o  u', 'sep  gon do he', 'sle do T oh u', 'setorug e', 'e sleptthe g', 'uept oheo', 'lp rug on dg h', 'udpt Thone', 'slt g Theh n', 'tledog he o u', 'g o The lpdogth', 'h o  slepdg rug', 'ug  thdg o slept', 'leptohehe g ', 'n doug sle teh', 'te hele  r', 'sleptrg oThdog', 'onp Te geg', 'The eu n lt', 'Thedoegslet', 'h the on rug sle', 'hrg let do t', 'rg slephe The o', 'doge sl on gh', 'ug g sptte he o', 'Ththe r eptgon', 'leptruggon h he', 'dg T set n thug', 'etru dg on T', 'tThruo onte', 'r Te g ote', 'rte  o h o', 'Th uo dot l', 'o he sp  en', 'Thdoote ru slt', 'sp r e T', 'herusl dh ', 'e oslpton r', 'uTeohe ep', 'Th g dog on st th', 'rgTheept th o ', 'ohe slru e on', 'slp n o Teh ug', 'u on t doset he', 'let ug  h g he', 'ug Tedg ehe', 'eptdog ohT', 'n rut gTh ep', ' og slepht

In [None]:
print(valid)

<torchtext.data.dataset.Dataset object at 0x7ff856255550>


In [None]:
# Import Library
import random
import torch, torchtext
from torchtext import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7ff857ed8d08>

In [None]:
Tweet = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Label = data.LabelField(tokenize ='spacy', is_target=True, batch_first =True, sequential =False)

In [None]:
fields = [('tweets', Tweet),('labels',Label)]

In [None]:
example = [data.Example.fromlist([new_train_tx,train_label], fields) for i in range(len(new_train_tx))] 

In [None]:
twitterDataset = data.Dataset(example, fields)

In [None]:
(train, valid) = twitterDataset.split(split_ratio=[0.95, 0.05], random_state=random.seed(SEED))

In [None]:
vars(train.examples[10])

{'labels': [3,
  4,
  3,
  2,
  3,
  4,
  4,
  3,
  4,
  3,
  4,
  2,
  4,
  2,
  4,
  3,
  3,
  2,
  4,
  3,
  4,
  3,
  2,
  4,
  2,
  2,
  4,
  3,
  4,
  1,
  4,
  3,
  2,
  3,
  4,
  3,
  3,
  3,
  2,
  3,
  3,
  4,
  2,
  3,
  2,
  0,
  3,
  3,
  1,
  4,
  3,
  4,
  3,
  4,
  4,
  4,
  4,
  3,
  3,
  3,
  2,
  3,
  3,
  3,
  1,
  1,
  4,
  3,
  4,
  4,
  3,
  2,
  1,
  3,
  2,
  2,
  3,
  4,
  4,
  4,
  4,
  3,
  2,
  4,
  3,
  3,
  4,
  4,
  3,
  4,
  3,
  3,
  1,
  4,
  3,
  3,
  4,
  3,
  3,
  4,
  3,
  4,
  3,
  3,
  3,
  4,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  4,
  3,
  3,
  4,
  4,
  3,
  3,
  3,
  2,
  3,
  2,
  3,
  3,
  3,
  4,
  1,
  2,
  2,
  2,
  4,
  3,
  3,
  3,
  3,
  2,
  4,
  4,
  3,
  4,
  3,
  3,
  4,
  3,
  3,
  3,
  4,
  3,
  3,
  3,
  3,
  3,
  3,
  4,
  4,
  4,
  3,
  4,
  4,
  3,
  3,
  4,
  2,
  4,
  3,
  2,
  1,
  3,
  2,
  3,
  3,
  1,
  4,
  4,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  2,
  3,
  3,
  2,
  3,
  3,
  2,
  3,
  3,
  3,
  2,
  4,
  3,
  1,
  2,


In [None]:
Tweet.build_vocab(train)
Label.build_vocab(train)

In [None]:
print('Size of input vocab : ', len(Tweet.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Tweet.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  8536
Size of label vocab :  5
Top 10 words appreared repeatedly : [('I like it .', 16234), ("For a film that 's being advertised as a comedy , Sweet Home Alabama is n't as funny as you 'd hoped .", 16234), ('See it .', 16234), ('True Hollywood Story .', 16234), ('Too bad .', 16234), ("What 's next ?", 16234), ('Every joke is repeated at least four times .', 16234), ("` Stock up on silver bullets for director Neil Marshall 's intense freight train of a film . '", 16234), ('One of the worst movies of the year .', 16234), ('... a pretentious mess ...', 16234)]
Labels :  defaultdict(<function _default_unk_index at 0x7ff80783fd08>, {3: 0, 1: 1, 2: 2, 4: 3, 0: 4})


In [None]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.tweets),
                                                            sort_within_batch=True, device = device)

In [None]:
import os, pickle
with open('tokenizer_aug.pkl', 'wb') as tokens: 
    pickle.dump(Tweet.vocab.stoi, tokens)

In [None]:
print(train_iterator)
for i in train_iterator:
  print(i)

<torchtext.data.iterator.BucketIterator object at 0x7ff72ef76b38>


TypeError: ignored

# Training Loop

In [None]:
def train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        
        # retrieve text and no. of words
        tweet, tweet_lengths = batch

        # randomly deleting the words in the sentence
        #tweet = back_translate(tweet)
        # print(tweet)
        # tweet = random_swap(tweet)
        # print("tweet after swap",tweet)
        # tweet = random_deletion(tweet, p=0.5)
        # print("tweet after delete",tweet)
        

        
        # convert to 1D tensor
        predictions = model(tweet, tweet_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.labels)        
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.labels)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Evaluation

In [None]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet, tweet_lengths = batch.tweets
            
            # convert to 1d tensor
            predictions = model(tweet, tweet_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels)
            acc = binary_accuracy(predictions, batch.labels)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# Train and Evaluate

In [None]:
N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_losSessions = valid_loss
        torch.save(mSessionodel.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

TypeError: ignored