# 0 TorchText

## Dataset Preview

Your first step to deep learning in NLP. We will be mostly using PyTorch. Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines. 

We will be using previous session tweet dataset. Let's just preview the dataset.

In [2]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [3]:
!cp '/content/gdrive/My Drive/EVA/stanfordSentimentTreebank.zip' stanfordSentimentTreebank.zip
!unzip -q -o stanfordSentimentTreebank.zip -d stanfordSentimentTreebank

In [4]:
import pandas as pd
df = pd.read_csv('/content/stanfordSentimentTreebank/stanfordSentimentTreebank/datasetSentences.txt',sep='\t')
df.tail()

Unnamed: 0,sentence_index,sentence
11850,11851,A real snooze .
11851,11852,No surprises .
11852,11853,We 've seen the hippie-turned-yuppie plot befo...
11853,11854,Her fans walked out muttering words like `` ho...
11854,11855,In this case zero .


In [5]:
df.shape

(11855, 2)

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11855 entries, 0 to 11854
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   sentence_index  11855 non-null  int64 
 1   sentence        11855 non-null  object
dtypes: int64(1), object(1)
memory usage: 185.4+ KB


## Defining Fields

In [7]:
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau

In [8]:
class StanfordDatasetReader():
  def __init__(self, sst_dir, split_idx):
    
    merged_dataset = self.get_merged_dataset(sst_dir)
    merged_dataset['sentiment values'] = merged_dataset['sentiment values'].astype(float)
    self.dataset = merged_dataset[merged_dataset["splitset_label"] == split_idx]
    # self.dataset["Revised_Sentiment"] = self.discretize_label(self.dataset.iloc[5])
    self.dataset['Revised_sentiment values'] = self.dataset.apply(lambda x: labelfunc(x["sentiment values"]), axis=1)
    # train_st_data['Revised_sentiment values'] = train_st_data.apply(lambda x: myfunc(x["sentiment values"]), axis=1)
  # https://github.com/iamsimha/conv-sentiment-analysis/blob/master/code/dataset_reader.py
  def get_merged_dataset(self, sst_dir):

    sentiment_labels = pd.read_csv(os.path.join(sst_dir, "sentiment_labels.txt"), sep="|")
    sentence_ids = pd.read_csv(os.path.join(sst_dir, "datasetSentences.txt"), sep="\t")
    dictionary = pd.read_csv(os.path.join(sst_dir, "dictionary.txt"), sep="|", names=['phrase', 'phrase ids'])
    train_test_split = pd.read_csv(os.path.join(sst_dir, "datasetSplit.txt"))
    sentence_phrase_merge = pd.merge(sentence_ids, dictionary, left_on='sentence', right_on='phrase')
    sentence_phrase_split = pd.merge(sentence_phrase_merge, train_test_split, on='sentence_index')
    return pd.merge(sentence_phrase_split, sentiment_labels, on='phrase ids').sample(frac=1)

  def discretize_label(self, label):
    print(type(label))
    if label <= 0.2: return 0
    if label <= 0.4: return 1
    if label <= 0.6: return 2
    if label <= 0.8: return 3
    return 4

  def word_to_index(self, word):
    if word in self.w2i:
      return self.w2i[word]
    else:
      return self.w2i["<OOV>"]

  def __len__(self):
    return self.dataset.shape[0]
    
  # def __getitem__(self, idx):
  #   return {"sentence": [self.word_to_index(x) for x in self.dataset.iloc[idx, 1].split()],
  #           "label": self.discretize_label(self.dataset.iloc[idx, 5])}
  def labelfunc(label):
    if label <= 0.5: return 0
    if label <= 0.4: return 1
    if label <= 0.6: return 2
    if label <= 0.8: return 3
    return 4

  def get_data(self):
    return self.dataset

  def __getitem__(self, idx):
    return {"sentence": [x for x in self.dataset.iloc[idx, 1].split()],
            "label": self.discretize_label(self.dataset.iloc[idx, 5])}

In [9]:
def labelfunc(label):
  if label <= 0.4: return 0
  if label <= 0.6: return 1
  if label <= 0.6: return 2
  if label <= 0.8: return 3
  return 4

In [10]:
import os
def load_data(sst_dir="/content/stanfordSentimentTreebank/stanfordSentimentTreebank/"):
  train_st_data_cl = StanfordDatasetReader(sst_dir, 1).get_data()
  # train_st_data_cl['Revised_sentiment values'] = train_st_data.apply(lambda x: labelfunc(x["sentiment values"]), axis=1)
  test_st_data_cl = StanfordDatasetReader(sst_dir, 2).get_data()
  # test_st_data_cl['Revised_sentiment values'] = test_st_data_cl.apply(lambda x: labelfunc(x["sentiment values"]), axis=1)
  validation_st_data_cl = StanfordDatasetReader(sst_dir, 3).get_data()
  # validation_st_data_cl['Revised_sentiment values'] = validation_st_data_cl.apply(lambda x: labelfunc(x["sentiment values"]), axis=1)
  return train_st_data_cl,test_st_data_cl,validation_st_data_cl

In [11]:
train_st_data,test_st_data,validation_st_data = load_data()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
train_st_data.head()

Unnamed: 0,sentence_index,sentence,phrase,phrase ids,splitset_label,sentiment values,Revised_sentiment values
8925,9349,Kung Pow seems like some futile concoction tha...,Kung Pow seems like some futile concoction tha...,185644,1,0.19444,0
6668,6976,Jaglom offers the none-too-original premise th...,Jaglom offers the none-too-original premise th...,146820,1,0.26389,0
6998,7322,"is the kind of movie that 's critic-proof , si...","is the kind of movie that 's critic-proof , si...",163462,1,0.26389,0
4918,5145,Waiting for Godard can be fruitful : ` In Prai...,Waiting for Godard can be fruitful : ` In Prai...,110911,1,0.75,3
10001,10492,"But , no , we get another scene , and then ano...","But , no , we get another scene , and then ano...",222713,1,0.41667,1


In [13]:
test_st_data.tail()

Unnamed: 0,sentence_index,sentence,phrase,phrase ids,splitset_label,sentiment values,Revised_sentiment values
7604,7963,The worst kind of independent ; the one where ...,The worst kind of independent ; the one where ...,149995,2,0.41667,1
682,708,Smith profiles five extraordinary American hom...,Smith profiles five extraordinary American hom...,26603,2,0.73958,3
776,806,The perfect film for those who like sick comed...,The perfect film for those who like sick comed...,26882,2,0.63889,3
395,405,Although it bangs a very cliched drum at times...,Although it bangs a very cliched drum at times...,18632,2,0.75,3
7981,8360,Extremely dumb .,Extremely dumb .,223315,2,0.055556,0


In [14]:
validation_st_data.head()

Unnamed: 0,sentence_index,sentence,phrase,phrase ids,splitset_label,sentiment values,Revised_sentiment values
7215,7554,A coarse and stupid gross-out .,A coarse and stupid gross-out .,143051,3,0.29167,0
7286,7633,Characters still need to function according to...,Characters still need to function according to...,144414,3,0.33333,0
7331,7681,This is the sort of burly action flick where o...,This is the sort of burly action flick where o...,150256,3,0.27778,0
7344,7697,By the miserable standards to which the slashe...,By the miserable standards to which the slashe...,222853,3,0.5,1
7105,7435,"For starters , the story is just too slim .","For starters , the story is just too slim .",223402,3,0.33333,0


### Further NLP Augemnattion

In [15]:
!pip install nlpaug

Collecting nlpaug
[?25l  Downloading https://files.pythonhosted.org/packages/66/40/18941536abc63578010e87808089eb3184a8d027df03bfc226894698f491/nlpaug-1.1.0-py3-none-any.whl (380kB)
[K     |▉                               | 10kB 19.1MB/s eta 0:00:01[K     |█▊                              | 20kB 18.4MB/s eta 0:00:01[K     |██▋                             | 30kB 11.0MB/s eta 0:00:01[K     |███▍                            | 40kB 9.3MB/s eta 0:00:01[K     |████▎                           | 51kB 9.8MB/s eta 0:00:01[K     |█████▏                          | 61kB 9.7MB/s eta 0:00:01[K     |██████                          | 71kB 9.9MB/s eta 0:00:01[K     |██████▉                         | 81kB 10.7MB/s eta 0:00:01[K     |███████▊                        | 92kB 10.8MB/s eta 0:00:01[K     |████████▋                       | 102kB 10.7MB/s eta 0:00:01[K     |█████████▌                      | 112kB 10.7MB/s eta 0:00:01[K     |██████████▎                     | 122kB 10.7MB/s e

In [16]:
# !pip install transformers

In [17]:
## Lets do the NLP data augmentation
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

##### Some basic examples for understanding and then further data augmentation by these
- Substitute word by WordNet's synonym
- Swap word randomly
- Delete a set of contunous word will be removed randomly
- Delete word randomly augemnattion

In [18]:
# validation_st_data.head()
train_st_data['sentence'].iloc[0]

'Kung Pow seems like some futile concoction that was developed hastily after Oedekerk and his fellow moviemakers got through crashing a college keg party .'

In [19]:
aug = naw.SynonymAug(aug_src='wordnet') ## Substitute word by WordNet's synonym¶

augmented_text = aug.augment(train_st_data['sentence'].iloc[0])
print("Original:")
print(train_st_data['sentence'].iloc[0])
print("Augmented Text:")
print(augmented_text)
train_st_data_SynonymAug_aug = train_st_data
train_st_data_SynonymAug_aug['sentence_aug'] = train_st_data_SynonymAug_aug.apply(lambda x: aug.augment(x['sentence']),axis=1)  ## Swap word randomly¶

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
Original:
Kung Pow seems like some futile concoction that was developed hastily after Oedekerk and his fellow moviemakers got through crashing a college keg party .
Augmented Text:
Kung Pow appear comparable some futile concoction that was developed hastily after Oedekerk and his fellow moviemakers buzz off through crashing a college keg party.


In [20]:
aug = naw.RandomWordAug(action="swap") # Swap word randomly¶

augmented_text = aug.augment(train_st_data['sentence'].iloc[0])
print("Original:")
print(train_st_data['sentence'].iloc[0])
print("Augmented Text:")
print(augmented_text)
train_st_data_swap_aug = train_st_data
train_st_data_swap_aug['sentence_aug'] = train_st_data_swap_aug.apply(lambda x: aug.augment(x['sentence']),axis=1)  ## Swap word randomly¶

Original:
Kung Pow seems like some futile concoction that was developed hastily after Oedekerk and his fellow moviemakers got through crashing a college keg party .
Augmented Text:
Kung Pow seems like futile some concoction that was developed hastily after and Oedekerk his fellow moviemakers got through a crashing keg party college.


In [21]:
# aug = naw.RandomWordAug(action='crop',aug_p=0.5, aug_min=0)
# augmented_text = aug.augment(train_st_data['sentence'].iloc[0])  ## Delete a set of contunous word will be removed randomly¶

# print("Original:")
# print(train_st_data['sentence'].iloc[0])
# print("Augmented Text:")
# print(augmented_text)
# train_st_data_crop_aug = train_st_data
# train_st_data_crop_aug['sentence_aug'] = train_st_data_crop_aug.apply(lambda x: aug.augment(x['sentence']),axis=1)  ## Delete a set of contunous word will be removed randomly¶

In [22]:
text = 'The quick brown fox jumps over the lazy dog .'
# Augmenter that apply random word operation to textual input.Augmenter that apply randomly behavior for augmentation.
aug = naw.RandomWordAug()
augmented_data = aug.augment(text)
augmented_data

train_st_data_delete_aug = train_st_data
# train_st_data_aug[sentence_aug] = aug.augment(train_st_data_aug.loc["sentence"] )
#--Using position to slice Email using a lambda function
train_st_data_delete_aug['sentence_aug'] = train_st_data_delete_aug.apply(lambda x: aug.augment(x['sentence']),axis=1)  ## Delete word randomly augemnattion


In [23]:
print("Original:")
print(train_st_data_delete_aug['sentence'].iloc[0])
print("Augmented Text:")
print(train_st_data_delete_aug['sentence_aug'].iloc[0])

Original:
Kung Pow seems like some futile concoction that was developed hastily after Oedekerk and his fellow moviemakers got through crashing a college keg party .
Augmented Text:
Pow seems futile concoction was developed hastily after Oedekerk and his fellow moviemakers crashing a college keg.


In [24]:
train_st_data_delete_aug.head()

Unnamed: 0,sentence_index,sentence,phrase,phrase ids,splitset_label,sentiment values,Revised_sentiment values,sentence_aug
8925,9349,Kung Pow seems like some futile concoction tha...,Kung Pow seems like some futile concoction tha...,185644,1,0.19444,0,Pow seems futile concoction was developed hast...
6668,6976,Jaglom offers the none-too-original premise th...,Jaglom offers the none-too-original premise th...,146820,1,0.26389,0,Jaglom offers - too - premise that involved mo...
6998,7322,"is the kind of movie that 's critic-proof , si...","is the kind of movie that 's critic-proof , si...",163462,1,0.26389,0,"is the kind of ' critic -, simply because aims..."
4918,5145,Waiting for Godard can be fruitful : ` In Prai...,Waiting for Godard can be fruitful : ` In Prai...,110911,1,0.75,3,Waiting for can: ` In of Love ' is ' s epitaph...
10001,10492,"But , no , we get another scene , and then ano...","But , no , we get another scene , and then ano...",222713,1,0.41667,1,", no, we get another, and another."


In [25]:
## Now I need to add all these data frames
combined_data_aug = pd.concat([train_st_data_delete_aug, train_st_data_swap_aug, train_st_data_SynonymAug_aug], axis=0)
## after this, now I need to drop the sentence column and rename sentence_aug to sentence
combined_data_aug.drop('sentence', axis=1, inplace=True)
combined_data_aug.rename(columns = {'sentence_aug':'sentence'}, inplace = True) 

In [26]:
combined_data_aug.head()

Unnamed: 0,sentence_index,phrase,phrase ids,splitset_label,sentiment values,Revised_sentiment values,sentence
8925,9349,Kung Pow seems like some futile concoction tha...,185644,1,0.19444,0,Pow seems futile concoction was developed hast...
6668,6976,Jaglom offers the none-too-original premise th...,146820,1,0.26389,0,Jaglom offers - too - premise that involved mo...
6998,7322,"is the kind of movie that 's critic-proof , si...",163462,1,0.26389,0,"is the kind of ' critic -, simply because aims..."
4918,5145,Waiting for Godard can be fruitful : ` In Prai...,110911,1,0.75,3,Waiting for can: ` In of Love ' is ' s epitaph...
10001,10492,"But , no , we get another scene , and then ano...",222713,1,0.41667,1,", no, we get another, and another."


### Final Data Preparation

In [27]:
def get_final_data(train_st_data,test_st_data,validation_st_data,combined_data_aug):
  train_st_data_final = train_st_data.drop(['sentence_index','phrase','phrase ids','splitset_label','sentiment values'],axis=1)
  train_st_data_final.rename(columns = {'Revised_sentiment values': 'sentiment'}, inplace = True)

  combined_data_aug.drop(['sentence_index','phrase','phrase ids','splitset_label','sentiment values'],axis=1,inplace=True)
  combined_data_aug.rename(columns = {'Revised_sentiment values': 'sentiment'}, inplace = True)

  train_st_data_final_mixed = pd.concat([combined_data_aug, train_st_data_final], axis=0)

  train_st_data_final_mixed = train_st_data_final_mixed.reset_index(drop=True) ## This is being done because data.Example.fromlist was failing

  test_st_data_final = test_st_data.drop(['sentence_index','phrase','phrase ids','splitset_label','sentiment values'],axis=1)
  test_st_data_final.rename(columns = {'Revised_sentiment values': 'sentiment'}, inplace = True)
  test_st_data_final = test_st_data_final.reset_index(drop=True) ## This is being done because data.Example.fromlist was failing

  validation_st_data_final = validation_st_data.drop(['sentence_index','phrase','phrase ids','splitset_label','sentiment values'],axis=1)
  validation_st_data_final.rename(columns = {'Revised_sentiment values': 'sentiment'} , inplace = True)
  validation_st_data_final = validation_st_data_final.reset_index(drop=True) ## This is being done because data.Example.fromlist was failing

  return train_st_data_final_mixed, test_st_data_final, validation_st_data_final

In [28]:
train_st_data_final, test_st_data_final, validation_st_data_final = get_final_data(train_st_data,test_st_data,validation_st_data,combined_data_aug)

In [29]:
train_st_data_final.head()

Unnamed: 0,sentiment,sentence,sentence_aug
0,0,Pow seems futile concoction was developed hast...,
1,0,Jaglom offers - too - premise that involved mo...,
2,0,"is the kind of ' critic -, simply because aims...",
3,3,Waiting for can: ` In of Love ' is ' s epitaph...,
4,1,", no, we get another, and another.",


In [30]:
# train_st_data_final.to_csv(r'train_st_data_final.csv', index = False)

In [31]:
train_st_data_final.shape

(32468, 3)

In [32]:
train_st_data_final.sentiment.value_counts()

0    12488
3     8864
1     6196
4     4920
Name: sentiment, dtype: int64

In [33]:
validation_st_data_final.sentiment.value_counts()

0    408
3    259
1    219
4    158
Name: sentiment, dtype: int64

Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequen tial to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [34]:
# Import Library
import random
import torch, torchtext
from torchtext import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fb1cd2b0cd8>

In [35]:
Sentence = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Sentiment = data.LabelField(tokenize ='spacy',is_target=True, batch_first =True, sequential =False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [36]:
fields = [('sentence', Sentence),('sentiment',Sentiment)]

In [37]:
# saving the dataframe 
train_st_data_final.to_csv('train_st_data_final.csv', index=False) 
test_st_data_final.to_csv('test_st_data_final.csv', index=False) 
validation_st_data_final.to_csv('validation_st_data_final.csv', index=False) 

In [38]:
# df = pd.read_csv('train_st_data_final.csv')
# df.head()
# example_trng = [data.Example.fromlist([df.sentence[i],df.sentiment[i]], fields) for i in range(df.shape[0])] 

Armed with our declared fields, lets convert from pandas to list to torchtext. We could also use TabularDataset to apply that definition to the CSV directly but showing an alternative approach too.

In [39]:
example_trng = [data.Example.fromlist([train_st_data_final.sentence[i],train_st_data_final.sentiment[i]], fields) for i in range(train_st_data_final.shape[0])] 
example_val = [data.Example.fromlist([validation_st_data_final.sentence[i],validation_st_data_final.sentiment[i]], fields) for i in range(validation_st_data_final.shape[0])] 

In [40]:
# Creating dataset
#twitterDataset = data.TabularDataset(path="tweets.csv", format="CSV", fields=fields, skip_header=True)

# twitterDataset = data.Dataset(example, fields)
train = data.Dataset(example_trng, fields)
valid = data.Dataset(example_val, fields)

Finally, we can split into training, testing, and validation sets by using the split() method:

In [41]:
# (train, valid) = twitterDataset.split(split_ratio=[0.85, 0.15], random_state=random.seed(SEED))

In [80]:
(len(train), len(valid))

TypeError: ignored

An example from the dataset:

In [79]:
vars(train.examples[10])

AttributeError: ignored

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. 

Let’s limit the vocabulary to a maximum of 5000 words in our training set:


In [74]:
MAX_VOCAB_SIZE = 25_000

Sentence.build_vocab(train, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

TypeError: ignored

In [45]:
Sentiment.build_vocab(train)

In [46]:
# Sentence.build_vocab(train)
# Sentiment.build_vocab(train)

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.

In [47]:
print('Size of input vocab : ', len(Sentence.vocab))
print('Size of label vocab : ', len(Sentiment.vocab))
print('Top 10 words appreared repeatedly :', list(Sentence.vocab.freqs.most_common(10)))
print('Labels : ', Sentiment.vocab.stoi)

Size of input vocab :  17052
Size of label vocab :  4
Top 10 words appreared repeatedly : [('.', 31935), (',', 26833), ('the', 17342), ('-', 13429), ("'", 13351), ('of', 12687), ('and', 12678), ('a', 12226), ('to', 8571), ('is', 7050)]
Labels :  defaultdict(<function _default_unk_index at 0x7fb15864db70>, {0: 0, 3: 1, 1: 2, 4: 3})


In [48]:
print('Size of input vocab : ', len(Sentence.vocab))
print('Size of label vocab : ', len(Sentiment.vocab))
print('Top 10 words appreared repeatedly :', list(Sentence.vocab.freqs.most_common(10)))
print('Labels : ', Sentiment.vocab.stoi)

Size of input vocab :  17052
Size of label vocab :  4
Top 10 words appreared repeatedly : [('.', 31935), (',', 26833), ('the', 17342), ('-', 13429), ("'", 13351), ('of', 12687), ('and', 12678), ('a', 12226), ('to', 8571), ('is', 7050)]
Labels :  defaultdict(<function _default_unk_index at 0x7fb15864db70>, {0: 0, 3: 1, 1: 2, 4: 3})


**Lots of stopwords!!**

Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [49]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [50]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.sentence),
                                                            sort_within_batch=True, device = device)

Save the vocabulary for later use

In [51]:
import os, pickle
with open('tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Sentence.vocab.stoi, tokens)

## Defining Our Model

We use the Embedding and LSTM modules in PyTorch to build a simple model for classifying tweets.

In this model we create three layers. 
1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 
2. That’s then fed into a 2 stacked-LSTMs with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). We are using 2 LSTMs for using the dropout.
3. Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [52]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout, bidirectional,pad_idx):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim,padding_idx = pad_idx)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=dropout,
                           bidirectional = bidirectional,
                           batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        self.dropout = nn.Dropout(dropout)
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
      
        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
        
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
    
        # Hidden = [batch size, hid dim * num directions]
        dense_outputs = self.dropout(self.fc(hidden))  
        # Final activation function softmax
        output = F.softmax(dense_outputs[0], dim=1)
        # output = F.softmax(dense_outputs, dim=1)
            
        return output

In [53]:
# Define hyperparameters
size_of_vocab = len(Sentence.vocab)
embedding_dim = 100
num_hidden_nodes = 256
num_output_nodes = 5
num_layers = 3
dropout = 0.2
bidirectional = True
PAD_IDX = Sentence.vocab.stoi[Sentence.pad_token]

# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, dropout, bidirectional,PAD_IDX)

In [54]:
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

classifier(
  (embedding): Embedding(17052, 100, padding_idx=1)
  (encoder): LSTM(100, 256, num_layers=3, batch_first=True, dropout=0.2, bidirectional=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=256, out_features=5, bias=True)
)
The model has 5,593,589 trainable parameters


## Model Training and Evaluation

First define the optimizer and loss functions

In [55]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-3)
criterion = nn.CrossEntropyLoss()

# define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    _, predictions = torch.max(preds, 1)
    
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    


In [56]:
pretrained_embeddings = Sentence.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([17052, 100])


In [57]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.0166, -0.4668,  2.0909,  ..., -1.4692,  0.4476, -0.7223],
        [-0.0791, -0.2089, -0.3442,  ...,  0.4657,  0.6297, -1.7395],
        [-0.3398,  0.2094,  0.4635,  ..., -0.2339,  0.4730, -0.0288],
        ...,
        [-0.2805,  0.1506,  0.3955,  ...,  0.6393,  0.0779,  0.7722],
        [ 0.5732, -1.0756, -0.1600,  ...,  0.4548,  0.2344,  0.0364],
        [-0.3997, -0.2994, -0.3571,  ...,  0.1802, -1.3936, -1.6659]])

In [58]:
UNK_IDX = Sentence.vocab.stoi[Sentence.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(embedding_dim)
model.embedding.weight.data[PAD_IDX] = torch.zeros(embedding_dim)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.3398,  0.2094,  0.4635,  ..., -0.2339,  0.4730, -0.0288],
        ...,
        [-0.2805,  0.1506,  0.3955,  ...,  0.6393,  0.0779,  0.7722],
        [ 0.5732, -1.0756, -0.1600,  ...,  0.4548,  0.2344,  0.0364],
        [-0.3997, -0.2994, -0.3571,  ...,  0.1802, -1.3936, -1.6659]])


In [59]:
# push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

100%|█████████▉| 399950/400000 [00:40<00:00, 16545.52it/s]

The main thing to be aware of in this new training loop is that we have to reference `batch.tweets` and `batch.labels` to get the particular fields we’re interested in; they don’t fall out quite as nicely from the enumerator as they do in torchvision.

**Training Loop**

In [60]:
def train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        tweet, tweet_lengths = batch.sentence   
        
        # convert to 1D tensor
        predictions = model(tweet, tweet_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.sentiment)        
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.sentiment)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Evaluation Loop**

In [61]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet, tweet_lengths = batch.sentence
            
            # convert to 1d tensor
            predictions = model(tweet, tweet_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.sentiment)
            acc = binary_accuracy(predictions, batch.sentiment)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Let's Train and Evaluate**

In [62]:
N_EPOCHS = 15
best_valid_loss = float('inf')
#freeze embeddings
model.embedding.weight.requires_grad = unfrozen = False

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	Train Loss: 1.500 | Train Acc: 39.46%
	 Val. Loss: 1.421 |  Val. Acc: 46.14% 

	Train Loss: 1.411 | Train Acc: 49.95%
	 Val. Loss: 1.399 |  Val. Acc: 49.79% 

	Train Loss: 1.358 | Train Acc: 55.63%
	 Val. Loss: 1.412 |  Val. Acc: 48.54% 

	Train Loss: 1.316 | Train Acc: 60.25%
	 Val. Loss: 1.404 |  Val. Acc: 49.89% 

	Train Loss: 1.280 | Train Acc: 63.87%
	 Val. Loss: 1.396 |  Val. Acc: 50.19% 

	Train Loss: 1.254 | Train Acc: 66.45%
	 Val. Loss: 1.377 |  Val. Acc: 52.41% 

	Train Loss: 1.232 | Train Acc: 68.61%
	 Val. Loss: 1.389 |  Val. Acc: 51.27% 

	Train Loss: 1.213 | Train Acc: 70.70%
	 Val. Loss: 1.389 |  Val. Acc: 51.08% 

	Train Loss: 1.193 | Train Acc: 72.76%
	 Val. Loss: 1.403 |  Val. Acc: 49.68% 

	Train Loss: 1.181 | Train Acc: 73.88%
	 Val. Loss: 1.400 |  Val. Acc: 49.85% 

	Train Loss: 1.172 | Train Acc: 74.87%
	 Val. Loss: 1.431 |  Val. Acc: 46.29% 

	Train Loss: 1.161 | Train Acc: 75.86%
	 Val. Loss: 1.418 |  Val. Acc: 48.18% 

	Train Loss: 1.152 | Train Acc: 77.01%
	

In [77]:
path='./saved_weights.pt'
model.load_state_dict(torch.load(path));

In [71]:
# N_EPOCHS = 15
# best_valid_loss = float('inf')
# #freeze embeddings
# model.embedding.weight.requires_grad = unfrozen = True

# for epoch in range(N_EPOCHS):
     
#     # train the model
#     train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
#     # evaluate the model
#     valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
#     # save the best model
#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save(model.state_dict(), 'saved_weights.pt')
    
#     print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	Train Loss: 1.213 | Train Acc: 70.59%
	 Val. Loss: 1.408 |  Val. Acc: 49.00% 

	Train Loss: 1.154 | Train Acc: 76.56%
	 Val. Loss: 1.387 |  Val. Acc: 51.65% 

	Train Loss: 1.114 | Train Acc: 80.64%
	 Val. Loss: 1.430 |  Val. Acc: 46.69% 

	Train Loss: 1.090 | Train Acc: 83.12%
	 Val. Loss: 1.449 |  Val. Acc: 44.64% 

	Train Loss: 1.073 | Train Acc: 84.69%
	 Val. Loss: 1.427 |  Val. Acc: 47.03% 

	Train Loss: 1.059 | Train Acc: 86.40%
	 Val. Loss: 1.407 |  Val. Acc: 49.20% 

	Train Loss: 1.051 | Train Acc: 87.06%
	 Val. Loss: 1.466 |  Val. Acc: 43.22% 

	Train Loss: 1.045 | Train Acc: 87.53%
	 Val. Loss: 1.440 |  Val. Acc: 45.70% 

	Train Loss: 1.038 | Train Acc: 88.33%
	 Val. Loss: 1.446 |  Val. Acc: 45.00% 

	Train Loss: 1.033 | Train Acc: 88.86%
	 Val. Loss: 1.461 |  Val. Acc: 43.66% 

	Train Loss: 1.026 | Train Acc: 89.44%
	 Val. Loss: 1.439 |  Val. Acc: 46.27% 

	Train Loss: 1.023 | Train Acc: 89.99%
	 Val. Loss: 1.448 |  Val. Acc: 45.45% 

	Train Loss: 1.020 | Train Acc: 90.36%
	

## Model Testing

In [63]:
#load weights and tokenizer

path='./saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();
tokenizer_file = open('./tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)

#inference 

import spacy
nlp = spacy.load('en')

def classify_tweet(tweet):
    
    categories = {0: "Negative", 1:"Positive", 2:"Neutral"}
    
    # tokenize the tweet 
    tokenized = [tok.text for tok in nlp.tokenizer(tweet)] 
    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]        
    # compute no. of words        
    length = [len(indexed)]
    # convert to tensor                                    
    tensor = torch.LongTensor(indexed).to(device)   
    # reshape in form of batch, no. of words           
    tensor = tensor.unsqueeze(1).T  
    # convert to tensor                          
    length_tensor = torch.LongTensor(length)
    # Get the model prediction                  
    prediction = model(tensor, length_tensor)

    _, pred = torch.max(prediction, 1) 
    
    return categories[pred.item()]

In [76]:
classify_tweet("A valid explanation for why Trump won't let women on the golf course.")

'Negative'

## Discussion on Data Augmentation Techniques 

You might wonder exactly how you can augment text data. After all, you can’t really flip it horizontally as you can an image! :D 

In contrast to data augmentation in images, augmentation techniques on data is very specific to final product you are building. As its general usage on any type of textual data doesn't provides a significant performance boost, that's why unlike torchvision, torchtext doesn’t offer a augmentation pipeline. Due to powerful models as transformers, augmentation tecnhiques are not so preferred now-a-days. But its better to know about some techniques with text that will provide your model with a little more information for training. 

### Synonym Replacement

First, you could replace words in the sentence with synonyms, like so:

    The dog slept on the mat

could become

    The dog slept on the rug

Aside from the dog's insistence that a rug is much softer than a mat, the meaning of the sentence hasn’t changed. But mat and rug will be mapped to different indices in the vocabulary, so the model will learn that the two sentences map to the same label, and hopefully that there’s a connection between those two words, as everything else in the sentences is the same.

### Random Insertion
A random insertion technique looks at a sentence and then randomly inserts synonyms of existing non-stopwords into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stopwords (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:


In [65]:
def random_insertion(sentence, n): 
    words = remove_stopwords(sentence) 
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym) 
    return sentence

## Random Deletion
As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability. Consider of it as pixel dropouts while treating images.

In [66]:
def random_deletion(words, p=0.5): 
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return [random.choice(words)] 
    else:
        return remaining

### Random Swap
The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here we sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

In [67]:
def random_swap(sentence, n=5): 
    length = range(len(sentence)) 
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return sentence

For more on this please go through this [paper](https://arxiv.org/pdf/1901.11196.pdf).

### Back Translation

Another popular approach for augmenting text datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. We can use the Python library googletrans for this purpose. 

In [68]:
!pip install googletrans==3.1.0a0

Collecting googletrans==3.1.0a0
  Downloading https://files.pythonhosted.org/packages/19/3d/4e3a1609bf52f2f7b00436cc751eb977e27040665dde2bd57e7152989672/googletrans-3.1.0a0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 5.2MB/s 
Collecting httpcore==0.9.*
[?25l  Downloading https://files.pythonhosted.org/packages/dd/d5/e4ff9318693ac6101a2095e580908b591838c6f33df8d3ee8dd953ba96a8/httpcore-0.9.1-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 4.8MB/s 
Collecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/d3/3c/cdeaf9ab0404853e77c45d9e8021d0d2c01f70a1bb26e460090926fe2a5e/hstspreload-2020.11.21-py3-none-any.whl (981kB)
[K     |████████████████████████████████| 983kB 16.3MB/s 
Collecting rfc3986<2,>=1.3
  Downloading https://files.pythonhosted

In [69]:
import random
import googletrans
from googletrans import Translator

# import googletrans.Translator

translator = Translator()
sentence = ['The dog slept on the rug']

available_langs = list(googletrans.LANGUAGES.keys()) 
trans_lang = random.choice(available_langs) 
print(f"Translating to {googletrans.LANGUAGES[trans_lang]}")

translations = translator.translate(sentence, dest=trans_lang) 
t_text = [t.text for t in translations]
print(t_text)

translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
en_text = [t.text for t in translations_en_random]
print(en_text)

Translating to malay
['Anjing itu tidur di atas permaidani']
['The dog slept on the carpet']
