# **YZV405E - Natural Language Processing / Homework 4**
---

## **1. Introduction**

**Any modification to the structure of the notebook is strictly prohibited.

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis.
>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words.

Here we'll use a dataset of food reviews, accompanied by sentiment labels: positive or negative.

### **Network Architecture**

The architecture for this network is shown below.

<img src="https://drive.google.com/uc?id=11tNLfadUaI8bUrSQm-LoiQye_YEq7-vG" width=80%>

>**First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. One approach is having an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.* Other method you will implement is utilizing a pre-trained language model (such as fastText) and giving word embeddings that you obtained from this model to your network.

>**After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the movie review data.

>**Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1.

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).

## **2. Data**


The dataset consist of text reviews and corresponding sentiment labels. Here's a breakdown:

**Review:** This column contains text data representing customer reviews written in Turkish.

**Sentiment:** This column contains numerical values representing the sentiment or emotion associated with each review.

### **Load Data**




If you are using Google Colab, you may create a folder in your Google Drive domain and get the dataset from that folder. Please make sure you have at least 10GB of free space available in your Google Drive.

In [1]:
# Comment out this cell if you are working on Google Colab
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

HW_PATH = '/content/drive/MyDrive/YZV405E_HW4/'

Mounted at /content/drive


If you are running this notebook in your local, simply get the current directory.

In [None]:
# Uncomment this cell if you are not working on Google Colab
#import os
#HW_PATH = os.getcwd()
#print(HW_PATH)

In [2]:
import pandas as pd

### START YOUR CODE HERE ###

df = pd.read_csv(HW_PATH + 'yorumsepeti.csv')

### END YOUR CODE HERE ###

df.head()

Unnamed: 0,review,sentiment
0,Her zaman komşu fırından sipariş verdiğim için...,0
1,sosisli ürün isteyen adama peynirli bişey yol...,0
2,Iyi pişsin diye söylememe rağmen az pişmiş gel...,0
3,kokmuş hamburger getirdiniz be ayıp ulan resm...,0
4,Allah affetsin çok kötüydü hiç bir şey mi iyi ...,0


In [3]:
reviews = df["review"].tolist()
labels = df["sentiment"].tolist()

print("Sample review: " + reviews[42])
print("Label: " + str(labels[42]))

Sample review: Çok geç geldi.
Label: 0


You should get:


```
Sample review: Çok geç geldi.
Label: 0
```



### **Pre-processing**

The first step when building a neural network model is getting your data into the proper form to feed into the network.

Let's split every review into individual words, lemmatize those words and remove punctuation.

If you're interested, consider reading this paper before getting started: https://arxiv.org/pdf/2003.07082

This run takes approximately 55 minutes on an 8-thread Coffee Lake Refresh CPU with single-channel memory, and 12–13 minutes on a T4 GPU.

In [4]:
!pip install stanza
import stanza
import torch
from string import punctuation
from tqdm import tqdm

# Download the Turkish Stanza model (should be run only once)
stanza.download('tr')

# Initialize the Stanza pipeline (includes lemmatizer)
lemmatizer = stanza.Pipeline('tr', processors='tokenize,mwt,pos,lemma', use_gpu=True)

reviews_processed = []

# iterate through reviews and process
for i in tqdm(range(len(reviews))):

    doc = lemmatizer(reviews[i].lower())  # convert to lowercase and analyze
    lemmas = [word.lemma for sent in doc.sentences for word in sent.words]  # extract lemmatized words

    # Remove punctuation
    filtered = [lemma for lemma in lemmas if lemma is not None and lemma not in punctuation]
    reviews_processed.append(' '.join(filtered))



Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: tr (Turkish) ...
INFO:stanza:File exists: /root/stanza_resources/tr/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: tr (Turkish):
| Processor | Package       |
-----------------------------
| tokenize  | imst          |
| mwt       | imst          |
| pos       | imst_charlm   |
| lemma     | imst_nocharlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!
100%|██████████| 13751/13751 [12:58<00:00, 17.66it/s]


Save `reviews_processed` for future use.







In [5]:
import pickle

with open(HW_PATH + '/reviews_processed.pkl', 'wb') as f:
  pickle.dump(reviews_processed, f)

Load `reviews_processed` if you are coming back.

In [6]:
import pickle

with open(HW_PATH + '/reviews_processed.pkl', 'rb') as f:
    reviews_processed = pickle.load(f)

In [7]:
reviews_processed[0]

'her zaman komşu fır sipariş ver için eksik gönder tespit et tamamla için bilgi ver gönder 2 adet vişneli kurabi özür ol nere eski komşu fır hizmet anlayış'

## **3. Word Embeddings**

### **Create vocabulary**

Find unique words in the dataset and build a dictionary that maps words to integers.

In [8]:
# feel free to use this import
from collections import Counter

### START YOUR CODE HERE ###

# get unique list of words
words = ' '.join(reviews_processed).split()
word_counts = Counter(words)
## Build a dictionary that maps words to integers
vocab = sorted(word_counts, key=word_counts.get, reverse=True)
vocab = {word: ii+1 for ii, word in enumerate(vocab)}

### END YOUR CODE HERE ###

### **Method 1: Encoding the words with integers**

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the words in reviews to integers and store the reviews in a new list called `reviews_ints`.

In [9]:
### START YOUR CODE HERE ###

vocab_to_int = vocab

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_processed:
    reviews_ints.append([vocab_to_int[word] for word in review.split() if word in vocab_to_int])

### END YOUR CODE HERE ###

**Test your code**

As a text that you've implemented the dictionary correctly, print out the number of unique words in your vocabulary and the contents of the first, tokenized review.

In [10]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 9900+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  9422

Tokenized review: 
 [[64, 85, 1694, 642, 9, 12, 27, 126, 50, 3158, 6, 1970, 27, 629, 12, 50, 73, 248, 3159, 1127, 615, 7, 379, 278, 1694, 642, 219, 1090]]


### **Method 2: Using fastText**

This might take several minutes to run.

In [12]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.tr.300.vec.gz -P "{HW_PATH}" #Specify the download path based on your desired destination directory.

--2025-04-24 13:46:46--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.tr.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.157.254.124, 108.157.254.15, 108.157.254.102, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.157.254.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1261500728 (1.2G) [binary/octet-stream]
Saving to: ‘/content/drive/MyDrive/YZV405E_HW4/cc.tr.300.vec.gz.2’


2025-04-24 13:47:04 (66.4 MB/s) - ‘/content/drive/MyDrive/YZV405E_HW4/cc.tr.300.vec.gz.2’ saved [1261500728/1261500728]



In [13]:
import os
gz_file = os.path.join(HW_PATH, 'cc.tr.300.vec.gz')

!gzip -d "{gz_file}"

In [14]:
#The following versions must be used when working on Google Colab.
#!pip install torch==2.3.0+cu121 --index-url https://download.pytorch.org/whl/cu121
!pip install torchtext==0.18.0
HW_PATH = '/content/drive/MyDrive/YZV405E_HW4/'
from torchtext.vocab import Vectors

# Load the fasttext word embedding for tweets
FASTTEXT_EMB_FILE = HW_PATH + '/cc.tr.300.vec'
emb = Vectors(name=FASTTEXT_EMB_FILE, cache= HW_PATH)

Collecting torchtext==0.18.0
  Downloading torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Downloading torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl (2.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.0/2.0 MB[0m [31m150.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.18.0


100%|██████████| 2000000/2000000 [04:22<00:00, 7624.35it/s]


> **Exercise:** Now you're going to get word embeddings from fastText. Convert the words in reviews to integer indexes from fastText and store the reviews in a new list called `reviews_emb_ix`. You can use `emb.stoi[word]` to get the index of a word from fastText.

In [15]:
### START YOUR CODE HERE ###

reviews_emb_ix = []
for review in reviews_processed:
    reviews_emb_ix.append([emb.stoi[word] for word in review.split() if word in emb.stoi])

### END YOUR CODE HERE ###

### **Padding sequences**

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `max_seq_length`, we'll pad with 0s. For reviews longer than `max_seq_length`, we can truncate them to the first `max_seq_length` words.

> **Exercise:** Define a function that returns an array `padded_sequence` that contains the padded data, of a standard size, that we'll pass to the network.
* The data should come from `reviews_ints` and `reviews_emb_ix`, since we want to feed integers to the network.
* Each row should be `max_seq_length` elements long.
* For reviews shorter than `max_seq_length` words, **left pad** with 0s. That is, if the review is `['en', 'iyi', 'kumru']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`.
* For reviews longer than `max_seq_length`, use only the first `max_seq_length` words as the feature vector.

As a small example, if the `max_seq_length=10` and an input review is:
```
[117, 18, 128]
```
The resultant, padded sequence should be:

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**Your final `padded_sequence` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `max_seq_length`.**

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.

In [18]:
import torch

def pad_sequence(reviews_vecs, max_sequence_length):
    ''' Return padded_sequence of review_vecs, where each review is padded with 0's
        or truncated to the input max_sequence_length.
    '''

    ### START YOUR CODE HERE ###

    # getting the correct rows x cols shape
    padded_sequence = torch.zeros((len(reviews_vecs), max_sequence_length), dtype=torch.long)

    # iterate through list of reviews
    for i, review in enumerate(reviews_vecs):

        # get the length of current review
        sequence_length = len(review)

        # iterate through words of current review
        for j, token in enumerate(review):

          # if token index exceeds max_sequence_length
          if j >= max_sequence_length:
            break

          padded_sequence[i, max_sequence_length - min(sequence_length, max_sequence_length) + j] = token

    ### END YOUR CODE HERE ###

    return padded_sequence

In [19]:
padded_reviews_ints = pad_sequence(reviews_ints, max_sequence_length=30)

padded_reviews_embs = pad_sequence(reviews_emb_ix, max_sequence_length=30)

See reviews before and after padding:

In [20]:
print(reviews_ints[10])
print(padded_reviews_ints[10])

[86, 324, 2, 362, 126, 1403, 2]
tensor([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,   86,
         324,    2,  362,  126, 1403,    2])


In [21]:
print(reviews_emb_ix[10])
print(padded_reviews_embs[10])

[136, 40322, 4047, 5112, 2712, 496561, 4047]
tensor([     0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,    136,  40322,   4047,   5112,
          2712, 496561,   4047])


**Test your code:** your `padded_sequence` should have as many rows as reviews and each `padded_sequence` row should contain `max_seq_length` values.

In [22]:
assert len(padded_reviews_ints)==len(reviews_ints), "Your padded_sequence should have as many rows as reviews."
assert len(padded_reviews_ints[0])==30, "Each padded_sequence row should contain max_seq_length values."

In [23]:
assert len(padded_reviews_embs)==len(reviews_emb_ix), "Your padded_sequence should have as many rows as reviews."
assert len(padded_reviews_embs[0])==30, "Each padded_sequence row should contain max_seq_length values."

## **4. Training, Validation, Test**

With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets.
* You'll need to create sets for the features and the labels, `train_x` and `train_y`, for example.
* Define a split fraction, `split_frac` as the fraction of data to **keep** in the training set. Usually this is set to 0.8 or 0.9.
* Whatever data is left will be split in half to create the validation and *testing* data.

In [24]:
import numpy as np

def split(features, labels, split_frac=0.8):

  ## split data into training, validation, and test data (features and labels, x and y)

  ### START YOUR CODE HERE ###

  split_idx = int(len(features)*split_frac)
  train_x, remaining_x = features[:split_idx], features[split_idx:]
  train_y, remaining_y = labels[:split_idx], labels[split_idx:]

  test_idx = int(len(remaining_x)*0.5)
  val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
  val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

  ### END YOUR CODE HERE ###

  ## print out the shapes of your resultant feature data
  print("\t\t\tFeature Shapes:")
  print("Train set: \t\t{}".format(train_x.shape),
        "\nValidation set: \t{}".format(val_x.shape),
        "\nTest set: \t\t{}".format(test_x.shape))
  print("\n")

  return [np.array(train_x), np.array(train_y), np.array(test_x), np.array(test_y), np.array(val_x), np.array(val_y)]

In [25]:
reviews_ints_split = split(padded_reviews_ints, labels)

reviews_embs_split = split(padded_reviews_embs, labels)

			Feature Shapes:
Train set: 		torch.Size([11000, 30]) 
Validation set: 	torch.Size([1375, 30]) 
Test set: 		torch.Size([1376, 30])


			Feature Shapes:
Train set: 		torch.Size([11000, 30]) 
Validation set: 	torch.Size([1375, 30]) 
Test set: 		torch.Size([1376, 30])




**Check your work**

With train, validation, and test fractions equal to 0.8, 0.1, 0.1, respectively, the final, feature data shapes should look like:
```
                    Feature Shapes:
Train set: 		 (11000, 30)
Validation set: 	(1375, 30)
Test set: 		  (1375, 30)
```

### **DataLoaders and Batching**

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

In [26]:
import torch
from torch.utils.data import TensorDataset, DataLoader

def batching(split_list, batch_size=32):

  ### START YOUR CODE HERE ###

  # create Tensor datasets
  train_data = TensorDataset(torch.from_numpy(split_list[0]), torch.from_numpy(split_list[1]))
  valid_data = TensorDataset(torch.from_numpy(split_list[4]), torch.from_numpy(split_list[5]))
  test_data = TensorDataset(torch.from_numpy(split_list[2]), torch.from_numpy(split_list[3]))

  # make sure the SHUFFLE your training data
  train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
  valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size, drop_last=True)
  test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

  ### END YOUR CODE HERE ###

  return [train_data, valid_data, test_data, train_loader, valid_loader, test_loader]

In [27]:
reviews_ints_batch = batching(reviews_ints_split)

reviews_embs_batch = batching(reviews_embs_split)

### **Sentiment Network**

Below is where you'll define the network.

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers.
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size.
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

#### **The Embedding Layer**

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding). It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table.

#### **The LSTM Layer(s)**

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships.

> **Exercise:** Complete the `__init__` and `forward` functions for the SentimentRNN model class.

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.

In [28]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [29]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5, pretrained=False):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        ### START YOUR CODE HERE ###

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # embedding and LSTM layers
        if not pretrained:
          self.embedding = nn.Embedding(vocab_size, embedding_dim)
        else:
          self.embedding = nn.Embedding.from_pretrained(emb.vectors)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)

        # dropout layer
        self.dropout = nn.Dropout(drop_prob)

        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

        ### END YOUR CODE HERE ###


    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """

        ### START YOUR CODE HERE ###

        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)

        lstm_out = lstm_out[:, -1, :] # getting the last time step output

        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)

        ### END YOUR CODE HERE ###

        # return last sigmoid output and hidden state
        return sig_out, hidden


    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM

        weight = next(self.parameters()).data

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

        return hidden


### **Instantiate the network**

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

> **Exercise:** Define the model  hyperparameters.


In [30]:
### START YOUR CODE HERE ###

# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int) + 1
output_size = 1
embedding_dim = 300
hidden_dim = 256
n_layers = 2
### END YOUR CODE HERE ###

net_int = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

net_emb = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, pretrained=True)

print(net_int)

print(net_emb)

SentimentRNN(
  (embedding): Embedding(9423, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)
SentimentRNN(
  (embedding): Embedding(2000000, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


### **Training & Testing**
Below is the typical training code. You can add code to save a model by name.

>We'll also be using a kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

* **Test data performance:** We'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.


In [31]:
def train_test(net, train_loader, valid_loader, test_loader, batch_size=32):
  # loss and optimization functions
  lr=0.001

  criterion = nn.BCELoss()
  optimizer = torch.optim.Adam(net.parameters(), lr=lr)

  # training params

  epochs = 4

  counter = 0
  print_every = 100
  clip=5 # gradient clipping

  # move model to GPU, if available
  if(train_on_gpu):
      net.cuda()

  net.train()
  # train for some number of epochs
  for e in range(epochs):
      # initialize hidden state
      h = net.init_hidden(batch_size)

      # batch loop
      for inputs, labels in train_loader:
          counter += 1

          if(train_on_gpu):
              inputs, labels = inputs.cuda(), labels.cuda()

          # Creating new variables for the hidden state, otherwise
          # we'd backprop through the entire training history
          h = tuple([each.data for each in h])

          # zero accumulated gradients
          net.zero_grad()
          # get the output from the model
          output, h = net(inputs, h)

          # calculate the loss and perform backprop
          loss = criterion(output.squeeze(), labels.float())
          loss.backward()
          # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
          nn.utils.clip_grad_norm_(net.parameters(), clip)
          optimizer.step()

          # loss stats
          if counter % print_every == 0:
              # Get validation loss
              val_h = net.init_hidden(batch_size)
              val_losses = []
              net.eval()
              for inputs, labels in valid_loader:

                  # Creating new variables for the hidden state, otherwise
                  # we'd backprop through the entire training history
                  val_h = tuple([each.data for each in val_h])

                  if(train_on_gpu):
                      inputs, labels = inputs.cuda(), labels.cuda()

                  output, val_h = net(inputs, val_h)
                  val_loss = criterion(output.squeeze(), labels.float())

                  val_losses.append(val_loss.item())

              net.train()
              print("Epoch: {}/{}...".format(e+1, epochs),
                    "Step: {}...".format(counter),
                    "Loss: {:.6f}...".format(loss.item()),
                    "Val Loss: {:.6f}".format(np.mean(val_losses)))

  """
  STARTING TO TEST
  """

  # Get test data loss and accuracy

  test_losses = [] # track loss
  num_correct = 0

  # init hidden state
  h = net.init_hidden(batch_size)

  net.eval()
  # iterate over test data
  for inputs, labels in test_loader:

      # Creating new variables for the hidden state, otherwise
      # we'd backprop through the entire training history
      h = tuple([each.data for each in h])

      if(train_on_gpu):
          inputs, labels = inputs.cuda(), labels.cuda()

      # get predicted outputs
      output, h = net(inputs, h)

      # calculate loss
      test_loss = criterion(output.squeeze(), labels.float())
      test_losses.append(test_loss.item())

      # convert output probabilities to predicted class (0 or 1)
      pred = torch.round(output.squeeze())  # rounds to the nearest integer

      # compare predictions to true label
      correct_tensor = pred.eq(labels.float().view_as(pred))
      correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
      num_correct += np.sum(correct)


  # -- stats! -- ##
  # avg test loss
  print("Test loss: {:.3f}".format(np.mean(test_losses)))

  # accuracy over all test data
  test_acc = num_correct/len(test_loader.dataset)
  print("Test accuracy: {:.3f}".format(test_acc))

A minimum test accuracy of 84% is required.

In [32]:
train_test(net_int, reviews_ints_batch[3], reviews_ints_batch[4], reviews_ints_batch[5])

Epoch: 1/4... Step: 100... Loss: 0.355172... Val Loss: 0.429872
Epoch: 1/4... Step: 200... Loss: 0.246268... Val Loss: 0.321698
Epoch: 1/4... Step: 300... Loss: 0.249701... Val Loss: 0.296915
Epoch: 2/4... Step: 400... Loss: 0.258323... Val Loss: 0.287476
Epoch: 2/4... Step: 500... Loss: 0.280142... Val Loss: 0.266661
Epoch: 2/4... Step: 600... Loss: 0.116212... Val Loss: 0.266135
Epoch: 3/4... Step: 700... Loss: 0.112436... Val Loss: 0.279043
Epoch: 3/4... Step: 800... Loss: 0.364446... Val Loss: 0.331357
Epoch: 3/4... Step: 900... Loss: 0.329708... Val Loss: 0.281977
Epoch: 3/4... Step: 1000... Loss: 0.159596... Val Loss: 0.291328
Epoch: 4/4... Step: 1100... Loss: 0.008357... Val Loss: 0.345549
Epoch: 4/4... Step: 1200... Loss: 0.245313... Val Loss: 0.400551
Epoch: 4/4... Step: 1300... Loss: 0.051134... Val Loss: 0.370692
Test loss: 0.401
Test accuracy: 0.887


In [33]:
train_test(net_emb, reviews_embs_batch[3], reviews_embs_batch[4], reviews_embs_batch[5])

Epoch: 1/4... Step: 100... Loss: 0.537815... Val Loss: 0.522613
Epoch: 1/4... Step: 200... Loss: 0.345190... Val Loss: 0.370334
Epoch: 1/4... Step: 300... Loss: 0.343093... Val Loss: 0.338043
Epoch: 2/4... Step: 400... Loss: 0.259589... Val Loss: 0.316154
Epoch: 2/4... Step: 500... Loss: 0.477005... Val Loss: 0.325998
Epoch: 2/4... Step: 600... Loss: 0.277201... Val Loss: 0.327965
Epoch: 3/4... Step: 700... Loss: 0.204179... Val Loss: 0.286772
Epoch: 3/4... Step: 800... Loss: 0.203407... Val Loss: 0.289831
Epoch: 3/4... Step: 900... Loss: 0.330558... Val Loss: 0.287656
Epoch: 3/4... Step: 1000... Loss: 0.325915... Val Loss: 0.285366
Epoch: 4/4... Step: 1100... Loss: 0.220127... Val Loss: 0.325600
Epoch: 4/4... Step: 1200... Loss: 0.183294... Val Loss: 0.294614
Epoch: 4/4... Step: 1300... Loss: 0.185305... Val Loss: 0.323806
Test loss: 0.275
Test accuracy: 0.881
