<a href="https://colab.research.google.com/github/edzimmermann/NLP/blob/main/cbowFIXED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Text-Classification" data-toc-modified-id="Text-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Classification</a></span><ul class="toc-item"><li><span><a href="#Subjectivity-Dataset" data-toc-modified-id="Subjectivity-Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Subjectivity Dataset</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Tokenization</a></span><ul class="toc-item"><li><span><a href="#Simple-Tokenization" data-toc-modified-id="Simple-Tokenization-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Simple Tokenization</a></span></li><li><span><a href="#Much-better-tokenization-with-Spacy" data-toc-modified-id="Much-better-tokenization-with-Spacy-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Much better tokenization with Spacy</a></span></li></ul></li><li><span><a href="#Split-dataset-in-train-and-validation" data-toc-modified-id="Split-dataset-in-train-and-validation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Split dataset in train and validation</a></span></li><li><span><a href="#Word-to-index-mapping" data-toc-modified-id="Word-to-index-mapping-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Word to index mapping</a></span></li><li><span><a href="#Sentence-encoding" data-toc-modified-id="Sentence-encoding-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Sentence encoding</a></span></li><li><span><a href="#Embedding-layer" data-toc-modified-id="Embedding-layer-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Embedding layer</a></span></li><li><span><a href="#Continuous-Bag-of-Words-Model" data-toc-modified-id="Continuous-Bag-of-Words-Model-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Continuous Bag of Words Model</a></span></li></ul></li><li><span><a href="#Training-the-CBOW-model" data-toc-modified-id="Training-the-CBOW-model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training the CBOW model</a></span></li><li><span><a href="#Data-loaders-for-SGD" data-toc-modified-id="Data-loaders-for-SGD-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data loaders for SGD</a></span></li></ul></div>

In [None]:
# import pytorch libraries
%matplotlib inline
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

# Text Classification
In this part of the tutorial we develop a continuous bag of words (CBOW) model for a text classification task described [here]( https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf). The CBOW model was first described [here](https://arxiv.org/pdf/1301.3781.pdf)

## Subjectivity Dataset
The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data:
```
wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
```

In [15]:
def unpack_dataset():
    ! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
    ! mkdir data
    ! tar -xvf rotten_imdb.tar.gz -C data

In [16]:
unpack_dataset()

URL transformed to HTTPS due to an HSTS policy
--2024-09-18 16:04:35--  https://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.53
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 519599 (507K) [application/x-gzip]
Saving to: ‘rotten_imdb.tar.gz.1’


2024-09-18 16:04:36 (5.88 MB/s) - ‘rotten_imdb.tar.gz.1’ saved [519599/519599]

mkdir: cannot create directory ‘data’: File exists
quote.tok.gt9.5000
plot.tok.gt9.5000
subjdata.README.1.0


In [17]:
!ls data

plot.tok.gt9.5000  quote.tok.gt9.5000  subjdata.README.1.0


In [18]:
! head -2 data/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 


In [19]:
from pathlib import Path
PATH = Path("data")
list(PATH.iterdir())

[PosixPath('data/plot.tok.gt9.5000'),
 PosixPath('data/quote.tok.gt9.5000'),
 PosixPath('data/subjdata.README.1.0')]

## Tokenization
Tokenization is the task of chopping up text into pieces, called tokens.

spaCy is an open-source software library for advanced Natural Language Processing. Here we will use it for tokenization.  

### Simple Tokenization

In [21]:
# We need each line in the file
def read_file(path):
    """ Read file returns a list of lines.
    """
    with open(path, encoding = "ISO-8859-1") as f:
        content = f.readlines()
    return content

In [22]:
obj_lines = read_file(PATH/"plot.tok.gt9.5000")

In [23]:
obj_lines[0]

'the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . \n'

In [None]:
np.array(obj_lines[0].strip().lower().split(" "))

array(['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a',
       'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi',
       'from', 'a', 'hunter', '.'], dtype='<U8')

### Much better tokenization with Spacy

In [24]:
!pip install -U spacy



In [25]:
import spacy

In [26]:
# first time run this
!python3 -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m89.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [28]:
tok = spacy.load('en_core_web_sm')

In [29]:
obj_lines = read_file(PATH/"plot.tok.gt9.5000")

In [30]:
len(obj_lines)

5000

In [31]:
obj_lines[0]

'the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . \n'

In [33]:
test = tok(obj_lines[0])

In [None]:
np.array([x for x in test])

## Split dataset in train and validation

In [34]:
from sklearn.model_selection import train_test_split

In [35]:
sub_content = read_file(PATH/"quote.tok.gt9.5000")
obj_content = read_file(PATH/"plot.tok.gt9.5000")
sub_content = np.array([line.strip().lower() for line in sub_content])
obj_content = np.array([line.strip().lower() for line in obj_content])
sub_y = np.zeros(len(sub_content))
obj_y = np.ones(len(obj_content))
X = np.append(sub_content, obj_content)
y = np.append(sub_y, obj_y)

In [36]:
X[0], y[0]

('smart and alert , thirteen conversations about one thing is a small gem .',
 0.0)

In [37]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
X_train[:5], y_train[:5]

(array(['will god let her fall or give her a new path ?',
        "the director's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship ( most notably wretched sound design ) .",
        "welles groupie/scholar peter bogdanovich took a long time to do it , but he's finally provided his own broadside at publishing giant william randolph hearst .",
        'based on the 1997 john king novel of the same name with a rather odd synopsis : " a first novel about a seasoned chelsea football club hooligan who represents a disaffected society operating by brutal rules .',
        'yet , beneath an upbeat appearance , she is struggling desperately with the emotional and physical scars left by the attack .'],
       dtype='<U691'),
 array([1., 0., 0., 1., 1.]))

## Word to index mapping
In interest of time we will tokenize without spaCy. Here we will compute a vocabulary of words based on the training set and a mapping from word to an index.

In [39]:
from collections import defaultdict

In [40]:
def get_vocab(content):
    """Computes Dict of counts of words.

    Computes the number of times a word is on a document.
    """
    vocab = defaultdict(float)
    for line in content:
        words = set(line.split())
        for word in words:
            vocab[word] += 1
    return vocab

In [41]:
#Getting the vocabulary from the training set
word_count = get_vocab(X_train)

In [42]:
word_count

defaultdict(float,
            {'new': 301.0,
             'her': 664.0,
             'fall': 27.0,
             'path': 23.0,
             'give': 45.0,
             'or': 300.0,
             '?': 145.0,
             'god': 30.0,
             'will': 285.0,
             'a': 4295.0,
             'let': 16.0,
             '.': 7796.0,
             ')': 545.0,
             'pacing': 8.0,
             'shifts': 3.0,
             'and': 4116.0,
             'twitchy': 1.0,
             'grow': 14.0,
             "director's": 8.0,
             'wearisome': 1.0,
             'wretched': 1.0,
             'leaden': 4.0,
             'perspective': 8.0,
             '(': 543.0,
             'sketchbook': 2.0,
             'design': 4.0,
             'sound': 16.0,
             'craftsmanship': 2.0,
             'adroit': 2.0,
             'style': 32.0,
             'amid': 8.0,
             'the': 5247.0,
             'indifferent': 3.0,
             'notably': 4.0,
             'most': 205

In [43]:
len(word_count.keys())

21415

In [45]:
# let's delete words that are very infrequent
for word in list(word_count):
    if word_count[word] < 5:
        del word_count[word]

In [46]:
len(word_count.keys())

4065

In [47]:
## Finally we need an index for each word in the vocab
vocab2index = {"<PAD>":0, "UNK":1} # init with padding and unknown
words = ["<PAD>", "UNK"]
for word in word_count:
    vocab2index[word] = len(words)
    words.append(word)

In [48]:
vocab2index

{'<PAD>': 0,
 'UNK': 1,
 'new': 2,
 'her': 3,
 'fall': 4,
 'path': 5,
 'give': 6,
 'or': 7,
 '?': 8,
 'god': 9,
 'will': 10,
 'a': 11,
 'let': 12,
 '.': 13,
 ')': 14,
 'pacing': 15,
 'and': 16,
 'grow': 17,
 "director's": 18,
 'perspective': 19,
 '(': 20,
 'sound': 21,
 'style': 22,
 'amid': 23,
 'the': 24,
 'most': 25,
 'long': 26,
 'william': 27,
 'own': 28,
 'peter': 29,
 'do': 30,
 'took': 31,
 'giant': 32,
 ',': 33,
 'finally': 34,
 'to': 35,
 'time': 36,
 'it': 37,
 'his': 38,
 "he's": 39,
 'but': 40,
 'at': 41,
 'provided': 42,
 'based': 43,
 'odd': 44,
 'represents': 45,
 '"': 46,
 'football': 47,
 'who': 48,
 'brutal': 49,
 'chelsea': 50,
 ':': 51,
 'name': 52,
 'seasoned': 53,
 'same': 54,
 'society': 55,
 'john': 56,
 'by': 57,
 'novel': 58,
 'of': 59,
 'about': 60,
 'club': 61,
 'first': 62,
 'on': 63,
 'king': 64,
 'rather': 65,
 'rules': 66,
 'with': 67,
 'physical': 68,
 'an': 69,
 'left': 70,
 'attack': 71,
 'beneath': 72,
 'emotional': 73,
 'is': 74,
 'desperately': 75

## Sentence encoding
Here we encode each sentence as a sequence of indices corresponding to each word.

In [49]:
x_train_len = np.array([len(x.split()) for x in X_train])
x_val_len = np.array([len(x.split()) for x in X_val])

In [50]:
np.percentile(x_train_len, 95) # let set the max sequence len to N=40

43.0

In [51]:
X_train[0]

'will god let her fall or give her a new path ?'

In [52]:
# returns the index of the word or the index of "UNK" otherwise
vocab2index.get("?", vocab2index["UNK"])

8

In [53]:
np.array([vocab2index.get(w, vocab2index["UNK"]) for w in X_train[0].split()])

array([10,  9, 12,  3,  4,  7,  6,  3, 11,  2,  5,  8])

In [54]:
def encode_sentence(s, N=40):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([vocab2index.get(w, vocab2index["UNK"]) for w in s.split()])
    l = min(N, len(enc1))
    enc[:l] = enc1[:l]
    return enc

In [55]:
encode_sentence(X_train[0])

array([10,  9, 12,  3,  4,  7,  6,  3, 11,  2,  5,  8,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0], dtype=int32)

In [56]:
x_train_len = np.minimum(x_train_len, 40)
x_val_len = np.minimum(x_val_len, 40)

In [57]:
x_train = np.vstack([encode_sentence(x) for x in X_train])
x_train.shape

(8000, 40)

In [58]:
x_val = np.vstack([encode_sentence(x) for x in X_val])
x_val.shape

(2000, 40)

## Embedding layer
Most deep learning models use a dense vectors of real numbers as representation of words (word embeddings), as opposed to a one-hot encoding representations. The module torch.nn.Embedding is used to represent word embeddings. It takes two arguments: the vocabulary size, and the dimensionality of the embeddings. The embeddings are initialized with random vectors.

In [59]:
# an Embedding module containing 10 words with embedding size 4
# embedding will be initialized at random
embed = nn.Embedding(10, 4, padding_idx=0)
embed.weight

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  0.0000],
        [-0.5163, -1.0105,  2.1750, -0.3798],
        [-0.0627, -1.0944, -1.9079,  1.2036],
        [-0.5578,  0.8704,  1.0049,  0.9250],
        [-0.0140, -1.3842, -0.2061,  0.1309],
        [-0.3215,  1.1934, -1.9050, -0.7542],
        [ 0.0360, -1.2057, -0.3352,  1.0153],
        [ 0.0506, -0.5088,  0.2579, -1.5502],
        [ 0.6085,  1.0207,  0.5835, -0.2245],
        [-0.8248, -0.0253,  0.0620,  2.2008]], requires_grad=True)

Note that the `padding_idx` has embedding vector 0.

In [60]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,4,1,5,1,0]])
embed(a)

tensor([[[-0.5163, -1.0105,  2.1750, -0.3798],
         [-0.0140, -1.3842, -0.2061,  0.1309],
         [-0.5163, -1.0105,  2.1750, -0.3798],
         [-0.3215,  1.1934, -1.9050, -0.7542],
         [-0.5163, -1.0105,  2.1750, -0.3798],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)

This would be the representation of a sentence with words with indices [1,4,1,5,1] and a padding at the end. Bellow we have an example in which we have two sentences. the first sentence has length 3 and the last sentence has length 2. In order to use a tensor we use padding at the end of the second sentence.

In [61]:
a = torch.LongTensor([[1,4,1], [1,3,0]])

Our model takes an average of the word embedding of each word. Here is how we do it.

In [62]:
s = torch.FloatTensor([3, 2]) # here is the size of the vector

In [63]:
embed(a)

tensor([[[-0.5163, -1.0105,  2.1750, -0.3798],
         [-0.0140, -1.3842, -0.2061,  0.1309],
         [-0.5163, -1.0105,  2.1750, -0.3798]],

        [[-0.5163, -1.0105,  2.1750, -0.3798],
         [-0.5578,  0.8704,  1.0049,  0.9250],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)

In [64]:
embed(a).sum(dim=1)

tensor([[-1.0466, -3.4053,  4.1439, -0.6287],
        [-1.0741, -0.1401,  3.1800,  0.5452]], grad_fn=<SumBackward1>)

In [65]:
sum_embs = embed(a).sum(dim=1)
sum_embs/ s.view(s.shape[0], 1)

tensor([[-0.3489, -1.1351,  1.3813, -0.2096],
        [-0.5371, -0.0701,  1.5900,  0.2726]], grad_fn=<DivBackward0>)

## Continuous Bag of Words Model

In [66]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, emb_size=100):
        super(CBOW, self).__init__()
        self.word_emb = nn.Embedding(vocab_size, emb_size, padding_idx=0)
        self.linear = nn.Linear(emb_size, 1)

    def forward(self, x, s):
        x = self.word_emb(x)
        x = x.sum(dim=1)/ s
        x = self.linear(x)
        return x

In [67]:
model = CBOW(vocab_size=5, emb_size=3)

In [68]:
model.word_emb.weight

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000],
        [-1.6398, -0.2460,  0.1200],
        [-0.2734, -0.5105,  0.1026],
        [-1.4194,  1.2210, -1.3923],
        [-0.1688, -1.4079,  0.7841]], requires_grad=True)

In [69]:
s = s.view(s.shape[0], 1)
model(a, s)

tensor([[-1.3603],
        [-0.4704]], grad_fn=<AddmmBackward0>)

# Training the CBOW model

In [70]:
V = len(words)
model = CBOW(vocab_size=V, emb_size=50)
print(V)

4067


In [71]:
def val_metrics(model):
    model.eval()
    x = torch.LongTensor(x_val) #.cuda()
    y = torch.Tensor(y_val).unsqueeze(1) #).cuda()
    s = torch.Tensor(x_val_len).view(x_val_len.shape[0], 1)
    y_hat = model(x, s)
    loss = F.binary_cross_entropy_with_logits(y_hat, y)
    y_pred = y_hat > 0
    correct = (y_pred.float() == y).float().sum()
    accuracy = correct/y_pred.shape[0]
    return loss.item(), accuracy.item()

In [72]:
# accuracy of a random model should be around 0.5
val_metrics(model)

(0.7114680409431458, 0.4909999966621399)

In [73]:
def train_epocs(model, epochs=10, lr=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    for i in range(epochs):
        model.train()
        x = torch.LongTensor(x_train)  #.cuda()
        y = torch.Tensor(y_train).unsqueeze(1)
        s = torch.Tensor(x_train_len).view(x_train_len.shape[0], 1)
        y_hat = model(x, s)
        loss = F.binary_cross_entropy_with_logits(y_hat, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        val_loss, val_accuracy = val_metrics(model)
        print("train_loss %.3f val_loss %.3f val_accuracy %.3f" % (loss.item(), val_loss, val_accuracy))

In [74]:
train_epocs(model, epochs=10, lr=0.1)

train_loss 0.709 val_loss 0.687 val_accuracy 0.520
train_loss 0.685 val_loss 0.598 val_accuracy 0.732
train_loss 0.592 val_loss 0.519 val_accuracy 0.826
train_loss 0.504 val_loss 0.433 val_accuracy 0.851
train_loss 0.406 val_loss 0.346 val_accuracy 0.890
train_loss 0.309 val_loss 0.296 val_accuracy 0.890
train_loss 0.248 val_loss 0.264 val_accuracy 0.893
train_loss 0.201 val_loss 0.250 val_accuracy 0.902
train_loss 0.168 val_loss 0.255 val_accuracy 0.900
train_loss 0.150 val_loss 0.260 val_accuracy 0.905


In [75]:
train_epocs(model, epochs=10, lr=0.01)

train_loss 0.132 val_loss 0.255 val_accuracy 0.910
train_loss 0.127 val_loss 0.253 val_accuracy 0.909
train_loss 0.123 val_loss 0.253 val_accuracy 0.911
train_loss 0.120 val_loss 0.253 val_accuracy 0.910
train_loss 0.117 val_loss 0.252 val_accuracy 0.910
train_loss 0.114 val_loss 0.252 val_accuracy 0.910
train_loss 0.111 val_loss 0.253 val_accuracy 0.911
train_loss 0.108 val_loss 0.253 val_accuracy 0.908
train_loss 0.105 val_loss 0.254 val_accuracy 0.907
train_loss 0.102 val_loss 0.255 val_accuracy 0.908


# Data loaders for SGD

Nearly all of deep learning is powered by one very important algorithm: **stochastic gradient descent (SGD)**. SGD can be seeing as an approximation of **gradient descent** (GD). In GD you have to run through *all* the samples in your training set to do a single itaration. In SGD you use *only one* or *a subset*  of training samples to do the update for a parameter in a particular iteration. The subset use in every iteration is called a **batch** or **minibatch**.

In [76]:
from torch.utils.data import Dataset, DataLoader

Next we are going to create a data loader. The data loader provides the following features:
* Batching the data
* Shuffling the data
* Load the data in parallel using multiprocessing workers.

In [77]:
def encode_sentence2(s, N=40):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([vocab2index.get(w, vocab2index["UNK"]) for w in s.split()])
    l = min(N, len(enc1))
    enc[:l] = enc1[:l]
    return enc, l

In [78]:
encode_sentence2(X_train[0])

(array([10,  9, 12,  3,  4,  7,  6,  3, 11,  2,  5,  8,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0], dtype=int32),
 12)

In [79]:
class SubjectivityDataset(Dataset):
    def __init__(self, X, y):
        self.x = X
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        x = self.x[idx]
        x, s = encode_sentence2(x)
        return x, self.y[idx], s

sub_dataset_train = SubjectivityDataset(X_train, y_train)

In [80]:
train_loader = DataLoader(sub_dataset_train, batch_size=5, shuffle=True)
x, y, s = next(iter(train_loader))

In [81]:
x, y, s

(tensor([[   1, 2137,    1, 1870,   59,  428,   33,    1,   33,   16,    1,   13,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0],
         [   1,    1, 2149,    1,   33,  705,  164,  705,   96,  208,    1,    3,
            57,    1,  758, 3948,   63,  191,  110,   13,    0,    0,    0,    0,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
             0,    0,    0,    0],
         [  37,  177,  497,   93,  612, 1037,   40,  524,  267,   33,  206,  694,
            37,   74,    8, 1037,   40,   37,  441,   94,  189,   59,   24,   25,
          3033, 2550,   16, 2814,    1,   41,   24,  740,  452,  654,   81,   11,
           379,   13,    0,    0],
         [  24,  308, 1755,   16,   15,  473,  184,  872,  163,   24,    1,   13,
             0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,

In [82]:
model = CBOW(vocab_size=V, emb_size=50)

In [83]:
train_loader = DataLoader(sub_dataset_train, batch_size=500, shuffle=True)

In [84]:
def train_epocs(model, epochs=10, lr=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    for i in range(epochs):
        total_loss = 0
        total = 0
        model.train()
        for x, y, s in train_loader:
            x = x.type(torch.LongTensor)  #.cuda()
            y = y.type(torch.FloatTensor).unsqueeze(1)
            s = s.type(torch.Tensor).view(s.shape[0], 1)
            y_hat = model(x, s)
            loss = F.binary_cross_entropy_with_logits(y_hat, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += x.size(0)*loss.item()
            total += x.size(0)
        train_loss = total_loss/total
        val_loss, val_accuracy = val_metrics(model)

        print("train_loss %.3f val_loss %.3f val_accuracy %.3f" % (train_loss, val_loss, val_accuracy))

In [85]:
train_epocs(model, epochs=10)

train_loss 0.656 val_loss 0.597 val_accuracy 0.780
train_loss 0.510 val_loss 0.431 val_accuracy 0.858
train_loss 0.341 val_loss 0.313 val_accuracy 0.885
train_loss 0.242 val_loss 0.263 val_accuracy 0.899
train_loss 0.189 val_loss 0.242 val_accuracy 0.905
train_loss 0.155 val_loss 0.234 val_accuracy 0.906
train_loss 0.131 val_loss 0.234 val_accuracy 0.905
train_loss 0.112 val_loss 0.234 val_accuracy 0.909
train_loss 0.097 val_loss 0.241 val_accuracy 0.907
train_loss 0.086 val_loss 0.246 val_accuracy 0.906
