##### Copyright 2018 The TensorFlow Authors (Original TensorFlow Version)
##### Translated to PyTorch 2025

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to Word Embeddings (PyTorch)

This tutorial shows how to train a sentiment classifier on the IMDB dataset using learned word embeddings with PyTorch.

First, here's a bit of background. Before we can build a model to predict the sentiment of a review, first we will need a way to represent the words of the review as numbers, so they can be processed by our network. There are several strategies to convert words to numbers.

As a first attempt, we might one-hot encode each word. One problem with this approach is efficiency. A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine we have 10,000 words in our vocabulary. To one-hot encode each one, we would create a vector where 99.99% of the elements are zero!

Instead, we can encode each word using a unique number. For example, we might assign 1 to 'the', 42 to 'dog', and 96 to 'cat', and so on. Using these numbers, we could encode a sentence like "The dog and cat sat on the mat" as [1, 42, 96, ...]. One problem still remains. Although we know dogs and cats are related, our representation doesn't encode that information for the classifier (the numbers 42 and 96 were arbitrarily chosen).

Unlike the above methods, a word embedding is learned from data. An embedding represents each word as a n-dimensional vector of floating point values. These values are trainable parameters, weights learned while training the model. After training, we hope that similar words will be close together in the embedding space. We can visualize the learned embeddings by projecting them down to a 2- or 3-dimensional space.

There are two ways to obtain word embeddings:

* Learn word embeddings jointly with the main task you care about (e.g. sentiment classification). In this case, you would start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.

* Load word embeddings into your model that were pre-computed using a different machine learning task than the one you are trying to solve. These are called "pre-trained word embeddings".

Here, we will take the first approach.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.0
CUDA available: False


# Download the IMDB dataset

The IMDB dataset comes packaged with TensorFlow/Keras. It has already been preprocessed such that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary. We'll use the Keras datasets to load it.

In [3]:
# Install tensorflow for dataset loading only
# !pip install tensorflow -q

from tensorflow import keras
imdb = keras.datasets.imdb

# Number of words to consider as features
num_words = 20000

# load IMDB dataset as lists of integers
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=num_words)

The argument num_words=20000 keeps the top 20,000 most frequently occurring words in the training data.

In [4]:
print("Training examples: {}, labels: {}".format(len(train_data), len(train_labels)))

Training examples: 25000, labels: 25000


The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:

In [5]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


Movie reviews may be different lengths. The below code shows the number of words in the first and second reviews. Since inputs to a neural network must be the same length, we'll need to resolve this.

In [6]:
len(train_data[0]), len(train_data[1])

(218, 189)

# Preprocess Data
We will pad the arrays so they all have the same length. In PyTorch, we'll use `torch.nn.utils.rnn.pad_sequence`:

In [7]:
from torch.nn.utils.rnn import pad_sequence

# Cut texts after this number of words
max_len = 500

# Pad sequences to max_len
def pad_sequences_pytorch(sequences, maxlen, padding='pre', truncating='pre', value=0):
    """Pad sequences to same length (PyTorch implementation)"""
    result = np.zeros((len(sequences), maxlen), dtype=np.int64)
    for i, seq in enumerate(sequences):
        if truncating == 'pre':
            trunc = seq[-maxlen:]
        else:
            trunc = seq[:maxlen]
        
        if padding == 'pre':
            result[i, -len(trunc):] = trunc
        else:
            result[i, :len(trunc)] = trunc
    return result

train_data = pad_sequences_pytorch(train_data, maxlen=max_len)
test_data = pad_sequences_pytorch(test_data, maxlen=max_len)

print(train_data.shape)

(25000, 500)


Notice the pad sequences method worked by prepending '0's to the start of the sequence:

In [8]:
print(train_data[0])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

# Build a Multi-Layer Perceptron
We are now ready to build our model. We will use an Embedding layer to map from an integer that corresponds to a word, to a vector of floating point weights (the embedding). These weights are learned when we train the model.

In [9]:
# Create PyTorch Dataset
class IMDBDataset(Dataset):
    def __init__(self, data, labels):
        self.data = torch.LongTensor(data)
        self.labels = torch.FloatTensor(labels)
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# Create datasets
train_dataset = IMDBDataset(train_data, train_labels)
test_dataset = IMDBDataset(test_data, test_labels)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [10]:
# Define the model
class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SentimentClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.global_avg_pool = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(embedding_dim, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        x = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        x = x.permute(0, 2, 1)  # (batch_size, embedding_dim, seq_len)
        x = self.global_avg_pool(x).squeeze(-1)  # (batch_size, embedding_dim)
        x = self.fc(x)  # (batch_size, 1)
        x = self.sigmoid(x)
        return x.squeeze()

embedding_dimension = 16
model = SentimentClassifier(num_words, embedding_dimension)

# Move model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(model)

SentimentClassifier(
  (embedding): Embedding(20000, 16)
  (global_avg_pool): AdaptiveAvgPool1d(output_size=1)
  (fc): Linear(in_features=16, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


In [11]:
# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())

# Training function
def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for data, labels in dataloader:
        data, labels = data.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        predicted = (outputs > 0.5).float()
        correct += (predicted == labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

# Validation function
def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, labels in dataloader:
            data, labels = data.to(device), labels.to(device)
            outputs = model(data)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            predicted = (outputs > 0.5).float()
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

In [12]:
# Train the model
num_epochs = 10

# Split train data for validation
val_size = int(0.2 * len(train_dataset))
train_size = len(train_dataset) - val_size
train_subset, val_subset = torch.utils.data.random_split(train_dataset, [train_size, val_size])

train_loader = DataLoader(train_subset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_subset, batch_size=32, shuffle=False)

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = validate(model, val_loader, criterion, device)
    
    print(f'Epoch {epoch+1}/{num_epochs}')
    print(f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')
    print(f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')
    print()

Epoch 1/10
Train Loss: 0.6918, Train Acc: 0.5388
Val Loss: 0.6839, Val Acc: 0.5800

Epoch 2/10
Train Loss: 0.6705, Train Acc: 0.6661
Val Loss: 0.6536, Val Acc: 0.7440

Epoch 3/10
Train Loss: 0.6258, Train Acc: 0.7678
Val Loss: 0.5993, Val Acc: 0.7832

Epoch 4/10
Train Loss: 0.5664, Train Acc: 0.8028
Val Loss: 0.5421, Val Acc: 0.8054

Epoch 5/10
Train Loss: 0.5096, Train Acc: 0.8268
Val Loss: 0.4919, Val Acc: 0.8224

Epoch 6/10
Train Loss: 0.4610, Train Acc: 0.8461
Val Loss: 0.4507, Val Acc: 0.8408

Epoch 7/10
Train Loss: 0.4201, Train Acc: 0.8597
Val Loss: 0.4170, Val Acc: 0.8516

Epoch 8/10
Train Loss: 0.3865, Train Acc: 0.8710
Val Loss: 0.3897, Val Acc: 0.8582

Epoch 9/10
Train Loss: 0.3587, Train Acc: 0.8786
Val Loss: 0.3674, Val Acc: 0.8660

Epoch 10/10
Train Loss: 0.3355, Train Acc: 0.8857
Val Loss: 0.3496, Val Acc: 0.8680



Our classifier has a validation accuracy of about 89%. Note that we make use of only the first 500 words in each review. We are also using global average pooling on our embedding before passing it to a single Dense layer, which treats each word separately without taking into consideration the ordering of the words in the sequence. To reach higher accuracy, it would be helpful to use a recurrent layer or 1D convolution which will take the sequence of the words into consideration.

# Visualize Embeddings with the Embedding Projector

Recall the reviews are encoded as series of integers in our training data. Before we can visualize the learned embeddings, first we will need to determine which word corresponds to each number. In this case, the IMDB dataset includes a utility method `.get_word_index()` that contains a mapping from words to numbers. We will use this to build a reversed word index, which maps from numbers to words.

In [13]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Now we can use the decode_review function to display the text for the first review. You will see padding at the beginning, since this review was shorter than our 500 word maximum length.

In [14]:
decode_review(train_data[0])

"<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PA

Now that we have the number to word mapping, we are ready to retrieve the learned embedding from the model. This gives us a matrix of weights. Each row corresponds to the embedding for that number in our `reversed_word_dict` above, and the corresponding word can be found in `word_index`.

In PyTorch, we can get the embedding weights directly from the embedding layer.


In [15]:
# Get embedding weights
weights = model.embedding.weight.detach().cpu().numpy()
print(weights.shape)  # (20000, 16). Each word is mapped to an embedding vector.

(20000, 16)


Next, we will format these for visualization in the embedding projector. To do so, we will need to provide two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [16]:
out_v = open('vecs.tsv', 'w')
out_m = open('meta.tsv', 'w')
for word_num in range(num_words):
    word = reverse_word_index.get(word_num, '?')
    embeddings = weights[word_num]
    out_m.write(word + "\n")
    out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

print("Embedding files saved: vecs.tsv and meta.tsv")

Embedding files saved: vecs.tsv and meta.tsv


Now, you can open the [Embedding Projector](http://projector.tensorflow.org/) in a new window, and click on 'Load data'. Upload the `vecs.tsv` and `meta.tsv` files from above. Next, click 'Search', and type in a word to find its closest neighbors. With this small dataset, not all of the learned embeddings will be interpretable, though some will be!

For example, try searching for 'beautiful'. The learned embeddings you see may be different, they depend on random weight initialization used by the model. When the author of this tutorial ran it, they saw "loved" and "wonderful" were the closest neighbors. Likewise, the closest neighbors for "lame" were "awful, and poorly".


# A More Advanced Model
We will implement a more advanced model that demonstrates two things:
1. The use of pre-trained embeddings.
2. The use of a 1D CNN.

We will be implementing a Depthwise Separable Convolutional Neural Network, which is a type of CNN that was written about in a paper published by Francois Chollet that is found here: https://arxiv.org/abs/1610.02357. Because CNN makes use of a sliding window, it will take the order of words in our text into consideration.

In [17]:
# download pretrained GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip

zsh:1: command not found: wget


In [18]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [19]:
import os
import numpy as np

In [20]:
glove_dir = './'

embeddings_index = {} # initialize dictionary
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [21]:
embedding_dim = 100

embedding_matrix = np.zeros((num_words, embedding_dim)) # create an array of zeros
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < num_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [22]:
# Define Separable Conv1D layers for PyTorch
class SeparableConv1d(nn.Module):
    """Depthwise separable convolution"""
    def __init__(self, in_channels, out_channels, kernel_size, padding=0, bias=True):
        super(SeparableConv1d, self).__init__()
        self.depthwise = nn.Conv1d(in_channels, in_channels, kernel_size, 
                                   groups=in_channels, padding=padding, bias=False)
        self.pointwise = nn.Conv1d(in_channels, out_channels, 1, bias=bias)
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

In [23]:
# Define the CNN model with pre-trained embeddings
class SentimentCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, embedding_matrix, 
                 blocks=4, filters=50, kernel_size=5, dropout_rate=0.3):
        super(SentimentCNN, self).__init__()
        
        # Embedding layer with pre-trained weights
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight = nn.Parameter(torch.FloatTensor(embedding_matrix))
        self.embedding.weight.requires_grad = False  # Freeze embeddings
        
        # Build convolutional blocks
        self.blocks = nn.ModuleList()
        in_channels = embedding_dim
        
        for _ in range(blocks):
            block = nn.Sequential(
                nn.Dropout(dropout_rate),
                SeparableConv1d(in_channels, filters, kernel_size, padding=kernel_size//2),
                nn.ReLU(),
                SeparableConv1d(filters, filters, kernel_size, padding=kernel_size//2),
                nn.ReLU(),
                nn.MaxPool1d(kernel_size=2, stride=1, padding=0)
            )
            self.blocks.append(block)
            in_channels = filters
        
        # Final layers
        self.sep_conv1 = SeparableConv1d(filters, filters * 2, kernel_size, padding=kernel_size//2)
        self.sep_conv2 = SeparableConv1d(filters * 2, filters * 2, kernel_size, padding=kernel_size//2)
        self.relu = nn.ReLU()
        self.global_avg_pool = nn.AdaptiveAvgPool1d(1)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(filters * 2, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        x = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        x = x.permute(0, 2, 1)  # (batch_size, embedding_dim, seq_len)
        
        # Apply convolutional blocks
        for block in self.blocks:
            x = block(x)
        
        # Final convolutions
        x = self.relu(self.sep_conv1(x))
        x = self.relu(self.sep_conv2(x))
        
        # Global average pooling and classification
        x = self.global_avg_pool(x).squeeze(-1)
        x = self.dropout(x)
        x = self.fc(x)
        x = self.sigmoid(x)
        return x.squeeze()

# Create the model
cnn_model = SentimentCNN(num_words, embedding_dim, embedding_matrix,
                         blocks=4, filters=50, kernel_size=5, dropout_rate=0.3)

# Move to device
cnn_model = cnn_model.to(device)

print(cnn_model)

SentimentCNN(
  (embedding): Embedding(20000, 100)
  (blocks): ModuleList(
    (0): Sequential(
      (0): Dropout(p=0.3, inplace=False)
      (1): SeparableConv1d(
        (depthwise): Conv1d(100, 100, kernel_size=(5,), stride=(1,), padding=(2,), groups=100, bias=False)
        (pointwise): Conv1d(100, 50, kernel_size=(1,), stride=(1,))
      )
      (2): ReLU()
      (3): SeparableConv1d(
        (depthwise): Conv1d(50, 50, kernel_size=(5,), stride=(1,), padding=(2,), groups=50, bias=False)
        (pointwise): Conv1d(50, 50, kernel_size=(1,), stride=(1,))
      )
      (4): ReLU()
      (5): MaxPool1d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False)
    )
    (1-3): 3 x Sequential(
      (0): Dropout(p=0.3, inplace=False)
      (1): SeparableConv1d(
        (depthwise): Conv1d(50, 50, kernel_size=(5,), stride=(1,), padding=(2,), groups=50, bias=False)
        (pointwise): Conv1d(50, 50, kernel_size=(1,), stride=(1,))
      )
      (2): ReLU()
      (3): SeparableConv

Let's compile and train the model

In [24]:
# Define optimizer and loss
optimizer_cnn = optim.Adam(cnn_model.parameters(), lr=0.001)
criterion_cnn = nn.BCELoss()

In [25]:
# Train the CNN model
num_epochs_cnn = 3
batch_size_cnn = 512

# Create new dataloaders with larger batch size
train_loader_cnn = DataLoader(train_subset, batch_size=batch_size_cnn, shuffle=True)
val_loader_cnn = DataLoader(val_subset, batch_size=batch_size_cnn, shuffle=False)

for epoch in range(num_epochs_cnn):
    train_loss, train_acc = train_epoch(cnn_model, train_loader_cnn, criterion_cnn, optimizer_cnn, device)
    val_loss, val_acc = validate(cnn_model, val_loader_cnn, criterion_cnn, device)
    
    print(f'Epoch {epoch+1}/{num_epochs_cnn}')
    print(f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')
    print(f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')
    print()

Epoch 1/3
Train Loss: 0.6933, Train Acc: 0.4979
Val Loss: 0.6932, Val Acc: 0.4932

Epoch 2/3
Train Loss: 0.6932, Train Acc: 0.4980
Val Loss: 0.6932, Val Acc: 0.4932

Epoch 3/3
Train Loss: 0.6931, Train Acc: 0.5008
Val Loss: 0.6931, Val Acc: 0.5068



# Next steps
* To learn more about Word Embeddings in PyTorch, check out the [PyTorch tutorials on embeddings](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

* [Hugging Face](https://huggingface.co/) contains large databases of pretrained embeddings and transformers you can download and reuse in your PyTorch projects.