# Word2Vec from Scratch and Transformer Implementation

In this notebook, I build a Word2Vec model from scratch ,
I relied heavily on ChatGPT to write most of the functions and explanations as I found limited detailed resources to implement word2vec from scratch, so using ChatGPT helped me understand and code these concepts clearly.



In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("train.csv")
df

Unnamed: 0,Category,Text
0,Accountant,education omba executive leadership university...
1,Accountant,howard gerrard accountant deyjobcom birmingham...
2,Accountant,kevin frank senior accountant inforesumekraftc...
3,Accountant,place birth nationality olivia ogilvy accounta...
4,Accountant,stephen greet cpa senior accountant 9 year exp...
...,...,...
13384,Web Designing,jessica claire montgomery street san francisco...
13385,Web Designing,jessica claire montgomery street san francisco...
13386,Web Designing,summary jessica claire 100 montgomery st 10th ...
13387,Web Designing,jessica claire montgomery street san francisco...


In [2]:
print(df.shape)
df['Category'].value_counts()

(13389, 2)


Category
Education                    410
Electrical Engineering       384
Mechanical Engineer          384
Consultant                   368
Sales                        364
Civil Engineer               364
Management                   361
Human Resources              360
Digital Media                358
Accountant                   350
Java Developer               348
Operations Manager           345
Building and Construction    345
Testing                      344
Architecture                 344
Aviation                     340
Business Analyst             340
Finance                      339
SQL Developer                338
Public Relations             337
Health and Fitness           332
Arts                         332
Network Security Engineer    330
DotNet Developer             329
Apparel                      320
Banking                      314
Automobile                   313
Web Designing                309
SAP Developer                304
Data Science                 299
E

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove special characters and punctuations
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    text =  ' '.join(words)

    return text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
df['Text'] = df['Text'].apply(preprocess_text)

### Tokenizing Text by Splitting Words

creating a new column `'tokens'` in the DataFrame by splitting each sentence in the `'Text'` column into a list of words using the simple `split()` method.

In [5]:
df['tokens'] = df['Text'].apply(lambda x: x.split())
print(df[['Text', 'tokens']].head())

                                                Text  \
0  education omba executive leadership university...   
1  howard gerrard accountant deyjobcom birmingham...   
2  kevin frank senior accountant inforesumekraftc...   
3  place birth nationality olivia ogilvy accounta...   
4  stephen greet cpa senior accountant 9 year exp...   

                                              tokens  
0  [education, omba, executive, leadership, unive...  
1  [howard, gerrard, accountant, deyjobcom, birmi...  
2  [kevin, frank, senior, accountant, inforesumek...  
3  [place, birth, nationality, olivia, ogilvy, ac...  
4  [stephen, greet, cpa, senior, accountant, 9, y...  


### Building Vocabulary from Tokenized Text

- Combine all token lists into one big list of words  
- Count how often each word appears using `Counter`  
- Create a vocabulary dictionary that maps each unique word to a unique index based on frequency  

In [6]:
from collections import Counter

# Flatten the list of token lists into one big list
all_tokens = [token for tokens in df['tokens'] for token in tokens]

# Count frequency of each word
word_counts = Counter(all_tokens)

# Create vocab: word → unique index
vocab = {word: idx for idx, (word, _) in enumerate(word_counts.most_common())}

print(f"Vocabulary size: {len(vocab)}")
print(f"Sample vocab items: {list(vocab.items())[:10]}")


Vocabulary size: 121352
Sample vocab items: [('management', 0), ('customer', 1), ('project', 2), ('data', 3), ('team', 4), ('service', 5), ('experience', 6), ('system', 7), ('skill', 8), ('business', 9)]


### Creating Context Pairs with Sliding Window

Function generates pairs of words where each center word is paired with its surrounding context words within a given window size (2 here). For every token list in the dataset, we apply this sliding window to collect all word pairs.

In [7]:
def sliding_window(tokens, window_size=2):
    pairs = []
    for i, center_word in enumerate(tokens):
        context_indices = list(range(max(0, i - window_size), i)) + \
                          list(range(i + 1, min(len(tokens), i + window_size + 1)))
        for j in context_indices:
            pairs.append((center_word, tokens[j]))
    return pairs

# Generate pairs for the entire dataset
all_pairs = []
for tokens in df['tokens']:
    all_pairs.extend(sliding_window(tokens, window_size=2))

print(f"Total training pairs: {len(all_pairs)}")
print(f"Example pairs: {all_pairs[:10]}")


Total training pairs: 25356034
Example pairs: [('education', 'omba'), ('education', 'executive'), ('omba', 'education'), ('omba', 'executive'), ('omba', 'leadership'), ('executive', 'education'), ('executive', 'omba'), ('executive', 'leadership'), ('executive', 'university'), ('leadership', 'omba')]


### Converting Word Pairs to Index Pairs

Covertung the word pairs into pairs of their corresponding indices from the vocabulary also checking if both words are in the vocabulary and then creating a list of indexed pairs.

In [None]:
def pairs_to_indices(pairs, vocab):
    pairs_idx = []
    for center_word, context_word in pairs:
        if center_word in vocab and context_word in vocab:
            pairs_idx.append((vocab[center_word], vocab[context_word]))
    return pairs_idx

training_pairs_idx = pairs_to_indices(all_pairs, vocab)

print(f"Number of indexed pairs: {len(training_pairs_idx)}")
print(f"Sample indexed pairs: {training_pairs_idx[:10]}")


Number of indexed pairs: 25356034
Sample indexed pairs: [(24, 25928), (24, 357), (25928, 24), (25928, 357), (25928, 121), (357, 24), (357, 25928), (357, 121), (357, 44), (121, 25928)]


### Creating Dataset and DataLoader for Word2Vec Training

Here, created a custom PyTorch `Dataset` class that takes the list of word index pairs. The class implements the necessary methods to get the length and individual pairs as tensors.

Then,  made a `DataLoader` to load the data in batches of 1024 and shuffle it every epoch to help the model learn better.


In [9]:
import torch
from torch.utils.data import Dataset, DataLoader

class Word2VecDataset(Dataset):
    def __init__(self, pairs):
        self.pairs = pairs

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        center, context = self.pairs[idx]
        return torch.tensor(center, dtype=torch.long), torch.tensor(context, dtype=torch.long)

dataset = Word2VecDataset(training_pairs_idx)
dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)


### Word2Vec Model with Negative Sampling

Defined the Word2Vec model class here. It has two embedding layers — one for the center words and one for the context words.

In the forward method:
- get embeddings for the center words, positive context words, and negative context words.
- Then calculate scores for positive pairs using dot products and apply `logsigmoid` to get the positive loss.
- For negative samples, calculate dot products (with a negation), apply `logsigmoid`, and sum for the negative loss.
- Finally, combine these losses and return the average loss for training.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.input_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, center_words, pos_context_words, neg_context_words):

        center_embeds = self.input_embeddings(center_words)               
        pos_embeds = self.output_embeddings(pos_context_words)            
        neg_embeds = self.output_embeddings(neg_context_words)             

        # Positive score: dot(center, pos_context)
        pos_score = torch.sum(center_embeds * pos_embeds, dim=1)           
        pos_loss = F.logsigmoid(pos_score)

        # Negative score: dot(center, neg_context)
        neg_score = torch.bmm(neg_embeds.neg(), center_embeds.unsqueeze(2)).squeeze()   
        neg_loss = F.logsigmoid(neg_score).sum(1)                      

        loss = -(pos_loss + neg_loss).mean()

        return loss


### Negative Sampling Distribution and Function

Here, function first calculate the frequency of each word in the vocabulary and smooth it by raising the frequencies to the power of 0.75, which is a common trick in Word2Vec to balance frequent and rare words.

Then, normalized the frequencies so they sum to 1.

Finally, define a function `get_negative_samples` that randomly picks negative samples for each batch according to this smoothed frequency distribution. This helps the model learn by contrasting real context words with these negative samples during training.


In [None]:
import numpy as np

# Calculate word frequencies (from your vocab)
word_counts_array = np.array([word_counts[word] for word, _ in sorted(vocab.items(), key=lambda x: x[1])])
word_freqs = word_counts_array / word_counts_array.sum()

# Apply smoothing (power 3/4)
word_freqs = word_freqs ** 0.75
word_freqs = word_freqs / word_freqs.sum()

def get_negative_samples(batch_size, neg_samples):
    neg_samples_idx = np.random.choice(len(vocab), size=(batch_size, neg_samples), p=word_freqs)
    return torch.LongTensor(neg_samples_idx)


### Training loop

In [14]:
import torch.optim as optim
from tqdm import tqdm

embedding_dim = 100
neg_samples = 5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Word2Vec(len(vocab), embedding_dim).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 3

for epoch in range(num_epochs):
    total_loss = 0
    loop = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    for center_words, pos_context_words in loop:
        center_words = center_words.to(device)
        pos_context_words = pos_context_words.to(device)

        batch_size = center_words.size(0)
        neg_context_words = get_negative_samples(batch_size, neg_samples).to(device)

        optimizer.zero_grad()
        loss = model(center_words, pos_context_words, neg_context_words)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1} finished. Average Loss: {avg_loss:.4f}")



Epoch 1/3: 100%|██████████| 24762/24762 [1:07:49<00:00,  6.09it/s]


Epoch 1 finished. Average Loss: 6.6390


Epoch 2/3: 100%|██████████| 24762/24762 [1:03:33<00:00,  6.49it/s]


Epoch 2 finished. Average Loss: 2.5436


Epoch 3/3: 100%|██████████| 24762/24762 [1:00:43<00:00,  6.80it/s]

Epoch 3 finished. Average Loss: 2.2090





In [None]:
import torch

embedding_dim = 100
vocab_size = len(vocab)

model = Word2Vec(vocab_size, embedding_dim)

model.load_state_dict(torch.load('word2vec_checkpoint.pth'))
model.eval()

word_embeddings = model.input_embeddings.weight.data.cpu()

print("✅ Word embeddings extracted successfully. Shape:", word_embeddings.shape)


✅ Word embeddings extracted successfully. Shape: torch.Size([121352, 100])


  model.load_state_dict(torch.load('word2vec_checkpoint.pth'))


### Finding Similar Words Using Cosine Similarity
`find_similar_words` takes a query word and finds the top N most similar words based on cosine similarity between embedding vectors.


In [None]:
import torch
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

embeddings = model.input_embeddings.weight.data.cpu().numpy()

# Function to find top-N similar words by cosine similarity
def find_similar_words(query_word, embeddings, word_id, id_word, top_n=10):
    if query_word not in word_id:
        print(f"'{query_word}' not in vocabulary.")
        return
    
    query_idx = word_id[query_word]
    query_vec = embeddings[query_idx].reshape(1, -1)
    
    cos_sim = cosine_similarity(query_vec, embeddings)[0]
    top_ids = cos_sim.argsort()[-top_n-1:][::-1]  # top_n+1 because query word will be most similar to itself
    
    print(f"Top {top_n} words similar to '{query_word}':")
    for idx in top_ids[1:]:  # skip the first, which is the query word itself
        print(f"{id_word[idx]} (score: {cos_sim[idx]:.4f})")


In [16]:
word_id = vocab
id_word = {idx: word for word, idx in vocab.items()}

find_similar_words("web", embeddings, word_id, id_word, top_n=10)

Top 10 words similar to 'web':
restful (score: 0.8151)
api (score: 0.7866)
soap (score: 0.7685)
ui (score: 0.7654)
application (score: 0.7514)
apis (score: 0.7450)
rest (score: 0.7418)
frontend (score: 0.7416)
html (score: 0.7237)
java (score: 0.7146)


In [None]:
import numpy as np

embeddings_np = word_embeddings.numpy()
np.save("word2vec_embeddings.npy", embeddings_np)  # Save for reuse

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

df['label'] = label_encoder.fit_transform(df['Category'])

print("\nLabel Encoding Mapping:")
for i, category in enumerate(label_encoder.classes_):
    print(f"{category} → {i}")


Label Encoding Mapping:
Accountant → 0
Advocate → 1
Agriculture → 2
Apparel → 3
Architecture → 4
Arts → 5
Automobile → 6
Aviation → 7
BPO → 8
Banking → 9
Blockchain → 10
Building and Construction → 11
Business Analyst → 12
Civil Engineer → 13
Consultant → 14
Data Science → 15
Database → 16
Designing → 17
DevOps → 18
Digital Media → 19
DotNet Developer → 20
ETL Developer → 21
Education → 22
Electrical Engineering → 23
Finance → 24
Food and Beverages → 25
Health and Fitness → 26
Human Resources → 27
Information Technology → 28
Java Developer → 29
Management → 30
Mechanical Engineer → 31
Network Security Engineer → 32
Operations Manager → 33
PMO → 34
Public Relations → 35
Python Developer → 36
React Developer → 37
SAP Developer → 38
SQL Developer → 39
Sales → 40
Testing → 41
Web Designing → 42


### Implementing transformer style classifier from here on similar to what in previous notebook

In [19]:
from sklearn.model_selection import train_test_split

# Split the data (80% train, 20% validation)
train_df, val_df = train_test_split(df[['Category', 'Text', 'tokens', 'label']], test_size=0.2, random_state=42)

# Check the shapes
print(f"\nTraining set shape: {train_df.shape}")
print(f"Validation set shape: {val_df.shape}")


Training set shape: (10711, 4)
Validation set shape: (2678, 4)


In [None]:
train_texts = train_df["tokens"].tolist()
train_labels = train_df["label"].tolist()

val_texts = val_df["tokens"].tolist()
val_labels = val_df["label"].tolist()

In [None]:
from torch.utils.data import Dataset
import torch

class TextDataset(Dataset):
    def __init__(self, texts, labels, word_id, embeddings, max_len):
        self.texts = texts
        self.labels = labels
        self.word_id = word_id
        self.embeddings = embeddings
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        tokens = self.texts[idx]
        label = self.labels[idx]

        # Get indices for tokens in vocab, use 0 if not found
        token_ids = [self.word_id.get(token, 0) for token in tokens]

        # Pad or truncate to max_len
        if len(token_ids) < self.max_len:
            token_ids += [0] * (self.max_len - len(token_ids))
        else:
            token_ids = token_ids[:self.max_len]

        token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
        label_tensor = torch.tensor(label, dtype=torch.long)

        return token_ids_tensor, label_tensor


In [22]:
lengths = [len(tokens) for tokens in df['tokens']]
print(f"Max length: {max(lengths)}")
print(f"95th percentile length: {np.percentile(lengths, 95)}")

Max length: 6554
95th percentile length: 1035.5999999999985


In [23]:
max_len = 1000;

train_dataset = TextDataset(train_texts, train_labels, vocab, embeddings_np, max_len)
val_dataset = TextDataset(val_texts, val_labels, vocab, embeddings_np, max_len)

In [24]:
import torch
import math

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)  # (max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # (max_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)  # even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # odd indices
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len, :].to(x.device)
        return x


In [25]:
class MultiHeadSelfAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.q_linear = torch.nn.Linear(d_model, d_model)
        self.k_linear = torch.nn.Linear(d_model, d_model)
        self.v_linear = torch.nn.Linear(d_model, d_model)
        self.out_linear = torch.nn.Linear(d_model, d_model)

        self.scale = math.sqrt(self.head_dim)

    def forward(self, x):
        batch_size, seq_len, _ = x.size()

        # Linear projections
        Q = self.q_linear(x)  # (batch_size, seq_len, d_model)
        K = self.k_linear(x)
        V = self.v_linear(x)

        # Split into heads
        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (batch, heads, seq_len, head_dim)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        # Calculate scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale  # (batch, heads, seq_len, seq_len)
        attn_weights = torch.nn.functional.softmax(scores, dim=-1)  # attention weights

        out = torch.matmul(attn_weights, V)  # (batch, heads, seq_len, head_dim)

        # Concatenate heads
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)  # (batch, seq_len, d_model)

        out = self.out_linear(out)  # final linear layer
        return out


In [26]:
class TransformerEncoderLayer(torch.nn.Module):
    def __init__(self, d_model, num_heads, dim_feedforward=2048, dropout=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.linear1 = torch.nn.Linear(d_model, dim_feedforward)
        self.dropout = torch.nn.Dropout(dropout)
        self.linear2 = torch.nn.Linear(dim_feedforward, d_model)

        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)
        self.dropout1 = torch.nn.Dropout(dropout)
        self.dropout2 = torch.nn.Dropout(dropout)

        self.activation = torch.nn.ReLU()

    def forward(self, src):
        # Self-attention + Add & Norm
        src2 = self.self_attn(src)
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        # Feedforward + Add & Norm
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)

        return src


In [28]:
class TransformerClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_heads, num_layers, num_classes, max_seq_len=512):
        super(TransformerClassifier, self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)
        self.pos_encoder = PositionalEncoding(embedding_dim, max_seq_len)

        self.layers = torch.nn.ModuleList([
            TransformerEncoderLayer(embedding_dim, num_heads) for _ in range(num_layers)
        ])

        self.classifier = torch.nn.Linear(embedding_dim, num_classes)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        x = self.pos_encoder(embedded)  # add positional encoding

        for layer in self.layers:
            x = layer(x)  # pass through transformer encoder layers

        # Pooling: take mean over sequence length
        x = x.mean(dim=1)  # (batch_size, embedding_dim)

        logits = self.classifier(x)  # (batch_size, num_classes)
        return logits


In [31]:
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
from sklearn.metrics import f1_score, accuracy_score

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = TransformerClassifier(
    vocab_size=len(vocab),
    embedding_dim=100,
    num_heads=4,
    num_layers=1,
    num_classes=43,
    max_seq_len=1000
).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 10

best_val_loss = float('inf')
patience = 3         # how many epochs to wait
counter = 0          # how many bad epochs seen in a row
delta = 0.01         # minimum improvement to reset patience

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    train_loop = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Training]")
    for inputs, targets in train_loop:
        inputs = inputs.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        avg_loss = total_loss / (train_loop.n + 1)
        train_loop.set_postfix({'Loss': f'{avg_loss:.4f}'})

    print(f"Epoch {epoch+1}, Final Training Loss: {avg_loss:.4f}")

    # Validation
    model.eval()
    val_total_loss = 0
    all_targets = []
    all_preds = []

    val_loop = tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Validation]")
    with torch.no_grad():
        for inputs, targets in val_loop:
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs = model(inputs)
            val_loss = criterion(outputs, targets)
            val_total_loss += val_loss.item()

            _, predicted = torch.max(outputs, 1)
            all_targets.extend(targets.cpu().numpy())
            all_preds.extend(predicted.cpu().numpy())

    avg_val_loss = val_total_loss / len(val_loader)
    acc = accuracy_score(all_targets, all_preds)
    f1 = f1_score(all_targets, all_preds, average='weighted')

    print(f"Epoch {epoch+1}, Validation Loss: {avg_val_loss:.4f}, Accuracy: {acc:.4f}, F1 Score: {f1:.4f}")

    # ---- Early Stopping Check ----
    if best_val_loss - avg_val_loss > delta:
        best_val_loss = avg_val_loss
        counter = 0
        torch.save(model.state_dict(), 'best_model.pth')  # save the best model
        print("Validation loss improved, saving model.")
    else:
        counter += 1
        print(f"No improvement in validation loss for {counter} epoch(s).")
        if counter >= patience:
            print("Early stopping triggered.")
            break

Epoch 1/10 [Training]: 100%|██████████| 335/335 [09:10<00:00,  1.64s/it, Loss=2.7211]


Epoch 1, Final Training Loss: 2.7211


Epoch 1/10 [Validation]: 100%|██████████| 84/84 [00:44<00:00,  1.89it/s]


Epoch 1, Validation Loss: 1.7478, Accuracy: 0.5093, F1 Score: 0.4826
Validation loss improved, saving model.


Epoch 2/10 [Training]: 100%|██████████| 335/335 [09:03<00:00,  1.62s/it, Loss=1.2894]


Epoch 2, Final Training Loss: 1.2894


Epoch 2/10 [Validation]: 100%|██████████| 84/84 [00:45<00:00,  1.85it/s]


Epoch 2, Validation Loss: 1.0567, Accuracy: 0.7218, F1 Score: 0.7104
Validation loss improved, saving model.


Epoch 3/10 [Training]: 100%|██████████| 335/335 [10:27<00:00,  1.87s/it, Loss=0.8794]


Epoch 3, Final Training Loss: 0.8794


Epoch 3/10 [Validation]: 100%|██████████| 84/84 [00:49<00:00,  1.71it/s]


Epoch 3, Validation Loss: 0.9563, Accuracy: 0.7405, F1 Score: 0.7286
Validation loss improved, saving model.


Epoch 4/10 [Training]: 100%|██████████| 335/335 [10:09<00:00,  1.82s/it, Loss=0.6805]


Epoch 4, Final Training Loss: 0.6805


Epoch 4/10 [Validation]: 100%|██████████| 84/84 [00:52<00:00,  1.60it/s]


Epoch 4, Validation Loss: 0.8559, Accuracy: 0.7763, F1 Score: 0.7749
Validation loss improved, saving model.


Epoch 5/10 [Training]: 100%|██████████| 335/335 [10:04<00:00,  1.80s/it, Loss=0.5398]


Epoch 5, Final Training Loss: 0.5398


Epoch 5/10 [Validation]: 100%|██████████| 84/84 [00:48<00:00,  1.72it/s]


Epoch 5, Validation Loss: 0.8170, Accuracy: 0.7789, F1 Score: 0.7794
Validation loss improved, saving model.


Epoch 6/10 [Training]: 100%|██████████| 335/335 [09:26<00:00,  1.69s/it, Loss=0.4263]


Epoch 6, Final Training Loss: 0.4263


Epoch 6/10 [Validation]: 100%|██████████| 84/84 [00:46<00:00,  1.80it/s]


Epoch 6, Validation Loss: 0.7684, Accuracy: 0.7954, F1 Score: 0.7953
Validation loss improved, saving model.


Epoch 7/10 [Training]: 100%|██████████| 335/335 [09:30<00:00,  1.70s/it, Loss=0.3245]


Epoch 7, Final Training Loss: 0.3245


Epoch 7/10 [Validation]: 100%|██████████| 84/84 [00:51<00:00,  1.64it/s]


Epoch 7, Validation Loss: 0.8380, Accuracy: 0.7857, F1 Score: 0.7872
No improvement in validation loss for 1 epoch(s).


Epoch 8/10 [Training]: 100%|██████████| 335/335 [09:27<00:00,  1.69s/it, Loss=0.2467]


Epoch 8, Final Training Loss: 0.2467


Epoch 8/10 [Validation]: 100%|██████████| 84/84 [00:45<00:00,  1.86it/s]


Epoch 8, Validation Loss: 0.8658, Accuracy: 0.7984, F1 Score: 0.8006
No improvement in validation loss for 2 epoch(s).


Epoch 9/10 [Training]: 100%|██████████| 335/335 [08:59<00:00,  1.61s/it, Loss=0.1805]


Epoch 9, Final Training Loss: 0.1805


Epoch 9/10 [Validation]: 100%|██████████| 84/84 [00:44<00:00,  1.87it/s]

Epoch 9, Validation Loss: 0.9551, Accuracy: 0.7797, F1 Score: 0.7790
No improvement in validation loss for 3 epoch(s).
Early stopping triggered.





In [32]:
torch.save(model.state_dict(), 'trained_model.pth')