# Module 2: Word Vectors and Embeddings

Now after cleaning the text we need to convert the text into some kind of numerical representation called vectors so that we can feed the data to a machine learning model for further processing.

**Scikit-learn**, also known as sklearn, is a free software machine learning library for the Python programming language. It’s one of the most useful and robust libraries for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction via a consistent interface in Python.

Documentation: https://scikit-learn.org/stable/

## 2.1 Bag of Words

The Bag of Words (BoW) model is a common way to represent text data in Natural Language Processing (NLP). Here's how it works:

1. **Tokenization**: The text is broken down into individual words or tokens.

2. **Counting**: The frequency of each word is counted.

3. **Storing**: The information is stored in a data structure, such as a dictionary or a vector.

The name "Bag of Words" comes from the fact that this model represents the document as a 'bag' of its words, disregarding grammar and word order but keeping track of frequency.

For example, if we have a vocabulary of 1000 words, then the whole document will be represented by a 1000-dimensional vector, where the vector’s ith entry represents the frequency of the ith vocabulary word in the document.


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Sample Corpus

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']


In [3]:
# Create a CountVectorizer object

vectorizer = CountVectorizer()

In [4]:
# Learn the vocabulary dictionary and return term-document matrix

X = vectorizer.fit_transform(corpus)

In [5]:
# Convert text into tokens

print(vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [6]:
# Display the embeddings matrix

print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


Disadvantages:

- It ignores the order and context of the words, which can affect the meaning and semantics of the text.
- It suffers from high dimensionality and sparsity, which can make the model complex and inefficient.
- It does not capture synonyms, antonyms, or other linguistic features that can enrich the representation of the text.

## 2.2 TF-IDF

**TF-IDF**, short for **Term Frequency-Inverse Document Frequency**, is a numerical statistic used in Natural Language Processing (NLP) to reflect how important a word is to a document in a collection or corpus. It's often used as a weighing factor in information retrieval, text mining, and user modeling.

The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

TF-IDF consists of two components:

1. **Term Frequency (TF)**: This measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization. It's calculated as:

   $$\text{TF}(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}$$

2. **Inverse Document Frequency (IDF)**: This measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones. It's calculated as:

   $$\text{IDF}(t) = \log_e \left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}}\right)$$

The TF-IDF score for a word in a document is the product of its TF score and its IDF score. The higher the TF-IDF score, the more important that word is to that document.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
# Sample Corpus

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

In [9]:
# Create a TfidfVectorizer object

vectorizer = TfidfVectorizer()

In [10]:
# Learn the vocabulary dictionary and return term-document matrix

X = vectorizer.fit_transform(corpus)

In [11]:
# Convert text into tokens

print(vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [12]:
# Display the embeddings matrix

print(X.toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


Disadvantages:

- It ignores the order and context of the words, which can affect the meaning and semantics of the text.
- It gives equal importance to all words, regardless of their relevance or frequency.
- It suffers from high dimensionality and sparsity, which can make the model complex and inefficient.
- It does not capture synonyms, antonyms, or other linguistic features that can enrich the representation of the text.

## 2.3 Pre-Trained Embeddings


Pre-trained embeddings and methods like Bag of Words (BoW) and TF-IDF are all techniques used to convert text into numerical form that can be processed by machine learning algorithms.

**Bag of Words (BoW)** and **TF-IDF** are simple statistical vectorization techniques. They treat each word as a feature of the document and represent the document as a bag of these features. However, *they do not capture any semantic information or the context in which words appear*.

**Pre-trained embeddings**, on the other hand, are more advanced and capture a lot more information. They are trained on large amounts of data and are able to capture semantic meaning and context. Examples of pre-trained embeddings include:
- Word2Vec
- GloVe
- FastText
- BERT.

Pre-trained embeddings can be used to generalize text classification models to new datasets and tasks, which can save time and resources in training the model. They provide dense vector representations of words, where the vector dimensions represent different linguistic properties of the words.

In summary, while BoW and TF-IDF provide a simple count-based representation of the text data, pre-trained embeddings provide a much richer representation that captures semantic meaning, context, and linguistic properties of the words.


### 2.3.1 Word2Vec Algorithm
Both Continuous Bag of Words (CBOW) and Skip-gram are two popular algorithms used in Word2Vec to create word embeddings from large text corpora. They are both used for learning distributed representations of words, but they have different approaches:

**CBOW (Continuous Bag of Words)**

*   In CBOW, the model predicts a target word based on the context words (words that surround the target word).
*   CBOW is faster to train compared to Skip-gram because it considers multiple context words to predict one target word.

**Skip-gram**

*   In Skip-gram, the model predicts context words based on the target word.
*   Skip-gram is slower to train compared to CBOW because it trains multiple models (one for each context word) to predict the target word.


Model card: https://radimrehurek.com/gensim/models/word2vec.html

In [13]:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
# assuming 's' is your text data

s = """
This is the first document. This document is the second document.
And this is the third one. Is this the first document?
"""
# Preprocessing the data
data = []
for i in sent_tokenize(s):
    temp = []
    for j in word_tokenize(i):
        temp.append(j.lower())
    data.append(temp)

In [15]:
# Creating the model and setting values for the various parameters

model = Word2Vec(data, vector_size = 100, window = 5, min_count = 1, sg =1)

**Note:**  In Gensim's Word2Vec implementation, the default value for the sg (skip-gram) parameter is 0, indicating that the Continuous Bag of Words (CBOW) variant is used by default.
To create a Skip-gram model, you need to explicitly set the sg parameter to 1 in the model initialization:
  model = Word2Vec(data, vector_size=100, window=5, min_count=1, **sg=1**)

In [16]:
# Getting vector embeddings for a word

vector = model.wv['document']

In [17]:
print(vector)

[-5.3675898e-04  2.3377803e-04  5.1011126e-03  9.0111634e-03
 -9.3046296e-03 -7.1193478e-03  6.4571658e-03  8.9753037e-03
 -5.0158435e-03 -3.7650017e-03  7.3809996e-03 -1.5337780e-03
 -4.5382124e-03  6.5550413e-03 -4.8621432e-03 -1.8138700e-03
  2.8784047e-03  9.8939589e-04 -8.2832631e-03 -9.4508985e-03
  7.3122787e-03  5.0708465e-03  6.7557860e-03  7.6295465e-04
  6.3531208e-03 -3.4069265e-03 -9.4830303e-04  5.7722903e-03
 -7.5227362e-03 -3.9368360e-03 -7.5102518e-03 -9.2837535e-04
  9.5392996e-03 -7.3179631e-03 -2.3361896e-03 -1.9368406e-03
  8.0786077e-03 -5.9307734e-03  4.5256016e-05 -4.7519240e-03
 -9.6026259e-03  5.0088917e-03 -8.7609310e-03 -4.3932023e-03
 -3.5118584e-05 -2.9534783e-04 -7.6626088e-03  9.6165314e-03
  4.9838675e-03  9.2361327e-03 -8.1576584e-03  4.4982354e-03
 -4.1376660e-03  8.2723942e-04  8.4996941e-03 -4.4644130e-03
  4.5164754e-03 -6.7880703e-03 -3.5479118e-03  9.3994327e-03
 -1.5778678e-03  3.2395139e-04 -4.1392036e-03 -7.6834001e-03
 -1.5086082e-03  2.46869

In [18]:
data

[['this', 'is', 'the', 'first', 'document', '.'],
 ['this', 'document', 'is', 'the', 'second', 'document', '.'],
 ['and', 'this', 'is', 'the', 'third', 'one', '.'],
 ['is', 'this', 'the', 'first', 'document', '?']]

In this code:

- `vector_size` is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.
- `window` is the maximum distance between a target word and words around the target word.
- `min_count` is the minimum count of words to consider when training the model; words with occurrence less than this count will be ignored.

Note: Training Word2Vec from scratch on a small dataset may not perform as well as pre-trained embeddings.
You can use pre-trained Word2Vec embeddings.

Example: https://radimrehurek.com/gensim/models/word2vec.html

### 2.3.2 BERT (Bidirectional Encoder Representations from Transformers)


BERT is a deep learning model that generates contextualized word embeddings. Unlike Word2Vec, FastText, and GloVe which generate a single static embedding for each word, BERT generates different embeddings for a word based on its specific context within a sentence. For example, in the sentences “I deposited money in the bank” and “I sat by the river bank”, BERT would generate different embeddings for the word “bank” in each sentence.

In terms of complexity and computational requirements, BERT is significantly more complex and resource-intensive than Word2Vec, FastText, and GloVe.

Documentation: https://huggingface.co/bert-base-uncased

In [19]:
# Step 1: Library Installs
!pip install transformers
!pip install torch

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
Col

In [20]:
# Step 2: Neccesary Imports

from transformers import BertModel, BertTokenizer
import torch

In [21]:
# Step 3: Load Pre-Trained BERT Model and Tokenizer

checkpoint = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BertModel.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [22]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [23]:
# Step 4.1: Just get the embeddings from a pretrained model

text = "Hugging Face's BERT is a powerful tool for natural language processing."
tokens = tokenizer(text, padding = True, truncation = True, return_tensors = "pt") # Tokenize your text


with torch.no_grad():
    embeddings = model(**tokens).last_hidden_state

In [24]:
embeddings

tensor([[[-0.5006, -0.0108,  0.0482,  ..., -0.3206, -0.0234,  0.4465],
         [ 0.0371, -0.3443,  0.9923,  ..., -0.0259,  0.6778,  0.5390],
         [ 0.1315, -0.2351,  0.9058,  ..., -0.2614,  0.4007,  0.2539],
         ...,
         [ 0.0892,  0.0058,  0.2358,  ..., -1.1112, -0.8665, -0.2763],
         [ 0.6234,  0.3100, -0.6328,  ...,  0.1791, -0.5473, -0.3639],
         [-0.2104,  0.1521,  0.0763,  ...,  0.3794, -0.6476, -0.3708]]])

In [25]:
embeddings.shape

torch.Size([1, 16, 768])

Another major difference is that while Word2Vec, FastText, and GloVe only provide word-level embeddings, BERT can also provide sentence-level embeddings.

In [26]:
# Step 4.2:  Get sentence-level embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the BERT model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Function to get the embeddings
def get_embeddings(sentence):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True)

    # Get the embeddings
    with torch.no_grad():
        outputs = model(**inputs)

    # Use the embeddings of the [CLS] token for the sentence-level embedding
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()

    return embeddings

# Test the function
sentence = "This is a sample sentence."
embeddings = get_embeddings(sentence)
print(embeddings)


[[-1.99308127e-01 -2.10060462e-01 -1.94994643e-01 -3.32378864e-01
  -5.21259665e-01 -3.73374075e-01  1.54189378e-01  4.00057852e-01
   3.53623852e-02 -1.50123253e-01 -3.52665842e-01 -2.28611425e-01
  -9.49477926e-02  1.03279911e-01  5.13516724e-01 -9.09007639e-02
   2.38635913e-01  4.82231945e-01  3.13785404e-01 -4.05656695e-01
   9.86269861e-02 -8.87676030e-02 -2.74609476e-01 -5.08711934e-01
   2.09130079e-01 -3.06265086e-01 -4.29530442e-02 -4.13105071e-01
  -4.07091305e-02  8.08255672e-02 -6.02922365e-02  3.57201546e-01
  -3.41745675e-01 -1.48240894e-01  3.90801311e-01 -1.31516546e-01
   3.67418706e-01 -9.12577286e-03  4.34040397e-01  1.41824752e-01
  -2.67848015e-01 -8.74635801e-02  2.66388327e-01  5.40870503e-02
  -5.58938347e-02 -5.74296772e-01 -2.81868768e+00 -3.43125105e-01
  -2.79367685e-01 -3.09405774e-01 -8.85440782e-02 -2.27909401e-01
   4.06301439e-01  5.09021819e-01 -2.27963448e-01  3.77304614e-01
  -2.90175945e-01  3.12225699e-01  1.94541186e-01 -1.05505832e-01
   1.51795

In [27]:
embeddings.shape

(1, 768)

In [28]:
# Step 5 : Perform Text Classification

import torch
from transformers import BertTokenizer, BertForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BertForSequenceClassification.from_pretrained(checkpoint)

labels = ["Negative", "Positive"]
text = "I love using BERT for NLP tasks."

inputs = tokenizer(text, return_tensors = "pt", padding = True, truncation = True)
outputs = model(**inputs)

logits = outputs.logits
predicted_class = torch.argmax(logits, dim = 1)
predicted_label = labels[predicted_class]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
print(f"Predicted Label: {predicted_label}")

Predicted Label: Negative


### Working with Domain-Specific BERT Models
Domain-specific BERT models are pre-trained models fine-tuned on specific domains, making them more suitable for tasks in those particular domains.


1.   Legal-BERT: Tailored for legal text analysis, LegalBERT is designed to handle legal documents, contracts, and other law-related text. (model card: https://huggingface.co/nlpaueb/legal-bert-base-uncased)
2.   Fin-BERT: Targeted at financial news, reports, and documents, FinBERT is pre-trained for financial sentiment analysis and related tasks. (model card: https://huggingface.co/ProsusAI/finbert)
3.  SciBERT: Fine-tuned on scientific articles and is well-suited for NLP tasks within the scientific research domain. (model card: https://huggingface.co/allenai/scibert_scivocab_uncased)
4. Bio_Clinical: Intended for tasks in the medical and healthcare domain, this model is pre-trained on a large corpus of biomedical literature. (model card: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT)




### Fine-tuning BERT Embeddings
In this lab, you will fine-tune a pre-trained BERT model on the SMS Spam Collection Dataset to build a spam detection system. You will use the Hugging Face Transformers library and PyTorch for fine-tuning.
1. **Load** the SMS Spam Collection **Dataset**.
2. Perform **data preprocessing steps**, including text cleaning, tokenization, and splitting into training and testing sets.
3. Convert the text data into **BERT-compatible input format ** (token IDs, attention masks).
4. **Load a pre-trained BERT model** (e.g., 'bert-base-uncased') from Hugging Face Transformers.
5. Define a **classification head** for spam/ham detection.
6. **Save embeddings** for later use.

In [30]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import pandas as pd
from tqdm.notebook import trange, tqdm

In [31]:
path_to_dataset = '/content/spam.csv'

In [32]:
data = pd.read_csv(path_to_dataset, encoding='ISO-8859-1')
data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [33]:
# List of columns to drop
columns_to_drop = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']

# drop the specified columns
data = data.drop(columns = columns_to_drop)

# rename 'v1' to 'label' and 'v2' to 'text'
data = data.rename(columns={'v1': 'label', 'v2': 'text'})

In [34]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [35]:
labels = LabelEncoder().fit_transform(data['label'])
texts = data['text']

In [36]:
# Split the data into train, validation, and test sets
train_texts, temp_texts, train_labels, temp_labels = train_test_split(texts, labels, stratify = labels, test_size = 0.2, random_state = 42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, stratify = temp_labels, test_size = 0.5, random_state = 42)

In [37]:
# Load the BERT tokenizer and encode the text data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_texts.tolist(), truncation = True, padding = True, return_tensors = 'pt', max_length = 512)
val_encodings = tokenizer(val_texts.tolist(), truncation = True, padding = True, return_tensors = 'pt', max_length = 512)
test_encodings = tokenizer(test_texts.tolist(), truncation = True, padding = True, return_tensors = 'pt', max_length = 512)

In [38]:
# Create PyTorch data loaders
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], torch.tensor(train_labels))
val_dataset = TensorDataset(val_encodings['input_ids'], val_encodings['attention_mask'], torch.tensor(val_labels))
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], torch.tensor(test_labels))

train_loader = DataLoader(train_dataset, batch_size= 32, shuffle = True)
val_loader = DataLoader(val_dataset, batch_size = 16)
test_loader = DataLoader(test_dataset, batch_size = 16)

In [39]:
# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
# Define training parameters
optimizer = AdamW(model.parameters(), lr = 1e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)



BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [41]:
# Training loop
num_epochs = 2

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc="Epoch {}".format(epoch+1)):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    average_loss = total_loss / len(train_loader)
    print("Average training loss: {:.4f}".format(average_loss))

Epoch 1:   0%|          | 0/140 [00:00<?, ?it/s]

Average training loss: 0.1526


Epoch 2:   0%|          | 0/140 [00:00<?, ?it/s]

Average training loss: 0.0318


In [None]:
# Evaluation
model.eval()
val_predictions = []
val_labels = []

for batch in tqdm(val_loader, desc="Validation"):
    input_ids, attention_mask, labels = batch
    input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1).cpu().numpy()
    val_predictions.extend(predictions)
    val_labels.extend(labels.cpu().numpy())

accuracy = accuracy_score(val_labels, val_predictions)
print("Validation Accuracy: {:.2f}%".format(accuracy * 100))

Validation:   0%|          | 0/35 [00:00<?, ?it/s]

Validation Accuracy: 99.10%


In [None]:
# Test the model on the test set
test_predictions = []
test_labels = []

for batch in tqdm(test_loader, desc="Testing"):
    input_ids, attention_mask, labels = batch
    input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=1).cpu().numpy()
    test_predictions.extend(predictions)
    test_labels.extend(labels.cpu().numpy())

test_accuracy = accuracy_score(test_labels, test_predictions)
print("Test Accuracy: {:.2f}%".format(test_accuracy * 100))

Testing:   0%|          | 0/35 [00:00<?, ?it/s]

Test Accuracy: 98.92%


In [42]:
# After training, extract embeddings

import numpy as np

def extract_bert_embeddings(model, dataloader):
    embeddings = []

    model.eval()
    with torch.no_grad():
        for batch in dataloader:
            input_ids, attention_mask, _ = batch
            input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            pooled_output = outputs.logits  # Get the pooled output, which is the embedding for [CLS] token
            embeddings.append(pooled_output.cpu().numpy())

    embeddings = np.vstack(embeddings)
    return embeddings

In [43]:
# Extract embeddings for train, validation, and test sets
train_embeddings = extract_bert_embeddings(model, train_loader)
val_embeddings = extract_bert_embeddings(model, val_loader)
test_embeddings = extract_bert_embeddings(model, test_loader)

In [44]:
train_embeddings

array([[-2.4921439,  2.4942746],
       [ 2.9128332, -2.7602217],
       [ 3.0440102, -2.8486948],
       ...,
       [ 3.0164921, -2.9879208],
       [-2.4487917,  2.511305 ],
       [ 2.881147 , -2.6458378]], dtype=float32)

In [45]:
# Save embeddings to files (you can choose your desired format)
np.save('train_embeddings.npy', train_embeddings)
np.save('val_embeddings.npy', val_embeddings)
np.save('test_embeddings.npy', test_embeddings)

## Lab Task 2: Vectorization and Classification

### Lab Task 2.1: Vectorization using Bag of Words, TF-IDF, and Word2Vec (Skip-Gram)

In [46]:
from gensim.models import Word2Vec

spam_df = pd.read_csv(path_to_dataset, encoding='ISO-8859-1')

# List of columns to drop
columns_to_drop = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']

# drop the specified columns
spam_df = spam_df.drop(columns = columns_to_drop)

# rename 'v1' to 'label' and 'v2' to 'text'
spam_df = spam_df.rename(columns={'v1': 'label', 'v2': 'text'})

# Tokenizing the processed text for Word2Vec
tokenized_texts = [text.split() for text in spam_df['text']]

# Training the Word2Vec model
model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec_model.model")

# To get the vector for a specific word
# vector = model.wv['example_word']


In [86]:
# 1. Transform the messages into Vectors using Bag of Words, TF-IDF, and Word2Vec (Skip-Gram)

# Bag of Words

corpus = spam_df['text'].str.split(' ')
bagOfWords = []
for line in corpus:
  a = (" ".join([alpha for alpha in line]))
  bagOfWords.append(a)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(bagOfWords)
print(vectorizer.get_feature_names_out())

['00' '000' '000pes' ... 'ûïharry' 'ûò' 'ûówell']


In [87]:
print(X)

  (0, 3550)	1
  (0, 8030)	1
  (0, 4350)	1
  (0, 5920)	1
  (0, 2327)	1
  (0, 1303)	1
  (0, 5537)	1
  (0, 4087)	1
  (0, 1751)	1
  (0, 3634)	1
  (0, 8489)	1
  (0, 4476)	1
  (0, 1749)	1
  (0, 2048)	1
  (0, 7645)	1
  (0, 3594)	1
  (0, 1069)	1
  (0, 8267)	1
  (1, 5504)	1
  (1, 4512)	1
  (1, 4318)	1
  (1, 8392)	1
  (1, 5533)	1
  (2, 4087)	1
  (2, 3358)	1
  :	:
  (5570, 4218)	1
  (5570, 8313)	1
  (5570, 1084)	1
  (5570, 4615)	1
  (5570, 7039)	1
  (5570, 3308)	1
  (5570, 7627)	1
  (5570, 1438)	1
  (5570, 5334)	1
  (5570, 2592)	1
  (5570, 8065)	1
  (5570, 1778)	1
  (5570, 7049)	1
  (5570, 2892)	1
  (5570, 3470)	1
  (5570, 1786)	1
  (5570, 3687)	1
  (5570, 4161)	1
  (5570, 903)	1
  (5570, 1546)	1
  (5571, 7756)	1
  (5571, 5244)	1
  (5571, 4225)	2
  (5571, 7885)	1
  (5571, 6505)	1


In [88]:
# TF-IDF

corpus = spam_df['text'].str.split(' ')
bagOfWords = []
for line in corpus:
  a = (" ".join([alpha for alpha in line]))
  bagOfWords.append(a)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(bagOfWords)
print(vectorizer.get_feature_names_out())

['00' '000' '000pes' ... 'ûïharry' 'ûò' 'ûówell']


In [89]:
print(X)

  (0, 8267)	0.18238655630689804
  (0, 1069)	0.3264252905795869
  (0, 3594)	0.15318864840197105
  (0, 7645)	0.15566431601878158
  (0, 2048)	0.2757654045621182
  (0, 1749)	0.3116082237740733
  (0, 4476)	0.2757654045621182
  (0, 8489)	0.22080132794235655
  (0, 3634)	0.1803175103691124
  (0, 1751)	0.2757654045621182
  (0, 4087)	0.10720385321563428
  (0, 5537)	0.15618023117358304
  (0, 1303)	0.24415547176756056
  (0, 2327)	0.25279391746019725
  (0, 5920)	0.2553151503985779
  (0, 4350)	0.3264252905795869
  (0, 8030)	0.22998520738984352
  (0, 3550)	0.1481298737377147
  (1, 5533)	0.5465881710238072
  (1, 8392)	0.4316010362639011
  (1, 4318)	0.5236458071582338
  (1, 4512)	0.4082988561907181
  (1, 5504)	0.27211951321382544
  (2, 77)	0.23012628226525952
  (2, 1156)	0.16541257593676326
  :	:
  (5570, 1786)	0.2829205787072918
  (5570, 3470)	0.2752778321471703
  (5570, 2892)	0.24400995680638932
  (5570, 7049)	0.20534386872930602
  (5570, 1778)	0.1366456751602606
  (5570, 8065)	0.20880862098597563
  

In [90]:
# Word2Vec (Skip-Gram)

"""Define model here"""

# Use the Word2Vec model to convert messages to vectors
def word2vec_transform(text, model):
    word_vectors = []
    for word in text.split():
        if word in model.wv:
            word_vectors.append(model.wv[word])
    if not word_vectors:
        # If no word vectors are found, return a vector of zeros
        return [0] * model.vector_size
    return np.mean(word_vectors, axis=0)

spam_df['word2vec_vectors'] = spam_df['text'].apply(lambda x: word2vec_transform(x, model))

# Split the dataset into training and testing sets
X_word2vec = np.vstack(spam_df['word2vec_vectors'])

### Lab Task 2.2: Encode Labels and Split the Dataset into Train and Test Sets

In [91]:
# 2. Encode the labels

from sklearn.preprocessing import LabelEncoder

# Encoding labels
encoder = LabelEncoder()
y = encoder.fit_transform(spam_df['label'])



In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limiting to 5000 most frequent words for simplicity
X_tfidf = tfidf_vectorizer.fit_transform(spam_df['text'])

In [93]:
# 3. Split the data into train and test sets for each of the vector representation. Test set should be 20% of the entire dataset.

from sklearn.model_selection import train_test_split

# Using TF-IDF vectors for this example, but you can replace with BoW or Word2Vec vectors as needed
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

### Lab Task 2.3: Train and Evaluate Models

#### Logistic Regression:
The logistic regression model is based on the logistic function, which is a type of S-shaped curve that maps any continuous input to a probability value between 0 and 1. The logistic function allows us to model the relationship between the independent variables and the probability of the dependent variable taking on the value of 1.

In [94]:
# 4. Use Logistic Regression with all 3 vector types to predict whether the sms is spam or ham.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Training a logistic regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", report)



Accuracy: 0.96

Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.73      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115

