### **Structured Approach for NLP Project: Text Embeddings**

#### **Explore and Compare Text Embeddings**

1. **Word2Vec:**
   - **Overview:** Shallow neural network model by Google for word vector representations.
   - **Key Features:** Captures semantic meaning; uses CBOW and Skip-Gram models.

2. **GloVe:**
   - **Overview:** Unsupervised algorithm by Stanford for word embeddings based on global word co-occurrence.
   - **Key Features:** Combines Word2Vec and matrix factorization advantages.

3. **BERT:**
   - **Overview:** Transformer-based model by Google that generates context-aware embeddings.
   - **Key Features:** Pretrained on large datasets (e.g., Wikipedia) and fine-tuned for specific tasks.

In [None]:
!pip install pandas numpy scikit-learn gensim transformers torch


In [None]:
!pip install gensim



In [116]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from transformers import BertTokenizer, BertModel
import torch

In [117]:
import pandas as pd

df = pd.read_csv("df_file.csv")

df.head()


Unnamed: 0,Text,Label
0,Budget to set scene for election\n \n Gordon B...,0
1,Army chiefs in regiments decision\n \n Militar...,0
2,Howard denies split over ID cards\n \n Michael...,0
3,Observers to monitor UK election\n \n Minister...,0
4,Kilroy names election seat target\n \n Ex-chat...,0


In [118]:
df.columns


Index(['Text', 'Label'], dtype='object')

In [119]:
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters
    text = ''.join(e for e in text if e.isalnum() or e.isspace())
    return text

# Apply preprocessing
df['cleaned_text'] = df['Text'].apply(preprocess_text)  


In [120]:
df.head()

Unnamed: 0,Text,Label,cleaned_text
0,Budget to set scene for election\n \n Gordon B...,0,budget to set scene for election\n \n gordon b...
1,Army chiefs in regiments decision\n \n Militar...,0,army chiefs in regiments decision\n \n militar...
2,Howard denies split over ID cards\n \n Michael...,0,howard denies split over id cards\n \n michael...
3,Observers to monitor UK election\n \n Minister...,0,observers to monitor uk election\n \n minister...
4,Kilroy names election seat target\n \n Ex-chat...,0,kilroy names election seat target\n \n exchat ...


In [121]:
# Split the dataset into training and testing sets
X = df['cleaned_text']
y = df['Label'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Embedding Generation

In [123]:
# Word2Vec
# Tokenize the training data
train_tokens = [tweet.split() for tweet in X_train]
word2vec_model = Word2Vec(sentences=train_tokens, vector_size=100, window=5, min_count=1, workers=4)
def get_word2vec_embeddings(tweets):
    embeddings = []
    for tweet in tweets:
        # Get the vectors for words in the tweet
        word_vectors = [word2vec_model.wv[word] for word in tweet.split() if word in word2vec_model.wv]
        
        # Check if word_vectors is not empty
        if word_vectors:
            vec = np.mean(word_vectors, axis=0)
        else:
            # If no words are found, create a zero vector
            vec = np.zeros(word2vec_model.vector_size)
        
        embeddings.append(vec)
    return np.array(embeddings)

X_train_w2v = get_word2vec_embeddings(X_train)
X_test_w2v = get_word2vec_embeddings(X_test)

In [124]:
# GloVe
glove_file = 'glove.6B.100d.txt' 
glove_vectors = {}
with open(glove_file, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vectors = np.array(values[1:], dtype='float32')
        glove_vectors[word] = vectors

def get_glove_embeddings(tweets):
    embeddings = []
    for tweet in tweets:
        # Get the vectors for words in the tweet
        glove_vectors_found = [glove_vectors[word] for word in tweet.split() if word in glove_vectors]
        
        # Check if glove_vectors_found is not empty
        if glove_vectors_found:
            vec = np.mean(glove_vectors_found, axis=0)
        else:
            # If no words are found, create a zero vector
            vec = np.zeros(100) 
        
        embeddings.append(vec)
    return np.array(embeddings)


X_train_glove = get_glove_embeddings(X_train)
X_test_glove = get_glove_embeddings(X_test)

In [125]:
from transformers import BertTokenizer, BertModel
# BERT
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(tweets):
    embeddings = []
    for tweet in tweets:
        inputs = tokenizer(tweet, return_tensors='pt', padding=True, truncation=True, max_length=512)
        outputs = bert_model(**inputs)
        vec = outputs.last_hidden_state.mean(dim=1).detach().numpy()  
        embeddings.append(vec.flatten())
    return np.array(embeddings)

X_train_bert = get_bert_embeddings(X_train.tolist())
X_test_bert = get_bert_embeddings(X_test.tolist())



In [126]:
# Model Training and Evaluation

def evaluate_model(X_train, X_test, y_train, y_test):
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    return accuracy, precision, recall, f1

In [127]:
# Evaluate the models
print("Word2Vec Embeddings:")
results_w2v = evaluate_model(X_train_w2v, X_test_w2v, y_train, y_test)
print(f"Accuracy: {results_w2v[0]:.4f}, Precision: {results_w2v[1]:.4f}, Recall: {results_w2v[2]:.4f}, F1-score: {results_w2v[3]:.4f}")

Word2Vec Embeddings:
Accuracy: 0.8247, Precision: 0.8253, Recall: 0.8247, F1-score: 0.8237


In [128]:
print("\nGloVe Embeddings:")
results_glove = evaluate_model(X_train_glove, X_test_glove, y_train, y_test)
print(f"Accuracy: {results_glove[0]:.4f}, Precision: {results_glove[1]:.4f}, Recall: {results_glove[2]:.4f}, F1-score: {results_glove[3]:.4f}")


GloVe Embeddings:
Accuracy: 0.9483, Precision: 0.9491, Recall: 0.9483, F1-score: 0.9481


In [129]:
print("\nBERT Embeddings:")
results_bert = evaluate_model(X_train_bert, X_test_bert, y_train, y_test)
print(f"Accuracy: {results_bert[0]:.4f}, Precision: {results_bert[1]:.4f}, Recall: {results_bert[2]:.4f}, F1-score: {results_bert[3]:.4f}")


BERT Embeddings:
Accuracy: 0.9843, Precision: 0.9846, Recall: 0.9843, F1-score: 0.9842


### **Comparison of Embedding Types for Text Classification**

| **Embedding Type** | **Accuracy** | **Precision** | **Recall** | **F1-score** |
|--------------------|--------------|---------------|------------|--------------|
| **BERT**           | 0.9843       | 0.9846        | 0.9843     | 0.9842       |
| **GloVe**          | 0.9483       | 0.9491        | 0.9483     | 0.9481       |
| **Word2Vec**       | 0.8247       | 0.8253        | 0.8247     | 0.8237       |

### **Findings and Insights**

- **BERT:** Best performer with highest accuracy (0.9843) and balanced metrics due to its ability to understand context. However, it is computationally expensive.
- **GloVe:** Moderate performance (accuracy: 0.9483). Efficient and good for general NLP tasks but lacks context sensitivity.
- **Word2Vec:** Lowest performance (accuracy: 0.8247), suitable for simpler tasks and environments with limited resources due to its lower computational cost.

### **Conclusion:**
- **Use BERT** for tasks needing deep contextual understanding.
- **Choose GloVe** for a balance between efficiency and performance.
- **Apply Word2Vec** for simple tasks requiring speed over accuracy.