# An Introduction to Embeddings

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate a notebook that teaches about embeddings 

This Jupyter notebook provides a detailed content outline on embeddings. It covers various topics such as the definition and role of embeddings in machine learning, different types of embeddings including word, image, and graph embeddings, popular algorithms for word embeddings like Word2Vec and GloVe, popular architectures for image embeddings such as CNN and ResNet, popular algorithms for graph embeddings like GraphSAGE and Node2Vec, applications of embeddings in natural language processing, computer vision, and network analysis, techniques for visualizing and interpreting embeddings, and a guide on how to train your own embeddings using different datasets and models. The aim of this notebook is to teach about embeddings and their practical applications.

## Types of embeddings

In [None]:
# Types of embeddings

In [None]:
## Word embeddings
# Word embeddings are vector representations of words in a continuous vector space.
# They capture semantic and syntactic relationships between words.
# Popular word embedding models include Word2Vec, GloVe, and FastText.

In [None]:
# Example code for loading Word2Vec embeddings
from gensim.models import Word2Vec

In [None]:
# Load pre-trained Word2Vec model
word2vec_model = Word2Vec.load("word2vec_model.bin")

In [None]:
## Image embeddings
# Image embeddings are vector representations of images.
# They encode visual features of images and can be used for tasks like image similarity and classification.
# Popular image embedding models include VGG16, ResNet, and Inception.

In [None]:
# Example code for loading VGG16 embeddings
from tensorflow.keras.applications.vgg16 import VGG16

In [None]:
# Load pre-trained VGG16 model
vgg16_model = VGG16(weights="imagenet")

In [None]:
## Graph embeddings
# Graph embeddings are vector representations of nodes or subgraphs in a graph.
# They capture structural and relational information of nodes and can be used for tasks like node classification and link prediction.
# Popular graph embedding models include node2vec, GraphSAGE, and DeepWalk.

In [None]:
# Example code for loading node2vec embeddings
import node2vec

In [None]:
# Create a graph using networkx
graph = networkx.Graph()

In [None]:
# Train node2vec model on the graph
n2v = node2vec.Node2Vec(graph, dimensions=128)

In [None]:
# Generate node embeddings
graph_embeddings = n2v.fit()

## Word embeddings

In [None]:
# Word embeddings

In [None]:
# Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are widely used in natural language processing tasks such as sentiment analysis, machine translation, and named entity recognition.

In [None]:
# Popular algorithms for generating word embeddings include Word2Vec and GloVe. In this section, we will explore these algorithms and their implementation.

In [None]:
# Import the necessary libraries
import numpy as np
from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

In [None]:
# Word2Vec
# Word2Vec is a popular algorithm for learning word embeddings. It uses a neural network model to learn word representations based on their context in a given text corpus.

In [None]:
# Create a sample sentence corpus
corpus = [
    ['I', 'love', 'machine', 'learning'],
    ['machine', 'learning', 'is', 'fun'],
    ['I', 'love', 'coding'],
    ['coding', 'is', 'fun']
]

In [None]:
# Train Word2Vec model on the corpus
model = Word2Vec(corpus, min_count=1)

In [None]:
# Get the word vector for a specific word
word_vector = model.wv['machine']
print("Word vector for 'machine':", word_vector)

In [None]:
# Find similar words to a given word
similar_words = model.wv.most_similar('machine')
print("Similar words to 'machine':", similar_words)

In [None]:
# GloVe
# GloVe (Global Vectors for Word Representation) is another popular algorithm for generating word embeddings. It leverages global word co-occurrence statistics to learn word representations.

In [None]:
# Convert GloVe pre-trained vectors to Word2Vec format
glove_input_file = 'glove.6B.100d.txt'  # Path to the GloVe file
word2vec_output_file = 'glove.6B.100d.word2vec.txt'  # Path to the output file
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
# Load the converted GloVe vectors
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

In [None]:
# Get the word vector for a specific word
word_vector = glove_model['machine']
print("Word vector for 'machine' (GloVe):", word_vector)

In [None]:
# Find similar words to a given word
similar_words = glove_model.most_similar('machine')
print("Similar words to 'machine' (GloVe):", similar_words)

## Image embeddings

In [None]:
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.preprocessing import image
import numpy as np

In [None]:
model = ResNet50(weights='imagenet', include_top=False)

In [None]:
def extract_image_embeddings(image_path):
    img = image.load_img(image_path, target_size=(224, 224))
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = preprocess_input(img)

In [None]:
    embeddings = model.predict(img)
    embeddings = np.squeeze(embeddings)

In [None]:
    return embeddings

In [None]:
image_path = 'path/to/image.jpg'
embeddings = extract_image_embeddings(image_path)
print(embeddings)

## Graph embeddings

In [None]:
# Import necessary libraries
import networkx as nx
import numpy as np
from gensim.models import Word2Vec
from node2vec import Node2Vec
from stellargraph import GraphSAGE

In [None]:
# Define a function to generate a random graph
def generate_graph():
    G = nx.Graph()
    G.add_edges_from([(1, 2), (1, 3), (2, 3), (2, 4), (3, 4), (4, 5)])
    return G

In [None]:
# Generate a random graph
graph = generate_graph()

In [None]:
# Define a function to generate graph embeddings using Node2Vec algorithm
def generate_node2vec_embeddings(graph):
    # Set Node2Vec parameters
    p = 1.0
    q = 1.0
    num_walks = 10
    walk_length = 80
    dimensions = 128
    window_size = 10
    
    # Generate random walks
    node2vec = Node2Vec(graph, dimensions=dimensions, walk_length=walk_length, num_walks=num_walks, p=p, q=q)
    model = node2vec.fit(window=window_size, min_count=1, batch_words=4)
    
    # Get node embeddings
    node_embeddings = {}
    for node in graph.nodes():
        node_embeddings[node] = model.wv[node]
    
    return node_embeddings

In [None]:
# Generate Node2Vec embeddings
node2vec_embeddings = generate_node2vec_embeddings(graph)

In [None]:
# Define a function to generate graph embeddings using GraphSAGE algorithm
def generate_graphsage_embeddings(graph):
    # Set GraphSAGE parameters
    dimensions = 128
    agg_func = 'mean'
    num_samples = [10, 5]
    
    # Generate GraphSAGE embeddings
    graphsage = GraphSAGE(graph, dimensions=dimensions, agg_func=agg_func, num_samples=num_samples)
    model = graphsage.fit()
    
    # Get node embeddings
    node_embeddings = {}
    for node in graph.nodes():
        node_embeddings[node] = model.predict(node)
    
    return node_embeddings

In [None]:
# Generate GraphSAGE embeddings
graphsage_embeddings = generate_graphsage_embeddings(graph)

## Applications of embeddings

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
data = load_sentiment_analysis_dataset()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['labels'], test_size=0.2, random_state=42)

In [None]:
vectorizer = CountVectorizer()
X_train_embeddings = vectorizer.fit_transform(X_train)
X_test_embeddings = vectorizer.transform(X_test)

In [None]:
model = LogisticRegression()
model.fit(X_train_embeddings, y_train)

In [None]:
y_pred = model.predict(X_test_embeddings)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertModel
from torch.utils.data import DataLoader, TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
input_ids = []
attention_masks = []

In [None]:
for text in data['text']:
    encoded_text = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=512,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    input_ids.append(encoded_text['input_ids'])
    attention_masks.append(encoded_text['attention_mask'])

In [None]:
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(data['labels'].values)

In [None]:
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
class BERTClassifier(nn.Module):
    def __init__(self):
        super(BERTClassifier, self).__init__()
        self.bert = model
        self.dropout = nn.Dropout(0.1)
        self.linear = nn.Linear(768, 2)
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        output = self.dropout(pooled_output)
        output = self.linear(output)
        output = self.softmax(output)
        return output

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier().to(device)
optimizer = optim.AdamW(model.parameters(), lr=2e-5)

In [None]:
for epoch in range(10):
    total_loss = 0
    total_steps = 0
    
    for batch in dataloader:
        input_ids, attention_masks, labels = batch
        input_ids = input_ids.to(device)
        attention_masks = attention_masks.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        output = model(input_ids, attention_masks)
        loss = nn.CrossEntropyLoss()(output, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        total_steps += 1
    
    print("Epoch:", epoch+1, "Loss:", total_loss/total_steps)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from PIL import Image

In [None]:
model = models.vgg16(pretrained=True)
model = nn.Sequential(*list(model.children())[:-1])

In [None]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [None]:
image_paths = get_image_paths()
image_embeddings = []

In [None]:
for image_path in image_paths:
    image = Image.open(image_path)
    image = transform(image)
    image = image.unsqueeze(0)
    image_embedding = model(image)
    image_embeddings.append(image_embedding)

In [None]:
image_embeddings = torch.cat(image_embeddings, dim=0)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
from torch.utils.data import DataLoader

In [None]:
model = models.resnet50(pretrained=True)
model = nn.Sequential(*list(model.children())[:-1])

In [None]:
for param in model.parameters():
    param.requires_grad = False

In [None]:
num_features = model[-1][-1].in_features
model[-1][-1] = nn.Linear(num_features, num_classes)

In [None]:
dataset = CustomDataset()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

In [None]:
for epoch in range(10):
    total_loss = 0
    total_steps = 0
    
    for batch in dataloader:
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        total_steps += 1
    
    print("Epoch:", epoch+1, "Loss:", total_loss/total_steps)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch_geometric.nn as gnn
from torch_geometric.data import DataLoader

In [None]:
class GraphNet(nn.Module):
    def __init__(self, num_features, num_classes):
        super(GraphNet, self).__init__()
        self.conv1 = gnn.GCNConv(num_features, 16)
        self.conv2 = gnn.GCNConv(16, num_classes)
    
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = nn.functional.relu(x)
        x = self.conv2(x, edge_index)
        return x

In [None]:
data = load_graph_data()

In [None]:
dataset = GraphDataset(data)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GraphNet(num_features, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
for epoch in range(10):
    total_loss = 0
    total_steps = 0
    
    for batch in dataloader:
        x, edge_index, y = batch
        x = x.to(device)
        edge_index = edge_index.to(device)
        y = y.to(device)
        
        optimizer.zero_grad()
        
        output = model(x, edge_index)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        total_steps += 1
    
    print("Epoch:", epoch+1, "Loss:", total_loss/total_steps)

## Embedding visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
embeddings = np.random.rand(100, 10)

In [None]:
tsne = TSNE(n_components=2, random_state=42)
embedded_embeddings = tsne.fit_transform(embeddings)

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(embedded_embeddings[:, 0], embedded_embeddings[:, 1], s=10)
plt.title("Embedding Visualization")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()

## Training your own embeddings

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec

In [None]:
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

In [None]:
data = pd.read_csv('your_dataset.csv')

In [None]:
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
def train_embeddings(train_data, embedding_size=100, window_size=5, num_epochs=10):
    sentences = [str(sentence).split() for sentence in train_data['text']]
    model = Word2Vec(sentences, size=embedding_size, window=window_size, min_count=1, workers=4)
    model.save('embeddings.bin')
    return model

In [None]:
embeddings_model = train_embeddings(train_data)

In [None]:
def text_to_embeddings(text, embeddings_model):
    embeddings = []
    for word in text.split():
        try:
            embeddings.append(embeddings_model.wv[word])
        except KeyError:
            continue
    return np.mean(embeddings, axis=0)

In [None]:
train_data['embeddings'] = train_data['text'].apply(lambda x: text_to_embeddings(x, embeddings_model))
val_data['embeddings'] = val_data['text'].apply(lambda x: text_to_embeddings(x, embeddings_model))

In [None]:
class EmbeddingModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(EmbeddingModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.fc1(x)
        x = nn.ReLU()(x)
        x = self.fc2(x)
        return x

In [None]:
train_embeddings = torch.tensor(list(train_data['embeddings']))
val_embeddings = torch.tensor(list(val_data['embeddings']))

In [None]:
input_size = train_embeddings.shape[1]
hidden_size = 50
output_size = 1
learning_rate = 0.001
batch_size = 64
num_epochs = 10

In [None]:
train_dataset = CustomDataset(train_embeddings)
val_dataset = CustomDataset(val_embeddings)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [None]:
model = EmbeddingModel(input_size, hidden_size, output_size)

In [None]:
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
for epoch in range(num_epochs):
    running_loss = 0.0
    
    model.train()
    for embeddings in train_loader:
        outputs = model(embeddings)
        loss = criterion(outputs, torch.ones(outputs.shape))
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    model.eval()
    with torch.no_grad():
        val_loss = 0.0
        for embeddings in val_loader:
            outputs = model(embeddings)
            loss = criterion(outputs, torch.ones(outputs.shape))
            
            val_loss += loss.item()
        
    print(f"Epoch {epoch+1}: Training Loss = {running_loss/len(train_loader)}, Validation Loss = {val_loss/len(val_loader)}")