# Word Embeddings
## Word embedding - origins and fundamentals
-    collective name for a set of language modeling and feature learning techings in natural language processing (NLP)
-    one-hot encoding is early example, but doesn't work
-    other examples borrow from Information Retrieval (IR): Term Frequency-Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), and topic modeling

### Distributed representations
-    Attempt to capture the meaning of word by considering its relations with other words in its context.

### Static embeddings
-    Embeddings are generated against a large corpus, but the nuumber of words, though large, is finite.
-    Think of static embedding as a dictionary
#### Word2Vec
-    Self-supervised
-    Continuous Bag of Words (CBOW) and Skip-gram
-    Skip-Gram with Negative Sampling (SGNS) model
-    GloVe - Global vectors for word representation
#### Creating your own embeddings using Gesim
-    Gesim is an open-source python library designed to extract semantic meaning from text documents.

In [1]:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8")
model = Word2Vec(dataset)
model.save("data/text8-word2vec.bin")

### Exploring the embedding space with Gensim
-    Reload the model we just built and explore it

In [2]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin")
word_vectors = model.wv

-    Look at the first few words in the vocabulary

In [3]:
#words = word_vectors.vocab.keys()

#print([x for i, x in enumerate(words) if i < 10])
#assert("king" in words)

my_dict = dict({})
i = 0
for idx, key in enumerate(model.wv.key_to_index):
    my_dict[key] = model.wv[key]
    i += 1
    if i >= 10:
        break

my_dict.keys()

dict_keys(['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two'])

Look for similar words to a given word "king"

In [4]:
def print_most_similar(word_conf_pairs, k):
    for i, (word, conf) in enumerate(word_conf_pairs):
        print("{:.3f} {:s}".format(conf, word))
        if i >= k-1:
            break
    if k < len(word_conf_pairs):
        print("...")

print_most_similar(word_vectors.most_similar("king"), 5)

0.729 prince
0.699 emperor
0.697 vii
0.695 constantine
0.690 throne
...


You can also do vector arithmetic similar to the country-capital example we described earlier

In [5]:
print_most_similar(word_vectors.most_similar(
    positive=["france", "berlin"], negative=["paris"]), 1
)

0.779 germany
...


The preceding similaring value is reported cosine.  Alternatively copyte the distance with lag scale, amplifying the difference between sorter distance and reducing the difference between longer ones.

In [6]:
print_most_similar(word_vectors.most_similar_cosmul(
    positive=["france", "berlin"], negative=["paris"]), 1
)

0.940 germany
...


Gensim also provides a doesnt_match function

In [7]:
print(word_vectors.doesnt_match(["hindus", "parsis", "singapore", "christians"]))

singapore


We can also calculate the similarity between two words.

In [8]:
for word in ["woman", "dog", "whale", "tree"]:
    print("similarity({:s}, {:s}) = {:.3f}".format(
        "man", word, word_vectors.similarity("man", word)
    ))

similarity(man, woman) = 0.724
similarity(man, dog) = 0.413
similarity(man, whale) = 0.245
similarity(man, tree) = 0.286


similar_by_word() function is simlar to similar(), except the latter normalizes the vector between comparing by default.

In [9]:
print(print_most_similar(
    word_vectors.similar_by_word("singapore"), 5)
     )

0.877 malaysia
0.847 indonesia
0.825 uganda
0.815 nepal
0.810 thailand
...
None


In [10]:
print("distance(singapore, malaysia) = {:.3f}".format(
    word_vectors.distance("singapore", "malaysia")
))

distance(singapore, malaysia) = 0.123


Lookup vectors for a vocabulary word either directly from the word_vectors object

In [11]:
vec_song = word_vectors["song"]
#vec_song_2 = word_vectors.word_vec("song", use_norm=True) ## Throws an error "KeyedVectors.get_vector() got an unexpected keyword argument 'use_norm'"

### Using word embeddings for spam detection

Embeddings provide dense fixed dimension vector for each token.  Each token is replaced with its vector, and this converts the sequence of text into a matrix of examples, each of which has a fixed number of features corresponding to the dimensionality of the embedding.

    - Convolutional Neural Network (CNN)
    - Short Message Service (SMS)

We will see how the program learns an embedding from scratch

In [12]:
#!pip install scikit-learn
import argparse
import gensim.downloader as api
import numpy as np
import os
import shutil
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix

2024-10-17 03:00:01.531097: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-17 03:00:01.531206: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-17 03:00:01.614039: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-17 03:00:01.777360: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Getting the data

In [13]:
def download_and_read(url):
    local_file = url.split('/')[-1]
    p = tf.keras.utils.get_file(local_file, url,
        extract=True, cache_dir=".")
    labels, texts = [], []
    local_file = os.path.join("datasets", "SMSSpamCollection")
    with open(local_file, "r") as fin:
        for line in fin:
            label, text = line.strip().split('\t')
            labels.append(1 if label == 'spam' else 0)
            texts.append(text)
    return texts, labels

DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
texts, labels = download_and_read(DATASET_URL)

### Making the data ready for use

    - The next step is to process the data so it can be consumed.
    

In [14]:
# tokenize and pad text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
print("{:d} sentences, max length: {:d}".format(num_records, max_seqlen))

5574 sentences, max length: 189


Convert our labels to categorical or one-hot encoding format.

In [15]:
# Labels
NUM_CLASSES = 2
cat_labels = tf.keras.utils.to_categorical(labels, num_classes=NUM_CLASSES)

Tokenizer allows access to the vocabulary created through the word_index attribute

In [17]:
# vocabulary
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)
print("vocab size: {:d}".format(vocab_size))

vocab size: 9010


Create the dataset object that our network will work

In [19]:
# dataset
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))
dataset = dataset.shuffle(10000)
test_size = num_records // 4
val_size = (num_records - text_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)
BATCH_SIZE = 128
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)

### Building the embedding matrix

- Gensim toolkit provides access to various trained embedding models.  You can see them by running
<code>import gensim.downloader as api
api.info("models").keys()</code>

- A cople of the trained word embeddings:
    - Word2Vec
    - GloVe
    - fastText
    - ConceptNet Numberbatch

<hr/>
For our example, we will chose the 300d CloVe embeddings trained on the Gigaword corpus.

In [23]:
def build_embedding_matrix(sequences, word2idx, embedding_dim, embedding_file):
    if os.path.exists(embedding_file):
        E = np.load(embedding_file)
    else:
        vocab_size = len(word2idx)
        E = np.zeros((vocab_size, embedding_dim))
        word_vectors = api.load(EMBEDDING_MODEL)
        for word, idx in word2idx.items():
            try:
                E[idx] = word_vectors.word_vec(word)
            except KeyError: # word not im embedding
                pass
        np.save(embedding_file, E)
    return E

EMBEDDING_DIM = 300
DATA_DIR = "data"
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")
EMBEDDING_MODEL = "glove-wiki-gigaword-300"
E = build_embedding_matrix(text_sequences, word2idx,
                           EMBEDDING_DIM, EMBEDDING_NUMPY_FILE)
print("Embedding matrix:", E.shape)

[===-----------------------------------------------] 6.2% 23.4/376.1MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Embedding matrix: (9010, 300)


  E[idx] = word_vectors.word_vec(word)


## Defining the spam classifier

- We will use a one-dimensional Convolutional Neural Network or ConvNet (1D CNN)

In [24]:
class SpamClassifierModel(tf.keras.Model):
    def __init__(self, vocab_sz, embeded_sz, input_length,
                 num_filters, kernel_sz, output_sz,
                 run_mode, embedding_weights,
                 **kwargs):
        super(SpamClassifierModel, self).__init__(**kwargs)
        if run_mode == "scratch":
            self.embedding = tf.keras.layers.Embedding(vocab_sz,
                embeded_sz, input_length=input_length,
                trainable=True)
        elif run_mode == "vectorizer":
            self.embedding = tf.keras.layers.Embedding(vocab_sz,
                embeded_sz, input_length=input_length,
                trainable=False)
        else:
            self.embedding = tf.keras.layers.Embedding(vocab_sz,
                embeded_sz, input_length=input_length,
                weights=[embedding_weights],
                trainable=True)

        self.conv = tf.keras.layers.Conv1D(filters=num_filters,
            kernel_size=kernel_sz, activation="relu")

        self.dropout = tf.keras.layers.SpatialDropout1D(0.2)
        self.pool = tf.keras.layers.GlobalMaxPooling1D()
        self.dense = tf.keras.layers.Dense(output_sz,
            activation="softmax")

    def call(self, x):
        x = self.embedding(x)
        x = self.conv(x)
        x = self.dropout(x)
        x = self.pool(x)
        x = self.dense(x)
        return x

# Model definition
conv_num_filters = 256
conv_kernel_size = 3
model = SpamClassifierModel(
    vocab_size, EMBEDDING_DIM, max_seqlen,
    conv_num_filters, conv_kernel_size, NUM_CLASSES,
    run_mode, E)
model.build(input_shape=(None, max_seqlen))

NameError: name 'run_mode' is not defined

In [None]:
# Compile
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

### The individuals that wrote the book couldn't be bothered to write complete code the above example doesn't work.  Check the other folders in chapter 04

## Training and evaluating the model

The dataset is imbalanced, with 747 instaces of spam to 4827 instances of ham.  We set class weights to indicat that an error on a spam SMS is eight times as expensive as an error on a ham SMS.

In [None]:
NUM_EPOCHS = 3
# data distrinution is 4827 ham and 747 spam (total 5574), which
# works out to approx 87% ham and 13% spam, so we take reciprocals
# and this works out to being each spam (1) item as being 
# approximately 8 times as important as each ham (0) message.
CLASS_WEIGHTS = { 0: 1, 1: 8 }
# train model
model.fit(train_dataset, epochs=NUM_EPOCHS,
          validation_data=val_dataset,
          class_weight=CLASS_WEIGHTS)
# evaluate against test set
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
    Ytest_ = model.predict_on_batch(Xtest)
    ytest = np.argmax(Ttest, axis=1)
    labels.extend(ytest.tolist())
    predictions.extend(ytest.tolist())

print("test accuracy: {:.3f}".format(accuracy_score(labels, predictions)))
print("confusion matrix")
print(confusion_matrix(labels, predictions))

## Running the spam detector

The 3 scenarios we want to look at are:
- let the network lean the embedding for the task
- start with fixed external third-party embedding where the embedding matrix is treated like a vectorizer to transorm the sequence of integers into a sequence of vectors
- Starting with external thrid party embedding which is further fine tuned to the task during the training.

## Neural embedding - not just for words

### Item2Vec
### node2vec

In [2]:
import gensim
import logging
import numpy as np
import os
import shutil
import tensorflow as tf
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
logging.basicConfig(format='%asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

2024-11-05 04:21:49.432116: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-05 04:21:49.432343: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-05 04:21:49.526978: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-05 04:21:49.719163: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Next downloa dthe data from UCI repository

In [4]:
DATA_DIR = "./data"
UCI_DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00371/NIPS_1987-2015.csv"

def download_and_read(url):
    local_file = url.split('/')[-1]
    p = tf.keras.utils.get_file(local_file, url, cache_dir=".")
    row_ids, col_ids, data = [], [], []
    rid = 0
    f = open(p, "r")

    for line in f:
        line = line.strip()
        if line.startswith("\"\","):
            # header
            continue

        # compute non-zero elements for current row
        counts = np.array([int(x) for x in line.split(',')[1:]])
        nz_col_ids = np.nonzero(counts)[0]
        nz_data = counts[nz_col_ids]
        nz_row_ids = np.repeat(rid, len(nz_col_ids))
        rid += 1

        # add data to big lists
        row_ids.extend(nz_row_ids.tolist())
        col_ids.extend(nz_col_ids.tolist())
        data.extend(nz_data.tolist())
    f.close()
    TD = csr_matrix((
        np.array(data), (
            np.array(row_ids), np.array(col_ids)
        )),
        shape=(rid, counts.shape[0]))

    return TD

# read data and convert to Term-Document matrix
TD = download_and_read(UCI_DATA_URL)

# compute undirected, unweighted edge matrix
E = TD.T * TD

#binarize
E[E > 0] = 1

Note this is a very slow process.

In [None]:
NUM_WALKS_PER_VERTEX = 32
MAX_PATH_LENGTH = 40
RESTART_PROB = 0.15
RANDOM_WALKS_FILE = os.path.join(DATA_DIR, "random-walks.txt")
def construct_random_walks(E, n, alpha, l, ofile):
    