# Word Embeddings in Julia

In Julia, the word embedding approach requires engaging a bit more directly with the underlying mechanisms--but this doesn't really add any complexity, nor does it really compromise on speed at all.  The approach we'll take is as follows:
- Load and tokenize our data.
- Load some pre-trained word vectors (in the first section), and train our own (in the second).
- For each token in our data, look up the word vectors, and represent each document as the sum of these vectors.
- Throw a small neural network at our data.

In [1]:
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m project at `C:\Users\andersonh\Documents\UA Projects\LAK 2023\demos\julia`


In [2]:
# requirements
# Pkg.add("CSV")
# Pkg.add("CUDA")
# Pkg.add("DataFrames")
# Pkg.add("Embeddings")
# Pkg.add("Flux")
# Pkg.add("Pipe")
# Pkg.add("ProgressMeter")

In [3]:
# load the data
using CSV
using DataFrames

train = DataFrame(CSV.File("../../data/train.csv"))
test = DataFrame(CSV.File("../../data/test.csv"))
val = DataFrame(CSV.File("../../data/validation.csv"));

In [4]:
using Pipe # run `Pkg.add("Pipe")` if needed
using ProgressMeter

# very similar preprocessing to before, but without stemming
preprocess(s::String) :: Vector{String} = @pipe (
    s
    |> lowercase(_)
    |> replace(_, r"[^a-z]+" => " ")
    |> split(_)
    |> filter(x -> length(x) >= 3, _)
)

train_tokens = @showprogress "Preprocessing training data" [
     preprocess(i) for i in train[!, :review_body]
]
test_tokens = @showprogress "Preprocessing testing data" [
     preprocess(i) for i in test[!, :review_body]
]
val_tokens = @showprogress "Preprocessing testing data" [
     preprocess(i) for i in val[!, :review_body]
]

train_tokens[1]

[32mPreprocessing training data 100%|████████████████████████| Time: 0:00:03[39mm39m
[32mPreprocessing testing data 100%|█████████████████████████| Time: 0:00:00[39m


81-element Vector{String}:
 "arrived"
 "broken"
 "manufacturer"
 "defect"
 "two"
 "the"
 "legs"
 "the"
 "base"
 "were"
 "not"
 "completely"
 "formed"
 ⋮
 "there"
 "aren"
 "missing"
 "structures"
 "and"
 "supports"
 "that"
 "don"
 "impede"
 "the"
 "assembly"
 "process"

In [5]:
# find and remove tokens that occur < 5 times total
function counter(it)
    counts = Dict()
    for i ∈ it
        counts[i] = get(counts, i, 0) + 1
    end
    return counts
end

whitelist = counter([tok for doc in train_tokens for tok in doc])
whitelist = Set(tok for (tok, count) ∈ whitelist if count > 5)

train_tokens = @showprogress [[tok for tok ∈ doc if tok ∈ whitelist] for doc in train_tokens]
test_tokens = @showprogress [[tok for tok ∈ doc if tok ∈ whitelist] for doc in test_tokens]
val_tokens = @showprogress [[tok for tok ∈ doc if tok ∈ whitelist] for doc in val_tokens];

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m


# Load pre-trained word vectors with `Embeddings.jl`

`Embeddings.jl` provides a nice, simple interface to pre-trained word vectors.

In [6]:
# install
# Pkg.add("Embeddings")

In [7]:
using Embeddings

# downloads the vector files if needes
# The "4" specifies which of the Glove embeddings to load--this loads
# the 300-dimensional ones.  Check the Embeddings.jl documentation for
# more information.
vectors = load_embeddings(GloVe{:en}, 4)

vectors

Embeddings.EmbeddingTable{Matrix{Float32}, Vector{String}}(Float32[0.04656 -0.25539 … 0.81451 0.429191; 0.21318 -0.25723 … -0.36221 -0.296897; … ; -0.20989 -0.12226 … 0.28408 0.32618; 0.053913 0.35499 … -0.17559 -0.0590532], ["the", ",", ".", "of", "to", "and", "in", "a", "\"", "'s"  …  "sigarms", "katuna", "aqm", "1.3775", "corythosaurus", "chanty", "kronik", "rolonda", "zsombor", "sandberger"])

The `EmbeddingTable` is a struct with two fields:
- `embeddings`: the table with one column per word, and one row per embedding dimension.
- `vocab`: a `Vector` of string names.  The $i^{th}$ string's vector is the $i^{th}$ row in the `embeddings` array.

We need to add one little mapping to convert a word into its corresponding vector.  Note that `Flux.jl`--which we'll use to build our small neural network--expects _one row per feature, one column per observation_, since Julia uses column-major ordering for arrays.

In [8]:
WORD_TO_IDX = Dict(reverse.(enumerate(vectors.vocab)))

function get_vector(word, tok2id, embeddings)
    if !(word ∈ keys(tok2id))
        return zeros(size(embeddings.embeddings)[1])
    else
        return embeddings.embeddings[:, tok2id[word]]
    end
end

get_vector("manufacturer", WORD_TO_IDX, vectors)

300-element Vector{Float32}:
  0.59205
  0.5055
 -0.19275
 -0.83702
 -0.20503
 -0.3296
 -0.20368
 -0.085202
 -0.27045
 -1.3407
  0.16294
 -0.37931
  0.30412
  ⋮
 -0.38281
  0.20347
  0.1666
 -0.25304
  0.33967
 -0.012803
 -0.11522
  0.63322
 -0.026877
  0.17706
  0.23072
  0.15622

In [9]:
function doc2vec(doc, tok2id, embeddings)
    if length(doc) == 0
        return zeros(size(embeddings.embeddings)[1])
    end
    return @pipe (
        [get_vector(tok, tok2id, embeddings) for tok ∈ doc]
        |> reduce(hcat, _)
        |> sum(_, dims=2) ./ size(_)[2]
    )
end

train_vectors = reduce(
    hcat,
    @showprogress [doc2vec(i, WORD_TO_IDX, vectors) for i ∈ train_tokens]
)
test_vectors = reduce(
    hcat,
    @showprogress [doc2vec(i, WORD_TO_IDX, vectors) for i ∈ test_tokens]
)
val_vectors = reduce(
    hcat,
    @showprogress [doc2vec(i, WORD_TO_IDX, vectors) for i ∈ val_tokens]
);

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:12[39mm
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m


In [10]:
# one-hot encode our y-values for cross-entropy loss
function one_hot(labels)
    encoded = zeros(length(unique(labels)), size(labels)[1])
    for l ∈ 1:length(labels)
        encoded[labels[l], l] = 1
    end   
    return encoded
end
train_y = one_hot(train[!, :stars])
test_y = one_hot(test[!, :stars])
val_y = one_hot(val[!, :stars]);

Now let's throw a small neural network at it using the `Flux.jl` library, which is (currently) Julia's primary neural network library.

In [11]:
# install if needed
# Pkg.add("Flux")
# Pkg.add("CUDA") # if you have a CUDA-compatible GPU
using Flux
using CUDA

In [12]:
# get our data into DataLoaders, which wrap the batching logic for us.
training_data = Flux.DataLoader(
    (train_vectors, train_y) |> gpu,
    batchsize=256,
    shuffle=true,
);

In [13]:
# our network
model = Chain(
    BatchNorm(size(vectors.embeddings)[1]),
    Dense(size(vectors.embeddings)[1] => 256, relu),
    Dense(256 => 256, relu),
    Dense(256 => 256, relu),
    Dense(256 => 5),
    softmax,
)
model = gpu(model)

# our optimizer
optim = Flux.setup(Flux.Adam(0.01), model);

# wrap the function evaluation logic
function evaluate_model(model, x, y, loss_fn)
    preds = cpu(model(gpu(x)))
    hard_preds = [i.I[1] for i ∈ argmax(preds, dims=1)]
    y_ = [i.I[1] for i ∈ argmax(y, dims=1)]
    acc = sum(y_ .== hard_preds) / size(y)[2]
    return loss_fn(preds, y), acc
end
    
val_loss, acc = evaluate_model(model, val_vectors, val_y, Flux.crossentropy)
println("Before training: val_loss=$val_loss acc=$acc")
for epoch in 1:5
    @showprogress "Epoch $epoch training loop" for (x, y) in training_data
        loss, grads = Flux.withgradient(model) do m
            # Evaluate model and loss inside gradient context:
            y_hat = m(x)
            Flux.crossentropy(y_hat, y)
        end
        Flux.update!(optim, model, grads[1])
    end
    val_loss, acc = evaluate_model(model, val_vectors, val_y, Flux.crossentropy)
    println("After epoch $epoch: val_loss=$val_loss acc=$acc")
end

evaluate_model(model, test_vectors, test_y, Flux.crossentropy)

Before training: val_loss=1.6120706644773484 acc=0.2154


[32mEpoch 1 training loop 100%|██████████████████████████████| Time: 0:00:47[39m


After epoch 1: val_loss=1.238588478843961 acc=0.458


[32mEpoch 2 training loop 100%|██████████████████████████████| Time: 0:00:01[39m


After epoch 2: val_loss=1.2207526426566764 acc=0.4642


[32mEpoch 3 training loop 100%|██████████████████████████████| Time: 0:00:02[39m


After epoch 3: val_loss=1.2161510502445512 acc=0.4668


[32mEpoch 4 training loop 100%|██████████████████████████████| Time: 0:00:02[39m


After epoch 4: val_loss=1.2101987258592155 acc=0.4714


[32mEpoch 5 training loop 100%|██████████████████████████████| Time: 0:00:02[39m


After epoch 5: val_loss=1.2141636575800134 acc=0.477


(1.1968655926110106, 0.483)

# Train your own word embeddings in Julia

Sadly, there doesn't seem to be any good library for training your own word embeddings in Julia as of right now--but you can always train you own using Flux!  You could re-implement Word2Vec, or just one-hot encode your words and let the model learn task-specific embeddings.  Both of those require a lot more code than I'm going to show here, though.