<img src="AIAYN_Draft_230426_85.png" alt="Eq (1) of Attention is All You Need" />

The 2017 "Attention is All You Need" paper can be found here: https://arxiv.org/abs/1706.03762

The aim of this lightning talk is to start at that equation (1) and show it's equivalent to a Python dict lookup (with 1-hot encodings). Over a handful of 26 cells, we'll stepwise transform the familiar python dict lookup into the more mysterious equation (1).

**Recommended:** come to a talk on **Tuesday, September 20th, at George Mason University's Virginia Square campus (Van Metre Hall Auditorium - room 134)** for a more complete discussion of **Transformer architectures.**

https://www.meetup.com/data-science-dc/events/295247462/

<img src="AIAYN_Draft_230426_86.png" alt="A simple python dict" />

In [None]:
import numpy as np

In [None]:
# Helper to show results in human readable form:
def print_result(human_readable, result):
    print(human_readable + " = '" + str(result) + "'")

In [None]:
# set query, q, to "attitude"
q = "attitude"
print_result("q", q)

In [None]:
# Make dict with keys and values (from 10 total tokens)
KVdict = {
             "cat" : "jinx",
        "medicine" : "meh",
           "taste" : "bitter",
        "attitude" : "yes",
     "disposition" : "fickle"}
KVdict

In [None]:
# Standard Python dict q, K, V result
print_result("KVdict[q]", KVdict[q])

<img src="AIAYN_Draft_230426_87.png" alt="1-hot vector mappings of keys and queries" />

*Why 1-hots? Rebecca Bilbro, Ph.D. has a minute on that: ->*  https://www.linkedin.com/posts/activity-7080150156112773120-VFD9

All K1map does is maps a token to its 1-hot, and all V1map does is map a 1-hot to its token.

In [None]:
# Make dict of mappings of tokens to 1-hot vectors (tuples)
Vec1hmap = {
            "cat" : (1,0,0,0,0,0,0,0,0,0),
       "medicine" : (0,1,0,0,0,0,0,0,0,0),
          "taste" : (0,0,1,0,0,0,0,0,0,0),
       "attitude" : (0,0,0,1,0,0,0,0,0,0),
    "disposition" : (0,0,0,0,1,0,0,0,0,0),
           "jinx" : (0,0,0,0,0,1,0,0,0,0),
            "meh" : (0,0,0,0,0,0,1,0,0,0),
         "bitter" : (0,0,0,0,0,0,0,1,0,0),
            "yes" : (0,0,0,0,0,0,0,0,1,0),
         "fickle" : (0,0,0,0,0,0,0,0,0,1)}
Vec1hmap

Vec1hmap just maps each key or value to its 1-hot vector

In [None]:
# Make dict of vector (immutable tuple for hashing) mappings of 
# 1-hots to tokens
Vec2token = {
    (1,0,0,0,0,0,0,0,0,0) : "cat",
    (0,1,0,0,0,0,0,0,0,0) : "medicine",
    (0,0,1,0,0,0,0,0,0,0) : "taste",
    (0,0,0,1,0,0,0,0,0,0) : "attitude",
    (0,0,0,0,1,0,0,0,0,0) : "disposition",
    (0,0,0,0,0,1,0,0,0,0) : "jinx",
    (0,0,0,0,0,0,1,0,0,0) : "meh",
    (0,0,0,0,0,0,0,1,0,0) : "bitter",
    (0,0,0,0,0,0,0,0,1,0) : "yes",
    (0,0,0,0,0,0,0,0,0,1) : "fickle"}
# Equivalently:
# Vec2token = dict((v,k) for k,v in Vec1hmap.items())
Vec2token

Vec2token just maps each 1-hot vector back to its token

In [None]:
# K1map is a mapping from each key to its 1-hot vector
K1map = {        "cat" : Vec1hmap["cat"],
            "medicine" : Vec1hmap["medicine"],
               "taste" : Vec1hmap["taste"],
            "attitude" : Vec1hmap["attitude"],
         "disposition" : Vec1hmap["disposition"],
        }
K1map

In [None]:
# V1map is a mapping from each vector's 1-hot to its value
V1map = {  Vec1hmap["jinx"] : "jinx",
            Vec1hmap["meh"] : "meh",
         Vec1hmap["bitter"] : "bitter",
            Vec1hmap["yes"] : "yes",
         Vec1hmap["fickle"] : "fickle",
        }
V1map

<img src="AIAYN_Draft_230426_88.png" alt="1-hot vector mappings of keys and queries" />

In [None]:
# The 1-hot vector, q1h, for the query "attitude" is (0,0,0,1,0,0,0,0,0,0)
q1h = K1map[q]
print_result("q1h", q1h)

In [None]:
# Make dict with 1-hot vector mapped version of each key
K1Vdict = {
            K1map["cat"] : "jinx",
       K1map["medicine"] : "meh",
          K1map["taste"] : "bitter",
       K1map["attitude"] : "yes",
    K1map["disposition"] : "fickle"}
K1Vdict

In [None]:
# Standard Python dict q, K, V result where q1h and K1h are 1-hot 
# mappings of both q and K are the same as KVdict[q]
print_result("KVdict[q]", KVdict[q])
print_result("K1Vdict[q1h]", K1Vdict[q1h])

<img src="AIAYN_Draft_230426_89.png" alt="Finding the value where the dot product between query and key is maximal" />

Note the only key with a nonzero dot product is the query key, q1h, so it selects the value at that maximum.

In [None]:
# K1h is a list of 1-hots for the key tokens
K1h = list(K1Vdict.keys())
# Given K1Vdict 1-hot representations of keys, select value where dot
# product of q1h with K1h is maximal
K1Vmaxdot = K1Vdict[
    K1h[
        np.argmax(np.dot(q1h, np.array(K1h).T))
    ]
]
print_result("KVdict[q]", KVdict[q])
print_result("KVdict[argmax_i<q1h,kih_i>]", K1Vmaxdot)

<img src="AIAYN_Draft_230426_90.png" alt="Mapping both keys and values to 1-hots" />

The only difference from the last example is that the values are mapped to their 1-hots and then the resulting 1-hot is mapped back to its token value via V1map

In [None]:
# Make dict with 1-hot vector mapped versions of keys and values
K1V1dict = {
            K1map["cat"] : Vec1hmap["jinx"],
       K1map["medicine"] : Vec1hmap["meh"],
          K1map["taste"] : Vec1hmap["bitter"],
       K1map["attitude"] : Vec1hmap["yes"],
    K1map["disposition"] : Vec1hmap["fickle"]}
K1V1dict

In [None]:
# Given K1V1dict 1-hot representations of keys & values, select
# v1h with max dot product of q1h with K1h and show the
# token corresponding to that v1h with Vec2token
K1V1maxdot = V1map[
    K1V1dict[
        K1h[
            np.argmax(np.dot(q1h, np.array(K1h).T))
        ]
    ]
]
print_result("KVdict[q]", KVdict[q])
print_result("V1map[K1V1dict[argmax_i<q1h,k1h_i>]]", K1V1maxdot)

<img src="AIAYN_Draft_230426_91.png" alt="Weighted dot products of query vector with values vectors" />

In [None]:
# Extract matrix, K (1-hots)
K = np.array(list(K1V1dict.keys()))
K

In [None]:
# Extract matrix, V (1-hots)
V = np.array(list(K1V1dict.values()))
V

In [None]:
# Given K1V1dict 1-hot representations of keys and values, 
# *weight* V by dot product of q1h with K; map 1-hot value
# back to token value via V1map token lookup
KVweighted = V1map[
    tuple(
        np.matmul(np.dot(q1h, K.T),V)
    )
]
print_result("KVdict[q]", KVdict[q])
print_result("V1map[KVdict[sum_i<q1h,k1h_i>v1h_i]]", KVweighted)

<img src="AIAYN_Draft_230426_92.png" alt="Scaled weighted dot products of query vector with and values matrix" />

In [None]:
# Normalizer, d_k, for no softmax and orthonormal vectors
d_k = 1.0
print_result("d_k", d_k)

In [None]:
# Given K1V1dict 1-hot representations of keys and values, 
# compute 1-hot encoding of value from AIAYN matrix version 
# of self-attention via matrix multiply; map 1-hot value back
# to token value via Vec2token
K1V1maxdot = V1map[
    tuple(
        np.matmul(
            np.matmul(q1h, K.T)/np.sqrt(d_k),
            V)
    )
]
print_result("KVdict[q]", KVdict[q])
print_result("V1map[((q1h)(K^T) / sqrt(d_k)) V]", K1V1maxdot)

<img src="AIAYN_Draft_230426_93.png" alt="Scaled dot attention with query, keys and values matrices -- full equivalence to AIAYN" />

First, we'll do the computation without the softmax (with 1 hots), and then with it, adjusting d_k.

In [None]:
Q = np.array([Vec1hmap[k] for k in list(KVdict.keys())])
Q

In [None]:
# Given K1V1dict 1-hot representations of keys and values, 
# compute 1-hot encoding of value from AIAYN matrix version 
# of self-attention via matrix multiply; map 1-hot value back
# to token value via Vec2token
Att1h = np.matmul(np.matmul(Q, K.transpose())/np.sqrt(d_k),V)
print_result("Att1h", Att1h)

In [None]:
print("For each query, the corresponding Attention_1h(Q,K,V) \
\n maps to (query, value):")
for iv, k in enumerate(list(Q)): # for each query 1-hot in Q
    print((Vec2token[tuple(k)], Vec2token[tuple(Att1h[iv])]))
print("This was the original mapping:")
KVdict

In [None]:
def softmax(x):
    return(np.exp(x)/np.exp(x).sum())

In [None]:
# If using the softmax to scale to 0-1, make d_k small (say 0.001)
# and round the result
d_ksm = 0.001
# Compute the softmax argument with QKd
QKd = np.matmul(Q, K.transpose())/np.sqrt(d_ksm)
# Compute the Attention with the softmax, rounded to nearest 10th decimal
Att = np.around(
    np.array([np.matmul(np.array(softmax(r)),V) for r in QKd]), 
    decimals=10)
print_result("Att", Att)

In [None]:
print("For each query, the corresponding Attention(Q,K,V) \
\n maps to (query, value):")
for iv, k in enumerate(list(Q)): # for each query 1-hot in Q
    print((Vec2token[tuple(k)], Vec2token[tuple(Att[iv])]))
print("This was the original mapping:")
KVdict

<img src="AIAYN_Draft_230426_94.png" alt="Full equivalence to AIAYN example Q.E.D." />

<img src="AIAYN_Draft_230426_85.png" alt="Eq (1) of Attention is All You Need" />