In [1]:
text = '''Machine learning is the study of computer algorithms that \
improve automatically through experience. It is seen as a \
subset of artificial intelligence. Machine learning algorithms \
build a mathematical model based on sample data, known as \
training data, in order to make predictions or decisions without \
being explicitly programmed to do so. Machine learning algorithms \
are used in a wide variety of applications, such as email filtering \
and computer vision, where it is difficult or infeasible to develop \
conventional algorithms to perform the needed tasks.'''

In [2]:
import re

### Converting text corpus into tokens

In [3]:
def tokenize(text):
    pattern = re.compile(r'[A-Za-z]+[\w^\']*|[\w^\']*[A-Za-z]+[\w^\']*')
    return pattern.findall(text.lower())

In [4]:
tokens = tokenize(text)

In [5]:
tokens

['machine',
 'learning',
 'is',
 'the',
 'study',
 'of',
 'computer',
 'algorithms',
 'that',
 'improve',
 'automatically',
 'through',
 'experience',
 'it',
 'is',
 'seen',
 'as',
 'a',
 'subset',
 'of',
 'artificial',
 'intelligence',
 'machine',
 'learning',
 'algorithms',
 'build',
 'a',
 'mathematical',
 'model',
 'based',
 'on',
 'sample',
 'data',
 'known',
 'as',
 'training',
 'data',
 'in',
 'order',
 'to',
 'make',
 'predictions',
 'or',
 'decisions',
 'without',
 'being',
 'explicitly',
 'programmed',
 'to',
 'do',
 'so',
 'machine',
 'learning',
 'algorithms',
 'are',
 'used',
 'in',
 'a',
 'wide',
 'variety',
 'of',
 'applications',
 'such',
 'as',
 'email',
 'filtering',
 'and',
 'computer',
 'vision',
 'where',
 'it',
 'is',
 'difficult',
 'or',
 'infeasible',
 'to',
 'develop',
 'conventional',
 'algorithms',
 'to',
 'perform',
 'the',
 'needed',
 'tasks']

In [64]:
len(set(tokens))

60

### Creating a mapping from tokens to indices

In [6]:
def mapping(tokens):
    word_to_id = {}
    id_to_word = {}

    for i, token in enumerate(set(tokens)):
        word_to_id[token] = i
        id_to_word[i] = token

    return word_to_id, id_to_word

In [26]:
word_to_id, id_to_word = mapping(tokens)

In [27]:
word_to_id

{'study': 0,
 'seen': 1,
 'so': 2,
 'conventional': 3,
 'through': 4,
 'or': 5,
 'without': 6,
 'model': 7,
 'wide': 8,
 'perform': 9,
 'to': 10,
 'tasks': 11,
 'where': 12,
 'difficult': 13,
 'is': 14,
 'experience': 15,
 'develop': 16,
 'data': 17,
 'algorithms': 18,
 'automatically': 19,
 'build': 20,
 'intelligence': 21,
 'mathematical': 22,
 'a': 23,
 'variety': 24,
 'it': 25,
 'needed': 26,
 'based': 27,
 'known': 28,
 'predictions': 29,
 'such': 30,
 'the': 31,
 'learning': 32,
 'being': 33,
 'artificial': 34,
 'training': 35,
 'explicitly': 36,
 'and': 37,
 'on': 38,
 'of': 39,
 'machine': 40,
 'computer': 41,
 'make': 42,
 'that': 43,
 'infeasible': 44,
 'email': 45,
 'applications': 46,
 'order': 47,
 'programmed': 48,
 'do': 49,
 'sample': 50,
 'filtering': 51,
 'decisions': 52,
 'used': 53,
 'as': 54,
 'subset': 55,
 'improve': 56,
 'vision': 57,
 'are': 58,
 'in': 59}

### Building the 1HE matrices for inputs and values

In [8]:
import numpy as np

np.random.seed(42)

Function for concatenation

The `yield from` itself starts a inner loop for each of the iterables, and yields the items of that iterable one-by-one. So the complexity here can be simplified to O(mxn) where m is number of iterables and n is # of items in each loop, or further simplified to O(T) where T is the total number of elements when we count all items from all iterables.

Think of the concat function as running a for loop within a for loop to iterate through a list of lists, processing each element within a list exactly one.


In [12]:
def concat(*iterables):
    for iterable in iterables:
        yield from iterable

In [21]:
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list3 = [7, 8, 9]

# Using the concat function to combine the lists
combined = concat(list1, list2, list3)

# Printing all elements in the combined sequence
for element in combined:
    print(element)


1
2
3
4
5
6
7
8
9


Function for 1HE

In [25]:
def one_hot_encode(id,vocab_size):
    res = [0]*vocab_size
    res[id] = 1
    return res

In [52]:
n_tokens = len(tokens)
n_tokens

84

#### What are we doing below?
- We create two sets of 1HE arrays, one array containing the 1HE arrays for index of our inputs in the vocab, and the other array containing the 1HE arrays for indices of our values (aka the tokens within the window of our input).

#### How do we do this? 
- We iterate through all our tokens first

#### Things to keep in mind
- A token's position in the list of tokens is not the same as its position in the vocab list.

In [100]:
def generate_training_data(tokens,word_to_id,window_size):

    X = []
    y = []
    token_len = len(tokens)
    vocab_size = len(word_to_id)
    for index in range(token_len):
        ## Create a list containing i and elements within its window
        window = concat(range(max(0,index-window_size),index),range(index,min(token_len,index+window_size+1)))

        for value_index in window:
            if index==value_index:
                ## we are skipping when i==j because a value can't be its own input
                ## values are items in the window that are before/after the item, but not itself
                continue
            # X are inputs, y are values
            # For first and last token, there's only two values
            # For second and penultimate token, there's 3 values
            # For the rest 80 tokens, there's 4 values, i.e., 2 before them and 2 after
            # These add up to 2*2 + 3*2 + 4*80 which is equal to 330, the shape[0] of X and Y
            # 60 aka the shape[1] of X and y is the size of our vocab
            X.append(one_hot_encode(word_to_id[tokens[index]],vocab_size))
            y.append(one_hot_encode(word_to_id[tokens[value_index]],vocab_size))

    return np.asarray(X),np.asarray(y)


In [94]:
X, y = generate_training_data(tokens,word_to_id,2)

### The Embedding Model

What's happening in this neural network and its two layers?

- We feed the 1HE matrix of our sentence with each row being the word's 1HE representation.

**First layer**: Multiplying with this weight matrix is what produces our embedding matrix.
- This is then matmul'd with a weight matrix which converts this sparse 1HE matrix into a dense embedding matrix, where each row is that token's embedding vector.
- Essentially, this first weight matrix is an embedding look-up table, where each row is a token's enbedding vector and the # of columns is the dimension of the embedding space.

**Second layer**: Why do we multiple with a second weight matrix?
- This multiplication is to convert our embedding matrix into a matrix which contains the logits for its relation to other tokens in our embedding space. Let's call this output matrix B.
- When we then apply softmax on top on this output B, that's when we get our final output matrix, where each row contains the probabilities for different tokens, with each probability indicating how related that token is to our input token (aka how contextually related is this to our input token).

##### Initializing the network

In [None]:
def init_network(vocab_size,n_embedding):

    model = {
        'w1': np.random.randn(vocab_size,n_embedding),
        'w2': np.random.randn(n_embedding,vocab_size)
    }