# Lab2.3 Machine learning using embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, you are going to use word embeddings instead of the one-hot-encoding of words. Word embeddings have many advantages:

* they capture similarities across words that can be learned from massive amounts of text data without annotation
* machine learning can easily exploit similarity because the embeddings are also represented as vectors
* the word embedding vectors are much smaller (100 up to 500 dimensions) and more dense than one-hot-encodings, which results in more efficient and compact models that also generalize better.

At the end of this notebook, you should have learned:

* how to replace the words in your training set by their embeddings
* how to train a classifier enriched with embeddings
* how to represent the words for any unseen text as embeddings
* how to add embeddings to our NERC system
* how to work with some popular data sets for NERC with which such embeddings can be combined



## 1 Quick introduction to embeddings

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth). Embeddings are however the weights in the hidden layer of a neural network that is trained to predict the contexts rather than representing the context in a vector directly.

In [None]:
#!conda install gensim

In this section, we will load pre-trained word embeddings called word2vec, created by Google. The embeddings have 300 dimensions.

First, please download the file from [their google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Then, create a folder in the same directory as this notebook and unpack the word2vec file in that folder.

We will load the embedding model with the Gensim package that we used before.

In [None]:
import gensim

We can now load the file using the gensim library (this takes a while):

In [33]:
## The Google model should be in the same directory. If not adapt the path accordingly.
path = r"C:\path\to\GoogleNews-vectors-negative300.bin"

word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)  

If you have limited memory to load the model, you add a limit to only load part of the vocabulary. In the next call only the first 500K words are loaded:

```word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('/Users/piek/Desktop/ONDERWIJS/data/models/GoogleNews-vectors-negative300.bin', binary=True, limit=500000)  ```

Word embeddings capture certain meaning aspects of words. Previous research has shown that they can partially capture simiarity ("tapas" is similar to "pintxos"), relatedness (tapas relates to Spain), and analogy ("Paris" is to "France" as "Rome" is to "Italy"). 

To get an idea of these properties of embeddings, we can compute the cosine similarity between two word vectors. We will expect for example, that "cat" and "tiger" are more similar than "cat" and "Germany". Feel free to play a bit with word1 and word2 below to get some feeling of the information these embeddings capture.

In [38]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [40]:
word1='Apple'
word2='Google'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
#print(word1_vector)
print(cosine_similarity(word1_vector, word2_vector))

[[0.5683571]]


We can also get the most similar words to some word, say 'slave_labor':

In [43]:
print(word_embedding_model.most_similar('slave_labor', topn=20))

[('slave_laborers', 0.6745792627334595), ('sweatshops', 0.6110795140266418), ('indentured_servitude', 0.5665937066078186), ('slaves', 0.5655405521392822), ('sweatshop', 0.555057942867279), ('indentured_labor', 0.5519518256187439), ('slavery', 0.5504497289657593), ('sex_slaves', 0.5418298244476318), ('indentured_servants', 0.528259813785553), ('enslavement', 0.5266590118408203), ('sweatshop_labor', 0.5248557925224304), ('concentration_camps', 0.5246363878250122), ('concentration_camp', 0.5233564972877502), ('laogai', 0.517486035823822), ('enslaved', 0.5159811973571777), ('gulag', 0.5139166116714478), ('slave', 0.5118827819824219), ('Gulags', 0.5074896216392517), ('Soviet_Gulag', 0.5051531791687012), ('cocoa_plantations', 0.5045973062515259)]


## 2 Using embeddings in a NERC model

Next, we will use the same example of Named Entity Recognition and Classification (NERC) as in the previous notebook but now replace the one-hot-vector for the vocabularies by their dense embeddings.

We use the same text as before and process it using SpaCy to get the words and the part of speech. We define the labels in the same way as well.

In [46]:
import spacy

nlp = spacy.load('en_core_web_sm')

text="Germany's representative to the European Union"

doc=nlp(text)

## The series of labels that go with the word tokens from the input text
y=['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG']

We will now replace the one-hot input representation of our words with embeddings. We generate our input data by simply looking up each word in the embeddings model. If we find it, we add the embedding vector to the training input, if not we add a vector with 300 zero values.

The following code creates an array from all the tokens in the spaCy document object "doc" by taking the embedding vectors for each word.

In [49]:
training_input=[]
for token in doc:
    word=token.text  #the next word from the tokenized text
    # we check if our word 
    # is inside the model vocabulary (loaded with the Google word2vec embeddings)
    if word in word_embedding_model:
        # in this case the word was found and vector is assigned with its embedding vector as the value
        vector=word_embedding_model[word]
    else: 
        # if the word does not exist in the embeddings vocabulary, 
        # we create a vector with all zeros.
        # The Google word2vec model has 300 dimensions so we creat a vector with 300 zeros
        vector=[0]*300
        print('This word is not in the word2vec vocabulary:', word)
    training_input.append(vector)

This word is not in the word2vec vocabulary: 's
This word is not in the word2vec vocabulary: to


We see that for two tokens from the spaCy output, we did not get an embedding.

We can inspect the first element in our training_input, which is the same size as the tokenized sentence but the words are replaced by embeddings.

In [53]:
print("The length of the training input = ",len(training_input))
#### the first token has the following embedding values
print(training_input[0])

The length of the training input =  7
[ 0.25976562  0.140625    0.24707031  0.00958252 -0.25       -0.08251953
 -0.09912109 -0.35351562 -0.1484375   0.1484375  -0.03540039 -0.05249023
  0.09277344 -0.14257812 -0.01483154  0.01647949  0.03710938  0.18847656
 -0.03955078 -0.05786133  0.26757812  0.10693359 -0.04345703  0.06738281
 -0.00177765  0.1328125  -0.16308594 -0.05908203 -0.22558594  0.12207031
  0.10791016 -0.19433594 -0.16210938 -0.14257812  0.09033203 -0.14648438
 -0.12109375  0.09960938  0.26367188  0.12695312  0.140625    0.11083984
  0.02697754 -0.01635742  0.00292969  0.14746094 -0.06542969 -0.16699219
  0.03662109  0.14941406 -0.14746094  0.06835938 -0.09228516  0.12207031
 -0.09179688  0.09082031 -0.38476562  0.03051758 -0.21679688 -0.12597656
 -0.08642578 -0.26171875 -0.08496094 -0.13964844 -0.02832031 -0.203125
  0.29101562 -0.13574219 -0.07226562  0.16308594 -0.19042969  0.22265625
  0.05566406  0.21289062  0.05053711 -0.09814453  0.12158203  0.01000977
  0.15234375 -0

Same as in the earlier cases, once we have the vector representations, we can use them to train our model.

In [56]:
from sklearn import svm

lin_clf = svm.LinearSVC()
lin_clf.fit(training_input, y)

**Testing the model** Let's say we want to test our model with the sentence: 'I love beer from Munich'. What we need to do is to preprocess the text in the same way as the training data by using spaCy (otherwise, we may get a mismatch in features), and next replace each word by an embedding vector as well.

In [59]:
test_sentence='I love beer from Berlin'
test_doc=nlp(test_sentence)

test_input=[]

for token in test_doc:
    word=token.text
    if word in word_embedding_model:
        vector=word_embedding_model[word]
    else:
        vector=[0]*300
    test_input.append(vector)

Because our representation is the same, we can ask the classifier to make a prediction for it:

In [62]:
pred=lin_clf.predict(test_input)
print(pred)

['O' 'O' 'O' 'O' 'B-LOC']


The classifier assigned IOB tags to the tokens in order and the final obtained the label 'B-LOC', which is correct.

Congratulations! You have now trained and testing your first embeddings-based NERC model. Note that the word 'Munich' is not in the training data but the system still managed to make a correct(!) prediction because the embedding matched.

So far you have just worked with a few toy examples. In order to obtain a good performance machine learning systems may need thousands and sometimes hundreds of thousands training examples. The vocabulary of a language is large and there is also large variation in expressions. Having only a few examples for each words or expression requires to have massive amounts of text.

To some extent, word embedding resolve the issue of *data sparseness*, as words unseen in the training data may still be similar to other words that are in the training data. Word embeddings are derived from millions of documents (billions of tokens) and are likely to have embeddings for most words.

## 3 Combining embeddings with one-hot encoding

So how can we combine the word embeddings with one-hot-encodings for other features?

We are first going to get the one-hot encodings of the text as we did in the previous notebook using the DictVectorizer

In [67]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

In [69]:
training_instances=[]
for token in doc:
    one_training_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_} # this concatenates the PoS and Lemma
    training_instances.append(one_training_instance)
the_array = vec.fit_transform(training_instances).toarray() 

If we inspect the array, we see it holds 7 rows, each row representing one token, and 12 columns, each column representing a feature value.

In [72]:
the_array.shape
# ROWS are WORDS, COLUMNS are FEATURES

(7, 12)

In [74]:
# the first token values
print(the_array[0])

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]


In [82]:
print(vec.get_feature_names_out())

["lemma='s" 'lemma=European' 'lemma=Germany' 'lemma=Union'
 'lemma=representative' 'lemma=the' 'lemma=to' 'part-of-speech=ADP'
 'part-of-speech=DET' 'part-of-speech=NOUN' 'part-of-speech=PART'
 'part-of-speech=PROPN']


In [84]:
print(the_array)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.]]


Our training_input array represented the same text with embeddings. Let's inspect the array for the embeddings using the numpy module (imported as np at the start of this notebook!).

In [87]:
np.array(training_input).shape
# ROWS are WORDS, COLUMNS are EMBEDDINGS

(7, 300)

It has the same rows but 300 additional features. We can now *concatenate* the features for each word using numpy:

In [90]:
features_input=np.array(the_array)
embeddings_input=np.array(training_input)

We assume that the number of rows is the same across the two arrays and each row corresponds to the same token instance.

In [93]:
#### num_words is the number of rows
num_words=features_input.shape[0]
concat_input=[] # for storing the result of concatenating
for index in range(num_words):
    print('Combining the values for:', index, " from the features and the embeddings")
    representation=list(features_input[index]) + list(embeddings_input[index]) # concatenate features per word
    concat_input.append(representation)

Combining the values for: 0  from the features and the embeddings
Combining the values for: 1  from the features and the embeddings
Combining the values for: 2  from the features and the embeddings
Combining the values for: 3  from the features and the embeddings
Combining the values for: 4  from the features and the embeddings
Combining the values for: 5  from the features and the embeddings
Combining the values for: 6  from the features and the embeddings


If we check the shape, we see it has the same rows but now the combination of features result in 312 columns.

In [96]:
np.array(concat_input).shape

(7, 312)

Lets inspect the concatenated vector for the first token.

In [99]:
print(concat_input[0])

[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.259765625, 0.140625, 0.2470703125, 0.00958251953125, -0.25, -0.08251953125, -0.09912109375, -0.353515625, -0.1484375, 0.1484375, -0.035400390625, -0.052490234375, 0.0927734375, -0.142578125, -0.01483154296875, 0.0164794921875, 0.037109375, 0.1884765625, -0.03955078125, -0.057861328125, 0.267578125, 0.10693359375, -0.04345703125, 0.0673828125, -0.00177764892578125, 0.1328125, -0.1630859375, -0.05908203125, -0.2255859375, 0.1220703125, 0.10791015625, -0.1943359375, -0.162109375, -0.142578125, 0.09033203125, -0.146484375, -0.12109375, 0.099609375, 0.263671875, 0.126953125, 0.140625, 0.11083984375, 0.0269775390625, -0.016357421875, 0.0029296875, 0.1474609375, -0.0654296875, -0.1669921875, 0.03662109375, 0.1494140625, -0.1474609375, 0.068359375, -0.09228515625, 0.1220703125, -0.091796875, 0.0908203125, -0.384765625, 0.030517578125, -0.216796875, -0.1259765625, -0.08642578125, -0.26171875, -0.0849609375, -0.1396484375, -0.0283203

### 3.1 Representing the test data

Note that we need to represent the test data in the same way as the train data. So also when testing we need to create an array with the same 312 features. We first use SpaCy again to get the linguistics features.

In [103]:
test_sentence='I love beer from Munich'
test_doc=nlp(test_sentence)

test_instances=[]
for token in test_doc:
    one_test_instance={'part-of-speech': token.pos_, 'lemma': token.lemma_} # this concatenates the PoS and Lemma
    test_instances.append(one_test_instance)

print(test_instances)
the_test_array = vec.fit_transform(test_instances).toarray()

the_test_array.shape


[{'part-of-speech': 'PRON', 'lemma': 'I'}, {'part-of-speech': 'VERB', 'lemma': 'love'}, {'part-of-speech': 'NOUN', 'lemma': 'beer'}, {'part-of-speech': 'ADP', 'lemma': 'from'}, {'part-of-speech': 'PROPN', 'lemma': 'Munich'}]


(5, 10)

In [105]:
print(the_test_array)

[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]]


We have a problem. The training and test array do not have the same number of columns. For the training set we had 12 features, and now we have 10. The vectorizer takes the properties and values from the data. Since the training and test data are different also the vector representation are different. Not only in size but also mixing positions and values differently.

The problem is that we used the *fit_transform* function that we used for training and that we modified the model of *vec* according to the test data. We thus ruined out model

In [108]:
print(vec.get_feature_names_out())

['lemma=I' 'lemma=Munich' 'lemma=beer' 'lemma=from' 'lemma=love'
 'part-of-speech=ADP' 'part-of-speech=NOUN' 'part-of-speech=PRON'
 'part-of-speech=PROPN' 'part-of-speech=VERB']


We should have used the *transform* function instead of *fit_transform* but this is now too late.

In [111]:
the_test_array = vec.transform(test_instances).toarray()

print(the_test_array.shape)
print(the_test_array)


(5, 10)
[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]]


In [113]:
print(vec.get_feature_names_out())

['lemma=I' 'lemma=Munich' 'lemma=beer' 'lemma=from' 'lemma=love'
 'part-of-speech=ADP' 'part-of-speech=NOUN' 'part-of-speech=PRON'
 'part-of-speech=PROPN' 'part-of-speech=VERB']


If we rebuild the model *vec* from the training data, we can fix this

In [116]:
vec = DictVectorizer()
the_array = vec.fit_transform(training_instances).toarray() 
print(vec.get_feature_names_out())

["lemma='s" 'lemma=European' 'lemma=Germany' 'lemma=Union'
 'lemma=representative' 'lemma=the' 'lemma=to' 'part-of-speech=ADP'
 'part-of-speech=DET' 'part-of-speech=NOUN' 'part-of-speech=PART'
 'part-of-speech=PROPN']


Now we can transform the test data without fitting the model

In [119]:
the_test_array = vec.transform(test_instances).toarray()

print(the_test_array.shape)
print(the_test_array)

(5, 12)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


Another way to fix this is by applying the vectorizer function to both the train and test data to build a model for all data and next represent the data for both sets separately by transforming it to the vector dimensions of the overarching model.

### 3.2 Harmonizing one-hot-vectors across training and test sets

We first combines the training and test instances into a single data set.

In [124]:
vec = DictVectorizer()
## First we concatenate the training and test instances and fit these to a vector representation
train_and_test_instance = training_instances + test_instances
print(train_and_test_instance)

[{'part-of-speech': 'PROPN', 'lemma': 'Germany'}, {'part-of-speech': 'PART', 'lemma': "'s"}, {'part-of-speech': 'NOUN', 'lemma': 'representative'}, {'part-of-speech': 'ADP', 'lemma': 'to'}, {'part-of-speech': 'DET', 'lemma': 'the'}, {'part-of-speech': 'PROPN', 'lemma': 'European'}, {'part-of-speech': 'PROPN', 'lemma': 'Union'}, {'part-of-speech': 'PRON', 'lemma': 'I'}, {'part-of-speech': 'VERB', 'lemma': 'love'}, {'part-of-speech': 'NOUN', 'lemma': 'beer'}, {'part-of-speech': 'ADP', 'lemma': 'from'}, {'part-of-speech': 'PROPN', 'lemma': 'Munich'}]


We adapt the model by calling *fit_transform* over the combination of the data.

In [127]:
the_array = vec.fit_transform(train_and_test_instance).toarray()
the_array.shape

(12, 19)

We see that we now have 12 rows (tokens) and 19 values. From this shared feature space, we need to recover the data corresponding to the training data and the data corresponding to the test data. Since the order is based on the concatenation, we can take the length of the training_instances to separate the first part as the training data and the second part as the test data.

In [130]:
print(the_array)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


In [132]:
print(vec.get_feature_names_out())

["lemma='s" 'lemma=European' 'lemma=Germany' 'lemma=I' 'lemma=Munich'
 'lemma=Union' 'lemma=beer' 'lemma=from' 'lemma=love'
 'lemma=representative' 'lemma=the' 'lemma=to' 'part-of-speech=ADP'
 'part-of-speech=DET' 'part-of-speech=NOUN' 'part-of-speech=PART'
 'part-of-speech=PRON' 'part-of-speech=PROPN' 'part-of-speech=VERB']


In [134]:
# For the training set we take the first part of the data upto the length of the training_instances
training_onehot = the_array[:len(training_instances)]
#For the test set, we take the remaining part of the data starting at the length of the training_instances
#(remember that '0' is the first data element)
test_onehot = the_array[len(training_instances):]

print('Number of training words =', training_onehot.shape)
print('Number of test words =', test_onehot.shape)

Number of training words = (7, 19)
Number of test words = (5, 19)


In [136]:
print(training_onehot)

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


In [138]:
print(test_onehot)

[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


Now, we ensured that the feature space is the same for the training and test data. Next we get the embeddings for both sets and combine these with the one-got-vector representations. We start with the training data again.

In [141]:
features_training_input=np.array(training_onehot)
embeddings_training_input=np.array(training_input)

In [143]:
num_words=training_onehot.shape[0]
concat_train_input=[]
for index in range(num_words):
    print(index)
    representation=list(training_onehot[index]) + list(embeddings_training_input[index]) # concatenate features per word
    concat_train_input.append(representation)

# we check the shape
np.array(concat_train_input).shape

0
1
2
3
4
5
6


(7, 319)

In [145]:
print(concat_train_input[0])

[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.259765625, 0.140625, 0.2470703125, 0.00958251953125, -0.25, -0.08251953125, -0.09912109375, -0.353515625, -0.1484375, 0.1484375, -0.035400390625, -0.052490234375, 0.0927734375, -0.142578125, -0.01483154296875, 0.0164794921875, 0.037109375, 0.1884765625, -0.03955078125, -0.057861328125, 0.267578125, 0.10693359375, -0.04345703125, 0.0673828125, -0.00177764892578125, 0.1328125, -0.1630859375, -0.05908203125, -0.2255859375, 0.1220703125, 0.10791015625, -0.1943359375, -0.162109375, -0.142578125, 0.09033203125, -0.146484375, -0.12109375, 0.099609375, 0.263671875, 0.126953125, 0.140625, 0.11083984375, 0.0269775390625, -0.016357421875, 0.0029296875, 0.1474609375, -0.0654296875, -0.1669921875, 0.03662109375, 0.1494140625, -0.1474609375, 0.068359375, -0.09228515625, 0.1220703125, -0.091796875, 0.0908203125, -0.384765625, 0.030517578125, -0.216796875, -0.1259765625, -0.08642578125, -0.26171875, -0.08

In [147]:
features_test_input=np.array(test_onehot)
embeddings_test_input=np.array(test_input)

In [149]:
num_words=test_onehot.shape[0]
concat_test_input=[]
for index in range(num_words):
    print(index)
    representation=list(test_onehot[index]) + list(embeddings_test_input[index]) # concatenate features per word
    concat_test_input.append(representation)

# we check the shape
np.array(concat_test_input).shape

0
1
2
3
4


(5, 319)

We can now train the classifier in the same way as we did before but now with the concatenated features, where the one-hot-vectors are aligned.

In [152]:
lin_clf.fit(concat_train_input, y)

In [154]:
pred=lin_clf.predict(concat_test_input)
print(pred)

['O' 'O' 'O' 'O' 'B-LOC']


## End of this notebook