# NER Using Word2Vec
We will start of by training a fully connected neural network using Word2Vec as input

You are free to use the English Word2vec by Google or the Danish by DaNLP. Either is fine. Remember you can load in the models using:

```
w2v = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin", binary=True
)
```
Remember you can see how to use there Danish word2vecs [here](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md) 

The tagged data for english is located in the folder for the class on github and can be parsed using the following function:

In [2]:
import pandas as pd
def get_conll_eng():
    # read file
    with open("eng.train.txt") as f:
        raw = f.read()

    def filter_empty_string(t):
        return list(filter(lambda x: x, t))

    # split it into documents
    docs = raw.split("-DOCSTART- -X- O O")
    docs = filter_empty_string(docs)
    res = []
    for n, doc in enumerate(docs):
        sents = filter_empty_string(doc.split("\n\n")) # split into sentences
        for sent_n, sent in enumerate(sents):
            token_tags = filter_empty_string(sent.split("\n")) # split into tokens (w. tags)
            for t in token_tags:
                word, pos, dep, ne = t.split(" ") # split into tags and token
                if len(s := ne.split("-")) == 2:
                    ne = s[1]
                res.append((n, sent_n, word, pos, dep, ne))
    df = pd.DataFrame(res, columns="doc_n sent_n, word pos dep ne".split()) # return as a df
    return df

df = get_conll_eng()
df.head()

Unnamed: 0,doc_n,"sent_n,",word,pos,dep,ne
0,0,0,EU,NNP,I-NP,ORG
1,0,0,rejects,VBZ,I-VP,O
2,0,0,German,JJ,I-NP,MISC
3,0,0,call,NN,I-NP,O
4,0,0,to,TO,I-VP,O


For Danish you can use this code instead (I would collapse the B-\* or I-\* as these denote Begining (B) of a named entity and end or inside (I) of a named entity):



In [3]:
from danlp.datasets import DDT

def get_conll_da():
    # Loading the Danish Dependency Tree data
    ddt = DDT()
    conllu_format = ddt.load_as_conllu(predefined_splits = True)

    data = []
    for n in range(len(conllu_format)):
        data.append([(i, token.form, token.misc.get("name").pop()) for i, sent in enumerate                 (conllu_format[n]) for token in sent]) #Getting the sentence #, Word and Tag.

    train = pd.DataFrame(data[0], columns = ['sentence_id', 'words', 'labels'])
    return train

df = get_conll_da()
df.head()

Unnamed: 0,sentence_id,words,labels
0,0,På,O
1,0,fredag,O
2,0,har,O
3,0,SID,B-ORG
4,0,inviteret,O


# Preprocessing
We will start of by determining the output category you can use the following code the get categorical output vectors (as one hot).

In [24]:
import numpy as np
def col_to_onehot(col):
    """
    turn column to one hot
    """
    preds = list(col.unique())
    n_predictions = len(preds)
    preds_one_hot = {pred: i for pred, i in zip(preds, range(n_predictions))}

    a = np.array(col.apply(lambda x: preds_one_hot[x]))
    b = np.zeros((a.size, a.max() + 1))
    b[np.arange(a.size), a] = 1
    return preds_one_hot, b

preds_one_hot, y = col_to_onehot(df["labels"])
print(preds_one_hot) # Again collapsing I-* and B-* is ideal
y[:10]

{'O': 0, 'B-ORG': 1, 'B-LOC': 2, 'B-PER': 3, 'I-PER': 4, 'B-MISC': 5, 'I-LOC': 6, 'I-MISC': 7, 'I-ORG': 8}


array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.]])

## Exercises

1) Collapse the I-\* and B-\* in labels (if using Danish)

2) Create a matrix consisting of the word embeddings with the shape (n_samples, embedding_size) the following function might give you a hint of where to start
```
    def get_embed(x):
        if x in w2v:
            return w2v[x]
        else:
            return w2v["UNK"]
```

In [None]:
# .shape = (number words, embedding size)

# Creating a Fully connected Neural Network:
Here we will start of doing a neural Network

## Exercises
1) Start of by making a one layer Neural Network,you can use the following code as inspiration, I recommend that you start with an input layer where the input shape is equal to the embedding size and end with fully connected (dense) layer with size equal to the number of predictions  

2) Train the model using at least 10 epochs. While it is training examine why one would use the categorical cross entropy loss vs. the mean squared error (MSE) <-- which you know from linear regression

3) Check if the model performs as intended consider the code in block 3 an inspiration

In [27]:
import tensorflow as tf
tf.config.run_functions_eagerly(True)  # remember to run of in eager mode otherwise it will throw an error

# Make model
model = tf.keras.Sequential()
# add layers
model.add(...)

model.compile(optimizer="sgd", loss="categorical_crossentropy")
model.summary() # inspect model

In [None]:
# Train model
history = model.fit(X, y, validation_split=0.1, epochs=1)

In [None]:
# 3( Evaluate Performance
input_embedding = tf.constant(gensim_w2v["Kenneth"])
x = tf.reshape(input_embedding, shape=(1, 300))  # desired shape (1 is batch size)
model.predict(x).round() # make prediction

# LSTM
LSTM or long short-term memory networks was used for a long time prior to the Transformer architecture and is today still used robotics and translation tasks. Here we will implement it in Keras. Note this is intended to be challenging, but not undoable.

## Exercises
1) We will have to change the input. Instead of being input vectors each sentence should be series of letter e.g. `[1, 28, 302]`, corresponding to their index in the word2vec embedding you can use the code in the first block for this. Similarly the output should be a sequence of predictions for each token.

2) You will need to replace the input layer in the model above with embedding layer in block 2. This layer simply applied the weight from gensims word2vec to each of the numbers and makes sure they are input correctly to any recurrent layers

3) Replace dense layers in the previous model (except the last) with the LSTM module by keras, remember to set `return_sequences=True` and consider why you set it to true (feel free to ask me if you can't figure it out)

4) Lastly you should wrap the last layer in a the `TimeDistributed` ([ref](https://keras.io/api/layers/recurrent_layers/time_distributed/)) as to secure output is proberly matched up in the sequence.

In [30]:
# 1)
index_word = genism_w2v.index2word
word_to_index = {word: i for i, word, in enumerate(index_word)}

In [None]:
# 2) make embedding layer
embedding_layer = gensim_w2v.wv.get_keras_embedding(train_embeddings=False)