Welcome to the Python Archeologist project, an expedition into the vast terrains of Natural Language Processing (NLP). As aspiring linguistic excavators, you will embark on a journey to discover obscured words, harnessing the power of the Word2Vec model.

As we've discussed in our lectures, in the realm of language, context reigns supreme. Words draw much of their meaning from the surrounding words. Your challenge for this project? Trying to predict words in documents that have been buried thousands of years underground! Our goal is to help archaelogists make sense of certain documents that have words that are ineligible:

![image](https://th.bing.com/th/id/OIG._ZvxAQdM.h2kWO.7ONMn?pid=ImgGn&w=1024&h=1024&rs=1)

To do that, we'll use some text to train a Word2Vec model that will be able to predict the center word based on context! After developing this model, we'll also be able to extract the latent meaning of our words by accessing the weights of the trained neural network.

In [83]:
# Libraries we may need: 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import string

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

from nltk.tokenize import word_tokenize

### Project - Predict the Hidden Word

To make sense of the hidden words, we need to train our Word2Vec model first! First, let's load our training base into Python!

Load the `wiki_pages.txt` file stored in the `data` folder using `python`. 
<br>
*Hint: Watch out for file encoding!*

In [1]:
with open('data/wiki_pages.txt', encoding='UTF-8') as f:
    wiki_file = f.read()

Remove all punctuation from the file you've just loaded into Python.

In [2]:
wiki_file = (
    wiki_file
    .translate(
        str.maketrans('', '', string.punctuation)
    )
)

Tokenize the file you loaded using `nltk's word tokenize`:

In [4]:
token_archives = word_tokenize(wiki_file)

Lower case all tokens in the tokenized version of the text you've just created.

In [5]:
token_archives = [token.lower() for token in token_archives]

Generate the training base for the tokens with a context of two neighbors on each side. For example, for the sentence 'much of Lower Egypt around', the features should be 'much of Egypt around' and the target should be 'lower'. You can use an average of the one-hot-vectors of individual words to generate the array for the context. The array for the target is a one-hot vector representing the target word. 
<br>
<br>
*Hint: Check the code of the lectures where we've used wikipedia data!*

In [7]:
vocab = list(
    set(token_archives)
)

vocab.sort()

word_representations = np.identity(len(vocab))

vocab_vectors = {}

for index, element in enumerate(vocab):
    vocab_vectors[element] = word_representations[index]

In [8]:
def retrieve_word_neighbors(sentence, neighbors):
    '''
    Retrieves word and neighbor(context) of size
    neighbors into two separate lists.
    
    Arguments:
    - Sentence(str): The sentence to retrieve
    words and context words.
    - Neighbors(int): The size of the window to
    consider context.
    
    Returns:
    - word_keys_sentence(list): Word list;
    - context_words(list): A list with the context
    words for each word of index i
    '''
    word_keys_sentence = []
    context_words = []    
    
    for index, word in enumerate(sentence):
        
        # Build start
        # Build finish
        start = index-neighbors        
        finish = index+neighbors
        
        # Get neighbor words
        neighbor_words = sentence[start:finish+1]
        # Generate context
        word_context = (
            neighbor_words[:neighbors]
            +
            neighbor_words[neighbors+1:]
        )
        
        # We only append the context if we have enough 
        # neighbors
        if len(word_context) >= neighbors*2:
            word_keys_sentence.append(word)
            context_words.append(word_context)
            
    return word_keys_sentence, context_words

In [9]:
word_keys, context_window = retrieve_word_neighbors(token_archives, 2)

In [10]:
train_word_size = len(word_keys)
vocab_size = len(vocab)

y = np.zeros([train_word_size, vocab_size])
X = np.zeros([train_word_size, vocab_size])

neighbors = 2
for index, word in enumerate(word_keys):
    if index % 1000 == 0:
        print(index)
    y[index,:] = vocab_vectors[word]
    aux_array = np.zeros([1, vocab_size])
    
    for neighbour in context_window[index]:
        aux_array = aux_array+vocab_vectors[neighbour]
    
    X[index,:] = (aux_array/(neighbors*2))[0]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000


Split the target and features data intro train and test using 20% of your test set for evaluation of the algorithm (select the test set randomnly).
<br>
<br>
*Hint: Use train_test_split from sklearn!*

In [11]:
# Choose sklearn use train_test_split 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Train a cbow model using *keras*. Your word vectors (inner layer) should have a size of 40 dimensions. Use any set of hyperparameters (`epochs, batch size, etc`) as you would like. 

In [12]:
model = Sequential()
model.add(Dense(40, input_dim=vocab_size, activation='relu'))
model.add(Dense(vocab_size, input_dim=40, activation='softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [13]:
model.fit(
    X, 
    y, 
    epochs=80, 
    batch_size=100,
    validation_split=0.1
)

Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80
Epoch 43/80
Epoch 44/80
Epoch 45/80
Epoch 46/80
Epoch 47/80
Epoch 48/80
Epoch 49/80
Epoch 50/80
Epoch 51/80
Epoch 52/80
Epoch 53/80
Epoch 54/80
Epoch 55/80
Epoch 56/80
Epoch 57/80
Epoch 58/80
Epoch 59/80
Epoch 60/80
Epoch 61/80
Epoch 62/80
Epoch 63/80
Epoch 64/80
Epoch 65/80
Epoch 66/80
Epoch 67/80
Epoch 68/80
Epoch 69/80
Epoch 70/80
Epoch 71/80
Epoch 72/80
Epoch 73/80
Epoch 74/80
Epoch 75/80
Epoch 76/80
Epoch 77/80
Epoch 78/80
Epoch 79/80
Epoch 80/80


<tensorflow.python.keras.callbacks.History at 0x221d7332188>

The archaelogists found documents with the following sentences: 
- `the muhammad ___ dynasty remained`
- `because the ___ empire was`
- `egypt and ___ formed a`

Using the trained machine learning model, try to predict the center words above and complete the sentences.

In [58]:
def predict_center_word(sentence):
    
    tokens = word_tokenize(sentence)
    
    features = np.zeros(len(vocab))
    
    for token in tokens:
        features += vocab_vectors[token]
    
    features = features / 4
    
    # Predict center word using model
    predicted_array = model.predict(features.reshape(1,-1))
    max_index = np.argmax(predicted_array)
    
    # Target word vector
    target = np.zeros(len(vocab))
    target[max_index] = 1
    
    # Check word that matches the array
    word = [k for k,v in vocab_vectors.items() if np.array_equal(v, target)]
    
    return ' '.join(tokens[0:2]+word+tokens[2:])

In [59]:
predict_center_word('the muhammad dynasty remained')

'the muhammad ali dynasty remained'

In [60]:
predict_center_word('because the empire was')

'because the roman empire was'

In [61]:
predict_center_word('egypt and formed a')

'egypt and was formed a'

Final task! The archaelogists want to understand which words are more similar to `egypt` (top 10) in our word vectors context. 
<br>
<br>
Extract the word vectors from our trained model (use any method you would like and from any layer you would want) and check which words are more similar to `egypt` using cosine similarity.

In [71]:
vocab.index('egypt')

3479

In [80]:
weights = model.get_weights()[0]
similarities = cosine_similarity(weights)

In [87]:
pd.DataFrame(
    similarities[3479],
    index = vocab
).sort_values(by=0, ascending=False).head(10)

Unnamed: 0,0
egypt,1.0
morocco,0.595999
1912,0.534263
1947,0.533961
epidemics,0.533809
tunis,0.531562
divorce,0.526187
liberalism,0.486211
autonomousnpublic,0.485979
1591,0.483702
