<a href="https://colab.research.google.com/github/duong4595/ib9cw0-text-analytics-23-24/blob/main/Copy_of_Simple_Language_Model_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simple Language Model
We will explore creating a simple language model based on word (token) embeddings. We will use `tinyshakespeare.txt` to train the language model.

# Embeddings
We will use the `gensim` library, which provides straightforward implementations of `word2vec`, to create word embeddings using the CBOW model. To keep it simple, we will create embeddings of size 5.

In [1]:
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import requests

# URL to the raw text file on GitHub
url = 'https://raw.githubusercontent.com/RDGopal/IB9CW0-Text-Analytics/main/Data/tinyshakespeare.txt'

# Use requests to get the content of the file
response = requests.get(url)

# Ensure the request was successful
if response.status_code == 200:
    s_text = response.text
    # Continue processing the text as needed
else:
    print("Failed to retrieve the file. Status code:", response.status_code)

# Print the first 500 characters
print(s_text[:500])

# Tokenize the text
tokens = word_tokenize(s_text)

# Organize the tokens into sentences, Word2Vec needs data in the format of list of lists of tokens
sentences = [tokens[i:i+100] for i in range(0, len(tokens), 100)]

# Train the CBOW model
word2vec_model = Word2Vec(sentences, vector_size=5, window=5, min_count=1, sg=0)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


#Tokenizer
Once we get the embeddings, we will store the words (tokens), token ids, and embeddings in a dataframe.

Note that the number of distinct words (tokens) is 14310. This is the size of our vocabulary.

In [2]:
import pandas as pd
# Create a DataFrame to store word, token_id, and embedding
data = {
    'word': [],
    'token_id': [],
    'embedding': []
}

for idx, word in enumerate(word2vec_model.wv.index_to_key):
    data['word'].append(word)
    data['token_id'].append(idx)
    data['embedding'].append(word2vec_model.wv[word].tolist())  # convert numpy array to list for easier handling in DataFrame

df = pd.DataFrame(data)
print(df)

             word  token_id                                          embedding
0               ,         0  [-0.2139173150062561, 1.6443589925765991, 4.50...
1               :         1  [3.2466704845428467, 2.541710138320923, 4.8088...
2               .         2  [2.82372784614563, 2.2152528762817383, 4.49781...
3             the         3  [-1.2569156885147095, -1.560045599937439, 4.52...
4               I         4  [0.9207433462142944, 4.367117881774902, 7.9074...
...           ...       ...                                                ...
14305  misbehaved     14305  [-0.006695118732750416, 0.16023458540439606, -...
14306   Happiness     14306  [-0.08563660085201263, 0.14777077734470367, -0...
14307     slew'st     14307  [-0.14825233817100525, -0.09676516056060791, 0...
14308   dismember     14308  [0.028686027973890305, -0.05427681654691696, -...
14309        viol     14309  [0.11942050606012344, -0.10194143652915955, 0....

[14310 rows x 3 columns]


#Training Data
Our main objective is to predict the next word (token) based on the previous 5 words (tokens). Thus, our context length is 5.

We will prepare the training data such that inputs are 5 consecutive words (token) and the output to be predicted is the 6th word (token). If the input has less than 5 words (tokens), we will pad it with \<pad>.

In [3]:
import numpy as np
import pandas as pd

def generate_training_data(sentences, model_wv, window_size=5):
    X, y = [], []
    sequence_texts = []  # For storing the actual sequences of words
    next_words = []  # For storing the actual next word
    for sentence in sentences:
        # Embed words using the Word2Vec model
        embedded_sentence = [model_wv[word] for word in sentence if word in model_wv]
        word_sentence = [word for word in sentence if word in model_wv]  # Keep the actual words for viewing
        # Create sequences
        for i in range(len(embedded_sentence)):
            end_ix = i + window_size
            if end_ix >= len(embedded_sentence):
                break
            seq_x, seq_y = embedded_sentence[i:end_ix], embedded_sentence[end_ix]
            seq_text, next_word = word_sentence[i:end_ix], word_sentence[end_ix]
            # Pad sequence if necessary
            seq_x += [np.zeros(model_wv.vector_size)] * (window_size - len(seq_x))
            seq_text += ['<pad>'] * (window_size - len(seq_text))  # Use <pad> for padding text
            X.append(np.concatenate(seq_x))
            y.append(seq_y)
            sequence_texts.append(' '.join(seq_text))
            next_words.append(next_word)
    return np.array(X), np.array(y), sequence_texts, next_words

# Assume 'sentences' and 'model.wv' have been defined
X_train, y_train, train_sequences, train_next_words = generate_training_data(sentences, word2vec_model.wv)

# Create DataFrame
train_df = pd.DataFrame({
    'Sequence': train_sequences,
    'Next Word': train_next_words,
    'X_train (Flattened Embeddings)': list(X_train),
    'y_train (Embedding)': list(y_train)
})



In [4]:
train_df.head()

Unnamed: 0,Sequence,Next Word,X_train (Flattened Embeddings),y_train (Embedding)
0,First Citizen : Before we,proceed,"[3.0321896, 1.8070849, 3.4270294, -2.8053408, ...","[0.19172128, 0.21077667, 0.5719604, -0.0292124..."
1,Citizen : Before we proceed,any,"[1.7173121, 1.209647, 2.0685549, -1.018809, -1...","[-0.97842467, 1.684992, 3.7786226, -2.8050857,..."
2,: Before we proceed any,further,"[3.2466705, 2.5417101, 4.8088984, -2.1392398, ...","[-0.01992005, 0.6620079, 1.1291828, -0.7801183..."
3,Before we proceed any further,",","[-0.024079662, 0.57426465, 1.1300638, -0.83964...","[-0.21391732, 1.644359, 4.5050793, -2.5268023,..."
4,"we proceed any further ,",hear,"[-0.023775179, 2.944243, 7.1471043, -2.0237226...","[0.30758998, 2.4186532, 4.124135, -0.95807487,..."


Our training dataset has 241,779 data points. Each data point has 6 words (tokens), and thus the total number of words (tokens) for training is 241,779 * 6 = 1,450,674.

In [5]:
train_df.shape
# train_df.head()

(241779, 4)

#Neural Network Design
The neural network we will train has the following structure.

1. Input layer: 25 nodes (5 words(tokens) with embedding size of 5 for each).
2. Hidden layer: 10 nodes.
3. Output layer: 5 nodes (for the predicted next word (token)).

The total number of parameters to estimate are:

25 $\times$ 10 (edges from input to hidden layer)+

10 (bias terms in the hidden layer) +

10$\times$5 (*edges from hidden layer to output layer*)+

5 (*bias terms in the output layer*).

Which is a total of 315 parameters that need to be estimated.

![picture](https://drive.google.com/uc?export=view&id=1lAj53mvleR-XRGLuJZu21E86EjXKMbSO)


In [6]:
from keras.models import Sequential
from keras.layers import Dense

def build_model(input_dim, hidden_neurons, output_dim):
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim, activation='relu'),
        Dense(output_dim, activation='linear')  # Assuming you want the raw embedding as output
    ])
    model.compile(optimizer='adam', loss='mse')
    return model


# Build and train the model
**Takes long to train - several hours on my machine**

**I have saved the trained model (`language_model_1.h5`). To run the trained model you simply have to load the saved model and run it. **

In [None]:
# Build model
nn_model = build_model(25, 10, 5)

# Train the model
nn_model.fit(X_train, y_train, epochs=20, batch_size=1)

Save the model for later use

In [None]:
from keras.models import load_model

# Save the trained model
nn_model.save('language_model_1.h5')  # Saves the model to your hard drive

#Load the trained model

In [8]:
from keras.models import load_model

# Load the model from the disk
loaded_model = load_model('language_model_1.h5')

Compute the loss function over the training data. This is useful to compare different models that predict the same outcome.

In [9]:
# You need to prepare your data in the same way it was prepared during model training
loss= loaded_model.evaluate(X_train, y_train)
print(f"Loss: {loss}")

Loss: 2.4441378116607666


#Next Word Prediction
Once the model is trained, we can use it for the next word (token) prediction. In the following, we will return 5 most likely next words (tokens)

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def predict_next_words(model, input_sequence, word_vectors, top_n=5):
    # Predict the embedding
    prediction = model.predict(np.array([input_sequence]))[0]

    # Calculate cosine similarity with all words
    all_similarities = cosine_similarity([prediction], word_vectors.vectors)[0]

    # Find the top 5 words with the highest similarity
    top_indices = np.argsort(-all_similarities)[:top_n]  # Negative for descending order
    closest_words = [(word_vectors.index_to_key[i], all_similarities[i]) for i in top_indices]

    return closest_words


Run the prediction

In [11]:
# Test the loaded model
test_sequence = "let it be done" # @param {type:"string"}
test_tokens = word_tokenize(test_sequence)
test_embedded = [word2vec_model.wv[word] for word in test_tokens if word in word2vec_model.wv]
test_input = np.concatenate(test_embedded[:5])  # Simplified example

vector_size = 5 # the model expects 5 words in the prompt

# Ensure there are exactly 5 embeddings, pad if fewer
if len(test_embedded) < 5:
    # Pad with zero-filled vectors
    test_embedded += [np.zeros(vector_size) for _ in range(5 - len(test_embedded))]

# Flatten the list of embeddings to match input shape, and ensure it's truncated to exactly 5 words
test_input = np.concatenate(test_embedded[:5])

predicted_words = predict_next_words(loaded_model, test_input, word2vec_model.wv)




In [12]:
print("Predicted next words:")
for word, similarity in predicted_words:
    print(f"{word}")

Predicted next words:
strike
Doricles
Look
Yes
CARLISLE


#Text Generation with Randomness
To make the outputted text more interesting, we will inject some randomness. We will pick the next word in the sequence of text based on the probabilities for the 20 likely next words (tokens).

Text generation function

In [13]:
def predict_next_words_with_probabilities(model, input_sequence, word_vectors, top_n=20):
    # Predict the embedding
    prediction = model.predict(np.array([input_sequence]))[0]

    # Calculate cosine similarity with all words
    all_similarities = cosine_similarity([prediction], word_vectors.vectors)[0]

    # Get the top 5 indices and scores
    top_indices = np.argsort(-all_similarities)[:top_n]
    top_scores = all_similarities[top_indices]

    # Convert scores to probabilities using softmax
    top_probabilities = np.exp(top_scores) / np.sum(np.exp(top_scores))

    # Ensure the probabilities sum to 1
    top_probabilities /= top_probabilities.sum()

    return [(word_vectors.index_to_key[i], top_probabilities[j]) for j, i in enumerate(top_indices)]



In [14]:
def generate_text(model, initial_text, word_vectors, num_words, vector_size=5):
    tokens = word_tokenize(initial_text)
    current_embeddings = [word_vectors[word] for word in tokens if word in word_vectors]

    generated_words = tokens.copy()

    for _ in range(num_words):
        if len(current_embeddings) < 5:
            padded_embeddings = current_embeddings + [np.zeros(vector_size) for _ in range(5 - len(current_embeddings))]
        else:
            padded_embeddings = current_embeddings[-5:]

        input_sequence = np.concatenate(padded_embeddings)

        next_word_options = predict_next_words_with_probabilities(model, input_sequence, word_vectors)

        words, probabilities = zip(*next_word_options)

        # Normalize probabilities to ensure they sum to 1
        probabilities = np.array(probabilities)
        probabilities /= probabilities.sum()

        next_word = np.random.choice(words, p=probabilities)

        generated_words.append(next_word)
        current_embeddings.append(word_vectors[next_word])

    return ' '.join(generated_words)



In [15]:
%%capture
initial_text = "let it be done" # @param {type:"string"}
num_words_to_generate = 40 # @param {type:"integer"}
generated_text = generate_text(loaded_model, initial_text, word2vec_model.wv, num_words_to_generate)

In [16]:
generated_text

"let it be done Marcius date Gentle Yes Love liest liest boiled Love forbid assured Look Ten Doricles scope assured hither cried liest cried prepare 'Twas Love liest Doricles sweet ignorance self inheritance forbid Look date to-morrow wonder Gentle prepare Look smile CARLISLE sweet"

Given that this is a 'toy' example, the outputs generated are not very impressive. But this represents the right direction of travel in terms of building a language model.