#Basics of language modeling and its applications

---


##Introduction to recurrent neural networks (RNNs) for language modeling

---


Hands-on exercise: Building a simple language model using RNNs in Python (e.g., using TensorFlow or PyTorch)

https://medium.com/@rachel_95942/language-models-and-rnn-c516fab9545b

---


https://www.youtube.com/watch?v=lDkEC7H88_A




In [None]:
# Import the necessary libraries
import numpy as np
import tensorflow as tf

In [None]:
# Define the text corpus
text = """
The quick brown fox jumps over the lazy dog.
"""

# Preprocess the text
corpus = text.lower().split()
unique_words = sorted(set(corpus))
word_to_int = {word: i for i, word in enumerate(unique_words)}
int_to_word = {i: word for i, word in enumerate(unique_words)}
vocab_size = len(unique_words)

print(word_to_int)
print(int_to_word)
print(unique_words)
print("vocab size: ",vocab_size)
print("corpus len: ",len(corpus))

{'brown': 0, 'dog.': 1, 'fox': 2, 'jumps': 3, 'lazy': 4, 'over': 5, 'quick': 6, 'the': 7}
{0: 'brown', 1: 'dog.', 2: 'fox', 3: 'jumps', 4: 'lazy', 5: 'over', 6: 'quick', 7: 'the'}
['brown', 'dog.', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the']
vocab size:  8
corpus len:  9


In [None]:
# Generate training data
sequence_length = 5
input_sequences = []
output_labels = []
for i in range(len(corpus) - sequence_length):
    sequence = corpus[i:i+sequence_length]
    label = corpus[i+sequence_length]
    input_sequences.append([word_to_int[word] for word in sequence])
    output_labels.append(word_to_int[label])

print(input_sequences)
print(output_labels)


[[7, 6, 0, 2, 3], [6, 0, 2, 3, 5], [0, 2, 3, 5, 7], [2, 3, 5, 7, 4]]
[5, 7, 4, 1]


In [None]:
# Convert the training data to numpy arrays
input_sequences = np.array(input_sequences)
output_labels = np.array(output_labels)

print(input_sequences)
print(output_labels)

[[7 6 0 2 3]
 [6 0 2 3 5]
 [0 2 3 5 7]
 [2 3 5 7 4]]
[5 7 4 1]


#A neural network model using the Keras API from TensorFlow

##tf.keras.Sequential: This is the basic model class in Keras that allows you to build a linear stack of layers. Each layer in the model is added sequentially.

---



tf.keras.layers.Embedding: This layer represents an embedding layer. It is commonly used in natural language processing tasks to convert integer-encoded input sequences into dense vectors of fixed size. The parameters used in this layer are:


---


vocab_size: The size of the vocabulary, which represents the total number of unique words in the input.
10: The output dimension of the embedding layer, which represents the size of the dense embedding vector for each word.
input_length=sequence_length: The length of the input sequences, which determines the number of time steps in the recurrent neural network (RNN).

---



tf.keras.layers.SimpleRNN: This layer represents a simple recurrent neural network (RNN). It processes the input sequence step by step and maintains an internal state that captures information about the past steps. The parameters used in this layer are:

32: The number of units (or hidden neurons) in the RNN. This determines the dimensionality of the output of the RNN layer.

---



tf.keras.layers.Dense: This layer represents a fully connected (dense) layer.
It is used for mapping the output of the previous layer to the desired output shape. The parameters used in this layer are:

vocab_size: The number of units (or neurons) in this layer, which corresponds to the size of the output vocabulary.
activation='softmax': The activation function applied to the layer's outputs. In this case, the softmax activation function is used, which normalizes the output values into a probability distribution over the output vocabulary.

---

This allows the model to predict the probability of each word in the vocabulary.

The model takes an input sequence of integer-encoded words, applies an embedding layer to convert them into dense vectors, processes them using a simple RNN layer, and finally maps the output to a probability distribution over the vocabulary using a dense layer with softmax activation. The model is trained to minimize the difference between its predicted probabilities and the true probabilities of the target words.

In [None]:
# Create the RNN language model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 12, input_length=sequence_length),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

The code snippet model.compile(loss='sparse_categorical_crossentropy', optimizer='adam') is used to configure the training process for the neural network model. Let's understand the parameters used in this function:

loss='sparse_categorical_crossentropy': The loss parameter specifies the loss function that will be used to measure the discrepancy between the predicted output of the model and the true output during training. In this case, 'sparse_categorical_crossentropy' is used as the loss function. This loss function is suitable for multi-class classification problems where the target labels are integers (sparse labels), not one-hot encoded. It calculates the cross-entropy loss between the true labels and the predicted probabilities.

optimizer='adam': The optimizer parameter specifies the optimization algorithm that will be used to update the weights of the neural network during training in order to minimize the loss function. In this case, 'adam' is used as the optimizer. Adam is an optimization algorithm that combines the benefits of two other popular algorithms, AdaGrad and RMSProp. It adapts the learning rate during training and performs well on a wide range of problems.

By calling model.compile(loss='sparse_categorical_crossentropy', optimizer='adam'), you are configuring the model to use the sparse categorical cross-entropy loss function and the Adam optimizer. Once the model is compiled, it is ready to be trained on a specific dataset using the model.fit() function.

In [None]:
# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

The model.fit() function is a method in many deep learning frameworks, such as TensorFlow or Keras, that is used to train a machine learning model. It is commonly used to iteratively optimize the parameters of the model based on input data and corresponding labels.

The function takes several arguments:

input_sequences: This is the input data or feature set that will be used to train the model. It should be organized as a matrix or tensor, where each row represents an individual sample or observation, and each column represents a specific feature or input dimension.

output_labels: These are the corresponding labels or target values that are associated with the input data. The labels represent the desired outputs or the ground truth for the given inputs. The format of the labels depends on the problem at hand, such as classification or regression.

epochs: This parameter specifies the number of times the model will iterate over the entire dataset during the training process. An epoch is defined as a complete pass through the training data. Increasing the number of epochs allows the model to potentially learn more from the data, but it can also lead to overfitting if the model starts to memorize the training examples.

verbose: This parameter controls the verbosity of the training process. By default, it is set to 1, which means that progress updates and logs will be displayed during training. If set to 0, the training will be performed silently without any output.

During the training process, the model will undergo an iterative optimization algorithm, such as gradient descent, to adjust its internal parameters based on the input data and the associated labels. The goal is to minimize a predefined loss function, which measures the discrepancy between the model's predicted outputs and the true labels. The specific optimization algorithm and loss function used depend on the type of model and the problem being solved.

After the model.fit() function completes, the trained model will have adjusted its parameters to better fit the provided data, hopefully improving its ability to make accurate predictions or classifications on new, unseen data

In [None]:
# Train the model
model.fit(input_sequences, output_labels, epochs=1000, verbose=0)

<keras.callbacks.History at 0x7f57350a81c0>

In [None]:
# Generate new text
seed_text = 'The fox jumps over the'
next_words = 10
for _ in range(next_words):
    seed_sequence = [word_to_int[word] for word in seed_text.lower().split()[-sequence_length:]]
    predicted_probs = model.predict(np.array([seed_sequence]))
    predicted_index = np.argmax(predicted_probs)
    predicted_word = int_to_word[predicted_index]
    seed_text += ' ' + predicted_word

print(seed_text)

The fox jumps over the over dog. the lazy over lazy lazy over the lazy


# Exercise 1: Change the sequence length and experiment with different values (e.g., 3, 7, 10). How does it affect the quality and coherence of the generated text?
# Solution:
'''
# Example with sequence length of 3
sequence_length = 3

# Example with sequence length of 7
sequence_length = 7

# Example with sequence length of 10
sequence_length = 10
'''

# Exercise 2: Try using a different text corpus of your choice. You can use a longer text to generate more meaningful results. Observe how the choice of corpus affects the output.
# Solution:
'''
# Define your own text corpus
text = "Your text corpus goes here."

# Preprocess the text and generate training data
# ...
'''

# Exercise 3: Experiment with different hyperparameters, such as embedding size, number of hidden units, and optimizer. Observe the impact of these changes on the training time and quality of generated text.
# Solution:
'''
# Example with embedding size of 20
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 20, input_length=sequence_length),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

# Example with 64 hidden units
model = tf.keras.Sequential

In [None]:
# Exercise 4: Create the RNN language model with stacked RNN layers
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 10, input_length=sequence_length),
    tf.keras.layers.SimpleRNN(32, return_sequences=True),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

In [None]:
# Exercise 5: Create the RNN language model with bidirectional RNN
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 10, input_length=sequence_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(32)),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])