# Hidden Markov Model with Token and Numerical Sequences

This notebook demonstrates how to train a Hidden Markov Model (HMM) that takes both token sequences and numerical data sequences as input and produces output token sequences.

Here's a breakdown of what the notebook covers:

1. Data Preparation
Creation of sample token sequences (e.g., ['A', 'B', 'C', 'D'])
Generation of corresponding numerical data sequences (e.g., [0.5, 1.2, 0.8, 1.5])
Definition of target output token sequences (e.g., ['X', 'Y', 'X', 'Z'])
2. Feature Engineering
Encoding categorical token data
Combining token encodings with numerical data into a single feature matrix
Preparing the data structure expected by HMM algorithms
3. Model Training
Training a Gaussian Hidden Markov Model using the combined features
The model learns the relationship between input features and hidden states (output tokens)
4. Evaluation
Testing the model on new sequences
Comparing predictions to ground truth
Calculating accuracy metrics
5. Visualization
Visualizing transition probabilities between states
Examining emission distributions for each state
6. Making Predictions
Applying the trained model to new, unseen data
Converting predicted hidden states back to output tokens
7. Advanced Techniques
Discussion of alternative approaches for more complex scenarios
The notebook is designed to be interactive and educational. You can run each cell to see how the model performs, and the code includes detailed comments explaining each step of the process.

You'll need the following Python packages to run the notebook:

numpy
matplotlib
hmmlearn
scikit-learn
(optionally) pomegranate for the custom HMM implementation section
To try it out, open the notebook in your Jupyter environment and run the cells in sequence


In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from hmmlearn import hmm
from sklearn.preprocessing import LabelEncoder

ModuleNotFoundError: No module named 'hmmlearn'

## 1. Data Preparation

We'll create sample data with:
- Input token sequences (e.g., ['A', 'B', 'C', 'D'])
- Corresponding numerical data (e.g., [0.5, 1.2, 0.8, 1.5])
- Target output token sequences (e.g., ['X', 'Y', 'X', 'Z'])

In [None]:
# Sample data generation
np.random.seed(42)

# Generate 100 sequences of length 4
n_sequences = 100
sequence_length = 4

# Define possible input tokens and output tokens
input_tokens = ['A', 'B', 'C', 'D']
output_tokens = ['X', 'Y', 'Z']

# Generate random input token sequences
input_token_sequences = []
for _ in range(n_sequences):
    sequence = np.random.choice(input_tokens, size=sequence_length)
    input_token_sequences.append(sequence)

# Generate corresponding numerical data (with some patterns)
numerical_data_sequences = []
for seq in input_token_sequences:
    numerical_seq = []
    for token in seq:
        if token == 'A':
            numerical_seq.append(np.random.normal(0.5, 0.1))  # Mean 0.5, std 0.1
        elif token == 'B':
            numerical_seq.append(np.random.normal(1.2, 0.15))  # Mean 1.2, std 0.15
        elif token == 'C':
            numerical_seq.append(np.random.normal(0.8, 0.12))  # Mean 0.8, std 0.12
        else:  # 'D'
            numerical_seq.append(np.random.normal(1.5, 0.2))  # Mean 1.5, std 0.2
    numerical_data_sequences.append(np.array(numerical_seq))

# Generate output token sequences based on patterns in the input
output_token_sequences = []
for i in range(n_sequences):
    output_seq = []
    for j in range(sequence_length):
        input_token = input_token_sequences[i][j]
        num_val = numerical_data_sequences[i][j]
        
        # Logic to determine output token based on input token and numerical value
        if input_token == 'A' and num_val < 0.5:
            output_seq.append('X')
        elif input_token == 'A' and num_val >= 0.5:
            output_seq.append('Y')
        elif input_token == 'B' and num_val < 1.2:
            output_seq.append('X')
        elif input_token == 'B' and num_val >= 1.2:
            output_seq.append('Z')
        elif input_token == 'C':
            output_seq.append('Y')
        else:  # 'D'
            output_seq.append('Z')
    
    output_token_sequences.append(np.array(output_seq))

# Display some examples
print("Sample data:")
for i in range(5):  # Show first 5 sequences
    print(f"Sequence {i+1}:")
    print(f"  Input tokens: {input_token_sequences[i]}")
    print(f"  Numerical data: {numerical_data_sequences[i]}")
    print(f"  Output tokens: {output_token_sequences[i]}")

## 2. Feature Engineering

We need to encode our token sequences and combine them with numerical data.

In [None]:
# Encode input tokens
input_encoder = LabelEncoder()
input_encoder.fit(input_tokens)

# Encode output tokens (these are our hidden states)
output_encoder = LabelEncoder()
output_encoder.fit(output_tokens)

# Prepare features by combining encoded input tokens and numerical data
X_combined = []
y = []

for i in range(n_sequences):
    # Encode input tokens for this sequence
    encoded_input = input_encoder.transform(input_token_sequences[i])
    
    # Combine encoded input with numerical data
    combined_features = np.column_stack((
        encoded_input,  # Input token (encoded)
        numerical_data_sequences[i]  # Numerical value
    ))
    
    X_combined.append(combined_features)
    
    # Encode output tokens for this sequence
    y.append(output_encoder.transform(output_token_sequences[i]))

# Reshape for HMM input - concatenate all sequences
X_train = np.vstack(X_combined)

# Create sequence lengths array for the HMM
lengths = [len(seq) for seq in X_combined]

# Print shapes to verify
print(f"X_train shape: {X_train.shape}")
print(f"Sequence lengths: {lengths}")

## 3. Training the HMM

We'll train a Gaussian HMM using the combined features (encoded input tokens + numerical data).

In [None]:
# Create and train the HMM
n_states = len(output_tokens)  # Number of hidden states = number of output tokens
model = hmm.GaussianHMM(
    n_components=n_states,  # Number of hidden states
    covariance_type="full",  # Full covariance matrix for each state
    n_iter=100,  # Max iterations for training
    verbose=True,
    random_state=42
)

# Fit the model
model.fit(X_train, lengths)

print("Model trained successfully!")
print(f"Converged: {model.monitor_.converged}")
print(f"Log-likelihood: {model.score(X_train, lengths)}")

## 4. Evaluating the Model

Let's create a test set and see how well our model performs.

In [None]:
# Generate a small test set
n_test = 10

# Create test sequences using the same logic as training set
test_input_token_sequences = []
for _ in range(n_test):
    sequence = np.random.choice(input_tokens, size=sequence_length)
    test_input_token_sequences.append(sequence)

test_numerical_data_sequences = []
for seq in test_input_token_sequences:
    numerical_seq = []
    for token in seq:
        if token == 'A':
            numerical_seq.append(np.random.normal(0.5, 0.1))
        elif token == 'B':
            numerical_seq.append(np.random.normal(1.2, 0.15))
        elif token == 'C':
            numerical_seq.append(np.random.normal(0.8, 0.12))
        else:  # 'D'
            numerical_seq.append(np.random.normal(1.5, 0.2))
    test_numerical_data_sequences.append(np.array(numerical_seq))

# Ground truth for test set
test_output_token_sequences = []
for i in range(n_test):
    output_seq = []
    for j in range(sequence_length):
        input_token = test_input_token_sequences[i][j]
        num_val = test_numerical_data_sequences[i][j]
        
        if input_token == 'A' and num_val < 0.5:
            output_seq.append('X')
        elif input_token == 'A' and num_val >= 0.5:
            output_seq.append('Y')
        elif input_token == 'B' and num_val < 1.2:
            output_seq.append('X')
        elif input_token == 'B' and num_val >= 1.2:
            output_seq.append('Z')
        elif input_token == 'C':
            output_seq.append('Y')
        else:  # 'D'
            output_seq.append('Z')
    
    test_output_token_sequences.append(np.array(output_seq))

# Prepare test data in the same way as training data
X_test_combined = []
y_test = []

for i in range(n_test):
    # Encode input tokens for this sequence
    encoded_input = input_encoder.transform(test_input_token_sequences[i])
    
    # Combine encoded input with numerical data
    combined_features = np.column_stack((
        encoded_input,
        test_numerical_data_sequences[i]
    ))
    
    X_test_combined.append(combined_features)
    y_test.append(output_encoder.transform(test_output_token_sequences[i]))

# Predict hidden states for each test sequence
predicted_states = []
for test_seq in X_test_combined:
    # Predict hidden states
    states = model.predict(test_seq)
    predicted_states.append(states)

# Convert predicted states back to output tokens
predicted_output_sequences = []
for states in predicted_states:
    predicted_tokens = output_encoder.inverse_transform(states)
    predicted_output_sequences.append(predicted_tokens)

# Compare predictions to ground truth
print("\nTest Results:")
correct_predictions = 0
total_predictions = 0

for i in range(n_test):
    print(f"\nTest Sequence {i+1}:")
    print(f"  Input tokens: {test_input_token_sequences[i]}")
    print(f"  Numerical data: {test_numerical_data_sequences[i]}")
    print(f"  True output: {test_output_token_sequences[i]}")
    print(f"  Predicted output: {predicted_output_sequences[i]}")
    
    # Count correct predictions
    for j in range(len(test_output_token_sequences[i])):
        total_predictions += 1
        if test_output_token_sequences[i][j] == predicted_output_sequences[i][j]:
            correct_predictions += 1

accuracy = correct_predictions / total_predictions
print(f"\nAccuracy: {accuracy:.2f} ({correct_predictions}/{total_predictions})")

## 5. Visualizing the HMM

Let's visualize the transition and emission probabilities.

In [None]:
# Transition matrix visualization
plt.figure(figsize=(8, 6))
plt.imshow(model.transmat_, aspect='auto', cmap='YlGnBu')
plt.colorbar()
plt.title('Transition Probability Matrix')
plt.xlabel('To state')
plt.ylabel('From state')
plt.xticks(np.arange(n_states), output_tokens)
plt.yticks(np.arange(n_states), output_tokens)

for i in range(n_states):
    for j in range(n_states):
        plt.text(j, i, f'{model.transmat_[i, j]:.2f}', 
                 ha='center', va='center', color='black')
plt.show()

# Means of emission distributions
plt.figure(figsize=(10, 6))
for i in range(n_states):
    plt.plot(model.means_[i], label=f'State {output_tokens[i]}')
plt.title('Mean of Emission Distributions for Each State')
plt.xlabel('Feature Index (0: Input Token, 1: Numerical Value)')
plt.ylabel('Mean Value')
plt.legend()
plt.grid(True)
plt.show()

## 6. Using the Model with New Data

Now let's demonstrate how to use our trained HMM with new data.

In [None]:
# Create a new example sequence
new_input_tokens = np.array(['A', 'B', 'C', 'D'])
new_numerical_data = np.array([0.45, 1.3, 0.75, 1.6])

# Prepare the new data
encoded_new_input = input_encoder.transform(new_input_tokens)
new_features = np.column_stack((encoded_new_input, new_numerical_data))

# Predict the hidden states
predicted_states = model.predict(new_features)
predicted_output = output_encoder.inverse_transform(predicted_states)

print("New sequence prediction:")
print(f"  Input tokens: {new_input_tokens}")
print(f"  Numerical data: {new_numerical_data}")
print(f"  Predicted output: {predicted_output}")

# We can also get the probability of the predicted sequence
log_prob = model.score(new_features)
print(f"  Log probability of the sequence: {log_prob:.2f}")

## 7. Alternative Approach: Custom HMM Implementation

If you need more flexibility, you can implement a custom HMM that directly models the relationship between tokens and numerical data.

In [None]:
# Example of a custom approach using pomegranate library
# pip install pomegranate  # Uncomment and run this line if pomegranate is not installed

from pomegranate import HiddenMarkovModel, DiscreteDistribution, NormalDistribution, IndependentComponentsDistribution
import json
from itertools import product

def custom_hmm_approach():
    # This is a skeleton implementation - would need to be completed
    print("Custom HMM approach using pomegranate:")
    print("This would allow more complex emission distributions that model joint probabilities")
    print("of tokens and numerical values.")
    print("\nExample implementation would include:")
    print("1. Create states with joint distributions for (token, numerical_value)")
    print("2. Define transitions between states")
    print("3. Train the model on sequences of (token, numerical_value) pairs")
    print("4. Predict output tokens for new sequences")
    
    # Note: Full implementation would be more complex

# Uncomment to run the custom approach example
# custom_hmm_approach()

## 8. Conclusion

We've demonstrated how to:

1. Prepare token sequences and numerical data for HMM training
2. Train an HMM that can predict output tokens based on input tokens and numerical data
3. Evaluate the model's performance
4. Use the model to make predictions on new data

The key insights are:
- We can combine categorical (token) and numerical data into a single feature vector
- The HMM can learn patterns in this combined feature space
- The output tokens are represented as hidden states
- You can extend this approach to more complex scenarios with longer sequences and more features