# Neural Data Preprocessing Workflow

This notebook demonstrates a step-by-step workflow for preprocessing neural data stored in `.mat` files. The goal is to extract meaningful features and labels for training machine learning models. Below are the key steps:

1. **Load Data**: Import all `.mat` files from the training folder and store them in a structured format.
2. **Explore Data**: Inspect the contents of a sample `.mat` file to understand the structure and types of data available.
3. **Feature Extraction**: Extract neural features (e.g., `spikePow` and `tx1`) and corresponding labels (e.g., `sentenceText`) from all trials.
4. **Normalization**: Apply block-wise z-score normalization to ensure consistent scaling across trials.
5. **Dataset Preparation**: Combine features and labels into a format suitable for training machine learning models.

Let’s begin by importing the necessary libraries and loading the data.

In [2]:
import os
import scipy.io
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, TimeDistributed


🧠 Step 1 (Updated): Load All .mat Files from a Folder
Let’s write code to:

List all .mat files in your training folder

Load each file

Store them in a dictionary (or list) where each key is the filename

In [3]:
# Replace with the actual path to your training folder
data_folder = 'dataset/train/'

# List all .mat files in the folder
mat_files = [f for f in os.listdir(data_folder) if f.endswith('.mat')]
print(f"Found {len(mat_files)} .mat files")

# Load all .mat files into a dictionary
data_list = {}

for file in mat_files:
    file_path = os.path.join(data_folder, file)
    data = scipy.io.loadmat(file_path)
    data_list[file] = data  # store using filename as key

# Show keys of one sample file
sample_file = mat_files[0]
print(f"Keys in one sample ({sample_file}):")
print(data_list[sample_file].keys())


Found 24 .mat files
Keys in one sample (t12.2022.04.28.mat):
dict_keys(['__header__', '__version__', '__globals__', 'sentenceText', 'tx1', 'tx2', 'tx3', 'tx4', 'spikePow', 'blockIdx'])


🧠 Step 2: Explore One Sample Trial
We want to:

Inspect the contents of one .mat file

Understand the shape and type of the neural features and text

In [4]:
# Choose one sample file from the loaded data
sample = list(data_list.values())[0]  # first trial

# Print available variables in this trial
print("Keys in sample trial:", sample.keys())

# Let's see what spikePow and sentenceText look like
spike_pow = sample['spikePow']
tx1 = sample['tx1']
sentence_text = sample['sentenceText']

print("Type of spikePow:", type(spike_pow))
print("Shape of spikePow:", np.array(spike_pow).shape)

print("Type of tx1:", type(tx1))
print("Shape of tx1:", np.array(tx1).shape)

print("Sentence text:")
print(sentence_text)


Keys in sample trial: dict_keys(['__header__', '__version__', '__globals__', 'sentenceText', 'tx1', 'tx2', 'tx3', 'tx4', 'spikePow', 'blockIdx'])
Type of spikePow: <class 'numpy.ndarray'>
Shape of spikePow: (1, 280)
Type of tx1: <class 'numpy.ndarray'>
Shape of tx1: (1, 280)
Sentence text:
['Nuclear rockets can destroy airfields with ease.                                      '
 'The best way to learn is to solve extra problems.                                     '
 'The spray will be used in first division matches next season.                         '
 "Our experiment's positive outcome was unexpected.                                     "
 "Alimony harms a divorced man's wealth.                                                "
 'She uses both names interchangeably.                                                  '
 'The misquote was retracted with an apology.                                           '
 'Critics fear the erosion of consumer protections and environmental standards.

🧠 Step 3: Extract Features & Labels from All Trials

Loop through all .mat files

For each file, extract:

The neural data from spikePow and tx1 (and optionally combine them)

The sentence text (as the label)

Store them into a list so we can build a training dataset

In [5]:

# Area 6v only: use first 128 channels
area6v_indices = np.arange(0, 128)

# Hold all features and labels
X = []
y = []

# Loop over all .mat training files
for filename in os.listdir(data_folder):
    if filename.endswith('.mat'):
        file_path = os.path.join(data_folder, filename)
        mat = scipy.io.loadmat(file_path)

        spikePow = mat['spikePow'].squeeze()  # S-length array of matrices
        tx1 = mat['tx1'].squeeze()
        blockIdx = mat['blockIdx'].squeeze()
        sentenceText = mat['sentenceText']

        num_trials = len(spikePow)

        for i in range(num_trials):
            # Combine tx1 and spikePow: each is (T x F), we use first 128 channels (Area 6v)
            spike_trial = spikePow[i][:, area6v_indices]
            tx1_trial = tx1[i][:, area6v_indices]
            
            # Concatenate along feature axis: (T x 128) → (T x 256)
            trial_features = np.concatenate([spike_trial, tx1_trial], axis=1)

            # Normalize using block-wise z-score
            block_id = blockIdx[i]
            block_mask = (blockIdx == block_id)

            # Get all trials from this block
            block_trials = [np.concatenate([spikePow[j][:, area6v_indices], tx1[j][:, area6v_indices]], axis=1)
                            for j in range(num_trials) if block_mask[j]]

            # Stack all block trials to compute mean & std
            block_data = np.vstack(block_trials)
            mean = block_data.mean(axis=0)
            std = block_data.std(axis=0) + 1e-8  # avoid divide by zero

            # z-score the current trial
            trial_features = (trial_features - mean) / std

            # Extract sentenceText (handle both string and char matrix cases)
            sentence_entry = sentenceText[i]
            if isinstance(sentence_entry, np.ndarray):
                sentence = ''.join(chr(int(c)) for c in sentence_entry.flatten() if c != 0)
            elif isinstance(sentence_entry, str):
                sentence = sentence_entry
            else:
                sentence = str(sentence_entry)

            X.append(trial_features)
            y.append(sentence)


# Let's check one sample
print("✅ Done extracting.")
print("Sample feature shape:", X[0].shape)
print("Sample label:", y[0])
print("Total examples:", len(X))



✅ Done extracting.
Sample feature shape: (478, 256)
Sample label: Nuclear rockets can destroy airfields with ease.                                      
Total examples: 8800


🧠 Step 4: Pad Sequences & Encode Labels
Your features X are variable-length sequences (trials), so we need to pad them to the same length. And the labels y are sentences (strings), so we’ll tokenize and encode them as sequences of integers too.

Here’s what we’ll do:

Pad X sequences to the same length.

Create a character-level tokenizer for y.

Convert y into sequences of integers.

Pad y sequences to the same max length.

In [6]:

# Pad input sequences (X)
max_timesteps = max(x.shape[0] for x in X)
feature_dim = X[0].shape[1]  # Should be 20
print(f"Max timesteps: {max_timesteps}, Feature dim: {feature_dim}")
X_padded = np.array([
    np.pad(x, ((0, max_timesteps - x.shape[0]), (0, 0)), mode='constant')
    for x in X
])

# Tokenize output sequences (y) at character level
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(y)
y_sequences = tokenizer.texts_to_sequences(y)

# Pad output sequences (y)
max_label_len = max(len(seq) for seq in y_sequences)
y_padded = pad_sequences(y_sequences, maxlen=max_label_len, padding='post')

# Optional: print shape info
print(f"Padded input shape: {X_padded.shape}")  # (num_samples, max_timesteps, 20)
print(f"Padded labels shape: {y_padded.shape}")  # (num_samples, max_label_len)
print(f"Vocab size: {len(tokenizer.word_index) + 1}")  # +1 for padding index 0

Max timesteps: 906, Feature dim: 256
Padded input shape: (8800, 906, 256)
Padded labels shape: (8800, 88)
Vocab size: 37


In [7]:
# Parameters
latent_dim = 256  # Size of LSTM hidden states
vocab_size = len(tokenizer.word_index) + 1  # Number of unique characters + padding

# Encoder
encoder_inputs = Input(shape=(max_timesteps, feature_dim))
encoder_lstm = LSTM(latent_dim, return_sequences=False, return_state=True)
_, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(max_label_len,))  # input: padded character sequences
decoder_embedding = Embedding(input_dim=vocab_size, output_dim=latent_dim, mask_zero=True)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=False)
decoder_outputs = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(vocab_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)

# Full model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 88)]         0           []                               
                                                                                                  
 input_1 (InputLayer)           [(None, 906, 256)]   0           []                               
                                                                                                  
 embedding (Embedding)          (None, 88, 256)      9472        ['input_2[0][0]']                
                                                                                                  
 lstm (LSTM)                    [(None, 256),        525312      ['input_1[0][0]']                
                                 (None, 256),                                                 

In [8]:
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Shift y_padded by one timestep for decoder output
decoder_target_data = np.expand_dims(y_padded, -1)  # (num_samples, max_label_len, 1)

# Train the model
history = model.fit(
    [X_padded, y_padded], decoder_target_data,
    batch_size=64,
    epochs=20,
    validation_split=0.1
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [9]:
import numpy as np
import os
import scipy.io

def load_test_data(test_folder):
    # Area 6v only
    area6v_indices = np.arange(0, 128)

    X_test = []
    y_test = []

    for filename in os.listdir(test_folder):
        if filename.endswith('.mat'):
            file_path = os.path.join(test_folder, filename)
            mat = scipy.io.loadmat(file_path)

            spikePow = mat['spikePow'].squeeze()
            tx1 = mat['tx1'].squeeze()
            blockIdx = mat['blockIdx'].squeeze()
            sentenceText = mat['sentenceText']

            num_trials = len(spikePow)

            for i in range(num_trials):
                # Extract spikePow and tx1 for this trial (only Area 6v)
                spike_trial = spikePow[i][:, area6v_indices]
                tx1_trial = tx1[i][:, area6v_indices]

                # Concatenate features (T x 256)
                trial_features = np.concatenate([spike_trial, tx1_trial], axis=1)

                # Normalize with block-wise mean and std
                block_id = blockIdx[i]
                block_mask = (blockIdx == block_id)

                block_trials = [np.concatenate([spikePow[j][:, area6v_indices], tx1[j][:, area6v_indices]], axis=1)
                                for j in range(num_trials) if block_mask[j]]

                block_data = np.vstack(block_trials)
                mean = block_data.mean(axis=0)
                std = block_data.std(axis=0) + 1e-8

                trial_features = (trial_features - mean) / std

                # Decode sentence text
                sentence_entry = sentenceText[i]
                if isinstance(sentence_entry, np.ndarray):
                    sentence = ''.join(chr(int(c)) for c in sentence_entry.flatten() if c != 0)
                elif isinstance(sentence_entry, str):
                    sentence = sentence_entry
                else:
                    sentence = str(sentence_entry)

                X_test.append(trial_features)
                y_test.append(sentence)

    return X_test, y_test


# Load
X_test, y_test = load_test_data('dataset/test')
print("✅ Done loading.")
print("Test samples:", len(X_test))
print("First input shape:", X_test[0].shape)
print("First label:", y_test[0])


✅ Done loading.
Test samples: 880
First input shape: (254, 256)
First label: Theocracy reconsidered.                                                         


In [10]:

# Pad
X_test_padded = np.array([
    x[:max_timesteps] if x.shape[0] >= max_timesteps else np.pad(x, ((0, max_timesteps - x.shape[0]), (0, 0)), mode='constant')
    for x in X_test
])


y_test_sequences = tokenizer.texts_to_sequences(y_test)
y_test_padded = pad_sequences(y_test_sequences, maxlen=max_label_len, padding='post')
decoder_target_test = np.expand_dims(y_test_padded, -1)

# Evaluate
model.evaluate([X_test_padded, y_test_padded], decoder_target_test, batch_size=64)




[6.45776090095751e-05, 1.0]

In [11]:
predictions = model.predict([X_test_padded, y_test_padded])
predicted_token_ids = np.argmax(predictions, axis=-1)  # shape: (batch_size, seq_len)

decoded_sentences = tokenizer.sequences_to_texts(predicted_token_ids)
for pred, true in zip(decoded_sentences, y_test):
    print(f"Prediction: {pred}")
    print(f"Ground Truth: {true}")
    print("-" * 50)




Prediction: t h e o c r a c y   r e c o n s i d e r e d .                                                                                                                                  
Ground Truth: Theocracy reconsidered.                                                         
--------------------------------------------------
Prediction: r i c h   p u r c h a s e d   s e v e r a l   s i g n e d   l i t h o g r a p h s .                                                                                            
Ground Truth: Rich purchased several signed lithographs.                                      
--------------------------------------------------
Prediction: s o   r u l e s   w e   m a d e ,   i n   u n a b a s h e d   c o l l u s i o n .                                                                                              
Ground Truth: So rules we made, in unabashed collusion.                                       
-------------------------------------------------

In [12]:
# Save the model
model.save('lstm_model.h5')
print("Model saved as lstm_model.h5")
# Save the tokenizer
import json
with open('tokenizer.json', 'w') as f:
    json.dump(tokenizer.to_json(), f)
print("Tokenizer saved as tokenizer.json")

Model saved as lstm_model.h5
Tokenizer saved as tokenizer.json


In [1]:
import numpy as np
import os
import json
import scipy.io
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from nltk.translate.bleu_score import corpus_bleu

# --- Load model and tokenizer ---
model = load_model('lstm_model.h5')
print("✅ Model loaded.")

with open('tokenizer.json') as f:
    tokenizer_json = json.load(f)
tokenizer = tokenizer_from_json(tokenizer_json)
print("✅ Tokenizer loaded.")

# --- Define test data loading ---
area6v_indices = np.arange(0, 128)

def load_test_data(test_folder):
    X_test = []
    y_test = []

    for filename in os.listdir(test_folder):
        if filename.endswith('.mat'):
            file_path = os.path.join(test_folder, filename)
            mat = scipy.io.loadmat(file_path)

            spikePow = mat['spikePow'].squeeze()
            tx1 = mat['tx1'].squeeze()
            blockIdx = mat['blockIdx'].squeeze()
            sentenceText = mat['sentenceText']

            num_trials = len(spikePow)

            for i in range(num_trials):
                spike_trial = spikePow[i][:, area6v_indices]
                tx1_trial = tx1[i][:, area6v_indices]
                trial_features = np.concatenate([spike_trial, tx1_trial], axis=1)

                block_id = blockIdx[i]
                block_mask = (blockIdx == block_id)
                block_trials = [np.concatenate([spikePow[j][:, area6v_indices], tx1[j][:, area6v_indices]], axis=1)
                                for j in range(num_trials) if block_mask[j]]
                block_data = np.vstack(block_trials)
                mean = block_data.mean(axis=0)
                std = block_data.std(axis=0) + 1e-8

                trial_features = (trial_features - mean) / std

                sentence_entry = sentenceText[i]
                if isinstance(sentence_entry, np.ndarray):
                    sentence = ''.join(chr(int(c)) for c in sentence_entry.flatten() if c != 0)
                elif isinstance(sentence_entry, str):
                    sentence = sentence_entry
                else:
                    sentence = str(sentence_entry)

                X_test.append(trial_features)
                y_test.append(sentence)

    return X_test, y_test

# --- Load and prepare test data ---
X_test, y_test = load_test_data('dataset/test')
print(f"✅ Loaded {len(X_test)} test samples.")



✅ Model loaded.
✅ Tokenizer loaded.
✅ Loaded 880 test samples.


In [2]:

# --- Pad inputs ---
max_timesteps = 906   # ✅ MATCHES model input

X_test_padded = np.array([
    np.pad(x[:max_timesteps], ((0, max(0, max_timesteps - x.shape[0])), (0, 0)), mode='constant')
    for x in X_test
])

# --- Prepare outputs ---
y_test_sequences = tokenizer.texts_to_sequences(y_test)
max_label_len = 88  # Replace with your training max_label_len
y_test_padded = pad_sequences(y_test_sequences, maxlen=max_label_len, padding='post')

# --- Make predictions ---
predictions = model.predict([X_test_padded, y_test_padded])

# Decode the predictions
predicted_token_ids = np.argmax(predictions, axis=-1)
decoded_sentences = tokenizer.sequences_to_texts(predicted_token_ids)

# --- Evaluate ---
# Remove padding tokens from both predictions and ground truth
def remove_padding(sentence, pad_token="<pad>"):
    return [word for word in sentence if word != pad_token]

# Remove padding from decoded sentences and ground truth
decoded_sentences_filtered = [remove_padding(pred.split()) for pred in decoded_sentences]
y_test_filtered = [remove_padding(true.split()) for true in y_test]

# Calculate BLEU score on filtered sentences
bleu_score = corpus_bleu([[true] for true in y_test_filtered], decoded_sentences_filtered)

# Print predictions with ground truth
for i, (pred, true) in enumerate(zip(decoded_sentences, y_test), 1):
    print(f"[{i}] Prediction   : {pred}")
    print(f"    Ground Truth : {true}")
    print("-" * 60)

# Output BLEU score
print(f"\n🎯 Final Corpus BLEU Score: {bleu_score:.4f}")


[1] Prediction   : t h e o c r a c y   r e c o n s i d e r e d .                                                                                                                                  
    Ground Truth : Theocracy reconsidered.                                                         
------------------------------------------------------------
[2] Prediction   : r i c h   p u r c h a s e d   s e v e r a l   s i g n e d   l i t h o g r a p h s .                                                                                            
    Ground Truth : Rich purchased several signed lithographs.                                      
------------------------------------------------------------
[3] Prediction   : s o   r u l e s   w e   m a d e ,   i n   u n a b a s h e d   c o l l u s i o n .                                                                                              
    Ground Truth : So rules we made, in unabashed collusion.                                 

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
